TAP-Vid: A Benchmark for Tracking Any Point in a Video

1Google DeepMind, 2University of Oxford

Visualization of TAP-Vid Dataset with Human Annotated Ground-Truth Point Tracks

Abstract

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

Video Summary

Video Summary

Annotation Workflow

A screen capture showing the typical annotation workflow and usage of the web-interface while annotating three point tracks in a brief video. For brevity, we used points where the optical flow based tracker worked exceptionally well and no further refinement was required, although this is not the general case.

TAP-Vid-DAVIS

TAP-Vid-Kinetics

This is the most challenging TAP-Vid subset. All videos are realistic, obtained from Youtube. Annotations are carefully monitored, filtered and refined multiple iterations to guarantee the quality. Unfortunately we cannot directly include Kinetics visualizations due to the licensing of these YouTube videos, but they may be visualized using the download scripts.

TAP-Vid-Kubric

TAP-Vid-RGB-Stacking

Improving Annotation with Flow based Tracker

Annotating videos with high-quality point tracks is a very challenging task. We use an optical flow based point interpolation to aid the human annotators in this task. On the left below are annotations obtained without this tracker, and on the right with the tracker. The tracker results in more spatially consistent tracks, and also speeds up the annotation process by over 40%, aiding and simplifying the annotation effort substantially.

Without flow tracker assist With flow tracker assist

Note the high-frequency jitter for the points on the swan's back and neck that occurs without assistance from optical flow.

Without flow tracker assist With flow tracker assist

Camera shake is particularly problematic, but optical flow compensates very well. It's almost unnoticeable when the flow interpolation is used, but without it, the points jump visibly three-quarters of the way through the video.

Without flow tracker assist With flow tracker assist

The flow tracker remains consistent even with fast motions, greatly improving the precision of the points on the horse's saddle.

Without flow tracker assist With flow tracker assist

Again the flow tracker is useful in compensating for camera motion and zoom.

Licensing

The annotations of TAP-Vid, as well as the RGB-Stacking videos, are released under a Creative Commons BY license. The original source videos of DAVIS come from the val set, and are also licensed under creative commons licenses per their creators; see the DAVIS dataset for details. Kinetics videos are publicly available on YouTube, but subject to their own individual licenses. See the Kinetics dataset webpage for details.

Related Links

Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories propose Persistent Independent Particles (PIPs), a new particle video method that tracks any pixel over time.

Kubric: A scalable dataset generator is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.