TAP-Vid: A Benchmark for Tracking Any Point in a Video

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

A screen capture showing the typical annotation workflow and usage of the web-interface while annotating three point tracks in a brief video. For brevity, we used points where the optical flow based tracker worked exceptionally well and no further refinement was required, although this is not the general case.

TAP-Vid-DAVIS

TAP-Vid-Kinetics

This is the most challenging TAP-Vid subset. All videos are realistic, obtained from Youtube. Annotations are carefully monitored, filtered and refined multiple iterations to guarantee the quality. Unfortunately we cannot directly include Kinetics visualizations due to the licensing of these YouTube videos, but they may be visualized using the download scripts.

TAP-Vid-Kubric

TAP-Vid-RGB-Stacking

Improving Annotation with Flow based Tracker

Annotating videos with high-quality point tracks is a very challenging task. We use an optical flow based point interpolation to aid the human annotators in this task. On the left below are annotations obtained without this tracker, and on the right with the tracker. The tracker results in more spatially consistent tracks, and also speeds up the annotation process by over 40%, aiding and simplifying the annotation effort substantially.

Without flow tracker assist	With flow tracker assist
Note the high-frequency jitter for the points on the swan's back and neck that occurs without assistance from optical flow.
Without flow tracker assist	With flow tracker assist
Camera shake is particularly problematic, but optical flow compensates very well. It's almost unnoticeable when the flow interpolation is used, but without it, the points jump visibly three-quarters of the way through the video.
Without flow tracker assist	With flow tracker assist
The flow tracker remains consistent even with fast motions, greatly improving the precision of the points on the horse's saddle.
Without flow tracker assist	With flow tracker assist
Again the flow tracker is useful in compensating for camera motion and zoom.

Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories propose Persistent Independent Particles (PIPs), a new particle video method that tracks any pixel over time.

Kubric: A scalable dataset generator is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.