Generic motion understanding from video involves not only tracking objects, but also perceiving how their
surfaces deform and move. This information is useful to make inferences about 3D shape, physical
properties and object interactions. While the problem of tracking arbitrary physical points on surfaces
over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until
now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a
companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations
of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction
of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to
compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder
sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point
tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on
synthetic data.