The Segmenting and Tracking Every Pixel (STEP) benchmark consists of 2 training sequences and 2 test sequences. It is based on the MOTChallenge and Multi-Object Tracking and Segmentation (MOTS) benchmark. This benchmark extends the annotations to the STEP task. To this end, we added dense pixelwise segmentation labels for every pixel. In this benchmark, every pixel has a semantic label and all pixels belonging to the most salient object class, pedestrian, have a unique tracking ID. We evaluate submitted results using the Segmentation and Tracking Quality (STQ) metric. This benchmark is part of the ICCV21-Workshop Segmenting and Tracking Every Point and Pixel. The labels were updated to follow this label format

Training Set

Sample Name FPS ResolutionLengthDescriptionSourceRef.
STEP-ICCV21-09301920x1080525 (00:18)A pedestrian street scene filmed from a low[1]
STEP-ICCV21-02301920x1080600 (00:20)People walking around a large[1]
Total 1125 frm.
(38 s.)

Test Set

Sample Name FPS ResolutionLengthDescriptionSourceRef.
STEP-ICCV21-07301920x1080500 (00:17)A busy pedestrian street filmed at eye level by a moving cameralink[1]
STEP-ICCV21-01301920x1080450 (00:15)People walking around a large[1]
Total 950 frm.
(32 s.)


Get all images (357MB)
Get labels (23MB)
Development Kit


[1] Weber, M., Xie, J., Collins, M., Zhu, Y., Voigtlaender, P., Adam, H., Green, B., Geiger, A., Leibe, B., Cremers, D., Osep, A., Leal-Taixe, L. & Chen, L.C. STEP: Segmenting and Tracking Every Pixel. arXiv:2102.11859, 2021.