8th Workshop on Benchmarking Multi-Target Tracking: Towards Spatiotemporal Action Grounding in Videos

Workshop

Understanding actions in videos goes beyond recognizing appearances - it requires tracking objects as they perform complex behaviors over time. While recent benchmarks have made strides in temporal action localization and referring object tracking, they often treat these tasks in isolation. A gap remains in evaluating models that must both localize and track multiple objects based on action-driven, natural language queries.

At the 8th edition of the BMTT workshop, we focus on action-aware multi-object tracking, aiming to bridge the divide between vision and language by introducing unified challenges that evaluate both temporal localization and object tracking. We invite the community to examine whether current models can reason about actions, follow fine-grained language instructions, and scale to more complex, real-world scenes.

Through invited talks, the challenge, and open discussions, this workshop will explore the limitations of existing vision-language models, analyze the performance of state-of-the-art trackers and temporal localizers on action-based queries, and promote the development of more robust, multimodal video understanding systems.

User Instructed Spatiotemporal Detections. It shows the bounding boxes of the referred objects over time for a given set of queries. The top shows the current frame number and the given set of queries, and the bottom shows the period when the object appears. The full video is available here.

Info

Workshop time	October 19, 2025
Workshop page	ICCV 2025 Workshop
Workshop slides	Workshop Slides
Venue	ICCV 2025 (Honolulu, Hawai'i)
Train data released	July 11, 2025
Test data released	July 11, 2025
Challenge submission deadline	September 19, 2025
Technical report deadline	September 26, 2025
Recordings	Will be available after the workshop!

Speakers

Katerina Fragkiadaki (CMU)

Philipp Krähenbühl (University of Texas)

Thomas Tang (NVIDIA)

Aoxiang Fan (EPFL)

Schedule (HST)

Time	Title	Speaker
08:30 - 09:00 am	Workshop introduction and Presentation of the new MOT25 dataset	Organizers
09:00 - 09:25 am	Invited Talk 1: Action Learning in Video by Tracking Points in 2D and 3D	Katerina Fragkiadaki
09:25 - 09:45 am	Oral: Third place of MOT25	Participants
09:45 - 10:10 am	Invited Talk 2: What are actions anyway?	Philipp Krähenbühl
10:10 - 10:30 am	Oral: Second place of MOT25	Participants
10:30 - 10:40 am	Coffee Break	-
10:40 - 11:05 am	Invited Talk 3: Scalable Multi-Camera Tracking with NVIDIA Metropolis and AI City Challenge	Thomas Tang
11:05 - 11:25 am	Oral: Winner of MOT25	Participants
11:25 - 11:50 am	Invited Talk 4: SCOUT Dataset	Aoxiang Fan
11:50 am – 12:15 pm	Discussion, Closing remarks and Awards	All speakers and Organizers

Competition

In this edition of the workshop, we focus on the task that combines temporal localization and multi-object tracking under action-based natural language queries. The key question we pose to the community is: Can current models accurately localize and track multiple objects, based solely on complex, free-form action queries?

To this end, we organize a challenging competition in which participants are asked to develop their models to solve the task. The goal of the challenge is to launch a new unified task for localization and tracking based on large-scale, manually annotated action descriptions.

MOT25-StAG track

For this track, we integrated three existing datasets (OVIS, MOT17 and MOT20) and annotated them with a variety of action language queries into the MOT25-StAG dataset. Participants can use all annotations in MOT25-StAG and test their temporal window localization and bounding box tracking methods on the MOT25-StAG test set.

Rules:

Participants are allowed to train their models using open-source datasets.
Participants have to provide a technical report and code, showing which datasets were used for training.
Participants are required to use the MOT25-StAG test set for evaluation and upload their results to the server. Evaluations on any other validation set can be conducted locally using our provided codebase.

Dataset: Image sequences are available at the OVIS website, MOT17 website and MOT20 website. All annotations can be found here.
Baselines: Our baseline for spatial grounding is available here, and baseline for temporal grounding is available here.
Metric: m-HIoU, HOTA, mIoU, R1@X, R5@X, R10@X
Evaluation script: Evaluation Toolkit
Test server: The test server will be made available in the Codabench website on July 21th.
Benchmark: The benchmark is now available on arXiv.

For the challenge, we will award the three best benchmark submissions. Challenge winners will be invited to give a short presentation describing their approach at the workshop event.

Technical report format

Please follow a two-column layout for your submission. The technical report should at most contain 4 pages including references. However, shorter reports of 2 pages are very welcome. Submissions are not blind, hence, please include all authors on the submission. Only participants with a submitted report are considered for the reward and to present on the workshop. Please make your challenge entry public once submitted and make it clear to which method the report belongs. All reports should be sent to Tanveer Hannan (hannan [at] dbs . ifi . lmu . de). The deadline is September 26th, 23:59 PST.

Organizers

Tanveer Hannan (Website) (LMU/MCML)

Shuaicong Wu (LMU)

Mark Weber (TUM)

Suprosanna Shit (UZH)

Rajat Koner (Amazon)

Jindong Gu (Google/Oxford)

Aljoša Ošep (NVIDIA)

Thomas Seidl (LMU/MCML)

Laura Leal-Taixé (NVIDIA/TUM)

BibTex

@misc{hannan2025svagbenchlargescalebenchmarkmultiinstance, title={SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding}, author={Tanveer Hannan and Shuaicong Wu and Mark Weber and Suprosanna Shit and Jindong Gu and Rajat Koner and Aljoša Ošep and Laura Leal-Taixé and Thomas Seidl}, year={2025}, eprint={2510.13016}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13016}, }