Dense Spatiotemporal Annotations. It shows the queries and corresponding object bounding boxes for each frame. The top shows the current frame number and a visual representation of the number of queries (using boxes), while the bottom displays the specific query content. The full video is available here.

Workshop

Understanding actions in videos goes beyond recognizing appearances - it requires tracking objects as they perform complex behaviors over time. While recent benchmarks have made strides in temporal action localization and referring object tracking, they often treat these tasks in isolation. A gap remains in evaluating models that must both localize and track multiple objects based on action-driven, natural language queries.

At the 8th edition of the BMTT workshop, we focus on action-aware multi-object tracking, aiming to bridge the divide between vision and language by introducing unified challenges that evaluate both temporal localization and object tracking. We invite the community to examine whether current models can reason about actions, follow fine-grained language instructions, and scale to more complex, real-world scenes.

Through invited talks, the challenge, and open discussions, this workshop will explore the limitations of existing vision-language models, analyze the performance of state-of-the-art trackers and temporal localizers on action-based queries, and promote the development of more robust, multimodal video understanding systems.

User Instructed Spatiotemporal Detections. It shows the bounding boxes of the referred objects over time for a given set of queries. The top shows the current frame number and the given set of queries, and the bottom shows the period when the object appears. The full video is available here.

Info

Time October 19, 2025
Venue ICCV 2025 (Honolulu, Hawai'i)
Train data released July 11th, 2025
Test data released July 11th, 2025
Challenge submission deadline September 19th, 2025
Technical report deadline September 26th, 2025
Recordings Will be available after the workshop!

Speakers

Schedule (EST)

Time Title Speaker
08:30 - 09:00 am Workshop introduction and Presentation of the new MOT25 dataset Organizers
09:00 - 09:25 am Invited Talk 1 Katerina Fragkiadaki
09:25 - 09:45 am Oral: Third place of MOT25 Participants
09:45 - 10:10 am Invited Talk 2 Philipp Krähenbühl
10:10 - 10:30 am Oral: Second place of MOT25 Participants
10:30 - 10:40 am Coffee Break -
10:40 - 11:05 am Invited Talk 3 Deva Ramanan
11:05 - 11:25 am Oral: Winner of MOT25 Participants
11:25 - 11:50 am Invited Talk 4 Pascal Fua
11:50 am – 12:15 pm Discussion, Closing remarks and Awards All speakers and Organizers

Competition

In this edition of the workshop, we focus on the task that combines temporal localization and multi-object tracking under action-based natural language queries. The key question we pose to the community is: Can current models accurately localize and track multiple objects, based solely on complex, free-form action queries?

To this end, we organize a challenging competition in which participants are asked to develop their models to solve the task. The goal of the challenge is to launch a new unified task for localization and tracking based on large-scale, manually annotated action descriptions.

MOT25-StAG track

For this track, we integrated three existing datasets (OVIS, MOT17 and MOT20) and annotated them with a variety of action language queries into the MOT25-StAG dataset. Participants can use all annotations in MOT25-StAG and test their temporal window localization and bounding box tracking methods on the MOT25-StAG test set.

Rules:
  • Participants are allowed to train their models using open-source datasets.
  • Participants have to provide a technical report and code, showing which datasets were used for training.
  • Participants are required to use the MOT25-StAG test set for evaluation and upload their results to the server. Evaluations on any other validation set can be conducted locally using our provided codebase.


Dataset: Image sequences are available at the OVIS website, MOT17 website and MOT20 website. All annotations can be found here.
Baselines: Our baseline for spatial grounding is available here, and baseline for temporal grounding is available here.
Metric: m-HIoU, HOTA, mIoU, R1@X, R5@X, R10@X
Evaluation script: Evaluation Toolkit
Test server: The test server will be made available in the Codabench website on July 21th.


For the challenge, we will award the three best benchmark submissions. Challenge winners will be invited to give a short presentation describing their approach at the workshop event.

Technical report format

Please follow a two-column layout for your submission. The technical report should at most contain 4 pages including references. However, shorter reports of 2 pages are very welcome. Submissions are not blind, hence, please include all authors on the submission. Only participants with a submitted report are considered for the reward and to present on the workshop. Please make your challenge entry public once submitted and make it clear to which method the report belongs. All reports should be sent to Tanveer Hannan (hannan [at] dbs . ifi . lmu . de). The deadline is September 26th, 23:59 PST.

Organizers

Tanveer Hannan (LMU/MCML)

Shuaicong Wu (LMU)

Mark Weber (TUM)

Suprosanna Shit (UZH)

Rajat Koner (Amazon)

Jindong Gu (Google/Oxford)

Aljoša Ošep (NVIDIA)

Thomas Seidl (LMU/MCML)

Laura Leal-Taixé (NVIDIA/TUM)