Workshop
Understanding actions in videos goes beyond recognizing appearances - it requires tracking objects as they perform complex behaviors over time. While recent benchmarks have made strides in temporal action localization and referring object tracking, they often treat these tasks in isolation. A gap remains in evaluating models that must both localize and track multiple objects based on action-driven, natural language queries.
At the 8th edition of the BMTT workshop, we focus on action-aware multi-object tracking, aiming to bridge the divide between vision and language by introducing unified challenges that evaluate both temporal localization and object tracking. We invite the community to examine whether current models can reason about actions, follow fine-grained language instructions, and scale to more complex, real-world scenes.
Through invited talks, the challenge, and open discussions, this workshop will explore the limitations of existing vision-language models, analyze the performance of state-of-the-art trackers and temporal localizers on action-based queries, and promote the development of more robust, multimodal video understanding systems.

Info
Time | October 19, 2025 |
Venue | ICCV 2025 (Honolulu, Hawai'i) |
Train data released | July 11th, 2025 |
Test data released | July 11th, 2025 |
Challenge submission deadline | September 19th, 2025 |
Technical report deadline | September 26th, 2025 |
Recordings | Will be available after the workshop! |