Workshop
Understanding actions in videos goes beyond recognizing appearances - it requires tracking objects as they perform complex behaviors over time. While recent benchmarks have made strides in temporal action localization and referring object tracking, they often treat these tasks in isolation. A gap remains in evaluating models that must both localize and track multiple objects based on action-driven, natural language queries.
At the 8th edition of the BMTT workshop, we focus on action-aware multi-object tracking, aiming to bridge the divide between vision and language by introducing unified challenges that evaluate both temporal localization and object tracking. We invite the community to examine whether current models can reason about actions, follow fine-grained language instructions, and scale to more complex, real-world scenes.
Through invited talks, the challenge, and open discussions, this workshop will explore the limitations of existing vision-language models, analyze the performance of state-of-the-art trackers and temporal localizers on action-based queries, and promote the development of more robust, multimodal video understanding systems.
Info
| Workshop time | October 19, 2025 |
| Workshop page | ICCV 2025 Workshop |
| Workshop slides | Workshop Slides |
| Venue | ICCV 2025 (Honolulu, Hawai'i) |
| Train data released | July 11, 2025 |
| Test data released | July 11, 2025 |
| Challenge submission deadline | September 19, 2025 |
| Technical report deadline | September 26, 2025 |
| Recordings | Will be available after the workshop! |




















