"Multigranularity Localization Transformer With Collaborative Understanding for Referring Multiobject Tracking"

As an essential component of vision-based measurement (VBM), referring multiobject tracking (RMOT) involves localizing and tracking specific objects in video frames using linguistic prompts as references. To enhance the effectiveness of linguistic prompts when training, we introduce a novel multigranularity localization transformer with collaborative understanding, termed multigranularity localization transformer with collaborative understanding (MGLT). Unlike previous methods focused on visual-language fusion and postprocessing, MGLT reevaluates RMOT by preventing linguistic clues from attenuating during propagation. MGLT comprises two key components: multigranularity implicit query bootstrapping (MGIQB) and multigranularity track-prompt alignment (MGTPA). MGIQB ensures that tracking and linguistic features are preserved in later layers of network propagation by bootstrapping the model to generate text-relevant and temporal-enhanced track queries. Simultaneously, MGTPA with multigranularity linguistic prompts enhances the model’s localization ability by understanding the relative positions of different referred objects within a frame. Extensive experiments on well-recognized benchmarks demonstrate that MGLT achieves the state-of-the-art performance. Notably, it shows significant improvements on the Refer-KITTI dataset of 2.73%, 7.95%, and 3.18% in HOTA, AssA, and IDF1, respectively. The code will be available at https://github.com/JiajunChern/MGLT.

Keywords: multigranularity localization; localization transformer; collaborative understanding; localization; transformer collaborative; multigranularity

Journal Title: IEEE Transactions on Instrumentation and Measurement
Year Published: 2025

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended