As an essential component of vision-based measurement (VBM), referring multiobject tracking (RMOT) involves localizing and tracking specific objects in video frames using linguistic prompts as references. To enhance the effectiveness… Click to show full abstract
As an essential component of vision-based measurement (VBM), referring multiobject tracking (RMOT) involves localizing and tracking specific objects in video frames using linguistic prompts as references. To enhance the effectiveness of linguistic prompts when training, we introduce a novel multigranularity localization transformer with collaborative understanding, termed multigranularity localization transformer with collaborative understanding (MGLT). Unlike previous methods focused on visual-language fusion and postprocessing, MGLT reevaluates RMOT by preventing linguistic clues from attenuating during propagation. MGLT comprises two key components: multigranularity implicit query bootstrapping (MGIQB) and multigranularity track-prompt alignment (MGTPA). MGIQB ensures that tracking and linguistic features are preserved in later layers of network propagation by bootstrapping the model to generate text-relevant and temporal-enhanced track queries. Simultaneously, MGTPA with multigranularity linguistic prompts enhances the model’s localization ability by understanding the relative positions of different referred objects within a frame. Extensive experiments on well-recognized benchmarks demonstrate that MGLT achieves the state-of-the-art performance. Notably, it shows significant improvements on the Refer-KITTI dataset of 2.73%, 7.95%, and 3.18% in HOTA, AssA, and IDF1, respectively. The code will be available at https://github.com/JiajunChern/MGLT.
               
Click one of the above tabs to view related content.