Natural language tracking aims to locate the position of a target specified by a natural language description. Existing methods are trained on vision-language datasets with a small number of language… Click to show full abstract
Natural language tracking aims to locate the position of a target specified by a natural language description. Existing methods are trained on vision-language datasets with a small number of language descriptions, which may lead to limited semantic generalization. Moreover, they extract visual and language features separately, which limits visual-semantic capabilities. To overcome these limitations, we propose a novel semantic-aware tracking framework, SATrack, which integrates a semantic-aware attention module and a cross-modal aggregation module. The proposed SATrack enjoys several merits. First, the semantic-aware attention module utilizes language semantics as a bridge to build associations between visual features, enabling stronger visual-semantic capabilities. Second, the cross-modal aggregation module transfers the semantic knowledge of CLIP into the tracking framework for semantic generalization. Extensive experimental results demonstrate that SATrack outperforms previous state-of-the-art trackers on four natural language tracking benchmarks.
               
Click one of the above tabs to view related content.