LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Agent-Based Control Prompt Tuning for Video-Text Retrieval

Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to video-text domain. Although these pre-trained models have shown… Click to show full abstract

Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to video-text domain. Although these pre-trained models have shown impressive performance, full fine-tuning becomes prohibitively expensive as the size of these pre-trained models grows rapidly. To solve this, parameter-efficient tuning methods have been proposed, and prompt tuning is one of the most promising directions. However, existing prompt tuning methods do not have sufficient performance due to the lack of cross-modal interaction and prompt reliability assurance. To address these issues, we propose an effective and efficient Agent-based Control Prompt Tuning method (AbC-PT) for parameter-efficient video-text retrieval. The proposed AbC-PT enjoys several merits. Firstly, we design a parameter-efficient agent decoder with a carefully designed consistent attention mechanism to effectively capture video temporal information, mine contextual texts and perform cross-modal interaction between them. Secondly, we introduce two different sets of prompts, i.e., the vanilla prompt prepended to the input tokens and the concept prompt as the agent of the agent decoder. In addition, to ensure cross-modal semantic consistency of the concept prompt, we design a semantic consistency constraint loss. Thirdly, we devise a parameter-free prompt controller for adaptively calibrating each vanilla prompt based on its semantic in a data-driven way. Extensive experiments on five challenging benchmarks demonstrate that our method not only outperforms state-of-the-art parameter-efficient tuning methods, but even surpasses the full fine-tuning with 0.46% parameter overhead.

Keywords: prompt tuning; video text; text retrieval; agent

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2025

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.