Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to video-text domain. Although these pre-trained models have shown… Click to show full abstract
Large-scale image-text pre-trained models have shown promising transferability to various downstream tasks. Video-text retrieval benefits from it by transferring pre-trained CLIP to video-text domain. Although these pre-trained models have shown impressive performance, full fine-tuning becomes prohibitively expensive as the size of these pre-trained models grows rapidly. To solve this, parameter-efficient tuning methods have been proposed, and prompt tuning is one of the most promising directions. However, existing prompt tuning methods do not have sufficient performance due to the lack of cross-modal interaction and prompt reliability assurance. To address these issues, we propose an effective and efficient Agent-based Control Prompt Tuning method (AbC-PT) for parameter-efficient video-text retrieval. The proposed AbC-PT enjoys several merits. Firstly, we design a parameter-efficient agent decoder with a carefully designed consistent attention mechanism to effectively capture video temporal information, mine contextual texts and perform cross-modal interaction between them. Secondly, we introduce two different sets of prompts, i.e., the vanilla prompt prepended to the input tokens and the concept prompt as the agent of the agent decoder. In addition, to ensure cross-modal semantic consistency of the concept prompt, we design a semantic consistency constraint loss. Thirdly, we devise a parameter-free prompt controller for adaptively calibrating each vanilla prompt based on its semantic in a data-driven way. Extensive experiments on five challenging benchmarks demonstrate that our method not only outperforms state-of-the-art parameter-efficient tuning methods, but even surpasses the full fine-tuning with 0.46% parameter overhead.
               
Click one of the above tabs to view related content.