LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval

Photo from wikipedia

Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations… Click to show full abstract

Video-text retrieval is a crucial task that has been a powerful application for multi-media data analysis and attracted tremendous interest in the research area. The core steps are feature representations and alignment to overcome the heterogeneous gap between videos and texts. Existing methods not only take advantage of multi-modal information in videos but also explore local alignment to enhance retrieval accuracy. Although performing well, these methods seem deficient at three perspectives: a) The semantic correlations between different modal features are not considered, which introduces irrelevant noise in feature representations. b) The cross-modal relations and temporal associations are ambiguously learned by a single self-attention manipulation. c) The training signal to optimize the semantic topic assignment for local alignment is missing. In this paper, we proposed a novel Temporal Multi-modal Graph Transformer with Global-Local Alignment (TMMGT-GLA) for video-text retrieval. We model the input video as a sequence of semantic correlation graphs to exploit the structural information between multi-modal features. Graph and temporal self-attention layers are leveraged on the semantic correlation graphs to effectively learn cross-modal relations and temporal associations respectively. For local alignment, the encoded video and text features are assigned to a set of shared semantic topics, and the distances between residuals from the same ones are minimized. To optimize the assignments, a minimum entropy-based regularization term is proposed for training the overall framework. Experimental results are carried out on the MSR-VTT, LSMDC, and ActivityNet Captions datasets. Our method outperforms previous approaches by a large margin and achieves state-of-the-art performance.

Keywords: video text; text retrieval; video; local alignment

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2023

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.