Cross-modal image-text retrieval is an important area of Vision-and-Language task that models the similarity of image-text pairs by embedding features into a shared space for alignment. To bridge the heterogeneous… Click to show full abstract
Cross-modal image-text retrieval is an important area of Vision-and-Language task that models the similarity of image-text pairs by embedding features into a shared space for alignment. To bridge the heterogeneous gap between the two modalities, current approaches achieve inter-modal alignment and intra-modal semantic relationship modeling through complex weighted combinations between items. In the intra-modal association and inter-modal interaction processes, the higher-weight items have a higher contribution to the global semantics. However, the same item always produces different contributions in the two processes, since most traditional approaches only focus on the alignment. This usually results in semantic changes and misalignment. To address this issue, this paper proposes Cross-modal Semantic Importance Consistency (CSIC) which achieves invariance in the semantic of items during aligning. The proposed technique measures the semantic importance of items obtained from intra-modal and inter-modal self-attention and learns a more reasonable representation vector by inter-calibrating the importance distribution to improve performance. We conducted extensive experiments on the Flickr30K and MS COCO datasets. The results show that our approach can significantly improve retrieval performance, proving the proposed approach’s superiority and rationality.
               
Click one of the above tabs to view related content.