As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not… Click to show full abstract
As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not effectively scanned and aligned the semantic units distributed in different granularities of images and texts. To address these issues, we propose a dual-branch foreground-background fusion network (FB-Net), which is implemented by fully exploring and fusing the complementarity in semantic units collected from the foreground and background areas of instances (e.g., images and texts). Firstly, to generate multi-granularity semantic units from images and texts, multi-scale semantic scanning is conducted on both foreground and background areas through multi-level overlapped sliding windows. Secondly, to align semantic units between images and texts, the stacked cross-attention mechanism is used to calculate the initial image-text similarity. Thirdly, to further adaptively optimize the image-text similarity, the dynamically self-adaptive weighted loss is designed. Finally, to perform the image-text retrieval, the similarities between multi-granularity foreground and background semantic units are fused to obtain the final image-text similarity. Experimental results show that our proposed FB-Net outperforms representative state-of-the-art methods for image-text retrieval, and ablation studies further verify the effectiveness of each component in FB-Net.
               
Click one of the above tabs to view related content.