Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy… Click to show full abstract
Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy burden of computation resources and insufficient intermodal alignment. In this work, we propose comprehensive framework of early and late fusion (CFELF), a universal framework to collaborate early fusion with late fusion. To enhance cross-modal correspondence, CFELF fuses local visual regions with global sentences at the early stage to aggregate on late fusion backbones. Therefore, fusions on two phases of the feature process could be complementary to each other to capture salient information in intramodality while encouraging intermodal alignments. We have extensively evaluated CFELF on four advanced late fusion backbones and compare with other early fusion modules. The results on two public image–text datasets demonstrate the effectiveness of the comprehensive fusion framework in retrieval performance with convergence accelerating.
               
Click one of the above tabs to view related content.