LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Comprehensive Framework of Early and Late Fusion for Image–Sentence Retrieval

Photo by ashleyjaynes89 from unsplash

Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy… Click to show full abstract

Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy burden of computation resources and insufficient intermodal alignment. In this work, we propose comprehensive framework of early and late fusion (CFELF), a universal framework to collaborate early fusion with late fusion. To enhance cross-modal correspondence, CFELF fuses local visual regions with global sentences at the early stage to aggregate on late fusion backbones. Therefore, fusions on two phases of the feature process could be complementary to each other to capture salient information in intramodality while encouraging intermodal alignments. We have extensively evaluated CFELF on four advanced late fusion backbones and compare with other early fusion modules. The results on two public image–text datasets demonstrate the effectiveness of the comprehensive fusion framework in retrieval performance with convergence accelerating.

Keywords: fusion; framework early; comprehensive framework; late fusion; image

Journal Title: IEEE MultiMedia
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.