"Comprehensive Framework of Early and Late Fusion for Image

Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy burden of computation resources and insufficient intermodal alignment. In this work, we propose comprehensive framework of early and late fusion (CFELF), a universal framework to collaborate early fusion with late fusion. To enhance cross-modal correspondence, CFELF fuses local visual regions with global sentences at the early stage to aggregate on late fusion backbones. Therefore, fusions on two phases of the feature process could be complementary to each other to capture salient information in intramodality while encouraging intermodal alignments. We have extensively evaluated CFELF on four advanced late fusion backbones and compare with other early fusion modules. The results on two public image–text datasets demonstrate the effectiveness of the comprehensive fusion framework in retrieval performance with convergence accelerating.

Keywords: fusion; framework early; comprehensive framework; late fusion; image

Journal Title: IEEE MultiMedia
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended