"Regularizing Visual Semantic Embedding With Contrastive Learning for Image-Text Matching"

Learning visual semantic embedding for image-text matching has achieved high success by using triplet loss to pull positive image-text pairs which share similar semantic meaning and to push negative image-text pairs which share different semantic meaning. Without modeling constraints from image-image or text-text pairs, the generated visual semantic embedding inevitably faces the problem of semantic misalignments among similar images or among similar texts. To solve this problem, we present a contrastive visual semantic embedding framework, named ConVSE, which achieves intra-modal semantic alignment by contrastive learning from augmented image-image (or text-text) pairs and achieves inter-modal semantic alignment by applying hardest-negative-enhanced triplet loss on image-text pairs. To the best of our knowledge, we are the first to find that contrastive learning benefits visual semantic embedding. Extensive experiments on large-scale MSCOCO and Flickr30 K datasets verify the effectiveness of our proposed ConVSE by outperforming visual semantic embedding-based methods and achieving new state-of-the-art.

Keywords: image text; semantic embedding; visual semantic; text pairs; image

Journal Title: IEEE Signal Processing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended