In recent years, several retrieval methods for measuring the similarity between images and texts have been proposed. Despite the efficiency of most of these methods, the scalar-based cosine similarities may… Click to show full abstract
In recent years, several retrieval methods for measuring the similarity between images and texts have been proposed. Despite the efficiency of most of these methods, the scalar-based cosine similarities may not be sufficiently expressive to fully capture the intricate matching pattern between the visual and textual features. In addition, the hybrid methods empirically integrate the global and local matching similarities, which results in less interpretability. This letter proposes a novel Multi-Level Matching Network (MLMN) which learns and integrates the vector-based multi-level matching features. Two vector-based matching branches are first designed to learn more powerful matching features. An interpretable matching integration strategy is also proposed, which adaptively integrate the learned matching features according to the global matching information. Moreover, the image-text retrieval is further considered as a binary classification problem, and the MLMN is trained by the binary cross-entropy loss with hardest negatives. Several experiments are performed using the MSCOCO and Flickr30K datasets. The results demonstrate that MLMN achieves a higher performance than that of the state-of-the-art methods.
               
Click one of the above tabs to view related content.