LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Adaptive Semantic-Enhanced Transformer for Image Captioning.

Photo from wikipedia

In the research on image captioning, rich semantic information is very important for generating critical caption words as guiding information. However, semantic information from offline object detectors involves many semantic… Click to show full abstract

In the research on image captioning, rich semantic information is very important for generating critical caption words as guiding information. However, semantic information from offline object detectors involves many semantic objects that do not appear in the caption, thereby bringing noise into the decoding process. To produce more accurate semantic guiding information and further optimize the decoding process, we propose an end-to-end adaptive semantic-enhanced transformer (AS-Transformer) model for image captioning. For semantic enhancement information extraction, we propose a constrained weaklysupervised learning (CWSL) module, which reconstructs the semantic object's probability distribution detected by the multiple instances learning (MIL) through a joint loss function. These strengthened semantic objects from the reconstructed probability distribution can better depict the semantic meaning of images. Also, for semantic enhancement decoding, we propose an adaptive gated mechanism (AGM) module to adjust the attention between visual and semantic information adaptively for the more accurate generation of caption words. Through the joint control of the CWSL module and AGM module, our proposed model constructs a complete adaptive enhancement mechanism from encoding to decoding and obtains visual context that is more suitable for captions. Experiments on the public Microsoft Common Objects in COntext (MSCOCO) and Flickr30K datasets illustrate that our proposed AS-Transformer can adaptively obtain effective semantic information and adjust the attention weights between semantic and visual information automatically, which achieves more accurate captions compared with semantic enhancement methods and outperforms state-of-the-art methods.

Keywords: information; semantic information; adaptive semantic; semantic enhanced; image captioning

Journal Title: IEEE transactions on neural networks and learning systems
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.