Image captioning has shown encouraging outcomes with Transformer-based architectures that typically use attention-based methods to establish semantic associations between objects in an image for caption prediction. Nevertheless, when appearance features… Click to show full abstract
Image captioning has shown encouraging outcomes with Transformer-based architectures that typically use attention-based methods to establish semantic associations between objects in an image for caption prediction. Nevertheless, when appearance features of objects in an image display low interdependence, attention-based methods have difficulty in capturing the semantic association between them. To tackle this problem, additional knowledge beyond the task-specific dataset is often required to create captions that are more precise and meaningful. In this article, a semantic attention network is proposed to incorporate general-purpose knowledge into a transformer attention block model. This design combines visual and semantic properties of internal image knowledge in one place for fusion, serving as a reference point to aid in the learning of alignments between vision and language and to improve visual attention and semantic association. The proposed framework is validated on the Microsoft COCO dataset, and experimental results demonstrate competitive performance against the current state of the art.
               
Click one of the above tabs to view related content.