"TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption"

Image captioning in remote sensing can help us understand the inner attributes of the objects and the outer relationships between different objects. However, the existing image captioning algorithms lack the ability of global representation and cannot obtain object relationships over long distances. In addition, these algorithmics generate captions randomly without consideration of the specific demands. To this end, we propose a pure transformer architecture with caption-type controller (CTC) for remote sensing image captioning (RSIC). Specifically, a multiscale vision transformer is adopted for image representation, where the global and detailed content can be captured with multihead self-attention (MSA) layers. A transformer decoder is then introduced to successively translate the image features into comprehensive sentences. The optional block called CTC is designed to consider the types of captions through caption-type matrix sets according to the demands, embedding the learnable sentence feature with the required type. The comparison and ablation experiments conducted on the RSIC dataset (RSICD) demonstrate that the proposed framework outperforms the current state-of-the-art image captioning methods. The experiments conducted on the FloodNet caption dataset further illustrate that the proposed methods can effectively generate specific types of captions.

Keywords: caption; remote sensing; type; image captioning; transformer; image

Journal Title: IEEE Geoscience and Remote Sensing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended