Image captioning in remote sensing can help us understand the inner attributes of the objects and the outer relationships between different objects. However, the existing image captioning algorithms lack the… Click to show full abstract
Image captioning in remote sensing can help us understand the inner attributes of the objects and the outer relationships between different objects. However, the existing image captioning algorithms lack the ability of global representation and cannot obtain object relationships over long distances. In addition, these algorithmics generate captions randomly without consideration of the specific demands. To this end, we propose a pure transformer architecture with caption-type controller (CTC) for remote sensing image captioning (RSIC). Specifically, a multiscale vision transformer is adopted for image representation, where the global and detailed content can be captured with multihead self-attention (MSA) layers. A transformer decoder is then introduced to successively translate the image features into comprehensive sentences. The optional block called CTC is designed to consider the types of captions through caption-type matrix sets according to the demands, embedding the learnable sentence feature with the required type. The comparison and ablation experiments conducted on the RSIC dataset (RSICD) demonstrate that the proposed framework outperforms the current state-of-the-art image captioning methods. The experiments conducted on the FloodNet caption dataset further illustrate that the proposed methods can effectively generate specific types of captions.
               
Click one of the above tabs to view related content.