The latest vision transformer (ViT) has stronger contextual feature representation capability than the existing convolutional neural networks and thus has the potential to depict the remote sensing scenes, which usually… Click to show full abstract
The latest vision transformer (ViT) has stronger contextual feature representation capability than the existing convolutional neural networks and thus has the potential to depict the remote sensing scenes, which usually have more complicated object distribution and spatial arrangement than ground image scenes. However, recent researches reflect that while ViT learns global features, it also ignores the key local features, which poses a bottleneck for understanding remote sensing scenes. In this letter, we tackle this challenge by proposing a novel multiinstance vision transformer (MITformer). Its originality mainly lies in the classic multiple instance learning (MIL) formulation, where each image patch embedded in ViT is regarded as an instance and each image is regarded as a bag. The benefit of designing ViT under MIL formulation is straightforward, as it helps highlight the feature response of key local regions of remote sensing scenes. Moreover, to enhance the feature propagation of local features, attention-based multilayer perceptron (AMLP) head is embedded at the end of each encoder unit. Finally, to minimize the potential semantic prediction differences between the classic ViT and our MIL head, a semantic consistency loss is designed. Experiments on three remote sensing scene classification benchmarks show that our proposed MITformer outperforms the existing state-of-the-art methods and validate the effectiveness of each component in our MITformer.
               
Click one of the above tabs to view related content.