"MITformer: A Multiinstance Vision Transformer for Remote Sensing Scene Classification"

The latest vision transformer (ViT) has stronger contextual feature representation capability than the existing convolutional neural networks and thus has the potential to depict the remote sensing scenes, which usually have more complicated object distribution and spatial arrangement than ground image scenes. However, recent researches reflect that while ViT learns global features, it also ignores the key local features, which poses a bottleneck for understanding remote sensing scenes. In this letter, we tackle this challenge by proposing a novel multiinstance vision transformer (MITformer). Its originality mainly lies in the classic multiple instance learning (MIL) formulation, where each image patch embedded in ViT is regarded as an instance and each image is regarded as a bag. The benefit of designing ViT under MIL formulation is straightforward, as it helps highlight the feature response of key local regions of remote sensing scenes. Moreover, to enhance the feature propagation of local features, attention-based multilayer perceptron (AMLP) head is embedded at the end of each encoder unit. Finally, to minimize the potential semantic prediction differences between the classic ViT and our MIL head, a semantic consistency loss is designed. Experiments on three remote sensing scene classification benchmarks show that our proposed MITformer outperforms the existing state-of-the-art methods and validate the effectiveness of each component in our MITformer.

Keywords: vit; remote sensing; vision transformer; mitformer

Journal Title: IEEE Geoscience and Remote Sensing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended