Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale… Click to show full abstract
Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale and object size. Existing methods usually learn the aerial scene representation from the convolutional neural networks (CNNs), which focus on the local response of an image. In contrast, the recently developed vision transformers (ViTs) can learn stronger global representation for aerial scenes, but are not qualified enough to highlight the key objects in an aerial scene due to the dramatic size and scale variation. To address this challenge, in this letter, we propose a local–global interactive ViT (LG-ViT) for this task. It is based on our deliberately designed local–global feature interactive learning scheme, which intends to jointly utilize the local-wise and global-wise feature representations. To realize the learning scheme in an end-to-end manner, the proposed LG-ViT consists of three key components, namely local–global feature extraction (LGFE), local–global feature interaction (LGFI), and local–global semantic constraints. Extensive experiments on three aerial scene classification benchmarks, namely UC Merced Land Use Dataset (UCM), Aerial Image Dataset (AID), and Northwestern Polytechnical University (NWPU), demonstrate the effectiveness of the proposed LG-ViT against the state-of-the-art methods. The effectiveness of each component and generalization capability are also validated.
               
Click one of the above tabs to view related content.