"A Local–Global Interactive Vision Transformer for Aerial Scene Classification"

Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale and object size. Existing methods usually learn the aerial scene representation from the convolutional neural networks (CNNs), which focus on the local response of an image. In contrast, the recently developed vision transformers (ViTs) can learn stronger global representation for aerial scenes, but are not qualified enough to highlight the key objects in an aerial scene due to the dramatic size and scale variation. To address this challenge, in this letter, we propose a local–global interactive ViT (LG-ViT) for this task. It is based on our deliberately designed local–global feature interactive learning scheme, which intends to jointly utilize the local-wise and global-wise feature representations. To realize the learning scheme in an end-to-end manner, the proposed LG-ViT consists of three key components, namely local–global feature extraction (LGFE), local–global feature interaction (LGFI), and local–global semantic constraints. Extensive experiments on three aerial scene classification benchmarks, namely UC Merced Land Use Dataset (UCM), Aerial Image Dataset (AID), and Northwestern Polytechnical University (NWPU), demonstrate the effectiveness of the proposed LG-ViT against the state-of-the-art methods. The effectiveness of each component and generalization capability are also validated.

Keywords: global interactive; aerial scene; local global; scene classification

Journal Title: IEEE Geoscience and Remote Sensing Letters
Year Published: 2023

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended