Pansharpening is a fundamental and hot-spot research topic in remote sensing image fusion. In recent years, self-attention-based transformer has attracted considerable attention in natural language processing (NLP) and introduced to… Click to show full abstract
Pansharpening is a fundamental and hot-spot research topic in remote sensing image fusion. In recent years, self-attention-based transformer has attracted considerable attention in natural language processing (NLP) and introduced to attend to computer vision (CV) tasks. Inspired by great success of the vision transformer (ViT) in image classification, we propose an improved and advanced purely transformer-based model for pansharpening. In the proposed method, stacked multispectral (MS) and panchromatic (PAN) images are cropped into patches (i.e., tokens), and after a three-layer self-attention-based encoder, these tokens contain rich information. After upsampled and stitched, a high spatial resolution (HR) MS image is finally obtained. Instead of convolutional neural networks (CNNs) pursuing a short-distance dependency, our proposed method aims to build up a long-distance dependency, to make full use of more useful features. The experiments were conducted on an opening benchmark dataset, including IKONOS with four-band MS/PAN images and WorldView-2 MS images featured by eight bands. In addition, the experiments were performed on reduced and full-resolution datasets from both qualitative and quantitative evaluation aspects. The experimental results indicate the competitive performance of the proposed model than other pansharpening methods, including the state-of-the-art pansharpening algorithms based on CNN.
               
Click one of the above tabs to view related content.