Transformer, especially vision transformer (ViT), is attracting increasing attention in various computer vision (CV) tasks. However, two urgent problems exist for the ViT: 1) owing to its attending to an… Click to show full abstract
Transformer, especially vision transformer (ViT), is attracting increasing attention in various computer vision (CV) tasks. However, two urgent problems exist for the ViT: 1) owing to its attending to an image in the patch level, the ViT seems to have a better performance in fetching global representations but is limited in extracting local features, which is an inherent advantage for the convolutional neural network (CNN) and 2) the learnable positional encoding plays a positive role, but limits the cross-resolution ability of the network. Specifically, the pre-trained model could only generate images of the same size during training. To conquer the two problems, we propose a novel convolution-embedded ViT with elastic positional encoding in this article. On one hand, we propose a joint CNN and self-attention (CSA) network to collaboratively extract local and global features. On the other hand, we propose to integrate the elastic CNN-based positional encoder into the framework to solve the rigid limitation of the ViT in cross resolution issues and improve the performance. Extensive experiments were conducted on IKONOS and WorldView-2 with 4- and 8-band multispectral (MS) images, respectively. The visual and numerical results show the competitive performance of the proposed method.
               
Click one of the above tabs to view related content.