Two-dimensional position information of input tokens is essential for transformer-based semantic segmentation models, especially on high-resolution aerial images. However, recent transformer-based segmentation methods use position encoding to record position information… Click to show full abstract
Two-dimensional position information of input tokens is essential for transformer-based semantic segmentation models, especially on high-resolution aerial images. However, recent transformer-based segmentation methods use position encoding to record position information and most position encoding methods encode the 1-D positions of tokens. Therefore, we propose a 2-D semantic transformer model (2DSegFormer) for semantic segmentation on aerial images. In 2DSegFormer, we design a novel 2-D positional attention to accurately record the 2-D position information required by the transformer. Furthermore, we design the dilated residual connection and use it instead of skip connection in the deep stages to get a larger receptive field. Skip connections are used in the shallow stages of 2DSegFormer to pass the details to the corresponding stages in the decoder. Experimental results on UAVid, Vaihingen, and AeroScapes datasets demonstrate the effectiveness of 2DSegFormer. Compared with the state-of-the-art methods, 2DSegFormer shows better performance and great robustness on three different datasets. In particular, 2DSegFormer-B2 achieves first place in the public ranking on the UAVid test set.
               
Click one of the above tabs to view related content.