Benefiting from effective global information interaction, vision transformers (ViTs) have been widely used in the building extraction task. However, buildings in remote sensing (RS) images usually differ greatly in size.… Click to show full abstract
Benefiting from effective global information interaction, vision transformers (ViTs) have been widely used in the building extraction task. However, buildings in remote sensing (RS) images usually differ greatly in size. Mainstream ViT-based segmentation models for RS images are based on Swin Transformer, which lacks multiscale information inside the ViT block. In addition, they only connect the output of the entire ViT encoder block to the decoder, which ignore the similarity information of the attention maps inside the ViT encoder block and are unable to provide better global dependencies for the decoder. To solve the above problems, we introduce a novel shunted transformer, which enables the model to capture multiscale information internally while fully establishing global dependencies, to build a pure ViT-based U-shaped model for building extraction. Furthermore, unlike the previous single-skip-connection structure of the U-shaped methods, we build a novel dual skip connection structure inside the model. It simultaneously transmits the attention maps inside the ViT encoder block and its entire output to the decoder, thereby fully mining the information of the ViT encoder block and providing better global information guidance for the decoder. Thus, our model is named shunted dual skip connection UNet (SDSC-UNet). We also design a feature fusion module called dual skip upsample fusion module (DSUFM) to aggregate the information. Our model yields the state-of-the-art (SOTA) performance [83.02% intersection over union (IoU)] on the Inria Aerial Image Labeling Dataset. Code will be available at https://github.com/stdcoutzrh/BuildingExtraction.
               
Click one of the above tabs to view related content.