For visual question answering on remote sensing (RSVQA), current methods scarcely consider geospatial objects typically with large-scale differences and positional sensitive properties. Besides, modeling and reasoning the relationships between entities… Click to show full abstract
For visual question answering on remote sensing (RSVQA), current methods scarcely consider geospatial objects typically with large-scale differences and positional sensitive properties. Besides, modeling and reasoning the relationships between entities have rarely been explored, which leads to one-sided and inaccurate answer predictions. In this article, a novel method called spatial hierarchical reasoning network (SHRNet) is proposed, which endows a remote sensing (RS) visual question answering (VQA) system with enhanced visual–spatial reasoning capability. Specifically, a hash-based spatial multiscale visual representation module is first designed to encode multiscale visual features embedded with spatial positional information. Then, spatial hierarchical reasoning is conducted to learn the high-order inner group object relations across multiple scales under the guidance of linguistic cues. Finally, a visual-question (VQ) interaction module is employed to learn an effective image–text joint embedding for the final answer predicting. Experimental results on three public RS VQA datasets confirm the effectiveness and superiority of our model SHRNet.
               
Click one of the above tabs to view related content.