For fine-grained recognition, capturing distinguishable features and effectively utilizing local information play a key role, since the objects of recognition exhibit subtle differences in different subcategories. Finding subtle differences between… Click to show full abstract
For fine-grained recognition, capturing distinguishable features and effectively utilizing local information play a key role, since the objects of recognition exhibit subtle differences in different subcategories. Finding subtle differences between subclasses is not straightforward. To address this problem, we propose a weakly supervised fine-grained classification network model with Local Diversity Guidance (LDGNet). We designed a Multi-Attention Semantic Fusion Module (MASF) to build multi-layer attention maps and channel–spatial interaction, which can effectively enhance the semantic representation of the attention maps. We also introduce a random selection strategy (RSS) that forces the network to learn more comprehensive and detailed information and more local features from the attention map by designing three feature extraction operations. Finally, both the attention map obtained by RSS and the feature map are employed for prediction through a fully connected layer. At the same time, a dataset of ancient towers is established, and our method is applied to ancient building recognition for practical applications of fine-grained image classification tasks in natural scenes. Extensive experiments conducted on four fine-grained datasets and explainable visualization demonstrate that the LDGNet can effectively enhance discriminative region localization and detailed feature acquisition for fine-grained objects, achieving competitive performance over other state-of-the-art algorithms.
               
Click one of the above tabs to view related content.