The purpose of fine-grained image classification is to distinguish subcategories belonging to the same basic-level category, for example, two hundred subcategories belonging to birds. It has been a challenging topic… Click to show full abstract
The purpose of fine-grained image classification is to distinguish subcategories belonging to the same basic-level category, for example, two hundred subcategories belonging to birds. It has been a challenging topic in the field of computer vision in recent years due to the small inter-class variance among different subcategories (e.g., color and texture) and the large intra-class variance in the same subcategory (e.g., pose and viewpoint). In this paper, we propose a Compound Model Scaling with Efficient Attention (CMSEA) for fine-grained image classification, which carefully balances the various dimensions of width, depth, and image resolution in model scaling. Furthermore, the proposed method utilizes an additional computational low attention module to efficiently learn subtler features from discriminative regions. In addition, regularization and data augmentation were employed to improve accuracy in the training. Extensive experiments demonstrate that CMSEA achieves 90.63%, 94.51%, and 95.19% accuracy on CUB-200-2011, FGVC-Aircraft, and Stanford Cars datasets, respectively. In particular, CMSEA on CUB-200-2011 obtains 2.3% higher accuracy with 18% fewer network parameters than the original approach. Consequently, our method has better accuracy and parameter efficiency compared to most existing methods.
               
Click one of the above tabs to view related content.