Building multi-person pose estimation (MPPE) models that can handle complex foreground and uncommon scenes is an important challenge in computer vision. Aside from designing novel models, strengthening training data is… Click to show full abstract
Building multi-person pose estimation (MPPE) models that can handle complex foreground and uncommon scenes is an important challenge in computer vision. Aside from designing novel models, strengthening training data is a promising direction but remains largely unexploited for the MPPE task. In this article, we systematically identify the key deficiencies of existing pose datasets that prevent the power of well-designed models from being fully exploited and propose the corresponding solutions. Specifically, we find that the traditional data augmentation techniques are inadequate in addressing the two key deficiencies, imbalanced instance complexity (IC) (evaluated by our new metric IC) and insufficient realistic scenes. To overcome these deficiencies, we propose a model-agnostic full-view data generation (Full-DG) method to enrich the training data from the perspectives of both poses and scenes. By hallucinating images with more balanced pose complexity and richer real-world scenes, Full-DG can help improve pose estimators' robustness and generalizability. In addition, we introduce a plug-and-play adaptive category-aware loss (AC-loss) to alleviate the severe pixel-level imbalance between keypoints and backgrounds (i.e., around 1:600). Full-DG together with AC-loss can be readily applied to both the bottom-up and top-down models to improve their accuracy. Notably, plugging into the representative estimators HigherHRNet and HRNet, our method achieves substantial performance gains of 1.0%-2.9% AP on the COCO benchmark, and 1.0%-5.1% AP on the CrowdPose benchmark.
               
Click one of the above tabs to view related content.