We observed that remarkable and impressive performance on image-based human pose estimation have been achieved by deep Convolutional Neural Networks (CNN). Nevertheless, directly applying these image-based models on videos is… Click to show full abstract
We observed that remarkable and impressive performance on image-based human pose estimation have been achieved by deep Convolutional Neural Networks (CNN). Nevertheless, directly applying these image-based models on videos is not only computionally intensive, but also may cause jitter and loss. The main reason is that the image-based models purely focus on the local features of individual frames and totally ignore the temporal information among adjacent frames. Some existing methods are proposed to address the temporal coherency issue. However, these methods need to be designed carefully and cannot be combined with existing image-based methods. In this paper, we propose a simple yet effective module to refine the estimated pose by exploiting the temporal coherency among the heatmaps of adjacent frames, which can be easily inserted into image-based networks as a plug-in. We show that the temporal coherency issue among the heatmap frames could be re-formulated as a graph path selection optimization problem. Moreover, to speed up the refinement process, we propose a hierarchical graph optimization to achieve the refinement from coarse to fine. Experimental results on two large-scale video pose estimation benchmarks show that our module can improve the performance with little speed loss when combined with image-based methods as an efficient plug-in.
               
Click one of the above tabs to view related content.