3D Hand pose estimation from a single depth image is an essential topic in computer vision and human-computer interaction. Although the rising of deep learning boosts the accuracy a lot,… Click to show full abstract
3D Hand pose estimation from a single depth image is an essential topic in computer vision and human-computer interaction. Although the rising of deep learning boosts the accuracy a lot, the problem is still hard to solve due to the complex structure of the human hand. Two existing types of methods with deep learning, i.e. the regression-based and detection-based methods, either lose spatial information of the hand structure or lack direct supervision of the joint coordinates. In this paper, we propose a novel Differentiable Spatial Regression method which combines the advantages of these two types of methods to overcome each other's shortcomings. Our method uses spatial-form representation (SFR) to maintain spatial information and differentiable decoder to establish a direct supervision. Following the procedure suggested by our method, a particular model named SRNet is designed which uses a combination of 2D heatmaps and local offset maps as SFRs. Two modules named Plane Regression and Depth Regression are designed as differentiable decoder to regress plane coordinates and depth coordinates respectively. Ablation study demonstrates the superiority of our method over the two combined methods since the differentiable decoder leads to better SFRs learned by the network itself other than human design. Extensive experiments on four public datasets demonstrate that SRNet is comparable with the state-of-the-art models.
               
Click one of the above tabs to view related content.