There are two key issues in human activity recognition: spatial dependencies and temporal dependencies. Most recent methods focus on only one of them, and thus do not have sufficient descriptive… Click to show full abstract
There are two key issues in human activity recognition: spatial dependencies and temporal dependencies. Most recent methods focus on only one of them, and thus do not have sufficient descriptive power to recognize complex activity. In this paper, we propose a hierarchical spatio-temporal model (HSTM) to solve the problem by modeling spatial and temporal constraints simultaneously. The new HSTM is a two-layer hidden conditional random field (HCRF), where the bottom-layer HCRF aims at describing spatial relations in each frame and learning more discriminative representations, and the top-layer HCRF utilizes these high-level features to characterize temporal relations in the whole video sequence. The new HSTM takes advantage of the bottom layer as the building blocks for the top layer and it aggregates evidence from local to global level. A novel learning algorithm is derived to train all model parameters efficiently and its effectiveness is validated theoretically. Experimental results show that the HSTM can successfully classify human activities with higher accuracies on single-person actions (UCF) than other existing methods. More importantly, the HSTM also achieves superior performance on more practical interactions, including human–human interactional activities (UT-Interaction, BIT-Interaction, and CASIA) and human–object interactional activities (Gupta video dataset).
               
Click one of the above tabs to view related content.