Abstract In accurate action recognition, discriminative human-region representation as auxiliary information is critical for fusing multiple visual clues in a video and further improving the recognition performance. To this end,… Click to show full abstract
Abstract In accurate action recognition, discriminative human-region representation as auxiliary information is critical for fusing multiple visual clues in a video and further improving the recognition performance. To this end, in this paper we propose integrating a novel representation named multilayer deep features (MDF) of both the human region and whole image area into an extended region-aware multiple kernel learning (ER-MKL) framework. To be specific, we first exploit the human cues with the help of the off-the-shelf semantic segmentation models. Then more powerful representations MDF are constructed by concatenating activations at the last convolutional layer and fully connected layer. Finally, the proposed framework termed ER-MKL is presented to learn a robust classifier for fusing human-region MDF and whole-region MDF. In addition to combining multiple kernels derived from features of heterogeneous image regions, ER-MKL also considers the sets of pre-learned classifiers and incorporates prior knowledge of different regions. Extensive evaluations on the JHMDB and UCF Sports datasets validate the effectiveness and the superiority of our proposed approach.
               
Click one of the above tabs to view related content.