Abstract In the skeleton-based action recognition, mining information from the joints and limbs of human skeletons plays a key role. Previous studies treated the skeleton data as vectors and could… Click to show full abstract
Abstract In the skeleton-based action recognition, mining information from the joints and limbs of human skeletons plays a key role. Previous studies treated the skeleton data as vectors and could not explicitly capture the joint interactions (e.g., RNN-based methods), or modeled the joint interactions in a local manner and may lose important cues without global response mapping (e.g., CNN and GCN (Graph Convolution Network) based methods). In this work, we address these problems by considering the potential relations of all the node pairs and edge pairs on the skeleton graphs. A dilation group-specific convolution module is proposed to aggregate relation messages of all the unit pairs on the skeleton graphs. By enumerating all the pair relations, the joint interactions could be learned explicitly and globally. It is then enhanced by introducing the attention pooling including temporal attention, spatial attention and channel attention. By stacking such several blocks, the relation messages of the node pairs are augmented by multi-layer propagation. Finally, the late fusion of four streams is used to combine the predictions of different inputs including node pairs, edge pairs and corresponding frame differences. The proposed method, termed conv-relation network, achieves competitive performance on two large scale datasets, NTU RGB+D and Kinetics.
               
Click one of the above tabs to view related content.