In the task of skeleton-based action recognition, long-term temporal dependencies are significant cues for sequential skeleton data. State-of-the-art methods rarely have access to long-term temporal information, due to the limitations… Click to show full abstract
In the task of skeleton-based action recognition, long-term temporal dependencies are significant cues for sequential skeleton data. State-of-the-art methods rarely have access to long-term temporal information, due to the limitations of their receptive fields. Meanwhile, most of the recent multiple branches methods only consider different input modalities but ignore the information in various temporal scales. To address the above issues, we propose a multi-scale temporal transformer (MTT) in this letter, for skeleton-based action recognition. Firstly, the raw skeleton data are embedded by graph convolutional network (GCN) blocks and multi-scale temporal embedding modules (MT-EMs), which are designed as multiple branches to extract features in various temporal scales. Secondly, we introduce transformer encoders (TE) to integrate embeddings and model the long-term temporal pattern. Moreover, we propose a task-oriented lateral connection (LaC) aiming to align semantical hierarchies. LaC distributes input embeddings to the downstream transformer encoders (TE), according to semantical levels. The classification headers aggregate results from TE and predict the action categories at last. The proposed method is shown efficiency and universality during experiments and achieves the state-of-the-art on three large datasets, NTU-RGBD 60, NTU-RGBD 120 and Kinetics-Skeleton 400.
               
Click one of the above tabs to view related content.