In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of… Click to show full abstract
In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. This paper puts forward a network extracting video sequence semantic information based on deep integration of local Spatial-Temporal information. The network uses 2D Convolutional Neural Network (2DCNN) and Multi Spatial-Temporal scale 3D Convolutional Neural Network (MST_3DCNN) respectively to extract spatial information and motion information. Spatial information and motion information of the same time quantum receive 3D convolutional integration to generate the temporary Spatial-Temporal information of a certain moment. Then, the Spatial-Temporal information of multiple single moments enters Temporal Pyramid Net (TPN) to generate the local Spatial-Temporal information of multiple time scales. Finally, bidirectional recurrent neutral network is used to act on the Spatial-Temporal information of all parts so as to acquire the context information spanning the length of the entire video, which endows the network with video context information extraction capability. Through the experiments on the three video action recognitio common experimental data sets UCF101, UCF11, UCFSports, the Spatial-Temporal information deep fusion network proposed in this paper has a high correct recognition rate in the task of video action recognitio.
               
Click one of the above tabs to view related content.