Actions in continuous videos are correlated and may have hierarchical relationships. Densely labeled datasets of complex videos have revealed the simultaneous occurrence of actions, but existing models fail to make… Click to show full abstract
Actions in continuous videos are correlated and may have hierarchical relationships. Densely labeled datasets of complex videos have revealed the simultaneous occurrence of actions, but existing models fail to make use of the relationships to analyze actions in the context of videos and better understand complex videos. We propose a novel architecture consisting of a correlation learning and input synthesis (CoLIS) network, long short-term memory (LSTM), and a hierarchical classifier. First, the CoLIS network captures the correlation between features extracted from video sequences and pre-processes the input to the LSTM. Since the input becomes the weighted sum of multiple correlated features, it enhances the LSTM’s ability to learn variable-length long-term temporal dependencies. Second, we design a hierarchical classifier which utilizes the simultaneous occurrence of general actions such as run and jump to refine the prediction on their correlated actions. Third, we use interleaved backpropagation through time for training. All these networks are fully differentiable so that they can be integrated for end-to-end learning. The results show that the proposed approach improves action recognition accuracy by 1.0% and 2.2% on single-labeled or densely labeled datasets respectively.
               
Click one of the above tabs to view related content.