Abstract In video surveillance with multiple people, human interactions and their action categories preserve strong correlations, and the identification of interaction configuration is of significant importance to the success of… Click to show full abstract
Abstract In video surveillance with multiple people, human interactions and their action categories preserve strong correlations, and the identification of interaction configuration is of significant importance to the success of action recognition task. Interactions are typically estimated using heuristics or treated as latent variables. However, the former usually introduces incorrect interaction configuration while the latter amounts to solve challenging optimization problems. Here we address these problems systematically by proposing a novel structured learning framework which enables the joint prediction of actions and interactions. To this end, both the features learned via deep nets and human interaction context are leveraged to encode the correlations among actions and pairwise interactions in a structured model, and all model parameters are trained via a large-margin framework. To solve the associated inference problem, we present two optimization algorithms, one is alternating search and the other is belief propagation. Experiments on both synthetic and real dataset demonstrate the strength of the proposed approach.
               
Click one of the above tabs to view related content.