Collective activity recognition, which tells what activity a group of people is performing, is a cutting-edge research topic in computer vision. Different from action performed by individuals, collective activity needs… Click to show full abstract
Collective activity recognition, which tells what activity a group of people is performing, is a cutting-edge research topic in computer vision. Different from action performed by individuals, collective activity needs to consider the complex interactions among different people. However, most previous works require exhaustive annotations such as accurate label information of individual actions, pairwise interactions, and poses, which could not be easily available in practice. Moreover, most of them treat human detection as a decoupled task before collective activity recognition and leverage all detected persons. This not only ignores the mutual relation between the two tasks, which makes it hard for filtering out irrelevant people, but also probably increases the computation burden when reasoning the collective activities. In this paper, we propose a fast weakly supervised deep learning architecture for collective activity recognition. For fast inference, we propose to make the actor detection and weakly supervised collective activity reasoning collaborate in an end-to-end framework by sharing convolutional layers between them. The joint learning makes the two tasks united and reinforced each other, so that it is more effective to filter out the outliers who are not involved in the activity. For the weakly supervised learning, we propose a latent embedding scheme for mining person-group interactive relationship to get rid of the use of any pairwise relation between people and the individual action labels as well. The experimental results show that the proposed framework achieves comparable or even better performance as compared to the state-of-the-art on three datasets. Our joint modelling reasons collective activities at the speed of 22.65 fps, which is the fastest ever known and substantially makes collective activity recognition more towards real-time applications.
               
Click one of the above tabs to view related content.