"Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification"

As the amount of available video data has grown substantially, automatic video classification has become an urgent yet challenging task. Most video classification methods focus on acquiring discriminative spacial visual features and motion patterns for video representation, especially deep learning methods, which have achieved very good results on action recognition problems. However, the performance of most of these methods drastically degenerates for more generic video classification tasks where the video contents are much more complex. Thus, in this paper, the mid-level semantics of videos are exploited to bridge the semantic gap between low-level features and high-level video semantics. Inspired by the term ``frequency-inverse document frequency'', a word weighting method for the problem of text classification is introduced to the video domain. The visual objects in videos are regarded as the words in texts, and two new weighting methods are proposed to encode videos by weighting visual objects according to the characteristics of videos. In addition, the semantic similarities between video categories and visual objects are introduced from the text domain as privileged information to facilitate classifier training on the obtained semantic representations of videos. The proposed semantic encoding method (semantic stream) is then fused with the popular two-stream CNN model for the final classification results. Experiments are conducted on two large-scale complex video datasets, CCV and ActivityNet. The experimental results validate the effectiveness of the proposed methods.

Keywords: video; mid level; video classification; semantics

Journal Title: IEEE Transactions on Multimedia
Year Published: 2019

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended