"MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection"

Multimodal learning among video and audio has shown significant performance improvement in violence detection. However, video and audio do not contribute consistently, and the video modality tends to dominate when determining whether a certain scene contains violent events. In fact, a few recent multimodal learning methods for violence detection do not fully consider data differences between various modalities, which lead to optimization imbalance problem during training, and ultimately result in insufficient performance. To address this issue, we propose a Multimodal Contrastive Learning (MCL) method to make full use of video and audio information for violence detection. In specific, to avoid the video modality dominating the model training, we design a multi-encoder framework to perform task-driven feature encoding on video and audio respectively. To reduce information loss during multimodal fusion, we introduce a contrastive learning task to capture semantically consistent representations. We conduct extensive experiments on XD-Violence dataset, showing that our proposed MCL achieves an average precision improvement of 2.34% against the state-of-the-art baseline.

Keywords: violence; video audio; violence detection; multimodal; contrastive learning

Journal Title: IEEE Signal Processing Letters
Year Published: 2023

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended