"MSAF: Multimodal Supervise-Attention Enhanced Fusion for Video Anomaly Detection"

The complementarity of multimodal signal is essential for video anomaly detection. However, existing methods either lack exploration to multimodal data or ignore the implicit alignment of multimodal features. In our work, we address this problem using a novel fusion method and propose a Multimodal Supervise-Attention enhanced Fusion (MSAF) framework under weak supervision. Our framework can be divided into two parts: 1) the multimodal labels refinement part refines video-level ground truth into pseudo clip-level labels for subsequent training, 2) the multimodal supervise-attention fusion network enhances features via implicitly aligning different information, then fusing them effectively to predict anomaly scores with the help of refined labels. We validate our framework on four challenging datasets: ShanghaiTech, UCF-Crime, LAD, and XD-Violence. Extensive experiments on the benchmarks demonstrate the effectiveness of our framework, which achieves comparable results on several benchmarks and outperforms current state-of-the-art methods on the XD-Violence audiovisual multimodal dataset.

Keywords: video anomaly; multimodal supervise; fusion; supervise attention

Journal Title: IEEE Signal Processing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended