LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Gated PE-NL-MA: A multi-modal attention based network for video understanding

Photo from wikipedia

Abstract In multi-modal learning tasks such as video understanding, the most important operations are feature extraction and feature enhancement for single modality and feature aggregation between modalities. In this paper,… Click to show full abstract

Abstract In multi-modal learning tasks such as video understanding, the most important operations are feature extraction and feature enhancement for single modality and feature aggregation between modalities. In this paper, we present two attention based algorithms, the Position-embedding Non-local (PE-NL) Network and the Multi-modal Attention (MA) feature aggregation method. Inspired by Non-local Neural Networks and Transformer, our PE-NL is a self-attention liked feature enhancement operation and it can capture long-range dependencies and model relative positions. The MA aggregation method merges visual and audio modalities while reducing feature dimensions and the number of parameters without losing too much accuracy. Both of PE-NL and MA blocks can be plugged into many multi-modal learning architectures. Our Gated PE-NL-MA network achieves competitive results on Youtube-8M dataset.

Keywords: network; attention; multi modal; attention based; video understanding

Journal Title: Neurocomputing
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.