Abstract In multi-modal learning tasks such as video understanding, the most important operations are feature extraction and feature enhancement for single modality and feature aggregation between modalities. In this paper,… Click to show full abstract
Abstract In multi-modal learning tasks such as video understanding, the most important operations are feature extraction and feature enhancement for single modality and feature aggregation between modalities. In this paper, we present two attention based algorithms, the Position-embedding Non-local (PE-NL) Network and the Multi-modal Attention (MA) feature aggregation method. Inspired by Non-local Neural Networks and Transformer, our PE-NL is a self-attention liked feature enhancement operation and it can capture long-range dependencies and model relative positions. The MA aggregation method merges visual and audio modalities while reducing feature dimensions and the number of parameters without losing too much accuracy. Both of PE-NL and MA blocks can be plugged into many multi-modal learning architectures. Our Gated PE-NL-MA network achieves competitive results on Youtube-8M dataset.
               
Click one of the above tabs to view related content.