LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

A Unimodal Representation Learning and Recurrent Decomposition Fusion Structure for Utterance-level Multimodal Embedding Learning

Photo from wikipedia

Learning a unified embedding for utterance-level video attracts significant attention recently due to the rapid development of social media and its broad applications. An utterance normally contains not only spoken… Click to show full abstract

Learning a unified embedding for utterance-level video attracts significant attention recently due to the rapid development of social media and its broad applications. An utterance normally contains not only spoken language but also the nonverbal behaviors such as facial expressions and vocal patterns. Instead of directly learning utterance embedding based on low-level features, we firstly explore high-level representation for each modality separately via an unimodal representation learning gyroscope structure. In this way, the learnt unimodal representations are more representative and contain more abstract semantic information. In the gyroscope structure, we introduce multi-scale kernel learning, ‘channel expansion’ and ‘channel fusion’ operations to explore high-level features both spatially and channelwise. Another insight of our method lies in that we fuse representations of all modalities to obtain a unified embedding by interpreting fusion procedure as the flow of inter-modality information between various modalities, which is more specialized in terms of the information to be fused and the fusion process. Specifically, considering that each modality carries modality-specific and cross-modality interactions, we innovate to decompose unimodal representations into intra- and inter-modality dynamics using gating mechanism, and further fuse the inter-modality dynamics by passing them from previous modalities to the following one using a recurrent neural fusion architecture. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmark datasets.

Keywords: utterance; representation; modality; fusion; level; structure

Journal Title: IEEE Transactions on Multimedia
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.