LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

Photo by jakobowens1 from unsplash

Abstract Multi-modal abstractive summarization for videos is an emerging task, aiming to integrate multi-modal and multi-source inputs (video, audio transcript) into a compressed textual summary. Although recent multi-encoder-decoder models on… Click to show full abstract

Abstract Multi-modal abstractive summarization for videos is an emerging task, aiming to integrate multi-modal and multi-source inputs (video, audio transcript) into a compressed textual summary. Although recent multi-encoder-decoder models on this task have shown promising performance, they did not explicitly model interactions of multi-source inputs. While some strategies like co-attention are utilized for modeling this interaction, considering ultra-long sequences and additional decoder in this task, the coupling of multi-modal data from multi-encoders and decoder needs complicated structure and additional parameters. In this paper, we propose a concise Decoder-only Multi-modal Transformer (D-MmT) based on the above observations. Specifically, we cut the encoder structure, and introduce an in-out shared multi-modal decoder to make the multi-source and target fully interact and couple in the shared feature space, reducing the model parameter redundancy. Also, we design a concise cascaded cross-modal interaction (CXMI) module in the multi-modal decoder that generates joint fusion representations and spontaneously establishes a fine-grained intra- and inter- association between multi-modalities. In addition, to make full use of the ultra-long sequence information, we introduce a joint in-out loss to make the input transcript also participate in backpropagation to enhance the contextual feature representation. The experimental results on the How2 dataset show that the proposed model outperforms the current state-of-the-art approach with fewer model parameters. Further analysis and visualization show the effectiveness of our proposed framework.

Keywords: multi modal; summarization videos; abstractive summarization; concise; multi; decoder

Journal Title: Neurocomputing
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.