Abstract Multi-modal abstractive summarization for videos is an emerging task, aiming to integrate multi-modal and multi-source inputs (video, audio transcript) into a compressed textual summary. Although recent multi-encoder-decoder models on… Click to show full abstract
Abstract Multi-modal abstractive summarization for videos is an emerging task, aiming to integrate multi-modal and multi-source inputs (video, audio transcript) into a compressed textual summary. Although recent multi-encoder-decoder models on this task have shown promising performance, they did not explicitly model interactions of multi-source inputs. While some strategies like co-attention are utilized for modeling this interaction, considering ultra-long sequences and additional decoder in this task, the coupling of multi-modal data from multi-encoders and decoder needs complicated structure and additional parameters. In this paper, we propose a concise Decoder-only Multi-modal Transformer (D-MmT) based on the above observations. Specifically, we cut the encoder structure, and introduce an in-out shared multi-modal decoder to make the multi-source and target fully interact and couple in the shared feature space, reducing the model parameter redundancy. Also, we design a concise cascaded cross-modal interaction (CXMI) module in the multi-modal decoder that generates joint fusion representations and spontaneously establishes a fine-grained intra- and inter- association between multi-modalities. In addition, to make full use of the ultra-long sequence information, we introduce a joint in-out loss to make the input transcript also participate in backpropagation to enhance the contextual feature representation. The experimental results on the How2 dataset show that the proposed model outperforms the current state-of-the-art approach with fewer model parameters. Further analysis and visualization show the effectiveness of our proposed framework.
               
Click one of the above tabs to view related content.