LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

ERM: Energy-Based Refined-Attention Mechanism for Video Question Answering

Photo from wikipedia

Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to… Click to show full abstract

Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to reveal potential associations between video and question. While these methods effectively remove irrelevant information from the spatiotemporal attention, they ignore the pseudo-related information within the cross-modal interaction attention. To address this problem, we proposed a novel energy-based refined-attention mechanism (ERM). ERM leverages the significant difference distribution as a discriminative criterion derived from question-guided cross-modal interaction information to determine question-related and question-irrelated cross-modal interaction information. The specific method is to measure the linear separability between the target neuron and other neurons in the neural network to confirm the importance of neurons. In addition, to solve the statistical bias caused by the differences between different modes in video tasks, the ERM proposed in this paper has learnable parameters. The correlation between different modes can be learned adaptively through learnable parameters. The advantages of the proposed ERM are that it is more flexible and modular while remaining lightweight. With the help of the ERM, we construct a lightweight VideoQA model that efficiently integrates the cross-modal feature representations in an energy-based manner. To evaluate the effectiveness of our method, we carried out extensive experiments on five publicly available datasets and compared them with state-of-the-art VideoQA methods. The experiment results demonstrate that our method brings a noticeable performance improvement compared to state-of-the-art VideoQA methods. ERM can be flexibly integrated into different VideoQA methods to improve their Q&A performance.

Keywords: video question; question; cross modal; energy based; attention

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2023

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.