Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to… Click to show full abstract
Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to reveal potential associations between video and question. While these methods effectively remove irrelevant information from the spatiotemporal attention, they ignore the pseudo-related information within the cross-modal interaction attention. To address this problem, we proposed a novel energy-based refined-attention mechanism (ERM). ERM leverages the significant difference distribution as a discriminative criterion derived from question-guided cross-modal interaction information to determine question-related and question-irrelated cross-modal interaction information. The specific method is to measure the linear separability between the target neuron and other neurons in the neural network to confirm the importance of neurons. In addition, to solve the statistical bias caused by the differences between different modes in video tasks, the ERM proposed in this paper has learnable parameters. The correlation between different modes can be learned adaptively through learnable parameters. The advantages of the proposed ERM are that it is more flexible and modular while remaining lightweight. With the help of the ERM, we construct a lightweight VideoQA model that efficiently integrates the cross-modal feature representations in an energy-based manner. To evaluate the effectiveness of our method, we carried out extensive experiments on five publicly available datasets and compared them with state-of-the-art VideoQA methods. The experiment results demonstrate that our method brings a noticeable performance improvement compared to state-of-the-art VideoQA methods. ERM can be flexibly integrated into different VideoQA methods to improve their Q&A performance.
               
Click one of the above tabs to view related content.