"Adaptive Spatial Location With Balanced Loss for Video Captioning"

Many pioneering approaches have verified the effectiveness of utilizing the global temporal and local object information for video understanding tasks and have achieved significant progress. However, existing methods utilize object detectors to extract all objects overall video frames. This may bring performance degradation due to the information redundancy both spatially and temporally. To address this problem, we propose an adaptive spatial location module for the video captioning task which dynamically predicts an important position of each video frame in the procedure of generating the description sentence. The proposed adaptive spatial location method not only makes our model focus on local object information, but also reduces time and memory consumption brought by the temporal redundancy in extensive video frames and improves the accuracy of generated description. Besides, we propose a balanced loss function to address the class imbalance problem existing in training data. The proposed balanced loss assigns different weight to each word of ground-truth sentence in the training process which can generate more diversified description sentences. Extensive experimental results on the MSVD and MSR-VTT dataset show that the proposed method achieves competitive performance compared to state-of-the-art methods.

Keywords: spatial location; adaptive spatial; balanced loss; video

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended