A portion of the data in video captioning datasets are noisy and unsuitable for models to learn at early stages, e.g., there could be a generic 4-word-long caption lacking distinctive… Click to show full abstract
A portion of the data in video captioning datasets are noisy and unsuitable for models to learn at early stages, e.g., there could be a generic 4-word-long caption lacking distinctive details of video content and a 19-word-long description with rare words and complex structure given the average caption length of around 9. The conventional training method, i.e., learning by random sampling indiscriminately from the whole training set, may cause data bias problems and undermine the model performance. In this work, we present a novel learning strategy, Adaptive Curriculum Learning (ACL), to alleviate the adverse effect of such problems. The main idea of our approach is to allow a model to learn from the data within its competence. Specifically, a difficulty measurement is first defined to evaluate the learning difficulty of video-caption pairs, and training data can be ranked accordingly. Then, based on the learning difficulties and model competence, an adaptive sampling approach is developed to provide suitable training subsets for video captioning models in different training stages. Notably, our proposed ACL is applicable to most existing video captioning works as it requires no modifications of the model architecture. Extensive experiments are conducted on mainstream benchmarks, i.e., MSVD and MSR-VTT datasets. The results show that both RNN-based and Transformer-based models achieve consistent performance improvements with our ACL strategy.
               
Click one of the above tabs to view related content.