Dense video captioning aims to localize and describe multiple events in untrimmed videos, which is a challenging task that draws attention recently in computer vision. Although existing methods have achieved… Click to show full abstract
Dense video captioning aims to localize and describe multiple events in untrimmed videos, which is a challenging task that draws attention recently in computer vision. Although existing methods have achieved impressive performance, most of them only focus on local information of event segments or very simple event-level context, overlooking the complexity of event-event relationship and the holistic scene. As a result, the coherence of captions within the same video could be damaged. In this article, we propose a novel event-centric hierarchical representation to alleviate this problem. We enhance the event-level representation by capturing rich relationship between events in terms of both temporal structure and semantic meaning. Then, a caption generator with late fusion is developed to generate surrounding-event-aware and topic-aware sentences, conditioned on the hierarchical representation of visual cues from the scene level, the event level, and the frame level. Furthermore, we propose a duplicate removal method, namely temporal-linguistic non-maximum suppression (TL-NMS) to distinguish redundancy in both localization and captioning stages. Quantitative and qualitative evaluations on the ActivityNet Captions and YouCook2 datasets demonstrate that our method improves the quality of generated captions and achieves state-of-the-art performance on most metrics.
               
Click one of the above tabs to view related content.