LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Accelerated masked transformer for dense video captioning

Photo from wikipedia

Abstract Dense video captioning aims to generate dense descriptions for all possible events in an untrimmed video. The task is challenging that it requires accurately localizing events in the video… Click to show full abstract

Abstract Dense video captioning aims to generate dense descriptions for all possible events in an untrimmed video. The task is challenging that it requires accurately localizing events in the video and simultaneously describe each event with a sentence. Current approaches usually decompose this task into two independent stages—the proposal localization stage and the caption generation stage, resulting in a suboptimal solution. Masked Transformer (MT) model [30] has been proposed to integrate the two stages and optimize them in an end-to-end philosophy. Despite the superior performance that the MT has achieved, its runtime efficiency is unsatisfactory which severely limits its applicability in real-world scenarios. In this paper, we devise an improved Accelerated Masked Transformer (AMT) model that enjoys the dual-benefit of effectiveness and efficiency. Taking MT as our reference model, we respectively introduce accelerating strategies to the two stages: 1) in the proposal localization stage, we introduce a lightweight anchor-free proposal in company with a local attention mechanism; and 2) in the caption generation stage, we introduce the single-shot feature masking strategy along with an average attention mechanism. Extensive experiments on two benchmark datasets ActivityNet-Caption and YouCookII demonstrate that AMT achieves competitive performance on both datasets with significant speed improvement. On the ActivityNet-Caption dataset, AMT reduces up to 2 × running time with comparable performance when compared to the reference MT model.

Keywords: video; accelerated masked; masked transformer; video captioning; dense video

Journal Title: Neurocomputing
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.