LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Deep multimodal embedding for video captioning

Photo by mattykwong1 from unsplash

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years,… Click to show full abstract

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.

Keywords: video; multimodal embedding; embedding video; video captioning; deep multimodal

Journal Title: Multimedia Tools and Applications
Year Published: 2019

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.