"Deep multimodal embedding for video captioning"

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.

Keywords: video; multimodal embedding; embedding video; video captioning; deep multimodal

Journal Title: Multimedia Tools and Applications
Year Published: 2019

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended