This paper proposes a novel coupled content semantic embedding (CLOSE) method with its application to video captioning. The motivation behind this design is to seek a consistent latent space between… Click to show full abstract
This paper proposes a novel coupled content semantic embedding (CLOSE) method with its application to video captioning. The motivation behind this design is to seek a consistent latent space between the content–semantic pair, in which the pair with same attribute is close to each other. Under the framework constructed on content–semantic embedding, CLOSE first learns two independent and reversible content–content and semantic–semantic embeddings, respectively, and then aggregates the two items via a coupled content–semantic embedding. Benefitting from the reversible property, our CLOSE can be pretrained with quantities of unlabeled data. In addition, casting on the work setting of feature embedding, a paradigm named multi-content embedding (MCE) is developed to describe the multi-focus information. Typically, MCE is capable of learning a feature embedding that can capture multiple discriminative contents. Extensive experiments compared with state-of-the-art methods on benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness and superiority of the proposed CLOSE.
               
Click one of the above tabs to view related content.