LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition

Photo by historyhd from unsplash

Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (${\it i}.{\it e}.$, lips) into text. Since the information from the visual lip movements… Click to show full abstract

Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (${\it i}.{\it e}.$, lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. We guide the audio value memory to imprint the audio feature and the lip-video key memory to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level VSR. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR.

Keywords: memory; speech recognition; information; vsr

Journal Title: IEEE Transactions on Multimedia
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.