With the technology development, more and more Internet of Things (IoT) devices with displays are making “face-to-face” interaction through visualization a reality. To protect the privacy of users, communications can… Click to show full abstract
With the technology development, more and more Internet of Things (IoT) devices with displays are making “face-to-face” interaction through visualization a reality. To protect the privacy of users, communications can be represented through avatars and use audio-driven real-time speech animation. However, if audio is the only available input, the quality of the outcome relies heavily on real-time phoneme recognition, such as recognition accuracy and latency. This article introduces a novel deep-learning-based real-time phoneme recognition network (RealPRNet) scheme to leverage spatial and temporal patterns in the input audio data. With featured long short-term memory stack block and long short-term features, RealPRNet can achieve super performance in phoneme recognition. Our comprehensive empirical results show that compared to the state-of-the-art algorithms, RealPRNet can achieve 20% phoneme error rate (PER) improvement and 4% block error distance (BDE) improvement in the best case.
               
Click one of the above tabs to view related content.