"Japanese Neural Incremental Text-to-Speech Synthesis Framework With an Accent Phrase Input"

Work in the development of neural incremental text-to-speech (iTTS), which is attracting increasing attention, has recently pursued low-latency processing by generating speech on the fly before reading complete sentences. Most current state-of-the-art iTTS systems use a prefix-to-prefix neural iTTS framework with look-ahead of 1-2 unit segments (i.e., phonemes or words). However, since the Japanese language is based on accent phrase units that are longer than words, using a prefix-to-prefix neural iTTS with a look-ahead approach increases latency. Here, we propose an alternative to the end-to-end neural iTTS architecture that does not apply look-ahead input when synthesizing speech chunks. We further propose a method to use information from the previous time step by connecting the synthesized vector and the model’s internal state to the current time step. We experimentally investigated the latency of various iTTS systems with different modeling and synthesis chunks. The experimental results show that, for Japanese, the proposed iTTS is able to synthesize better speech quality, with a similar latency range, than the conventional baseline prefix-to-prefix neural iTTS with word units. Moreover, we found that our proposed approach improved the prosodic naturalness among synthesized units in the Japanese language. Subjective evaluations also revealed that the proposed approach with an incremental unit of two accent phrases achieved the best scores in Japanese iTTS systems.

Keywords: neural incremental; text speech; speech; incremental text; neural itts; itts

Journal Title: IEEE Access
Year Published: 2023

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended