Work in the development of neural incremental text-to-speech (iTTS), which is attracting increasing attention, has recently pursued low-latency processing by generating speech on the fly before reading complete sentences. Most… Click to show full abstract
Work in the development of neural incremental text-to-speech (iTTS), which is attracting increasing attention, has recently pursued low-latency processing by generating speech on the fly before reading complete sentences. Most current state-of-the-art iTTS systems use a prefix-to-prefix neural iTTS framework with look-ahead of 1-2 unit segments (i.e., phonemes or words). However, since the Japanese language is based on accent phrase units that are longer than words, using a prefix-to-prefix neural iTTS with a look-ahead approach increases latency. Here, we propose an alternative to the end-to-end neural iTTS architecture that does not apply look-ahead input when synthesizing speech chunks. We further propose a method to use information from the previous time step by connecting the synthesized vector and the model’s internal state to the current time step. We experimentally investigated the latency of various iTTS systems with different modeling and synthesis chunks. The experimental results show that, for Japanese, the proposed iTTS is able to synthesize better speech quality, with a similar latency range, than the conventional baseline prefix-to-prefix neural iTTS with word units. Moreover, we found that our proposed approach improved the prosodic naturalness among synthesized units in the Japanese language. Subjective evaluations also revealed that the proposed approach with an incremental unit of two accent phrases achieved the best scores in Japanese iTTS systems.
               
Click one of the above tabs to view related content.