With the development of deep learning technology, speech synthesis based on deep neural networks has gradually become the mainstream method in the field of speech synthesis. In this paper, we… Click to show full abstract
With the development of deep learning technology, speech synthesis based on deep neural networks has gradually become the mainstream method in the field of speech synthesis. In this paper, we explored the Tacotron2 model for Lhasa-Tibetan dialect speech synthesis by constructing a feature prediction network with a seq2seq structure which maps the character vector to Mel spectrum, and combining with the WaveNet model trained in a semi-supervised way to synthesize the Mel spectrum into a time domain waveform. The model avoids processing front-end text analysis that requires extensive prior knowledge in Lhasa-Tibetan dialect and reduces the need of a large amount of transcribed speech data. Experimental results show that the proposed method is effective and has higher clarity and naturalness than other related synthesis models for Lhasa-Tibetan dialect.
               
Click one of the above tabs to view related content.