"An E2E-ASR-Based Iteratively-Trained Timestamp Estimator"

Text-to-speech alignment, also known as time alignment, is essential for automatic speech recognition (ASR) systems used for speech retrieval tasks, such as keyword search and speech segment extraction. Previous works have used the Gaussian mixture model-hidden Markov model (GMM-HMM) forced alignment to improve the alignment performance. However, when used with end-to-end (E2E) ASR, GMM-HMM forced alignment causes extra reliance on expertise such as pronunciation lexica. It also increases the system complexity because GMM-HMMs are very dissimilar to E2E models. To tackle these two problems, we propose an E2E-ASR-based iteratively-trained timestamp estimator (ITSE), which performs alignment between token-level transcription and speech. We train ITSE first with coarse initial alignment targets generated using connectionist temporal classification (CTC) posteriors. During training, we iteratively perform realignment to update the targets. We attribute the effectiveness of the iterative training to ITSE's two vital features. First, ITSE performs alignment using similarities between token and speech embeddings instead of frame-wise token classification posteriors. Second, ITSE uses speech embeddings that are aware of left context rather than global context. ITSE significantly outperforms CTC-based baselines in word alignment accuracy and is comparable to a GMM-HMM forced aligner. In short, ITSE is an accurate, lightweight text-to-speech alignment module implemented without expertise such as pronunciation lexica.

Keywords: itse; speech; asr based; e2e asr; iteratively trained; based iteratively

Journal Title: IEEE Signal Processing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended