"Factors in Emotion Recognition With Deep Learning Models Using Speech and Text on Multiple Corpora"

Emotion recognition performance of deep learning models is influenced by multiple factors such as acoustic condition, textual content, style of emotion expression (e.g. acted, natural), etc. In this paper, multiple factors are analysed by training and evaluating state-of-the-art deep learning models using the input modalities speech, text, and their combination across 6 emotional speech corpora. A novel deep learning model architecture is presented that further improves the state-of-the-art in multimodal emotion recognition with speech and text on the IEMOCAP corpus. Results from models trained on individual corpora show that combining speech and text improves performance only on corpora where the text of utterances varies across different emotions, while it reduced performance on corpora with fixed text expressed in different emotions, where the speech-only models performed better. Further, cross-corpus investigations are presented to understand the robustness to changing acoustic and textual content. Results show that models perform significantly better in matched conditions in particular single corpus models perform better than multi-corpus models, with the latter showing a tendency to be more robust to acoustic variations, while performance still depends on characteristics of both training corpora and test corpus.

Keywords: speech text; corpora; learning models; deep learning; emotion recognition

Journal Title: IEEE Signal Processing Letters
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended