Emotion recognition performance of deep learning models is influenced by multiple factors such as acoustic condition, textual content, style of emotion expression (e.g. acted, natural), etc. In this paper, multiple… Click to show full abstract
Emotion recognition performance of deep learning models is influenced by multiple factors such as acoustic condition, textual content, style of emotion expression (e.g. acted, natural), etc. In this paper, multiple factors are analysed by training and evaluating state-of-the-art deep learning models using the input modalities speech, text, and their combination across 6 emotional speech corpora. A novel deep learning model architecture is presented that further improves the state-of-the-art in multimodal emotion recognition with speech and text on the IEMOCAP corpus. Results from models trained on individual corpora show that combining speech and text improves performance only on corpora where the text of utterances varies across different emotions, while it reduced performance on corpora with fixed text expressed in different emotions, where the speech-only models performed better. Further, cross-corpus investigations are presented to understand the robustness to changing acoustic and textual content. Results show that models perform significantly better in matched conditions in particular single corpus models perform better than multi-corpus models, with the latter showing a tendency to be more robust to acoustic variations, while performance still depends on characteristics of both training corpora and test corpus.
               
Click one of the above tabs to view related content.