Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a… Click to show full abstract
Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a significant drop in the accuracy during evaluation. While speaker-adaptive techniques have been used for speech recognition, to the best of our knowledge, they have not been employed for emotion recognition. Motivated by this, a speaker-adaptive DNN-HMM-based SER system is proposed in this paper. Feature space maximum likelihood linear regression technique has been used for speaker adaptation during both training and testing phases. The proposed system uses MFCC and epoch-based features. We have exploited our earlier work on robust detection of epochs from emotional speech to obtain emotion-specific epoch-based features, namely instantaneous pitch, phase, and the strength of excitation. The combined feature set improves on the MFCC features, which have been the baseline for SER systems in the literature by + 5.07% and over the state-of-the-art techniques by + 7.13 %. While using just the MFCC features, the proposed model improves upon the state-of-the-art techniques by 2.06%. These results bring out the importance of speaker adaptation for SER systems and highlight the complementary nature of the MFCC and epoch-based features for emotion recognition using speech. All experiments were carried out an IEMOCAP emotional dataset.
               
Click one of the above tabs to view related content.