Speech emotion recognition (SER) aims to endow machines with the intelligence in perceiving latent affective components from speech. However, the existing works on deep-learning-based SER make it difficult to jointly… Click to show full abstract
Speech emotion recognition (SER) aims to endow machines with the intelligence in perceiving latent affective components from speech. However, the existing works on deep-learning-based SER make it difficult to jointly consider time–frequency and sequential information in speech due to their structures, which may lead to deficiencies in exploring reasonable local emotional representations. In this regard, we propose a convolutional-recurrent neural network with multiple attention mechanisms (CRNN-MAs) for SER in this article, including the paralleled convolutional neural network (CNN) and long short-term memory (LSTM) modules, using extracted Mel-spectrums and frame-level features, respectively, in order to acquire time–frequency and sequential information simultaneously. Furthermore, we set three strategies for the proposed CRNN-MA: 1) a multiple self-attention layer in the CNN module on frame-level weights; 2) a multidimensional attention layer as the input features of the LSTM; and 3) a fusion layer summarizing the features of the two modules. Experimental results on three conventional SER corpora demonstrate the effectiveness of the proposed approach through using the convolutional-recurrent and multiple-attention modules, compared with other related models and existing state-of-the-art approaches.
               
Click one of the above tabs to view related content.