The use of sufficiently large datasets is important for most deep learning tasks, and emotion recognition tasks are no exception. Multimodal emotion recognition is the task of considering multiple types… Click to show full abstract
The use of sufficiently large datasets is important for most deep learning tasks, and emotion recognition tasks are no exception. Multimodal emotion recognition is the task of considering multiple types of modalities simultaneously to improve accuracy and robustness, typically utilizing three modalities: visual, audio, and text. Similar to other deep learning tasks, large datasets are required. Various heterogeneous datasets exist, including unimodal datasets constructed for traditional unimodal recognition and bimodal or trimodal datasets for multi-modal emotion recognition. A trimodal emotion recognition model shows high performance and robustness by comprehensively considering multiple modalities. However, the use of unimodal or bimodal datasets in this case is problematic. In this study, we propose a novel method to improve the performance of emotion recognition based on a cross-modal translator that can translate between the three modalities. The proposed method can train a multimodal model based on three modalities with different types of heterogeneous datasets, and the dataset does not require alignment between modalities: visual, audio, and text. We achieved a high performance exceeding the baseline in CMU-MOSEI and IEMOCAP, which are representative multimodal datasets, by adding unimodal and bimodal datasets to the trimodal dataset.
               
Click one of the above tabs to view related content.