The primary objective of affective computing is the precise and automated identification of human affective states. In recent years, affective recognition has gained substantial attention owing to its broad potential… Click to show full abstract
The primary objective of affective computing is the precise and automated identification of human affective states. In recent years, affective recognition has gained substantial attention owing to its broad potential for enhancing human–computer interactions. Contrary to conventional approaches that frequently overlook the intrinsic relationships between different modalities and mostly rely on a single modality, this study presents a differential multimodal transformer with an end-to-end multimodal affective state recognition framework that integrates heterogeneous data sources. The architecture is organized into three parts: a differential multi-scale feature extractor (DMFE) module to align signals and emphasize meaningful temporal changes, a global cross-attention encoder (GCE) to model relationships across modalities, and difference-augmented weighted fusion (DWF) to merge features by applying two self-attention pooling layers to highlight the fused representation, which is then fed to the classifier. Specifically, on the DEAP dataset, the model achieved 94.23% (valence) and 93.91% (arousal) accuracies with corresponding F1-scores of 94.89% and 94.92%. In WESAD, the framework attained 94.63% accuracy and a 93.82% F1-score. Performance on CogLoad reached 92.08% accuracy and 92.91% F1-score, while on MOCAS, it achieved 92.04% accuracy and a 91.89% F1-score. The results clearly demonstrate the superiority of the proposed architecture over existing models. Furthermore, ablation studies have confirmed the individual contributions and significance of core architectural components.
               
Click one of the above tabs to view related content.