A Deep Learning Feature Mapping Algorithm for Emotion Detection via Facial and Audio Signals

Tayarani, Mohammad, Shahid, Shamim, Foerster, Frank and Steuber, Volker (2026) A Deep Learning Feature Mapping Algorithm for Emotion Detection via Facial and Audio Signals. Applied Soft Computing, 197: 114998. ISSN 1568-4946
Copy

Automatic emotion recognition plays a critical role in areas such as mental-health monitoring, human–robot interaction, and personalised learning systems, yet current multimodal approaches often struggle with high intra-class variability and the limited discriminative power of raw audio–visual features. Existing methods typically rely on direct classification of audio or facial data, which does not explicitly enforce a structured joint embedding in which emotional categories become separable. This paper addresses this limitation by proposing a supervised contrastive feature-mapping algorithm that transforms temporal audio and video features into a representation that minimises intra-class distances while maximising inter-class distances. In contrast to prior work, which usually focuses on handcrafted feature engineering or end-to-end classifiers, our approach explicitly learns a discriminative metric space that enhances the geometry of the feature distribution. The method is evaluated on the RAVDESS and CREMA-D benchmark datasets. Experimental results show that the proposed mapping yields consistent accuracy improvements over strong machine-learning baselines, with gains of up to approximately 6\%, achieving 96.07\% accuracy on RAVDESS and competitive performance on CREMA-D, while outperforming or matching recent state-of-the-art multimodal emotion-recognition pipelines. Statistical tests (Kruskal–Wallis and paired t-tests) confirm that the learned representation significantly increases class separability ($p<10^{−5}$). While the method assumes the availability of paired audio–visual inputs without requiring explicit temporal alignment, the learned feature space is compact, discriminative, and well-suited to downstream tasks such as affect-aware dialogue systems, rehabilitation monitoring, and adaptive educational interfaces. These results demonstrate that contrastive feature mapping provides a robust and generalisable framework for multimodal emotion analysis.


picture_as_pdf
A_deep_learning_feature_mapping_algorithm_for_emotion_detection_via_facial_and_audio_signals.pdf
subject
Published Version
Available under Creative Commons: BY 4.0

View Download

EndNote BibTeX Reference Manager Refer Atom Dublin Core MPEG-21 DIDL METS HTML Citation RIOXX2 XML OpenURL ContextObject MODS Data Cite XML ASCII Citation OpenURL ContextObject in Span
Export

Downloads