Investigating HuBERT-based Speech Emotion Recognition Generalisation Capability
Author
Li, Letian
Glackin, Cornelius
Cannings, Nigel
Veneziano, Vito
Barker, Jack
Oduola, Olakunle
Woodruff, Chris
Laird, Thea
Laird, James
Sun, Yi
Attention
2299/28234
Abstract
Transformer-based architectures have made significant progress in speech emotion recognition (SER). However, most published SER research trained and tested models on data from the same corpus, resulting in poor generalisation ability to unseen data collected from different corpora. To address this, we applied the HuBERT model to a combined training set consisting of five publicly available datasets (IEMOCAP, RAVDESS, TESS, CREMA-D, and 80% CMU-MOSEI) and conducted cross-corpus testing on the Strong Emotion (StrEmo) Dataset (a natural dataset collected by the authors) and two publicly available datasets (SAVEE and 20% CMU-MOSEI). Our best result achieved an F1 score of 0.78 over the three test sets, with an F1 score of 0.86 for StrEmo specifically. Additionally, we are pleased to release the spreadsheet of key information on the StrEmo dataset as supplementary material to the conference.