Investigating Salient Representations and Label Variance Modeling in Dimensional Speech Emotion Analysis


Representations from models such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT) have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Both HuBERT, and BERT models generate fairly large dimensional representations, and such models were not trained with emotion recognition task in mind. Such large dimensional representations result in speech emotion models with large parameter size, resulting in both memory and computational cost complexities. In this work, we investigate the selection of representations based on their task saliency, which may help to reduce the model complexity without sacrificing dimensional emotion estimation performance. In addition, we investigate modeling label uncertainty in the form of grader opinion variance, and demonstrate that such information can help to improve the model’s generalization capacity and robustness. Finally, we analyzed the robustness of the speech emotion model against acoustic degradation and observed that the selection of salient representations from pre-trained models and modeling label uncertainty helped to improve the models generalization capacity to unseen data containing acoustic distortions in the form of environmental noise and reverberation.



Source link