Original scientific paper
https://doi.org/10.1080/00051144.2024.2371249
Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition
Thomas Mary Little Flower
; Department of ECE, St.Xavier’s Catholic College of Engineering, Chunkankadai, India
*
Thirasama Jaya
; Department of ECE, Saveetha College of Engineering, Chennai, India
Sreedharan Christopher Ezhil Singh
; Department of Mechanical Engineering, Vimal Jyothi Engineering College, Kannur, India
* Corresponding author.
Abstract
Speech emotion recognition (SER) is attractive in several domains, such as automated translation,
call centres, intelligent healthcare, and human–computer interaction. Deep learning models for
emotion identification need considerable labelled data, which is only sometimes available in the
SER industry. A database needs enough speech samples, good features, and a better classifier to
identify emotions efficiently. This study uses data augmentation to enhance the amount of input
voice samples and address the data shortage issue. The database capacity increases by adding
white noise to the speech signals by data augmentation. In this work, the Mel-frequency Cepstral
Coefficient (MFCC) and Mel-frequency Magnitude Coefficient (MFMC) features, along with a onedimensional convolutional neural network (1D-CNN), are used to classify speech emotions. The
datasets utilized to estimate the model’s enactment were AESDD, CAFE, EmoDB, IEMOCAP, and
MESD. The data augmentation with the 1D-CNN (MFMC) model performed best, with an average
accuracy of 99.2% for AESDD, 99.5% for CAFE, 97.5% for EmoDB, 92.4% for IEMOCAP and 96.9%
for the MESD database. The proposed 1D-CNN (MFMC) with data augmentation outperforms the
1D-CNN (MFCC) without data augmentation in emotion recognition.
Keywords
Neural networks; affective computing; emotion recognition; audio database; accuracy
Hrčak ID:
326329
URI
Publication date:
3.7.2024.
Visits: 0 *