Literature DB >> 33686349

Impact of autoencoder based compact representation on emotion detection from audio.

Nivedita Patel¹, Shireen Patel¹, Sapan H Mankad¹.

Abstract

Emotion recognition from speech has its fair share of applications and consequently extensive research has been done over the past few years in this interesting field. However, many of the existing solutions aren't yet ready for real time applications. In this work, we propose a compact representation of audio using conventional autoencoders for dimensionality reduction, and test the approach on two benchmark publicly available datasets. Such compact and simple classification systems where the computing cost is low and memory is managed efficiently may be more useful for real time application. System is evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto Emotional Speech Set (TESS). Three classifiers, namely, support vector machines (SVM), decision tree classifier, and convolutional neural networks (CNN) have been implemented to judge the impact of the approach. The results obtained by attempting classification with Alexnet and Resnet50 are also reported. Observations proved that this introduction of autoencoders indeed can improve the classification accuracy of the emotion in the input audio files. It can be concluded that in emotion recognition from speech, the choice and application of dimensionality reduction of audio features impacts the results that are achieved and therefore, by working on this aspect of the general speech emotion recognition model, it may be possible to make great improvements in the future.

Entities: Chemical

Keywords: Audio; Autoencoder; Emotion; RAVDESS; TESS

Year: 2021 PMID： 33686349 PMCID： PMC7927770 DOI： 10.1007/s12652-021-02979-3

Source DB: PubMed Journal: J Ambient Intell Humaniz Comput

Introduction

Speech is one of the major communication methods used by humans (Mustaqeem and Kwon 2019). Emotions are forms of expression for humans and therefore, emotion is naturally used in everyday speech by human beings for expressing their sentiments clearly. Speech contains both linguistic and non linguistic information (Mansour et al. 2019). A speech signal contains information like intended message, speaker identity and emotional state of the speaker (Bhaykar et al. 2013). Efficient communication through language and speech has enabled sharing of ideas, messages, and perceptions to one another. In voice based signals, there are two factors of primary importance: acoustic variation and words that are spoken. Acoustic features such as the pitch, timing, voice quality, and articulation of the speech signal highly correlate with the underlying emotion due to the effects of arousal in the nervous system, increased heart rate, etc. The variation of these features forms the basis of emotion recognition in speech. Speech emotion recognition is the task of extracting the emotions of the speaker from his or her speech signal. Detecting these emotions provide insight into deeper complexities that help to navigate through real time situations. Emotion recognition from speech is one of the major challenges in the field of human computer interaction. The formulation of powerful emotion recognition systems are thus beneficial and the objective of a good emotion recognition system is to be able to mimic human perception in the way that humans are able to detect emotions such as anger, sadness, and happiness while talking to one another (Basu et al. 2017). Despite extensive research in emotion recognition from speech, there are still several challenges such as imperfect databases, low quality of recorded utterances, cross-database performance, and difficulties when it comes to speaker independent recognition as each person has a different way of speaking. Speech emotion recognition systems are pattern recognition systems, and are generally composed of three parts: (1) speech signal acquisition, (2) feature extraction, and (3) emotion recognition through the use of classifiers (Huang et al. 2014). Speech features can be broadly categorized into four types: continuous features, qualitative features, spectral features, and Teager energy operator (TEO)-based features. In this work, three classifiers are implemented to demonstrate the impact of the proposed model: decision tree classifier, support vector machines, and convolution neural network. Figure 1 describes a typical speech based emotion recognition system.

Fig. 1

Block diagram of a general speech emotion recognition system

Block diagram of a general speech emotion recognition system Applications of audio based emotion recognition systems provide aid in mental health assessments. Speech processing technology can help diagnose and also detect the severity of disorders (Low et al. 2020). Additionally, it can aid in speech therapy which aims to help with people’s speech impairments (Schipor et al. 2014). Moreover, applications like health-care and counseling can benefit the most from such automated systems. Speech recognition systems are particularly useful where man and machine interaction (MMI) is required like in web movies, computer tutorial applications, online learning (Cen et al. 2016), call center communications, mobile communications, etc, because the response in such systems is dependent on the sentiment of the user. The objective of such systems in the case of call center communications would be to detect the emotional state and urgency of the caller (Bojani et al. 2020). This would help to improve the functionality of call centres especially those giving health care support for old aged people and emergency call centers. It is also used in car systems to detect the mental state of drivers which is directly correlated to the probability of rash driving and subsequent accidents (Kamaruddin and Wahab 2010). These systems can help ensure safety of drivers, passengers, and people on roads. It can also be used for the purpose of lie detection in criminal and forensic investigation. Furthermore, research shows its application in detection of school violence based on children’s speech (Han et al. 2018). Lastly, it can improve other artificial intelligence applications like playing customized music based on the emotion of the speaker on call, marketing, and intelligent toys. The rest of this paper is organized as follows. Section 2 discusses the previous works that have been used for emotion recognition from speech. Section 3 describes the proposed methodology for emotion classification. The experimental work and results are discussed in Sects. 4 and 5 respectively. Section 6 concludes the paper with observations and future remarks.

Related work

A substantial amount of research has been carried out in the field of speech emotion recognition (SER). In this section, we present a brief review of the work done on emotion detection from audio. Many of the current research methodologies are based on two different classification approaches. The first is the use of classical classifiers such as SVM and artificial neural networks (ANN) and the second is the use of classifiers based on deep learning such as convolutional neural networks (CNN) and deep neural networks (DNN) (Akçay and Oğuz 2020). Using both linguistic (probabilistic and entropy-based models of words and phrases) and acoustic (pitch, loudness, spectral characteristic) feature modeling, SVM was used as a classifier for anger recognition (Polzehl et al. 2011). This study showed that the acoustic modeling outperforms linguistic modeling. Accuracy of 75% for the WoZ database and approximately 79% for IVR datasets were achieved. In Zhang et al. (2016), four models for the binary classification problem: the simple model, the single task (ST) model, the multi-task feature selection/learning (MTFS/MTFL) model, and the group multi-task feature selection/learning (GMTFS/GMTFL) model were implemented. Feature extraction of acoustic low level descriptors (LLDs) was done and then four models are used for each emotion classification. It was tested on the RAVDESS dataset and the maximum accuracy achieved was 64.29%. Support vector machines have been used as a classification technique by many researchers. Feature extraction using MFCCs, Spectral Centroids, and Delta and Delta–Delta MFCCs along with a bagging ensemble with SVM as a classifier was used for speech detection on three different datasets, namely, IITKGP-SEHSC, RAVDESS, and Berlin EMO-DB. 75.69% accuracy was obtained on the RAVDESS dataset using the proposed methodology (Bhavan et al. 2019). Another study Tomba et al. (2018) aimed to be able to detect stress through speech analysis using mean energy, the mean intensity and MFCC features. Using SVM and neural networks on the RAVDESS dataset, accuracies of 78.75% and 89.16% were achieved. In Deb and Dandapat (2016), feature selection of a relatively new feature, residual sinusoidal peak amplitude (RSPA), for emotion classification was utilized. The RSPA feature is evaluated from the LP residual of the speech signal using a sinusoidal model. Again, SVM classifier was used and evaluated on EMO-DB dataset giving a maximum accuracy of 74.4%. Furthermore, architectures such as convolutional neural network (CNN) and long short-term memory (LSTM) have also been used to test the emotion capturing capability from various standard speech representations such as mel spectrogram, magnitude spectrogram and Mel-Frequency Cepstral Coefficients (MFCC’s). Bidirectional long short term memory network and convolutional neural network were used and the best accuracy was 82.35%, achieved for CNN + BLSTM architecture with MFCC as input for EMO-DB in Pandey et al. (2019). Convolutional neural network model was evaluated on RAVDESS in Jannat et al. (2018), but the accuracy of the sole audio tests is comparatively low at 66.41%. In Zhao et al. (2019), one 1D CNN LSTM network and one 2D CNN LSTM network were constructed to learn local and global emotion-related features from speech and log-Mel spectrogram respectively. Accuracies of 95.33% and 95.89% on Berlin EmoDB of speaker-dependent and speaker-independent experiments, and of 89.16% and 52.14% on IEMOCAP database of speaker-dependent and speaker-independent experiments, respectively were achieved. Recurrent neural network (RNN) architectures have also been used for the purpose of SER. 63.5% accuracy with the IIEMOCAP corpus in Mirsamadi et al. (2017). Popova et al. (2018) used a fine-tuned DNN to classify the mel spectrograms obtained from the speech samples of RAVDESS dataset. The authors obtained the accuracy of 71% using VGG-16 network as a classifier. A sparse autoencoder method for feature transfer learning for speech emotion recognition was proposed in Deng et al. (2013). Average accuracy of 51.6% (original) and 59.9% (reconstructed) was achieved for the datasets. To learn from labelled and unlabelled data, the semi-supervised autoencoder (SS-AE) was introduced in Deng et al. (2018). It extends a popular unsupervised deep denoising autoencoder. A variant of SS-AE that introduces skip connections from the lower layer to the upper one called SS-AE-Skip was also implemented. SS-AE and SS-AE-Skip obtain an average UAR of 42.7% and 42.8%, respectively. Deng et al. (2017) also introduces Universum learning to a deep autoencoder, leading to reducing the inherent mismatch between the training and test data by simultaneously learning common knowledge from labelled and unlabelled data. The Universum Autoencoder achieves an accuracy of 59.3%, which is comparable to the SVM UAR 54.1%. In Aouani and Ben Ayed (2018), the model implements stack and simple auto encoder after MFCC feature extraction. The experimental results show that DSVM method outperforms the standard SVM with a classification rate of 69.84% and 68.25% using 39 MFCC, respectively. Additionally, the auto-encoder method outperforms the standard SVM, yielding a classification rate of 73.01%. A brief review of the work done on emotion detection from audio is presented in Table 1.

Table 1

Summary of different methodologies used for SER

No.	Dataset	Methodology	Results (accuracy)	Author
1	IVR customer care domain, database from WoZ data collection^a	SVM	79%, 75%	Polzehl et al. (2011)
2	IEMOCAP corpus^b	RNN	63.5%	Mirsamadi et al. (2017)
3	EMO-DB, VAM, and TUM AVIC	SVM	51.6%	Deng et al. (2013)
4	Berlin EmoDB and IEMOCAP	CNN, LSTM	95.33%, 95.89% on Berlin EmoDB; 89.16%, 52.14% on IEMOCAP	Zhao et al. (2019)
5	EMO-DB	SVM	74.4%	Deb and Dandapat (2016)
6	EMO-DB and IEMOCAP	Bidirectional LSTM and CNN	82.35%	Pandey et al. (2019)
7	(UMSSED^c) and (RAVDESS^d)	Four models for binary classification	64.29%	Zhang et al. (2016)
8	RAVDESS	CNN	66.41%.	Jannat et al. (2018)
9	RAVDESS	SVM, NN	78.75%, 89.16%	Tomba et al. (2018)
10	RAVDESS	SVM	75.69%	Bhavan et al. (2019)
11	GeWEC	Universum AE	59.3%	Deng et al. (2018)
12	GeWEC	SSAE	51.6%	Deng et al. (2017)
13	SAVEE	SVM, DSM, AE	69.84%, 68.25%, 73.01%	Aouani and Ben Ayed (2018)

ahttp://dicit.fbk.eu/index.php?location=woz

bhttps://sail.usc.edu/iemocap/

chttps://web.eecs.umich.edu/~emilykmp/umssed.html

dhttps://zenodo.org/record/1188976

Summary of different methodologies used for SER ahttp://dicit.fbk.eu/index.php?location=woz bhttps://sail.usc.edu/iemocap/ chttps://web.eecs.umich.edu/~emilykmp/umssed.html dhttps://zenodo.org/record/1188976 In this work we attempt to use both traditional classifiers and deep learning classifiers with the addition of an autoencoder which is a deep learning based enhancement technique. The results achieved with the implementation of some state-of-the-art classifiers such as Alexnet and Resnet50 are also presented. To the best of our knowledge, our model outperforms all the current works that have been evaluated using the same datasets, namely, RAVDESS and TESS in terms of accuracy with the exception of Tomba et al. (2018). However, our model may be comparable in terms of simplicity and reliability as the implementation in this work consists of simple autoencoders along with some classical classifiers. The highest accuracy we report is 96 by evaluating CNN on the TESS dataset which also outperforms classifier models that have been tested on other datasets.

Proposed methodology

Any SER system consists of two components: a processing unit that extracts the appropriate features from the speech data and a classifier that ultimately decides the emotion from the underlying speech utterance. In this section, the methodology used for feature extraction, dimensionality reduction, and classification in the proposed model are presented. Also, the use of autoencoders for the purpose of dimensionality reduction, and its impact on classification is discussed.

Features

The first step is preprocessing which includes the extraction and selection of a set of specific acoustic features as well as normalization, noise reduction, etc. In some works, basic acoustic features like pitch-related, intensity-related, and duration-related features have been extracted (Chen et al. 2012). Feature extraction is an important stage of the recognition. There are many kinds of feature extraction methods and some parametric representations are Mel-frequency cepstrum coefficients (MFCC), the linear-frequency cepstrum coefficients (LFCC), the linear prediction coefficients (LPC), and the reflection coefficients (RC). MFCC based features are very common and are used in a lot of SER models to this day such as Likitha et al. (2017) and Sowmya and Rajeswari (2020). MFCCs represent audio based on perception with their frequency bands logarithmically positioned. It captures the power spectrum and unique characteristics of humans. The main steps of MFCC feature extraction are pre-emphasis, frame-blocking, fast-Fourier transform (FFT), Mel frequency warping, and discrete cosine transform (DCT) Muljono et al. (2019). Pre-emphasis is a filtering process that is used to process a signal before performing feature extraction on it. Framing consists of splitting the signal into several frames. This process aims to convert each frame from the time domain to the frequency domain. FFT is a rapid algorithm that is used to implement a discrete Fourier transform (DFT). In the mel scaling stage, a pattern is measured in the ‘mel’ scale. The ‘mel’ scale is a linear frequency scale below 1000 Hz and a logarithmic scale above 1000 Hz. Mel scaling is performed as shown in Eq. (1):At the discrete cosine transform (DCT) stage, the mel spectrum coefficient is converted into the time domain. The result is called MFCC. Figure 2 explains the MFCC extraction process from an audio signal. MFCC has numerous advantages like simple calculation, better ability of distinction and high robustness to noise. We have used MFCC features to represent audio samples in this work.

Fig. 2

Block diagram for MFCC

Dimensionality reduction

Dimensionality reduction is defined as the process of reducing the number of features that describe some data. It is a necessary approach to downsize data. There are many methodologies that can be used in order to reduce the dimensionality of data such as principal component analysis (PCA), Linear discriminant analysis (LDA), Random forests, etc. PCA seems to be one of the most popular methodologies when it comes to SER. PCA is a preprocessing linear transformation technique. Chen et al. (2012) describe principal component analysis (PCA) which is used to find a subspace whose basis vectors correspond to the maximum-variance in the original space. They also describe Linear discriminant analysis (LDA) which selects those vectors that best discriminate among classes and how these methods may be selected for application in speech features. Further, they present an independent, comparative analysis of PCA, LDA and PCA + LDA used in speech emotion recognition. It is found that none of the three algorithms is the state-of-the-art for all emotion categories. A new integrated approach was also introduced. Furthermore, Daneshfar and Kabudian (2019) propose a system that is based on a modified quantum-behaved particle swarm optimization (QPSO) algorithm for feature-vector dimension reduction. The proposed method improves the accuracy of the SER system compared to classical methods such as PCA, LDA, and standard QPSO. Autoencoders can also be used for dimensionality reduction. Deep autoencoders have already proved to be effective tools for denoising (Xia et al. 2014) and classification (Cibau et al. 2013) for SER. They are being extended to the process of dimensionality reduction. Autoencoder is an unsupervised learning process that does not require external labels. The autoencoder algorithm belongs to a special family of dimensionality reduction methods that is implemented using artificial neural networks. It aims to learn a compressed representation for an input while simultaneously minimizing its reconstruction error (Wang et al. 2014). For example, Zabalza et al. (2016) proposes the use of a stacked autoencoder for dimensionality reduction and feature extraction in hyperspectral imaging. Stacked autoencoders are an extension of the autoencoder framework as they contain several layers between the input and the output. Therefore, final features are obtained through progressive abstraction levels. Variational auto-encoders which use variational inference to generate a latent representation of the data have also been used for the task of dimensionality reduction (Martin et al. 2019). Finally, Sahay et al. (2019) suggests the use of a cascaded autoencoder that can perform both tasks of denoising and dimensionality reduction. Thus, autoencoders prove to be a useful tool for dimensionality reduction as this method has added benefits over traditional methods such as PCA. This is due to the fact that they remove the need to select meaningful features from the entire list of components, reducing subjectivity and significant human interaction from the analysis (Thomas et al. 2016). In addition to this, autoencoders depending on the size of the dataset and application have often been shown to perform better than principal component analysis (Wang et al. 2012). There are no guidelines to choose the size of the bottleneck layer in the autoencoder like there are in PCA. Autoencoders retain all the information of the original data set and since the autoencoder encodes all the information into the reduced layer, the decoder is in turn better equipped to reconstruct the original data set. It is more optimized as compared to PCA. The drawbacks of using an autoencoder for dimension reduction includes the requirement for greater computation and the tuning, but the trade off provides higher accuracy. In this paper too, an autoencoder has been used for the purpose of dimensionality reduction before attempting to classify the data.

Autoencoder

The most basic architecture of an autoencoder has the same number of dimensions in the input layer as well in the output layer, but the hidden layer has less number of dimensions which is where the dimension reduction occurs. A general representation of an autoencoder with a single hidden layer is depicted in Fig. 3. It will contain learned information of the input data in a compressed manner. As with other neural networks, there is a lot of flexibility in how autoencoders can be constructed including variation in the number of hidden layers and the number of nodes in each. As shown in Fig. 3, the encoder takes an input and reduces it to a form of in the hidden layer through the use of a function f() which is a standard activation function either an identity function for a linear projection or sigmoid function for non-linear mapping where W is a weight matrix. Ignoring the bias of the neural network, the encoding process is represented as follows: represents another weight matrix and the decoding process is represented by the following equation:Here, g() is either a sigmoid function for non-linear reconstruction or an identity function for linear reconstruction similar to f(). g() function has been used to represent the decoding process. refers to the dimension of inputs and refers to the dimension of output after dimensionality reduction. represents the set of y dimensional output data vectors and represents the set of x dimensional input data vectors. The decoder reconstructs sets of instances that are indexed by and have specific weights for to get a weighted reconstruction error :The total weighted reconstruction error for all the n input samples to the autoencoder is E:A general autoencoder iteratively computes and updates the values of and by using an algorithm such as the K-nearest neighbor algorithm. Furthermore, using the concept of stochastic gradient descent, the autoencoder will minimize the total weighted reconstruction error. Finally, it updates the parameters W and . This is done iteratively until the convergence point is reached.

Fig. 3

Architecture of a general autoencoder

Classifiers

Each classifier has a unique set of advantages and limitations and therefore, the performance may vary with each classifier. The objective of this section is to provide an overview of the classifiers used in this work.

Support vector machine

Generally, SVM is used as a binary classifier, however, it can also be used as a multi-class classifier. It is a highly effective tool for computation of machine learning algorithms and is widely used in all types of pattern recognition problems. Especially in the cases of limited training data availability, it has been known to outperform other classifiers. SVM is basically designed on the use of kernel functions to non-linearly map the original features to a high-dimensional space where data is then well classified using a linear manifold. It has been used extensively for classification especially image classification. It has been proven successful in applications such as thyroid disease detection (Shankar et al. 2020), for classification of mammograms in breast cancer detection (Vijayarajeswari et al. 2019), and even determination of poverty (Naviamos and Niguidula 2020). SVM has shown superior performance for emotion recognition in comparison to linear discriminant classifiers and nearest neighbor classifiers. It has been used as a classifier for sound based emotion recognition (Sonawane et al. 2017) and shown great accuracies. Furthermore, deep support vector machines were tested for speech emotion recognition in Aouani and Ayed (2019) and gave better performance than previous studies. A decision tree SVM model with Fisher feature selection for speech emotion recognition was also implemented and achieves as high as 98.29% accuracy (Sun et al. 2019). Therefore, in this work we decided to implement SVM as a classifier as it can be considered to be one of the most successful classifiers.

Decision tree classifier

Decision trees are widely used for the purpose of classification and regression. They are tree-like structure consisting of three types of components including internal nodes, root node, and terminal node as shown in Fig. 4. There has to be a parent node for each internal and terminal node present in the tree which denotes the data source, and at least two child nodes will be created from each parent node depending on the decision rules that might be different for different scenarios (Pantazi et al. 2020).

Fig. 4

Flowchart of decision tree algorithm (Pantazi et al. 2020)

Flowchart of decision tree algorithm (Pantazi et al. 2020) Decision tree classifiers have been applied in diverse areas such as Agile Management System (AMS), for intelligent data mining in agriculture (Pantazi et al. 2020), for microscopic image analysis, and for character recognition, speech recognition and radar signal classification. Decision trees are able to disintegrate a complex decision making process into simpler decisions in hierarchical manner and hence, making it easier for interpretation. Decision tree classifiers have high adaptability and effective features, making them capable of extracting decision making knowledge from the given data.

Convolutional neural networks

This is one of the most popular deep learning methods manifested in areas of face recognition, handwriting recognition, and many other processing and recognition problems. Recently, CNN has also been applied to COVID-19 related applications in order to facilitate screening approaches during this pandemic. In a recent study, COVID-Net, an open source deep convolutional neural network design was introduced and it is tailored for the detection of COVID-19 cases from chest X-ray (CXR) images (Wang and Wong 2020). The promising results achieved by COVID-Net on the COVIDx test dataset are credible. Similarly, a deep CNN, called decompose, transfer, and compose (DeTraC), for the classification of COVID-19 chest X-ray images was adopted and accuracies up to 95.12% were achieved (Abbas et al. 2020). Also, three different convolutional neural network based models (ResNet50, InceptionV3 and Inception-ResNetV2) have been proposed for the detection of coronavirus pneumonia infected patients using chest X-ray radiographs in Narin et al. (2020). It is observed that the pre-trained ResNet50 model provides the highest classification with 98% accuracy. Apart from computer vision, CNNs have also been used specifically for the task of speech emotion recognition. Authors in Barra et al. (2020) present a study that exploits an ensemble of CNNs, trained over Gramian angular fields (GAF) images for market financial forecasting and trend analysis in the US. A multiresolution imaging approach is used to feed each CNN. This enables the analysis of different time intervals for a single observation. A method for speech emotion recognition using spectrograms and deep convolutional neural network (CNN) is capable of predicting emotions accurately and efficiently (Badshah et al. 2017). Spectrograms generated from the speech signals are input to the deep CNN. This study also investigates the effectiveness of transfer learning for emotions recognition using a pre-trained AlexNet model. However, they conclude that the results are not satisfactory. Zheng et al. (2018) proposed an SER model based on CNN feature extraction followed by random forest classification. Therefore, CNN can be used in multiple approaches as well. Huang et al. (2014) achieved results using CNN by trying to learn salient feature maps using an auto-encoder. Similarly, applied deep convolutional neural networks, however, failed to get an accuracy of more than 40%. One dimensional CNN has also been used successfully to produce an accuracy of about 80% (Basu et al. 2017). Therefore, CNN models have extensive applications. They are well known and proven in use. Deep CNNs have two essential ingredients: a rectified linear unit (ReLU) defined as a univariate nonlinear function given byand a sequence of convolutional filter masks inducing sparse convolutional structures (Zhou 2020). Filter mask defines a sequence of filter coefficients, where the filter length is a fixed integer in order to control the sparsity, and it has been assumed that only when . When a filter mask w is convoluted with , we get a new sequence defined as . This generates Toeplitz type convolutional matrix T that has constant diagonals. The matrix has larger number of rows than the columns, thus allowing deep neural networks to represent more complex and richer functions. The convolutional layer in CNN extracts features from the input. The filters are used to extract local patterns and form feature maps. Mathematically, this particular layer can be represented as shown in Eq. (3) (Pandey et al. 2019):where,Then, the pooling layer generally follows which reduces the number of parameters and hence, reducing the complexity of the model. Max-pooling and average pooling are two types that are mainly used in the model. Max pooling chooses the maximum from the window specified, whereas, average pooling calculates the average of the specified window. The CNN model in addition to convolutional layers and pooling layers also consists of dropout layers, dense layers, and the last fully connected layer which is responsible for generating an output for regression/classification tasks. : (i, j)th element of the kth output feature map : th filter : kth bias q: input feature maps *: 2D spatial convolution operation

Experimental scenario

The proposed system is evaluated on two datasets: Ravdess dataset, and Toronto Speech dataset. Feature extraction from raw audio files is done with the help of MFCC. Further, the audio files are fed into an autoencoder model for the purpose of dimension reduction. Then, newly reconstructed data is used as an input for the SVM model, decision tree classifier, and CNN. The performance on the basis of different evaluation measures is compared before and after applying the autoencoder. Figure 5 illustrates the system model that has been proposed in this paper.

Fig. 5

The proposed system model

Dataset

The three classifying models have been evaluated on two publicly available speech emotion datasets: (1) Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)1 (Livingstone and Russo 2018), and (2) Toronto Emotional Speech Set (TESS)2 (Pichora-Fuller and Dupuis 2020).

The RAVDESS dataset

RAVDESS dataset contains a complete set of 7356 files (24.8 GB) of audio and video, speech and song. It is a dynamic and multimodal set consisting of facial and vocal expressions in North American English. The database consists of 24 actors who vocalize two lexically matched statements in neutral north American accent. Neutral, calm, happy, sad, angry, fearful, disgust, and surprised are the eight speech emotions. Each expression is produced at two levels of intensity-strong and normal, with neutral expression as an addition. All data are available in three modality formats- Audio-Video, Audio only, and Video only. However, for this work, we use only the audio files that make up 1440 files (24 actors * 60 trials per actor). Tables 2 and 3 represent the distribution of wave files of the RAVDESS dataset and the filename identifiers respectively. Figure 6 depicts the file naming convention used in this dataset for each audio file.

Table 2

RAVDESS-wave only audio files description

Gender	Count	Trials per actor	# Of audio samples
Female	12	60	1440
Male	12	60	1440

Table 3

Filename identifiers(RAVDESS)

Modality	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= full-AV, 02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= video-only, 03 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= audio-only
Vocal Channel	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= speech, 02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= song
Emotion	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= neutral, 02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= calm, 03 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= happy, 04 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= sad, 05 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= angry,
Emotion	06 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= fearful, 07 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= disgust, 08 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= surprised
Intensity	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= normal, 02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= strong (Note: Strong intensity for neutral emotion is not there)
Statement	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= “Kids are talking by the door”,
Statement	02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= “Dogs are sitting by the door”
Repetition	01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= 1st repetition, 02 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=$$\end{document}= 2nd repetition
Actor	01 to 24
	Male: Odd numbered actors
	Female: Even numbered actors

Fig. 6

Filename convention for a sample audio file from RAVDESS corpus

RAVDESS-wave only audio files description Filename convention for a sample audio file from RAVDESS corpus Filename identifiers(RAVDESS)

The TESS dataset

TESS dataset, represented in Table 4, consists of recordings of two women, aged 26 and 64 years, portraying seven emotions: anger, happiness, disgust, fear, sadness, neutral and pleasant surprise. 200 target words were spoken in the carrier phrase Say the word_ by each woman for all the seven emotions. The two women chosen for the recordings were from the Toronto area. Both have received musical training, are university educated, and speak English as their first language.

Table 4

TESS dataset description

Actor/subject	Words per emotion	# Of emotions	# Of audio files
Female 1 (age 26)	200	7	2800
Female 2 (age 64)	200	7	2800

TESS dataset description

Data representation

Mel-Frequency Cepstral Coefficients (MFCCs) have been used for the purpose of feature extraction in both, RAVDESS AND TESS dataset. Since MFCCs have been used extensively to extract features from audio signal and discard all the unnecessary background noise, we incorporated MFCC in our approach for feature extraction. The MFCC features are calculated for the speech files with the default sliding window size of 25 ms and the shift of 10 ms. 128 MFCC features were extracted from the input audios.

Dimensionality reduction using autoencoder

For our study, we used a simple autoencoder based on a fully connected neural layer as encoder and decoder. The proposed autoencoder architecture is the same for both the datasets. The encoding dimension taken is 64 with input shape being (128, ). A single layer of encoder and decoder has been taken, with encoded label storing the encoded representation of the input, and ‘decoded’ label representing the lossy reconstruction of the input. A separate encoder and decoder models are also built. Now, in order to train our autoencoder to reconstruct the audio files, firstly the configuration of the implemented model is done to use binary cross-entropy as the loss function and AdaDelta optimizer to optimize the loss function. Further, we try to train our autoencoder for 100 epochs with 256 as the batch size. For visualizing the encoded representations and the reconstructed inputs, Matplotlib library in Python was used. As mean squared error (MSE) based metric deals with the difference between the predicted output and the real label, we used MSE as the error function because the model tries to reconstruct the input. Adam optimizer was used to compute the gradients. Figures 7 and 8 give a graphical representation of the performance of autoencoder on the RAVDESS and TESS datasets respectively. The visual representation shows the original input and the reconstructed input of the procedure. As the encoded representations have an encoding dimension of 64, and the input is 128 floats, the input will be compressed by a factor of 2 (i.e., 128/64). Hence, we will get 64-dimensional encoded representations. The decoded input that is produced after passing the encoded input to the decoder layer, has the same size as the original input, but with a reduced pixel value. Thus, we will have an output with shape (128,) but with reduced dimensions.

Fig. 7

After applying autoencoder model to RAVDESS dataset

Fig. 8

After applying autoencoder model to TESS dataset

After applying autoencoder model to RAVDESS dataset After applying autoencoder model to TESS dataset The accuracies of the reconstructed input after applying our proposed autoencoder model in TESS and RAVDESS dataset is 91.92% and 90.58% respectively.

Architecture

The audio dataset has been splitted into training and testing datasets using the train_test_split() function in Sklearn model. The training dataset is used to build the model, and the testing dataset is used to evaluate the performance of the model on unknown data. Here, we have kept the testing dataset size as 33% of the total dataset, and random state seed is 42 used to perform a random split. Hence, the size of the training dataset will be 67%. The data is then fed into different classifiers described in this section.

CNN architecture for RAVDESS dataset

CNN architecture used for the RAVDESS dataset has two convolutional layers followed by pooling, dropout, flatten layers in addition to fully connected dense layers with softmax layer as depicted in Fig. 9. We have made use of 1D CNN model as our data consists of fixed length audio signal and 1D CNN is very effective for extracting features from a fixed length segment of the entire dataset. The input shape being , the output of the first convolutional layer will be because the first convolutional layer has 32 parallel feature maps and a kernel size of 5. The next convolutional layer has 64 parallel feature maps with kernel size of 5 and relu as the activation function, giving an output of size . Relu has been used as an activation function so as to increase non-linearity in our audio files. We have used two CNN layers in order to help the model learn features from the input. Next is the pooling layer of size 8 giving an output of size , thus reducing the number of learned features and keeping only the important elements. It is followed by a dropout layer having a rate of 0.25 that keeps the size of the output matrix the same. Dropout layer is needed to slow down the learning process of the model and in turn avoid overfitting. After, the matrix is flattened to get a height of 1024, forming a matrix of size . After a dense and one more dropout layer of rate 0.5, the final dense layer and the softmax function will reduce the vector of height 1024 to a vector of 8 as in RAVDESS we are to predict from eight classes of emotions. Adadelta, a part of the gradient descent algorithms, has been used as the optimizer and sparse categorical crossentropy for the loss function, as our inputs are in integer format and our model is built for multi-class classification. The training model has the batch size of 128 along with 1000 epochs. The loss function and optimizer defined is the same as that defined in the TESS CNN model.

Fig. 9

Keras visualization of the 1D CNN model applied to RAVDESS

CNN architecture for TESS dataset

The CNN architecture used for the TESS dataset consists of 1D convolutional layer with a pooling layer, dropout and flatten layers along with the fully connected dense layers with softmax layer as the output as shown in Fig. 10.

Fig. 10

Conv1D model keras visualization for the TESS dataset

Conv1D model keras visualization for the TESS dataset The shape of the input layer is and the kernel size and number of feature maps are taken as 5 and 32 respectively, thus, giving an output of size after passing through the CNN layer. A pooling layer of size 8 has been taken after in order to reduce the complexity of the output and prevent overfitting of the data. The dropout layer follows which is used to increase accuracy on the unseen data as it randomly assigns zero weights to the neurons in the network, hence, making the network less sensitive to small variations in data. For our model, since we have chosen a rate of 0.25, 25% of the neurons will receive a zero weight. The size of the output matrix remains the same, i.e., . Now, the output matrix of the dropout layer is flattened to get an output of height 512, forming a matrix. Next, passing matrix to the dense layer reduces the size of the matrix to which is again given to a dropout layer of rate 0.5. The final fully connected dense layer with softmax used as the activation function will reduce the vector height of 128 to the vector of 7 since we need to make predictions based on seven classes of emotions. Then, the model is trained with batch size of 128 and 70 epochs. Adadelta optimizer having a default learning rate of 0.001 and sparse categorical cross-entropy as the loss function has been used for the training of our model. After, the model is evaluated on the test dataset.

Applying Alexnet and Resnet50 to RAVDESS and TESS

In our proposed methodology, along with SVM, decision tree, and 1D CNN techniques, we also tried to implement Alexnet and Resnet50 models on our two datasets-RAVDESS and TESS. As the architecture of alexnet and Resnet50 have been defined for image classification, we converted the audio files into spectrograms followed by conversion of spectrograms created to RGB images. An audio signal is represented in time domain which is converted into frequency domain to be able to be represented as a spectrogram. Fast Fourier transform (FFT) is a mathematical tool which analyses the frequency content of audio and it is calculated over a bunch of overlapping window segments. In spectrograms, the y-axis which represents frequency is converted to log scale, and is mapped onto mel scale to get mel spectrograms. In our approach we have used mel spectrograms. The size of FFT has been taken as 1024 which also defines the window length. Hop length which defines the steps between windows is 100. Then, the amplitude is transformed into decibels to get a logarithmic scale followed by saving the spectrograms created in a specific folder. Once we have spectrograms of all our audio files, we need to convert them into RGB images, thus, each image will be represented by 3 channels. As the alexnet architecture requires the images to be in size , and the Resnet50 model takes input of size , we need to resize all images in the respective size format. Thus, all 1440 spectrograms of RAVDESS dataset will be resized to (227, 227, 3) to be fed into the alexnet model, and (224, 224, 3) to be fed into the Resnet50 model. Similarly, all 2880 spectrograms of the TESS dataset will be resized to (227, 227, 3) for Alexnet, and (224, 224, 3) for Resnet50. The images are then shuffled in order to prevent any bias during training the model. The next step is to normalize the data to ensure that all the input images have a common scale. The input is now splitted into training and testing datasets using the train_test_split() function. The test size taken is 20% with random features having an integer value of 42. The alexnet model has 8 layers, i.e., 5 convolutional and 3 fully connected layers. The first convolutional layer has 96 kernels having size (11, 11) with stride 4 followed by a max pooling layer of size (3, 3). The second convolutional layer consists of 256 kernels of size (5, 5) having stride followed by another max pooling layer of size (3, 3) with 2 as the stride. The next three convolutional layers are connected directly having 384, 384, and 256 kernels respectively. These layers have kernel size as (3, 3) and stride as 1. The max pooling layer after the fifth convolutional layer feeds the output into a series of two fully connected dense layers whose output is then passed onto the third fully connected layer having softmax function. The alexnet model uses ReLU as the activation function to improve the non-linearity of the model. Adam has been used as the optimizer and sparse categorical crossentropy as the loss function. Now, the model is trained on the training dataset for 100 epochs having batch size of 128 for the TESS dataset, and for 100 epochs having batch size of 32 for the RAVDESS dataset. In our methodology, we have also used a pre-trained resnet50 model inbuilt in keras which has been trained on the Imagenet data. However, we apply the model to our dataset of size . The layers have been initialized with imagenet weights. The pre-trained model is then followed by a classifier using softmax activation function. Adam is used as the optimizer, and categorical crossentropy as the loss function. Next, the model is trained on the training dataset for 20 epochs having batch size of 64 for the RAVDESS dataset, and for 20 epochs with batch size of 128 for the TESS dataset.

Performance evaluation

Speech based emotion recognition is a classification system and following performance parameters have been used in this work to assess the system efficiency. Classification accuracy is defined as the percentage of test samples predicted correctly by the classifier. This measure gives an overall success rate of the classifier. Precision (Pr) is the ratio of correctly predicted positive samples to predicted positive samples, and recall (R) is the predictive positive samples to actual positive samples. F1-score is the harmonic average of precision and recall for a specific class.Macro precision and recall values represent average of different precision and recall values derived from different trials of experiments, respectively, and macro-F1 score is their weighted average. These measures are typically used in multi-class classifier settings. An assumption of uniform weights is typically made while calculating macro-average values of these measures. However, if the weight is given as per the number of samples of each class during calculation, then we obtain weighted average precision and recall values.

Results and discussion

The performance of systems implemented in this work are compared on the two datasets using different evaluation measures. This section describes results and some observations.

Results for the RAVDESS dataset

The number of training and testing samples taken are 1929 and 951 respectively. Table 5 presents the comparison between the accuracies of three classifiers used before and after applying autoencoder. The average speedup in accuracy has been calculated in the following manner: [(model accuracy after applying autoencoder—model accuracy on original data)/model accuracy on original data]*100. For example, in SVM, the average speedup in accuracy would be [(40.16–30.17)]*100, i.e., 33.11%. Similarly, the average speedup accuracy for Decision Tree and CNN has been also calculated and shown in Table 5. The CNN model that is used for original data before reconstructing the input files, has the same architecture as the CNN model implemented for the reconstructed input with batch size of 128 and 500 epochs. Similarly, SVM and Decision Tree Classifier have been implemented using Python Scikit Learn in the same way for the original data as that for the reconstructed data. CNN achieves the best accuracy, i.e., 75% on original data and 80% on the reconstructed data.

Table 5

Comparison between performance of models (in terms of % accuracy) implemented on RAVDESS dataset

	SVM	Decision tree	CNN
Model accuracy on original data	30.17	77	75
Model accuracy after applying autoencoder	40.16	76	80
Average speedup in accuracy (%)	33.11	− 1.29	6.66

Comparison between performance of models (in terms of % accuracy) implemented on RAVDESS dataset Tables 6 and 7 display the classification results of the RAVDESS dataset. Precision and recall percentage of each class has been shown in the two tables along with their F1-measure. In decision tree classifier, predictions for classes Happy and Sad are affected positively after applying the autoencoder as their F1-score values improve. For the case of a CNN, except Angry class, rest of the classes have improved predictions.

Table 6

Classification results of RAVDESS dataset on original data

Classes	Decision tree classifier			CNN classifier
Classes	Precision (%)	Recall (%)	F-1 score	Precision (%)	Recall (%)	F-1 score
0 (Neutral)	79	84	0.82	61	59	0.6
1 (Calm)	84	81	0.83	77	84	0.8
2 (Happy)	72	77	0.75	59	79	0.67
3 (Sad)	64	70	0.67	68	68	0.68
4 (Angry)	76	76	0.76	93	71	0.81
5 (Fearful)	78	83	0.8	73	74	0.74
6 (Disgust)	78	75	0.76	83	70	0.76
7 (Surprised)	86	72	0.78	79	76	0.78
Macro average	77	77	0.77	74	73	0.73
Weighted average	77	77	0.77	75	74	0.74

Table 7

Classification results of RAVDESS dataset on encoded data

Classes	Decision tree classifier			CNN classifier
Classes	Precision (%)	Recall (%)	F-1 score	Precision (%)	Recall (%)	F-1 score
0 (Neutral)	78	78	0.78	74	75	0.74
1 (Calm)	83	75	0.79	84	95	0.89
2 (Happy)	83	80	0.82	83	71	0.77
3 (Sad)	67	72	0.7	84	71	0.77
4 (Angry)	77	73	0.75	76	87	0.81
5 (Fearful)	72	83	0.77	71	81	0.76
6 (Disgust)	77	70	0.73	79	77	0.78
7 (Surprised)	72	76	0.74	88	77	0.82
Macro average	76	76	0.76	80	79	0.79
Weighted average	76	76	0.76	80	80	0.8

Classification results of RAVDESS dataset on original data Classification results of RAVDESS dataset on encoded data

Results for the TESS dataset

The number of training and testing samples taken are 1876 and 924 respectively. Table 8 presents the comparison between the performance of three classifiers used before and after applying the autoencoder. The CNN model that is used for original data before reconstructing the input files, has the same architecture as the CNN model implemented for the reconstructed input with batch size of 128 and 70 epochs. Similarly, SVM and decision tree classifier has been applied in the same way for the original data as that for the reconstructed data. CNN achieves the best accuracy, i.e., 94% on original data and 96% on the reconstructed data.

Table 8

Comparison between performance of models implemented on TESS dataset

	SVM	Decision tree classifier	CNN
Model accuracy on original data	86.14%	90%	94%
Model accuracy after applying autoencoder	91.99%	90%	96%
Average speedup in accuracy (%)	6.79	0.0	2.12

Comparison between performance of models implemented on TESS dataset Tables 9 and 10 represent the classification results of the TESS dataset. There is no significant improvement in case of CNN classifier except surprise and sad class after applying the proposed approach. While decision tree has negligible performance change.

Table 9

Classification results of TESS dataset on original data

Classes	Decision tree classifier			CNN classifier
Classes	Precision (%)	Recall (%)	F-1 score	Precision (%)	Recall (%)	F-1 score
0 (Angry)	92	91	0.91	100	100	1
1 (Disgust)	94	91	0.93	100	98	0.99
2 (Fear)	93	90	0.91	88	97	0.92
3 (Happy)	98	91	0.94	96	100	0.98
4 (Neutral)	86	93	0.89	100	96	0.98
5 (Surprise)	83	84	0.84	96	79	0.87
6 (Sad)	85	89	0.87	78	87	0.82
Macro average	90	90	0.9	94	94	0.94
Weighted average	90	90	0.9	94	94	0.94

Table 10

Classification results of TESS dataset on encoded data

Classes	Decision tree classifier			CNN classifier
Classes	Precision (%)	Recall (%)	F-1 score	Precision (%)	Recall (%)	F-1 score
0 (Angry)	93	97	0.95	98	99	0.99
1 (Disgust)	94	97	0.95	98	98	0.98
2 (Fear)	89	87	0.88	95	96	0.96
3 (Happy)	90	86	0.88	99	96	0.99
4 (Neutral)	86	90	0.88	95	97	0.96
5 (Surprise)	87	84	0.86	97	91	0.94
6 (Sad)	89	87	0.88	91	95	0.93
Macro average	90	90	0.9	96	96	0.96
Weighted average	90	90	0.9	96	96	0.96

Classification results of TESS dataset on original data Classification results of TESS dataset on encoded data

Comparison with state-of-the-art techniques

After looking at the performance of SVM, decision tree classifier, and 1D CNN, let us discuss the results obtained by implementing alexnet and resnet50 models. The accuracies obtained after evaluating the performance of the trained alexnet model on the testing dataset of RAVDESS and TESS were 54.17% and 82.32% respectively. However, after incorporating autoencoder for dimensionality reduction, the respective accuracies obtained were 21.18% and 43.03%. When the resnet50 model was applied and tested on the RAVDESS and TESS testing dataset, we obtained accuracies of 15.97% and 13.03% respectively on the original data, and 12.84% and 15.71% respectively on the reconstructed data. When we compare the performance of alexnet and resnet50 models with SVM, decision tree, and 1D CNN as reported in the previous sections, we find that we get maximum accuracy in 1D CNN for TESS as well as RAVDESS dataset and not in either of the deep learning models used, i.e., alexnet and resnet50. The state of the art approaches, Alexnet and resnet50, have high computational cost and high processing delays in addition to low performance as calculated on the RAVDESS and TESS dataset. We tried to use the alexnet and pre-trained resnet50 model for the SER problem, however, the results were not satisfactory. In this paper, we have presented major contributions for increasing the accuracy of speech emotion recognition compared to state-of-the-art, and reducing the computational complexity of the presented SER model which has been achieved using SVM, decision tree, and 1D CNN. Thus, the authors majorly focus on SVM, decision tree, and 1D CNN as the architecture of these techniques is compact and simple in structure, cost-effective, and memory efficient.

Observations

Thus, it can be seen from Tables 5 and 8 that there is a significant improvement seen in the performance of SVM and CNN after using autoencoder for the dimensionality reduction. However, decision tree classifier doesn’t show any improvements in its accuracy as for TESS dataset it is the same 90% in both scenarios, while for RAVDESS it becomes 76% from 77%. Performance of a SVM, b decision tree, and c CNN classifier on both the datasets Observations from Fig. 11 indicate that RAVDESS dataset is more challenging as the system achieved less performance across all classifiers in comparison to TESS dataset. Another conclusion is that decision tree based classifier is mostly invariant to the proposed method i.e. the compression hasn’t affected its performance much. However, the other two classifiers show promising performance with compact representation. This indicates that the efficiency of the system is data-driven and classifier-dependent.

Fig. 11

Performance of a SVM, b decision tree, and c CNN classifier on both the datasets

Conclusion and future work

In this paper, we demonstrated the impact of autoencoder based compact representation of audio data to recognize human emotions. An improvement was observed with the aid of this compact representation on two benchmark datasets. The relative improvements varied according to the type of classifier used as well as according to the dataset used for demonstration. The average relative improvement was 4.66% for the RAVDESS dataset and 2.616% for the TESS dataset. To our best knowledge, this is the first attempt to exploit autoencoders on direct audio files for audio emotion detection and getting a highest accuracy of 96% on the TESS dataset. For future work, we would suggest replacing the decision tree classifier with other classifiers such as long short term memory (LSTM) or its combination with CNN. Further, in the proposed model, a simple autoencoder was employed, but improvement of the results are likely using different encoders even in combination such as denoising encoders and convolutional autoencoders in succession.

5 in total

5. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images.

Authors: Linda Wang; Zhong Qiu Lin; Alexander Wong
Journal: Sci Rep Date: 2020-11-11 Impact factor: 4.379