| Literature DB >> 35392615 |
Rakesh Kumar Das1, Nahidul Islam1, Md Rayhan Ahmed2, Salekul Islam2, Swakkhar Shatabda2, A K M Muzahidul Islam2.
Abstract
The speech emotion recognition system determines a speaker's emotional state by analyzing his/her speech audio signal. It is an essential at the same time a challenging task in human-computer interaction systems and is one of the most demanding areas of research using artificial intelligence and deep machine learning architectures. Despite being the world's seventh most widely spoken language, Bangla is still classified as one of the low-resource languages for speech emotion recognition tasks because of inadequate availability of data. There is an apparent lack of speech emotion recognition dataset to perform this type of research in Bangla language. This article presents a Bangla language-based emotional speech-audio recognition dataset to address this problem. BanglaSER is a Bangla language-based speech emotion recognition dataset. It consists of speech-audio data of 34 participating speakers from diverse age groups between 19 and 47 years, with a balanced 17 male and 17 female nonprofessional participating actors. This dataset contains 1467 Bangla speech-audio recordings of five rudimentary human emotional states, namely angry, happy, neutral, sad, and surprise. Three trials are conducted for each emotional state. Hence, the total number of recordings involves 3 statements × 3 repetitions × 4 emotional states (angry, happy, sad, and surprise) × 34 participating speakers = 1224 recordings + 3 statements × 3 repetitions × 1 emotional state (neutral) × 27 participating speakers = 243 recordings, resulting in a total number of recordings of 1467. BanglaSER dataset is created by recording speech-audios through smartphones, and laptops, having a balanced number of recordings in each category with evenly distributed participating male and female actors, and would serve as an essential training dataset for the Bangla speech emotion recognition model in terms of generalization. BanglaSER is compatible with various deep learning architectures such as Convolutional neural networks, Long short-term memory, Gated recurrent unit, Transformer, etc. The dataset is available at https://data.mendeley.com/datasets/t9h6p943xy/5 and can be used for research purposes.Entities:
Keywords: Bangla language; Deep Learning; Sound processing; Speech emotion recognition
Year: 2022 PMID: 35392615 PMCID: PMC8980634 DOI: 10.1016/j.dib.2022.108091
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1Class-wise speech-audio recording distribution in the BanglaSER dataset.
Fig. 2Sample waveform plot of a randomly selected sample of every emotional state, (a) angry, (b) happy, (c) neutral, (d) sad, and (e) surprise emotions of the BanglaSER dataset.
A comparative summary among different public SER datasets and BanglaSER dataset (- Not mentioned, ✔- Yes, x - No).
| Specifications | IEMOCAP | RAVDESS | MSP-IMPROV | EMO-DB | SAVEE | AESDD | CADKES | SUBESCO | |
|---|---|---|---|---|---|---|---|---|---|
| # scripted / target audio records | 5255 | 1440 | 620 | 535 | 480 | 500 | 6760 | 7000 | 1467 |
| # of speech-audio emotions | 9 | 8 | 4 | 7 | 7 | 5 | 5 | 7 | 5 |
| # of target sentences | 3 | 2 | 15 | 10 | 15 | 19 | 52 | 10 | 3 |
| # of participating actors | 10 | 24 | 12 | 10 | 4 | 5 | 26 | 20 | 34 |
| Professional actors | ✔ | ✔ | ✔ | ✔ | - | ✔ | x | ✔ | x |
| Sampling Rate | 16 kHz | 48 kHz | 48 kHz | 16 kHz | 44.1 kHz | - | 16 kHz | 48 kHz | 44.1 kHz |
| Class balance | ✔ | x | ✔ | x | x | ✔ | ✔ | ✔ | ✔ |
| Gender balance | ✔ | ✔ | ✔ | ✔ | x | x | ✔ | ✔ | ✔ |
| Language | English | English | English | German | English | Greek | Korean | Bangla |
Fig. 3Sample (a) Mel-spectrograms, (b) Spectral Rolloff, (c) MFCCs, (d) tonal centroids of every category of emotions from the BanglaSER dataset.
A suggested feature vector set for a machine-learning model to evaluate the BanglaSER dataset for the SER task.
| Domain Name | Number of Features | Parameters |
|---|---|---|
| Time | 3 | Amplitude envelope (frame size = 1024, hop length = 512) |
| Frequency | 145 | Band energy ratio (split frequency = 2000, sample rate = 22050) |
| Cepstral | 20 | MFCCs (coefficients: 1-20, Discrete cosine transform type = 2). |
| Statistics | 3 | Entropy, Kurtosis, and Skewness. |
A descriptive summarization of the proposed BanglaSER dataset.
| Year | 2022 |
|---|---|
| Language | Bangla |
| Dataset type | Acted, Scripted |
| File Type | Audio only |
| The file format of audio clips | .WAV |
| Sampling rate | 44.1 kHz |
| No. of actors | 34 (17 males and 17 females) |
| Age group of actors | 19 to 47 years |
| No. of emotions | 5 |
| No. of emotional states | Angry, Happy, Neutral, Sad, Surprise |
| No. of statements | 3 |
| No. of audio clips | 1467 |
| Size of the dataset | 776.8 MB |
| Unit level | Sentence |
| No. of words | 11 |
| No. of vowels in the texts (Phonetic) | 23 |
| No. of constants in the texts (Phonetic) | 24 |
| No. of other phonemes in the texts (Phonetic) | 0 Diphthongs and 0 nasalization |
| The average duration of each clip | 3 to 4 s |
| Utilized Software | Audacity |
| Duration | 1 h 29 min (Approx.) |
| Human accuracy | 80.5% (approx.) |
Fig. 4Workflow diagram of the BanglaSER dataset preparation.
Description of the filename convention.
| Identifier | Meaning |
|---|---|
| Mode | 03 = Audio-only |
| Statement type | 01= Scripted |
| Emotion | 01 = Happy, 02 = Sad, 03 = Angry, 04 = Surprise, 05 = Neutral |
| Intensity | 01 = Normal, 02 = Strong |
| Statements | 01 = |
| Repetition | 01 = 1st repetition, 02 = 2nd repetition, 03=3rd repetition |
| Actor | 01 = First actor, 02 = Second actor…..., 34 = Thirty-four actor (odd-male, even-female) |
Confusion matrix for the actual versus perceived emotions for target statements during the dataset evaluation process.
| Perceived emotions (80.5%) | ||||||
|---|---|---|---|---|---|---|
| Category | Angry (%) | Happy (%) | Neutral (%) | Sad (%) | Surprise (%) | |
| Angry | 2 | 0.5 | 5 | 8 | ||
| Happy | 3.5 | 3 | 1.5 | 12 | ||
| Neutral | 1 | 3.5 | 11.5 | 2.5 | ||
| Sad | 2.5 | 1.5 | 14 | 2.5 | ||
| Surprise | 3 | 13.5 | 5 | 1.5 | ||
| Subject | Signal Processing |
| Specific subject area | Speech emotion recognition, Sound analysis |
| Type of data | Digital audio files |
| How the data were acquired | Bangladeshi voice data are recorded using the smartphone's default recording application, laptop, and microphone. Recordings are set to 3 to 4 s of duration and surrounding noises are removed using the Audacity software.Tools used: Smartphone Microphone Headset Asus GL503GE Laptop Audacity software |
| Data format | Waveform Audio File Format (WAV). |
| Description of data collection | The speech-audio have been recorded in different categories: |
| Data source location | City: Dhaka |
| Data accessibility |