| Literature DB >> 36015717 |
Bagus Tris Atmaja1,2, Akira Sasou1.
Abstract
Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition in various conditions. The experiments are conducted on the Japanese Twitter-based emotional speech and IEMOCAP datasets. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentations and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific condition.Entities:
Keywords: SVM; affective computing; data augmentations; speech emotion recognition; wav2vec 2.0
Mesh:
Year: 2022 PMID: 36015717 PMCID: PMC9415521 DOI: 10.3390/s22165941
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Flow of data selection for training speaker-independent SER in the experiments of data augmentation; J: Japanese (JTES); E: English (IEMOCAP); For text-independent, the split of training/test is based on sentences instead of speakers.
Figure 2Source-filter model based on CAR-HMM. The AR filter and the HMM, as depicted in (b,f), represent a vocal tract and a generative model of a glottal flow derivative, respectively. Given a state transition sequence, the expectations and variances of each state’s output PDFs can align (for example) as depicted in (d,e). The glottal flow derivative, as depicted in (c), is then defined by the realized values of the aligned output PDFs. Finally, the CAR-HMM is assumed to generate the voiced speech as depicted in (a) by filtering the glottal flow derivative with the AR filter.
Figure 3Plots of the original and augmented data from a sample in JTES dataset.
Number of data (utterances) in each number of data augmentation.
| Number of Data Augmentations | Number of Training Data | |||
|---|---|---|---|---|
| JTES-SI | JTES-TI | JTES-STI | IEMOCAP | |
| Without augmentation | 16,000 | 16,000 | 14,400 | 4290 |
| With one augmentation | 32,000 | 32,000 | 28,800 | 8580 |
| With two augmentations | 48,000 | 48,000 | 43,200 | 12,870 |
| With three augmentations | 64,000 | 64,000 | 57,600 | 17,160 |
| With four augmentations | 80,000 | 80,000 | 72,000 | 21,450 |
| Test data | 2000 | 2000 | 400 | 1241 |
Figure 4Flowchart of the main methodology from datasets to unweighted accuracy (UA).
Unweighted average recall (UAR, %) of SER with different data augmentation techniques on JTES-SI. orig = original JTES dataset, glt = glottal source extraction, spc = speech cleaned, ir = impulse response, noi = noise addition. The highest scores for each experiment are in bold.
| Data | Exp. #1 | Exp. #2 |
|---|---|---|
| Without augmentation | ||
| orig | 97.10 | 97.10 |
| With one augmentation | ||
| orig + glt | 96.45 | 96.45 |
| orig + spc | 97.20 | 97.20 |
| orig + ir | 97.73 | 97.43 |
| orig + noi | 97.55 | 97.60 |
| With two augmentations | ||
| orig + glt + spc | 96.90 | 96.90 |
| orig + spc + ir | 97.78 | 97.58 |
| orig + ir + noi | 97.85 | 97.83 |
| orig + noi + glt | 97.23 | 97.50 |
| orig + glt + ir | 97.33 | 97.10 |
| orig + spc + noi | 97.65 | 97.65 |
| With three augmentations | ||
| orig + glt + spc + ir | 97.53 | 97.40 |
| orig + spc + ir + noi | 97.98 |
|
| orig + ir + noi + glt | 97.78 | 97.78 |
| orig + noi + glt + spc | 97.60 | 97.68 |
| With four augmentations | ||
| orig + glt + spc + ir + noi |
| 97.90 |
Unweighted average recall (UAR, %) of SER with different data augmentation techniques on JTES-TI. orig = original JTES dataset, glt = glottal source extraction, spc = speech cleaned, ir = impulse response, noi = noise addition. The highest scores for each experiment are in bold.
| Data | Exp. #1 | Exp. #2 |
|---|---|---|
| Without augmentation | ||
| orig | 75.08 | 75.05 |
| With one augmentation | ||
| orig + glt | 76.43 | 76.45 |
| orig + spc | 76.23 | 76.23 |
| orig + ir | 75.30 | 74.70 |
| orig + noi | 75.38 | 75.20 |
| With two augmentations | ||
| orig + glt + spc |
|
|
| orig + spc + ir | 75.38 | 74.68 |
| orig + ir + noi | 75.30 | 74.73 |
| orig + noi + glt | 76.05 | 75.98 |
| orig + glt + ir | 75.50 | 75.10 |
| orig + spc + noi | 75.80 | 75.45 |
| With three augmentations | ||
| orig + glt + spc + ir | 75.70 | 74.75 |
| orig + spc + ir + noi | 75.48 | 74.25 |
| orig + ir + noi + glt | 75.43 | 75.00 |
| orig + noi + glt + spc | 75.93 | 75.00 |
| With four augmentations | ||
| orig + glt + spc + ir + noi | 75.63 | 74.50 |
Unweighted average recall (UAR, %) of SER with different data augmentation techniques on JTES-STI. orig = original JTES dataset, glt = glottal source extraction, spc = speech cleaned, ir = impulse response, noi = noise addition. The highest scores for each experiment are in bold.
| Data | Exp. #1 | Exp. #2 |
|---|---|---|
| Without augmentation | ||
| orig | 74.50 | 74.50 |
| With one augmentation | ||
| orig + glt | 75.50 | 75.50 |
| orig + spc | 76.50 | 76.50 |
| orig + ir | 74.75 | 73.75 |
| orig + noi | 76.25 | 75.50 |
| With two augmentations | ||
| orig + glt + spc |
|
|
| orig + spc + ir | 76.00 | 74.50 |
| orig + ir + noi | 75.50 | 74.50 |
| orig + noi + glt | 75.25 | 75.25 |
| orig + glt + ir | 75.00 | 73.75 |
| orig + spc + noi | 75.75 | 74.75 |
| With three augmentations | ||
| orig + glt + spc + ir | 76.00 | 74.75 |
| orig + spc + ir + noi | 74.50 | 74.25 |
| orig + ir + noi + glt | 74.00 | 75.00 |
| orig + noi + glt + spc | 75.50 | 75.00 |
| With four augmentations | ||
| orig + glt + spc + ir + noi | 74.75 | 74.50 |
Unweighted average recall (UAR, %) of SER with different data augmentation techniques on IEMOCAP. orig = original IEMOCAP dataset, glt = glottal source extraction, spc = speech cleaned, ir = impulse response, noi = noise addition. The highest scores for each experiment are in bold.
| Data | Exp. #1 | Exp. #2 |
|---|---|---|
| Without augmentation | ||
| orig | 74.88 | 74.88 |
| With one augmentation | ||
| orig + glt | 74.23 | 74.23 |
| orig + spc | 75.44 | 75.44 |
| orig + ir | 75.03 | 74.80 |
| orig + noi | 75.43 | 75.07 |
| With two augmentations | ||
| orig + glt + spc | 75.66 | 75.66 |
| orig + spc + ir | 75.68 | 75.37 |
| orig + ir + noi | 75.15 | 75.46 |
| orig + noi + glt | 75.11 | 75.54 |
| orig + glt + ir | 75.01 | 74.5 |
| orig + spc + noi | 75.62 | 75.33 |
| With three augmentations | ||
| orig + glt + spc + ir | 75.73 | 75.39 |
| orig + spc + ir + noi | 76.16 |
|
| orig + ir + noi + glt | 75.16 | 74.54 |
| orig + noi + glt + spc | 76.03 | 75.71 |
| With four augmentations | ||
| orig + glt + spc + ir + noi |
| 75.80 |
Comparison of JTES performances (Unweighted Accuracy, UA). The highest scores for each split are in bold.
| Reference | Split | Features | Augmentation | UA (%) |
|---|---|---|---|---|
| [ | SI | emo_large | No | 87.88 |
| [ | SI | Mel-cepstrum | No | 71.31 |
| [ | SI | ComParE | No | 81.44 |
| This study | SI | wav2vec 2.0 | Yes |
|
| [ | TI | emo_large | No | 64.36 |
| This study | TI | wav2vec 2.0 | Yes |
|
| [ | STI | emo_large | No | 69.56 |
| [ | STI | HSFs of 5 frames | Yes | 73.40 |
| [ | STI | LLDs | No | 64.40 |
| This study | STI | wav2vec 2.0 | No | 74.50 |
| This study | STI | wav2vec 2.0 | Yes |
|
Comparison of IEMOCAP performances. The highest scores in this study are in bold.
| Reference | Test Set | Features | Aug. | WA (%) | UA (%) |
|---|---|---|---|---|---|
| [ | CV | UniSpeech-SAT Large | No | 70.78 | - |
| [ | CV | HuBERT large | No | 67.56 | - |
| [ | CV | Audio25 + GloVe + BERT | No | 77.51 | 78.41 |
| [ | Session 5 | Audio25 + GloVe + BERT | No | 83.08 | 83.22 |
| [ | Session 5 | HuBERT large | No | 63.90 | 64.54 |
| This study | Session 5 | wav2vec 2.0 | No | 74.13 | 74.88 |
| This study | Session 5 | wav2vec 2.0 | Yes |
|
|