| Literature DB >> 35327924 |
Nikola Simić1, Siniša Suzić1, Tijana Nosek1, Mia Vujović1, Zoran Perić2, Milan Savić3, Vlado Delić1.
Abstract
Speaker recognition is an important classification task, which can be solved using several approaches. Although building a speaker recognition model on a closed set of speakers under neutral speaking conditions is a well-researched task and there are solutions that provide excellent performance, the classification accuracy of developed models significantly decreases when applying them to emotional speech or in the presence of interference. Furthermore, deep models may require a large number of parameters, so constrained solutions are desirable in order to implement them on edge devices in the Internet of Things systems for real-time detection. The aim of this paper is to propose a simple and constrained convolutional neural network for speaker recognition tasks and to examine its robustness for recognition in emotional speech conditions. We examine three quantization methods for developing a constrained network: floating-point eight format, ternary scalar quantization, and binary scalar quantization. The results are demonstrated on the recently recorded SEAC dataset.Entities:
Keywords: convolutional neural network; emotional speech; quantization; speaker recognition
Year: 2022 PMID: 35327924 PMCID: PMC8947568 DOI: 10.3390/e24030414
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Examples of spectrograms for various styles of emotional speech obtained from the SEAC database: (a) neutral; (b) anger; (c) joy; (d) fear; and (e) sadness.
The proposed CNN model.
| Layer | Arguments | Number of Parameters |
|---|---|---|
| Convolution2D | Filters = 16, kernel size = (9, 3), input shape (128, 170, 1) | 448 |
| MaxPooling2D | Pool size = (2, 2) | |
| Convolution2D | Filters = 32, kernel size = (3, 1) | 1568 |
| MaxPooling2D | Pool size = (2, 2) | |
| Flatten | ||
| Dense_1 | Nodes = 128 | 4,989,056 |
| Dropout | Rate = 0.2 | |
| Dense_2 | Nodes = 23 | 2967 |
| Total number of parameters | 4,994,039 |
The number of spectrograms in folds.
| Fold | Number of Spectrograms |
|---|---|
| 1 | 607 |
| 2 | 587 |
| 3 | 593 |
| 4 | 583 |
| 5 | 684 |
| Total | 3054 |
Cross-validation performance of the proposed full-precision model.
| Fold | Classification | Weighted | Weighted | Weighted |
|---|---|---|---|---|
| 1 | 99.51 | 1.00 | 1.00 | 1.00 |
| 2 | 99.49 | 1.00 | 0.99 | 0.99 |
| 3 | 99.66 | 1.00 | 1.00 | 1.00 |
| 4 | 98.46 | 0.98 | 0.98 | 0.98 |
| 5 | 99.12 | 0.99 | 0.99 | 0.99 |
| Average | 99.248 | 0.994 | 0.992 | 0.992 |
The number of spectrograms used per emotion within the experiment.
| Emotion | Number of Spectrograms | |
|---|---|---|
| Training | Neutral | 2370 |
| Testing | Neutral | 684 |
| Anger | 513 | |
| Joy | 608 | |
| Fear | 579 | |
| Sadness | 635 |
Figure 2Training and validation performance of the full-precision CNN model.
Classification accuracy of the proposed full-precision and quantized CNN models.
| Classification Accuracy (%) | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| Proposed Model | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 99.12 | 86.16 | 84.46 | 79.84 | 85.69 |
| FP8 (1, 4, 3) | 99.12 | 86.35 | 84.46 | 79.84 | 85.69 |
| FP8 (1, 5, 2) | 99.12 | 86.55 | 84.46 | 79.84 | 85.86 |
| Ternary quant. | 97.22 | 86.35 | 83.07 | 76.54 | 84.54 |
| Binary quant. | 94.01 | 83.82 | 77.37 | 69.29 | 83.06 |
SQNR for various quantization models.
| SQNR (dB) | |
|---|---|
| Proposed Model | |
| FP8 (1, 4, 3) | 30.98 |
| FP8 (1, 5, 2) | 25.58 |
| Ternary quant. (1/16) | 5.90 |
| Binary quant. | −31.483 |
Weighted precision of the proposed full-precision and quantized CNN models.
| Weighted Precision | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| Proposed Model | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |
| FP8 (1, 4, 3) | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |
| FP8 (1, 5, 2) | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |
| Ternary quant. | 0.97 | 0.88 | 0.87 | 0.82 | 0.87 |
| Binary quant. | 0.95 | 0.86 | 0.84 | 0.72 | 0.87 |
Weighted recall of the proposed full-precision and quantized CNN models.
| Weighted Recall | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| Proposed Model | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |
| FP8 (1, 4, 3) | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |
| FP8 (1, 5, 2) | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |
| Ternary quant. | 0.97 | 0.86 | 0.83 | 0.77 | 0.85 |
| Binary quant. | 0.94 | 0.84 | 0.77 | 0.69 | 0.83 |
Weighted F1 score of the proposed full-precision and quantized CNN models.
| Weighted F1 Score | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| Proposed Model | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |
| FP8 (1, 4, 3) | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |
| FP8 (1, 5, 2) | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |
| Ternary quant. | 0.97 | 0.86 | 0.83 | 0.75 | 0.85 |
| Binary quant. | 0.94 | 0.84 | 0.77 | 0.67 | 0.83 |
The CNN model from [17].
| Layer | Arguments | Number of Parameters |
|---|---|---|
| Convolution2D | Filters = 32, kernel size = (4, 4), input shape (128, 170, 1) | 544 |
| MaxPooling2D | Pool size = (4, 4), strides = (2, 2) | |
| Convolution2D | Filters = 64, kernel size = (4, 4) | 32,832 |
| MaxPooling2D | Pool size = (4, 4), strides = (2, 2) | |
| Flatten | ||
| Dense_1 | Nodes = 230 | 15,662,310 |
| Dropout | Rate = 0.5 | |
| Dense_2 | Nodes = 115 | 26,565 |
| Dense_3 | Nodes = 23 | 2668 |
| Total number of parameters | 15,724,919 |
Classification accuracy of the CNN model from [17]: full-precision and additionally quantized scenarios.
| Classification Accuracy (%) | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| CNN Model from [ | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 98.83 | 78.75 | 83.94 | 78.43 | 84.70 |
| FP8 (1, 4, 3) | 98.98 | 78.75 | 84.28 | 78.58 | 84.70 |
| FP8 (1, 5, 2) | 98.83 | 78.75 | 84.11 | 77.95 | 84.87 |
| Ternary quant. | 98.83 | 82.65 | 85.49 | 75.59 | 86.35 |
| Binary quant. | 95.47 | 76.41 | 81.17 | 72.91 | 83.39 |
SQNR for various quantization models applied to the model from [17].
| SQNR (dB) | |
|---|---|
| CNN from [ | |
| FP8 (1, 4, 3) | 30.94 |
| FP8 (1, 5, 2) | 25.54 |
| Ternary quant. (1/16) | 5.219 |
| Binary quant. | −33.94 |
Figure 3Classification accuracy gain over the model from [17].
The VGGish-based architecture.
| Layer | Arguments | Number of Parameters |
|---|---|---|
| Convolution2D | Filters = 64, kernel size = (3, 3), strides = (1, 1), | 640 |
| MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |
| Convolution2D | Filters = 128, kernel size = (3, 3), strides = (1, 1) | 73,856 |
| MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |
| Convolution2D | Filters = 256, kernel size = (3, 3), strides = (1, 1) | 295,168 |
| Convolution2D | Filters = 256, kernel size = (3, 3), strides = (1, 1) | 590,080 |
| MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |
| Convolution2D | Filters = 512, kernel size = (3, 3), strides = (1, 1) | 1,180,160 |
| Convolution2D | Filters = 512, kernel size = (3, 3), strides = (1, 1) | 2,359,808 |
| MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |
| Flatten | ||
| Dense_1 | Nodes = 4096 | 184,553,472 |
| Dense_2 | Nodes = 4096 | 16,781,312 |
| Dense_3 | Nodes = 23 | 94,231 |
| Total number of parameters | 205,928,727 |
Classification accuracy of the full-precision VGGish-based architecture.
| Classification Accuracy (%) | |||||
|---|---|---|---|---|---|
| Emotion | |||||
| VGGish-Based Architecture | Neutral | Anger | Fear | Sadness | Joy |
| Full-precision | 98.83 | 79.93 | 85.84 | 79.21 | 85.03 |
Figure 4Classification accuracy gain over the various models: SVM, kNN and MLP [10]; VGGish-based architecture [37]; CNN model from [17].