| Literature DB >> 35372830 |
Shahin Amiriparian1, Tobias Hübner1, Vincent Karas1, Maurice Gerczuk1, Sandra Ottl1, Björn W Schuller1,2.
Abstract
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilize them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image Convolutional Neural Networks (CNNs). The framework creates and augments Mel spectrogram plots on the fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone. DeepSpectrumLite operates decentralized, eliminating the need for data upload for further processing. We demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing by obtaining state-of-the-art results on a set of paralinguistic and general audio tasks, including speech and music emotion recognition, social signal processing, COVID-19 cough and COVID-19 speech analysis, and snore sound classification. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at https://github.com/DeepSpectrum/DeepSpectrumLite.Entities:
Keywords: audio processing; computational paralinguistics; deep spectrum; embedded devices; transfer learning
Year: 2022 PMID: 35372830 PMCID: PMC8969434 DOI: 10.3389/frai.2022.856232
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1A general overview of a DeepSpectrumLite model deployed on a target device for inference. Raw audio (from the device's microphone) is first converted to a spectrogram representation and the values are mapped to the red-green-blue (RGB) color space according to a certain color mapping definition. These spectrogram plots are then forwarded through the TFLite version of a trained CNN model, and an MLP classifier head generates predictions for the task at hand.
Statistics of the databases utilized in our experiments in terms of number of samples (#), number of speakers (Sp.), number of classes (C.), total duration (Dur.) in minutes, and mean and standard deviation of the duration (Mean & Std dur.) in seconds.
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| 725 | 2 | 397 | 97.8 | 6.34 | 2.2679 | |
| 893 | 2 | 366 | 194.4 | 13.16 | 5.4784 | |
| 9,365 | 7 | 65 | 445.6 | 2.86 | 1.2564 | |
| 914 | 3 | 21 | 58.7 | 3.86 | 1.6418 | |
| 5,531 | 4 | 10 | 746.2 | 4.46 | 3.0645 | |
| 828 | 4 | 219 | 20.8 | 1.51 | 0.3464 | |
| 1,012 | 6 | 23 | 78.4 | 4.65 | 0.4213 | |
| 16,462 | 2 | 915 | 1059.8 | 3.86 | 0.6399 |
This table shows the configuration of the different hyperparameters for each of the datasets used in our experiments.
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Classifier units | 512 | 700 | 512 | 512 | 512 | 512 | 128 | 512 |
| Dropout rate | 0.25 | 0.4 | 0.25 | 0.25 | 0.25 | 0.25 | 0.5 | 0.25 |
| Initial learning rate | 0.001 | 0.01 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| Epochs of first phase | 40 | 20 | 40 | 40 | 40 | 40 | 40 | 40 |
| Epochs of second phase | 200 | – | 120 | 200 | 200 | 120 | 80 | 80 |
| Fine-tuned layers | 298 | 0 | 128 | 298 | 128 | 85 | 42 | 128 |
| Audio chunk length [s] | 3.0 | – | 3.0 | 3.0 | 4.0 | 1.0 | 4.0 | 4.0 |
Results of the transfer learning experiments with DeepSpectrumLite (DS Lite) on three of the ComParE 2021 Challenge tasks (CCS, CSS, and ESS) and IEMOCAP compared against Deep Spectrum feature extraction + Support Vector Machine (SVM).
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63.3 | 64.1 | 55.7−72.8 | 56.0 | 60.4 | 55.9−64.9 | 64.2 | 56.4 | 51.5−61.3 | 53.0 | 56.3 | 54.2−58.2 | |
| 56.5 | 71.1 | 62.2−79.5 | 61.6 | 61.2 | 55.1−66.8 | 43.1 | 60.0 | 54.3−66.1 | 55.1 | 56.4 | 55.2−63.9 | |
| 57.1 | 71.4 | 62.5−79.4 | 62.2 | 62.3 | 56.4−68.4 | 43.1 | 59.9 | 54.1−65.7 | 55.5 |
| 55.6−64.0 | |
| 58.4 | 72.7 | 63.9−80.9 | 60.7 | 63.6 | 58.1−69.0 | 48.1 | 61.3 | 55.4−67.2 | 55.2 | 59.3 | 54.9−63.5 | |
| 59.0 |
| 66.3−82.4 | 60.7 |
| 55.1−66.8 | 47.2 |
| 55.8−67.3 | 53.9 | 59.2 | 54.9−63.5 | |
For the ComParE tasks, we evaluate against the official .
Results of the transfer learning experiments with DeepSpectrumLite (DS Lite) on four datasets, the ComParE 2018 Snore Sub-Challenge tasks Snore, the ComParE 2019 Continuous Sleepiness Sub-Challenge SLEEP, RAVDESS emotional song and DEMoS.
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33.5 | 39.4 | 33.4−45.7 | 66.0 | 65.7 | 62.9−68.5 | 84.7 | 76.6 | 72.7−84.5 | 59.1 | 58.5 | 55.8−61.4 | |
| 43.7 | 50.0 | 43.7−55.6 | 71.7 |
| 66.5−71.6 | 84.0 | 78.6 | 72.6−84.1 | 65.0 | 62.1 | 59.1−65.1 | |
| 44.4 | 52.0 | 46.3−57.9 | 66.3 | 66.8 | 65.1−68.7 | 77.4 |
| 74.6−87.1 | 69.5 | 69.7 | 67.1−72.3 | |
| 39.2 |
| 49.4−58.4 | 67.6 | 67.6 | 64.9−70.4 | 71.8 | 75.0 | 68.9−80.5 | 73.8 |
| 70.1−75.0 | |
For the Sleep task, we discretize the label into two classes, while for the DEMoS task, we recreate the dataset partitioning used in Baird et al. (.
This table shows the mean execution time, the number of parameters, the mean requested memory, and the model size of our preprocessing (prepr.), the DenseNet121 TensorFlow (TF), and TF Lite model.
|
|
| ||
|---|---|---|---|
| Mean time [ms] | 7.1 | 240.0 | 89.7/242.0 |
| FLOPs | – | 3.1 G | – |
| Parameters | – | 7.6 M | 7.6 M |
| Mean memory [MB] | 4.5 | 116.5 | 185.4/292.8 |
| Model size | 150.0 kb | 82.1 MB | 30.0 MB |
In the TF Lite Model column, the values before the slash are from the test on the CPU system, whereas the values after the slash are from the on-device test. Details regarding the test setup are described in the text. FLOPs, floating point operations.
Figure 2Visualization of the SHapley Additive exPlanations (SHAP) outputs for sample Mel spectrograms from negative and positive classes of CSS and CCS datasets. The x-axis of the spectrograms represents time [0–5 s] and the y-axis represents the frequency [0–4,096 Hz]. The range of the SHAP values for each model's output is given in the color bar below each image. Areas that increase the probability of the class are colored in red, and blue areas decrease the probability. A detailed account of the spectrogram and SHAP values analysis is given in Section 3.5. (A) SHAP outputs from the COVID-19 speech (CSS) model. (B) SHAP outputs from the COVID-19 cough (CCS) model.
Figure 3Confusion matrices for the test results obtained by the best CSS (A) and CCS (B) models. In the cells, the absolute number of cases is given, and the percentage of “classified as” of the class is provided in the respective row. The percentage values are also indicated by a color scale: the darker, the higher the prediction. A detailed account of the confusion matrices analysis is given in Section 3.5.