| Literature DB >> 34954610 |
Madhurananda Pahar1, Marisa Klopper2, Robin Warren3, Thomas Niesler4.
Abstract
We present an experimental investigation into the effectiveness of transfer learning and bottleneck feature extraction in detecting COVID-19 from audio recordings of cough, breath and speech. This type of screening is non-contact, does not require specialist medical expertise or laboratory facilities and can be deployed on inexpensive consumer hardware such as a smartphone. We use datasets that contain cough, sneeze, speech and other noises, but do not contain COVID-19 labels, to pre-train three deep neural networks: a CNN, an LSTM and a Resnet50. These pre-trained networks are subsequently either fine-tuned using smaller datasets of coughing with COVID-19 labels in the process of transfer learning, or are used as bottleneck feature extractors. Results show that a Resnet50 classifier trained by this transfer learning process delivers optimal or near-optimal performance across all datasets achieving areas under the receiver operating characteristic (ROC AUC) of 0.98, 0.94 and 0.92 respectively for all three sound classes: coughs, breaths and speech. This indicates that coughs carry the strongest COVID-19 signature, followed by breath and speech. Our results also show that applying transfer learning and extracting bottleneck features using the larger datasets without COVID-19 labels led not only to improved performance, but also to a marked reduction in the standard deviation of the classifier AUCs measured over the outer folds during nested cross-validation, indicating better generalisation. We conclude that deep transfer learning and bottleneck feature extraction can improve COVID-19 cough, breath and speech audio classification, yielding automatic COVID-19 detection with a better and more consistent overall performance.Entities:
Keywords: Bottleneck features; Breath; COVID-19; Cough; Speech; Transfer learning
Mesh:
Year: 2021 PMID: 34954610 PMCID: PMC8679499 DOI: 10.1016/j.compbiomed.2021.105153
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 6.698
Summary of the Datasets used in Pre-training. Classifiers are pre-trained on 10.29 h audio recordings annotated with four class labels: cough, sneeze, speech and noise. The datasets do not include any COVID-19 labels.
| Type | Dataset | Sampling Rate | No of Events | Total audio | Average length | Standard deviation |
|---|---|---|---|---|---|---|
| Cough | TASK dataset | 44.1 kHz | 6000 | 91 min | 0.91 s | 0.25 s |
| Brooklyn dataset | 44.1 kHz | 746 | 6.29 min | 0.51 s | 0.21 s | |
| Wallacedene dataset | 44.1 kHz | 1358 | 17.42 min | 0.77 s | 0.31 s | |
| Google Audio Set & Freesound | 16 kHz | 3098 | 32.01 min | 0.62 s | 0.23 s | |
| Total (Cough) | — | 11 202 | 2.45 h | 0.79 s | 0.23 s | |
| Sneeze | Google Audio Set & Freesound | 16 kHz | 1013 | 13.34 min | 0.79 s | 0.21 s |
| Google Audio Set & Freesound + SMOTE | 16 kHz | 9750 | 2.14 h | 0.79 s | 0.23 s | |
| Total (Sneeze) | — | 10 763 | 2.14 h | 0.79 s | 0.23 s | |
| Speech | Google Audio Set & Freesound | 16 kHz | 2326 | 22.48 min | 0.58 s | 0.14 s |
| LibriSpeech | 16 kHz | 56 | 2.54 h | 2.72 min | 0.91 min | |
| Total (Speech) | — | 2382 | 2.91 h | 4.39 s | 0.42 s | |
| Noise | TASK dataset | 44.1 kHz | 12 714 | 2.79 h | 0.79 s | 0.23 s |
| Google Audio Set & Freesound | 16 kHz | 1027 | 11.13 min | 0.65 s | 0.26 s | |
| Total (Noise) | — | 13 741 | 2.79 h | 0.79 s | 0.23 s |
Fig. 1Pre-processed breath signals from both COVID-19 positive and COVID-19 negative subjects in the Coswara dataset. Breaths corresponding to inhalation are marked by arrows, and are followed by an exhalation.
Fig. 2Pre-processed speech (counting from 1 to 20 at a normal pace) from both COVID-19 positive and COVID-19 negative subjects in the Coswara dataset. In contrast to breath (Fig. 1), the spectral energy in this speech is concentrated below 1 kHz.
Summary of the datasets used for COVID-19 classification. Cough, breath and speech signals were extracted from the Coswara, ComParE and Sarcos datasets. COVID-19 positive subjects are under-represented in all three.
| Type | Dataset | Sampling Rate | Label | Subjects | Total audio | Average per subject | Standard deviation |
|---|---|---|---|---|---|---|---|
| Cough | Coswara | 44.1 kHz | COVID-19 Positive | 92 | 4.24 min | 2.77 s | 1.62 s |
| Healthy | 1079 | 0.98 h | 3.26 s | 1.66 s | |||
| Total | 1171 | 1.05 h | 3.22 s | 1.67 s | |||
| ComParE | 16 kHz | COVID-19 Positive | 119 | 13.43 min | 6.77 s | 2.11 s | |
| Healthy | 398 | 40.89 min | 6.16 s | 2.26 s | |||
| Total | 517 | 54.32 min | 6.31 s | 2.24 s | |||
| Sarcos | 44.1 kHz | COVID-19 Positive | 18 | 0.87 min | 2.91 s | 2.23 s | |
| COVID-19 Negative | 26 | 1.57 min | 3.63 s | 2.75 s | |||
| Total | 44 | 2.45 min | 3.34 s | 2.53 s | |||
| Breath | Coswara | 44.1 kHz | COVID-19 Positive | 88 | 8.58 min | 5.85 s | 5.05 s |
| Healthy | 1062 | 2.77 h | 9.39 s | 5.23 s | |||
| Total | 1150 | 2.92 h | 9.126 s | 5.29 s | |||
| Speech | Coswara (normal) | 44.1 kHz | COVID-19 Positive | 88 | 12.42 min | 8.47 s | 4.27 s |
| Healthy | 1077 | 2.99 h | 9.99 s | 3.09 s | |||
| Total | 1165 | 3.19 h | 9.88 s | 3.22 s | |||
| Coswara (fast) | 44.1 kHz | COVID-19 Positive | 85 | 7.62 min | 5.38 s | 2.76 s | |
| Healthy | 1074 | 1.91 h | 6.39 s | 1.77 s | |||
| Total | 1159 | 2.03 h | 6.31 s | 1.88 s | |||
| ComParE | 16 kHz | COVID-19 Positive | 214 | 44.02 min | 12.34 s | 5.35 s | |
| Healthy | 396 | 1.46 h | 13.25 s | 4.67 s | |||
| Total | 610 | 2.19 h | 12.93 s | 4.93 s |
Fig. 3Feature extraction process for a breath audio. The frame overlap δ is calculated to ensure that the entire recording is divided into segments. For MFCCs, for example, this results in a feature matrix with dimensions ().
Primary feature (PF) extraction hyperparameters. We have used between 13 and 65 MFCCs and between 40 and 200 linearly spaced filters to extract log energies.
| Hyperparameters | Description | Range |
|---|---|---|
| MFCCs ( | lower order MFCCs to keep | 13 × |
| Linearly spaced filters ( | used to extract log energies | 40 to 200 in steps of 20 |
| Frame length ( | into which audio is segmented | 2 |
| Segments ( | number of frames extracted from audio | 10 × |
Hyperparameters of the pre-trained networks: Feature extraction hyperparameters were adopted from the optimal values in previous related work [20], while classifier hyperparameters were optimised on the pre-training data using cross-validation.
| FEATURE EXTRACTION HYPERPARAMETERS | ||
|---|---|---|
| Hyperparameters | Values | |
| MFCCs | 39 | |
| Frame length | 210 = 1024 | |
| Segments | 150 | |
| CLASSIFIER HYPERPARAMETERS | ||
| Hyperparameters | Classifier | Values |
| Convolutional filters | CNN | 256 & 128 & 64 |
| Kernel size | CNN | 2 |
| Dropout rate | CNN, LSTM | 0.2 |
| Dense layer (for pre-training) | CNN, LSTM, Resnet50 | 512 & 64 & 4 |
| Dense layer (for fine-tuning) | CNN, LSTM, Resnet50 | 32 & 2 |
| LSTM units | LSTM | 512 & 256 & 128 |
| Learning rate | LSTM | 10−3 = 0.001 |
| Batch Size | CNN, LSTM, Resnet50 | 27 = 128 |
| Epochs | CNN, LSTM, Resnet50 | 70 |
Fig. 4CNN Transfer Learning Architecture. Cross-validation on the pre-training data determined the optimal CNN architecture to have three convolutional layers with 256, 128 and 64 (2 × 2) kernels respectively, each followed by (2,2) max-pooling. The convolutional layers were followed by two dense layers with 512 and 64 relu units each, and the network was terminated by a 4-dimensional softmax. To apply transfer learning, the final two layers were removed and replaced with a new dense layer and a terminating 2-dimensional softmax to account for COVID-19 positive and negative classes. Only this newly added portion of the network was trained for classification on the data with COVID-19 labels. In addition, the outputs of the third-last layer (512-dimensional dense relu) from the pre-trained network were used as bottleneck features.
Classifier hyperparameters, optimised using leave-p-out nested cross-validation.
| Hyperparameters | Classifier | Range |
|---|---|---|
| Regularisation Strength ( | LR, SVM | 10 |
| LR | 0 to 1 in steps of 0.05 | |
| LR, MLP | 0 to 1 in steps of 0.05 | |
| Kernel Coefficient ( | SVM | 10 |
| No. of neighbours ( | KNN | 10 to 100 in steps of 10 |
| Leaf size ( | KNN | 5 to 30 in steps of 5 |
| No. of neurons ( | MLP | 10 to 100 in steps of 10 |
| No. of convolutional filters ( | CNN | 3 × 2 |
| Kernel size ( | CNN | 2 and 3 |
| Dropout rate ( | CNN, LSTM | 0.1 to 0.5 in steps of 0.2 |
| Dense layer size ( | CNN, LSTM | 2 |
| LSTM units ( | LSTM | 2 |
| Learning rate ( | LSTM, MLP | 10 |
| Batch Size ( | CNN, LSTM | 2 |
| Epochs ( | CNN, LSTM | 10 to 250 in steps of 20 |
COVID-19 cough classification performance. For the Coswara, Sarcos and ComParE datasets the highest AUCs of 0.982, 0.961 and 0.944 respectively were achieved by a Resnet50 trained by transfer learning in the first two cases and a KNN classifier using 12 primary features determined by sequential forward selection (SFS) in the third. When Sarcos is used exclusively as a validation set for a classifier trained on the Coswara data, an AUC of 0.954 is achieved.
| Dataset | ID | Classifier | Best Feature Hyperparameters | Best Classifier Hyperparameters (Optimised inside nested cross-validation) | Performance | ||||
|---|---|---|---|---|---|---|---|---|---|
| Spec | Sens | Acc | AUC | ||||||
| Coswara | |||||||||
| C2 | CNN + TL | ” | 92% | 98% | 95% | 0.972 | 3 × 10−3 | ||
| C3 | LSTM + TL | ” | ” | 93% | 95% | 94% | 0.964 | 3 × 10−3 | |
| C4 | MLP + BNF | ” | 92% | 96% | 94% | 0.963 | 4 × 10−3 | ||
| C5 | SVM + BNF | ” | 89% | 93% | 91% | 0.942 | 3 × 10−3 | ||
| C6 | KNN + BNF | ” | 88% | 90% | 89% | 0.917 | 7 × 10−3 | ||
| C7 | LR + BNF | ” | 84% | 86% | 85% | 0.898 | 8 × 10−3 | ||
| C8 | Resnet50 + PF [ | Table 4 in [ | Default Resnet50 (Table 1 in Ref. [ | 98% | 93% | 95% | 0.976 | 18 × 10−3 | |
| C9 | CNN + PF [ | ” | Table 4 in [ | 99% | 90% | 95% | 0.953 | 39 × 10−3 | |
| C10 | LSTM + PF [ | ” | ” | 97% | 91% | 94% | 0.942 | 43 × 10−3 | |
| Sarcos | |||||||||
| C12 | LSTM + TL | ” | 92% | 92% | 92% | 0.943 | 3 × 10−3 | ||
| C13 | CNN + TL | ” | ” | 89% | 91% | 90% | 0.917 | 4 × 10−3 | |
| C14 | MLP + BNF | ” | 88% | 90% | 89% | 0.913 | 7 × 10−3 | ||
| C15 | SVM + BNF | ” | 88% | 89% | 89% | 0.904 | 6 × 10−3 | ||
| C16 | KNN + BNF | ” | 85% | 87% | 86% | 0.883 | 8 × 10−3 | ||
| C17 | LR + BNF | ” | 83% | 86% | 85% | 0.867 | 9 × 10−3 | ||
| Sarcos (val only) | ” | – | |||||||
| C19 | LSTM + PF [ | Table 5 in [ | Table 5 in [ | 73% | 75% | 74% | 0.779 | – | |
| C20 | LSTM + PF + SFS [ | ” | ” | 96% | 91% | 93% | 0.938 | – | |
| ComParE | C21 | Resnet50 + TL | Default Resnet50 (Table 1 in Ref. [ | 89% | 93% | 91% | 0.934 | 4 × 10−3 | |
| C22 | LSTM + TL | ” | 88% | 92% | 90% | 0.916 | 4 × 10−3 | ||
| C23 | CNN + TL | ” | ” | 86% | 90% | 88% | 0.898 | 4 × 10−3 | |
| C24 | MLP + BNF | ” | 85% | 90% | 88% | 0.912 | 5 × 10−3 | ||
| C25 | SVM + BNF | ” | 85% | 90% | 88% | 0.903 | 6 × 10−3 | ||
| C26 | KNN + BNF | ” | 85% | 86% | 86% | 0.882 | 8 × 10−3 | ||
| C27 | LR + BNF | ” | 84% | 86% | 85% | 0.863 | 8 × 10−3 | ||
| C29 | KNN + PF | 78% | 80% | 80% | 0.855 | 13 × 10−3 | |||
| C30 | MLP + PF | 76% | 80% | 78% | 0.839 | 14 × 10−3 | |||
| C31 | SVM + PF | 75% | 78% | 77% | 0.814 | 12 × 10−3 | |||
| C32 | LR + PF | 69% | 73% | 71% | 0.789 | 13 × 10−3 | |||
COVID-19 breath classifier performance: For breaths, the best performance was achieved by an SVM using bottleneck features (AUC = 0.942). The Resnet50 classifier trained by transfer learning achieves a similar AUC of 0.934.
| Dataset | ID | Classifier | Best Feature Hyperparameters | Best Classifier Hyperparameters (Optimised inside nested cross-validation) | Performance | ||||
|---|---|---|---|---|---|---|---|---|---|
| Spec | Sens | Acc | AUC | ||||||
| Coswara | B1 | Resnet50 + TL | Default Resnet50 (Table 1 in Ref. [ | 87% | 93% | 90% | 0.934 | 3 × 10−3 | |
| B2 | LSTM + TL | ” | 86% | 90% | 88% | 0.927 | 3 × 10−3 | ||
| B3 | CNN + TL | ” | ” | 85% | 89% | 87% | 0.914 | 3 × 10−3 | |
| ” | |||||||||
| B5 | MLP + BNF | ” | 87% | 93% | 90% | 0.923 | 6 × 10−3 | ||
| B6 | KNN + BNF | ” | 87% | 93% | 90% | 0.922 | 9 × 10−3 | ||
| B7 | LR + BNF | ” | 86% | 90% | 88% | 0.891 | 8 × 10−3 | ||
| B8 | Resnet50 + PF | Default Resnet50 (Table 1 in Ref. [ | 92% | 90% | 91% | 0.923 | 34 × 10−3 | ||
| B9 | LSTM + PF | 90% | 86% | 88% | 0.917 | 41 × 10−3 | |||
| B10 | CNN + PF | 87% | 85% | 86% | 0.898 | 42 × 10−3 | |||
COVID-19 speech classifier performance: For the Coswara (fast and normal speech) and the ComParE speech the highest AUCs were 0.893, 0.861 and 0.923 respectively and achieved by a Resnet50 trained by transfer learning in the first two cases and an SVM using with bottleneck features in the third case.
| Dataset | ID | Classifier | Best Feature Hyperparameters | Best Classifier Hyperparameters (Optimised inside nested cross-validation) | Performance | ||||
|---|---|---|---|---|---|---|---|---|---|
| Spec | Sens | Acc | AUC | ||||||
| Coswara | |||||||||
| normal | S2 | LSTM + TL | ” | 88% | 82% | 85% | 0.877 | 4 × 10−3 | |
| speech | S3 | CNN + TL | ” | ” | 88% | 81% | 85% | 0.875 | 4 × 10−3 |
| S4 | MLP + BNF | ” | 83% | 85% | 84% | 0.871 | 8 × 10−3 | ||
| S5 | SVM + BNF | ” | 83% | 85% | 84% | 0.867 | 7 × 10−3 | ||
| S6 | KNN + BNF | ” | 80% | 85% | 83% | 0.868 | 6 × 10−3 | ||
| S7 | LR + BNF | ” | 79% | 83% | 81% | 0.852 | 7 × 10−3 | ||
| S8 | Resnet50 + PF | Default Resnet50 (Table 1 in Ref. [ | 84% | 80% | 82% | 0.864 | 51 × 10−3 | ||
| S9 | LSTM + PF | 84% | 78% | 81% | 0.844 | 51 × 10−3 | |||
| S10 | CNN + PF | 82% | 78% | 80% | 0.832 | 52 × 10−3 | |||
| Coswara | |||||||||
| fast | S12 | LSTM + TL | ” | 83% | 78% | 81% | 0.860 | 3 × 10−3 | |
| speech | S13 | CNN + TL | ” | ” | 82% | 76% | 79% | 0.851 | 3 × 10−3 |
| S14 | MLP + BNF | ” | 78% | 83% | 81% | 0.858 | 7 × 10−3 | ||
| S15 | SVM + BNF | ” | 78% | 83% | 81% | 0.856 | 8 × 10−3 | ||
| S16 | KNN + BNF | ” | 77% | 83% | 81% | 0.854 | 8 × 10−3 | ||
| S17 | LR + BNF | ” | 77% | 82% | 80% | 0.841 | 11 × 10−3 | ||
| S18 | LSTM + PF | 84% | 80% | 82% | 0.856 | 47 × 10−3 | |||
| S19 | Resnet50 + PF | Default Resnet50 (Table 1 in Ref. [ | 82% | 78% | 80% | 0.822 | 45 × 10−3 | ||
| S20 | CNN + PF | 79% | 77% | 78% | 0.810 | 41 × 10−3 | |||
| ComParE | S21 | Resnet50 + TL | Default Resnet50 (Table 1 in Ref. [ | 84% | 90% | 87% | 0.914 | 4 × 10−3 | |
| S22 | LSTM + TL | ” | 82% | 88% | 85% | 0.897 | 5 × 10−3 | ||
| S23 | CNN + TL | ” | ” | 80% | 88% | 84% | 0.892 | 5 × 10−3 | |
| ” | |||||||||
| S25 | MLP + BNF | ” | 80% | 88% | 84% | 0.905 | 6 × 10−3 | ||
| S26 | KNN + BNF | ” | 80% | 86% | 83% | 0.891 | 7 × 10−3 | ||
| S27 | LR + BNF | ” | 81% | 85% | 83% | 0.890 | 7 × 10−3 | ||
| S28 | MLP + PF + SFS | 82% | 88% | 85% | 0.912 | 11 × 10−3 | |||
| S29 | MLP + PF | 81% | 85% | 83% | 0.893 | 14 × 10−3 | |||
| S30 | KNN + PF | 80% | 84% | 82% | 0.847 | 16 × 10−3 | |||
| S31 | SVM + PF | 79% | 81% | 80% | 0.836 | 15 × 10−3 | |||
| S32 | LR + PF | 69% | 72% | 71% | 0.776 | 18 × 10−3 | |||
Fig. 5COVID-19 cough classification: A Resnet50 classifier with transfer learning achieved the highest AUC in classifying COVID-19 coughs for the Coswara and Sarcos datasets (0.982 and 0.961 respectively). For the ComParE dataset, AUCs of 0.944 and 0.934 were achieved by a KNN classifier using 12 features identified by SFS and by a Resnet50 classifier trained by transfer learning respectively.
Fig. 6COVID-19 breath classification: An SVM classifier using bottleneck features (BNF) achieved the highest AUC of 0.942 when classifying COVID-19 breath. The Resnet50 with and without the transfer learning has achieved AUCs of 0.934 and 0.923 respectively, with higher σ for the latter (Table 7).
Fig. 7COVID-19 speech classification: An SVM classifier using bottleneck features (BNF) achieved the highest AUC of 0.923 when classifying COVID-19 speech in ComParE dataset. A Resnet50 trained by transfer learning achieves a slightly lower AUC of 0.914. Speech (normal and fast) in the Coswara dataset can be used to classify COVID-19 with AUCs of 0.893 and 0.861 respectively using a Resnet50 trained by transfer learning.