| Literature DB >> 35898070 |
Conor Wall1, Li Zhang2, Yonghong Yu3, Akshi Kumar4, Rong Gao5.
Abstract
Medical audio classification for lung abnormality diagnosis is a challenging problem owing to comparatively unstructured audio signals present in the respiratory sound clips. To tackle such challenges, we propose an ensemble model by incorporating diverse deep neural networks with attention mechanisms for undertaking lung abnormality and COVID-19 diagnosis using respiratory, speech, and coughing audio inputs. Specifically, four base deep networks are proposed, which include attention-based Convolutional Recurrent Neural Network (A-CRNN), attention-based bidirectional Long Short-Term Memory (A-BiLSTM), attention-based bidirectional Gated Recurrent Unit (A-BiGRU), as well as Convolutional Neural Network (CNN). A Particle Swarm Optimization (PSO) algorithm is used to optimize the training parameters of each network. An ensemble mechanism is used to integrate the outputs of these base networks by averaging the probability predictions of each class. Evaluated using respiratory ICBHI, Coswara breathing, speech, and cough datasets, as well as a combination of ICBHI and Coswara breathing databases, our ensemble model and base networks achieve ICBHI scores ranging from 0.920 to 0.9766. Most importantly, the empirical results indicate that a positive COVID-19 diagnosis can be distinguished to a high degree from other more common respiratory diseases using audio recordings, based on the combined ICBHI and Coswara breathing datasets.Entities:
Keywords: Convolutional Neural Network; Gated Recurrent Unit; Long Short-Term Memory; attention mechanism; audio lung abnormality classification; bidirectional Recurrent Neural Network; ensemble model
Mesh:
Year: 2022 PMID: 35898070 PMCID: PMC9332569 DOI: 10.3390/s22155566
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The proposed ensemble model comprising four base networks for diverse lung abnormality and COVID-19 diagnosis.
The summary of existing studies for sound classification.
| Related Studies | Methodologies | Novel Strategies |
|---|---|---|
| Choi et al. [ | CRNN with four conv2d layers and two GRU layers for music classification using the Million Song dataset | - |
| Chen and Li [ | (1) CNN-BiLSTM and 1D DNN for audio emotion classification, and (2) 1D DNN for lyrics emotion classification, using the Million Song dataset. (3) A stacking ensemble used to combine emotion classification results from both audio and text inputs. | (1) CNN-BiLSTM and 1D DNN for audio emotion classification, and (2) 1D DNN for lyrics emotion classification, using the Million Song dataset. (3) A stacking ensemble used to combine emotion classification results from both audio and text inputs. |
| Perna [ | 2D CNN | - |
| Perna and Tagarelli [ | LSTM to train and classify the respiratory ICBHI dataset on both pathology and anomaly levels. The pathology-driven classification includes two tasks, i.e., binary (healthy/unhealthy) and 3-class (healthy/chronic/non-chronic) classification. On the other hand, for anomaly-driven diagnosis, a 4-class prediction is performed to detect normal/wheeze/crackle/both crackle and wheeze conditions. | Using different sliding window settings for data preparation |
| Pahar et al. [ | Resnet50, LSTM, CNN, MLP, SVM, and LR for the classification of different lung abnormalities using the Coswara dataset and the SARS-CoV-2 South Africa (Sarcos) dataset. | - |
| Zhang et al. [ | CRNN with attention mechanisms for environmental sound classification using ESC-10 and ESC-50 datasets. | CRNN with attention mechanisms |
| Wall et al. [ | BiLSTM and BiGRU with attention mechanisms for respiratory and coughing sound classification | BiLSTM and BiGRU with attention mechanisms |
| Wall et al. [ | BiLSTM for 2-class (health/unhealthy) respiratory sound classification | - |
| Zhang et al. [ | An evolving ensemble of CRNNs for respiratory abnormality (healthy/chronic/non-chronic) classification, as well as heart sound and environmental sound classification. | Hyper-parameter fine-tuning using PSO (but for 3-class respiratory abnormality detection) |
| García-Ordás et al. [ | 2D CNN with two convolutional layers in combination with different data augmentation and oversampling techniques for respiratory abnormality classification | Adopting different oversampling techniques |
| Li et al. [ | 1D CNN with three convolutional layers for heart sound classification | - |
| Xiao et al. [ | 1D CNN with clique and transition blocks for heart sound classification | 1D CNN with clique and transition blocks |
| Boddapati et al. [ | AlexNet and GoogLeNet for environmental sound classification | - |
| Sait et al. [ | Transfer learning based on Inception-v3 combined with MLP for COVID-19 diagnosis using breathing and chest X-ray image inputs | Transfer learning based on Inception-v3 combined with MLP for multimodal COVID-19 diagnosis |
| Zhang et al. [ | 2D CNN combined with sound mix-up | Sound mix-up for model training |
| This research | An evolving ensemble of A-CRNN, A-BiLSTM, A-BiGRU, and 1D CNN, with PSO-based hyper-parameter optimization | (1) CRNN, BiLSTM, and BiGRU with attention mechanisms (i.e., A-CRNN, A-BiLSTM, and A-BiGRU), as well as 1D CNN for audio classification. (2) PSO-based hyper-parameter tuning, and (3) an ensemble model combining the devised A-CRNN, A-BiLSTM, A-BiGRU, and 1D CNN. |
Dataset properties.
| Dataset | Dataset Name | Class | No. of Files |
|---|---|---|---|
| D1 | ICBHI | COPD | 793 |
| Healthy | 35 | ||
| Bronchiectasis | 16 | ||
| Bronchiolitis | 13 | ||
| URTI | 23 | ||
| Pneumonia | 37 | ||
| Asthma | 1 | ||
| LRTI | 2 | ||
| D2 | Coswara Cough | COVID-19 Positive | 110 |
| COVID-19 Negative | 107 | ||
| D3 | Coswara Speech | COVID-19 Positive | 103 |
| COVID-19 Negative | 104 | ||
| D4 | Coswara Breathing | COVID-19 Positive | 101 |
| COVID-19 Negative | 103 | ||
| D5 | ICBHI + Coswara Breathing | COPD | 793 |
| Healthy | 35 | ||
| Bronchiectasis | 16 | ||
| Bronchiolitis | 13 | ||
| URTI | 23 | ||
| Pneumonia | 37 | ||
| COVID-19 | 101 |
The subject-independent train–test split for ICBHI used in our experiments.
| Training Set | Augmented Training Set | Test Set | |
|---|---|---|---|
| Bronchiectasis | 14 | 672 | 2 |
| Bronchiolitis | 7 | 672 | 6 |
| URTI | 16 | 672 | 7 |
| Healthy | 18 | 672 | 17 |
| Pneumonia | 30 | 660 | 7 |
| COPD | 648 | 648 | 145 |
| Total | 733 | 3996 | 184 |
The subject-independent train–test split for the combined dataset (D5) based on ICBHI and Coswara breathing databases.
| Training Set | Augmented Training Set | Test Set | ||
|---|---|---|---|---|
| ICBHI (D1) | Bronchiectasis | 14 | 672 | 2 |
| Bronchiolitis | 7 | 672 | 6 | |
| URTI | 16 | 672 | 7 | |
| Healthy | 18 | 672 | 17 | |
| Pneumonia | 30 | 660 | 7 | |
| COPD | 648 | 648 | 145 | |
| Coswara Breathing (D4) | COVID-19 | 81 | 648 | 20 |
| Total | 814 | 4644 | 204 |
Model 1—The proposed A-CRNN model architecture.
| Layer# | Layer Description | Unit Setting | Kernel Size |
|---|---|---|---|
| L1 | Conv1D | 512 | 3 |
| L2 | Conv1D | 256 | 3 |
| L3 | MaxPooling1D | N/A | N/A |
| L4 | BiLSTM | 512 | N/A |
| L5 | Attention Mechanism | N/A | N/A |
| L6 | LSTM | 256 | N/A |
| L7 | Dense | 128 | N/A |
| L8 | FC Dense (Softmax) | Number of classes | N/A |
Model 2—The proposed A-BiLSTM network architecture.
| Layer# | Layer Description | Unit Setting |
|---|---|---|
| L1 | BiLSTM | 512 |
| L2 | LSTM | 256 |
| L3 | Attention Mechanism | N/A |
| L4 | Dense | 128 |
| L5 | Dropout | 0.6 |
| L6 | Dense | 64 |
| L7 | FC Dense (Softmax) | Number of classes |
Model 3—The proposed A-BiGRU network architecture.
| Layer# | Layer Description | Unit Setting |
|---|---|---|
| L1 | BiGRU | 512 |
| L2 | GRU | 256 |
| L3 | Attention Mechanism | N/A |
| L4 | Dense | 128 |
| L5 | Dropout | 0.6 |
| L6 | Dense | 64 |
| L7 | FC Dense (Softmax) | Number of classes |
Model 4—The proposed CNN architecture.
| Layer# | Layer Description | Unit Setting | Kernel Size |
|---|---|---|---|
| L1 | Conv1D | 128 | 3 |
| L2 | Conv1D | 128 | 3 |
| L3 | Conv1D | 128 | 3 |
| L4 | MaxPooling1D | N/A | N/A |
| L5 | Conv1D | 256 | 3 |
| L6 | Conv1D | 256 | 3 |
| L7 | Conv1D | 256 | 3 |
| L8 | MaxPooling1D | N/A | N/A |
| L9 | Conv1D | 512 | 3 |
| L10 | Conv1D | 512 | 1 |
| L11 | Conv1D | 2 | 1 |
| L12 | GlobalAveragePooling1D | N/A | N/A |
| L13 | Activation | N/A | N/A |
Optimized hyper-parameter settings with respect to D1.
| Model | Hyper-Parameter | Setting |
|---|---|---|
| A-CRNN | Learning Rate | 0.00159 |
| Batch Size | 128 | |
| Epoch | 37 | |
| A-BiLSTM | Learning Rate | 0.00095 |
| Batch Size | 128 | |
| Epoch | 105 | |
| A-BiGRU | Learning Rate | 0.00193 |
| Batch Size | 128 | |
| Epoch | 45 | |
| CNN | Learning Rate | 0.00019 |
| Batch Size | 128 | |
| Epoch | 53 |
Optimized hyper-parameter settings with respect to D2.
| Model | Hyper-Parameter | Setting |
|---|---|---|
| A-CRNN | Learning Rate | 0.000106 |
| Batch Size | 64 | |
| Epoch | 26 | |
| A-BiLSTM | Learning Rate | 0.000101 |
| Batch Size | 64 | |
| Epoch | 15 | |
| A-BiGRU | Learning Rate | 0.00909 |
| Batch Size | 64 | |
| Epoch | 16 | |
| CNN | Learning Rate | 0.00013 |
| Batch Size | 64 | |
| Epoch | 23 |
Optimized hyper-parameter settings with respect to D3.
| Model | Hyper-Parameter | Setting |
|---|---|---|
| A-CRNN | Learning Rate | 0.000143 |
| Batch Size | 512 | |
| Epoch | 48 | |
| A-BiLSTM | Learning Rate | 0.000099 |
| Batch Size | 512 | |
| Epoch | 96 | |
| A-BiGRU | Learning Rate | 0.00187 |
| Batch Size | 512 | |
| Epoch | 33 | |
| CNN | Learning Rate | 0.000122 |
| Batch Size | 512 | |
| Epoch | 130 |
Optimized hyper-parameter settings with respect to D4.
| Model | Hyper-Parameter | Setting |
|---|---|---|
| A-CRNN | Learning Rate | 0.000163 |
| Batch Size | 512 | |
| Epoch | 48 | |
| A-BiLSTM | Learning Rate | 0.000098 |
| Batch Size | 512 | |
| Epoch | 96 | |
| A-BiGRU | Learning Rate | 0.00083 |
| Batch Size | 512 | |
| Epoch | 33 | |
| CNN | Learning Rate | 0.000103 |
| Batch Size | 512 | |
| Epoch | 130 |
Optimized hyper-parameter settings with respect to D5.
| Model | Hyper-Parameter | Setting |
|---|---|---|
| A-CRNN | Learning Rate | 0.000157 |
| Batch Size | 128 | |
| Epoch | 42 | |
| A-BiLSTM | Learning Rate | 0.000197 |
| Batch Size | 128 | |
| Epoch | 30 | |
| A-BiGRU | Learning Rate | 0.00192 |
| Batch Size | 128 | |
| Epoch | 38 | |
| CNN | Learning Rate | 0.000083 |
| Batch Size | 128 | |
| Epoch | 43 |
Results of base and ensemble models using a subject-independent train–test split for D1, i.e., the ICBHI dataset.
| Models | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| A-CRNN | 0.8947 | 1 | 0.9474 | 0.8989 |
| A-BiLSTM | 0.8947 | 0.8571 | 0.8759 | 0.8933 |
| A-BiGRU | 0.8655 | 0.8571 | 0.8613 | 0.8652 |
| CNN | 0.883 | 0.8571 | 0.8701 | 0.882 |
| Ensemble | 0.9532 | 1 | 0.9766 | 0.9551 |
Confusion matrix of the proposed ensemble model for D1, i.e., the ICBHI dataset.
| Bronchiectasis | Bronchiolitis | COPD | Healthy | Pneumonia | URTI | |
|---|---|---|---|---|---|---|
| Bronchiectasis | 1 | 0 | 0 | 0 | 0 | 0 |
| Bronchiolitis | 0 | 1 | 0 | 0 | 0 | 0 |
| COPD | 0.0261 | 0 | 0.9673 | 0.0065 | 0 | 0 |
| Healthy | 0 | 0 | 0 | 1 | 0 | 0 |
| Pneumonia | 0 | 0 | 0.2 | 0 | 0.8 | 0 |
| URTI | 0 | 0 | 0 | 0.4 | 0 | 0.6 |
Results of base and ensemble models for D2, i.e., the Coswara cough dataset.
| Models | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| A-CRNN | 0.9231 | 0.8846 | 0.9038 | 0.9060 |
| A-BiLSTM | 1 | 0.9391 | 0.9700 | 0.9754 |
| A-BiGRU | 1 | 0.9600 | 0.9800 | 0.9825 |
| CNN | 0.9524 | 0.9077 | 0.9300 | 0.9297 |
| Ensemble | 1 | 0.9420 | 0.9710 | 0.9750 |
Confusion matrix of the proposed ensemble model for D2, i.e., the Coswara cough dataset.
| Positive | Negative | |
|---|---|---|
| Positive | 1 | 0 |
| Negative | 0.058 | 0.942 |
Results of base and ensemble models for D3, i.e., the Coswara speech dataset.
| Models | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| A-CRNN | 0.9289 | 0.8747 | 0.9018 | 0.9023 |
| A-BiLSTM | 0.9422 | 0.8410 | 0.8916 | 0.8894 |
| A-BiGRU | 0.8344 | 0.8258 | 0.8300 | 0.8304 |
| CNN | 0.8965 | 0.8795 | 0.8880 | 0.8881 |
| Ensemble | 0.9480 | 0.8920 | 0.9200 | 0.9240 |
Confusion matrix of the proposed ensemble model for D3, i.e., the Coswara speech dataset.
| Positive | Negative | |
|---|---|---|
| Positive | 0.9480 | 0.0520 |
| Negative | 0.1080 | 0.8920 |
Results of base and ensemble models for D4, i.e., the Coswara breathing dataset.
| Models | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| A-CRNN | 0.9073 | 0.8746 | 0.8909 | 0.8936 |
| A-BiLSTM | 0.9909 | 0.9530 | 0.9720 | 0.9724 |
| A-BiGRU | 0.8654 | 0.8069 | 0.8362 | 0.8320 |
| CNN | 0.9497 | 0.8562 | 0.9030 | 0.9066 |
| Ensemble | 0.9810 | 0.8770 | 0.9290 | 0.9300 |
Confusion matrix of the proposed ensemble model for D4, i.e., the Coswara breathing dataset.
| Positive | Negative | |
|---|---|---|
| Positive | 0.9810 | 0.019 |
| Negative | 0.1230 | 0.8770 |
Results of the base and ensemble models using a subject-independent train–test split for D5, i.e., the combination of ICBHI and Coswara breathing datasets.
| Models | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| A-CRNN | 0.911 | 1 | 0.9555 | 0.9141 |
| A-BiLSTM | 0.9215 | 0.8571 | 0.8893 | 0.9192 |
| A-BiGRU | 0.8743 | 0.8571 | 0.8657 | 0.8737 |
| CNN | 0.9162 | 0.8571 | 0.8867 | 0.9141 |
| Ensemble | 0.9424 | 1 | 0.9712 | 0.9444 |
Confusion matrix of the proposed ensemble model for D5, i.e., the combination of ICBHI and Coswara breathing datasets.
| Bronchiectasis | Bronchiolitis | COPD | Healthy | Pneumonia | URTI | COVID-19 | |
|---|---|---|---|---|---|---|---|
| Bronchiectasis | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Bronchiolitis | 0 | 0.8333 | 0 | 0.1667 | 0 | 0 | 0 |
| COPD | 0.0261 | 0 | 0.9542 | 0.0196 | 0 | 0 | 0 |
| Healthy | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Pneumonia | 0 | 0 | 0 | 0 | 0.8 | 0.2 | 0 |
| URTI | 0 | 0 | 0 | 0.4 | 0 | 0.6 | 0 |
| COVID-19 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Ensemble results for the five test datasets.
| Dataset | Sensitivity | Specificity | ICBHI Score | Accuracy |
|---|---|---|---|---|
| D1 | 0.9532 | 1 | 0.9766 | 0.9551 |
| D2 | 1 | 0.942 | 0.971 | 0.975 |
| D3 | 0.948 | 0.892 | 0.920 | 0.924 |
| D4 | 0.981 | 0.877 | 0.929 | 0.930 |
| D5 | 0.9424 | 1 | 0.9712 | 0.9444 |
Performance comparison with existing studies.
| Existing Studies | Methodology | No. of Classes | Evaluation Strategies | Results | |
|---|---|---|---|---|---|
| ICBHI | Wall et al. [ | BiLSTM with attention mechanisms | 6 | 90–10 (random) | Accuracy rate—0.962 |
| Zhang et al. [ | An evolving ensemble of CRNNs | 3 (healthy, chronic, and non-chronic) | 80–20 (subject-independent) | ICBHI score—0.9803 | |
| Wall et al. [ | BiLSTM | 2 (healthy and unhealthy) | 80–20 (random) | ICBHI score—0.957 | |
| Perna [ | 2D CNN | 3 (healthy, chronic, and non-chronic) | 80–20 (random) | ICBHI score—0.83 | |
| Perna and Tagarelli [ | LSTM with 50% overlapping between windows | 3 (healthy, chronic, and non-chronic) | 80–20 (random) | ICBHI score—0.9 | |
| Perna and Tagarelli [ | LSTM without overlapping | 3 (healthy, chronic, and non-chronic) | 80–20 (random) | ICBHI score—0.89 | |
| García-Ordás et al. [ | 2D CNN with Synthetic Minority Oversampling Technique | 3 (healthy, chronic, and non-chronic) | 10-fold (random) | ICBHI score—0.558 | |
| García-Ordás et al. [ | 2D CNN with Adaptive Synthetic Sampling Method | 3 (healthy, crohnic, and non-crohnic) | 10-fold (random) | ICBHI score—0.911 | |
| García-Ordás et al. [ | 2D CNN with dataset weighted | 3 (healthy, chronic, and non-chronic) | 10-fold (random) | ICBHI score—0.476 | |
| This research | Ensemble of optimized A-CRNN, A-BiLSTM, A-BiGRU, and 1D CNN | 6 | 80–20 (subject-independent) | ICBHI score—0.9766 | |
| Coswara (cough) | Wall et al. [ | BiLSTM with attention mechanisms | 2 | 90–10 (random) | Accuracy rate—0.968 |
| This research | Ensemble of optimized A-CRNN, A-BiLSTM, A-BiGRU, and 1D CNN | 2 | 80–20 (random) | ICBHI score—0.971 |