| Literature DB >> 35960763 |
Jane Saldanha1, Shaunak Chakraborty2, Shruti Patil1, Ketan Kotecha1, Satish Kumar1, Anand Nayyar3.
Abstract
Computerized auscultation of lung sounds is gaining importance today with the availability of lung sounds and its potential in overcoming the limitations of traditional diagnosis methods for respiratory diseases. The publicly available ICBHI respiratory sounds database is severely imbalanced, making it difficult for a deep learning model to generalize and provide reliable results. This work aims to synthesize respiratory sounds of various categories using variants of Variational Autoencoders like Multilayer Perceptron VAE (MLP-VAE), Convolutional VAE (CVAE) Conditional VAE and compare the influence of augmenting the imbalanced dataset on the performance of various lung sound classification models. We evaluated the quality of the synthetic respiratory sounds' quality using metrics such as Fréchet Audio Distance (FAD), Cross-Correlation and Mel Cepstral Distortion. Our results showed that MLP-VAE achieved an average FAD of 12.42 over all classes, whereas Convolutional VAE and Conditional CVAE achieved an average FAD of 11.58 and 11.64 for all classes, respectively. A significant improvement in the classification performance metrics was observed upon augmenting the imbalanced dataset for certain minority classes and marginal improvement for the other classes. Hence, our work shows that deep learning-based lung sound classification models are not only a promising solution over traditional methods but can also achieve a significant performance boost upon augmenting an imbalanced training set.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35960763 PMCID: PMC9374267 DOI: 10.1371/journal.pone.0266467
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Summary of techniques used in automated respiratory sounds auscultation.
Literature review of classification models proposed for lung sound auscultation.
| Author and Year | Framework | Input/Features | Technique | Results | |
|---|---|---|---|---|---|
| Pathology driven classification | Anomaly driven classification | ||||
| [ | ✖ | ✔ | Linear Predictive Cepstral Coefficients (LPCC) | Multilayer Perceptron Classifier | Accuracy of 99.22% was obtained |
| [ | ✖ | ✔ | Mel Spectrograms with clipped black (zero energy) regions | RespireNet framework consisting of ResNet-34 trained on concatenation-based augmented samples of respiratory cycles along with device- specific optimizations. | Achieved a sensitivity of 0.54 and specificity of 0.83 in classifying wheezes (W), crackles (C), both wheezes and crackles (B) and healthy/normal (N) respiratory cycles. |
| [ | ✖ | ✔ | Mel Frequency Cepstral Coefficients (MFCCs) and Power Spectrum Density (PSD) | For breath detector, models such as KNN, Random Forest and Logistic Regression were proposed. | All models achieved a precision of 0.98 and recall of 0.99, 0.98 and 0.99 respectively. |
| For anomaly detection engine, models such as Logistic Regression, SVM, ANN, Random Forest and KNN were used. | SVM and Logistic Regression achieved a precision and recall of 0.93, 0.94 and 0.91, 0.91 respectively, All other models achieved a precision and recall of 0.92 and 0.91 respectively. | ||||
| [ | ✖ | ✔ | Short Time Fourier Transformation (STFT) | Pretrained Deep Convolutional Network + SVM Classifier | Accuracy of 65.5% was obtained |
| Pretrained Deep Convolutional Network fined tuned for classification | Accuracy of 63.09% was obtained | ||||
| [ | ✔ | ✔ | Mel Frequency Cepstral Coefficients (MFCCs) | Recurrent Neural Networks like LSTM, GRU, BiGRU and BiLSTM | Sensitivity and Specificity of 64% and 82% were obtained respectively |
| [ | ✖ | ✔ | Mel Frequency Cepstral Coefficients (MFCCs). | Noise Masking Recurrent Neural Network (NMRNN) | Sensitivity and Specificity of 56% and 73.6% were obtained respectively for end to end classification |
| [ | ✖ | ✔ | Mel Frequency Cepstral Coefficients (MFCCs). | Hidden Markov Models in combination with Gaussian mixture models | Best Score achieved in second evaluation phase of ICHBI was 39.56 |
| [ | ✔ | ✖ | Abnormal lung sounds, presence of breathlessness, peak meter readings and family history | Logistic Regression with L1 Regularization | Achieved 0.95 AUC score in separating COPD and asthma patients from other categories of diseases and 0.97 AUC score in distinguishing COPD and Asthma |
A literature review of data augmentation techniques for audio classification.
| Author and Year | Purpose | Data Augmentation Technique(s) | Input to augmentation technique | Results / Impact of augmentation on the performance of classification models |
|---|---|---|---|---|
| [ | Environmental Sound classification | Time Stretching, Pitch Shifting, Dynamic Range Compression and Background Noise Addition | Log-Mel- Spectrogram | The accuracy for the proposed CNN (SB-CNN) increased from 73% (before augmentation) to 79% (after augmentation) |
| [ | Speech Recognition | Mixup Augmentation | Normalized Spectrogram | The authors compared the classification performance of a VGG-11 model trained with empirical risk minimization and mixup augmentation and observed a lower classification error with mixup augmentation. |
| [ | Speech Recognition | Variational Autoencoder | Discrete Fourier Transform | The authors proposed four classification models and evaluated these using Word Error Rate (WER). However, all four classification models suffered an increase in the WER after augmentation. |
| [ | Speech Recognition | SpecAugment | Log Mel Spectrogram | Listen Attend Spell obtained WER of 2.8 with Augmentation and without presence of Language Model whereas LAS obtained WER of 4.1 without Augmentation |
| [ | Acoustic Scene Classification | Spectrogram Rolling and mixup | Mel Frequency Cepstral Coefficient | The mean accuracy obtained by ResNet model before augmentation is 80.97% and after augmentation is 82.85% |
| [ | Monaural Singing Voice Separation | Variational Autoencoder- Generative Adversarial Network (VAE-GAN) | Short Time Fourier Transform | The authors used metrics such as Source to Interference ratio (SIR), Source to Artifacts ratio (SAR) and Source to Distortion ratio (SDR) to evaluate the separation quality of a deep recurrent neural network and VAE-GAN. A higher value suggested better separation quality. The results revealed that VAE-GAN had a higher SDR and SAR whereas RNN had a higher SIR. |
| [ | Environmental Sound Classification | WaveGAN | Raw audio | For baseline method the accuracy generated was 94.84% whereas after application of GAN the accuracy achieved was 97.03 |
| [ | Animal Audio Classification | Signal Speed scaling, Pitch Shift, Volume increase/decrease, Addition of random noise and Time shift | Raw audio | The mean accuracy obtained by VGG19 on the cat dataset is 83.05 without augmentation and 85.59 with augmentation |
| pitch shift, time shift, summing two spectrograms from same class, applying random cropping followed by cutting the spectrogram in 10 different temporal slices and applying a function on it, application of time shift by randomly picking the shift T. | Mel Spectrogram | The mean accuracy obtained by VGG19 on cat dataset is 83.05 without augmentation and 90.68 with augmentation | ||
| [ | Abnormal Respiratory Sounds Detection | Convolutional Variational Autoencoder | Mel Spectrogram | The specificity, sensitivity and F-score of the respiratory sounds classification model increased from 0.286 to 0.986, 0.888 to 0.988 and 0.349 to 0.900 upon augmentation respectively. |
| [ | Acoustic Scene Classification | Zero-value Masking | Log Mel Spectrogram | The accuracy on DCASE 18 dataset is 76.2% |
Fig 2Distribution of crackles and wheezes in the respiratory cycle.
Fig 3Patient wise diagnosis in ICBHI dataset.
Fig 4Count of audio files for various respiratory diseases.
Fig 5Distribution of respiratory cycle per class.
Fig 6Class wise split of audio segments into train and test sets.
Fig 7Proposed methodology.
Fig 8Histogram showing the distribution of respiratory cycle durations.
Fig 9Padded raw audio segments of all classes used in the study.
Fig 10Structure of variational autoencoder.
Fig 11Mel spectrograms of various respiratory diseases.
Fig 12Overall architecture of MLP-VAE.
Fig 13Architecture of CNN-VAE.
Fig 14Overall architecture of conditional VAE.
Samples generated by proposed variational autoencoders.
| Synthetic samples generated | LRTI | URTI | Bronchiectasis | Bronchiolitis | Healthy | Pneumonia |
|---|---|---|---|---|---|---|
|
| 5641 | 5641 | 5642 | 5641 | 5636 | 5637 |
|
| 5633 | 5640 | 5641 | 5640 | 5641 | 5640 |
|
| 5657 | 5569 | 5577 | 5636 | 5646 | 5613 |
Fig 15Procedure for computing MFCCs.
Fig 16MFCC for various respiratory classes.
Hyperparameters configuration of the proposed classification models.
| Model | Hyperparameters | Value |
|---|---|---|
|
| Number of neurons in hidden layers (1–3), layer 4 and hidden layers (5–7) | 512, 1024, 512 |
| Activation function used in hidden layers | ReLU | |
| Optimizer and learning rate | Adam and 0.0001 | |
|
| Number of filters in Conv2D layers | 32 |
| Stride in Conv2D layers | (1,1) | |
| Pool size in MaxPool2D layers | (2,2) | |
| Stride in MaxPool 2D layers | (2,2) | |
| Kernel size in Conv2D layers 1 and 2 | (3,3) and (2,2) | |
| Number of neurons in Dense layers (1–4) | 64, 128, 128, 64 | |
| Activation function used in Dense layers | ReLU | |
| Optimizer and Learning Rate | Adam and 0.0001 | |
|
| Number of memory cells in LSTM layers 1 and 2 | 64 and 128 |
| Number of neurons in Dense layers (1–3) | 64, 256, 128 | |
| Activation function used in LSTM layers 1 and 2 | tanh | |
| Activation function used in Dense layers (1–3) | ReLU | |
| Optimizer and Learning Rate | Adam and 0.0001 | |
|
| Number of neurons in Dense layers (1–6) | 256, 128, 64, 512, 512, 512 |
| Activation function used in Dense layers (1–6) | ReLU | |
| Optimizer and Learning Rate | Adam and 0.0001 | |
|
| Number of neurons in Dense layers (1–3) | 256, 128, 64 |
| Activation function used in Dense layers (1–3) | ReLU | |
| Optimizer and Learning Rate | Adam and 0.0001 |
Fig 17Visual representation of MLP model.
Fig 18Visual representation of CNN model.
Fig 19Visual representation of RNN-LSTM model.
Fig 20Visual representation of RESNET-50 transfer learning model.
Fig 21Visual representation of EFFICIENT NET B0 transfer learning model.
Fig 22Computation of FAD.
FAD of synthetic samples of minority classes w.r.t real samples.
| Generative model | Class | |||||
|---|---|---|---|---|---|---|
| Bronchiectasis | Bronchiolitis | Healthy | Pneumonia | LRTI | URTI | |
|
| 28.40 | 11.72 | 4.81 | 12.34 | 3.16 | 14.11 |
|
| 12.47 | 10.86 | 12.05 | 11.56 | 12.10 | 10.44 |
|
| 13.96 | 10.88 | 12.07 | 11.62 | 10.79 | 10.57 |
Fig 23FAD of synthetic samples w.r.t real samples for minority classes.
Fig 24Principal components of MFCCs of synthetic (MLP-VAE) and real samples of minority classes.
Fig 26Principal components of MFCCs of synthetic (Conditional VAE) and real samples of minority classes.
Fig 25Principal components of MFCCs of synthetic (CNN-VAE) and real samples of minority classes.
Cross-correlation between sampled synthetic and real audio segments for each class.
| Class | MLP-VAE | CNN-VAE | Conditional VAE |
|---|---|---|---|
| Bronchiectasis | 0.1 ± 0.13 | 0.40 ± 0.15 | |
| Bronchiolitis | 0.34 ± 0.18 | 0.52 ± 0.12 | |
| Healthy | 0.01 ± 0.16 | 0.52 ± 0.13 | |
| LRTI | 0.46 ± 0.15 | 0.50 ± 0.17 | |
| Pneumonia | 0.32 ± 0.17 | 0.55 ± 0.14 | |
| URTI | 0.17 ± 0.15 | 0.39 ± 0.15 |
Fig 27Correlation heatmap between sampled synthetic (MLP-VAE) and real audio segments for all minority classes.
Fig 29Correlation heatmap between sampled synthetic (Conditional-VAE) and real audio segments for all minority classes.
Fig 28Correlation heatmap between sampled synthetic (CNN-VAE) and real audio segments for all minority classes.
Fig 30Mean Mel Cepstral Distortion between the mel cepstras of the synthetic and real audio samples for all classes.
Fig 31Confusion matrix.
Fig 33Confusion matrices for ANN classifier with imbalanced and augmented training sets.
Fig 37Confusion matrices for Efficient Net B0 classifier with imbalanced and augmented training set.
Impact of VAE augmentation on the performance of classification models.
| Dataset | Metric | MLP | CNN | LSTM | RESNET-50 | EFFICIENT NET B0 |
|---|---|---|---|---|---|---|
|
|
| 0.95±0.09 | 0.96±0.06 | 0.92±0.16 | 0.97±0.05 | 0.94±0.11 |
|
| 0.48±0.3 | 0.29±0.37 | 0.41±0.28 | 0.75±0.2 | 0.47±0.32 | |
|
| 0.43±0.35 | 0.34±0.39 | 0.32±0.28 | 0.62±0.25 | 0.37±0.31 | |
|
| 0.43±0.32 | 0.3±0.36 | 0.34±0.26 | 0.64±0.19 | 0.39±0.31 | |
|
|
| 0.97±0.05 | 0.96±0.08 | 0.92±0.15 | 0.98±0.05 | 0.96±0.07 |
|
| 0.51±0.29 | 0.61±0.2 | 0.41±0.24 | 0.71±0.16 | 0.55±0.22 | |
|
| 0.53±0.32 | 0.74±0.18 | 0.49±0.23 | 0.78±0.14 | 0.67±0.2 | |
|
|
|
|
|
|
| |
|
|
| 0.95±0.11 | 0.96±0.07 | 0.92±0.17 | 0.98±0.04 | 0.96±0.06 |
|
| 0.45±0.33 | 0.62±0.23 | 0.38±0.26 | 0.71±0.14 | 0.56±0.26 | |
|
| 0.58±0.33 | 0.76±0.17 | 0.46±0.23 | 0.77±0.19 | 0.62±0.25 | |
|
|
|
|
|
|
| |
|
|
| 0.96±0.07 | 0.96±0.05 | 0.91±0.19 | 0.98±0.04 | 0.96±0.07 |
|
| 0.42±0.34 | 0.4±0.33 | 0.36±0.27 | 0.7±0.17 | 0.55±0.25 | |
|
| 0.48±0.33 | 0.48±0.35 | 0.52±0.23 | 0.76±0.17 | 0.6±0.23 | |
|
| 0.41±0.32 |
|
|
|
|
Note: The improvement in F1 scores after augmentation are marked in bold.
Fig 32Classwise comparison of F1 score achieved by the classifiers with different training set.
Statistical significance of performance metrics achieved by various classifiers with imbalanced and augmented training sets.
| Performance Metric | F ratio | Critical F value | p-value |
|---|---|---|---|
|
| 0.13 | 2.63 | 0.94 |
|
| 9.01 | 2.63 | 8.72 x 10−6 |
|
| 5.97 | 2.63 | 0.0005 |
|
| 9.52 | 2.63 | 4.35 x 10−6 |
Comparison of our results with recent works undertaken towards multi-class respiratory disease classification.
| Authors and Year | Dataset used | Features / Input to model | Proposed Model(s) | Sensitivity | Specificity | ICBHI Score |
|---|---|---|---|---|---|---|
| [ | ICBHI Dataset with healthy and | MFCCs combined with their first-order derivative | LSTM | 0.98 | 0.82 | 0.90 |
| [ | ICBHI Dataset with CNN VAE generated synthetic samples of | Mel Spectrograms of respiratory sounds | CNN | 0.99 | 0.99 | 0.99 |
| [ | ICBHI Dataset with augmented samples of | MFCCs | CNN | 0.92 | 0.92 | 0.92 |
| [ | King Abdullah University Hospital + ICBHI Database with | Entropy-based features | Boosted Decision Trees | 0.95 | 0.99 | 0.97 |
|
| ICBHI dataset with VAE-generated synthetic samples of | MFCCs of respiratory sound segments | MLP | 0.97 | 0.51 | 0.74 |
| CNN | 0.96 | 0.62 | 0.79 | |||
| LSTM | 0.92 | 0.41 | 0.67 | |||
| RESNET-50 | 0.98 | 0.71 | 0.85 | |||
| EFFICIENT NET B0 | 0.96 | 0.56 | 0.76 |
Fig 38Comparative summary of recent works undertaken towards respiratory sounds classification.