| Literature DB >> 36212727 |
Suman Deb1, Pankaj Warule1, Amrita Nair1, Haider Sultan1, Rahul Dash1, Jarek Krajewski2.
Abstract
This paper presents a deep learning-based analysis and classification of cold speech observed when a person is diagnosed with the common cold. The common cold is a viral infectious disease that affects the throat and the nose. Since speech is produced by the vocal tract after linear filtering of excitation source information, during a common cold, its attributes are impacted by the throat and the nose. The proposed study attempts to develop a deep learning-based classification model that can accurately predict whether a person has a cold or not based on their speech. The common cold-related information is captured using Mel-frequency cepstral coefficients (MFCC) and linear predictive coding (LPC) from the speech signal. The data imbalance is handled using the sampling strategy, SMOTE-Tomek links. Then, utilizing MFCC and LPC features, a deep learning-based model is trained and then used to categorize cold speech. The performance of a deep learning-based method is compared to logistic regression, random forest, and gradient boosted tree classifiers. The proposed model is less complex and uses a smaller feature set while giving comparable results to other state-of-the-art methods. The proposed method gives an UAR of 67.71 % , higher than the benchmark OpenSMILE SVM result of 64 % . The study's success will yield a noninvasive method for cold detection, which can further be extended to detect other speech-affecting pathologies.Entities:
Keywords: Cold speech; Deep neural network; Gradient boosted trees; LPC; MFCC; Random forest
Year: 2022 PMID: 36212727 PMCID: PMC9529162 DOI: 10.1007/s00034-022-02189-y
Source DB: PubMed Journal: Circuits Syst Signal Process ISSN: 0278-081X Impact factor: 2.311
Fig. 1Block diagram of proposed method
Fig. 2Mean values of the MFCC feature for the Cold and No cold classes of speech
Fig. 3Mean Values of the delta MFCC feature for the Cold and No cold classes of speech
Fig. 4LPC feature extraction
Fig. 5The proposed architecture of DNN
Fig. 6The loss vs epoch curve during the training of DNN using a combination of MFCC and LPC features
UAR(%) scores of different methods on the URTIC dataset
| Feature | LR | RF | XGBoost | DNN |
|---|---|---|---|---|
| MFCC | 62.90 | 65.32 | 65.21 | 65.64 |
| LPC | 62.03 | 62.87 | 60.20 | 59.78 |
| MFCC + LPC | 64.80 | 64.67 | 66.49 | 64.97 |
| MFCC + LPC with Under-sampling | 65.11 | 64.95 | 65.51 | 63.64 |
| MFCC + LPC with SMOTE + Tomek links | 64.67 | 65.69 | 66.71 | 67.71 |
Fig. 7The confusion matrix of DNN for classification of the URTIC dataset
UAR(%) Performance comparison of the proposed method with the state-of-the-art methods on the URTIC dataset
| Model | UAR(%) |
|---|---|
| ComParE functionals + SVM (Schuller et al.) [ | 64.00 |
| ComParE BoAW + SVM (Schuller et al.) [ | 64.20 |
| VOI + SVM (Huckvale and Beke) [ | 66.34 |
| VOW + SVM (Huckvale and Beke) [ | 66.47 |
| MOD + SVM (Huckvale and Beke) [ | 67.95 |
| GPPS + SVM (Huckvale and Beke) [ | 66.07 |
| MFCC + GMM (Cai et al.) [ | 64.80 |
| CQCC + GMM (Cai et al.) [ | 65.40 |
| PSP + SVM (Suresh et al.) [ | 64.00 |
| VMD + SVM (Deb et al.) [ | 66.84 |
| MFCC + Autoencoder (Kao et al.) [ | 65.81 |
| eGeMAPS + NN (Teixeira et al.) [ | 66.90 |
| MFCC + FV + PCA + SVM (Vicente et al.) [ | 64.92 |
| Vowel-like regions MFCC + DNN [ | 61.93 |
| Proposed (MFCC + LPC + SMOTE–Tomek links + DNN) | 67.71 |