Literature DB >> 35318171

Machine learning for detecting COVID-19 from cough sounds: An ensemble-based MCDM method.

Nihad Karim Chowdhury¹, Muhammad Ashad Kabir², Md Muhtadir Rahman³, Sheikh Mohammed Shariful Islam⁴.

Abstract

This research aims to analyze the performance of state-of-the-art machine learning techniques for classifying COVID-19 from cough sounds and to identify the model(s) that consistently perform well across different cough datasets. Different performance evaluation metrics (precision, sensitivity, specificity, AUC, accuracy, etc.) make selecting the best performance model difficult. To address this issue, in this paper, we propose an ensemble-based multi-criteria decision making (MCDM) method for selecting top performance machine learning technique(s) for COVID-19 cough classification. We use four cough datasets, namely Cambridge, Coswara, Virufy, and NoCoCoDa to verify the proposed method. At first, our proposed method uses the audio features of cough samples and then applies machine learning (ML) techniques to classify them as COVID-19 or non-COVID-19. Then, we consider a multi-criteria decision-making (MCDM) method that combines ensemble technologies (i.e., soft and hard) to select the best model. In MCDM, we use the technique for order preference by similarity to ideal solution (TOPSIS) for ranking purposes, while entropy is applied to calculate evaluation criteria weights. In addition, we apply the feature reduction process through recursive feature elimination with cross-validation under different estimators. The results of our empirical evaluations show that the proposed method outperforms the state-of-the-art models. We see that when the proposed method is used for analysis using the Extra-Trees classifier, it has achieved promising results (AUC: 0.95, Precision: 1, Recall: 0.97).

Entities: Chemical

Keywords: COVID-19; Classification; Cough; Ensemble; Entropy; MCDM; Machine learning; TOPSIS

Mesh：

Year: 2022 PMID： 35318171 PMCID： PMC8926945 DOI： 10.1016/j.compbiomed.2022.105405

Source DB: PubMed Journal: Comput Biol Med ISSN： 0010-4825 Impact factor: 6.698

Introduction

The outbreak of the second wave of the COVID-19 pandemic has resulted in an increased loss of human life. As has been observed, the second wave is destroying some countries’ health care systems. To limit the spread of the virus, regional regular testing and contact tracing can substitute for regional restraints [1], and the “Trace, Test and Treat” policy has flattened the pandemic trajectory (for instance, in Singapore, South Korea and China) in its initial stages [2]. Therefore, to reduce the infection rate and limit the impact on medical resources, fast and relatively cheap COVID-19 infection detection methods are indispensable. Infected countries have implemented many strategies to limit the spread of this virus. Such strategies include, encouraging people to maintain social distancing and personal hygiene, enhancing infection screening systems through multi-functional testing, pursuing mass vaccination to reduce the pandemic ahead of time, etc. Developing or underdeveloped countries are still striving to improve their detection capabilities because the current methods of detecting COVID-19 (such as reverse transcription-polymerase chain reaction (RT-PCR)) require expensive kits for on-site testing, and these kits are not always easy to obtain. Hence, low-cost, distributable, and reliable pre-screening tests are essential for identifying and diagnosing COVID-19 and limiting local outbreaks of COVID-19 infection. Besides the RT-PCR standard diagnostic scheme, several artificial intelligence (AI)-based methods have recently been proposed that use chest X-rays [[3], [4], [5]] and CT scans [6,7] to distinguish COVID-19 from other bacterial/viral infections. At the same time, to use RT-PCR, CT scans and X-rays for diagnosis, it is essential to go to a testing center or well-equipped clinical facilities. Since the above-mentioned test protocol involves multiple people at close range, there is a high risk of spreading infection to a greater extent due to the infectivity of COVID-19. To limit the exponential growth of COVID-19 cases, one solution is to design a model that can perform biological tests without involving many people. Therefore, many AI-based applications that use audio with less human contact have been used for testing and the early detection of respiratory diseases. Cough is a distinctive symptom of many respiratory diseases, and cough symptoms have been used to detect different respiratory diseases such as pulmonary edema, tuberculosis, pneumonia, whooping cough, and asthma through AI-based models [[8], [9], [10], [11]]. It is prevalent that COVID-19 infects the respiratory system, affecting the sound of someone's coughing, breathing, and voice tone. Recently, several studies have proposed audio-based AI models [2,[12], [13], [14], [15], [16], [17], [18]] for detecting the infection status of COVID-19. This paper proposes a machine learning (ML)-based COVID-19 detection architecture using audio recordings, particularly cough sounds. Our work includes using crowd-sourcing data from the University of Cambridge [12], which contains two categories, namely asymptomatic and symptomatic, to explore the use of human coughing as a unique marker of COVID-19. Subsequently, we validate the proposed method using other datasets, such as Coswara [13], Virufy [17], and Virufy integrated with NoCoCoDa [19]. The key idea of our work is to generate audio features, such as Mel-Frequency Cepstral Coefficient (MFCC), Chromagram, Mel-Scaled Spectrogram, Spectral Contrast and Tonal Centroid, before inputting the data to a classifier while maintaining a high level of detection performance acceptable to COVID-19 cases. We then use some popular ML-based classification techniques for binary classification (i.e., categorizing between COVID-19 and non-COVID-19). After that, we consider using a multi-criteria decision (MCDM) [20] method to evaluate the results of each classification technique and consider three different training strategies with different frameworks and hyper-parameter choices (see Section 3.4). Entropy is considered for selecting weights of different evaluation criteria, and then the generated weights are assigned to the weights used for Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) [21], which are used for the ranking of the models in the MCDM method. The MCDM outputs from each training strategy are aggregated through soft and hard ensembles to make the best decision for choosing the best model. Indeed, model comparisons that only consider a few evaluation criteria (i.e., accuracy, precision, etc.) cannot reflect the actual model performance when the dataset is imbalanced. Therefore, we consider MCDM that deals with various evaluation criteria, such as Accuracy (Acc.), Receiver Operating Characteristic-Area Under Curve (ROC-AUC), Precision, Recall, Specificity, F1-score, False Positive Rate (FPR), and False Negative Rate (FNR), and using these evaluation criteria, MCDM selects the best model. Moreover, the MCDM has proven its effectiveness in some aspects of the COVID-19 management system [22,23]. Also, we have integrated ensemble methods in MCDM frameworks, thereby reducing the decision bias in choosing the best model. To the best of our knowledge, this is the first attempt to explore an ensemble-based MCDM in detecting COVID-19 from cough sound. Furthermore, to support the development of the proposed architecture, we perform an extensive experiment through Recursive Feature Elimination with Cross-Validation (RFECV) to rank audio features. By using the top-ranked features, we have increased the AUC score of the asymptomatic category by 3% and the AUC score of the symptomatic category by 12% compared to the baseline AUC score without feature selection. The research results show that our proposed architecture can effectively detect a COVID-19 cough. In addition, the results of the ensemble-based MCDM of different ML models can help medical practitioners to choose the best performing model under different experimental settings. The main contributions of this paper are summarized as follows: We propose an ensemble-based MCDM method for detecting COVID-19 from cough sound data. We propose three testing strategies with different frameworks and hyper-parameter optimization to analyze the existing baseline ML models' detection performance for identifying the best model. We apply feature selection methods to identify the most important features, thereby significantly improving prediction performance. We consider four independent cough datasets for validation to confirm the effectiveness of our proposed method. We conduct an empirical evaluation of the model and compare it with the state-of-the-art models to evaluate the effectiveness of the proposed method in distinguishing COVID-19 from non-COVID-19. The rest of this article is organized as follows. Section 2 presents related work, and Section 3 describes the methodology and explains our proposed method. Section 4 reports our experimental results. Finally, Section 5 summarizes the paper and identifies future work.

Related works

Early research [24,25] findings indicate that coughs originating from specific infections or diseases have sufficient distinguishing characteristics that ML-based models can use for classification. Furthermore, several ML-based methods [[26], [27], [28], [29]] have shown significantly superior performance in using sound to diagnose various respiratory diseases in automatic audio interpretation. Nowadays, many researchers have begun to explore the respiratory sounds (i.e., cough, breath and voice) of patients who have tested positive for COVID-19 to distinguish them from healthy people's sounds. The first step involves creating a valid audio benchmark dataset to diagnose COVID-19 effectively. Many researchers have made significant efforts to create such datasets, which includes Cambridge University sound data [12], Coswara [13], Cough against COVID [14], COVID-19 cough dataset [15], AI4COVID [2], COUGHVID [16], Virufy1 [17], Novel Coronavirus Cough Database (NoCoCoDa) [19], Breathe for Science,2 and SARS COVID-19 in South Africa (Sarcos).3 With their release, several studies have been conducted that focus on the ML-based COVID-19 detection model from audio samples. We divide the literature review of ML-based COVID-19 detection methods based on audio samples into four groups: 1) speech and voice, 2) cough, breadth and voice, 3) cough and breadth, and 4) cough only. In the following section, we review the most relevant research work. Some studies [30,31] used only speech and voice sounds for classifying COVID-19. Other studies [13,32] explored cough, breath, and voice samples for COVID-19 detection. Some studies [12,[33], [34], [35]] use cough and breath samples as diagnostic symptoms for COVID-19 testing. Brown et al. [12] proposed a binary predictive model in which they used cough and breath to distinguish the sound of COVID-19 from people with asthma or healthy people. They extracted audio features and combined them with the output of a pre-trained audio neural network. Their model achieved a receiver operating characteristic-area under the curve (ROC-AUC) of over 0.80 in all tasks designed during the experiment. In Ref. [33], the raw breath and cough audio and spectrogram were used to identify whether the patient was infected with COVID-19 through the ensemble of neural networks. Here, Bayesian optimization and hyperband combined were considered for automatic hyper-parameter selection, which achieved an unweighted average recall rate (UAR) of 0.74 or an AUC of 0.80. Harry et al. [34] proposed a novel modeling approach that utilizes a custom deep neural network based on ResNet [36] to diagnose COVID-19 from mutual breathing and cough representation, with an AUC of 0.846. QUCoughScope [35] is a mobile application that uses the Cambridge University dataset to automatically detect asymptomatic COVID-19 patients using the cough and breathing sounds. Many studies [2,14,15,17,[37], [38], [39], [40], [41]] considered the analysis of cough audio signals as a workable course of action for an initial COVID-19 diagnosis. In Ref. [14], cough sounds were analyzed through an AI-based model, and the proposed model showed a statistically significant signal, indicating the status of COVID-19. The authors used microbiologically confirmed COVID-19 coughs and obtained an AUC score of 0.72 using the CNN architecture ResNet18. Using cough sounds, Ankit et al. [37] proposed an AI framework for diagnosing COVID-19 with interpretable features. The proposed framework combined cough sound characteristics with patient symptoms during empirical evaluation, and included four cough categories, COVID-19, asthma, bronchitis and healthy. In Ref. [15], the AI speech processing framework for COVID-19 is pre-screened from cough records using the speech biomarker feature extractor. In this method, cough records are converted by MFCC and put into a CNN-based architecture, which consists of a Poisson biomarker layer and three pre-trained ResNet50's [36] in parallel. Imran et al. [2] proposed a model called AI4COVID, which can distinguish the pathomorphological changes caused by COVID-19 infection in the respiratory system and compare it with other respiratory infections (such as pertussis and bronchitis) and a normal respiratory tract. Also, the authors developed a tri-pronged mediator-centered AI engine to reduce the misdiagnosis risk for the cough-based diagnosis of COVID-19. Madhurananda et al. [38] used two datasets, Coswara and Sarcos, to diagnose COVID-19 from cough samples. The authors explored seven ML-based approaches, and from empirical evaluation, it was shown that ResNet50 and LSTM got higher AUC scores than the other ML methods. Javier et al. [39] proposed a COVID-19 cough detection algorithm based on empirical mode decomposition (EMD), and then introduced the acoustic sonography tensor and a deep artificial neural network classifier with convolutional layers for subsequent classification. Another study [40] developed a classifier for the COVID-19 pre-screening model from two publicly available crowd-sourced cough sound samples, in which they divided the cough sound samples into non-overlapping coughs, and extracted six cough features from each. The authors conducted many experiments on shallow ML, convolutional neural networks (CNN) and pre-trained CNN models, and reported that an ensemble of CNN can achieve better accuracy. There are some limitations accompanied by the previous studies. Previous studies have used a number of evaluation criteria such as accuracy, AUC, precision, recall, and F1-score, and these criteria are always expected to be higher. However, these evaluation criteria are sensitive when there is a minority class. At the same time, it is often difficult to choose the best model while the model exhibits the best result for some evaluation criteria, but not for all. To address this problem, we consider MCDM, which considers the evaluation criteria of the mixer, some of which are expected to be higher, while others are expected to be lower. Indeed, MCDM deals with various evaluation criteria and selects the best model. In addition, previous studies have conducted experiments using a variety of experimental settings, such as selection of cross-validation techniques, up-sampling/down-sampling techniques, and hyperparameter optimization techniques, and did not provide any relative performance comparisons of different experimental settings to select the best model. We propose three training strategies under different experimental settings, and apply MCDM in each training strategy to solve the problem. The MCDM results of each training strategy are integrated through ensemble methods to make the best decision for selecting the best model.

Methodology

Motivated by the current progress of ML-based audio applications, we have developed an end-to-end ML-based framework that can incorporate cough samples and directly predict binary classification labels, implying the possibility of COVID-19. As the backbone of our proposed method, we use audio features, including Mel-Frequency Cepstral Coefficients, Mel-Scaled Spectrogram, Tonal Centroid, Chromagram and Spectral Contrast, and then perform feature fusion. The output of the feature fusion passes to the trained classifier layer, which consists of 10 classification methods, Extremely Randomized Trees (Extra-Trees), Support Vector Machine (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGBoost), Gradient Boosting (GBoost), Logistic Regression (LR), k-Nearest Neighbor (k-NN) and Histogram-based Gradient Boosting (HGBoost). Each classifier is trained using different training strategies as detailed in Section 3.4. In addition, to select an optimized COVID-19 cough diagnosis model, we use the MCDM method that considers the decision matrix generated from different evaluation criteria outlined in Section 3.5. After that, we calculate the relative closeness score of each training strategy by integrating TOPSIS and entropy. Then, we use two ensemble strategies (soft ensemble and hard ensemble) to rank the models. We further analyze the effect of feature dimensionality reduction. In this regard, we use Recursive Feature Elimination with Cross-Validation (RFECV). Finally, we fed the selected features into the best classifier to detect COVID-19. The following sections outline the dataset description, the proposed method (including feature extraction and classification), the training strategies used, and the details of the optimization techniques used to select the best model. An overview of our proposed method can be seen in Fig. 1 .

Fig. 1

An overview of the proposed method for detecting COVID-19 from cough samples.

Dataset description and preprocessing

In this section, we will describe in detail the datasets used for analysis in this article. We have used four datasets in the experimental evaluation: Cambridge [12], Coswara [13], Virufy [17], and Virufy integrated with NoCoCoDa [19]. Table 1 shows the distribution of cough samples used during the experiment. Each cough sample is resampled with a sampling rate of 22.5 kHz, and a window type of Hann.

Table 1

Datasets description.

Dataset	Category	COVID-19	Non-COVID-19	Total
Cambridge	Asymtomatic	141	298	439
Cambridge	Symtomatic	54	32	86
Coswara	-	185	1134	1319
Virufy	-	48	73	121
NoCoCoDa	-	73	-	73
Virufy + NoCoCoDa	-	121	73	194

Datasets description.

Cambridge dataset

The University of Cambridge has launched a web-based application and a mobile application for people to provide coughing, breathing, and voice data when reading a prescribed sentence.4 In the case of the Cambridge dataset, we consider two categories, namely asymptomatic and symptomatic, to distinguish COVID-19 positive from non-COVID-19. Fig. 2 shows asymptomatic and symptomatic COVID-19 and non-COVID-19 samples from the Cambridge dataset. Since the University of Cambridge dataset authors released the dataset following a one-to-one legal agreement, we considered the restrictions they adopted to use it not for commercial purposes but research purposes.

Fig. 2

COVID-19 and Non-COVID-19 cough samples of the Cambridge dataset.

Asymptomatic: Distinguish people who tested positive for COVID-19 from those who tested negative, had a clean medical history, had never smoked, and were asymptomatic. In the dataset, there are 141 cough samples from people who have tested positive for COVID-19 and 298 cough samples from people who do not have COVID-19 (those who have a clean medical history, have never smoked, and have no symptoms). Symptomatic: Distinguish between those who tested positive for COVID-19 and declared cough as a symptom from those who tested negative and had a cough as a symptom. Moreover, these people had a clean medical history and had never smoked. This task distinguishes 54 symptomatic COVID-19 samples from 32 symptomatic non-COVID-19 samples. COVID-19 and Non-COVID-19 cough samples of the Cambridge dataset.

Coswara dataset

In addition to the Cambridge dataset, we also consider the Coswara dataset developed by the Indian Institute of Science (IISc), Bangalore.5 The dataset is now publicly available.6 We collected samples from the Coswara dataset between April 2020 and May 2021. Since the record category of the Coswara dataset is different from that of the Cambridge dataset, to make it consistent with the Cambridge dataset, we only consider the heavy cough variants of the COVID-19 and healthy (non-COVID-19) categories. From the Coswara dataset, we have considered a total of 185 COVID-19 and 1, 134 non-COVID-19 cough samples for training and testing.

Virufy dataset

The Virufy COVID-19 open cough dataset is the first free COVID-19 cough sound collected in hospital under the supervision of a doctor according to standard operating procedures (SOP) and patients’ informed consent. This dataset is preprocessed and labeled with COVID-19 status, obtained through PCR testing and patient demographic data. A total of 121 segmented cough samples (48 COVID-19 positive and 73 COVID-19 negative) from 16 patients were considered for experimental evaluation.

NoCoCoDa dataset

The NoCoCoDa dataset includes coughing events during or after the critical phase of COVID-19 patients recorded through public media interviews. A total of 73 individual cough events were obtained, and the cough phases were marked after the interview was manually segmented. Since the NoCoCoDa dataset only has COVID-19 samples, in the experiment, we have integrated it with the Virufy dataset consisting of COVID-19 positive and healthy samples.

Feature extraction methods

The sound waveform considered in the feature extraction process is sampled at a sampling rate of 22 kHz to ensure uniformity, as it is a standard frequency for audio applications. Five spectral features from the sampled audio (i.e., Mel-Frequency Cepstral Coefficients, Mel-Scaled Spectrogram, Tonal Centroid, Chromagram, and Spectral Contrast) are extracted using the librosa [42] library from Python. where ζ is the tone centroid vector, and for the time frame n is given by the product of the transformation matrix, Φ, and the chroma vector c. where N is the total number in the k-th sub-band, k ∈ [1,6], and α is a constant ranging from 0.02 to 0.2. Mel-Frequency Cepstral Coefficients (MFCCs): MFCCs have already shown their usefulness through the analysis of dry and wet cough detection [43], as well as highlighted as successful features for audio analysis. In the feature extraction of MFCC, after the windowing operation, fast fourier transform (FFT) applies to find the power spectrum of each frame. Afterward, the Mel scale is used to perform filter bank processing on the power spectrum. Mel-scaled filters are calculated from physical frequency (f) by the following Equation (1). After converting the power spectrum to the logarithmic domain, discrete cosine transform (DCT) is applied to the audio signal to measure the MFCC coefficients. Mel-Scaled Spectrogram: In ML applications concerning audio analysis, we often need to represent the power spectrogram in the Mel scale domain. The feature extraction process of the Mel-scaled Spectrogram includes several steps to generate the spectrogram. Before calculating the FFT, we set the window size to 2048 and the hop length to 512. After that, we set the number of Mels to 128, which is the evenly spaced frequency. Finally, the magnitude of the signal is decomposed into components corresponding to the frequencies in the Mel scale. Tonal Centroid: The tonal centroid feature is a way of projecting a 12-bin tuned chromagram onto a 6-dimensional vector, as described in Equation (2) [44]. Chromagram: We calculate the chromatogram from the short-time Fourier transform (STFT) power spectrum. We initialize the window size to 2048 and the hop length to 512. The number of chroma bins generated is 12. Finally, it extracts the normalized energy of each chroma bin on each frame, which is the required feature vector. Spectral Contrast: First, perform FFT on the audio samples to obtain the frequency spectrum. Using several Octave-scale filters, the frequency domain is partitioned into sub-bands. In the feature extraction process, the number of frequency bands is set to be 6. The strength of spectral valleys, peaks, and their differences are evaluated in each sub-band, as stated in Equations (3), (4), (5) [45]. After being converted to the logarithmic domain, the original spectral contrast features will be mapped to the orthogonal space.

Trained classifiers

We consider ten ML algorithms in our proposed method for classification, i.e., Extremely Randomized Trees (Extra-Trees), Support Vector Machine (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGBoost), Gradient Boosting (GBoost), Logistic Regression (LR), k-Nearest Neighbor (k-NN) and Histogram-based Gradient Boosting (HGBoost). In the following, we will briefly describe each of the different classifiers evaluated in our experimental evaluation. Extremely Randomized Trees (Extra-Trees) is a classifier that can fit multiple random decision trees to each sub-sample of the dataset, so it can control overfitting and uses the average to improve detection accuracy. The Extra-Trees classifier has proven useful in diagnosing patients with chronic obstructive pulmonary disease [46]. Support Vector Machine (SVM) is a popular supervised technique that can effectively perform classification tasks. Several SVM kernels (such as Gaussian function, polynomial function, or quadratic function) can be used during the classification task. Some previous studies [2,12,14,32,33,38,47] have successfully applied SVM to detect COVID-19 in audio samples. Random Forest (RF) is a collection of decision trees widely used in classification tasks. By growing a combination of trees and voting for each category of trees, we can observe significant classification accuracy. Random Forest has achieved success in classifying cough, breath, and sound events [13]. Adaptive Boosting (AdaBoost) is a classifier that first fits the classifier to the original dataset, and then fits other copies of the classifier to the same dataset. However, the weights of misclassified instances are adapted to force successive classifiers to pay more attention to hard events. Multilayer Perceptron (MLP) has adapted to the concept of human biological neural networks and can learn non-linear relationships. The training of the network depends on iteration, bias, weight adjustment, learning rate, and optimization. It effectively detects COVID-19 coughs [32,37,38] and other types of coughs. Extreme Gradient Boosting (XGBoost) classifier is a decision-tree-based ensemble ML technique that utilizes a gradient boosting structure. This advanced and powerful technique can deal with data irregularities and further reduce overfitting [47]. Some previous studies have reported the performance of the XGBoost classifier in detecting COVID-19 in cough samples [14,47]. Gradient Boosting (GBoost) generates an additive model according to the forwarding stage-wise, and summarizes it by optimizing the differentiable loss function [48]. At each stage, regression trees (equal to the total number of classes) are fitted to the negative gradient of the binomial or multinomial deviation loss function. Logistic Regression (LR) is a parametric classification model with fixed parametric numbers that predict categorical or discrete output for given input features. We can use multinomial logistic regression in scenarios with multiple categories rather than two categories [49]. Madhurananda et al. [38] successfully used it for COVID-19 cough detection. k-Nearest Neighbor (k-NN) is a well-known classifier that appears in large-scale ML applications. As we have seen from previous studies, researchers used k-NN in non-COVID-19 applications such as night coughing and sniffing [50] and used k-NN to detect COVID-19 in cough samples [32,38,47,51]. Histogram-based Gradient Boosting (HGBoost) is a highly desirable ML technology, where the application needs to get better quality performance in less inference time. The main advantage of histogram-based gradient boosting technology is speed. Chung et al. [52] successfully explored this method to predict the severity of COVID-19.

Training strategies and hyper-parameters optimization

We introduce three training strategies, namely training strategies 1, 2 and 3, to evaluate the effectiveness of different factors of the proposed method. It is evident from the dataset that the positive category of COVID-19 is under-represented, which may adversely affect the performance of the ML classifier. Therefore, we have used the Synthetic Minority Oversampling Technique (SMOTE) [53] during training to balance the dataset to enhance the ML classifier's performance. The difference between training strategy 1 and strategy 2 is that strategy 1 does not apply SMOTE in the training process, while strategy 2 does. However, they both use the same hyper-parameters. On the other hand, the difference between strategies 1 and 2 and strategy 3 is that strategy 3 integrates nested cross-validation with hyper-parameters optimization. The nested cross-validation includes an inner loop of 5-fold stratified cross-validation for hyper-parameters optimization, and the outer loop, being in the training process with SMOTE, maintains 10-fold stratified cross-validation. The hyper-parameters used during empirical evaluation for optimization are listed in Table 2 . For classifications where we encounter class imbalance problems, the default threshold (i.e., 0.50) leads to poor performance. Therefore, we apply the threshold moving technique to adjust the probability threshold that outlines the probability to the class label. In our experiments, in each fold of cross-validation, we generate ROC-AUC scores for all threshold values from 0.1 to 1 by incriminating 0.001, and select the best threshold that produces the highest ROC-AUC score. Table 3 shows different configurations of training strategies. In the case of training strategies 1 and 2, we use fixed hyper-parameters for all classifiers. In both strategies, we use 10 fold cross-validation in which dataset divides in train and test set. In strategy 3, we apply nested cross-validation, where outer cross-validation divides the dataset into train and test sets. In the inner loop of cross-validation, we apply grid search to get the best parameters using the training set. Once we obtain the best parameters, we train classifiers using those parameters using the same train set, which creates the outer loop of cross-validation. Afterward, we evaluate the proposed model using a never-before-used test set.

Table 2

Hyper-parameters search space of classifiers for optimization.

Classifiers	Hyper-parameters	Range
Extra-Trees	Estimators	600, 700, 800
	Criterion	Gini, Entropy
	Max. features	Auto, Sqrt, Log2
SVM	C	0.10 to 1.0, step = 0.10
	Kernel	Linear, Poly, rbf, Sigmoid
	Gamma	Auto, Scale
RF	Estimators	600, 700, 800
RF	Max. features	Auto, Sqrt, Log2
AdaBoost	Estimators	600, 700, 800
AdaBoost	Algorithm	SAMME, SAMME.R
MLP	Hidden layer sizes	(64), (64,64), (128), (128,128)
	Activation	identity, logistic, tanh, relu
	Solver	lbfgs, sgd, adaml
	Learning rate	constant, invscaling, adaptive
XGBoost	Estimators	600,700,800
XGBoost	Max. depth	4,5,6
GBoost	Estimators	600, 700, 800
	Criterion	friedman_mse, mse
	Max. features	auto, sqrt, log2
	Loss	deviance, exponential
LR	Penalty	l1, l2, elasticnet
LR	Solver	newton-cg, lbfgs, liblinear, sag, saga
k-NN	Number of neighbours	5 to 8, step = 1
k-NN	Algorithm	auto, ball tree, kd tree, brute
HGBoost	Max. iteration	100 to 600, step = 100
HGBoost	Loss	binary crossentropy

Table 3

Configurations of different training strategies.

Training Strategy #	Cross-Validation Method	Cross-Validation Folds	Up-sampling Method	Threshold Moving	Hyper-parameters Selection Method
Strategy 1	Stratified	10	N/A	✓	Fixed
Strategy 2	Stratified	10	SMOTE	✓	Fixed
Strategy 3	Stratified	10	SMOTE	✓	Optimized using Nested Cross-Validation with Grid Search

Hyper-parameters search space of classifiers for optimization. Configurations of different training strategies.

Ensemble-based MCDM

To select an optimized COVID-19 cough diagnostic model, we have employed the MCDM method that considers different evaluation criteria. Selecting the best model using one or a few evaluation criteria (accuracy, precision, etc.) does not make sense when considering bias data, i.e., class imbalance, where most data belongs to one class. To address this problem, we consider MCDM, which considers several evaluation criteria with higher and lower influence in the mixer. For example, some evaluation criteria are expected to have high values, such as accuracy, precision, etc., while we expect other evaluation criteria to have low values, such as false positive rate, false negative rate, etc. One widely accepted approach for MCDM is integrating the Entropy and TOPSIS methods where Entropy calculates the weight of each evaluation criterion and TOPSIS handles this weight with a decision matrix to produce an outcome that reflects the best performing model. TOPSIS has the following advantages: (1) Suitable for processing many alternatives and attributes; (2) The process is simple and easy to use; (3) Regardless of the number of attributes, it maintains the same processing steps [20]. The core aspect of the TOPSIS method is the decision matrix, which is formed by using the evaluation criteria value of each alternative, as defined in Equation (6).Where, A 1, A 2, …, A represent the alternatives to ranking based on the evaluation criteria and C 1, C 2, …, C . X represents the score of the alternative A related to the criterion C . Entropy-based weight measures the information of the decision matrix, which is the prerequisite of the TOPSIS method development, and is used to determine the criterion's weight. We not only use entropy to quantitatively measure data, but also calculate proportional weight information. We have summarized the complete working steps of determining the weight of each evaluation criterion in Algorithm 1. Supposing there are m alternatives and n pieces of criteria in the D, X is the j-th criterion value in the i-th alternative. The algorithm includes several steps: the standardization of the index, the element-wise projection, measurement of entropy of the j-th index, and calculation of the weight of each criterion. Steps to measure entropy-based weight. Steps of TOPSIS method. We outline the functional steps of the TOPSIS method in Algorithm 2. After performing the initial steps of the TOPSIS, i.e., normalization of the decision matrix and determination of the weighted decision matrix, step 3 in Algorithm 2 defines the ideal best and the ideal worst solutions. The equations for determining the ideal best and the ideal worst are as follows: where J + and J − are the criteria having positive and negative impact respectively. Step 4 calculates the distance between each feasible solution and the ideal positive solution and the ideal negative solution. Next, step 5 measures the relative closeness to the ideal solution, and finally, step 6 ranks the evaluation alternatives according to the relative closeness value. Steps of soft and hard ensemble method during validation. We also integrate ensemble methods into MCDM in combination with the multiple training strategies discussed in Section 3.4. The core concept of multiple training strategies is developed based on considering the training strategies of different experimental settings. Each experimental setup contains unique optimization parameters. Therefore, ensemble in MCDM through multiple training strategies is more efficacious than MCDM based on one training, thus providing a better model choice for diagnosing COVID-19 cough. We have selected two ensemble methods (soft ensemble and hard ensemble) in the proposed method to select the best model in MCDM to classify cough samples as COVID-19 or non-COVID-19, as described in Algorithm 3. In Algorithm 3, steps 1 to 7 have measured the relative closeness of MCDM of each model for each training strategy. With a soft ensemble, it uses the average value of relative affinity and considers all training strategies, and ranks the models according to the average value, as described in steps 9 and 10. Using a hard ensemble, the outcome of an MCDM is defined as the transformation of relative closeness score that maps to a vote. The final ensemble needs to aggregate the votes of all training strategies for all alternatives (i.e., classification models), and select the best alternative category with the highest number of votes, as shown in step 11.

Experiments and results

In this section, we present our experimental results to detect COVID-19 from cough sound. We first describe the evaluation criteria used in the experimental evaluation (Section 4.1). After that, we present the classification performance of our approach (Sections 4.2, 4.3) using the Cambridge dataset and the ranking of the classification models using ensemble-based MCDM (Section 4.4). Then, we discuss the feature selection process using the Recursive Feature Elimination with Cross-Validation (RFECV) method and apply this process to all datasets used in this experiment. Finally, we compare our approach with the state-of-the-art approaches and show the results of other datasets.

Evaluation criteria

We use eight standard evaluation metrics Accuracy (Acc.), Receiver Operating Characteristic-Area Under Curve (ROC-AUC), Precision, Recall, Specificity, F1-score, False Positive Rate (FPR), and False Negative Rate (FNR) across all 10-fold stratified cross-validation.

Prediction performance of the asymptomatic category

We present the decision matrix related to the classification performance of the various classifiers in Table 4 for the asymptomatic category of the Cambridge dataset. The evaluation criterion linked to the upward arrow expects a higher value, while the downward arrow is the opposite. For training strategy 1, the results indicate that the Extra-Trees classifier provides the best performance, with AUC, accuracy, precision, and recall of 0.85, 0.86, 0.93, and 0.62, respectively. In addition, HGBoost and RF classifiers also show excellent performance, with AUC of 0.83 and 0.81, respectively. However, XGBoost classifier manifests relatively low performance, with an AUC of 0.68. We also see that for strategy 2, Extra-Trees, RF, XGBoost, and HGBoost classifiers achieve better performance than other classifiers under most evaluation criteria. In addition, the results confirm that Extra-Trees and HGBoost classifiers can also achieve better classification performance than RF and XGBoost in most evaluation criteria. When training the classifier using strategy 3, we see that RF and GBoost perform better than other classifiers. GBoost and XGBoost can achieve the best AUC of 0.85, but compared to GBoost, XGBoost shows a better recall. The results also show that when we integrate SMOTE during training in strategies 2 and 3, we get an average recall of 0.76 for both strategies compared to 0.56 for strategy 1. Therefore, we can conclude that strategy 2 and strategy 3 would be effective predictors for screening COVID-19.

Table 4

Decision matrix of the proposed method for asymptomatic category considering training strategies. Evaluation criteria into two groups based on maximization and minimization. Acc., AUC, Precision, Recall, Specificity, F1-score are expected to be the maximum; in contrast, FPR and FNR are expected to have the minimum.

Training strategies	Classifiers	Evaluation Criteria
Training strategies	Classifiers	Acc.(↑)	AUC(↑)	Precision(↑)	Recall(↑)	Specificity(↑)	F1-score(↑)	FPR(↓)	FNR(↓)
Strategy 1	Extra-Trees	0.86	0.85	0.93	0.62	0.98	0.75	0.02	0.38
	SVM	0.81	0.81	0.82	0.54	0.94	0.65	0.06	0.46
	RF	0.85	0.81	0.90	0.62	0.97	0.73	0.03	0.38
	AdBoost	0.82	0.80	0.82	0.55	0.94	0.66	0.06	0.45
	MLP	0.81	0.83	0.84	0.51	0.95	0.63	0.05	0.49
	XGBoost	0.77	0.68	0.90	0.33	0.98	0.48	0.02	0.67
	GBoost	0.81	0.80	0.78	0.57	0.92	0.66	0.08	0.43
	LR	0.80	0.78	0.76	0.57	0.91	0.65	0.09	0.43
	k-NN	0.80	0.80	0.82	0.48	0.95	0.61	0.05	0.52
	HGBoost	0.84	0.83	0.92	0.56	0.98	0.70	0.02	0.44
Strategy 2	Extra-Trees	0.85	0.85	0.76	0.77	0.88	0.76	0.12	0.23
	SVM	0.77	0.79	0.62	0.79	0.77	0.69	0.23	0.21
	RF	0.82	0.84	0.71	0.77	0.85	0.74	0.15	0.23
	AdBoost	0.78	0.79	0.64	0.72	0.81	0.68	0.19	0.28
	MLP	0.80	0.81	0.69	0.70	0.85	0.69	0.15	0.30
	XGBoost	0.83	0.84	0.70	0.79	0.84	0.75	0.16	0.21
	GBoost	0.78	0.80	0.64	0.76	0.80	0.69	0.20	0.24
	LR	0.78	0.79	0.62	0.78	0.78	0.69	0.22	0.22
	k-NN	0.80	0.81	0.68	0.71	0.84	0.69	0.16	0.29
	HGBoost	0.84	0.86	0.72	0.81	0.85	0.76	0.15	0.19
Strategy 3	Extra-Trees	0.84	0.83	0.75	0.74	0.88	0.74	0.12	0.26
	SVM	0.81	0.83	0.67	0.79	0.82	0.72	0.18	0.21
	RF	0.84	0.84	0.75	0.77	0.88	0.76	0.12	0.23
	AdBoost	0.79	0.82	0.65	0.78	0.8	0.71	0.20	0.22
	MLP	0.82	0.82	0.71	0.74	0.86	0.72	0.14	0.26
	XGBoost	0.83	0.85	0.71	0.79	0.85	0.75	0.15	0.21
	GBoost	0.84	0.85	0.74	0.76	0.88	0.75	0.12	0.24
	LR	0.78	0.79	0.63	0.72	0.8	0.68	0.20	0.28
	k-NN	0.79	0.79	0.66	0.70	0.83	0.68	0.17	0.30
	HGBoost	0.83	0.85	0.71	0.79	0.85	0.75	0.15	0.21

Prediction performance of the symptomatic category

Symptomatic category refers to the binary classification of symptomatic COVID-19 and non-COVID-19, where individuals are tested for COVID-19 and declare that they have a cough. Using strategy 1, the Extra-Trees and RF classifiers provide a better performance, with AUC and accuracy of 0.87 and 0.87, respectively. However, the precision of the Extra-Trees classifier is better, and RF is the best in terms of recall. In contrast, k-NN shows comparatively lower performance, with an AUC of 0.73. The results show that for strategy 2, the performance of Extra-Trees and MLP is almost the same. Both classifiers provide the same AUC score, but MLP is the best at accuracy 0.87. Furthermore, LR provides a recall value of 0.89, which is the best among other classifiers. For strategy 3, Extra-Trees and RF maintain almost the same performance as strategy 1, while k-NN shows the worst performance. The results also show that SMOTE can effectively deal with the class imbalance problem in the dataset, thereby improving the classification performance in strategies 2 and 3.

Model selection using ensemble-based MCDM

This section presents the results of selecting an optimal diagnostic model for COVID-19 through ensemble-based MCDM. Table 4, Table 5 provide decision matrices considering all training strategies of asymptomatic and symptomatic categories, respectively. Table 6 shows the entropy-based weights of the decision matrix based on all evaluation criteria (Algorithm 1 shows the steps required for calculation). FPR and FNR (asymptomatic and symptomatic) maintain the maximum weight of strategies 1 and 3, while AUC maintains the maximum weight of strategy 2 in the two tasks. According to the results, the criterion with the highest weight is the most important criterion, and the least important criterion has a lower weight value. Next, the normalized decision matrix and the weight are multiplied to obtain the weighted normalized decision matrix, as described in step 2 in Algorithm 2. Furthermore, Table 7 shows the results of the ideal best value and the ideal worst value generated from the weighted decision matrix, as shown in step 3 of Algorithm 2 and Equations (7), (8)).

Table 5

Training strategies	Classifiers	Evaluation Criteria
Training strategies	Classifiers	Acc.(↑)	AUC(↑)	Precision(↑)	Recall(↑)	Specificity(↑)	F1-score(↑)	FPR(↓)	FNR(↓)
Strategy 1	Extra-Trees	0.87	0.87	1	0.8	1	0.89	0	0.20
	SVM	0.79	0.78	0.93	0.72	0.91	0.81	0.09	0.28
	RF	0.87	0.87	0.96	0.83	0.94	0.89	0.06	0.17
	AdBoost	0.79	0.78	0.93	0.72	0.91	0.81	0.09	0.28
	MLP	0.83	0.81	0.98	0.74	0.97	0.84	0.03	0.26
	XGBoost	0.84	0.81	0.95	0.78	0.94	0.86	0.06	0.22
	GBoost	0.79	0.75	0.93	0.72	0.91	0.81	0.09	0.28
	LR	0.84	0.8	0.95	0.78	0.94	0.86	0.06	0.22
	k-NN	0.73	0.75	0.92	0.63	0.91	0.75	0.09	0.37
	HGBoost	0.77	0.70	0.87	0.74	0.81	0.80	0.19	0.26
Strategy 2	Extra-Trees	0.86	0.86	0.98	0.80	0.97	0.88	0.03	0.20
	SVM	0.84	0.80	0.92	0.81	0.88	0.86	0.13	0.19
	RF	0.84	0.85	0.95	0.78	0.94	0.86	0.06	0.22
	AdBoost	0.80	0.79	0.91	0.76	0.88	0.83	0.13	0.24
	MLP	0.87	0.86	0.94	0.85	0.91	0.89	0.09	0.15
	XGBoost	0.81	0.84	1	0.70	1	0.83	0	0.30
	GBoost	0.86	0.83	0.94	0.83	0.91	0.88	0.09	0.17
	LR	0.86	0.83	0.89	0.89	0.81	0.89	0.19	0.11
	k-NN	0.72	0.74	0.97	0.57	0.97	0.72	0.03	0.43
	HGBoost	0.77	0.74	0.93	0.69	0.91	0.79	0.09	0.31
Strategy 3	Extra-Trees	0.84	0.83	1	0.74	1	0.85	0	0.26
	SVM	0.80	0.79	0.93	0.74	0.91	0.82	0.09	0.26
	RF	0.87	0.85	0.96	0.83	0.94	0.89	0.06	0.17
	AdBoost	0.83	0.80	0.95	0.76	0.94	0.85	0.06	0.24
	MLP	0.83	0.78	0.88	0.83	0.81	0.86	0.19	0.17
	XGBoost	0.84	0.81	0.92	0.81	0.88	0.86	0.13	0.19
	GBoost	0.88	0.87	0.98	0.83	0.97	0.90	0.03	0.17
	LR	0.78	0.76	0.95	0.69	0.94	0.80	0.06	0.31
	k-NN	0.69	0.71	0.97	0.52	0.97	0.67	0.03	0.48
	HGBoost	0.83	0.80	0.91	0.80	0.88	0.85	0.13	0.20

Table 6

Evaluation criteria and weights based on the entropy of all categories.

Category	Training Strategies	Evaluation criteria
Category	Training Strategies	Acc.	AUC	Precision	Recall	Specificity	F1-score	FPR	FNR
Asymptomatic	Strategy 1	0.10	0.06	0.13	0.06	0.11	0.07	0.25	0.22
	Strategy 2	0.13	0.19	0.14	0.09	0.09	0.18	0.09	0.10
	Strategy 3	0.10	0.10	0.09	0.08	0.12	0.11	0.19	0.21
Symptomatic	Strategy 1	0.13	0.13	0.11	0.10	0.09	0.12	0.15	0.16
	Strategy 2	0.09	0.16	0.14	0.09	0.10	0.08	0.15	0.18
	Strategy 3	0.07	0.09	0.10	0.07	0.09	0.07	0.15	0.37

Table 7

The results of the ideal best and the ideal worst value of each task for each training strategy.

Category	Evaluation criteria	Strategy 1		Strategy 2		Strategy 3
Category	Evaluation criteria	V⁺	V⁻	V⁺	V⁻	V⁺	V⁻
Asymptomatic	Acc.	0.032	0.028	0.042	0.038	0.031	0.029
	AUC	0.020	0.016	0.064	0.059	0.034	0.032
	Precision	0.045	0.037	0.049	0.040	0.032	0.027
	Recall	0.023	0.013	0.030	0.026	0.025	0.022
	Specificity	0.037	0.034	0.029	0.025	0.041	0.037
	F1-score	0.025	0.016	0.059	0.053	0.036	0.032
	FPR	0.029	0.131	0.019	0.036	0.045	0.075
	FNR	0.057	0.100	0.026	0.041	0.058	0.083
Symptomatic	Acc.	0.043	0.036	0.031	0.026	0.024	0.019
	AUC	0.046	0.037	0.055	0.047	0.030	0.025
	Precision	0.036	0.031	0.046	0.041	0.034	0.030
	Recall	0.036	0.028	0.034	0.022	0.024	0.015
	Specificity	0.032	0.026	0.035	0.028	0.029	0.024
	F1-score	0.041	0.035	0.028	0.023	0.022	0.017
	FPR	0	0.102	0	0.092	0	0.095
	FNR	0.033	0.073	0.025	0.096	0.077	0.216

Decision matrix of the proposed method for symptomatic category considering training strategies. Evaluation criteria into two groups based on maximization and minimization. Acc., AUC, Precision, Recall, Specificity, F1-score are expected to be the maximum; in contrast, FPR and FNR are expected to have the minimum. Evaluation criteria and weights based on the entropy of all categories. The results of the ideal best and the ideal worst value of each task for each training strategy. According to Table 7, each COVID-19 diagnostic model shows the difference of each criterion in respect of the ideal best and worst values. Before calculating the relative closeness value, we need to measure two separations, S + and S −, which reflect how close each classifier is to the ideal best and worst (see step 4 of Algorithm 2). The hypothesis that influences the selection of the best model is that the best model's S + value is the minimum compared to the other model's S + value. In contrast, the best model's S − value is relatively higher compared to other model's S − value. Table 8 shows the relative closeness value (C ) of each training strategy of the ten classifiers using step 5 of Algorithm 2. We integrate these relative closeness values into ensemble methods (soft ensemble and hard ensemble) to rank the models. In the case of the soft ensemble, we take the average of the relative closeness values, and give the final ranking based on the average; the highest average value reflects the best model. In this way, we have seen Extra-Trees become the top model for asymptomatic and symptomatic categories. On the other hand, for hard ensemble, we assign points (C P) to each C value mapped from 1 to 10, where the highest point is assigned to the highest C . However, if two or more models have the same C value, we assign the same point. After summing up all the points, we got the top-ranked model. It can be seen from Table 8 that the results of the hard ensemble reflect that HGBoost is the best for asymptomatic; for symptomatic, the Extra-Trees classifier is at the top.

Table 8

Results of MCDM with integration of ensemble.

Category	Classifiers	Relative Closeness Scores			Ensemble
		C_m1	C_m2	C_m3	Soft		Hard
		C_m1	C_m2	C_m3	Avg.(C_mj)	Rank	C_m1P	C_m2P	C_m3P	Total(C_mjP)	Rank
Asymptomatic	Extra-Trees	1	0.806	0.701	0.835	1	10	10	6	26	2
	SVM	0.478	0.370	0.535	0.461	7	3	4	4	11	8
	RF	0.871	0.683	0.867	0.807	3	8	7	10	25	3
	AdBoost	0.483	0.256	0.422	0.387	9	4	1	3	8	9
	MLP	0.579	0.428	0.614	0.540	5	6	6	5	17	5
	XGBoost	0.690	0.699	0.736	0.708	4	7	8	8	23	4
	GBoost	0.314	0.351	0.807	0.490	6	2	2	9	13	6
	LR	0.267	0.357	0.132	0.252	10	1	3	1	5	10
	k-NN	0.561	0.405	0.262	0.409	8	5	5	2	12	7
	HGBoost	0.920	0.806	0.736	0.821	2	9	10	8	27	1
Symptomatic	Extra-Trees	0.947	0.790	0.772	0.836	1	10	10	8	28	1
	SVM	0.515	0.484	0.647	0.548	8	5	4	4	13	7
	RF	0.717	0.675	0.837	0.743	2	8	8	9	25	2
	AdBoost	0.515	0.427	0.743	0.561	7	5	2	7	14	6
	MLP	0.784	0.643	0.596	0.674	5	9	7	3	19	4
	XGBoost	0.693	0.694	0.672	0.686	3	7	9	6	22	3
	GBoost	0.514	0.626	0.915	0.685	4	3	6	10	19	4
	LR	0.692	0.440	0.589	0.573	6	7	3	2	12	8
	k-NN	0.457	0.511	0.362	0.443	9	2	5	1	8	9
	HGBoost	0.176	0.467	0.662	0.435	10	1	1	5	7	10

-The underlined boldface indicates the highest-ranked models.

Results of MCDM with integration of ensemble. -The underlined boldface indicates the highest-ranked models. After analyzing the results of integrating MCDM (Table 8), we can say that the proposed method using the Extra-Trees and HGBoost classifiers is better than other classifiers. Table 9 shows the comparison of the detection of asymptomatic and symptomatic COVID-19 from cough samples using the Extra-Trees and HGBoost classifiers based on our proposed method. For the asymptomatic category, we see that our proposed method's AUC using the HGBoost classifier is higher than the Extra-Trees classifier. The Extra-Trees classifier shows higher precision, but the AUC and recall rate lag behind HGBoost. When comparing the precision results, for the symptomatic category, we see that the Extra-Trees classifier shows impressive results when classifying COVID-19 symptomatic cough, with a precision rate of 1. On the other hand, HGBoost achieves a recall of 0.80, which is higher than Extra-Trees.

Table 9

Comparison of the proposed methods for COVID-19 cough detection.

Category	Method	AUC	Precision	Recall
Asymptomatic	Proposed (Audio Features + Extra-Trees)	0.83	0.75	0.74
Asymptomatic	Proposed (Audio Features + HGBoost)	0.85	0.71	0.79
Symptomatic	Proposed (Audio Features + Extra-Trees)	0.83	1	0.74
Symptomatic	Proposed (Audio Features + HGBoost)	0.80	0.91	0.80

-Bold values indicate the highest.

Comparison of the proposed methods for COVID-19 cough detection. -Bold values indicate the highest. Fig. 3 shows the confusion matrices of the proposed method considering the Extra-Trees classifier for all training strategies. In Fig. 3 (b)–(c), for COVID-19 asymptomatic cough detection, strategy 2 provides results that are 3% better than strategy 3. Moreover, the proposed method can effectively detect non-COVID-19 asymptomatic coughs; whether in strategy 2 or strategy 3, it can provide identical performance. Although strategy 1 shows relatively low performance compared to other strategies for asymptomatic COVID-19 cough detection, strategy 1 outperforms other strategies for the symptomatic category. When comparing strategies 2 and 3 for non-COVID-19 symptomatic cough detection, the Extra-Trees classifier provides excellent results through strategy 3. In addition, for asymptomatic and symptomatic COVID-19 cough detection, training strategy 2 outperforms strategy 3, ranging from 3% to 6%.

Fig. 3

Normalized confusion matrices of Extra-Tree classifiers with 10-fold cross-validation for all training strategies. Figures (a)–(c) represent the confusion matrix of asymptomatic categories, and for symptomatic categories, the confusion matrices are (d)–(f). The sum of each class is equal to 1. Note that 0 represents COVID-19 and 1 represents Non-COVID-19 cough.

Feature dimension reduction

We analyze the effect of feature dimensionality reduction on asymptomatic and symptomatic categories. In this regard, we use cross-validated recursive feature elimination (RFECV). It is based on the feature importance weights and cross-validation to adjust the number of selected features automatically. We use three supervised learning estimators, i.e., Extra-Trees, LinearSVC, and LDA, while fitting the method that provides information about feature importance. Fig. 4 a shows the optimal number of feature selections using different estimators for the asymptomatic category. The Extra-Trees estimator achieves a reasonably good AUC score, exceeding 0.80 while maintaining the best features. However, other estimators such as LinearSVC and LDA achieve lower AUC than Extra-Trees. In this regard, the total number of best features generated using the Extra-Trees estimator is 38 (18 MFCC, 6 Chromagram, 9 Mel-scaled Spectrogram, and 5 Spectral Contrast), but the total number of best features generated using the LinearSVC and LDA estimators is 6 (2 MFCC, 2 Mel-Scaled Spectrogram, and 2 Spectral Contrast) and 78 (8 MFCC, 1 Chromagram, 65 Mel-Scaled Spectrogram, and 4 Spectral Contrast), respectively.

Fig. 4

Optimal numbers of feature selection using recursive feature elimination with cross-validation for Cambridge asymptomatic and symptomatic categories. Note that RFECV stands for Recursive Feature Elimination with Cross-Validation. For symptomatic, we observe a similar trend in Fig. 4b. Extra-Trees obtains a higher AUC than LinearSVC and LDA while retaining the best features. The Extra-Trees estimator selects a total of 6 (3 MFCC, 1 Mel-Scaled Spectrogram, and 2 Spectral Contrast) best features, while the LinearSVC and LDA estimators select a total of 1 (1 Tonal Centroid) and 3 (all 3 from Spectral Contrast) best features, respectively. Here, we observe that both categories (i.e., asymptomatic and symptomatic) produce comparable AUC scores while using Extra-Trees as an estimator, but the symptomatic category retains fewer features than the asymptomatic category.

Comparison

Table 10 shows the comparison between our proposed model with integrating feature selection and the state-of-the-art models for detecting COVID-19 from cough samples.

Table 10

Comparison of our proposed approach with the state-of-the-art approaches.

Dataset		Method	AUC	Precision	Recall
Cambridge	Asymptomatic	Brown et al. [12]	0.80	0.72	0.69
		Proposed (RFECV + Extra-Trees)	0.88	0.75	0.81
		Proposed (RFECV + HGBoost)	0.85	0.76	0.73
	Symptomatic	Brown et al. [12]	0.87	0.70	0.90
		Muhammad et al. [35]	-	0.87	0.82
		Proposed (RFECV + Extra-Trees)	0.95	1	0.91
		Proposed (RFECV + HGBoost)	0.81	0.93	0.80
Coswara		Proposed (RFECV + Extra-Trees)	0.64	0.70	0.58
Coswara		Proposed (RFECV + HGBoost)	0.66	0.76	0.47
Virufy		Proposed (RFECV + Extra-Trees)	0.92	0.89	0.88
Virufy		Proposed (RFECV + HGBoost)	0.94	0.89	0.98
Virufy + NoCoCoDa		Melek [41]	0.99	0.99	0.97
		Proposed (RFECV + Extra-Trees)	0.97	1	0.92
		Proposed (RFECV + HGBoost)	0.98	0.99	0.98
Combined dataset		Proposed (RFECV + Extra-Trees)	0.79	0.61	0.67
Combined dataset		Proposed (RFECV + HGBoost)	0.78	0.61	0.66

-Bold values indicate the highest.

Comparison of our proposed approach with the state-of-the-art approaches. -Bold values indicate the highest. The purpose is not to do a direct comparison except the work [12], because the implementation details of other works are not available or the dataset is different from us. When comparing the “with feature selection” and “no feature selection” approaches for the asymptomatic category, the AUC and recall value of our proposed Extra-Trees classifier with feature selection score higher, with 0.88 and 0.81, respectively. On the other hand, our proposed method with feature selection provides significantly better results than with no feature selection for symptomatic category. Note that the results of “no feature selection” are reported in Table 9, while Table 10 shows “with feature selection” results. Obviously, when considering the feature selection step, the performance of the Extra-Trees classifier is shown to be better than that of the HGBoost classifier. When comparing with Brown et al. [12] in the asymptomatic category, we see that our proposed method's AUC and recall using the Extra-Trees classifier is higher than that. What is more, HGBoost achieves a precision of 0.76, which is higher than others. HGBoost shows a better result than the previous study [12], but the AUC and recall rate lag behind Extra-Trees. As we have observed from empirical evaluation, for the symptomatic category, the proposed method using the Extra-Trees classifier outperforms the previous study [12]. We also see that the Extra-Trees classifier shows impressive results when classifying COVID-19 symptomatic cough, with a precision rate of 1. On the other hand, Brown et al. [12] achieved a recall of 0.90, which is comparable to Extra-Trees. In addition, the overall precision of the model [35] is 0.87, and the precision of the proposed method for symptom category reporting is 1. However, the dataset setting of the symptomatic category is different from ours. For the Coswara dataset, the precision and recall of the Extra-Trees classifier are 0.70 and 0.58, respectively. The HGBoost classifier shows better AUC and precision than the Extra-Trees classifier, but it lags significantly behind when comparing recall rates. For the Virufy dataset, the AUC, precision, and recall rate for detecting COVID-19 are 0.94, 0.89, and 0.98, respectively, which indicates that our proposed model has high detection performance when considering the HGBoost classifier. In the case of integrating Virufy with the NoCoCoDa dataset, our proposed model achieves higher AUC values of 0.97 and 0.98 for Extra-Trees and HGBoost respectively, which means that our model has a lower false negative and false positive rate. In addition, the recall rate of the HGBoost classifier is as high as 0.98. Such a high recall rate ensures that our proposed model will have a very low false negative result for COVID-19, making it a suitable screen for detecting COVID-19. The detection performance between us and Melek [41] is almost the same, but Melek [41] considered 59 COVID-19 samples from the NoCoCoDa dataset, while we considered all 73 COVID-19 samples. We further created a dataset by combining all the datasets and have applied our approach to answer the question of practical use in the field, since in real life such a classifier is not limited to operating on a specific set of people. The results presented in Table 10 for the combined dataset show that although HGBoost classifier provides the same precision as Extra-Trees classifier, Extra-Trees classifier outperforms HGBoost in terms of AUC and recall.

Conclusion and future work

In this paper, we present an ensemble based MCDM method for detecting COVID-19 from cough samples. In particular, we address the challenge of selecting the best classification model considering eight evaluation criteria where there is variation among these evaluation criteria. At first, we generate features that stem from the audio analysis of cough samples. In the training process, we consider three training strategies with different parameter settings to assess the effectiveness of various aspects of the proposed method. After that, we construct a decision matrix of ten ML-driven classifiers with eight evaluation criteria for each training strategy. Next, the proposed method integrates TOPSIS to rank the models of each training strategy, where the weight of the evaluation criteria is calculated using entropy. Subsequently, using ensemble methods, namely soft ensemble and hard ensemble, the best COVID-19 diagnostic model is identified based on the quantitative information of the measurement standards (such as average and counting votes corresponding to relative closeness value). The reason for choosing the ensemble strategy is that it reduces the bias in selecting the best model as the relative closeness values of different training strategies significantly affect the model's ranking. Our empirical evaluation shows that the proposed method considering the Extra-Trees and HGBoost classifiers provide better result. It also confirmed that the tree-based ensemble learning classifiers performed better than the non-tree-based ensemble learning classifiers. Furthermore, we believe that our approach could also be used in other application domains including epileptic seizure detection [54], atrophic gastritis screening [55], and time series classification [56]. In future work, we will study the cross-institutional datasets to enhance our proposed method. Furthermore, we will apply deep learning models for COVID-19 cough classification. In addition, we plan to analyze the severity of COVID-19 using cough sound.

Declaration of competing interest

All authors declare that there is no conflict of interest in this work.

27 in total

1. Exploring the Use of Artificial Intelligence Techniques to Detect the Presence of Coronavirus Covid-19 Through Speech and Voice Analysis.

Authors: Laura Verde; Giuseppe De Pietro; Ahmed Ghoneim; Mubarak Alrashoud; Khaled N Al-Mutib; Giovanna Sannino
Journal: IEEE Access Date: 2021-04-26 Impact factor: 3.367

2. SARS-CoV-2 Detection From Voice.

Authors: Gadi Pinkas; Yarden Karny; Aviad Malachi; Galia Barkai; Gideon Bachar; Vered Aharonson
Journal: IEEE Open J Eng Med Biol Date: 2020-09-24

3. AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app.

Authors: Ali Imran; Iryna Posokhova; Haneya N Qureshi; Usama Masood; Muhammad Sajid Riaz; Kamran Ali; Charles N John; Md Iftikhar Hussain; Muhammad Nabeel
Journal: Inform Med Unlocked Date: 2020-06-26

4. Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT.

Authors: Edward H Lee; Jimmy Zheng; Errol Colak; Maryam Mohammadzadeh; Golnaz Houshmand; Nicholas Bevins; Felipe Kitamura; Emre Altinmakas; Eduardo Pontes Reis; Jae-Kwang Kim; Chad Klochko; Michelle Han; Sadegh Moradian; Ali Mohammadzadeh; Hashem Sharifian; Hassan Hashemi; Kavous Firouznia; Hossien Ghanaati; Masoumeh Gity; Hakan Doğan; Hojjat Salehinejad; Henrique Alves; Jayne Seekins; Nitamar Abdala; Çetin Atasoy; Hamidreza Pouraliakbar; Majid Maleki; S Simon Wong; Kristen W Yeom
Journal: NPJ Digit Med Date: 2021-01-29