Madhurananda Pahar1, Marisa Klopper2, Robin Warren3, Thomas Niesler4. 1. Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa. Electronic address: mpahar@sun.ac.za. 2. SAMRC Centre for Tuberculosis Research, DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, South Africa. Electronic address: marisat@sun.ac.za. 3. SAMRC Centre for Tuberculosis Research, DSI-NRF Centre of Excellence for Biomedical Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, South Africa. Electronic address: rw1@sun.ac.za. 4. Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa. Electronic address: trn@sun.ac.za.
Abstract
We present a machine learning based COVID-19 cough classifier which can discriminate COVID-19 positive coughs from both COVID-19 negative and healthy coughs recorded on a smartphone. This type of screening is non-contact, easy to apply, and can reduce the workload in testing centres as well as limit transmission by recommending early self-isolation to those who have a cough suggestive of COVID-19. The datasets used in this study include subjects from all six continents and contain both forced and natural coughs, indicating that the approach is widely applicable. The publicly available Coswara dataset contains 92 COVID-19 positive and 1079 healthy subjects, while the second smaller dataset was collected mostly in South Africa and contains 18 COVID-19 positive and 26 COVID-19 negative subjects who have undergone a SARS-CoV laboratory test. Both datasets indicate that COVID-19 positive coughs are 15%-20% shorter than non-COVID coughs. Dataset skew was addressed by applying the synthetic minority oversampling technique (SMOTE). A leave-p-out cross-validation scheme was used to train and evaluate seven machine learning classifiers: logistic regression (LR), k-nearest neighbour (KNN), support vector machine (SVM), multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM) and a residual-based neural network architecture (Resnet50). Our results show that although all classifiers were able to identify COVID-19 coughs, the best performance was exhibited by the Resnet50 classifier, which was best able to discriminate between the COVID-19 positive and the healthy coughs with an area under the ROC curve (AUC) of 0.98. An LSTM classifier was best able to discriminate between the COVID-19 positive and COVID-19 negative coughs, with an AUC of 0.94 after selecting the best 13 features from a sequential forward selection (SFS). Since this type of cough audio classification is cost-effective and easy to deploy, it is potentially a useful and viable means of non-contact COVID-19 screening.
We present a machine learning based COVID-19 cough classifier which can discriminate COVID-19 positive coughs from both COVID-19 negative and healthy coughs recorded on a smartphone. This type of screening is non-contact, easy to apply, and can reduce the workload in testing centres as well as limit transmission by recommending early self-isolation to those who have a cough suggestive of COVID-19. The datasets used in this study include subjects from all six continents and contain both forced and natural coughs, indicating that the approach is widely applicable. The publicly available Coswara dataset contains 92 COVID-19 positive and 1079 healthy subjects, while the second smaller dataset was collected mostly in South Africa and contains 18 COVID-19 positive and 26 COVID-19 negative subjects who have undergone a SARS-CoV laboratory test. Both datasets indicate that COVID-19 positive coughs are 15%-20% shorter than non-COVID coughs. Dataset skew was addressed by applying the synthetic minority oversampling technique (SMOTE). A leave-p-out cross-validation scheme was used to train and evaluate seven machine learning classifiers: logistic regression (LR), k-nearest neighbour (KNN), support vector machine (SVM), multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM) and a residual-based neural network architecture (Resnet50). Our results show that although all classifiers were able to identify COVID-19 coughs, the best performance was exhibited by the Resnet50 classifier, which was best able to discriminate between the COVID-19 positive and the healthy coughs with an area under the ROC curve (AUC) of 0.98. An LSTM classifier was best able to discriminate between the COVID-19 positive and COVID-19 negative coughs, with an AUC of 0.94 after selecting the best 13 features from a sequential forward selection (SFS). Since this type of cough audio classification is cost-effective and easy to deploy, it is potentially a useful and viable means of non-contact COVID-19 screening.
COVID-19 (COrona VIrus Disease of 2019), caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2) virus, was declared a global pandemic on February 11, 2020 by the World Health Organisation (WHO). It is a new coronavirus but similar to other coronaviruses, including SARS-CoV (severe acute respiratory syndrome coronavirus) and MERS-CoV (Middle East respiratory syndrome coronavirus) which caused disease outbreaks in 2002 and 2012, respectively [1,2].The most common symptoms of COVID-19 are fever, fatigue and dry coughs [3]. Other symptoms include shortness of breath, joint pain, muscle pain, gastrointestinal symptoms and loss of smell or taste [4]. At the time of writing, there were 142.1 million active cases of COVID-19 globally, and there had been 3 million deaths, with the USA reporting the highest number of cases (31.7 million) and deaths (567,729) [5]. The scale of the pandemic has caused some health systems to be overrun by the need for testing and the management of cases.Several attempts have been made to identify early symptoms of COVID-19 through the use of artificial intelligence applied to images. The 50-layer residual neural network (Resnet50) architecture has been shown to perform better than other pre-trained models such as AlexNet, GoogLeNet and VGG16 in these tasks. For example, it has been demonstrated that COVID-19 can be detected from computed tomography (CT) images with an accuracy of 96.23% by using a Resnet50 architecture [6]. The same architecture was shown to detect pneumonia due to COVID-19 with an accuracy of 96.7% [7] and to detect COVID-19 from x-ray images with an accuracy of 96.30% [8].Coughing is one of the predominant symptoms of COVID-19 [9] and also a symptom of more than 100 other diseases, and its effect on the respiratory system is known to vary [10]. For example, lung diseases can cause the airway to be either restricted or obstructed and this can influence the acoustics of the cough [11]. It has also been postulated that the glottis behaves differently under different pathological conditions [12,13] and this makes it possible to distinguish between coughs due to TB [14,15], asthma [16], bronchitis and pertussis (whooping cough) [[17], [18], [19], [20]].Respiratory data such as breathing, sneezing, speech, eating behaviour and coughing can be processed by machine learning algorithms to diagnose respiratory illness [[21], [22], [23]]. Simple machine learning tools, like binary classifiers, are able to distinguish COVID-19 respiratory sounds from healthy counterparts with an area under the ROC curve (AUC) exceeding 0.80 [24]. Detecting COVID-19 by analysing only the cough sounds is also possible. AI4COVID-19 is a mobile app that records 3 s of cough audio which is analysed automatically to provide an indication of COVID-19 status within 2 min [25]. A deep neural network (DNN) was shown to distinguish between COVID-19 and other coughs with an accuracy of 96.83% on a dataset containing 328 coughs from 150 patients of four different classes: COVID-19, asthma, bronchitis and healthy [26]. There appear to be unique patterns in COVID-19 coughs that allow a pre-trained Resnet18 classifier to identify COVID-19 coughs with an AUC of 0.72. In this case, cough samples were collected over the phone from 3621 individuals with confirmed COVID-19 [27]. COVID-19 coughs were classified with a higher AUC of 0.97 (sensitivity = 98.5% and specificity = 94.2%) by a Resnet50 architecture, trained on coughs from 4256 subjects and evaluated on 1064 subjects that included both COVID-19 positive and COVID-19 negative subjects by implementing four biomarkers [28]. A high AUC exceeding 0.98 was also achieved in Ref. [29] when discriminating COVID-19 positive coughs from COVID-19 negative coughs on a clinically validated dataset consisting of 2339 COVID-19 positive and 6041 COVID-19 negative subjects using DNN based classifiers.Data collection from COVID-19 patients is challenging and the datasets are often not publicly available. Nevertheless, efforts have been made to compile such datasets. For example, a dataset consisting of coughing sounds recorded during or after the acute phase of COVID-19 from patients via public media interviews has been developed in Ref. [30]. The Coswara dataset is publicly available and collected in a more controlled and targeted manner [31]. At the time of writing, this dataset included useable ‘deep cough’ i.e. loud coughs recordings from 92 COVID-19 positive and 1079 healthy subjects. We have also begun to compile our own dataset by collecting recordings from subjects who have undergone a SARS-CoV laboratory test in South Africa. This Sarcos (SARS COVID-19 South Africa) dataset is currently still small and includes only 44 subjects (18 COVID-19 positive and 26 COVID-19 negative).Both the Coswara and Sarcos datasets are imbalanced since COVID-19 positive subjects are outnumbered by non-COVID-19 subjects. Nevertheless, collectively these two datasets contain recordings from all six continents, as shown in Fig. 1
. To improve machine learning classification performance, we have applied the synthetic minority over-sampling technique (SMOTE) to balance our datasets. Furthermore, we have found that the COVID-19 positive coughs are 15%–20% shorter than non-COVID coughs. Hence, feature extraction is designed to preserve the time-domain patterns over an entire cough. Classifier hyperparameters were optimised by using the leave-p-out cross-validation, followed by training and evaluation of machine learning approaches, namely logistic regression (LR), k-nearest neighbour (KNN), support vector machine (SVM), multilayer perceptron (MLP) and deep neural networks (DNN) such as a convolutional neural network (CNN), long short-term memory (LSTM) and Resnet50. The Resnet50 produced the highest AUC of 0.976 ≈ 0.98 when trained and evaluated on the Coswara dataset, outperforming the baseline results presented in Ref. [32]. No classifier has been trained on the Sarcos dataset due to its small size. It can also not be combined with Coswara as it contains slightly different classes. Instead, this dataset has been used for an independent validation of the best-performing DNN classifiers developed on the Coswara dataset. In these validation experiments, it was found that the highest AUC of 0.938 ≈ 0.94 is achieved when using the best 13 features identified using the greedy sequential forward selection (SFS) algorithm and an LSTM classifier. We conclude that it is possible to identify COVID-19 on the basis of cough audio recorded using a smartphone. Furthermore, this discrimination between COVID-19 positive and both COVID-19 negative and healthy coughs is possible for audio samples collected from subjects located all over the world. Additional validation is however still required to obtain approval from regulatory bodies for use as a diagnostic tool.
Fig. 1
Location of participants in the Coswara and the Sarcos datasets: Participants in the Coswara dataset were located on five different continents, excluding Africa. The majority (91%) of participants in the Coswara dataset are from Asia, as indicated in Fig. 2. Sarcos participants who supplied geographical information are mostly (75%) from South Africa, as shown in Fig. 3.
Location of participants in the Coswara and the Sarcos datasets: Participants in the Coswara dataset were located on five different continents, excluding Africa. The majority (91%) of participants in the Coswara dataset are from Asia, as indicated in Fig. 2. Sarcos participants who supplied geographical information are mostly (75%) from South Africa, as shown in Fig. 3.
Fig. 2
Coswara dataset at the time of experimentation: There are 1079 healthy and 92 COVID-19 positive subjects in the pre-processed dataset, used for feature extraction and classifier training. Most of the subjects are aged between 20 and 50. There are 282 female and 889 male subjects and most of them are from Asia. Subjects are from five continents: Asia (Bahrain, Bangladesh, China, India, Indonesia, Iran, Japan, Malaysia, Oman, Philippines, Qatar, Saudi Arabia, Singapore, Sri Lanka, United Arab Emirates), Australia, Europe (Belgium, Finland, France, Germany, Ireland, Netherlands, Norway, Romania, Spain, Sweden, Switzerland, Ukraine, United Kingdom), North America (Canada, United States), and South America (Argentina, Mexico).
Fig. 3
Sarcos dataset at the time of experimentation: There are 26 COVID-19 negative and 18 COVID-19 positive subjects in the processed dataset. Unlike the Coswara dataset, there are more female than male subjects. Most of the subjects had their lab test performed within two weeks of participation. Only 19 of the subjects reported coughing as a symptom, and for these the reported duration of coughing symptoms was variable. There were 33 subjects from Africa (South Africa), 1 from South America (Brazil), 1 from Asia (India) and the rest declined to specify their geographic location.
Data
We have used two datasets in our experimental evaluation: the Coswara dataset and the Sarcos dataset.
The Coswara dataset
The Coswara project is aimed at developing a diagnostic tool for COVID-19 based on respiratory, cough and speech sounds [31]. Public participants were asked to contribute cough recordings via a web-based data collection platform using their smartphones (https://coswara.iisc.ac.in). The collected audio data includes fast and slow breathing, deep and shallow coughing, phonation of sustained vowels and spoken digits. Age, gender, geographical location, current health status and pre-existing medical conditions are also recorded. Health status includes ‘healthy’, ‘exposed’, ‘cured’ or ‘infected’. Audio recordings were sampled at 44.1 KHz and subjects were from all continents except Africa, as shown in Fig. 2
. In this study, we have made use of the raw audio recordings and applied pre-processing as described in Section 2.3.Coswara dataset at the time of experimentation: There are 1079 healthy and 92 COVID-19 positive subjects in the pre-processed dataset, used for feature extraction and classifier training. Most of the subjects are aged between 20 and 50. There are 282 female and 889 male subjects and most of them are from Asia. Subjects are from five continents: Asia (Bahrain, Bangladesh, China, India, Indonesia, Iran, Japan, Malaysia, Oman, Philippines, Qatar, Saudi Arabia, Singapore, Sri Lanka, United Arab Emirates), Australia, Europe (Belgium, Finland, France, Germany, Ireland, Netherlands, Norway, Romania, Spain, Sweden, Switzerland, Ukraine, United Kingdom), North America (Canada, United States), and South America (Argentina, Mexico).
The Sarcos dataset
A similar initiative in South Africa encouraged participants to allow the voluntarily recording their coughs using an online platform (https://coughtest.online) under the research project name: ‘COVID-19 screening by cough sound analysis’. Only coughs were collected as audio samples, and only subjects who had recently undergone a SARS-CoV laboratory test were asked to participate. The sampling rate for the audio recordings was 44.1 KHz. In addition to the cough audio recordings, subjects were presented with a voluntary and anonymous questionnaire, providing informed consent. The questionnaire prompted for the following information.Age and gender.Whether tested by an authorised COVID-19 testing centre.Days since the test was performed.Lab result (COVID-19 positive or negative).Country of residence.Known contact with COVID-19 positive patient.Known lung disease.Symptoms and temperature.Whether they are a regular smoker.Whether they have a current cough and for how many days.Among the 44 participants, 33 (75%) subjects asserted that they are South African residents and therefore represent the African continent, as shown in Fig. 3
. As there were no subjects from Africa in the Coswara dataset, together the Coswara and Sarcos dataset include subjects from all six continents.Sarcos dataset at the time of experimentation: There are 26 COVID-19 negative and 18 COVID-19 positive subjects in the processed dataset. Unlike the Coswara dataset, there are more female than male subjects. Most of the subjects had their lab test performed within two weeks of participation. Only 19 of the subjects reported coughing as a symptom, and for these the reported duration of coughing symptoms was variable. There were 33 subjects from Africa (South Africa), 1 from South America (Brazil), 1 from Asia (India) and the rest declined to specify their geographic location.
Data pre-processing
The raw cough audio recordings from both datasets have the sampling rate (μ) of 44.1 KHz and is subjected to some simple pre-processing steps, described below. We note, time-window length (λ) as 0.05 s and amplitude threshold value (Φ) as 0.005, where both of these values were determined manually and interactively, as the silence removal was validated by visual inspection in all cases.The original cough audio c
(t) is normalised by following Equation (1).The processed final cough audio is shown in Fig. 4
and noted as: C(t). Here, I denotes the time-window and we define:
Fig. 4
A processed COVID-19 cough audio which is shorter than the original cough audio but keeps all spectrum resolution. Amplitudes are normalised and extended silences are removed in the pre-processing.
A processed COVID-19 cough audio which is shorter than the original cough audio but keeps all spectrum resolution. Amplitudes are normalised and extended silences are removed in the pre-processing.For example, when j = 0; C
will be the portion of signal where C
0⋯C
2205, as μ = 44100 Hz and λ = 0.05 s. , where Ξ is the length of signal c
(t). C(t) is calculated by following Equation (3).where, ⊕ means concatenation and, C
(t) ≥Φ, if , where ∀i ∈ I.Thus, the amplitudes of the raw audio data in the Coswara and the Sarcos dataset were normalised, after which periods of silence were removed from the signal to within a 50 ms margin using a simple energy detector. Fig. 4 shows an example of the original raw audio, as well as the pre-processed audio.After pre-processing, the Coswara dataset contains 92 COVID-19 positive and 1079 healthy subjects and the Sarcos dataset contains 18 COVID-19 positive and 26 COVID-19 negative subjects, as summarised in Table 1
. In both datasets, COVID-19 positive coughs are 15%–20% shorter than non-COVID coughs.
Table 1
Summary of the Coswara and Sarcos Datasets: In the Coswara dataset, there were 1171 subjects with useable ‘deep cough’ recordings, 92 of whom were COVID-19 positive while 1079 were healthy. This amounts to a total of 1.05 h of cough audio recordings (after pre-processing) that will be used for experimentation. The Sarcos dataset contains data from a total of 44 subjects, 18 of whom are COVID-19 positive and 26 who are not. This amounts to a total of 2.45 min of cough audio recordings (after pre-processing) that has been used for experimentation. COVID-19 positive coughs are 15%–20% shorter than non-COVID coughs.
Dataset
Label
Subjects
Total audio
Average per subject
Standard deviation
Coswara
COVID-19 Positive
92
4.24 min
2.77 s
1.62 s
Healthy
1079
0.98 h
3.26 s
1.66 s
Total
1171
1.05 h
3.22 s
1.67 s
Sarcos
COVID-19 Positive
18
0.87 min
2.91 s
2.23 s
COVID-19 Negative
26
1.57 min
3.63 s
2.75 s
Total
44
2.45 min
3.34 s
2.53 s
Summary of the Coswara and Sarcos Datasets: In the Coswara dataset, there were 1171 subjects with useable ‘deep cough’ recordings, 92 of whom were COVID-19 positive while 1079 were healthy. This amounts to a total of 1.05 h of cough audio recordings (after pre-processing) that will be used for experimentation. The Sarcos dataset contains data from a total of 44 subjects, 18 of whom are COVID-19 positive and 26 who are not. This amounts to a total of 2.45 min of cough audio recordings (after pre-processing) that has been used for experimentation. COVID-19 positive coughs are 15%–20% shorter than non-COVID coughs.
Dataset balancing
Table 1 shows that COVID-19 positive subjects are under-represented in both datasets. To compensate for this imbalance, which can detrimentally affect machine learning [33,34], we have applied SMOTE data balancing to create equal number of COVID-19 positive coughs during training [35,36]. This technique has previously been successfully applied to cough detection and classification based on audio recordings [15,18,37].SMOTE oversamples the minor class by generating synthetic examples, instead of for example random oversampling. In our dataset, for each COVID-19 positive cough, 5 other COVID-19 positive coughs were randomly chosen and the closest in terms of the Euclidean distance is identified as x
. Then the synthetic COVID-19 positive samples are created using Equation (4).The multiplicative factor u is uniformly distributed between 0 and 1 [38].We have also implemented other extensions of SMOTE such as borderline-SMOTE [39,40] and adaptive synthetic sampling [41]. However, the best results were obtained by using SMOTE without any modification.
Feature extraction
The feature extraction process is illustrated in Fig. 5
. Features such as mel-frequency cepstral coefficients (MFCCs), log frame energies, zero crossing rate (ZCR) and kurtosis are extracted. MFCCs have been used very successfully as features in audio analysis and especially in automatic speech recognition [42,43]. They have also been found to be useful in differentiating dry coughs from wet coughs [44] and classifying tuberculosis coughs [45]. We have used the traditional MFCC extraction method considering higher resolution MFCCs along with the velocity (first-order difference, Δ) and acceleration (second-order difference, ΔΔ) as adding these has shown classifier improvement in the past [46]. Log frame energies can improve the performance in audio classification tasks [47]. The ZCR [48] is the number of times a signal changes sign within a frame, indicating the variability present in the signal. The kurtosis [49] indicates the tailedness of a probability density. For the samples of an audio signal, it indicates the prevalence of higher amplitudes. These features have been extracted by using the hyperparameters described in Table 2
for all cough recordings.
Fig. 5
Feature Extraction: Pre-processed cough audio recordings, shown in Fig. 4, are split into individual segments after which features such as MFCCs, MFCCs velocity (Δ), MFCCs acceleration (ΔΔ), log frame energies, ZCR and kurtosis are extracted. So, for number of MFCCs and number of segments, the final feature matrix has () dimensions.
Table 2
Feature extraction hyperparameters optimised using the leave-p-out cross-validation as described in Section 5.2
Hyperparameter
Description
Range
MFCC (M)
Number of lower-order
13 × k1, where
MFCCs to keep
k1 = 1, 2, 3, 4, 5
Frame (F)
Frame-size in which
2k2 where
audio is segmented
k2 = 8, …, 12
Seg (S)
Number of frames
10 × k3, where
extracted from the audio
k3 = 5, 7, 10, 12, 15
Feature Extraction: Pre-processed cough audio recordings, shown in Fig. 4, are split into individual segments after which features such as MFCCs, MFCCs velocity (Δ), MFCCs acceleration (ΔΔ), log frame energies, ZCR and kurtosis are extracted. So, for number of MFCCs and number of segments, the final feature matrix has () dimensions.Feature extraction hyperparameters optimised using the leave-p-out cross-validation as described in Section 5.2We have extracted features in a way that preserves the information regarding the beginning and the end of a cough event to allow time-domain patterns in the recordings to be discovered while maintaining the fixed input dimensionality expected by, for example, a CNN. From every recording, we extract a fixed number of features by distributing the fixed-length analysis frames uniformly over the time-interval of the cough. The input feature matrix for the classifiers then always has the dimension of () for number of MFCCs along with number of velocity (Δ) and number of acceleration (ΔΔ), as illustrated in Fig. 5. If Λ is the number of samples in the cough audio, we can calculate the number of samples between consecutive frames δ using Equation (5).So, for example a 2.2 s long cough audio event contains 97020 samples, as the sampling rate is 44.1 KHz. If the frame length is 1024 samples and number of segments are 100, then the frame skip (δ) is = 971 samples.In contrast with the more conventionally applied fixed frame rates, this way of extracting features ensures that the entire recording is captured within a fixed number of frames, allowing especially the CNN classifiers to discover more useful temporal patterns and provide better classification performance. This particular method of feature extraction has also shown promising result in classifying COVID-19 breath and speech [37].
Classifier architectures
We have trained and evaluated seven machine learning classifiers in total. LR models have been found to outperform other more complex classifiers such as classification trees, random forests, SVM in some clinical prediction tasks [14,50,51]. We have used gradient descent weight regularisation as well as lasso (l1 penalty) and ridge (l2 penalty) estimators during training [52,53]. This LR classifier has been intended primarily as a baseline against which any improvements offered by the more complex architectures can be measured. A KNN classifier bases its decision on the class labels of the k nearest neighbours in the training set and in the past has been able to both detect [[54], [55], [56]] and classify [17,45,57] sounds such as coughs and snores successfully. SVM classifiers have also performed well in both detecting [58,59] and classifying [60] cough events. The independent term in kernel functions is chosen as a hyperparameter while optimising the SVM classifier. An MLP, a neural network with multiple layers of neurons separating the input and output [61], is capable of learning non-linear relationships and have for example been shown to be effective when discriminating influenza coughs from other coughs [62]. MLP have also been applied to classify tuberculosis coughs [45,59] and detect coughs in general [63,64]. The penalty ratios, along with the number of neurons are used as the hyperparameters which were optimised using the leave-p-out cross-validation process (Fig. 8 and Section 5.2).
Fig. 8
Leave p-out cross-validation, used to train and evaluate the classifiers. The development set (DEV) consisting K subjects has been used to optimise the hyperparameters while training on the TRAIN set, consisted of N − J − K subjects. The final evaluation of the classifiers in terms of the AUC occurs on the TEST set, consisting J subjects.
A CNN is a popular deep neural network architecture, primarily used in image classification [65]. For example, in the past two decades CNNs were applied successfully to complex tasks such as face recognition [66]. It has also performed well in classifying COVID-19 breath and speech [37]. A CNN architecture [67,68] along with the optimised hyperparameters (Table 3
) is shown in Fig. 6
. An LSTM model is a type of recurrent neural network whose architecture allows it to remember previously-seen inputs when making its classification decision [69]. It has been successfully used in automatic cough detection [15,70], and also in other types of acoustic event detection [71,72]. The hyperparameters optimised for the LSTM classifier [73] are mentioned in Table 3 and visually explained in Fig. 7
. The 50-layer deep residual learning (Resnet50) neural network [74] is a very deep architecture that contains skip layers, and has been found to outperform other very deep architectures such as VGGNet. It performs particularly well on image classification tasks on the dataset such as ILSVRC, the CIFAR10 dataset and the COCO object detection dataset [75]. Resnet50 has already been used in successfully detecting COVID-19 from CT images [6], coughs [28], breath, speech [37] and Alzheimer's [76]. Due to extreme computation load, we have used the default Resnet50 structure mentioned in Table 1 of [74].
Table 3
Classifier hyperparameters, optimised using the leave-p-out cross-validation as described in Section 5.2
Hyperparameter
Description
Classifier
Range
ν1
Regularisation strength
LR
10i1 where i1=−7,−6,…,6,710−7 to 107
ν2
l1 penalty
LR
0 to 1 in steps of 0.05
ν3
l2 penalty
LR
0 to 1 in steps of 0.05
ξ1
Number of neighbours
KNN
10 to 100 in steps of 10
ξ2
Leaf size
KNN
5 to 30 in steps of 5
ζ1
Regularisation strength
SVM
10i3 where i3=−7,−6,…,6,710−7 to 107
ζ2
Kernel Coefficient
SVM
10i4 where i4=−7,−6,…,6,710−7 to 107
η1
No. of neurons
MLP
10 to 100 in steps of 10
η2
l2 penalty
MLP
10i2 where i2=−7,−6,…,6,710−7 to 107
η3
Stochastic gradient descent
MLP
0 to 1 in steps of 0.05
α1
No. of Conv filters
CNN
3×2k4 where k4 = 3, 4, 5
α2
Kernel size
CNN
2 and 3
α3
Dropout rate
CNN, LSTM
0.1 to 0.5 in steps of 0.2
α4
Dense layer size
CNN, LSTM
2k5 where k5 = 4, 5
β1
LSTM units
LSTM
2k6 where k6 = 6, 7, 8
β2
Learning rate
LSTM
10k7 where k7 = −2, −3, −4
β3
Batch Size
CNN, LSTM
2k8 where k8 = 6, 7, 8
β4
No. of epochs
CNN, LSTM
10 to 250 in steps of 20
Fig. 6
CNN Classifier: Our CNN classifier uses α1 two-dimansional convolutional layers with kernel size α2, rectified linear units as activation functions and a dropout rate of α3. After max-pooling, two dense layers with α4 and 8 units respectively and rectified linear activation functions follow. The network is terminated by a two-dimensional softmax where one output (1) represents the COVID-19 positive class and the other (0) healthy or COVID-19 negative class. During training, features are presented to the neural network in batches of size β3 for β4 epochs.
Fig. 7
LSTM classifier: Our LSTM classifier has β1 LSTM units, each with rectified linear activation functions and a dropout rate of α3. This is followed by two dense layers with α4 and 8 units respectively and rectified linear activation functions. The network is terminated by a two-dimensional softmax where one output (1) represents the COVID-19 positive class and the other (0) healthy or COVID-19 negative class. During training, features are presented to the neural network in batches of size β3 for β4 epochs.
Classifier hyperparameters, optimised using the leave-p-out cross-validation as described in Section 5.2CNN Classifier: Our CNN classifier uses α1 two-dimansional convolutional layers with kernel size α2, rectified linear units as activation functions and a dropout rate of α3. After max-pooling, two dense layers with α4 and 8 units respectively and rectified linear activation functions follow. The network is terminated by a two-dimensional softmax where one output (1) represents the COVID-19 positive class and the other (0) healthy or COVID-19 negative class. During training, features are presented to the neural network in batches of size β3 for β4 epochs.LSTM classifier: Our LSTM classifier has β1 LSTM units, each with rectified linear activation functions and a dropout rate of α3. This is followed by two dense layers with α4 and 8 units respectively and rectified linear activation functions. The network is terminated by a two-dimensional softmax where one output (1) represents the COVID-19 positive class and the other (0) healthy or COVID-19 negative class. During training, features are presented to the neural network in batches of size β3 for β4 epochs.Leave p-out cross-validation, used to train and evaluate the classifiers. The development set (DEV) consisting K subjects has been used to optimise the hyperparameters while training on the TRAIN set, consisted of N − J − K subjects. The final evaluation of the classifiers in terms of the AUC occurs on the TEST set, consisting J subjects.
Classification process
Hyperparameter optimisation
Both feature extraction and classifier architectures have a number of hyperparameters. They are listed in Table 2and 3 and were optimised by using a leave-p-out cross-validation scheme.As the sampling rate is 44.1 KHz in both the Coswara and Sarcos dataset, by varying the frame lengths from 28 to 212 i.e. 256 to 4096 samples, features are extracted from frames whose duration varies between approximately 5 and 100 ms? Different phases in a cough carry important features [44] and thus each cough has been divided between 50 and 150 segments with steps of 20–30, as shown in Fig. 5. By varying the number of lower-order MFCCs to keep (from 13 to 65, with steps of 13), the spectral resolution of the features was varied.
Cross-validation
All our classifiers have been trained and evaluated by using a nested leave-p-out cross-validation scheme, as shown in Fig. 8 [77]. Since only the Coswara dataset was used for training and parameter optimisation, N = 1171 in Fig. 8. We have set the train and test split as 4 : 1; as this ratio has been used effectively in medical classification tasks [78]. Thus, J = 234 and K = 187 in our experiments.The figure shows that, in an outer loop, J subjects are removed from the complete set of N subjects to be used for later independent testing. Then, a further K subjects are removed from the remaining N − J subjects to serve as a development set to optimise the hyperparameters listed in Table 3. The inner loop considers all such sets of K subjects, and the optimal hyperparameters are chosen on the basis of all these partitions. The resulting optimal hyperparameters are used to train a final system on all N − J subjects which is evaluated on the test set consisting of J subjects. If the N − J subjects in the training portion contain C
1 COVID-19 positive and C
2 COVID-19 negative coughs, then (C
2 − C
1) synthetic COVID-19 positive coughs are created by using SMOTE. AUC has always been the optimisation criterion in this cross-validation. This entire procedure is repeated for all possible non-overlapping test sets in the outer loop. The final performance is evaluated by calculating and averaging AUC over these outer loops.This cross-validation procedure makes the best use of our small dataset by allowing all subjects to be used for both training and testing purposes while ensuring unbiased hyperparameter optimisation and a strict per-subject separation between cross-validation folds.
Classifier evaluation
Receiver operating characteristic (ROC) curves were calculated within the inner and outer loops shown in Fig. 8. The area under the ROC curve (AUC) indicates how well the classifier has performed over a range of decision thresholds [79]. From these ROC curves, the decision that achieves an equal error rate (γ
) was computed. This is the threshold for which the difference between the classifier's true positive rate (TPR) and false positive rate (FPR) is minimised.We note the mean per-frame probability that a cough is from a COVID-19 positive subject by :where K indicates the number of frames in the cough and P(Y = 1|X
, θ) is the output of the classifier for feature vector X
and parameters θ for the ith frame. Now we define the indicator variable C as:We then define two COVID-19 index scores (COVID_I
1 and COVID_I
2) in Equations (8) and (9) respectively.In Equation (8), N
1 is the number of coughs from the subject in the recording while in Equation (9), N
2 indicates the total number of frames of cough audio gathered from the subject. Hence Equation (6) computes a per-cough average probability while Equation (9) computes a per-frame average probability. For the Coswara dataset, N
1 = 1.The COVID-19 index scores, given by Equation (8) and 9, can both be used to make classification decisions. We have found that for some classifier architectures one will lead to better performance than the other. Therefore, we have made the choice of the scoring function an additional hyperparameter to be optimised during cross-validation.We have calculated the specificity and sensitivity from these predicted values and then compared them with the actual values and finally calculated the AUC and used it as a method of evaluation. The mean specificity, sensitivity, accuracy and AUC along with the optimal hyperparameters for each classifier are shown in Tables 4 and 5
.
Table 4
Classifier performance when training and evaluating on the Coswara dataset. The best two classifiers along with their feature extraction and optimal classifier hyperparameters are mentioned. The area under the ROC curve (AUC) has been the optimisation criterion during cross-validation. The mean specificity (spec), sensitivity (sens), accuracy (ACC) and standard deviation of AUC (σ) are also shown. The best performance is achieved by the Resnet50.
Classifier performance when training on the Coswara dataset and evaluating on the Sarcos dataset. The best performance was achieved by the LSTM classifier, and further improvements were achieved by applying SFS.
Classifier
Best Feature Hyperparameters
Optimal Classifier Hyperparameters (trained on Coswara dataset in Table 4)
Classifier performance when training and evaluating on the Coswara dataset. The best two classifiers along with their feature extraction and optimal classifier hyperparameters are mentioned. The area under the ROC curve (AUC) has been the optimisation criterion during cross-validation. The mean specificity (spec), sensitivity (sens), accuracy (ACC) and standard deviation of AUC (σ) are also shown. The best performance is achieved by the Resnet50.Classifier performance when training on the Coswara dataset and evaluating on the Sarcos dataset. The best performance was achieved by the LSTM classifier, and further improvements were achieved by applying SFS.
Results
Coswara dataset
Classification performance for the Coswara dataset is shown in Table 4. The Coswara results are the average specificity, sensitivity, accuracy and AUC along with its standard deviation calculated over the outer loop test-sets during cross-validation. These tables also show the values of the hyperparameters which produce the highest AUC during cross-validation.Table 4 shows that all seven classifiers can classify COVID-19 coughs and the Resnet50 classifier exhibits the best performance, with an AUC of 0.976 when using a 120-dimensional feature matrix consisting of 39 MFCCs with appended velocity and acceleration extracted from frames that are 1024 samples long and when grouping the coughs into 50 segments. The corresponding accuracy is 95.3% with sensitivity 93% and specificity 98%. The CNN and LSTM classifiers also exhibited good performance, with AUCs of 0.953 and 0.942 respectively, thus comfortably outperforming the MLP, which achieved an AUC of 0.897. The optimised LR and SVM classifiers showed substantially weaker performance, with AUCs of 0.736 and 0.815 respectively. Table 4 also shows that DNN classifiers exhibit lower standard deviation across the folds than other classifiers. This suggests that DNN classifiers are also prone to perform better on new datasets without further hyperparameter optimisation.The mean ROC curves for the optimised classifier of each architecture are shown in Fig. 9
. We see that LSTM, CNN and Resnet50 classifiers achieve better performance than the remaining architectures at most operating points. Furthermore, the figure confirms that the Resnet50 architecture also in most cases achieved better classification performance than the CNN and LSTM. There appears to be a small region of the curve where the CNN outperforms the Resnet50 classifier, but this will need to be verified by future further experimentation with a larger dataset.
Fig. 9
Mean ROC curves for the classifiers trained and evaluated on the Coswara dataset: The highest AUC of 0.98 was achieved by the Resnet50, while the LR classifier has the lowest AUC of 0.74.
Mean ROC curves for the classifiers trained and evaluated on the Coswara dataset: The highest AUC of 0.98 was achieved by the Resnet50, while the LR classifier has the lowest AUC of 0.74.We also see from Table 4 that using a larger number of MFCCs consistently leads to improved performance. Since the spectral resolution used to compute the 39-dimensional MFCCs surpasses that of the human auditory system, we conclude that the classifiers are using information not generally perceivable to the human listeners. We have come to similar conclusions in previous work considering the classification of coughing sounds due to tuberculosis [14].
Sarcos dataset
Classification performance for the Sarcos dataset is shown in Table 5. Here the CNN, LSTM and Resnet50 classifiers trained on the Coswara dataset (as shown in Table 4) were tested on the 44 subjects in Sarcos dataset. No further hyperparameter optimisation was performed and hence Table 5 simply notes the same hyperparameters presented in Table 4. We see that performance has in all cases deteriorated relative to the better-matched Coswara dataset. The best performance was achieved by the LSTM classifier, which achieved an AUC of 0.779. In the next section, we improve this classifier by applying feature selection.
Feature selection
As an additional experiment, SFS has been applied to the best-performing system in Table 5, the LSTM. SFS is a greedy selection method for the individual feature dimensions that contribute the most towards the classifier performance [80].The feature selection hyperparameters in these experiments were 13 MFCCs extracted from 2048 samples (i.e. 0.46 s) long frames while coughs were grouped into 70 segments. Thus, SFS could select from a total of 42 features: MFCCs along with their velocity (Δ) and accelerations (ΔΔ), log frame energy, ZCR and Kurtosis (Equation (5)). After performing SFS to the LSTM classifier, a peak AUC of 0.938 was observed on the Sarcos dataset when using the best 13 features among those 42, as shown in Fig. 10
and Table 5. These 13 selected features led to an improvement of AUC from 0.779 to 0.938 (Fig. 11
) and they include MFCCs ranging from 3 to 12 along with their velocity (Δ) and acceleration (ΔΔ), suggesting all dimensions of feature matrix carry equally-important COVID-19 signatures.
Fig. 10
Sequential Forward Selection, when applied to a feature matrix composed of 13 MFCCs with appended velocity (Δ) and acceleration (ΔΔ), log frame energies, ZCR and kurtosis (Equation (5)). Peak performance is observed after selecting the best 13 features.
Fig. 11
Mean ROC curve for the best performed LSTM classifier trained on Coswara dataset and evaluated on Sarcos dataset: AUC of 0.78 has been achieved while using all 42 features. After applying SFS and selecting the best 13 features, the AUC has been improved to 0.94.
Sequential Forward Selection, when applied to a feature matrix composed of 13 MFCCs with appended velocity (Δ) and acceleration (ΔΔ), log frame energies, ZCR and kurtosis (Equation (5)). Peak performance is observed after selecting the best 13 features.Mean ROC curve for the best performed LSTM classifier trained on Coswara dataset and evaluated on Sarcos dataset: AUC of 0.78 has been achieved while using all 42 features. After applying SFS and selecting the best 13 features, the AUC has been improved to 0.94.
Conclusion and future work
We have developed COVID-19 cough classifiers using smartphone audio recordings and seven machine learning architectures. To train and evaluate these classifiers, we have used two datasets. The first, larger, dataset is publicly available and contains data from 1171 subjects (92 COVID-19 positive and 1079 healthy) residing on all five continents except Africa. The smaller second dataset contains recordings from 18 COVID-19 positive and 26 COVID-19 negative subjects, 75% of whom reside in South Africa. Thus, together the two datasets include data from subjects residing on all six continents. After pre-processing the cough audio recordings, we have found that the COVID-19 positive coughs are 15%–20% shorter than non-COVID coughs. Then we have extracted MFCCs, log frame energy, ZCR and kurtosis features from the cough audio using a special feature extraction technique which preserves the time-domain patterns and then trained and evaluated those seven classifiers using the nested leave-p-out cross-validation. Our best-performing classifier is the Resnet50 architecture and is able to discriminate between COVID-19 coughs and healthy coughs with an AUC of 0.98 on the Coswara dataset. These results outperform the baseline result of the AUC of 0.7 in Ref. [32]. When testing on the Sarcos dataset, the LSTM model trained on the Coswara dataset exhibit the best performance, discriminating COVID-19 positive coughs from COVID-19 negative coughs with an AUC of 0.94 while using the best 13 features determined by sequential forward selection (SFS). Furthermore, since better performance is achieved using a larger number of MFCCs than is required to mimic the human auditory system, we also conclude that at least some of the information used by the classifiers to discriminate the COVID-19 coughs and the non-COVID coughs may not be perceivable to the human ear.Although the systems we describe require more stringent validation on a larger dataset, the results we have presented are very promising and indicate that COVID-19 screening based on automatic classification of coughing sounds is viable. Since the data has been captured on smartphones, and since the classifier can in principle also be implemented on such device, such cough classification is cost-efficient, easy to apply and deploy. Furthermore, it could be applied remotely, thus avoiding contact with medical personnel.In ongoing work, we are continuing to enlarge our dataset and to apply transfer learning in order take advantage of the other larger datasets. We are also beginning to consider the best means of implementing the classifier on a readily-available consumer smartphone.
Authors: Laura Verde; Giuseppe De Pietro; Ahmed Ghoneim; Mubarak Alrashoud; Khaled N Al-Mutib; Giovanna Sannino Journal: IEEE Access Date: 2021-04-26 Impact factor: 3.367
Authors: Kawther S Alqudaihi; Nida Aslam; Irfan Ullah Khan; Abdullah M Almuhaideb; Shikah J Alsunaidi; Nehad M Abdel Rahman Ibrahim; Fahd A Alhaidari; Fatema S Shaikh; Yasmine M Alsenbel; Dima M Alalharith; Hajar M Alharthi; Wejdan M Alghamdi; Mohammed S Alshahrani Journal: IEEE Access Date: 2021-07-15 Impact factor: 3.367