Literature DB >> 34840419

Towards sound based testing of COVID-19-Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge.

Neeraj Kumar Sharma¹, Ananya Muguli¹, Prashant Krishnan¹, Rohit Kumar¹, Srikanth Raj Chetupalli¹, Sriram Ganapathy¹.

Abstract

The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the "Coswara" project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled "Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge". The goal was to bring together researchers and practitioners interested in developing acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-receiver operating curve (AUC-ROC) of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technological development of acoustic-based POCTs.

Entities: Chemical

Keywords: Acoustics; COVID-19; Healthcare; Machine learning; Respiratory diagnosis

Year: 2021 PMID： 34840419 PMCID： PMC8610834 DOI： 10.1016/j.csl.2021.101320

Source DB: PubMed Journal: Comput Speech Lang ISSN： 0885-2308 Impact factor: 1.899

Introduction

The viral respiratory infection caused by the novel coronavirus, SARS-CoV-2, termed as the coronavirus disease 2019 (COVID-19), was declared a pandemic by the World Health Organization (WHO) in March 2020. The current understanding of COVID-19 prognosis suggests that the virus infects the nasopharynx and then spreads to the lower respiratory tract (Schaefer et al., 2020). One of the key strategies to combat the rapid spread of infection across populations is to perform rapid and large-scale testing. Currently, the prominent COVID-19 testing methodologies take a molecular sensing approach. The gold-standard technique, termed as reverse transcription polymerase chain reaction (RT-PCR) (Corman et al., 2020), relies on using nasopharyngeal or throat swab samples. The swab sample is treated with chemical reagents enabling isolation of the ribonucleic acid (RNA), followed by deoxyribonucleic acid (DNA) formation, amplification and analysis, facilitating the detection of COVID-19 genome in the sample. However, this approach has several limitations. The swab sample collection procedure violates physical distancing (Target product profiles, 2020). The processing of these samples requires a well equipped laboratory, with readily available chemical reagents and expert analysts. Further, the turnaround time for test results can vary from several hours to a few days. The protein based rapid antigen testing (RAT) (Peeling et al., 2021) improves over the speed of detection while being inferior to the RT-PCR in detection performance. The RAT test also involves the need for chemical reagents. In view of the above mentioned limitations in molecular testing approaches (namely, RT-PCR and RAT), there is a need to design highly specific, rapid and easy-to-use point-of-care tests (POCTs) that could identify the infected individuals in a decentralized manner. Using acoustics for developing such a POCT would overcome various limitations in terms of speed, and cost, and also allow scalable remote testing. A list of publicly accessible COVID-19 audio datasets. Samples refers to count of distinct audio records from human subjects. Each audio record is composed of a set of sound recordings corresponding to the stated sound categories.

Exploring acoustics based testing

The use of acoustics for diagnosis of pertussis (Pramono et al., 2016), tuberculosis (Botha et al., 2018), childhood pneumonia (Abeyratne et al., 2013), and asthma (Hee et al., 2019) has been explored using cough sounds recorded with portable devices. As COVID-19 is an infection affecting the respiratory pathways (Li et al., 2021), recently, researchers have made efforts towards COVID-19 acoustic data collection. A list of acoustic datasets is provided in Table 1. Building on these datasets, few studies have evaluated the possibility of COVID-19 detection using acoustics. Brown et al. (2020) used cough and breathing sounds jointly and attempted a binary classification task of separating COVID-19 infected individuals from healthy. The dataset was collected through crowd-sourcing, and the analysis was done on COVID-19 infected individuals. The authors reported a performance between AUC-ROC (area-under-the-receiver operating characteristic curve). Agbley et al. (2020) demonstrated 81% specificity (at 43% sensitivity) on a subset of the COUGHVID dataset (Orlandic et al., 2021). Imran et al. (2020) studied cough sound samples from four groups of individuals, namely, healthy, and those with bronchitis, pertussis, and COVID-19 infection. They report an accuracy of 92.6%. Laguarta et al. (2020) used a large sample set of COVID-19 infected individuals and report an AUC-ROC performance of 97.0%. Andreu-Perez et al. (2021) create a controlled dataset by collecting cough sound samples from patients visiting hospitals, and they report 98.8% AUC-ROC.

Table 1

A list of publicly accessible COVID-19 audio datasets.

Ref	Dataset	Sound categories	Access	COVID/non-COVID samplesa	Method
Orlandic et al. (2021)	COUGHVID	Cough	Public	1155/27 550	Crowdsourced
Sharma et al. (2020)	Coswara	Cough, speech, breathing	Public	345/1785	Crowdsourced
Virufy COVID-19 Open Cough Dataset (2021)	Virufy	Cough	Public	7/9	Hospital
Cohen-McFarlane et al. (2020)	NoCoCoDa	Cough	On request	13/NA	YouTube
Brown et al. (2020)	COVID-19 sounds	Cough, breathing	On request	141/318	Crowdsourced

Samples refers to count of distinct audio records from human subjects. Each audio record is composed of a set of sound recordings corresponding to the stated sound categories.

Although these studies are encouraging, they suffer from some limitations. They do not use a common dataset, and a few are based on privately collected datasets. The ratio of COVID-19 patients to healthy (or non-COVID) is different in every study. The performance metrics are also different across studies. Some of the studies report performance per-cough bout, and others report per-patient. Further, most of the studies have not bench-marked on other open source datasets, making it difficult to compare among the various propositions. The DiCOVA challenge timeline. An illustration of Track-1 and Track-2 development datasets. Here, (a,d) show the COVID and non-COVID pool size in terms of number of individuals; (b,e) show the breakdown of non-COVID individuals into categories of no symptoms, symptoms (cold, cough), and pre-existing respiratory ailment (asthma, chronic lung disease, pneumonia); (c,f) depicts the age group distribution in the development dataset.

Contribution

We launched the “Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge” (Muguli et al., 2021) with two primary goals. Firstly, to encourage the speech and audio researchers to analyze acoustics of cough and speech sounds for a problem of immediate societal relevance. The challenge was launched under the umbrella of Interspeech 2021, and participants were given an option to submit their findings to a special session in this flagship conference. Secondly, and more importantly, to provide a benchmark for monitoring the progress in acoustic based diagnostics of COVID-19. The development and (blind) test datasets were provided to the participants to facilitate design and evaluation of classifier systems. A leaderboard was created allowing participants to rank order their performance against others. This paper describes the details of the challenge including the dataset, the baseline system (Section 2), and provides a summary of the various submitted systems (Section 3). An analysis of the scores submitted by the top teams (Section 4), and the insights gained from the challenge are also presented (Section 6).

DiCOVA Challenge

The DiCOVA challenge2 was launched on Feb, 2021 and the challenge lasted till Mar, . The participation was through a registration process. A development set, a baseline system, and a blind test set was provided to all registered participants. A timeline of the challenge is shown in Fig. 1. A remote server based scoring system with a leaderboard setting was created. This provided near real-time ranking and monitoring progress of each team on the blind test set.3 The call for participation in the challenge attracted plus registrations. Out of this, teams made final submissions on the blind test set.

Fig. 1

The DiCOVA challenge timeline.

Dataset

The challenge dataset is derived from the Coswara dataset (Sharma et al., 2020), a crowd-sourced dataset of sound recordings. The Coswara data is collected using a website.4 The volunteers from across the globe, age groups and health conditions were requested to record their sound data in a quiet environment using an internet connected device (like, mobile phone or computer). The participants initially provide demographic information like age and gender. An account of their current health status in the form of a questionnaire of symptoms as well as pre-existing conditions like respiratory ailments and co-morbidity are recorded. The web based tool also records the result of the COVID-19 test conducted and the possibility of exposure to the virus through primary contacts. The acoustic data from each subject contains audio categories, namely, shallow and deep breathing ( types), shallow and heavy cough ( types), sustained phonation of vowels [æ] (as in bat), [i] (as in beet), and [u] (as in boot) ( types), and fast and normal pace number counting ( types). The dataset collection protocol was approved by the Human Ethics Committee of the Indian Institute of Science, Bangalore, and P. D. Hinduja National Hospital and Medical Research Center, Mumbai, India. The DiCOVA Challenge used a subset of the Coswara dataset, sampled from the data collected between April- and Feb-. The sampling included only age group of years. The subjects with health status of “recovered” (who were COVID positive however fully recovered from the infection) and “exposed” (suspecting exposure to the virus) were not included in the dataset. Further, subjects with audio recordings of duration less than ms were discarded. The resulting curated subject pool was divided into the following two groups. non-COVID: Subjects self reported as healthy, having symptoms such as cold/cough or having pre-existing respiratory ailments (like asthma, pneumonia, chronic lung disease) but were not tested positive for COVID-19. COVID: Subjects self-declared as COVID-19 positive (asymptomatic or symptomatic with mild/moderate infection) The DiCOVA 2021 challenge featured two tracks. The Track-1 development dataset composed of (heavy) cough sound recordings from subjects. The Track-2 development dataset composed of deep breathing, vowel [i], and number counting (normal pace) speech recordings from subjects. An illustration of the important metadata details in the development set is provided in Fig. 2. About % of the subjects were male. The majority of the participants lie in the age group of years. Also, the dataset is highly imbalanced with less than % of the participants belonging to the COVID category. We retained this class imbalance in the challenge as this reflects the typical real-world scenario.

Fig. 2

An illustration of Track-1 and Track-2 development datasets. Here, (a,d) show the COVID and non-COVID pool size in terms of number of individuals; (b,e) show the breakdown of non-COVID individuals into categories of no symptoms, symptoms (cold, cough), and pre-existing respiratory ailment (asthma, chronic lung disease, pneumonia); (c,f) depicts the age group distribution in the development dataset.

In the data release, the development dataset was further divided into train and validation splits. The splits are illustrated in Fig. 3. A meta-analysis study of COVID-19 symptoms by Li et al. (2021) found cough (53.9%) as a common symptoms in 281,641 COVID-19 infected individuals. In addition, prior efforts on data collection and modeling largely focused on the cough samples (see Table 1) (Orlandic et al., 2021, Brown et al., 2020). Owing to this, the challenge emphasized progress in Track-1. A leaderboard was created and the participants competed by uploading their scores for a blind test dataset and monitoring the performance. The Track-2 featured the test dataset, without any leaderboard-style competition and encouraged the participants to carry out an exploratory analysis.

Fig. 3

Illustration of dataset splits for Track-1 (cough) and Track-2 (breathing and speech).

Audio specifications

The crowd-sourced dataset reflects a good representation of real-world data with sensor variability arising from diverse recording devices. For the challenge, we re-sampled all audio recordings to 44.1 kHz and compressed them to FLAC (Free Lossless Audio Codec) format for ease of distribution. The average duration of Track-1 development set cough recordings is standard deviation s. The average duration of Track-2 development set audio recordings is -breathing s, vowel [i] s, and number counting speech s.

Task

Track 1: The task focused on cough audio samples only. This was the primary track of the challenge with most teams participating only in this track. A leaderboard website was hosted for the challenge enabling teams to evaluate their system performance (validation set and blind test set). The participating teams were required to submit the COVID probability score for each audio file in the validation and test sets. The leaderboard website computed the ROC-AUC and the specificity/sensitivity. Every team was provided a maximum of tickets for submitting scores to the leaderboard. Track 2: Track-2 explored the use of recordings other than cough for the task of COVID diagnostics. The audio recordings released in this track composed of breathing, speech related to sustained phonation of vowel [i] and number counting (). The development and (non-blind) test sets were released concurrently, without any formal leaderboard style evaluation and competition. The data and the baseline system setup were provided to the registered teams after signing a terms and conditions document. As per the document, the teams were not allowed to use the publicly available Coswara dataset.5

Evaluation metrics

The focus of the challenge was binary classification, that is, detecting COVID or non-COVID using acoustics. As the dataset was imbalanced, we choose not to use accuracy as an evaluation metric. Each team submitted COVID probability scores (, a higher value indicating a higher likelihood of COVID infection) for the list of validation/test audio recordings. For performance evaluation, we used the scores with the ground truth labels to compute the receiver operating characteristics (ROC) curve. The curve was obtained by varying the decision threshold between with a step size of 0.0001. The area under the resulting ROC curve, AUC-ROC, was used as a performance measure for the classifier, where the area was computed using the trapezoidal method. The AUC-ROC formed the primary evaluation metric. Further, specificity (true negative rate), at a sensitivity (true positive rate) greater than or equal to 80% was used as a secondary evaluation metric. For brevity, we will refer to AUC-ROC by AUC in the rest of the paper. (a) A scatter plot of the average five-fold validation AUC versus test AUC performance for every submission on the leaderboard. (b) Test set AUC performance in rank ordered manner for each of the system submissions. Here, AUC refers to AUC-ROC.

Baseline system

The baseline system was implemented using tools from the scikit-learn Python library (Pedregosa et al., 2011). Pre-processing: For every audio file, the signal was normalized in amplitude. Using a sound activity detection threshold of 0.01 and a buffer size of ms on either side of a sample, any region of the audio signal with amplitude lower than threshold was discarded. Also, initial and final snippets of the audio were removed to avoid abrupt start and end activity in the recordings. Feature extraction: The baseline system used the dimensional mel-frequency cepstral coefficients (MFCCs), its delta and delta–delta coefficients, computed over samples (23.2 ms), with a hop of samples ( ms). The resulting feature dimension was 39 × 1. Classifiers: The following three classifiers were designed. Logistic Regression (LR): The LR classifier was trained for epochs. The binary cross entropy (BCE) loss with a regularization strength of 0.01 was used for optimizing the model. Multi layer perceptron (MLP): A single-layer perceptron model with hidden units and tanh() activation was used. Similar to the LR model, the BCE loss with a regularization of strength 0.001 was optimized for parameter estimation. The loss was optimized using Adam optimizer with an initial learning rate of 0.001. The COVID samples were over-sampled to compensate for the data imbalance (weighted BCE loss). Random Forests (RF): A random forest classifier was trained with trees using Gini impurity criterion for tree growing. In the weighted BCE loss used in LR and MLP, the class errors are weighted. That is, where is the loss, is the number of training samples, and are the true and predicted labels, and and are the weights associated with non-COVID and COVID classes. We choose and as the inverse of fraction of samples associated with the corresponding class in the training set. In RF, class weights are used to weigh the Gini impurity criterion for finding splits for tree growing. In the terminal nodes of each tree, class weights are again taken into consideration for class prediction via “weighted majority” vote (Chen et al., 2004). Variation in AUC as a function of number of nodes in the hidden layer of an MLP classifier. The AUC-ROC% is the average over the five validation folds. In the MLP design, the goal was to develop a shallow architecture for the baseline system and encourage participants to build deep networks with pre-training from other datasets. Our internal analysis had shown over-fitting issues for deep architectures when trained only with the challenge data. Hence, a single hidden layer architecture with tanh() was chosen. Ablation experiments were carried out to decide on the number of nodes in the hidden layer of this MLP. A grid search in the range of nodes in steps of showed average AUC-ROC (over the five validation folds) in the range of . The AUC-ROC improved with addition of nodes from to and after this the increase did not so a monotonic improvement. This is shown in Fig. 5. Also, shown is the AUC-ROC obtained using a two hidden layer MLP. The performance is in the similar range as that of single layer MLP. For the baseline system we opted for a single layer MLP with hidden units.

Fig. 5

Variation in AUC as a function of number of nodes in the hidden layer of an MLP classifier. The AUC-ROC% is the average over the five validation folds.

Inference and performance: To obtain a classification score for an audio recording: the file was pre-processed, frame-level MFCC features were extracted, frame-level probability scores were computed using the trained model(s), and all the frame scores were averaged to obtain a single COVID probability score for the audio recording. For evaluation on the test set files, the probability scores from five validation models (for a classifier type) were averaged to obtain the final score. Table 2, Banerjee and Nilhani, 2021, Elizalde and Tompkins, 2021, Singh et al., 2021 depicts the performance of the three classifiers on the validation folds and the test sets. All classifiers performed better than chance. For Track-1, the AUC for the test set was better for the MLP classifier (69.85% AUC). For Track-2, RF gave the best AUC in all sound categories ( AUC).

Table 2

The baseline system performance on Track-1 and Track-2 on development set (5-fold val) and test set.

Track	Sound	Model	Performance (AUC%)
			Val. (std. dev)	Test
		LR	66.95 (±3.89)	61.97
1	Cough	MLP	68.54 (±3.69)	69.85
		RF	70.69 (±3.10)	67.59

		LR	60.95 (±4.85)	60.94
	Breathing	MLP	72.47 (±4.38)	71.52
		RF	75.17 (±2.75)	76.85
		LR	71.48 (±1.23)	67.71
2	Vowel [i]	MLP	70.39 (±4.11)	73.19
		RF	69.73 (±4.31)	75.47
		LR	68.93 (±2.44)	61.22
	Speech	MLP	73.57 (±1.59)	61.13
		RF	69.61 (±3.49)	65.27

The baseline system performance on Track-1 and Track-2 on development set (5-fold val) and test set. Further, among the category of acoustic sounds, the breathing samples provided the best AUC (76.85%) performance followed by vowel sound [i] (75.47%). The baseline system code6 was provided to the participants as a reference for setting up a classifier training and scoring pipeline. Summary of submitted systems in terms of feature and model configurations. The specificity (%) is reported at a sensitivity of 80%.

Track-1: Submitted systems overview

A total of teams (plus the baseline system) participated in the Track-1 leaderboard. Out of these, teams submitted their system reports describing the explored approaches.7 In this section, we provide a brief overview of the submissions, emphasizing on the obtained performances and explored classifiers, features, model ensembling and data augmentation techniques.

Performance

In total, out of the teams reported a performance better than the baseline system. We refer to the teams with Team IDs corresponding to their rank on the leaderboard, that is, best AUC performance as T-1 and so on. A performance summary of all the submitted systems on the validation and the blind test data is given in Fig. 4. Fig. 4(a) depicts a comparison of the validation and test results. Interestingly, there is a slight positive correlation between test and validation performance. For some teams, the validation performances exceed % AUC. Deducing from the system reports, these performances are primarily due to training on the whole development dataset without removing the validation data. Fig. 4(b) depicts the best AUC posted by participating teams (including baseline) on the blind test data. The best AUC performance on the test data was 87.07%, a significant improvement over the baseline AUC (that is, 69.85%).

Fig. 4

(a) A scatter plot of the average five-fold validation AUC versus test AUC performance for every submission on the leaderboard. (b) Test set AUC performance in rank ordered manner for each of the system submissions. Here, AUC refers to AUC-ROC.

Illustration of AUC performance on full test set, and test set split by gender, and age. For male set: subjects ( COVID), for female set: subjects ( COVID), for age set: subjects ( COVID), and for age 40 set: subjects ( COVID).

Features

The teams designed and experimented with a wide spectrum of features. A concise highlight is provided in Table 3. A majority of the teams used mel-spectrograms, mel-frequency cepstral coefficients (Davis and Mermelstein, 1980), or equivalent rectangular bandwidth (ERB) (Smith and Abel, 1999) spectrograms ( submissions out ). Further, the openSMILE features (Eyben et al., 2010), which consist of statistical measures extracted on low-level acoustic feature descriptors, were explored by teams. Few teams explored features derived using Teager energy based cepstral coefficients (TECC Kamble and Patil, 2019; T-15), and pool of short-term features such as short-term energy, zero-crossing rate, and voicing (T-5, T-14, T-27). Other teams resorted to using embeddings derived from pre-trained neural networks as features. These included VGGish (Hershey et al., 2017), DeepSpectrum (Amiriparian et al., 2017), OpenL3 (Cramer et al., 2019), YAMNet (Plakal and Ellis, 2020) embeddings (T-7, T-12), and x-vectors (Snyder et al., 2018) (T-15).

Table 3

Summary of submitted systems in terms of feature and model configurations. The specificity (%) is reported at a sensitivity of 80%.

Classifiers

The teams explored various classifier models (see Table 3). These included classical machine learning models, such as decision trees, random forests (RFs), and support vector machines (SVMs), and modern deep learning models, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and residual networks (ResNet). Several teams also attempted an ensemble of models to improve the final system performance. The CNNs were explored by teams (T-1, T-5, T-10, T-17). Variants of CNNs with residual connections and recording level average pooling to deal with the variable length input were developed by teams (T-2, T-5, T-8, T-9, T-10, T-13, T-16, T-21). Citing the improved ability of LSTMs to handle variable length inputs, (T-3, T-5, T-12, T-27) explored these models. The classical ML approaches of random forest, logistic regression and SVMs were used by (T-4, T-6, T-12, T-18). LightGBM (Gradient Boosting Machine) (Ke et al., 2017) model was explored by (T-15), and extra trees classifiers were studied by (T-7). Pre-training was also studied in several systems (T-3, T-17). Autoencoder style pre-training was used by (T-17). Several teams had also experimented with transfer learning from architectures pre-trained on image based models (T-2, T-8, T-13) and audio based models (T-10, T-13).

Model ensembling

The fusion of scores from different classifier architectures was explored by multiple teams (T-3, T-4, T-6 T-10, T-11, T-12, T-13). The fusion of multiple features was explored by (T-13). Further, (T-2, T-3) investigated score fusion of outputs obtained from the model tuned on the five validation folds.

Data augmentation

Data augmentation is a popular strategy in which external or synthetic audio data is used in training of deep network models. Five teams reported using this strategy by including COUGHVID cough dataset (Orlandic et al., 2021) (publicly available), adding Gaussian noise at varying SNRs, or doing audio manipulations (pitch shifting, time-scaling, etc., via tools such as audiomentations8 ). Few teams also used data augmentation approaches to circumvent the problem of class imbalance. These included T-1 using mixup (Zhang et al., 2017), (T-3, T-9, T-11) using SpecAugment (Park et al., 2019), (T-2, T-5, T-9) using additive noise, T-21 using sample replication, and T-5 using Vocal-Tract Length Perturbation (VTLP) (Jaitly and Hinton, 2013), to increase the sample counts of the minority class. Besides these, other strategies for training included gender aware training (T-21), using focal loss (Lin et al., 2017) objective function (T-2, T-8, T-11), and hyper-parameter tuning using model searching algorithm TPOT (T-7) (Le et al., 2019). In the next section, we discuss in detail the approaches used by the Track-1 four top performing teams.

Track-1: Top performers

A comparison of AUC and sensitivity of top four teams, their score fusion and the baseline system.

T-1: The brogrammers

The team (Mahanta et al., 2021) focused on using a multi-layered CNN network architecture. Special emphasis was laid on having a small number of learnable parameters. Every audio segment was trimmed or zero padded to s. For feature extraction, this segment was represented using dimensional MFCC features per frame, and a matrix of 15 × 302 frames was obtained. A cascade of a CNN and fully connected layers, with max-pooling and ReLU non-linearities, was used in the neural network architecture. For data augmentation, the team used the audiomentations tool. The classifier was trained using binary cross entropy (BCE) loss to output COVID probability score. The team did not report performing any system combination unlike several other participating teams.

T-2: NUS-Mornin system

The team focused (Chang et al., 2021) on using the residual network (ResNet) model with spectrogram images as features. To overcome the limitations of data scarcity and imbalance, the team resorted to three key strategies. Firstly, data augmentation was done by adding Gaussian noise to spectrograms. Secondly, focal loss function was used instead of binary cross entropy loss. Thirdly, the ResNet50 was pre-trained on ImageNet followed by fine-tuning on DiCOVA development set and an ensemble of four models was used to generate final COVID probability scores.

T-3: UIUC SST system

The team (Harvill et al., 2021) used long short term memory (LSTM) models. With the motivation of generative modeling of mel-spectrogram for capturing informative features of cough, the team proposed using the auto-regressive predictive coding (APC) (Oord et al., 2018). The APC is used to pre-train the initial LSTM layers operating on the input mel-spectrogram. The additional layers of the full network, which was composed of BLSTM and fully connected layers, was trained using the DiCOVA development set. As the number of model parameters was high, the team also used data augmentation using COUGHVID dataset (Orlandic et al., 2021) and SpecAugment (Park et al., 2019) tool. The binary cross entropy was chosen as the loss function. The final COVID-19 probability score was obtained as an average of several similar models, trained on development data subsets or sampled at different checkpoints during training.

T-4: The North system

The team (Södergren et al., 2021) explored classical machine learning models like random forests (RF), support vector machines (SVM), and multi-layer perceptron (MLP) rather than deep learning models. The features used were the dimensional openSMILE functional features (Eyben et al., 2010). The openSMILE features were z-score normalized to prevent feature domination. The hyper-parameters of the models were tuned to obtain the best results. The SVM models alone provided an AUC of 85.1% on the test data. The RF and the MLP scored an AUC of 82.15 and 75.65, respectively. The final scores were obtained by a weighted average of the probability scores from the RF and SVM models, with weights of 0.25 and 0.75, respectively.

Top 4 teams: Fairness

Here, we present a fairness analysis of the scores generated by the top 4 teams. We particularly focus on gender-wise and age-wise performance on the test set. Fig. 6 depicts this performance. Interestingly, all the four teams gave a better performance for female subjects. Similarly, the test dataset was divided into two groups based on subjects with age and age 40. Here, the top two teams had a considerably higher AUC for age 40 subjects, while T-3 had a lower AUC for this age group and T-4 had the highest. In summary, the performance of top four teams did not reflect the bias in the development data (70% male subjects, and largely in age 40 group; see Fig. 2).

Fig. 6

Top 4 teams: Score fusion

The systems from the top four teams differ in terms of features, model architectures, and data augmentation strategies. We consider a simple arithmetic mean fusion of the scores from top teams. Let , and , be the COVID probability score predicted by the th team submission for the th subject in the test data. Here, denotes the number of subjects in the test set, and , denoting the number of top teams, is four. The scores are first calibrated by correcting for the range as follows. where and . The fused scores are obtained as, The ROC obtained using these prediction scores is denoted by Fusion in Fig. 7. The Fusion system ROC gives an AUC of 95.10%, a significant improvement over each of the individual system results. Table 4 depicts the sensitivity of the top four systems, the fusion, and baseline (MLP) at % specificity. The fused model surpasses all the other models and achieves a sensitivity of 70.7%.

Fig. 7

Illustration of ROCs obtained on the test set for the top four teams. The ROC associated with the hypothetical score fusion system obtained using the top four teams is also shown.

Table 4

A comparison of AUC and sensitivity of top four teams, their score fusion and the baseline system.

Performance measures		Team T-1		Team T-2		Team T-3		Team T-4		Fusion		Baseline
AUC %		87.07		85.43		85.35		85.21		95.07		69.85
Sensitivity (at 95% Specificity)		46.34		39.02		60.97		29.27		70.73		17.07

Illustration of ROCs obtained on the test set for the top four teams. The ROC associated with the hypothetical score fusion system obtained using the top four teams is also shown.

Track-2: Systems overview

This track was an exploratory track. It did not feature a leaderboard and did not require system report submission to the organizers. Hence, we have a summary of explorations carried by two teams only (only two reports were available with the details on Track-2 submission). Ritwik et al. (2021) performed a detailed analysis on the COVID detection performance obtained with different spectral features for each of the three sound categories, namely, breathing, vowel [i] and counting. The authors explored acoustical features included MFCCs, Gaussian mixture model (GMM) based super vectors, formant frequency features, fundamental frequency values and its harmonics. The study suggests formant frequency features as the best performing feature for the binary task. Further, complimentary information is present in the three sound categories. A fusion of probability scores from each sound category, for each subject, gave an AUC 73.4% on validation folds and 71.7% AUC on the test (same as eval) dataset. Deshpande and Schuller (2021) explored the estimates of breathing patterns, obtained from different sound categories, for COVID-19 detection. Towards this, an encoder which predicts breathing pattern from speech signals was designed using a subset of UCL Speech Breath Monitoring (UCL-SBM) database (Schuller et al., 2020). This pre-trained encoder is then used to predict the breathing patterns from breathing, vowel-[i], and counting sound categories, separately. The estimated breathing patterns are then used as feature vectors to train a decoder model to predict COVID-19 status. Interestingly, the breathing features performed superior to MFCCs for vowel-e and counting sound categories. Further, a combination of breathing and MFCCs features performed better than either one of these features. Across the three sound categories provided in Track-2, the average validation AUC ranged between and the test AUC ranged between .

Discussion

Challenge accomplishments

The challenge problem statement for Track-1 required the design of a binary classifier. A clear problem statement, with a well-defined evaluation metric (AUC), encouraged a significant number of registrations. This included plus teams from around the globe, with a good representation from both industry and academia. The teams which completed the challenge came from different countries. Additionally, teams associated themselves with industry. Among the submissions, out of the teams exhibited a performance well above the baseline system AUC (see Fig. 4(b)). Altogether, the challenge provided a platform for researchers to explore a healthcare problem of immense and timely societal impact. The results indicate potential in using acoustics for COVID-19 POCT development. The challenge turnaround time was days, and the progress made by different teams in this short time span highlighted their efforts. Eleven studies pursued in this challenge (Muguli et al., 2021, Das et al., 2021, Mallol-Ragolta et al., 2021, Ritwik et al., 2021, Deshpande and Schuller, 2021, Karas and Schuller, 2021, Bhosale et al., 2021, Södergren et al., 2021, Harvill et al., 2021, Kamble et al., 2021, Avila et al., 2021), after going through the peer review process, were presented at the DiCOVA Special Session, Interspeech 2021 Conference (on Aug 2021). The World Health Organization (WHO) has stated that a sensitivity (at a specificity ) is necessary for an acceptable POCT tool (Target product profiles, 2020). The top four teams fell short of this benchmark (see Table 4), indicating that there is scope for further development in future. Interestingly, a simple combination of the scores from the systems of these teams achieves a performance more closer to this benchmark. This suggests some ways to reap advantage via collaboration between multiple teams for improved tool development. The development of such an acoustic based diagnostic tool for COVID-19 diagnosis would offer multiple advantages in terms of speed, cost, portability, and accuracy.

Limitations and future scope

The challenge, being first of its kind, also had its own limitations. We discuss some of these below. The development dataset had class imbalance, with a majority of the samples belonging to the non-COVID class. Although the imbalance reflects the prevalence of the infection in the population, it will be ideal to improve the balance in future challenges. The Coswara dataset (Sharma et al., 2020) developed by our team is being regularly updated with more samples. As of August 2021, it contains data from close to COVID-19 positive individuals and non-COVID individuals. A majority of the DiCOVA dataset samples came from India. While the cultural dependence of cough and breathing is not well established, it will be ideal to evaluate the performance on datasets collected from multiple geographical sites. Towards this, future challenges can include demographically balanced datasets, with close collaborations between multiple sites involved in the data collection efforts. The task involved in the challenge simplified to a binary classification setting. However, in a practical scenario, there are multiple respiratory ailments resulting from bacterial, fungal, or viral infections, with each condition potentially leaving a unique bio-marker. The future challenges can target multi-class categorization. This will also widen the usability of the tool. The data did not contain information regarding the progression of the disease (or the time elapsed since the positive COVID-19 test). Also, the subjects in the “recovered from COVID-19” and “exposed to COVID-19 patient” categories were excluded in the challenge dataset. The leaderboard and system highlights reported were limited to the cough recordings only. As seen in Table 2, analysis using breathing and speech signals can also yield performance results comparable to those observed in cough recordings. In addition, the Coswara tool (Sharma et al., 2020) also records the symptom data from participants. Using a combination of various sound categories and symptoms in developing a COVID-19 detection tool might further push the detection performance. In the DiCOVA challenge, the performance ranking of the teams was based on AUC-ROC metric. This conveyed the model’s ability to perform binary classification of COVID and non-COVID subjects. However, the challenge did not emphasize model interpretability and explainability as key requirements. In a healthcare scenario, the interpretability of the model decisions may be as important as the accuracy. Hence, future challenges should encourage this aspect. For example, a recent work by Xia et al. (2021) proposes an ensemble framework for quantifying decision uncertainty. Multiple classification models are developed and a disagreement across the learned model during testing phase is used as a measure of uncertainty in decision. In future, it is also important to focus on reproducibility of the models, and lower memory and computational foot-prints as these will benefit design and deployment of tool in mobile devices. Recently, on 12 Aug 2021, we launched the Second DiCOVA Challenge9 which attempts to circumvent some of the above limitations. It features three sound categories, namely, breathing, cough and speech, and a leaderboard for each category. In a separate track, the participants are also encouraged to fuse scores or decisions from classifiers designed on multiple sound categories. Further, in comparison to the first DiCOVA Challenge, the dataset is larger in size.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

12 in total

1. Cough sound analysis can rapidly diagnose childhood pneumonia.

Authors: Udantha R Abeyratne; Vinayak Swarnkar; Amalia Setyati; Rina Triasih
Journal: Ann Biomed Eng Date: 2013-06-07 Impact factor: 3.934

2. COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings.

Authors: Jordi Laguarta; Ferran Hueto; Brian Subirana
Journal: IEEE Open J Eng Med Biol Date: 2020-09-29

3. AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app.

Authors: Ali Imran; Iryna Posokhova; Haneya N Qureshi; Usama Masood; Muhammad Sajid Riaz; Kamran Ali; Charles N John; Md Iftikhar Hussain; Muhammad Nabeel
Journal: Inform Med Unlocked Date: 2020-06-26

4. A Cough-Based Algorithm for Automatic Diagnosis of Pertussis.

Authors: Renard Xaviero Adhi Pramono; Syed Anas Imtiaz; Esther Rodriguez-Villegas
Journal: PLoS One Date: 2016-09-01 Impact factor: 3.240

5. Scaling tree-based automated machine learning to biomedical big data with a feature set selector.

Authors: Trang T Le; Weixuan Fu; Jason H Moore
Journal: Bioinformatics Date: 2020-01-01 Impact factor: 6.937

6. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR.

Authors: Victor M Corman; Olfert Landt; Marco Kaiser; Richard Molenkamp; Adam Meijer; Daniel Kw Chu; Tobias Bleicker; Sebastian Brünink; Julia Schneider; Marie Luisa Schmidt; Daphne Gjc Mulders; Bart L Haagmans; Bas van der Veer; Sharon van den Brink; Lisa Wijsman; Gabriel Goderski; Jean-Louis Romette; Joanna Ellis; Maria Zambon; Malik Peiris; Herman Goossens; Chantal Reusken; Marion Pg Koopmans; Christian Drosten
Journal: Euro Surveill Date: 2020-01