Literature DB >> 34965907

Machine Learning-based Voice Assessment for the Detection of Positive and Recovered COVID-19 Patients.

Carlo Robotti¹, Giovanni Costantini², Giovanni Saggio³, Valerio Cesarini⁴, Anna Calastri⁵, Eugenia Maiorano⁵, Davide Piloni⁶, Tiziano Perrone⁷, Umberto Sabatini⁷, Virginia Valeria Ferretti⁸, Irene Cassaniti⁹, Fausto Baldanti¹⁰, Andrea Gravina¹¹, Ahmed Sakib¹¹, Elena Alessi¹², Matteo Pascucci¹², Daniele Casali⁴, Zakarya Zarezadeh⁴, Vincenzo Del Zoppo⁴, Antonio Pisani¹³, Marco Benazzo¹⁴.

Abstract

Many virological tests have been implemented during the Coronavirus Disease 2019 (COVID-19) pandemic for diagnostic purposes, but they appear unsuitable for screening purposes. Furthermore, current screening strategies are not accurate enough to effectively curb the spread of the disease. Therefore, the present study was conducted within a controlled clinical environment to determine eventual detectable variations in the voice of COVID-19 patients, recovered and healthy subjects, and also to determine whether machine learning-based voice assessment (MLVA) can accurately discriminate between them, thus potentially serving as a more effective mass-screening tool. Three different subpopulations were consecutively recruited: positive COVID-19 patients, recovered COVID-19 patients and healthy individuals as controls. Positive patients were recruited within 10 days from nasal swab positivity. Recovery from COVID-19 was established clinically, virologically and radiologically. Healthy individuals reported no COVID-19 symptoms and yielded negative results at serological testing. All study participants provided three trials for multiple vocal tasks (sustained vowel phonation, speech, cough). All recordings were initially divided into three different binary classifications with a feature selection, ranking and cross-validated RBF-SVM pipeline. This brough a mean accuracy of 90.24%, a mean sensitivity of 91.15%, a mean specificity of 89.13% and a mean AUC of 0.94 across all tasks and all comparisons, and outlined the sustained vowel as the most effective vocal task for COVID discrimination. Moreover, a three-way classification was carried out on an external test set comprised of 30 subjects, 10 per class, with a mean accuracy of 80% and an accuracy of 100% for the detection of positive subjects. Within this assessment, recovered individuals proved to be the most difficult class to identify, and all the misclassified subjects were declared positive; this might be related to mid and short-term vocal traces of COVID-19, even after the clinical resolution of the infection. In conclusion, MLVA may accurately discriminate between positive COVID-19 patients, recovered COVID-19 patients and healthy individuals. Further studies should test MLVA among larger populations and asymptomatic positive COVID-19 patients to validate this novel screening technology and test its potential application as a potentially more effective surveillance strategy for COVID-19.

Entities: Chemical

Keywords: Accuracy; Cough; SARS-CoV-2; Screening test; Sensitivity; Surveillance

Year: 2021 PMID： 34965907 PMCID： PMC8616736 DOI： 10.1016/j.jvoice.2021.11.004

Source DB: PubMed Journal: J Voice ISSN： 0892-1997 Impact factor: 2.009

INTRODUCTION

The Coronavirus Disease 2019 (COVID-19) pandemic, caused by Severe Acute Respiratory Syndrome Coronavirus 2, has reached more than 200 countries to date, jeopardizing healthcare systems and national administrations worldwide.1, 2, 3 At the time of writing (September 2021), over 225 million confirmed cases and more than 4.5 million deaths had been reported globally. To curb the alarming spread of the disease, several virological tests were promptly implemented, including reverse transcription-polymerase chain reaction (RT-PCR) for viral RNA detection, serologic testing for immunoglobulins M (IgM) and G (IgG) quantification and rapid diagnostic kits. , Nevertheless, available testing strategies suffer from critical limitations in accuracy and most appropriate clinical applications. , Furthermore, inadequate testing infrastructures, high costs, lack of testing components, and long waiting times for results might have contributed to the poor control of the pandemic, , leading to an underestimation of the infection's actual burden. Specifically, the limitations of current screening strategies (symptoms checklists, temperature checking) , stress the need for new instruments, which should be highly sensitive, but also widely accessible, non-invasive, cost-effective, and able to provide results quickly and at scale. Within this frame, the present research project was designed to test a novel COVID-19 screening tool based on voice analysis through machine learning (ML). Conventional voice analysis proved useful in detecting distinguishing acoustic features of pathologies impairing all structures and systems responsible for phonation, including lungs,17, 18, 19 trachea, larynx,21, 22, 23 vocal folds , and central nervous system.26, 27, 28, 29, 30 Furthermore, encouraging results had been obtained for disorders impairing voice production mechanisms only secondarily, including cardiovascular diseases31, 32, 33, 34 and diabetes. Advantageously, ML-based voice assessment (MLVA) allows processing thousands of acoustic variables simultaneously, benefiting from the computational power of properly trained algorithms. , Moreover, MLVA confirmed its efficacy even on samples gathered through non-professional recording instruments (ie, smartphones), , making this technology potentially available on a global scale through mobile devices. Deep Learning (DL), especially based on Convolutional Neural Networks (CNN) applied to images or Long Short-Term Memory networks, is commonly considered as a substantial alternative to traditional machine learning pipeline methods. The advantages include the possibility to extract chance very complex features (through repeated non-linear transformation of data) and the fact that it is completely data-driven,t therefore not requiring preprocessing of the training set. However,D DL is often difficult to use with small datasets, requiring larger sets (often in the order of thousands of entries or more) and/or a suitable data augmentation procedure, which indeed constitutes a preprocessing artifact. Moreover, DL models usually comprehend a large number of parameters and require more hardware resources for their training. Several studies , demonstrated the possibilities of DL for pathological speech assessment; yet, the datasets and the accuracy values have often shown to be comparable with those of traditional ML methods for the same pathologies, namely Parkinson's disease or dysphonia. Moreover, a study by Hasan et al compared CNN and SVM for hyper-spectral image recognition with the results being very similar, slightly favouring SVM. Although apparently different from our task, hyper-spectral images are in fact translated as image inputs for CNN and as a set of complex features for the SVM, which is exactly how a speech-based pathology detection is carried out. We considered the two approaches as equally promising, and significantly problem-dependent. For this specific study, we chose a traditional SVM-based machine learning pipeline based on many state-of-the-art performances on pathology identification with small datasets, but also considering our need to still retain clinically relevant features in our analysis, which would be a much more difficult task using a high-abstraction DL-based approach. Interestingly, over the last few months, some research groups have been racing worldwide to search for COVID-19 acoustic biomarkers, mainly focusing on cough and breath sounds and yielding promising results.47, 48, 49 The DiCOVA challenge within the Interspeech 2021 conference is also worth mentioning, where several teams tested algorithms for the identification of COVID-19 from crowdsourced voice samples, with the winning team accuracy being 87%. Nonetheless, several limitations of these studies appear noteworthy. Firstly, data was mainly collected through web-based platforms, thus bounding researchers to rely on patients’ self-declarations. Secondly, being conducted outside of controlled clinical settings, incomplete clinical data were often provided (ie, testing type and timing, inclusion and exclusion criteria, COVID-19 symptoms). Thirdly, to the best of our knowledge, only two studies included vocal samples other than cough in the analyses. , Specifically, in our opinion, proper speech tasks could provide valuable additional features for MLVA implementation, in reason of the complex interactions between voice-production subsystems , and their peculiar impairments in COVID-19.53, 54, 55, 56, 57 Lastly, no ML study enrolled recovered COVID-19 patients to date, even though they may represent a crucial population for multiple reasons, including potential residual viral spreading58, 59, 60 and severe long-term disabilities.61, 62, 63 Therefore, the present study was designed to test MLVA as a potential screening tool for COVID-19 within a controlled clinical setting, collecting multiple vocal tasks (sustained phonation, speech, cough) with commercially available smartphones from three different subpopulations (positive COVID patients, recovered negative COVID-19 patients, healthy controls).

MATERIAL AND METHODS

Study design

The study was conducted between March 2020 and October 2020 at three Italian COVID-19 units after approval of the ethics committees (IRCCS San Matteo Foundation, Pavia, Italy, reference number 20200053388; Ospedale dei Castelli, Ariccia, Italy, reference number 0064181/2020; Policlinico Tor Vergata Foundation, Rome, Italy, reference number 0012909/2020). The study was conducted following the principles stated by the Helsinki Declaration.

Study population

For the present study, 70 positive COVID-19 patients (group P), 70 recovered negative COVID-19 patients (group R) and 70 healthy individuals (group H) matching inclusion and exclusion criteria (Table 1 ) were consecutively recruited. An additional test sample population of 10 subjects per group (15 males and 15 females; median age 53 years; interquartile range 37-63) was also recruited. Positive COVID-19 patients were recruited within 10 days from nasal swab (NS) positivity. For positive patients, COVID-19 pneumonia was diagnosed clinically and radiologically through chest computed tomography (CT). Recovery from COVID-19 was confirmed clinically, radiologically, and with two consecutively negative NS tests. Moreover, only recovered patients with a Lung Ultrasound score ≤ 3 were recruited, to exclude subjects with residual pulmonary fibrosis.65, 66, 67 Healthy individuals were recruited among hospital staff members and their acquaintances if they never had tested positive for COVID-19, reported no COVID-19 symptoms, nor had unprotected exposure to COVID-19 cases (known or suspected). Serum samples (SS) for antibodies quantification were collected from healthy participants at least 20 days after recording sessions,68, 69, 70 yielding negative results. Written informed consent was obtained from all participants. All data was pseudonymized.

TABLE 1

Inclusion and Exclusion Criteria for the Three Study Groups

Inclusion Criteria	Group P	Group R	Group H	Exclusion Criteria	Group P	Group R	Group H
Age between 18 and 80 y	■	■	■	Drugs acting on CNS	■	■	■
European ethnicity	■	■	■	Head and neck cancer	■	■	■
Italian native speaker	■	■	■	Lung cancer	■	■	■
Positive NS (< 10 d)	■	■	NA	Chemoradiation therapy	■	■	■
Two consecutive negative NS	NA	■	NA	C-PAP therapy	■	■	■
LUS ≤ 3	NA	■	NA	Tracheal intubation	■	■	■
Negative SS (> 20 d)	NA	NA	■	Tracheostomy	■	■	■

Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; NS, SARS-CoV-2 nasal swab for RNA detection; LUS, lung ultrasound score; SS, SARS-CoV-2 serum sample for IgM and IgG quantification; CNS, Central Nervous System; C-PAP, Continuous Positive Airway Pressure; NA, not Apre-cable.

Inclusion and Exclusion Criteria for the Three Study Groups Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; NS, SARS-CoV-2 nasal swab for RNA detection; LUS, lung ultrasound score; SS, SARS-CoV-2 serum sample for IgM and IgG quantification; CNS, Central Nervous System; C-PAP, Continuous Positive Airway Pressure; NA, not Apre-cable.

Voice recordings

Recording sessions were conducted in similar hospital rooms, with quiet environments and tolerable levels of background noise. Specifically, no machines producing static or impulsive noises were running in the background and no other voices were captured while recording. Moreover, the global quality of the recordings was assessed by ear by three independent audio engineers, rating samples as “acceptable” based on voice clarity, absence of noticeable reverb, absence of noticeable hiss or hum noises, and intelligible phonation. Voice samples were recorded with Huawei Y6-2019 smartphones (Huawei Technologies Co., Ltd., Shenzhen, China), in high quality and uncompressed format (.wav, 16-bit, 44.1 kHz). Devices were carefully disinfected after each use, according to the manufacturer's instructions. Participants were instructed to sit up straight on chairs with no armrests, keeping elbows and arms relaxed to avoid arm and shoulder strain. During recording sessions, all participants removed their masks not to alter acoustic signals nor speech intelligibility.71, 72, 73 The device's microphone was placed 15-20 cm in front of the participants’ mouths. Three distinct vocal-tasks were performed by each participant: (1) sustained voicing of the vowel /a/ (like in “bra”), at comfortable pitch and loudness, for at least 5 seconds; (2) a common Italian saying (“a caval donato non si guarda in bocca,” literally “do not look a gift horse in the mouth”); (3) cough. Three trials were recorded for each task. For the vowel and the sentence tasks, the trial with the lowest competing noise was then selected for MLVA, while all three cough trials were considered for the analyses. Recordings with poor audio quality or mispronunciation errors were then discarded from the analyses (Table S1 in the Supplement). All recordings were uploaded to a secure institutional server. Audio files were then trimmed to retain only vocalizing sections. Each participant provided three trials for each task effortlessly. Specifically, recording sessions required no more than 2 minutes for each participant.

Machine learning-based voice assessment (MLVA)

MLVA was performed in five steps: preprocessing, feature extraction (FE), feature selection (FS), feature ranking (FR), and classification (CL). First, raw audio data of all vocal tasks underwent preprocessing elaboration. Specifically, Root Mean Square normalization was applied to feed the algorithms with normalized data, thus mitigating variations related to different recording environments. Subsequently, FE was performed embedding OpenSMILE (OpenSMILE; audEERING GmbH, Munich, Germany) in a bash script, following previously validated protocols. A total of 6373 unidimensional features was extracted using the configuration file of the INTERSPEECH2016 Computational Paralinguistics Challenge (IS ComParE 2016) feature dataset. Subsequently, FS, FR and CL were performed using the software Weka (Waikato Environment for Knowledge Analysis; University of Waikato, Waikato, New Zealand). Starting from FS, audio files were organized into nine different datasets, with three binary comparisons between the classes, each one being based on the three vocal tasks. Thus, the training set was arranged for one-versus-one comparisons, namely P versus H, P versus R and R versus H. According to a Greedy-Stepwise search method, each binary dataset underwent FS with a Correlation-Based approach, retaining approximately 2% of the previously extracted features. FR was performed basing on heuristic merit factors through a linear SVM classifier. The first 50 top-ranking features were preserved for each dataset, retaining the most informative content while maintaining a standardized number of features. Finally, CL was conducted through the SVM classifier (Radial Basis SVM), which was selected for its effectiveness within the analysis of relatively small datasets. , Accuracy on the training set, for each binary classifier (binary MLVA), was calculated by means of a 10-fold cross-validation, dividing the whole set into folds and using a different one of those as a validation set each time. Final accuracy is the average of the 10 accuracies obtained on each nine-to-one set. On the other hand, the test accuracy on the external set was obtained by running the pretrained binary models on the test data, unified with a majority voting system. Since each binary model is comprised of three sub-classifiers, one per vocal task, a majority voting system was also used to unify the outputs of the three sub-classifiers. Those subjects who received three different responses from the three binary classifiers were deemed as “uncertain.” Sensitivity and specificity were obtained through confusion matrices, as well as their three-class equivalent (H Accuracy, P Accuracy, R Accuracy).

Statistical analysis

To compare clinical and demographic characteristics, statistical tests were performed using Stata (StataCorp 2019, Stata Statistical Software: Release 16, College Station, TX). Qualitative variables were summarized as absolute counts and percentages of each category, while quantitative variables were summarized as medians and interquartile ranges. Fisher's exact test was used to compare categorical variables between the groups of patients. Mann-Whitney test and Kruskal-Wallis test (with Dunn's test for post hoc comparisons) were used to compare quantitative variables between two or more groups of patients, respectively. Bonferroni's correction was applied to allow for multiple comparisons. Two-sided P values were considered statistically significant when lower than 0.05.

RESULTS

Seventy positive COVID-19 patients (group P) were consecutively enrolled at the COVID-19 Units of the enrolled institutions. To match this population, 70 participants were consecutively recruited among recovered negative COVID-19 patients and healthy individuals. These subjects make up the training set for the binary MLVA. Clinical and demographic data are reported in Table 2 .

TABLE 2

Clinical and Demographic Characteristics of the Three Study Groups

Variables	Group P(n = 70)	Group R(n = 70)	Group H(n = 70)	P value
Variables	Group P(n = 70)	Group R(n = 70)	Group H(n = 70)	Global	P Versus H	P Versus R	R Versus H
Age, median (IQR), years	57 (39-67)	59 (48-69)	41 (29-54)	< 0.001	< 0.001	0.215	< 0.001
Gender
Males, n (%)	40 (57%)	45 (64%)	37 (53%)	0.402	NC	NC	NC
Females, n (%)	30 (43%)	25 (36%)	33 (47%)	0.402	NC	NC	NC
BMI, median (IQR), kg/m²	27.8 (26.1-31.2)	26.5 (24.4-30.5)	24.3 (22.4-28.6)	0.015	0.006	0.458	0.043
Smoking habits
Non-smokers, n (%)	35 (50%)	38 (54%)	38 (54%)	0.005	0.333	0.522	0.003
Smokers, n (%)	8 (11%)	2 (3%)	15 (21%)
Ex-smokers, n (%)	27 (39%)	30 (43%)	17 (24%)
COVID-19 pneumoniadiagnosis, n (%)²	40 (57%)	67 (96%)	-	< 0.001	-	-	-
COVID-19 symptoms
Presence of symptoms,n (%)	54 (77%)	54 (77%)	-	> 0.90	-	-	-
Number of symptoms,median (IQR)	2 (1-4)	2 (1-3)	-	0.096	-	-	-
Asthenia (n, %)	29 (41%)	39 (56%)	-	0.128	-	-	-
Dyspnea on exertion (n, %)	29 (41%)	31 (44%)	-	0.864	-	-	-
Cough (n, %)	34 (49%)	8 (11%)	-	< 0.001	-	-	-
Muscle pain (n, %)	10 (14%)	25 (36%)	-	0.006	-	-	-
Dysphonia (n, %)	23 (33%)	5 (7%)	-	< 0.001	-	-	-
Olfaction disorder (n, %)	13 (19%)	6 (9%)	-	0.137	-	-	-
Taste disorder (n, %)	12 (17%)	5 (7%)	-	0.119	-	-	-
Olfaction and tastedisorder (n, %)	13 (19%)	6 (9%)	-	0.137	-	-	-
Dyspnea at rest (n, %)	15 (21%)	2 (3%)	-	0.001	-	-	-
Blocked nose (n, %)	11 (16%)	2 (3%)	-	0.017	-	-	-
Headache (n, %)	6 (9%)	7 (10%)	-	> 0.90	-	-	-
Fever (n, %)	7 (10%)	0 (0%)	-	0.013	-	-	-
Dysphagia (n, %)	1 (1%)	5 (7%)	-	0.209	-	-	-
Chest pain (n, %)	2 (3%)	3 (4%)	-	> 0.90	-	-	-

Data regarding COVID-19 pneumonia and COVID-19 symptoms were collected only for positive and recovered COVID-19 patients, therefore cells are left blank for healthy control subjects. Data about pneumonia for group P refer to ongoing COVID-19 pneumonia diagnosis at the time of enrollment, while for group R they refer to previously diagnosed and currently recovered COVID-19 pneumonia. Significant p values are reported in bold font.

Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; IQR, interquartile range; NC, not calculated; BMI, body mass index; COVID-19, coronavirus disease 2019.

Clinical and Demographic Characteristics of the Three Study Groups Data regarding COVID-19 pneumonia and COVID-19 symptoms were collected only for positive and recovered COVID-19 patients, therefore cells are left blank for healthy control subjects. Data about pneumonia for group P refer to ongoing COVID-19 pneumonia diagnosis at the time of enrollment, while for group R they refer to previously diagnosed and currently recovered COVID-19 pneumonia. Significant p values are reported in bold font. Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; IQR, interquartile range; NC, not calculated; BMI, body mass index; COVID-19, coronavirus disease 2019. Moreover, 30 subjects (10 per class) have subsequently been collected for an external test sample which allowed MLVA to act as a three-way classifier. COVID-19 symptoms were reported by 77% of both positive and recovered patients (P > 0.90). More than 40% of symptomatic participants of both groups reported dyspnea on exertion and asthenia (P > 0.05). Cough, dyspnea at rest, blocked nose, and fever were reported more frequently by positive COVID-19 patients (P < 0.02 in all comparisons). Contrariwise, muscle pain was reported at a higher rate by recovered subjects as a residual symptom (P = 0.006). Finally, no relevant differences were highlighted for the remaining screened symptoms (P > 0.05). At the time of enrollment, COVID-19-related pneumonia had been diagnosed clinically and radiologically in 57% of positive patients; former diagnoses of COVID-19-related pneumonia were recorded instead for 96% of recovered patients (P < 0.001).

Machine learning based voice assessment (MLVA)

Receiver Operating Characteristic (ROC) curves describing MLVA performances for each binary comparison between groups are depicted in Figure 1 and in Figures S1 and S2 in the Supplement. Table 3 reports accuracy, sensitivity, specificity, and area under the ROC curve (AUC) values. Overall, MLVA for binary classifications (on the training set) demonstrated a mean accuracy of 90.24% (range 87.88%-92.81%), a mean sensitivity of 91.15% (range 83.58%-93.27%), a mean specificity of 89.13% (range 85.51%-92.31%) and a mean AUC of 0.94 (range 0.91-0.97) across all tasks and all comparisons. According to accuracy values, the vowel task performed as the best discriminator within the comparison between groups P and H (90.07%) and between groups P and R (92.81%). Differently, the cough task performed as the best discriminator within the comparison between groups R and H (90.49%). Finally, radar charts highlighting the top-ranking acoustic features for all tasks and comparisons are depicted in Figure 2 and in Figures S3 to S10 in the Supplement. The lists of all top-ranking features are reported in Tables S2 to S10 in the Supplement.

FIGURE 1

ROC curves comparing MLVA performances for all tasks within the discrimination between positive COVID-19 patients (group P) and healthy individuals (group H).

TABLE 3

Accuracy, Sensitivity, Specificity and Area Under the Curve (AUC) of Machine-Learning Based Voice Analysis for All Tasks and All Comparisons Between Groups

Comparison	Vocal Task	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC (CI)	Cut-Off
Group P versus Group H	Vowel /a/	90.07	92.11	88.00	0.94 (0.90-0.98)	0.93
	Sentence	87.88	83.58	92.31	0.91 (0.86-0.96)	0.85
	Cough	89.44	91.28	87.50	0.92 (0.90-0.94)	0.91
Group P versus Group R	Vowel /a/	92.81	92.86	91.43	0.97 (0.95-1.00)	0.94
	Sentence	91.18	91.04	91.30	0.96 (0.92-1.00)	0.94
	Cough	91.50	93.27	89.58	0.94 (0.92-0.96)	0.92
Group R versus Group H	Vowel /a/	89.21	92.86	85.51	0.92 (0.87-0.97)	0.93
	Sentence	89.55	92.75	86.15	0.96 (0.92-0.99)	0.93
	Cough	90.49	90.63	90.36	0.92 (0.90-0.94)	0.91

Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; CI, 95% confidence interval.

FIGURE 2

Discrimination between positive COVID-19 patients and healthy individuals based on the first 20 top ranking features of the vowel task. The red line of this radar plot corresponds to positive COVID-19 patients (group P), while the blue line corresponds to healthy individuals (group H). Each radius represents a distinct audio feature. Each point on the red line represents the feature's mean value for group P, normalized to its mean value for group H. Out of the original 50 top-ranking features, only the first 20 were reported for convenient viewing reasons. The list of all 20 top-ranking features is depicted in Table S2 (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article).

ROC curves comparing MLVA performances for all tasks within the discrimination between positive COVID-19 patients (group P) and healthy individuals (group H). Accuracy, Sensitivity, Specificity and Area Under the Curve (AUC) of Machine-Learning Based Voice Analysis for All Tasks and All Comparisons Between Groups Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; CI, 95% confidence interval. Discrimination between positive COVID-19 patients and healthy individuals based on the first 20 top ranking features of the vowel task. The red line of this radar plot corresponds to positive COVID-19 patients (group P), while the blue line corresponds to healthy individuals (group H). Each radius represents a distinct audio feature. Each point on the red line represents the feature's mean value for group P, normalized to its mean value for group H. Out of the original 50 top-ranking features, only the first 20 were reported for convenient viewing reasons. The list of all 20 top-ranking features is depicted in Table S2 (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article). For the three-way classification carried out on the external test set, accuracies of 80%, 100% and 60% have been obtained for the identification of healthy, positive and recovered subjects respectively, which brings to a mean accuracy of 80%. Noteworthily, one recovered subject was deemed as “uncertain.” For this external test, three binary classifiers were ensembled, each comprised of three ensembled sub-classifiers, one per vocal task. This unification procedure was carried out with a majority voting system. Binary accuracies for each sub-classifier were calculated only for the test subjects pertaining to the two classes considered by each binary classifier: for example, recovered subjects were not considered in evaluating the accuracy within the P versus the H classifier. Confusion matrices for each sub-classifier along with binary accuracy, sensitivity and specificity are reported in Table 4 . The final confusion matrix for all the test subjects is reported in Table 5 , while the compact 3 × 4 matrix is reported in Table 6 .

TABLE 4

Confusion Matrices for Each Sub-Classifier Over the External Test Set, Along With Binary Accuracy, Sensitivity and Specificity Calculated on the Two Respective Classes for Each Comparison

#	Real Class	P Versus H			P Versus R			R Versus H
		Vowel /a/	Sentence	Cough	Vowel /a/	Sentence	Cough	Vowel /a/	Sentence	Cough
1	H	H	H	H	P	P	P	H	H	R
2	H	H	H	H	R	P	P	H	H	R
3	H	P	H	H	P	P	P	H	H	R
4	H	H	H	H	P	P	P	H	H	H
5	H	H	H	H	P	P	R	H	H	R
6	H	H	H	H	P	P	P	H	H	R
7	H	P	H	P	R	P	P	H	H	R
8	H	P	H	P	P	P	P	H	H	H
9	H	H	H	H	P	P	P	H	H	R
10	H	H	H	H	P	P	P	H	H	H
11	P	P	P	P	P	P	P	R	R	R
12	P	P	P	P	P	P	P	R	R	R
13	P	P	P	P	P	P	P	R	R	R
14	P	P	H	P	P	P	P	H	R	R
15	P	P	P	P	P	R	P	R	R	R
16	P	H	P	P	P	P	P	R	R	R
17	P	P	P	P	P	P	P	R	R	R
18	P	H	P	P	P	R	P	R	R	R
19	P	P	P	P	P	P	P	R	R	R
20	P	P	P	P	P	P	P	H	R	R
21	R	H	H	H	R	R	R	R	R	R
22	R	P	P	H	R	R	R	R	R	R
23	R	H	H	H	R	R	P	R	R	R
24	R	P	P	H	R	P	P	R	R	R
25	R	P	P	H	R	P	P	R	R	R
26	R	P	H	H	R	R	P	R	R	R
27	R	P	H	H	R	P	P	R	R	P
28	R	H	P	H	R	R	P	R	P	R
29	R	P	P	H	R	P	P	R	R	R
30	R	H	H	H	R	R	P	R	R	R
Accuracy (%)		75	95	90	100	80	60	100	95	60
Sensitivity (%)		80	90	100	100	100	100	100	90	90
Specificity (%)		70	100	80	100	60	20	100	100	30

Abbreviations: #, number of test subject; P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects.

TABLE 5

Final Confusion Matrix for the Three Classifiers on the External Test Set Along With Mean Accuracy and Per-Class Accuracies

#	Real Class	Binary Classifiers Output:			Final	Error
		P Versus H	P Versus R	R Versus H
1	H	H	H	P	H
2	H	H	H	P	H
3	H	H	H	P	H
4	H	H	H	P	H
5	H	H	H	P	H
6	H	H	H	P	H
7	H	P	H	P	P	yes
8	H	P	H	P	P	yes
9	H	H	H	P	H
10	H	H	H	P	H
11	P	P	R	P	P
12	P	P	R	P	P
13	P	P	R	P	P
14	P	P	R	P	P
15	P	P	R	P	P
16	P	P	R	P	P
17	P	P	R	P	P
18	P	P	R	P	P
19	P	P	R	P	P
20	P	P	R	P	P
21	R	H	R	R	R
22	R	P	R	R	R
23	R	H	R	R	R
24	R	P	R	P	P	yes
25	R	P	R	P	P	yes
26	R	P	R	R	R
27	R	H	R	P	uncertain	uncertain
28	R	H	R	R	R
29	R	P	R	P	P	yes
30	R	H	R	R	R
Accuracy (%)					80
H accuracy (%)					80
P accuracy (%)					100
R accuracy (%)					60

Abbreviations: #, number of test subject; P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; Final, final prediction obtained through majority voting of the three classifiers; Error, whether a mis-classification has occurred.

TABLE 6

Final 3 × 4 Confusion Matrix

True Class	Classified as
	H (%)	P (%)	R (%)	Uncertain (%)
H (%)	80	20	0	0
P (%)	0	100	0	0
R (%)	0	30	60	10

Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects.

Confusion Matrices for Each Sub-Classifier Over the External Test Set, Along With Binary Accuracy, Sensitivity and Specificity Calculated on the Two Respective Classes for Each Comparison Abbreviations: #, number of test subject; P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects. Final Confusion Matrix for the Three Classifiers on the External Test Set Along With Mean Accuracy and Per-Class Accuracies Abbreviations: #, number of test subject; P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects; Final, final prediction obtained through majority voting of the three classifiers; Error, whether a mis-classification has occurred. Final 3 × 4 Confusion Matrix Abbreviations: P, positive COVID-19 patients; R, recovered negative COVID-19 patients; H, healthy control subjects.

DISCUSSION

The present investigation, conducted within a controlled clinical setting, demonstrated that MLVA can accurately discriminate between positive COVID-19 patients, recovered negative COVID-19 patients, and healthy individuals, by detecting highly-distinguishing patterns of audio features for all tasks across all study groups. In comparison to most previous works, this study expands MLVA analyses to proper speech tasks, providing further evidence in support of the potential clinical application of this novel screening tool for COVID-19. Binary MLVA (cross-validated on the training set) yielded promising results for all tasks and all comparisons between groups, with a satisfactory mean accuracy of 90.24% and a significantly high mean AUC of 0.94. Previous studies testing cough and breath sounds reported encouraging results in terms of accuracy (range 88.89%-98.50%) and AUC (range 0.80-0.98). , However, comparisons with literature appear scarcely feasible, primarily for methodological reasons. Firstly, previous studies searching for COVID-19 acoustic biomarkers relied almost exclusively on cough, since it represents a well-renowned COVID-19 core symptom.83, 84, 85 Pre-COVID-19 ML studies demonstrated the relevance of cough samples for detecting multiple respiratory conditions. , However, it is conceivable that proper speech tasks may provide additional valuable features for MLVA, potentially even more representative of the multifaceted interactions between phonatory subsystems , and their impairment in COVID-19.53, 54, 55 , Indeed, the vowel task proved higher accuracy and AUC values than cough when discriminating between groups P and H and between groups P and R, demonstrating lower performances only when discriminating between groups R and H, thus confirming that speech tasks may have at least similar informative contents. With regards to sensitivity (the ability to detect subjects with the disease) and specificity (the ability to identify healthy individuals), binary MLVA yielded satisfactory results. In particular, when discriminating between groups P and H, the vowel task demonstrated the highest sensitivity (92.11%), while the sentence task proved the highest specificity (92.31%). A preliminary observation of the selected acoustic features highlights a trend that sees domains other than the frequency as the most prevalent ones. This is in line with the fact that differences in voices over the three classes are not always detectable by ear. Moreover, a predominance of RASTA related features can be observed. The RASTA domain is based on cepstral coefficients of a PLP autoregressive model, high-pass filtered in the mel-frequency domain. Therefore, it is inherently insensitive to slowly-varying spectral components, which are most often represented by background noise and differences in recording hardware and environment. On the other hand, RASTA is sensitive to eventual other background voices, which we were very careful not to include in our recordings. With regards to proper speech tasks, Pinkas and Shimon also obtained encouraging results through their preliminary analyses (vowel /a/, counting from 50 to 80). However, data was gathered from smaller populations of positive COVID-19 subjects and their analyses yielded lower AUC and accuracy values. Secondly, most studies gathered cough samples from crowdsourced databases, allegedly to promptly gather large datasets, nonetheless often providing incomplete medical data. Foreseeing the need for an automatic tool, a three-way classifier was also developed by unifying the three models with a majority voting system. Thus, the features used for the classifications were the same used for the binary MLVA. This classification yielded 80% accuracy for the identification of subjects of group H, a 100% accuracy for subjects of group P and a 60% accuracy for subjects of group R, with 10% classified as “uncertain.” With a mean of 80%, the classifier still yields relevant accuracy, comparable to that of conventional nasal swabs, with the additional advantage of preliminarily discriminating recovered subjects. Furthermore, the trend of the vowel being the best discriminating task, shortly followed by the sentence, was confirmed. It is worth noting that all misclassified recovered subjects were declared as positives. This might suggest that the COVID-19 “signature” persists in the voice in the mid and short-term, even when the clinical course of infection is over. This is in line with considerations by Helding et al on COVID-induced long-lasting damages to the phonatory system, which are especially concerning for voice professionals such as singers and therefore deserve attention. These promising results further support the potential employment of MLVA as a COVID-19 screening tool. Regarding eventual on-site examinations, setting up uncrowded and noise-free recording environments would be very beneficial, especially regarding the complexity of the problem which requires clean datasets (high-quality and homogeneous). However, the presence of noise-robust features like RASTA and the caveat of avoiding background voices in the recordings could be sufficient, to an extent that still needs to be thoroughly tested, for an on-site examination outside the clinical environment, such as in closed spaces placed in the territory or even in a silent room at one's house. Naturally, an automatic tool in a “real” environment will possibly require a new training process. The present research was conceived to overcome the critical issues of current screening strategies, such as symptoms checklists and temperature checking. Regarding symptoms checklists, a recent review concluded that commonly screened COVID-19 signs and symptoms have low diagnostic accuracy, since neither presence nor absence of symptoms is accurate enough to confirm or rule out the disease. Temperature screening appears to be an unreliable, high-cost, and low-yield strategy. Indeed, in a study investigating European patients' clinical features, only 45% of mild-to-moderate COVID-19 patients had fever, and the rate dropped to 9% when asymptomatic individuals were also considered. Therefore, based on our preliminary results, we believe that MLVA could represent a more reliable, cost-effective, non-invasive, and widely deployable COVID-19 screening tool. Specifically, several MLVA screening scenarios could be envisioned: (1) population daily screening, potentially localizing new viral hotbeds; (2) remote testing, limiting infectious risk for healthcare workers by reducing in-person interactions; (3) alternative COVID-19 testing where virological tests are scarcely accessible or poorly available. Furthermore, MLVA could also be employed at scale to preselect candidates for virological testing because of its high sensitivity and specificity. Interestingly, a recent meta-analysis highlighted that sensitivity and specificity of currently available COVID-19 diagnostics are not equally high, ranging from 97.2% of RT-PCR analyses of sputum samples to 73.3% and even 62.3% when RT-PCR is performed on NS specimens and saliva, respectively. This variability may be primarily related to COVID-19 clinical course, since chances of viral detection on biological samples depend on specific collection times. , Moreover, being time-consuming and expensive, these technologies appear unsuitable for reiterated population screening. Contrariwise, effective surveillance regimens (aimed at rapidly filtering infected individuals out from the population, thus preventing further spreading) should focus instead on high-frequency testing, even with lower analytic sensitivity. Both high-sensitivity and low-sensitivity tests can detect the infection within its narrow transmission window, but only frequently repeated tests can spot it during its very early phases. In this matter, being a low-cost and widely spreadable technology (ie, smartphones), MLVA could potentially be employed to test large populations recurrently over time, suggesting prompt confirmation through virological diagnostics when suspected cases are detected, making this novel technology a more effective COVID-19 filter. It is to be stressed that, in the case of quite rare diseases such as COVID-19, screening tests with high specificity and Positive Predictive value (PPV, the probability that subjects with a positive test truly have the disease) are preferable, as they offer a better “rule in” test. However, the diagnostic parameters of a screening test (such as accuracy, specificity and sensitivity) are not intrinsic properties of the test itself, but they do strongly depend upon the clinical setting in which the test is applied. Therefore, in order to reduce falsely positive results, data regarding the actual prevalence of COVID-19 should be taken into account in order to weight the results of this screening tool. Nevertheless, patients should always be sent to conventional diagnostics for confirmation (ie, RT-PCR) in case of positive results. , Although the test set is not large, preliminary results stresses the potential utility of MLVA as an on-site screening tool used in substitution or in addition to nasal swabs for prediagnosis, as well as the possibility to develop an application for real-time, remote self-assessment. In these regards, we consider MLVA to only be a preliminary tool which should suggest a more extensive examination in case of a positive outcome. Recruited healthy individuals had to respect strict inclusion criteria. Furthermore, SS testing was conducted at least 20 days after recording sessions, yielding negative results. Similarly, Laguarta crowdsourced cough samples from healthy individuals who declared having tested negative, nevertheless not specifying testing type. For the present investigation, control subjects did not undergo baseline NS testing, since NS may yield falsely-negative results, especially in early phases of COVID-19 and with potentially high rates. Instead, numerous studies demonstrated that seroconversion rates in positive COVID-19 patients reach almost 100% 15 to 19 days after symptoms onset,68, 69, 70 suggesting that delayed immunological confirmation might offer a more reliable strategy when recruiting healthy control subjects during the present pandemic. Noteworthily, this is the first study testing MLVA on recovered COVID-19 patients, with promising results. Specifically, the satisfactory classification accuracy obtained discriminating between positive and recovered patients suggests that MLVA may detect different COVID-19 clinical phases. Therefore, further studies should test MLVA in monitoring disease progression. Moreover, the results obtained within the discrimination between recovered and healthy individuals suggest that COVID-19 may leave detectable vocal traces even without clinically evident pulmonary impairments. In fact, all recovered patients had a Lung Ultrasound Score of 3 or lower.65, 66, 67 The importance of recruiting recovered patients lies in the fact that these subjects might test positive again, although the reasons behind it (ie, reinfections, new viral variants, reactivation of former infections) and eventual residual viral spreading are still debated.58, 59, 60 Ultimately, it is expected that healthcare systems will face critical challenges in the future for the management of recovered COVID-19 patients due to potential long-term invalidating sequelae.61, 62, 63 , , In this matter, MLVA could offer a feasible and low-cost strategy to detect these subjects among the general population. Lastly, some limitations of the present investigations must be stated. Firstly, most positive and recovered COVID-19 patients (77%) presented clinical symptoms of the disease. Future studies should address this experimental approach to positive but asymptomatic patients, thus improving MLVA performances in the preclinical phases of COVID-19. Secondly, we are aware that our study's sample size was limited, and that the lack of sample size calculation limits the ability to draw ultimate inferences in support of a prompt employment of MLVA in clinical practice. Therefore, although promising, the results of the present study should be intended as preliminary. However, the adopted rigorous methodology and the homogenous population of this study (same ethnicity, language and nationality) support the quality of our results, hopefully dispelling some skepticism towards this pioneering screening technology. Wider multicultural and multilanguage study should be designed to confirm our findings among international populations, in order to rapidly answer the pressing need for a more effective surveillance strategy for COVID-19.

CONCLUSIONS

In conclusion, the present MLVA model demonstrated high accuracy for the discrimination between positive COVID-19 patients, recovered negative COVID-19 patients and healthy control subjects within a controlled clinical setting. A preliminary three-way classification proves the feasibility of an automatic tool. Moreover, the prevalence of noise-robust acoustic features like the RASTA domain suggest that an on-site examination is possible, especially in sufficiently noise-free environments. Further studies should test MLVA with pauci-symptomatic positive subjects, which are prevalent in the postvaccine era, and will also focus on long-term recovered subjects. Moreover, further examinations would be beneficial especially with wider datasets among larger populations, in order to validate this novel screening instrument and answer the pressing need for a more effective surveillance strategy for COVID-19.

Data statement

The lists of the first top 100 features obtained through the feature selection process for all tasks and all study groups are available at https://figshare.com/articles/dataset/MLVA_COVID-19/14130239. Clinical data and audio files are not publicly available due to privacy and consent restrictions. Moreover, data contain potentially identifying or sensitive patient information. However, they may be made available to research institutions by the authors upon reasonable request.

78 in total

1. Fast coronavirus tests: what they can and can't do.

Authors: Giorgia Guglielmi
Journal: Nature Date: 2020-09 Impact factor: 49.962

2. Antibody responses to SARS-CoV-2 in patients with COVID-19.

Authors: Quan-Xin Long; Bai-Zhong Liu; Hai-Jun Deng; Gui-Cheng Wu; Kun Deng; Yao-Kai Chen; Pu Liao; Jing-Fu Qiu; Yong Lin; Xue-Fei Cai; De-Qiang Wang; Yuan Hu; Ji-Hua Ren; Ni Tang; Yin-Yin Xu; Li-Hua Yu; Zhan Mo; Fang Gong; Xiao-Li Zhang; Wen-Guang Tian; Li Hu; Xian-Xiang Zhang; Jiang-Lin Xiang; Hong-Xin Du; Hua-Wen Liu; Chun-Hui Lang; Xiao-He Luo; Shao-Bo Wu; Xiao-Ping Cui; Zheng Zhou; Man-Man Zhu; Jing Wang; Cheng-Jun Xue; Xiao-Feng Li; Li Wang; Zhi-Jie Li; Kun Wang; Chang-Chun Niu; Qing-Jun Yang; Xiao-Jun Tang; Yong Zhang; Xia-Mao Liu; Jin-Jing Li; De-Chun Zhang; Fan Zhang; Ping Liu; Jun Yuan; Qin Li; Jie-Li Hu; Juan Chen; Ai-Long Huang
Journal: Nat Med Date: 2020-04-29 Impact factor: 53.440

Review 3. Neurobiology of COVID-19.

Authors: Majid Fotuhi; Ali Mian; Somayeh Meysami; Cyrus A Raji
Journal: J Alzheimers Dis Date: 2020 Impact factor: 4.472

4. A prospective multicentre study testing the diagnostic accuracy of an automated cough sound centred analytic system for the identification of common respiratory disorders in children.

Authors: Paul Porter; Udantha Abeyratne; Vinayak Swarnkar; Jamie Tan; Ti-Wan Ng; Joanna M Brisbane; Deirdre Speldewinde; Jennifer Choveaux; Roneel Sharan; Keegan Kosasih; Phillip Della
Journal: Respir Res Date: 2019-06-06

5. COVID-19 After Effects: Concerns for Singers.

Authors: Lynn Helding; Thomas L Carroll; John Nix; Michael M Johns; Wendy D LeBorgne; David Meyer
Journal: J Voice Date: 2020-08-06 Impact factor: 2.300

6. Artificial intelligence enabled preliminary diagnosis for COVID-19 from voice cues and questionnaires.

Authors: Carmi Shimon; Gabi Shafat; Inbal Dangoor; Asher Ben-Shitrit
Journal: J Acoust Soc Am Date: 2021-02 Impact factor: 1.840

7. Retest positive for SARS-CoV-2 RNA of "recovered" patients with COVID-19: Persistence, sampling issues, or re-infection?

Authors: Hanyujie Kang; Yishan Wang; Zhaohui Tong; Xuefeng Liu
Journal: J Med Virol Date: 2020-06-09 Impact factor: 20.693

8. Recurrence of COVID-19 after recovery: a case report from Italy.

Authors: Daniela Loconsole; Francesca Passerini; Vincenzo Ostilio Palmieri; Francesca Centrone; Anna Sallustio; Stefania Pugliese; Lucia Donatella Grimaldi; Piero Portincasa; Maria Chironna
Journal: Infection Date: 2020-05-16 Impact factor: 3.553

9. Antibody Responses to SARS-CoV-2 in Patients With Novel Coronavirus Disease 2019.

Authors: Juanjuan Zhao; Quan Yuan; Haiyan Wang; Wei Liu; Xuejiao Liao; Yingying Su; Xin Wang; Jing Yuan; Tingdong Li; Jinxiu Li; Shen Qian; Congming Hong; Fuxiang Wang; Yingxia Liu; Zhaoqin Wang; Qing He; Zhiyong Li; Bin He; Tianying Zhang; Yang Fu; Shengxiang Ge; Lei Liu; Jun Zhang; Ningshao Xia; Zheng Zhang
Journal: Clin Infect Dis Date: 2020-11-19 Impact factor: 9.079

10. Proposal for International Standardization of the Use of Lung Ultrasound for Patients With COVID-19: A Simple, Quantitative, Reproducible Method.

Authors: Gino Soldati; Andrea Smargiassi; Riccardo Inchingolo; Danilo Buonsenso; Tiziano Perrone; Domenica Federica Briganti; Stefano Perlini; Elena Torri; Alberto Mariani; Elisa Eleonora Mossolani; Francesco Tursi; Federico Mento; Libertario Demi
Journal: J Ultrasound Med Date: 2020-04-13 Impact factor: 2.754

3 in total

1. Usefulness, acceptation and feasibility of electronic medical history tool in reflux disease.

Authors: Jerome R Lechien; Anaïs Rameau; Lisa G De Marrez; Gautier Le Bosse; Karina Negro; Andra Sebestyen; Robin Baudouin; Sven Saussez; Stéphane Hans
Journal: Eur Arch Otorhinolaryngol Date: 2022-06-28 Impact factor: 2.503

2. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning.

Authors: Giovanni Costantini; Emilia Parada-Cabaleiro; Daniele Casali; Valerio Cesarini
Journal: Sensors (Basel) Date: 2022-03-23 Impact factor: 3.576

3. Deep learning and machine learning-based voice analysis for the detection of COVID-19: A proposal and comparison of architectures.

Authors: Giovanni Costantini; Valerio Cesarini Dr; Carlo Robotti; Marco Benazzo; Filomena Pietrantonio; Stefano Di Girolamo; Antonio Pisani; Pietro Canzi; Simone Mauramati; Giulia Bertino; Irene Cassaniti; Fausto Baldanti; Giovanni Saggio
Journal: Knowl Based Syst Date: 2022-07-28 Impact factor: 8.139

3 in total