Literature DB >> 35469291

Prediction of SARS-CoV-2 infection with a Symptoms-Based model to aid public health decision making in Latin America and other low and middle income settings.

Andrea Ramírez Varela¹, Sergio Moreno López¹, Sandra Contreras-Arrieta¹, Guillermo Tamayo-Cabeza¹, Silvia Restrepo-Restrepo¹, Ignacio Sarmiento-Barbieri¹, Yuldor Caballero-Díaz¹, Luis Jorge Hernandez-Florez¹, John Mario González¹, Leonardo Salas-Zapata², Rachid Laajaj¹, Giancarlo Buitrago-Gutierrez³, Fernando de la Hoz-Restrepo⁴, Martha Vives Florez¹, Elkin Osorio², Diana Sofía Ríos-Oliveros², Eduardo Behrentz¹.

Abstract

Symptoms-based models for predicting SARS-CoV-2 infection may improve clinical decision-making and be an alternative to resource allocation in under-resourced settings. In this study we aimed to test a model based on symptoms to predict a positive test result for SARS-CoV-2 infection during the COVID-19 pandemic using logistic regression and a machine-learning approach, in Bogotá, Colombia. Participants from the CoVIDA project were included. A logistic regression using the model was chosen based on biological plausibility and the Akaike Information criterion. Also, we performed an analysis using machine learning with random forest, support vector machine, and extreme gradient boosting. The study included 58,577 participants with a positivity rate of 5.7%. The logistic regression showed that anosmia (aOR = 7.76, 95% CI [6.19, 9.73]), fever (aOR = 4.29, 95% CI [3.07, 6.02]), headache (aOR = 3.29, 95% CI [1.78, 6.07]), dry cough (aOR = 2.96, 95% CI [2.44, 3.58]), and fatigue (aOR = 1.93, 95% CI [1.57, 2.93]) were independently associated with SARS-CoV-2 infection. Our final model had an area under the curve of 0.73. The symptoms-based model correctly identified over 85% of participants. This model can be used to prioritize resource allocation related to COVID-19 diagnosis, to decide on early isolation, and contact-tracing strategies in individuals with a high probability of infection before receiving a confirmatory test result. This strategy has public health and clinical decision-making significance in low- and middle-income settings like Latin America.

Entities: Chemical

Keywords: Anosmia; COVID-19; Logistic model; Machine learning; SARS-CoV-2; Symptoms

Year: 2022 PMID： 35469291 PMCID： PMC9020649 DOI： 10.1016/j.pmedr.2022.101798

Source DB: PubMed Journal: Prev Med Rep ISSN： 2211-3355

Introduction

The SARS-CoV-2 pandemic has been one of the most significant public health challenges in modern history. By April 2022, more than 500 million cases have been reported, with over 6 million deaths globally (Johns Hopkins University, 2022). The COVID-19 pandemic has posed a significant challenge in terms of the readiness of healthcare systems to mobilize resources for infection diagnosis, contact tracing, and supporting adequate infrastructure for treatment. Particularly for diagnosis, since the pandemic’s start, low- and middle-income countries reported inequities and delays in access to testing, time-to-test consultation, and test results turnaround (Lau et al., 2020). Evidence shows that in high-income countries, test turnaround times can be <3 days, (McGarry et al., 2021, National Health Services, 2021) in contrast to middle-low- and low-income countries, where test turnaround times can be between 8 and 10 days, depending on socioeconomic status (Laajaj et al., 2021). Even more, a modeling study suggests that delays of over 3 days from symptom onset to test result turnaround significantly decrease the effectiveness of nonpharmacological strategies such as contact tracing (Kretzschmar et al., 2020). Given these limitations in testing, other methods for rapid diagnostics have been used. Tools such as imaging have been conducted as an alternative to clinical testing for COVID-19 diagnosis and disease severity. These analyses have included both chest X-ray (CXR) and computed tomography (CT) images combining various methods, including machine leaning (ML) neural networks, among others (Chen et al., 2021, Li et al., 2020a, Singh et al., 2021, Wang et al., 2022). However, most interventions related to diagnostic imaging require some level of face-to-face medical care, implying an additional challenge to healthcare systems in low-to-middle income countries with accessibility issues such as Colombia and most of Latin America. Therefore, clinical guidelines have provided common symptoms that are suggestive of the infection. However, given the low predictive capacity of these symptoms if presented alone, it may lead to isolating people with a very low probability of infection, or no isolation at all. Due to this, recent literature suggests to use symptoms clusters to improve detection of COVID-19. This strategy is particularly helpful for clinical decision-making and testing in primary healthcare settings, with scarce diagnostic resources, in particular during pandemic peaks with high community transmission when mass unlimited testing is not possible (Li et al., 2020b, Long et al., 2020, Mercer and Salit, 2021). Also, in the case of telemedicine, this can amplify healthcare access without greater infrastructure and overcome difficulties in patient care (Hincapié et al., 2020, Monkowski et al., 2019). Some studies have approached COVID-19 diagnosis using symptoms-based models. To our knowledge, two studies have been conducted in Latin America regarding SARS-CoV-2 infection prediction. The first study used a database from a symptoms-tracking app in Brazil with serological testing for SARS-CoV-2. Symptoms such as loss of smell/anosmia and shortness of breath were independently associated with SARS-CoV-2 infection with a negative predictive value (NPV) of 93% (Dantas et al., 2021). The second study was also conducted in Brazil, assessing national seroprevalence and its association with reported symptoms such as anosmia, fever, and body ache through a conditional inference tree analysis. In this study, those participants that did not report these symptoms had a positivity rate of 0.8%, compared to 18.3% of those with loss of smell and fever (Menezes et al., 2021). In settings with low testing capacity, such as Colombia and Latin America, as shown before, where there is low adherence to non-pharmacological interventions such as isolation because of socioeconomic reasons, a model for predicting SARS-CoV-2 infection may be an alternative to resource allocation, including diagnostic testing capacities and prioritization in healthcare and contact tracing. The CoVIDA project was an intensified epidemiological surveillance study for SARS-CoV-2 performed in Bogotá, the most populated city in Colombia (with 7 million inhabitants). Over 55,000 RT-PCR tests were performed to increase the city’s testing capacity during the first two pandemic waves. RT-PCR tests conducted by the CoVIDA project allowed the local health authorities to identify positive cases in individuals with mild symptoms that may have been missed by traditional epidemiological surveillance that focuses mainly on people with moderate to severe COVID-19 (Varela et al., 2021). Therefore, in this study, we aimed to test a prediction model based on symptoms to predict a positive RT-PCR test result for SARS-CoV-2 infection among high mobility working adult participants of the CoVIDA project during the 1st year of the COVID-19 pandemic in Bogotá, Colombia.

Materials and methods

Study design and population

This study used the data from the CoVIDA project, a large sentinel epidemiological surveillance study conducted in Bogotá, Colombia, from April 2020 to March 2021. This project was created to detect transmission patterns among high mobility asymptomatic or mild symptomatic populations due to their occupation (healthcare workers, public transportation workers, employees in public markets, grocery stores, food delivery, construction, cleaning and other home services, education, informal workers, police, military forces, firefighters, and other essential services). Further description of the CoVIDA project sampling and testing allocation methods can be found elsewhere (Varela et al., 2021). After participants accepted the informed consent via a telephone call, they completed a questionnaire about sociodemographic characteristics, including sex, age, socioeconomic strata, contact with a confirmed case of COVID-19, comorbidities, and self-reported symptoms related to SARS-CoV-2 infection (sore throat, dry cough, fatigue, anosmia, diarrhea, fever, dyspnea, confusion, headache, myalgias, dysgeusia, chills, vomiting/nausea, rhinorrhea) (Instituto Nacional de Salud, 2020, Ministerio de Salud y Protección Social de Colombia, 2020). In the case of a positive test, the CoVIDA contact center informed the participants of the results and provided with recommendations. Positive test results were provided to the Colombian health authorities according to national guidelines. This study was approved by the ethics committee of Universidad de los Andes (Act No. 1278 of 2020). Informed consent was obtained via telephone call, in order to comply with physical distancing.

Specimen collection and testing

The CoVIDA testing centers performed the RT-PCR test for SARS-CoV-2 infection following two testing models: (a) home visit, for mild-symptomatic participants, and (b) for asymptomatic patients, drive/walk-through testing. Participants underwent laboratory testing for SARS-CoV-2 using RT-PCR tests with nasopharyngeal swab sampling. The samples were processed by the GenCore Sequencing Center of the Universidad de los Andes following the international Berlin protocol and using the U-TOP™ COVID-19 detection kit for one-step real-time RT-PCR (Corman et al., 2020).

Statistical analysis

We reported descriptive results of the complete database using absolute and relative frequencies and central tendency measurements according to the distribution of the continuous variables. We compared the distribution of categorical sociodemographic characteristics and symptoms by the test result (positive or negative) using the Chi-square test or the Fisher exact test (in variables with reporting frequencies <5) for bivariate analysis. For continuous variables, we used the U-Mann Whitney test. A p value of 0.05 or lower was considered statistically significant.

Variable selection for prediction models

The selection of variables for our prediction model used in the logistic regression is presented in Fig. 1A. As we included many symptoms and biological features in the questionnaire, we aimed to choose the best model with the combination of variables that offered the best diagnostic accuracy using variables selection. First, the choice of variables was based on biological plausibility regarding symptoms used to predict SARS-CoV-2 infection according to the literature. Second, we included all possible interactions between variables (Sauerbrei and Royston, 2011) using the fractional multinomial polynomials method for the complete database. Variables selected for the final models were based on the leaps-and-bounds algorithm which allowed us to reach the best combination of variables as predictors for the model using the best Akaike information criterion to reduce false-positive rates (Lindsey and Sheather, 2015). We calculated the logistic regression’s odds ratios (ORs) and 95% confidence intervals (95% CI).

Fig. 1

Modeling framework for the analysis of symptoms association and SARS-CoV-2 prediction. a) Variable’s selection for the final logistic regression model; b) Sampling procedure and hold out validation for logistic regression model and the ML approach. Given the low prevalence of the event (an overall positivity rate of 5.7% observed in the CoVIDA project), (Varela et al., 2021) resampling and oversampling procedures were performed to approach to the data imbalance. We perfomed an independent sensitivity analysis using the following datasets: (a) the complete database; (b) an undersampling dataset resulting from a random resampling using a 4:1 ratio of participants with negative and positive RT-PCR results, respectively; (c) an oversampling dataset obtained by the ROSE method; and d) an oversampling dataset obtained by SMOTE. We performed a holdout validation of the model obteined by randomly splitting all datasets into training and testing datasets in a 70:30 ratio (see Fig. 1B). Analyses were performed in all datasets. Complete definition of sample balancing techniques can be found in Supplementary Note 1.

Machine learning analysis

Sensitivity analysis to assess the diagnostic performance of the prediction model was conducted using multiple ML methods: random forest (RF), support vector machine (SVM), and Extreme gradient boosting (XG boost). An RF model was performed using a classification approach, using GINI reduction based on Breiman’s random forest algorithm. The SVM method followed a based Kernel linear approach. The XG boosting model was trained in the training set with the hyperparameter settings described in Supplementary Note 2. Each ML approach included variables to the prediction model in order of importance. We assessed graphically the importance of each variable included in ML models. We assessed the performance of the obtained models with all datasets using the following parameters: area under the curve (AUC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and total accuracy (participants correctly classified). We conducted analyses using Stata 16.0 and R version 4.0.5.

Results

Data used in this study comes from the CoVIDA project, an intensified epidemiological surveillance study that performed 58,577 RT-PCR tests for SARS-CoV-2 from April 2020 to March 2021. A positivity rate (response variable) of 5.7% was observed in the CoVIDA sample (Varela et al., 2021).

Demographic characteristics among participants

Table 1 shows sociodemographic characteristics and symptoms related to SARS-CoV-2 infection. The median age of the sample was 36 years (IQR = 28–48). Participants with a positive test result showed a median age of 35 years (IQR = 27–47), while those with a negative test result showed a median of 36 years (IQR = 28–49). Positive RT-PCR for SARS-CoV-2 was more frequent in 30- to 59-year-olds, with 56.5%. Regarding the sex of participants, 50.8% were female. The 18.4% of the participants reported contact with a COVID-19 confirmed case. The cumulative positivity rate for those with contact with a confirmed case was 7.4%, compared to the 5.3% of participants who did not report contact. The most frequent comorbidities were arterial hypertension and smoking, with 7.4% and 5.1%, respectively (see Table 1).

Table 1

Sociodemographic characteristics and symptoms related to SARS-CoV-2 infection (N = 58,556).

Variable	TotalN = 58,556	Positiven = 3,325	Negativen = 55,231	p-value
Age (years) median (IQR)	36 (28–48)	35 (27–47)	36 (28–49)	< ·001*

Age (years) (n, %)				·001*
<18	235 (0·4)	30 (0·90)	205 (0·4)
18–29	17,718 (30·3)	1,096 (32·9)	16,622 (30·1)
30–59	35,119 (59·9)	1,880 (56·5)	33,239 (60·2)
>60	5,484 (9·4)	319 (9·6)	5,165 (9·4)

Sex (n, %)				·183
Female	29,736 (50·8)	1,652 (5·6)	28,084 (94·4)
Male	28,794 (49·2)	1,673 (5·8)	27,121 (94·2)

Contact with a COVID-19 confirmed case (n, %)				< ·001*
Yes	10,758 (18·4)	796 (7·4)	9,962 (92·6)
No	47,798 (81·6)	2,529 (5·3)	45,269 (94·7)

Comorbidities (n, %)
Arterial hypertension	4,352 (7·4)	228 (5·2)	4,124 (94·8)	·193
Smoking	2,969 (5·1)	133 (4·5)	2,836 (95·5)	·004*
Obesity	2,551 (4·4)	139 (5·5)	2,412 (94·6)	·609
Asthma	2,311 (3·9)	104 (4·5)	2,207 (95·5)	·007*
Diabetes mellitus	1,379 (2·4)	71 (5·2)	1,308 (94·9)	·478
Chronic obstructive lung disease	295 (0·5)	19 (6·4)	276 (93·6)	·652

Symptoms related to SARS-CoV-2 (n, %)
Asymptomatic	51,254 (87·5)	1,852 (3·6)	49,402 (96·4)	< ·001*
Symptomatic	7,302 (12·5)	1,473 (20·2)	5,829 (79·8)	< ·001*
Sore throat	4,230 (7·2)	790 (18·7)	3,440 (81·3)	< ·001*
Dry cough	3,407 (5·8)	893 (26·2)	2,514 (73·8)	< ·001*
Fatigue	2,713 (4·6)	707 (26·1)	2,006 (73·9)	< ·001*
Anosmia	1,802 (3·1)	749 (41·6)	1,053 (58·4)	< ·001*
Diarrhea	1,871 (3·2)	413 (22·1)	1,458 (77·9)	< ·001*
Fever	1,169 (2·0)	444 (37·9)	725 (62·0)	< ·001*
Dyspnea	1,376 (2·6)	383 (27·8)	993 (72·2)	< ·001*
Confusion	413 (0·7)	119 (28·8)	294 (71·2)	< ·001*
Headache	254 (0·4)	50 (19·7)	204 (80·3)	< ·001*
Myalgias	68 (0·1)	15 (22·1)	53 (77·9)	< ·001*
Dysgeusia	16 (0·03)	13 (81·3)	3 (18·8)	< ·001*†
Chills	25 (0·04)	2 (8·0)	23 (92·0)	·616†
Vomiting/nausea	25 (0·04)	3 (12·0)	22 (88·0)	·172†
Rhinorrhea	13 (0·02)	0 (0)	13 (100·0)	·376†

*p value <·05.

†Fisher exact test.

Sociodemographic characteristics and symptoms related to SARS-CoV-2 infection (N = 58,556). *p value <·05. †Fisher exact test. Asymptomatic participants accounted for 87.5% of the sample. Of these asymptomatic people, 3.6% had a positive RT-PCR result. Among the symptomatic participants, 20.2% had a positive test result. The most frequently reported symptoms among the entire cohort were sore throat (7.2%), dry cough (5.8%), fatigue (4.6%), diarrhea (3.2%), and anosmia (3.1%). In participants with a positive test result, the most frequent symptoms reported were dysgeusia (81.3%), anosmia (41.6%), fever (37.9%), dyspnea (27.8%), dry cough (26.2%), and fatigue (26.1%; see Table 1).

Model training

Given the low prevalence of the event (an overall positivity rate of 5.7% observed in the CoVIDA project),(Varela et al., 2021) resampling and oversampling were performed to address the data imbalance. The multivariable fractional polynomial algorithm with the complete database suggested an interaction between dry cough and anosmia. However, this interaction variable was not included in the final model by the leaps and bounds algorithm. The variables that were recommended by the algorithm for the logistic regression model were age, socioeconomic strata, contact with confirmed COVID-19 case, arterial hypertension, and symptoms related to SARS-CoV-2 (fever, anosmia, dry cough, fatigue, and headache), as shown in the equation below: Table 2 presents the unadjusted and adjusted odds ratios obtained in the logistic regression. The contact with a confirmed COVID-19 case was associated with higher odds of SARS-CoV-2 infection. Regarding symptoms, anosmia showed the highest odds (aOR = 7.76, 95% CI [6.19, 9.73]) of having a positive test compared to a participant who did not report it. Fever had an aOR of 4.29 (CI 95% 3.07–6.02), headache an aOR of 3.29 (CI 95% 1.78–6.07), dry cough an aOR of 2.96 (CI 95% 2.44–3.58), and fatigue an aOR of 1.93 (CI 95% 1.57–2.93). Fig. 2A presents the adjusted ORs obtained in the logistic regression model in a forest plot. A diagnostic performance assessment of the model was conducted in the testing daset of the undersampling database. AUC was 0.73, with an SE of 26%, SP of 98%, PPV of 73%, and NPV of 86%. With the logistic regression model, 85% of participants were correctly classified. Fig. 2B shows the logistic regression model’s AUC. The variables included in each model and the logistic regression estimates for the complete database, ROSE dataset, and SMOTE dataset are presented in Supplementary Tables 1-3.

Table 2

Logistic regression with undersampling dataset (n = 14,475).

Variable	Unadjusted OR	95% CI	p-value	Adjusted OR	95% CI	p-value
Age	0·99	[0·99, 1·00]	·108	1·00	[0·99, 1·01]	·052

Socioeconomic strata
Low-low	3·71	[2·42, 5·70]	< ·001*	3·16	[1·98, 5·02]	< ·001
Low	3·13	[2·16, 4·52]	< ·001*	2·70	[1·82, 5·02]	< ·001
Middle-low	2·74	[1·91, 3·95]	< ·001*	2·27	[1·81, 4·01]	< ·001
Middle	1·37	[0·93, 2·02]	·103	1·22	[0·81, 1·84]	·337
Middle-high	1·30	[0·84, 2·02]	·231	1·27	[0·79, 2·01]	·332

Contact with confirmed COVID-19	1·62	[1·45, 1·82]	< ·001*	1·27	[1·12, 1·46]	< ·001*

Arterial hypertension	0·78	[0·64, 0·96]	·020*	0·79	[1·12, 1·46]	·058

Symptoms related to SARS-CoV-2
Fever	15·75	[11·81, 21·01]	< ·001*	4·29	[3·07, 6·02]	< ·001*
Anosmia	17·19	[13·97, 20·91]	< ·001*	7·76	[6·19, 9·73]	< ·001*
Dry cough	8·02	[6·90, 9·33]	< ·001*	2·96	[2·44, 3·58]	< ·001*
Fatigue	7·09	[6·03, 8·33]	< ·001*	1·93	[1·57, 2·93]	< ·001*
Headache	4·85	[2·80, 8·37]	< ·001*	3·29	[1·78, 6·07]	< ·001*

*p value < 0·05.

Fig. 2

Logistic regression model obtained under BIC criterion. a) Forest plot for association between sociodemographic characteristics, COVID-19 related symptoms and SARS-CoV-2 positive RT-PCR test; b) ROC curve for prediction of SARS-CoV-2 positive RT-PCR test result.

Logistic regression with undersampling dataset (n = 14,475). *p value < 0·05. Logistic regression model obtained under BIC criterion. a) Forest plot for association between sociodemographic characteristics, COVID-19 related symptoms and SARS-CoV-2 positive RT-PCR test; b) ROC curve for prediction of SARS-CoV-2 positive RT-PCR test result. Comparison of the diagnostic performance of logistic regression and ML methods is shown in Table 3. Similar AUCs were obtained using the complete database, the SMOTE dataset and the undersampling dataset. The ROSE dataset had higher AUC, SE, and PPV but lower SP and proportion of correctly classified individuals. Similar SP and PPV values were obtained using all datasets. Variable importance for each dataset using the ML approach is presented in Supplementary Figs. 1 through 4. Anosmia was considered the primary classification variable when using the complete and undersampling datasets.

Table 3

Diagnostic performance of prediction models.

Dataset	Variable	AUC	SE	SP	PPV	NPV	Correctly classified
Complete data set	Logistic regression	·71	·10	·99	·57	·95	·95
	RF	·66	·04	·99	·54	·94	·95
	SVM	·59	·02	·99	·48	·95	·94
	XG boosting	·73	·07	·99	·55	·95	·95
Undersampling	Logistic regression	·73	·26	·98	·73	·86	·85
	RF	·81	·28	·98	·72	·86	·85
	SVM	·73	·29	·98	·74	·86	·85
	XG boosting	·77	·09	·99	·88	·83	·84
SMOTE	Logistic regression	·72	·22	·94	·71	·66	·67
	RF	·87	·69	·96	·91	·83	·86
	SVM	·66	·24	·94	·71	·66	·67
	XG boosting	·90	·98	·99	·97	·81	·86
ROSE	Logistic regression	·74	·47	·88	·79	·62	·67
	RF	·81	·65	·84	·80	·71	·75
	SVM	·73	·47	·88	·79	·62	·67
	XG boosting	·77	·57	·81	·76	·65	·69

Diagnostic performance of prediction models.

Discussion

The main finding in this study is the high prediction capacity of this symptoms-based model. Our model correctly classified 8 out of 10 participants with SARS-CoV-2 with an AUC of 0.73. We found that an individual with anosmia alone, has over seven times the risk of having a positive RT-PCR test for SARS-CoV-2 relative to those without this symptom. Fever, headache, fatigue, and dry cough were also important symptoms in predicting SARS-CoV-2 infection. Combining classical statistical methods such as logistic regression and more modern ones such as machine learning with RF, SVM, and XG boosting, aided both in robustness and interpretation for the results. A model that uses symptoms to predict SARS-CoV-2 infection may aid clinical decision-making with high SP to aid the application of non-pharmacological interventions such as selective lockdowns, isolation, contact tracing, and testing. This strategy can serve as a valuable tool in limited-resource settings with scarce testing availability to allow early and precise decisions to improve epidemiological surveillance during the COVID-19 pandemic. Anosmia has been reported as a common symptom in COVID-19 patients (Lechien et al., 2020, Saniasiaya et al., 2021). This symptom has been the main feature addressed by several prediction models (Callejon-Leblic et al., 2021, Lechien et al., 2020, Saniasiaya et al., 2021). While most studies have used a binary categorization for anosmia, some studies used a visual analog scale (VAS) assessment of symptoms resulting in better diagnostic performance (Callejon-Leblic et al., 2021, Gerkin et al., 2021). Anosmia has also been used to predict local healthcare stress and to develop public health policies to prevent strain in healthcare systems (Pierron et al., 2020). Despite the potential of anosmia alone in predicting a positive SARS-CoV-2 test result, we found that a combination of symptoms (anosmia, fever, headache, dry cough, and fatigue) and clinical features provided the best accuracy. The study conducted by Menezes et al. used conditional inference tree analyses to identify which combinations of symptoms were most likely to predict positive test results. Changes in smell or taste, fever, and body aches had the best diagnostic performance (Menezes et al., 2021). Other symptoms such as cough, fever, fatigue, malaise or body aches, sore throat, and headache have also been included in various models. In a systematic review evaluated the diagnostic accuracy of over 80 signs and symptoms related to SARS-CoV-2 infection, anosmia alone, ageusia alone, and anosmia and ageusia combined had the best results, including sensitivities below 50% but specificities over 90%. Various combinations of symptoms assessed in this review, primarily including fever and cough with other symptoms, had an SP above 80% but at the cost of very low SE (<30%) (Struyf et al., 2020). While our model had an AUC of 0.73, with an SP of 98% and PPV of 73%, the diagnostic performance of the models reported in the literature has varied highly. Other studies reported highly accurate models, including only symptoms-related variables with the high AUC (Callejon-Leblic et al., 2021, Roland et al., 2020). Other models that have included symptoms such as fatigue, cough, fever, and respiratory and gastrointestinal symptoms have had slightly lower accuracy (La Torre et al., 2020, Lan et al., 2020, Menni et al., 2020, Tudrej et al., 2020). Diagnostic imaging has also been used for COVID-19 prediction diagnosis. Deep learning models using X-rays (XR) and computed tomography (CT) radiomics have shown accuracy with AUC ranging from 0.87 (Wang et al., 2021) to 0.99 (Chen et al., 2021, Li et al., 2020a, Singh et al., 2022, Singh et al., 2021). Other models, using radiomics and clinical information to predict severity of the disease, had higher AUC (0.897 compared to 0.847 of radiomics and 0.767 clinical variables alone) (Purkayastha et al., 2021). However, accessibility to XR or CT is problematic in low resourced settings. Countries like Colombia, have highly dispersed rural areas where primary care does not include clinical or imaging testing for SARS-CoV-2. In highly populated cities such as Bogotá, healthcare infrastructure remains insufficient, especially for management of mild cases that do not require fixed medical attention but may overwhelm healthcare facilities for more severe cases. Classical epidemiological methods may be limited for assessing prediction models for larger populations, especially when the low prevalence of the outcome causes an imbalance in the response variable. ML methods have been applied in population-based systems to complement classical statistical methods given their capabilities of processing large datasets, detecting patterns, and analyzing trends to provide important information to public health (Zeng et al., 2021). Recent literature has demonstrated the use of ML in research during the COVID-19 pandemic; specifically, on its advantages in diagnostics in middle-low- and low-income countries (Li et al., 2020c, Naseem et al., 2020). In fact, ML methods applied in this paper are highly used in tabular data, including XG boosting with promising results in diagnostic medicine and clinical data on COVID-19 (Li et al., 2020c). We also included two traditional ML methods such as random forest and SVM that have previously shown appropriate diagnostic performance when assessing severity in the COVID-19 pandemic (Alotaibi et al., 2021, Kumar et al., 2021) and in other medical contexts such as genomics (Ogutu et al., 2011). Nevertheless, methods such as logistic regression provide interpretation input to assess risk in a more intuitive way for clinicians and decision makers than ML methods. Hence, both methods can complement prediction in public health-related events such as COVID-19. Other studies have approached COVID-19 with similar ML methods, including resampling strategies to address the imbalance in the response variables such as was our case. In the specific case of Dantas et al., a large sample and low outcome prevalence (positivity rate of 11.8%) led them to evaluate several methods to get to a better prediction model. In their case, they used the up-sampling balancing strategy (Dantas et al., 2021). Other sudies have used ML methods for prediction diagnosis. The model proposed by Zoabi et al. used a large dataset provided by the Israeli Ministry of Health and included symptoms such as cough, fever, headache, sore throat, and shortness of breath, as well as variables such as age, sex, and history of contact with a confirmed case of COVID-19 with an AUC of 0.86 (Zoabi et al., 2021). Nonetheless, this model did not include anosmia (unlike our study), which has proved to be a relevant symptom in other prediction models. The main strengths of our study were the large sample size and the comparison of several statistical approaches to develop a symptoms-based prediction model. The inclusion of several symptoms in the questionnaire applied to the participants allowed assessing relevant clinical characteristics. Our model’s high SP and NPV could be advantageous in redirecting the limited testing resources to those patients with a higher probability of having COVID-19. Such a strategy provides elements for clinical decision making, especially in countries such as Colombia that require diagnostic test results for case definition. In the case of saturation of the public health systems, a triage testing strategy could also be helpful in assessing which individuals have the highest probability of being infected. Also, some limitations of our study should be considered. The first was that symptoms were self-reported via telephone survey. In this sense, the participant could have reported apparent COVID-19-related symptoms that were unspecific and biased symptoms by the information spread by media and other sources. Also, the test result and self-reported symptoms could not have reflected the specific moment of the infection and affected the model’s accuracy.

Conclusions

Our study produced a symptoms-based model with multiple statistical approaches that correctly identified over 85% of participants. This model can be used to strengthen epidemiological surveillance protocols, to prioritize resource allocation related to COVID-19 diagnosis, to decide on early isolation, and for contact tracing strategies in individuals with a high probability of infection and before receiving a confirmatory test result. This strategy has public health and clinical decision-making significance in low- and middle-income settings like Latin America.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

33 in total

Review 1. A Retrospective Cohort Study to Assess the Impact of an Inpatient Infectious Disease Telemedicine Consultation Service on Hospital and Patient Outcomes.

Authors: Daniel Monkowski; Luther V Rhodes; Suzanne Templer; Sharon Kromer; Jessica Hartner; Kimberly Pianucci; Hope Kincaid
Journal: Clin Infect Dis Date: 2020-02-14 Impact factor: 9.079

2. COVID-19 symptoms predictive of healthcare workers' SARS-CoV-2 PCR results.

Authors: Fan-Yun Lan; Robert Filler; Soni Mathew; Jane Buley; Eirini Iliaki; Lou Ann Bruno-Murtha; Rebecca Osgood; Costas A Christophi; Alejandro Fernandez-Montero; Stefanos N Kales
Journal: PLoS One Date: 2020-06-26 Impact factor: 3.240

3. Impact of delays on effectiveness of contact tracing strategies for COVID-19: a modelling study.

Authors: Mirjam E Kretzschmar; Ganna Rozhnova; Martin C J Bootsma; Michiel van Boven; Janneke H H M van de Wijgert; Marc J M Bonten
Journal: Lancet Public Health Date: 2020-07-16

4. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR.

Authors: Victor M Corman; Olfert Landt; Marco Kaiser; Richard Molenkamp; Adam Meijer; Daniel Kw Chu; Tobias Bleicker; Sebastian Brünink; Julia Schneider; Marie Luisa Schmidt; Daphne Gjc Mulders; Bart L Haagmans; Bas van der Veer; Sharon van den Brink; Lisa Wijsman; Gabriel Goderski; Jean-Louis Romette; Joanna Ellis; Maria Zambon; Malik Peiris; Herman Goossens; Chantal Reusken; Marion Pg Koopmans; Christian Drosten
Journal: Euro Surveill Date: 2020-01

5. App-based symptom tracking to optimize SARS-CoV-2 testing strategy using machine learning.

Authors: Leila F Dantas; Igor T Peres; Leonardo S L Bastos; Janaina F Marchesi; Guilherme F G de Souza; João Gabriel M Gelli; Fernanda A Baião; Paula Maçaira; Silvio Hamacher; Fernando A Bozza
Journal: PLoS One Date: 2021-03-25 Impact factor: 3.240

6. Implementation and Usefulness of Telemedicine During the COVID-19 Pandemic: A Scoping Review.

Authors: María Alejandra Hincapié; Juan Carlos Gallego; Andrés Gempeler; Jorge Arturo Piñeros; Daniela Nasner; María Fernanda Escobar
Journal: J Prim Care Community Health Date: 2020 Jan-Dec

Review 7. Testing at scale during the COVID-19 pandemic.

Authors: Tim R Mercer; Marc Salit
Journal: Nat Rev Genet Date: 2021-05-04 Impact factor: 59.581

8. Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis.

Authors: Wei Tse Li; Jiayan Ma; Neil Shende; Grant Castaneda; Jaideep Chakladar; Joseph C Tsai; Lauren Apostol; Christine O Honda; Jingyue Xu; Lindsay M Wong; Tianyi Zhang; Abby Lee; Aditi Gnanasekar; Thomas K Honda; Selena Z Kuo; Michael Andrew Yu; Eric Y Chang; Mahadevan Raj Rajasekaran; Weg M Ongkeko
Journal: BMC Med Inform Decis Mak Date: 2020-09-29 Impact factor: 2.796

Review 9. Exploring the Potential of Artificial Intelligence and Machine Learning to Combat COVID-19 and Existing Opportunities for LMIC: A Scoping Review.

Authors: Maleeha Naseem; Ramsha Akhund; Hajra Arshad; Muhammad Talal Ibrahim
Journal: J Prim Care Community Health Date: 2020 Jan-Dec