Literature DB >> 36018862

Determinants of coronavirus disease 2019 infection by artificial intelligence technology: A study of 28 countries.

Hsiao-Ya Peng1, Yen-Kuang Lin2, Phung-Anh Nguyen3,4,5,6, Jason C Hsu1,3,4,5, Chun-Liang Chou7, Chih-Cheng Chang8, Chia-Chi Lin1, Carlos Lam9,10,11, Chang-I Chen12, Kai-Hsun Wang13, Christine Y Lu14.   

Abstract

OBJECTIVES: The coronavirus disease 2019 pandemic has affected countries around the world since 2020, and an increasing number of people are being infected. The purpose of this research was to use big data and artificial intelligence technology to find key factors associated with the coronavirus disease 2019 infection. The results can be used as a reference for disease prevention in practice.
METHODS: This study obtained data from the "Imperial College London YouGov Covid-19 Behaviour Tracker Open Data Hub", covering a total of 291,780 questionnaire results from 28 countries (April 1~August 31, 2020). Data included basic characteristics, lifestyle habits, disease history, and symptoms of each subject. Four types of machine learning classification models were used, including logistic regression, random forest, support vector machine, and artificial neural network, to build prediction modules. The performance of each module is presented as the area under the receiver operating characteristics curve. Then, this study further processed important factors selected by each module to obtain an overall ranking of determinants.
RESULTS: This study found that the area under the receiver operating characteristics curve of the prediction modules established by the four machine learning methods were all >0.95, and the RF had the highest performance (area under the receiver operating characteristics curve is 0.988). Top ten factors associated with the coronavirus disease 2019 infection were identified in order of importance: whether the family had been tested, having no symptoms, loss of smell, loss of taste, a history of epilepsy, acquired immune deficiency syndrome, cystic fibrosis, sleeping alone, country, and the number of times leaving home in a day.
CONCLUSIONS: This study used big data from 28 countries and artificial intelligence methods to determine the predictors of the coronavirus disease 2019 infection. The findings provide important insights for the coronavirus disease 2019 infection prevention strategies.

Entities:  

Mesh:

Year:  2022        PMID: 36018862      PMCID: PMC9417026          DOI: 10.1371/journal.pone.0272546

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.752


Introduction

The coronavirus disease 2019 (COVID-19; also known as severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2)) pandemic has spread rapidly around the world, causing global panic and affecting all aspects of people’s lives and the economy since December 2019. As of July 2021, there have been more than 188 million confirmed COVID-19 cases worldwide and at least 4.06 million deaths [1]. Identifying high-risk groups, taking preventive measures as early as possible, and caring for those who may get sick are important goals for preventing further spread of the global COVID-19 pandemic. Traditionally, logistic regression in basic statistical methodology has been often used to explore which key influencing factors have a significant correlation with the occurrence of diseases, and thus inform prevention efforts. With the rise of artificial intelligence (AI) in recent years and the development of various algorithms, including AI-based machine learning and deep learning algorithms, researchers can use data obtained to build more accurate prediction models [2]. A prediction module generated using only a single algorithm based on a certain operation logic may not be the most suitable module. Integrating multiple prediction models using multiple algorithms based on various operational logics can generate more comprehensive, complete and objective results. Researchers are increasingly using AI methods to predict and prevent the occurrence of diseases. Regarding the new global COVID-19 pandemic, medical and academic professionals around the world have also adopted various machine learning and deep learning methods to conduct research on preventing and treating COVID-19. For example, previous study determined weather and climate conditions, such as temperature and humidity, that might affect spread of the COVID-19 virus [3]. AI technology were also applied on medical images (chest x-ray image) to predict whether patients were infected [4], to track the chain of virus transmission, and to assist in the development of vaccines and drugs [5]. Demographic data (ex. age) and clinical data (ex. renal function and the results of COVID-19 RT-PCR tests) were used as predictive indicators to assist in diagnosis [6, 7]. Besides, the combination of modern medical and AI technologies greatly improved the screening, prediction, and tracking of virus contacts, as well as increased the reliability of vaccine and medication development [8, 9]. Many studies also focus on confirmed COVID-19 patients, using machine learning methods to build predictive models for disease prognosis, including severity or mortality [10-12]. Furthermore, some scholars have used AI technologies to predict the development trend of the spread [13, 14] and the health system failure [15] of COVID-19 from the perspective of public health. None of the abovementioned studies used data from multiple countries and multiple algorithms. To help fill the gap in knowledge, this study investigated the factors associated with COVID-19 infection using big data from multiple countries and multiple algorisms. The current study has two purposes. The first goal was to use machine learning methods to generate a predictive model for COVID-19 infection, and to use simple information to preliminarily check whether an infection is possible. The second objective was to determine important features of COVID-19 infection, and propose precautions and preventive measures to the public based on the results. This study used publicly available questionnaire survey data around the world, which included basic information, living habits, disease history, and symptoms of respondents from 28 countries. The predictive model established by AI technology can help us understand the determinants of COVID-19 infection, and avoid unnecessary hospital visits and nosocomial infections.

Methods

This section includes data sources, cohort selection, descriptive statistics, algorithms used in this study, methods of comparing results obtained from different algorisms, and the way to find key determinants.

Data sources

Data used in this study were from the Imperial College London YouGov Covid-19 Behavior Tracker Data Hub. YouGov partnered with the Institute of Global Health Innovation at Imperial College London to gather global insights on people’s behaviors in response to COVID-19. The data in this database came from results of a questionnaire survey of people in 28 countries [16]. Use of data from online open databases for research purposes is exempt from review by the Institutional Review Board (IRB) in Taiwan because the data used is public information. This study collected data from the above database during April 1, 2020~August 31, 2020. Based on the results of the literature review, we applied a clinical perspective and consulted with clinicians and experts to determined 52 factors (including basic characteristics, lifestyle habits, disease histories, and symptoms) that may lead to COVID-19 infection to build predictive models. Four categories of possible influencing factors were collected. The first category consisted of basic characteristics, including gender, age, number of people in the household, number of children in the household, and country. The second category was lifestyle habits, including number of times washing, sanitizer washing, soap washing, frequency of cleaning, eating alone, sleeping alone, frequency of mask wearing, frequency of covering the nose and mouth, the number of contacts with people inside the home, the number of contacts with people outside the home, number of times of leaving home in a day, avoiding having guests, avoiding contacting people, avoiding going outside, avoiding going to shops, avoiding going to the hospital, avoiding taking public transportation, avoiding small social gatherings, avoiding medium-sized social gatherings, avoiding large-sized social gatherings, avoiding crowded areas, avoiding touching objects, self-isolating, having difficulties isolating, being willing to isolate, and whether the family had been tested. The third category was disease history, including acquired immune deficiency syndrome (AIDS), arthritis, asthma, cancer, cystic fibrosis chronic obstructive pulmonary disease, diabetes, epilepsy, heart disease, hyperlipidemia, hypertension, mental disease, multiple sclerosis, not willing to say, and no disease. The last category was symptoms, including cough, fever, loss of smell, loss of taste, having difficulty breathing, and no symptoms (see S1 Appendix). In total, 52 possible influential factors were assessed in this study.

Cohort selection

This study retrieved original data of 315,276 interviewees from the above database (during April 1~August 31, 2020). After excluding missing data (n = 10,106) and outliers (n = 13,390), 291,780 people remain in this study. Outliers include unreasonable data such as washing more than 50 times a day, leaving home more than 20 times a day, etc. This study finally selected cases from 28 countries and used a total of 52 influencing variables to establish a prediction module for COVID-19 infection (see S1 Appendix). Among the data of the 291,780 cases, only 3,179 were COVID-infected patients (positive samples), and the other 288,601 were non-infected patients (negative samples). Due to the large difference between the two groups of people, the prediction module established by this imbalance might not be accurate. Therefore, this study used the Synthetic Minority Over-sampling Technique (SMOTE) [17] method to generate similar synthetic samples to resolve this data imbalance problem. SMOTE was used to generate additional synthetic positive samples with similar distributions based on the distribution characteristics of the original positive sample. After the samples in this study were processed by SMOTE, the final number of positive samples was 12,716, and the number of negative samples was 14,305. Differences between variables in the two groups are shown in Table 1 (continuous variables) and Table 2 (categorical variables).
Table 1

Study group comparison of continuous variables.

Numeric VariablesCOVID-19 infection
Yes (N = 12,716)No (N = 14,305)p-value
MeanSDMeanSD
Basic characteristics
age32.969.4142.7616.33<0.01
Lifestyle habits
number of times of washing7.595.439.756.99<0.01
sanitizer washing1.670.881.941.18<0.01
soap washing1.730.961.450.8<0.01
frequency of cleaning1.70.882.291.22<0.01
eating alone1.951.053.461.59<0.01
sleeping alone1.81.053.541.69<0.01
frequency of mask wearing1.510.872.281.67<0.01
frequency of covering the nose and mouth1.670.941.430.88<0.01
the number of contacts with people inside the home4.232.013.452.21<0.01
the number of contacts with people outside the home4.864.796.859.15<0.01
number of times of leaving home in a day3.6122.441.63<0.01
avoiding having guests1.911.012.051.30.457
avoiding contacting people1.720.951.651.2<0.01
avoiding going outside1.840.972.391.29<0.01
avoiding going to shops1.930.992.521.25<0.01
avoiding going to the hospital1.871.12.091.420.501
avoiding taking public transportation1.720.971.881.33<0.01
avoiding small social gatherings1.890.992.211.34<0.01
avoiding medium-sized social gatherings1.851.011.911.25<0.01
avoiding large-sized social gatherings1.850.981.631.14<0.01
avoiding crowded areas1.70.911.681.03<0.01
avoiding touching objects1.670.9421.15<0.01
Table 2

Study group comparison of categorical variables.

Categorical VariablesCOVID-19 infection
Yes (N = 12,716)No (N = 14,305)p-value
n%n%
Basic characteristics
gender(male)800663.00%714850.00%<0.01
the number of people in the household    
0 or not sure2792.20%2491.70%<0.01
1235518.50%198113.80%<0.01
2148511.70%366225.60%<0.01
3167513.20%293420.50%<0.01
4184314.50%278219.40%<0.01
5212116.70%149410.40%<0.01
68306.50%6514.60%<0.01
76405.00%2792.00%<0.01
8 or more148811.70%2731.90%<0.01
number of children in the household     
0 or not sure250519.70%753252.70%<0.01
1392930.90%328022.90%<0.01
2277121.80%214615.00%<0.01
310118.00%7135.00%<0.01
46274.90%3372.40%<0.01
550.00%20.00%<0.01
6 or more186814.70%2952.10%<0.01
country     
Australia3212.50%6534.60%<0.01
Brazil2311.80%4233.00%<0.01
Canada990.80%3982.80%<0.01
China3332.60%6524.60%<0.01
Denmark770.60%4593.20%<0.01
Finland530.40%4913.40%<0.01
France2111.70%7115.00%<0.01
Germany1701.30%6294.40%<0.01
Hong Kong1180.90%2711.90%<0.01
India6154.80%6394.50%<0.01
Indonesia1911.50%4603.20%<0.01
Italy2532.00%6504.50%<0.01
Japan430.30%2511.80%<0.01
Malaysia1941.50%4913.40%<0.01
Mexico1271.00%4613.20%<0.01
Netherlands1371.10%2431.70%<0.01
Norway1591.30%4353.00%<0.01
Philippines1241.00%4623.20%<0.01
Saudi Arabia133910.50%3982.80%<0.01
South Korea2121.70%2271.60%<0.01
Spain1261.00%6004.20%<0.01
Sweden1781.40%5804.10%<0.01
Taiwan1271.00%4493.10%<0.01
Thailand192515.10%4152.90%<0.01
United Arab Emirates197215.50%3762.60%<0.01
United Kingdom680.50%10137.10%<0.01
United States4883.80%6564.60%<0.01
Vietnam282522.20%8125.70%<0.01
Lifestyle habits
self-isolating816164.20%980268.50%<0.01
having difficulties isolating     
Very easy758659.70%458932.10%<0.01
Somewhat easy202916.00%440730.80%<0.01
Neither easy nor difficult142311.20%248017.30%<0.01
Somewhat difficult9717.60%169011.80%<0.01
Very difficult3612.80%7034.90%<0.01
Not sure3462.70%4363.00%<0.01
being willing to isolate     
Very willing744958.60%825757.70%<0.01
Somewhat willing293023.00%363525.40%<0.01
Neither willing nor unwilling11278.90%13719.60%<0.01
Somewhat unwilling6415.00%4233.00%<0.01
Very unwilling2161.70%2191.50%<0.01
Not sure3532.80%4002.80%<0.01
whether the family had been tested718156.50%540.40%<0.01
Disease history
AIDS473637.20%670.50%<0.01
arthritis553243.50%8796.10%<0.01
asthma559144.00%11247.90%<0.01
cancer486438.30%3892.70%<0.01
cystic fibrosis454535.70%720.50%<0.01
COPD490238.50%2902.00%<0.01
diabetes525041.30%9436.60%<0.01
epilepsy474937.30%1130.80%<0.01
heart disease528341.50%211614.80%<0.01
hypertension480737.80%4833.40%<0.01
mental disease483738.00%7915.50%<0.01
multiple sclerosis445435.00%710.50%<0.01
not willing to say4713.70%5984.20%<0.01
no disease350127.50%867760.70%<0.01
Symptoms
cough579745.60%8245.80%<0.01
fever607147.70%4543.20%<0.01
loss of smell554343.60%3192.20%<0.01
loss of taste574545.20%3212.20%<0.01
having difficulty breathing560944.10%5523.90%<0.01
no symptoms515940.60%1297990.70%<0.01

Descriptive statistics

This study used the Wilcoxon rank-sum test for quantitative variables such as age score and Chi-square test for proportions. This study used R language software for analysis, and all two-tailed p values of <0.05 were considered to be statistically significant.

Algorithms used in this study for prediction models

To evaluate whether a given subject will be diagnosed with COVID-19 according to both geographical and lifestyle features based on the survey items, the target variable was coded 1 for cases diagnosed with COVID-19 and 0 for individuals not diagnosed with COVID-19. As the aim was a typical classification problem, this study used four types of machine learning classification models: Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN). Four machine learning models were chosen to evaluate the performance of each model and compare differences in features selected by these four models. This study randomly divided the data into an 80% training set and a 20% validation set before deploying them. Models were trained on the training dataset and verified using the validation dataset. The generalizability of the model is calculated based on the validation dataset. Four models used in this study were described below. LR is used to classify binary categories by predicting the probabilities of outcomes. It is the most popular and simplest method applied to classification problems [18, 19]. One of the advantages of using an LR is that it is easy to understand how it operates, and it can also be applied to select important variables. RF is an ensemble learning method for classification, and it is often viewed as the expansion of a decision tree. RF is iterated by constructing a multitude of decision trees and determining the class based on the mode of the predicted classes. That is, during training, the weight of each tree is the same. Each tree is treated as a voter, classifying one data point into one category. The majority of all trees’ decisions is the final classification of the data. The advantage of the RF is that it can avoid overfitting compared to the decision trees [20]. SVM tries to find an optimal hyperplane on which to classify data [21]. The optimal hyperplane is the perfect decision boundary for maximizing the margin between two classifications. Data on the margin line are called the support vector. The advantage of the SVM is that it can be applied to high-dimension datasets by adjusting the kernel function, but it requires more time for calculating than other models [22]. The development of an ANN is based on simulating how the human brain operates [23]. An ANN is made up of neurons with layers–one input layer, one or two hidden layers, and one output layer. Neurons in a layer connect to ones in a neighboring layer by different weights. Adjusting the weights to minimize the error function is a process used to train the model. Although training a neural network is complicated, it provides good performance of classification tasks [24]. We used the “caret package” (i.e., Classification And REgression Training), it contains functions to streamline the model training process [25]. For LR model, we used the method glm(), which has no tuning parameters; for RF model, we used the method rf(), which has the tuning parameters as mtry (#randomly selected predictors); for SVM model, we used the method svmLinear, which has the tuning parameters as c (Cost); as for ANN model, we used the default method mlp(), which has the tuning parameters as size (#Hidden Units). In this study, the ANN model was performed with 2 hidden layers. The rectified linear (relu) and softmax functions were used as the activation functions of the hidden layers and the output layers, respectively.

Comparison of results obtained by different algorithms

Six performance matrices were used to evaluate the efficiency of the model, including the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and area under the receiver operating characteristics curve (AUROC). Accuracy is the sum of true positive and true negative predictions divided by the number of positive and negative samples. Sensitivity measures the proportion of positives that are correctly identified (i.e., the proportion of those who were correctly identified as having the condition among those who are affected). Specificity measures the proportion of negatives that are correctly identified (i.e., the proportion of those who are correctly identified as not having the condition to those who are unaffected). The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A higher result can be interpreted as an indication of greater accuracy. The PPV and NPV cannot be intrinsic to the test (as true positive rates and true negative rates are); they also depend on the prevalence. The AUROC stands for the area under the receiver operating characteristic curve (ROC). That is, the AUROC measures the entire two-dimensional area underneath the entire ROC, where the ROC is a probability curve depicting the association between the true positive rate and false positive rate. By analogy, the higher the AUROC, the better the model is at distinguishing between patients with the disease and those with no disease.

Determinants of coronavirus disease 2019 infection

To get the important variables, we used the function varImp(object = [model_name]) [26]. Basically, the default behavior is to compute the area under the ROC curve in the SVM classification models. This area is used as the measure of variable importance. For the ANN models, the basic method is used combinations of the absolute values of the weights, which was introduced by Gevrey et al. (2003) [27]. First, this study used the analytical results of the four models to identify the 15 most important features of COVID-19 infection. This study set 15 points for the first important feature of each model, 14 points for the second important feature, and so on. Then, this study calculated the total score of each important feature through a composite weighted scoring method, and finally sorted the total scores from high to low.

Results

Table 1 shows differences between the two groups in various continuous variables. Compared to non-infected patients, infected patients were younger. This study found that compared to non-infected patients, infected patients had a lower number of times washing, number of times washing with sanitizer, frequency of cleaning, frequency of mask wearing, and number of times contacting people outside the home, and lower rates of eating alone, sleeping alone, avoiding having guests, avoiding going outside, avoiding going to shops, avoiding going to the hospital, avoiding taking public transportation, avoiding small social gatherings, avoiding medium-sized social gatherings, and avoiding touching objects. Table 2 shows differences between the two groups in various categorical variables. Compared to non-infected patients, infected patients had a higher proportion of males, number of people (or children) in the house, a history of various diseases, and all symptoms. Countries with the highest proportions of infected patients and more than 10% of all cases included Vietnam, the United Arab Emirates, Thailand, and Saudi Arabia. Table 3 shows the accuracy, sensitivity, specificity, PPV, NPV, and AUROC of the four prediction models. It was found that the accuracy of the RF model was the highest (0.957); the SVM had the highest sensitivity (0.967); the LR had the highest specificity (0.968); the LR had the highest PPV (0.963); the SVM had the highest NPV (0.972). The RF had the highest AUROC (0.988), followed by the SVM (0.987), ANN (0.986), and LR (0.953). The ROC curve in Fig 1 shows that values of the AUROC of the RF, SVM, and ANN were the best and were similar. Although the AUROC of the LR was lower than those of the other models, its AUROC was still >95%.
Table 3

Machine learning model indices.

IndexLRRFSVMANN
Accuracy0.9520.957*0.9530.953
Sensitivity0.9350.9570.967*0.963
Specificity0.968*0.9590.940.942
PPV0.963*0.9540.9310.934
NPV0.9430.960.972*0.968
AUROC0.9530.988*0.9870.986

LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.

*: the best performing model for each index

Fig 1

ROC curve.

LR = logistic regression; DT = decision tree; RF = random forest; SVM = support vector machine; NN = artificial neural network.

ROC curve.

LR = logistic regression; DT = decision tree; RF = random forest; SVM = support vector machine; NN = artificial neural network. LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network. *: the best performing model for each index Table 4 summarizes the 15 most important variables of COVID-19 infection based on the four algorithms. “Whether the family had been tested” is the top 1 variable in all models, and “no symptoms” ranks the second variable for LR, RF and ANN models. After weighting, “Whether the family had been tested” is the most critical factor, which suggests that at least one family member who had been exposed and tested for COVID-19 and this was a strong predictor for COVID-19 infection among respondents. This was followed by “no symptoms”, “loss of smell”, “loss of taste”, “epilepsy”, “AIDS”, “cystic fibrosis”, “sleeping alone”, “country” and “the number of times of leaving home in a day” (see Table 5).
Table 4

Variables by importance in four models.

LRRFSVMANN
1[2] whether the family had been tested[2] whether the family had been tested[2] whether the family had been tested[3] whether the family had been tested
2[4] no symptoms*[4] no symptoms[2] number of times of leaving home in a day[4] no symptoms*
3[2] sleeping alone[1] country[2] number of times of washing[3] multiple sclerosis
4[1] Thailand[2] sleeping alone[2] frequency of covering the nose and mouth[3] cystic fibrosis
5[3] epilepsy[4] loss of taste[3] COPD[4] loss of smell
6[2] number of times of leaving home in a day[4] fever[2] avoiding crowded areas[3] epilepsy
7[3] cystic fibrosis[4] loss of smell[3] AIDS[3] AIDS
8[4] loss of smell[2] eat alone[4] loss of smell[4] loss of taste
9[1] age*[3] AIDS[2] soap washing[3] cancer
10[4] loss of taste[3] epilepsy[4] loss of taste[3] heart disease
11[3] cancer[4] having difficulty breathing[1] the number of contacts with people inside the home[2] frequency of cleaning
12[3] arthritis[4] cough[3] cancer[3] COPD
13[3] AIDS[3] cystic fibrosis[3] cystic fibrosis[3] arthritis
14[3] multiple sclerosis[3] COPD[3] epilepsy[4] fever
15[3] COPD[2] number of times of leaving home in a day[2] avoiding medium-sized social gatherings[2] frequency of covering the nose and mouth

LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.

Four types: [1] Basic characteristics [2] Lifestyle habits [3] Disease history [4] Symptom

*: means negative correlation

Table 5

Weighted importance of variables by model.

CategoriesVariablesLRRFSVMANNTotal
Lifestyle habitswhether the family had been tested1515151560
Symptomsno symptoms141401442
Symptomsloss of smell8981136
Symptomsloss of taste6116831
Disease historyepilepsy11621029
Disease historyAIDS379928
Disease historycystic fibrosis9331227
Lifestyle habitssleeping alone13120025
Basic characteristicscountry12130025
Lifestyle habitsthe number of times of leaving home in a day10114025
Disease historyCOPD1211418
Disease historycancer504716
Disease historymultiple sclerosis2001315
Lifestyle habitsnumber of times of washing0013013
Lifestyle habitsfrequency of covering the nose and mouth0012113
Symptomsfever0100212
Lifestyle habitsavoiding crowded areas0010010
Disease historyarthritis40037
Basic characteristicsage70007
Lifestyle habitssoap washing00707
Disease historyheart disease00066
Lifestyle habitseating alone05005
Symptomshaving difficulty breathing05005
Lifestyle habitsfrequency of cleaning00055
Symptomscough04004
Lifestyle habitsavoiding medium-sized social gatherings00101
Basic characteristicsthe number of contacts with people outside the home00000

LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.

LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network. Four types: [1] Basic characteristics [2] Lifestyle habits [3] Disease history [4] Symptom *: means negative correlation LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.

Discussion

This is one of the first studies to use huge amounts of survey data from 28 countries (with 315,276 interviewees) that involved basic characteristics, lifestyle, disease history, and COVID-19 symptoms and AI technologies to predict COVID-19 infection. The AUROC of each model is between 0.951–0.988, and the RF model has the highest AUROC (0.988). The prediction accuracy of all modules are higher than 93%, with high sensitivity (≧91%) and high specificity (≧94%). Among them, the RF’s accuracy rate (95.7%) was the highest. The results pointed out that the most important factors of COVID-19 infection were, in order, whether the family had been tested, having no symptoms, loss of smell, loss of taste, a history of epilepsy, AIDS and cystic fibrosis. Compared to high-cost and difficult-to-access medical imaging data, this study used a questionnaire survey based on basic characteristics and behaviors of individuals across many countries, and used AI machine learning methods to obtain very high accuracy rates (93%~96%) for COVID-19 infection prediction modules. This study included four major categories of variables, including basic characteristics, lifestyle habits, disease histories, and symptoms, with a total of 52 variables. These variables provide a complete and detailed discussion of multiple factors possibly affecting COVID-19 infection. Based on the findings, this study recommend the following for COVID-19 prevention in countries around the world. (1) Age: Young people are more susceptible to infection, possibly because they have more opportunities to socialize and contact others. (2) High-risk groups based on medical history (prevention): People with a history of epilepsy, AIDS or cystic fibrosis should pay special attention. (3) High-risk groups based on symptoms (emergency): Patients with symptoms of loss of smell and loss of taste should pay more attention. (4) The importance of screening when the person is exposed: people who have family members being tested are more likely to be found to be infected. (5) Lifestyle recommendations: individuals who sleep alone and leave home less often might reduce COVID-19 infection risk. This study has several limitations. The data source of the study was a questionnaire survey across 28 countries. The study was based on survey responses, which is vulnerable to recall bias and underestimation attributable to bias of detection and reporting of COVID-19 infection. Further, this study is a secondary analysis of existing data sourced from an international survey. Therefore, the analysis and findings are restricted to the range of information and level of details collected by the original survey. The survey may underrepresent the most socially disadvantaged individuals and those in remote areas, particularly those without phones, speaking other languages or whose health limited their participation. Possible sources of non-sampling error of the original survey might include non-response bias, and cultural differences in question interpretation. While the analysis provides insights into behaviors for preventing COVID-19 infection, this study did not assess the actual effects of the recommended behaviors to avoid infection (such as leaving the home less often), which is beyond the scope of this study. Moreover, this study did not have information on the severity or the outcome of COVID-19 infection (such as death). Future studies are warranted to predict severe COVID-19 infection and predict COVID-related mortality. Finally, this study did not have information for developing prediction models specific to regions and ethnic groups [28]; this should be an important area for future research as it may be informative for prevention strategy development. Nevertheless, the AI models with big data can be an exemplar for disease risk prediction.

Conclusions

To date, the health, life, and economy of people in all countries around the world are still being greatly affected by the COVID-19 pandemic. This study used an international survey data including disease history and lifestyle habits and AI methods to predict COVID-19 infection. The findings provide insights that young people, those with a history of epilepsy, AIDS or cystic fibrosis, and those with symptoms such as loss of smell, loss of taste, etc., have high-risk for COVID-19 infection. Important prevention behaviors include COVID screening (especially when a family member is being tested for COVID), sleeping alone, and leaving home less often. These findings can be applied to real applications, including ways to help identify high-risk groups and ways to avoid COVID-19 infection through changes in lifestyle habits.

Variables type and description.

(DOC) Click here for additional data file.
  19 in total

Review 1.  Logistic regression and artificial neural network classification models: a methodology review.

Authors:  Stephan Dreiseitl; Lucila Ohno-Machado
Journal:  J Biomed Inform       Date:  2002 Oct-Dec       Impact factor: 6.317

Review 2.  What is a support vector machine?

Authors:  William S Noble
Journal:  Nat Biotechnol       Date:  2006-12       Impact factor: 54.908

Review 3.  Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review.

Authors:  Samuel Lalmuanawma; Jamal Hussain; Lalrinfela Chhakchhuak
Journal:  Chaos Solitons Fractals       Date:  2020-06-25       Impact factor: 5.944

4.  New machine learning method for image-based diagnosis of COVID-19.

Authors:  Mohamed Abd Elaziz; Khalid M Hosny; Ahmad Salah; Mohamed M Darwish; Songfeng Lu; Ahmed T Sahlol
Journal:  PLoS One       Date:  2020-06-26       Impact factor: 3.240

5.  Deep Learning and Medical Image Processing for Coronavirus (COVID-19) Pandemic: A Survey.

Authors:  Sweta Bhattacharya; Praveen Kumar Reddy Maddikunta; Quoc-Viet Pham; Thippa Reddy Gadekallu; Siva Rama Krishnan S; Chiranji Lal Chowdhary; Mamoun Alazab; Md Jalil Piran
Journal:  Sustain Cities Soc       Date:  2020-11-05       Impact factor: 7.587

6.  Development of a multivariate prediction model of intensive care unit transfer or death: A French prospective cohort study of hospitalized COVID-19 patients.

Authors:  Yves Allenbach; David Saadoun; Georgina Maalouf; Matheus Vieira; Alexandra Hellio; Jacques Boddaert; Hélène Gros; Joe Elie Salem; Matthieu Resche Rigon; Cherifa Menyssa; Lucie Biard; Olivier Benveniste; Patrice Cacoub
Journal:  PLoS One       Date:  2020-10-19       Impact factor: 3.240

7.  Prediction of the COVID-19 epidemic trends based on SEIR and AI models.

Authors:  Shuo Feng; Zebang Feng; Chen Ling; Chen Chang; Zhongke Feng
Journal:  PLoS One       Date:  2021-01-08       Impact factor: 3.240

Review 8.  Contribution of machine learning approaches in response to SARS-CoV-2 infection.

Authors:  Mohammad Sadeq Mottaqi; Fatemeh Mohammadipanah; Hedieh Sajedi
Journal:  Inform Med Unlocked       Date:  2021-01-24

9.  Data-driven study of the COVID-19 pandemic via age-structured modelling and prediction of the health system failure in Brazil amid diverse intervention strategies.

Authors:  Askery Canabarro; Elayne Tenório; Renato Martins; Laís Martins; Samuraí Brito; Rafael Chaves
Journal:  PLoS One       Date:  2020-07-30       Impact factor: 3.240

10.  A bivariate prediction approach for adapting the health care system response to the spread of COVID-19.

Authors:  Paolo Berta; Paolo Paruolo; Stefano Verzillo; Pietro Giorgio Lovaglio
Journal:  PLoS One       Date:  2020-10-15       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.