| Literature DB >> 36018862 |
Hsiao-Ya Peng1, Yen-Kuang Lin2, Phung-Anh Nguyen3,4,5,6, Jason C Hsu1,3,4,5, Chun-Liang Chou7, Chih-Cheng Chang8, Chia-Chi Lin1, Carlos Lam9,10,11, Chang-I Chen12, Kai-Hsun Wang13, Christine Y Lu14.
Abstract
OBJECTIVES: The coronavirus disease 2019 pandemic has affected countries around the world since 2020, and an increasing number of people are being infected. The purpose of this research was to use big data and artificial intelligence technology to find key factors associated with the coronavirus disease 2019 infection. The results can be used as a reference for disease prevention in practice.Entities:
Mesh:
Year: 2022 PMID: 36018862 PMCID: PMC9417026 DOI: 10.1371/journal.pone.0272546
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Study group comparison of continuous variables.
| Numeric Variables | COVID-19 infection | |||||
|---|---|---|---|---|---|---|
| Yes (N = 12,716) | No (N = 14,305) | p-value | ||||
| Mean | SD | Mean | SD | |||
|
| ||||||
| age | 32.96 | 9.41 | 42.76 | 16.33 | <0.01 | |
|
| ||||||
| number of times of washing | 7.59 | 5.43 | 9.75 | 6.99 | <0.01 | |
| sanitizer washing | 1.67 | 0.88 | 1.94 | 1.18 | <0.01 | |
| soap washing | 1.73 | 0.96 | 1.45 | 0.8 | <0.01 | |
| frequency of cleaning | 1.7 | 0.88 | 2.29 | 1.22 | <0.01 | |
| eating alone | 1.95 | 1.05 | 3.46 | 1.59 | <0.01 | |
| sleeping alone | 1.8 | 1.05 | 3.54 | 1.69 | <0.01 | |
| frequency of mask wearing | 1.51 | 0.87 | 2.28 | 1.67 | <0.01 | |
| frequency of covering the nose and mouth | 1.67 | 0.94 | 1.43 | 0.88 | <0.01 | |
| the number of contacts with people inside the home | 4.23 | 2.01 | 3.45 | 2.21 | <0.01 | |
| the number of contacts with people outside the home | 4.86 | 4.79 | 6.85 | 9.15 | <0.01 | |
| number of times of leaving home in a day | 3.61 | 2 | 2.44 | 1.63 | <0.01 | |
| avoiding having guests | 1.91 | 1.01 | 2.05 | 1.3 | 0.457 | |
| avoiding contacting people | 1.72 | 0.95 | 1.65 | 1.2 | <0.01 | |
| avoiding going outside | 1.84 | 0.97 | 2.39 | 1.29 | <0.01 | |
| avoiding going to shops | 1.93 | 0.99 | 2.52 | 1.25 | <0.01 | |
| avoiding going to the hospital | 1.87 | 1.1 | 2.09 | 1.42 | 0.501 | |
| avoiding taking public transportation | 1.72 | 0.97 | 1.88 | 1.33 | <0.01 | |
| avoiding small social gatherings | 1.89 | 0.99 | 2.21 | 1.34 | <0.01 | |
| avoiding medium-sized social gatherings | 1.85 | 1.01 | 1.91 | 1.25 | <0.01 | |
| avoiding large-sized social gatherings | 1.85 | 0.98 | 1.63 | 1.14 | <0.01 | |
| avoiding crowded areas | 1.7 | 0.91 | 1.68 | 1.03 | <0.01 | |
| avoiding touching objects | 1.67 | 0.94 | 2 | 1.15 | <0.01 | |
Study group comparison of categorical variables.
| Categorical Variables | COVID-19 infection | |||||
|---|---|---|---|---|---|---|
| Yes (N = 12,716) | No (N = 14,305) | p-value | ||||
| n | % | n | % | |||
|
| ||||||
| gender(male) | 8006 | 63.00% | 7148 | 50.00% | <0.01 | |
| the number of people in the household | ||||||
| 0 or not sure | 279 | 2.20% | 249 | 1.70% | <0.01 | |
| 1 | 2355 | 18.50% | 1981 | 13.80% | <0.01 | |
| 2 | 1485 | 11.70% | 3662 | 25.60% | <0.01 | |
| 3 | 1675 | 13.20% | 2934 | 20.50% | <0.01 | |
| 4 | 1843 | 14.50% | 2782 | 19.40% | <0.01 | |
| 5 | 2121 | 16.70% | 1494 | 10.40% | <0.01 | |
| 6 | 830 | 6.50% | 651 | 4.60% | <0.01 | |
| 7 | 640 | 5.00% | 279 | 2.00% | <0.01 | |
| 8 or more | 1488 | 11.70% | 273 | 1.90% | <0.01 | |
| number of children in the household | ||||||
| 0 or not sure | 2505 | 19.70% | 7532 | 52.70% | <0.01 | |
| 1 | 3929 | 30.90% | 3280 | 22.90% | <0.01 | |
| 2 | 2771 | 21.80% | 2146 | 15.00% | <0.01 | |
| 3 | 1011 | 8.00% | 713 | 5.00% | <0.01 | |
| 4 | 627 | 4.90% | 337 | 2.40% | <0.01 | |
| 5 | 5 | 0.00% | 2 | 0.00% | <0.01 | |
| 6 or more | 1868 | 14.70% | 295 | 2.10% | <0.01 | |
| country | ||||||
| Australia | 321 | 2.50% | 653 | 4.60% | <0.01 | |
| Brazil | 231 | 1.80% | 423 | 3.00% | <0.01 | |
| Canada | 99 | 0.80% | 398 | 2.80% | <0.01 | |
| China | 333 | 2.60% | 652 | 4.60% | <0.01 | |
| Denmark | 77 | 0.60% | 459 | 3.20% | <0.01 | |
| Finland | 53 | 0.40% | 491 | 3.40% | <0.01 | |
| France | 211 | 1.70% | 711 | 5.00% | <0.01 | |
| Germany | 170 | 1.30% | 629 | 4.40% | <0.01 | |
| Hong Kong | 118 | 0.90% | 271 | 1.90% | <0.01 | |
| India | 615 | 4.80% | 639 | 4.50% | <0.01 | |
| Indonesia | 191 | 1.50% | 460 | 3.20% | <0.01 | |
| Italy | 253 | 2.00% | 650 | 4.50% | <0.01 | |
| Japan | 43 | 0.30% | 251 | 1.80% | <0.01 | |
| Malaysia | 194 | 1.50% | 491 | 3.40% | <0.01 | |
| Mexico | 127 | 1.00% | 461 | 3.20% | <0.01 | |
| Netherlands | 137 | 1.10% | 243 | 1.70% | <0.01 | |
| Norway | 159 | 1.30% | 435 | 3.00% | <0.01 | |
| Philippines | 124 | 1.00% | 462 | 3.20% | <0.01 | |
| Saudi Arabia | 1339 | 10.50% | 398 | 2.80% | <0.01 | |
| South Korea | 212 | 1.70% | 227 | 1.60% | <0.01 | |
| Spain | 126 | 1.00% | 600 | 4.20% | <0.01 | |
| Sweden | 178 | 1.40% | 580 | 4.10% | <0.01 | |
| Taiwan | 127 | 1.00% | 449 | 3.10% | <0.01 | |
| Thailand | 1925 | 15.10% | 415 | 2.90% | <0.01 | |
| United Arab Emirates | 1972 | 15.50% | 376 | 2.60% | <0.01 | |
| United Kingdom | 68 | 0.50% | 1013 | 7.10% | <0.01 | |
| United States | 488 | 3.80% | 656 | 4.60% | <0.01 | |
| Vietnam | 2825 | 22.20% | 812 | 5.70% | <0.01 | |
|
| ||||||
| self-isolating | 8161 | 64.20% | 9802 | 68.50% | <0.01 | |
| having difficulties isolating | ||||||
| Very easy | 7586 | 59.70% | 4589 | 32.10% | <0.01 | |
| Somewhat easy | 2029 | 16.00% | 4407 | 30.80% | <0.01 | |
| Neither easy nor difficult | 1423 | 11.20% | 2480 | 17.30% | <0.01 | |
| Somewhat difficult | 971 | 7.60% | 1690 | 11.80% | <0.01 | |
| Very difficult | 361 | 2.80% | 703 | 4.90% | <0.01 | |
| Not sure | 346 | 2.70% | 436 | 3.00% | <0.01 | |
| being willing to isolate | ||||||
| Very willing | 7449 | 58.60% | 8257 | 57.70% | <0.01 | |
| Somewhat willing | 2930 | 23.00% | 3635 | 25.40% | <0.01 | |
| Neither willing nor unwilling | 1127 | 8.90% | 1371 | 9.60% | <0.01 | |
| Somewhat unwilling | 641 | 5.00% | 423 | 3.00% | <0.01 | |
| Very unwilling | 216 | 1.70% | 219 | 1.50% | <0.01 | |
| Not sure | 353 | 2.80% | 400 | 2.80% | <0.01 | |
| whether the family had been tested | 7181 | 56.50% | 54 | 0.40% | <0.01 | |
|
| ||||||
| AIDS | 4736 | 37.20% | 67 | 0.50% | <0.01 | |
| arthritis | 5532 | 43.50% | 879 | 6.10% | <0.01 | |
| asthma | 5591 | 44.00% | 1124 | 7.90% | <0.01 | |
| cancer | 4864 | 38.30% | 389 | 2.70% | <0.01 | |
| cystic fibrosis | 4545 | 35.70% | 72 | 0.50% | <0.01 | |
| COPD | 4902 | 38.50% | 290 | 2.00% | <0.01 | |
| diabetes | 5250 | 41.30% | 943 | 6.60% | <0.01 | |
| epilepsy | 4749 | 37.30% | 113 | 0.80% | <0.01 | |
| heart disease | 5283 | 41.50% | 2116 | 14.80% | <0.01 | |
| hypertension | 4807 | 37.80% | 483 | 3.40% | <0.01 | |
| mental disease | 4837 | 38.00% | 791 | 5.50% | <0.01 | |
| multiple sclerosis | 4454 | 35.00% | 71 | 0.50% | <0.01 | |
| not willing to say | 471 | 3.70% | 598 | 4.20% | <0.01 | |
| no disease | 3501 | 27.50% | 8677 | 60.70% | <0.01 | |
|
| ||||||
| cough | 5797 | 45.60% | 824 | 5.80% | <0.01 | |
| fever | 6071 | 47.70% | 454 | 3.20% | <0.01 | |
| loss of smell | 5543 | 43.60% | 319 | 2.20% | <0.01 | |
| loss of taste | 5745 | 45.20% | 321 | 2.20% | <0.01 | |
| having difficulty breathing | 5609 | 44.10% | 552 | 3.90% | <0.01 | |
| no symptoms | 5159 | 40.60% | 12979 | 90.70% | <0.01 | |
Machine learning model indices.
| Index | LR | RF | SVM | ANN |
|---|---|---|---|---|
| Accuracy | 0.952 | 0.957 | 0.953 | 0.953 |
| Sensitivity | 0.935 | 0.957 | 0.967 | 0.963 |
| Specificity | 0.968 | 0.959 | 0.94 | 0.942 |
| PPV | 0.963 | 0.954 | 0.931 | 0.934 |
| NPV | 0.943 | 0.96 | 0.972 | 0.968 |
| AUROC | 0.953 | 0.988 | 0.987 | 0.986 |
LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.
*: the best performing model for each index
Fig 1ROC curve.
LR = logistic regression; DT = decision tree; RF = random forest; SVM = support vector machine; NN = artificial neural network.
Variables by importance in four models.
| LR | RF | SVM | ANN | |
|---|---|---|---|---|
| 1 | [ | [ | [ | [ |
| 2 | [ | [ | [ | [ |
| 3 | [ | [ | [ | [ |
| 4 | [ | [ | [ | [ |
| 5 | [ | [ | [ | [ |
| 6 | [ | [ | [ | [ |
| 7 | [ | [ | [ | [ |
| 8 | [ | [ | [ | [ |
| 9 | [ | [ | [ | [ |
| 10 | [ | [ | [ | [ |
| 11 | [ | [ | [ | [ |
| 12 | [ | [ | [ | [ |
| 13 | [ | [ | [ | [ |
| 14 | [ | [ | [ | [ |
| 15 | [ | [ | [ | [ |
LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.
Four types: [1] Basic characteristics [2] Lifestyle habits [3] Disease history [4] Symptom
*: means negative correlation
Weighted importance of variables by model.
| Categories | Variables | LR | RF | SVM | ANN | Total |
|---|---|---|---|---|---|---|
| Lifestyle habits | whether the family had been tested | 15 | 15 | 15 | 15 | 60 |
| Symptoms | no symptoms | 14 | 14 | 0 | 14 | 42 |
| Symptoms | loss of smell | 8 | 9 | 8 | 11 | 36 |
| Symptoms | loss of taste | 6 | 11 | 6 | 8 | 31 |
| Disease history | epilepsy | 11 | 6 | 2 | 10 | 29 |
| Disease history | AIDS | 3 | 7 | 9 | 9 | 28 |
| Disease history | cystic fibrosis | 9 | 3 | 3 | 12 | 27 |
| Lifestyle habits | sleeping alone | 13 | 12 | 0 | 0 | 25 |
| Basic characteristics | country | 12 | 13 | 0 | 0 | 25 |
| Lifestyle habits | the number of times of leaving home in a day | 10 | 1 | 14 | 0 | 25 |
| Disease history | COPD | 1 | 2 | 11 | 4 | 18 |
| Disease history | cancer | 5 | 0 | 4 | 7 | 16 |
| Disease history | multiple sclerosis | 2 | 0 | 0 | 13 | 15 |
| Lifestyle habits | number of times of washing | 0 | 0 | 13 | 0 | 13 |
| Lifestyle habits | frequency of covering the nose and mouth | 0 | 0 | 12 | 1 | 13 |
| Symptoms | fever | 0 | 10 | 0 | 2 | 12 |
| Lifestyle habits | avoiding crowded areas | 0 | 0 | 10 | 0 | 10 |
| Disease history | arthritis | 4 | 0 | 0 | 3 | 7 |
| Basic characteristics | age | 7 | 0 | 0 | 0 | 7 |
| Lifestyle habits | soap washing | 0 | 0 | 7 | 0 | 7 |
| Disease history | heart disease | 0 | 0 | 0 | 6 | 6 |
| Lifestyle habits | eating alone | 0 | 5 | 0 | 0 | 5 |
| Symptoms | having difficulty breathing | 0 | 5 | 0 | 0 | 5 |
| Lifestyle habits | frequency of cleaning | 0 | 0 | 0 | 5 | 5 |
| Symptoms | cough | 0 | 4 | 0 | 0 | 4 |
| Lifestyle habits | avoiding medium-sized social gatherings | 0 | 0 | 1 | 0 | 1 |
| Basic characteristics | the number of contacts with people outside the home | 0 | 0 | 0 | 0 | 0 |
LR = logistic regression; RF = random forest; SVM = support vector machine; ANN = artificial neural network.