| Literature DB >> 35538201 |
Elijah A Adeoye1, Yelena Rozenfeld2, Jennifer Beam1, Karen Boudreau1, Emily J Cox3, James M Scanlan4.
Abstract
Notable discrepancies in vulnerability to COVID-19 infection have been identified between specific population groups and regions in the USA. The purpose of this study was to estimate the likelihood of COVID-19 infection using a machine-learning algorithm that can be updated continuously based on health care data. Patient records were extracted for all COVID-19 nasal swab PCR tests performed within the Providence St. Joseph Health system from February to October of 2020. A total of 316,599 participants were included in this study, and approximately 7.7% (n = 24,358) tested positive for COVID-19. A gradient boosting model, LightGBM (LGBM), predicted risk of initial infection with an area under the receiver operating characteristic curve of 0.819. Factors that predicted infection were cough, fever, being a member of the Hispanic or Latino community, being Spanish speaking, having a history of diabetes or dementia, and living in a neighborhood with housing insecurity. A model trained on sociodemographic, environmental, and medical history data performed well in predicting risk of a positive COVID-19 test. This model could be used to tailor education, public health policy, and resources for communities that are at the greatest risk of infection.Entities:
Keywords: COVID-19; Infection; Risk; Social determinants of health
Mesh:
Year: 2022 PMID: 35538201 PMCID: PMC9090454 DOI: 10.1007/s11517-022-02549-5
Source DB: PubMed Journal: Med Biol Eng Comput ISSN: 0140-0118 Impact factor: 3.079
Fig. 1Schematic of predictive modeling experiments performed to predict risk of initial COVID-19 infection. Legend: ^LGBM outperformed other models on the 25% test set. Thus, we re-trained an LGBM model on a 80/10/10 split (1) to increase the size of the training set and (2) to avoid overfitting by exploring its performance in both a test and a validation set. The training samples, for the 80% split, was 253,279 (train_negative = 233,889; train_positive = 19,390). After case–control augmentation (downsampling the training samples count to the positive samples count), we arrived at a train_negative = 19,390 and train_positive = 19,390. RFE = recursive feature elimination, LR = logistic regression, LGBM = light gradient boosting machine, PCA = principal component analysis, SMOTE = synthetic minority oversampling technique. *LGBM was the final selected model. The refresh icon indicates that the LGBM model was put through a second round of modeling with a train, test, and validation split of 80/10/10, respectively, for the final steps
Study participant demographics and characteristics
| Tested people | Tested positive | Tested negative | ||||
|---|---|---|---|---|---|---|
| ( | ( | ( | ||||
| % of totala | N | In-group, %b | N | In-group, %b | ||
| Sociodemographic | ||||||
| Age | ||||||
| < 18 | 25,640 | 8.10 | 1766 | 6.89 | 23,874 | 93.11 |
| 18–29 | 51,328 | 16.21 | 4992 | 9.73 | 46,336 | 90.27 |
| 30–39 | 49,570 | 15.66 | 3875 | 7.82 | 45,695 | 92.18 |
| 40–49 | 41,634 | 13.15 | 3565 | 8.56 | 38,069 | 91.44 |
| 50–59 | 45,760 | 14.45 | 3707 | 8.10 | 42,053 | 91.90 |
| 60–69 | 45,976 | 14.52 | 2804 | 6.10 | 43,172 | 93.90 |
| 70–79 | 34,057 | 10.76 | 1941 | 5.70 | 32,116 | 94.30 |
| 80 + | 22,634 | 7.15 | 1708 | 7.55 | 20,926 | 92.45 |
| Gender | ||||||
| Female | 179,381 | 56.66 | 12,826 | 7.15 | 166,555 | 92.85 |
| Male | 137,218 | 43.34 | 11,532 | 8.40 | 125,686 | 91.60 |
| Education | ||||||
| Education < 12 years | 219,444 | 69.31 | 13,409 | 6.11 | 206,035 | 93.89 |
| Employment | ||||||
| Student | 17,475 | 5.52 | 1574 | 9.01 | 15,901 | 90.99 |
| Employed | 131,019 | 41.38 | 10,725 | 8.19 | 120,294 | 91.81 |
| Not employed | 58,380 | 18.44 | 4946 | 8.47 | 53,434 | 91.53 |
| Retired | 63,324 | 20.00 | 3864 | 6.10 | 59,460 | 93.90% |
| Unknown | 46,401 | 14.66 | 3249 | 7.00 | 43,152 | 93.00% |
| Race | ||||||
| White | 199,492 | 63.01 | 9742 | 4.88 | 189,750 | 95.12 |
| American Indian|Alaska Native | 4069 | 1.29 | 293 | 7.20 | 3776 | 92.80 |
| Asian | 13,334 | 4.21 | 1044 | 7.83 | 12,290 | 92.17 |
| Black|African American | 12,018 | 3.80 | 1095 | 9.11 | 10,923 | 90.89 |
| Native Hawaiian | Pacific Islander | 2700 | 0.85 | 424 | 15.70 | 2276 | 84.30 |
| Hispanic | Latino | 39,997 | 12.63 | 7962 | 19.91 | 32,035 | 80.09 |
| Unknown | 44,989 | 14.21 | 3798 | 8.44 | 41,191 | 91.56 |
| Ethnicity | ||||||
| Other ethnic groups | 276,602 | 87.37 | 16,396 | 5.93 | 260,206 | 94.07 |
| Hispanic or Latino | 39,997 | 12.63 | 7962 | 19.91 | 32,035 | 80.09 |
| Religious affiliation | ||||||
| Agnostic | 90,655 | 28.63 | 5585 | 6.16 | 85,070 | 93.84 |
| Christian | 121,557 | 38.39 | 10,293 | 8.47 | 111,264 | 91.53 |
| Other religion | 10,534 | 3.33 | 679 | 6.45 | 9855 | 93.55 |
| Unknown | 93,853 | 29.64 | 7801 | 8.31 | 86,052 | 91.69 |
| Relationship | ||||||
| Single | 123,850 | 39.12 | 10,096 | 8.15 | 113,754 | 91.85 |
| Divorced or legally separated | 37,797 | 11.94 | 2412 | 6.38 | 35,385 | 93.62 |
| Married or significant other | 128,944 | 40.73 | 9817 | 7.61 | 119,127 | 92.39 |
| Unknown | 26,008 | 8.21 | 2033 | 7.82 | 23,975 | 92.18 |
| Language | ||||||
| English | 288,252 | 91.05 | 18,964 | 6.58 | 269,288 | 93.42 |
| Sino-Tibetan | 2192 | 0.69 | 244 | 11.13 | 1948 | 88.87 |
| Spanish | 12,435 | 3.93 | 3679 | 29.59 | 8756 | 70.41 |
| Other languages | 13,720 | 4.33 | 1471 | 10.72 | 12,249 | 89.28 |
| Clinical | ||||||
| Body mass index | ||||||
| Normal | 66,179 | 20.90 | 4231 | 6.39 | 61,948 | 93.61 |
| Underweight | 5180 | 1.64 | 296 | 5.71 | 4884 | 94.29 |
| Moderately obese | 45,918 | 14.50 | 4061 | 8.84 | 41,857 | 91.16 |
| Overweight | 70,933 | 22.40 | 5918 | 8.34 | 65,015 | 91.66 |
| Severely obese | 23,334 | 7.37 | 2078 | 8.91 | 21,256 | 91.09 |
| Very severely obese | 19,981 | 6.31 | 1643 | 8.22 | 18,338 | 91.78 |
| Unknown | 85,074 | 26.87 | 6,131 | 7.21 | 78,943 | 92.79 |
| Number of chronic conditions | ||||||
| 0 | 141,916 | 44.83 | 12,551 | 8.84 | 129,365 | 91.16 |
| 1–2 | 103,464 | 32.68 | 7629 | 7.37 | 95,835 | 92.63 |
| 3–4 | 46,632 | 14.73 | 2905 | 6.23 | 43,727 | 93.77 |
| 5 + | 24,587 | 7.77 | 1273 | 5.18 | 23,314 | 94.82 |
| Clinical diagnosis | ||||||
| Diagnosis of diabetes | 34,930 | 11.03 | 3340 | 9.56 | 31,992 | 91.59 |
| Diagnosis of kidney disease | 789 | 0.25 | 94 | 11.91 | 709 | 89.86 |
| Diagnosis of HIV/AIDS | 767 | 0.24 | 54 | 7.04 | 718 | 93.61 |
| Diagnosis of dementia | 7316 | 2.31 | 910 | 12.44 | 6510 | 88.98 |
| Polypharmacy | ||||||
| 0 prescriptions | 104,273 | 32.94 | 9066 | 8.69 | 95,207 | 91.31 |
| 1–9 prescriptions | 160,387 | 50.66 | 12,403 | 7.73 | 147,984 | 92.27 |
| 10–19 prescriptions | 38,656 | 12.21 | 2238 | 5.79 | 36,418 | 94.21 |
| 20–29 prescriptions | 9809 | 3.10 | 481 | 4.90 | 9328 | 95.10 |
| 30 + prescriptions | 3474 | 1.10 | 170 | 4.89 | 3304 | 95.11 |
| Mental health and substance use | ||||||
| History of illicit drug use | 35,588 | 11.24 | 1561 | 4.39 | 34,027 | 95.61 |
| History of tobacco use | 40,352 | 12.75 | 1836 | 4.55 | 38,516 | 95.45 |
| Diagnosis of serious persistent mental illness | 30,246 | 9.55 | 1286 | 4.25 | 28,960 | 95.75 |
| Diagnosis of substance use disorder | 24,757 | 7.82 | 1071 | 4.33 | 23,686 | 95.67 |
| Primary care affiliation | ||||||
| Internal primary care provider | 112,191 | 35.44 | 7017 | 6.25 | 105,174 | 93.75 |
| External primary care provider | 116,348 | 36.75 | 8708 | 7.48 | 107,640 | 92.52 |
| Unknown primary care provider | 88,060 | 27.81 | 8633 | 9.80 | 79,427 | 90.20 |
| Symptoms | ||||||
| Fever | 101,388 | 32.02 | 15,157 | 14.95 | 86,231 | 85.05 |
| Cough | 113,047 | 35.71 | 16,319 | 14.44 | 96,728 | 85.56 |
| Breath | 107,216 | 33.86 | 13,642 | 12.72 | 93,574 | 87.28 |
| Chills | 6443 | 2.04 | 950 | 14.74 | 5493 | 85.26 |
| Myalgia | 8587 | 2.71 | 1686 | 19.63 | 6901 | 80.37 |
| Environmental | ||||||
| Region | ||||||
| Oregon | 83,293 | 26.31 | 5018 | 6.02 | 78,275 | 93.98 |
| Alaska | 17,269 | 5.45 | 857 | 4.96 | 16,412 | 95.04 |
| Puget Sound | 34,437 | 10.88 | 2144 | 6.23 | 32,293 | 93.77 |
| Southern California | 65,815 | 20.79 | 7389 | 11.23 | 58,426 | 88.77 |
| Washington|Montana | 115,589 | 36.51 | 8931 | 7.73 | 106,658 | 92.27 |
| Unknown | 196 | 0.06 | 19 | 9.69 | 177 | 90.31 |
| Age-stratified communal living | ||||||
| Non-communal living | 230,410 | 72.78 | 16,624 | 7.21 | 213,786 | 92.79 |
| Adult community | 12,534 | 3.96 | 1055 | 8.42 | 11,479 | 91.58 |
| Adult and youth | 46,996 | 14.84 | 4460 | 9.49 | 42,536 | 90.51 |
| Multigenerational | 15,481 | 4.89 | 1535 | 9.92 | 13,946 | 90.08 |
| Senior living | 2876 | 0.91 | 300 | 10.43 | 2576 | 89.57 |
| Other | 8302 | 2.62 | 384 | 4.63 | 7918 | 95.37 |
| Financial insecurity | 98,537 | 31.12 | 10,285 | 10.44 | 88,252 | 89.56 |
| Housing insecurity | 72,081 | 22.77 | 8849 | 12.28 | 63,232 | 87.72 |
| Transportation insecurity | 88,401 | 27.92 | 7240 | 8.19 | 81,161 | 91.81 |
Legend: Characteristics of the patient population included in this analysis
a% of total is the percentage of the total N (316,599)
bIn-group % is the percentage of the total tested people for each row
Area under the curve (AUC) of modeling experiments run to predict COVID-19 risk of infection
| Trial | Augmentation/feature reduction | Model | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|
| 1 | RFE | LR | 0.767 | 0.093 | 0.994 |
| 2 | CC | LGBM* | 0.814 | 0.718 | 0.754 |
| 3 | CC | LGBM** | 0.727 | 0.623 | 0.713 |
| 4 | CC | Ensemble | 0.816 | 0.717 | 0.760 |
| 5 | CC | LR | 0.800 | 0.721 | 0.730 |
| 6 | CC-PCA | Ensemble | 0.805 | 0.714 | 0.745 |
| 7 | CC-RFE | Ensemble | 0.816 | 0.715 | 0.759 |
| 8 | CC-RFE | LR | 0.800 | 0.721 | 0.731 |
| 9 | SMOTE | Ensemble | 0.797 | 0.552 | 0.864 |
| 10 | SMOTE | LR | 0.759 | 0.624 | 0.759 |
| 11 | SMOTE-PCA | Ensemble | 0.802 | 0.622 | 0.823 |
| 12 | SMOTE-RFE | Ensemble | 0.792 | 0.555 | 0.858 |
| 13 | SMOTE-RFE | LR | 0.756 | 0.621 | 0.760 |
Legend: All models included symptoms as predictors except for trial 3. Except for the Light Gradient Boosting Machine model (LGBM), reported area under the receiver operating characteristic curve (AUC ROC) scores is for the 25% held-out test set of the 75/25 train/test split. For the LGBM model, a 80/10/10 training/test/validation split was used, and AUC is given for performance on the final validation set
RFE = recursive feature elimination, LR = logistic regression, CC = case–control, LGBM = light gradient boosting machine, PCA = principal component analysis, SMOTE = synthetic minority oversampling technique
Final selected model. This was the model that was used for the SHAP scores with symptoms presented in Fig. 2
**Final selected model without symptoms. This was the model that was used for the SHAP scores without symptoms presented in Fig. 3
Fig. 2Relative contribution of predictors in a machine learning model predicting COVID-19 infection based on symptoms and demographic information. Legend: A SHapley Additive exPlanations (SHAP) scores showing the average impact of each predictor on the model. SHAP values were computed using the final LGBM model. Higher SHAP values correspond to increased COVID-19 infection risk. B The relative importance of the top 20 COVID-19 predictors in descending order is shown here. The plot is made of dots corresponding to each prediction for a single patient. The horizontal axis shows the relative impact of a low or high prediction value for each variable, the impact ranging from blue (least associated with infection) to red (most associated with infection). Blue on the left to red on the right shows increasing infection risk as the feature increases (i.e., Cough: 0 = No Cough, 1 = Cough). Red on the left to blue on the right shows decreasing infection risk as the feature increases (i.e., polypharmacy)
Fig. 3Relative contribution of predictor variables in a machine learning model trained to predict COVID-19 infection based on demographic information alone. Legend: A SHAP scores showing the average impact of each predictor on the model using the final LGBM model. Higher SHAP values correspond to increased COVID-19 infection risk. B The top 20 COVID-19 demographic predictors, without symptoms, are shown here in descending order. All other computational and graphic elements (use of dots, color coding, variable score association strength shown by horizontal axis) are identical with those used for Fig. 2a and b