| Literature DB >> 34151259 |
Tarun Karthik Kumar Mamidi1, Thi K Tran-Nguyen2, Ryan L Melvin3, Elizabeth A Worthey1,2.
Abstract
Developing an accurate and interpretable model to predict an individual's risk for Coronavirus Disease 2019 (COVID-19) is a critical step to efficiently triage testing and other scarce preventative resources. To aid in this effort, we have developed an interpretable risk calculator that utilized de-identified electronic health records (EHR) from the University of Alabama at Birmingham Informatics for Integrating Biology and the Bedside (UAB-i2b2) COVID-19 repository under the U-BRITE framework. The generated risk scores are analogous to commonly used credit scores where higher scores indicate higher risks for COVID-19 infection. By design, these risk scores can easily be calculated in spreadsheets or even with pen and paper. To predict risk, we implemented a Credit Scorecard modeling approach on longitudinal EHR data from 7,262 patients enrolled in the UAB Health System who were evaluated and/or tested for COVID-19 between January and June 2020. In this cohort, 912 patients were positive for COVID-19. Our workflow considered the timing of symptoms and medical conditions and tested the effects by applying different variable selection techniques such as LASSO and Elastic-Net. Within the two weeks before a COVID-19 diagnosis, the most predictive features were respiratory symptoms such as cough, abnormalities of breathing, pain in the throat and chest as well as other chronic conditions including nicotine dependence and major depressive disorder. When extending the timeframe to include all medical conditions across all time, our models also uncovered several chronic conditions impacting the respiratory, cardiovascular, central nervous and urinary organ systems. The whole pipeline of data processing, risk modeling and web-based risk calculator can be applied to any EHR data following the OMOP common data format. The results can be employed to generate questionnaires to estimate COVID-19 risk for screening in building entries or to optimize hospital resources.Entities:
Keywords: COVID-19; ICD-10; credit scorecard model; electronic health record; risk prediction
Year: 2021 PMID: 34151259 PMCID: PMC8211871 DOI: 10.3389/fdata.2021.675882
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
Demographics and Clinical Characteristics of the UAB LDS N3C Cohort.
| UAB LDS N3C cohort ( | ||
|---|---|---|
| COVID-19 testing: | ||
| COVID-19 results | Positive ( | Negative ( |
| Total COVID tests | 1,328 | 7,596 |
| COVID Tests/Person | 1.46 | 1.20 |
| All medical tests: | ||
| All tests | 1,951,404 | 17,395,613 |
| All tests/person | 2,139 | 2,739 |
| Age | mean = 52 (10–119) | mean = 52 (<1–119) |
| Gender: | ||
| Male (%) | 394 (43%) | 3,035 (48%) |
| Female (%) | 516 (57%) | 3,314 (52%) |
| Unknown (%) | 2 (0%) | 1 (0%) |
| Race: | ||
| White (%) | 337 (37%) | 3,441 (54%) |
| Black (%) | 416 (46%) | 2,497 (39%) |
| Asian (%) | 27 (3%) | 70 (1%) |
| Hispanic (%) | 28 (3%) | 174 (3%) |
| Others (%) | 104 (11%) | 168 (3%) |
| Conditions: | ||
| Total conditions | 129,091 | 1,133,396 |
| Unique conditions | 9,224 | 24,101 |
| #Conditions/Person | 142 | 178 |
| #Unique conditions/Person | 10 | 4 |
| Smoking: | ||
| Current smoker | 81 (9%) | 1,602 (25%) |
| Former smoker | 196 (21.5%) | 1,625 (26%) |
| Never smoker | 368 (40%) | 2,589 (41%) |
| Unknown | 13 (1%) | 64 (1%) |
| Substance use: | ||
| Current substance abuse | 27 (3%) | 895 (14%) |
| No substance abuse | 632 (69%) | 4,716 (74%) |
| Former substance abuse | 32 (3.5%) | 402 (6%) |
| Unknown | 15 (1.6%) | 74 (1%) |
| Alcohol use: | ||
| Current alcohol | 273 (30%) | 1954 (31%) |
| Former alcohol | 58 (6%) | 652 (10%) |
| No alcohol | 379 (41.5%) | 3,459 (54.5%) |
| Unknown | 12 (1.3%) | 80 (1%) |
| Weight: | ||
| Underweight (BMI < 19) | 20 (2%) | 271 (4%) |
| Normal weight (BMI = 20–25) | 49 (5%) | 563 (9%) |
| Overweight (BMI = 25–40) | 320 (35%) | 2,439 (38%) |
| Obese (BMI > 40) | 120 (13%) | 773 (12%) |
FIGURE 1Overview of workflow.
FIGURE 2LASSO vs Elastic-Net model performance on two sets of data Receiver operating characteristic (ROC) curves are shown for the final model for each of the four assessed techniques (A,B), and the corresponding areas under curves (AUC) are presented in the figure legend. By AUC on hold out data (0.815), the models built on data filtered by two-week before COVID (non)diagnosis perform the best (B).
Model metrics Evaluation of four models (LASSO and Elastic-Net with patient’s conditions information from two timeframes) while training and testing (i.e., holdout) data set. For each model, the accuracy, F-Score, and AUC with 95% CI using DeLong’s method (DeLong et al., 1988) are shown. The accuracy metric indicates the percent of correct predictions. F-score is the harmonic mean of precision and recall. Area under receiver operating curve (AUC) is the area under the curve resulting from plotting the true positive against the false positive rate.
| Training metrics | |||
|---|---|---|---|
| All-Time + LASSO | All-Time + Elastic-Net | ||
| Accuracy | 0.746 | Accuracy | 0.755 |
| F-Score | 0.834 | F-Score | 0.840 |
| AUC | 0.838 | AUC | 0.840 |
| 95% AUC CI | [0.82 0.86] | 95% AUC CI | [0.82 0.86] |
FIGURE 3Confusion matrices Confusion matrices using training (A–D) and holdout (E–H) data are shown for the final model for each of the four assessed techniques. Considering that these models are built to recommend COVID-19 testing, we sought to avoid False Negative predictions while being more lenient towards False Positive errors.
FIGURE 4Web application demonstration Four representative snapshots with different scorings from the COVID-19 risk predictor web application are shown. Scores were calculated based on participant answers to questions related to their symptoms and conditions using the Credit Scorecard method.
Example questionnaire Example questionnaire built using our selected model using the UAB-i2b2 data—the LASSO method on the 2-week filtered data. Base score is 320 and the risk increases/decreases based on the answers in the questionnaire. Any score between 450 and 696 is considered high risk for infection. Disclaimer: This questionnaire is intended only as an example output from a model built using our pipeline. It is not itself a diagnostic tool.
| Questions | Yes | No |
|---|---|---|
| Do you have chronic kidney disease? | 36 | −6 |
| Do you have cough? | 36 | −44 |
| Have you delivered a baby? | 35 | −2 |
| Are you having acute upper respiratory infections? | 30 | −6 |
| Do you have fever? | 24 | −5 |
| Are you having depression, anxiety, problems with cognitive functions or other brain disorders? | 17 | −4 |
| Are you having pneumonia? | 17 | −3 |
| Are you having respiratory failure? | 16 | −3 |
| Are you dependent on nicotine? | 14 | −4 |
| Do you have allergic rhinitis? | 14 | −2 |
| Do you have retention of urine? | 14 | −1 |
| Do you have pain? | 14 | −1 |
| Do you have hernia? | 13 | −1 |
| Do you have liver fibrosis/cirrhosis? | 13 | −1 |
| Do you have disturbances of skin sensation? | 12 | −2 |
| Are you having anemia? | 10 | −1 |
| Are you having bacterial infection? | 9 | −1 |
| Do you have complications from heart disease? | 8 | −2 |
| Do you have hypotension? | 8 | −1 |
| Do you have complications of cardiac and vascular prosthetic devices, implants and grafts? | 6 | 0 |
| Are you vitamin D deficient? | 2 | 0 |
| Do you have cardiac arrhythmias? | 2 | 0 |