| Literature DB >> 26301417 |
Katherine P Liao1, Ashwin N Ananthakrishnan2, Vishesh Kumar3, Zongqi Xia4, Andrew Cagan5, Vivian S Gainer5, Sergey Goryachev5, Pei Chen6, Guergana K Savova7, Denis Agniel8, Susanne Churchill9, Jaeyoung Lee10, Shawn N Murphy11, Robert M Plenge12, Peter Szolovits13, Isaac Kohane7, Stanley Y Shaw3, Elizabeth W Karlson1, Tianxi Cai6.
Abstract
BACKGROUND: Typically, algorithms to classify phenotypes using electronic medical record (EMR) data were developed to perform well in a specific patient population. There is increasing interest in analyses which can allow study of a specific outcome across different diseases. Such a study in the EMR would require an algorithm that can be applied across different patient populations. Our objectives were: (1) to develop an algorithm that would enable the study of coronary artery disease (CAD) across diverse patient populations; (2) to study the impact of adding narrative data extracted using natural language processing (NLP) in the algorithm. Additionally, we demonstrate how to implement CAD algorithm to compare risk across 3 chronic diseases in a preliminary study. METHODS ANDEntities:
Mesh:
Year: 2015 PMID: 26301417 PMCID: PMC4547801 DOI: 10.1371/journal.pone.0136651
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of approach to developing the CAD algorithm in the RA cohort.
Fig 2Validation of the CAD algorithm in the IBD cohort.
Clinical characteristics of subjects in the RA, IBD and DM cohorts.
| Clinical characteristics | RA, n = 4453 | IBD, n = 10,974 | DM, n = 65,099 |
|---|---|---|---|
| Age, mean (SD) | 60.9 (14.8) | 47.3 (18.8) | 64.6 (15.4) |
| Female gender, % | 79.1 | 53.2 | 46.9 |
| Race, n (%) | |||
| White | 66.9 | 85.3 | 67.2 |
| Black | 5.7 | 3 | 12.1 |
| Hypertension, n (%) | 38.4 | 25.7 | 80.6 |
| Diabetes mellitus, n (%) | 14.4 | 9.2 | 92.1 |
| Hyperlipidemia, n (%) | 29.1 | 23.1 | 68 |
| Ever smoker, n (%) | 48.5 | 59 | 73.2 |
| Mean f/u time in EMR, yrs (SD) | 8.6 (5.5) | 7.1 (5.7) | 8.1 (5.9) |
Variables in the final CAD algorithm.
| Variable | Variable type | Standardized coefficient | Standard error | |
|---|---|---|---|---|
| Structured | NLP | |||
| Coronary artery disease | ✓ | 1.44 | 0.19 | |
| ICD9 codes, normalized | ✓ | 0.4 | 0.35 | |
| Ischemic heart disease | ✓ | 0.35 | 0.17 | |
| CAD procedures | ✓ | 0.33 | 0.16 | |
| EMR follow-up time (months) | ✓ | 0.22 | 0.12 | |
| Coronary artery disease | ✓ | 0.22 | 0.15 | |
| CABG, PCI | ✓ | 0.20 | 0.24 | |
| No LDL values in EMR | ✓ | 0.10 | 0.11 | |
| Age | ✓ | 0.10 | 0.10 | |
| Mean LDL | ✓ | -0.01 | 0.08 | |
| Never smoker | ✓ | -0.01 | 0.10 | |
| Current smoker | ✓ | -0.07 | 0.11 | |
| Echocardiogram performed | ✓ | -0.16 | 0.13 | |
| Hypertension | ✓ | -0.19 | 0.16 | |
| ICD9 codes, total number | ✓ | -0.63 | 0.23 | |
| Intercept | -10.19 | 4.61 | ||
*Please refer to S1 Appendix for full description of variable
Validation of accuracy of the two step classification of CAD (screening + algorithm) using structured data only, compared with the structured data + NLP.
Natural language processing = NLP; negative predictive value = NPV; positive predictive value = PPV.
| Disease cohort | Sensitivity | Specificity | PPV | NPV | Additional subjects classified with CAD (%) |
|---|---|---|---|---|---|
| IBD, structured data only | 59 | 99.6 | 90 | 98.1 | ref |
| IBD, structured + NLP | 73 | 99.6 | 90 | 98.6 | 17.2 |
| DM, structured data only | 84 | 96.2 | 90 | 93.8 | ref |
| DM, structured + NLP | 87 | 96.3 | 90 | 94.5 | 10.2 |
Clinical characteristics of subjects classified with CAD in the RA, IBD and DM cohorts (PPV of CAD classification > = 90% PPV*).
| Clinical characteristics | RA, n = 4453 | IBD, n = 10,974 | DM, n = 65,099 | |||
|---|---|---|---|---|---|---|
| CAD yes, n = 245 (5.0%) | CAD no, n = 4208 | CAD yes, n = 457 (4.2%) | CAD no, n = 10,517 | CAD yes, n = 16,962 (26.1%) | CAD no, n = 48,136 | |
| Mean age, (SD) | 72.9 (10.1) | 60.2 (14.7) | 71.3 (11.0) | 46.3 (18.4) | 71.7 (11.1) | 62.1 (15.9) |
| Male gender (%) | 45.7 | 19.4 | 70.7 | 45.7 | 66.9 | 48.3 |
| Race (%) | ||||||
| White | 80.0 | 66.1 | 93.0 | 85.0 | 78.7 | 63.7 |
| Black | 7.4 | 5.6 | 2.2 | 3.0 | 7.1 | 14.0 |
| Comorbidity | ||||||
| Hypertension (%) | 88.2 | 35.5 | 85.6 | 23.1 | 92.2 | 75.4 |
| Diabetes mellitus (%) | 40.0 | 12.2 | 35.2 | 7.1 | N/A | N/A |
| Hyperlipidemia | 82.5 | 26.0 | 81.0 | 20.6 | 83.6 | 61.5 |
| Smoking status, ever vs never (%) | 66.9 | 30.1 | 75.7 | 33.2 | 68.8 | 45.9 |
| Follow-up (months), mean (SD) | 139.7 (58.5) | 101.2 (65.5) | 112.9 (75.0) | 83.7 (68.1) | 98.1 (69.6) | 96.7 (71.0) |
*Selected specificity cutoff for each cohort based on PPV> = 90%; RA based on medical record review.
Unadjusted and adjusted odds ratios comparing risk of CAD in IBD and RA to DM.
| Clinical variables | Unadjusted OR (95% CI) | Adjusted OR (95% CI) |
|---|---|---|
| Age | - | 1.05 (1.05, 1.05) |
| Male gender | - | 2.27 (2.18, 2.36) |
| Hyperlipidemia | - | 2.48 (2.36, 2.61) |
| Ever smoker | - | 1.92 (1.84, 1.99) |
| Hypertension | - | 1.80 (1.68, 1.91) |
|
| 0.12 (0.11, 0.14) |
|
|
| 0.17 (0.15, 0.19) |
|