| Literature DB >> 34620889 |
Su H Chu1,2, Emily S Wan3,4,5, Michael H Cho3,4, Sergey Goryachev6, Vivian Gainer6, James Linneman7, Erica J Scotty7, Scott J Hebbring7, Shawn Murphy6,8, Jessica Lasky-Su3,4, Scott T Weiss3,4, Jordan W Smoller4,8,9,10, Elizabeth Karlson11,12.
Abstract
Electronic health records (EHR) provide an unprecedented opportunity to conduct large, cost-efficient, population-based studies. However, the studies of heterogeneous diseases, such as chronic obstructive pulmonary disease (COPD), often require labor-intensive clinical review and testing, limiting widespread use of these important resources. To develop a generalizable and efficient method for accurate identification of large COPD cohorts in EHRs, a COPD datamart was developed from 3420 participants meeting inclusion criteria in the Mass General Brigham Biobank. Training and test sets were selected and labeled with gold-standard COPD classifications obtained from chart review by pulmonologists. Multiple classes of algorithms were built utilizing both structured (e.g. ICD codes) and unstructured (e.g. medical notes) data via elastic net regression. Models explicitly including and excluding spirometry features were compared. External validation of the final algorithm was conducted in an independent biobank with a different EHR system. The final COPD classification model demonstrated excellent positive predictive value (PPV; 91.7%), sensitivity (71.7%), and specificity (94.4%). This algorithm performed well not only within the MGBB, but also demonstrated similar or improved classification performance in an independent biobank (PPV 93.5%, sensitivity 61.4%, specificity 90%). Ancillary comparisons showed that the classification model built including a binary feature for FEV1/FVC produced substantially higher sensitivity than those excluding. This study fills a gap in COPD research involving population-based EHRs, providing an important resource for the rapid, automated classification of COPD cases that is both cost-efficient and requires minimal information from unstructured medical records.Entities:
Mesh:
Year: 2021 PMID: 34620889 PMCID: PMC8497529 DOI: 10.1038/s41598-021-98719-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of COPD datamart selection and developed algorithms.
Figure 2Broad overview of steps in phenotyping algorithm development.
Algorithms for classifying chronic obstructive pulmonary disease.
| Classification method | Classifier description | Minimum selection criteria | ||
|---|---|---|---|---|
| ICD9/10 | Visit criteria | Other criteria | ||
| ICD-stricta | 3 COPD-specific codes | 3 or more COPD-specific codes | None | |
| ICD-broadb | 2 COPD-specific codes | 2 or more COPD-specific codes | None | |
| Control selection | 0 COPD-specific codes | Subjects with no history of COPD related codes | 2 encounters in MGB Biobank | |
| SAFE-NLP | Model selected from surrogate assisted feature extraction with natural language processing of unstructured EHR data (narrative text from clinic notes) | At least 1 COPD-specific code and at least 3 broad COPD codes | 1 visit with electronic clinical note in the EHR | Selected by classifier |
| CRTPFT- | Model selected from literature-based and expert-curated feature inputs primarily derived from structured data, | At least 1 COPD-specific code and at least 3 broad COPD codes | 1 visit with electronic clinical note in the EHR | Selected by classifier |
| CRTPFT+ | Model selected from the feature space of CRTPFT-, but | At least 1 COPD-specific code and at least 3 broad COPD codes | 1 visit with electronic clinical note in the EHR | Selected by classifier |
| CRT + SAFE | Model based on combining the full feature space for CRTPFT+ and SAFE | At least 1 COPD-specific code and at least 3 broad COPD codes | 1 visit with electronic clinical note in the EHR | Selected by classifier |
aCOPD-specific codes include: 1) ICD9: 491.2, 493.2, and 496.*; 2) ICD10: J43.* or J44.*.
bBroad COPD codes include any codes with the following base numbers: 1) ICD9: 491.*, 492.*, 493.2*, and 496.*; 2) ICD10: J40.*, J41.*, J42.*, J43.*, J44.*.
cAll model-based algorithms were developed using probability-based thresholding via logistic regression models selected using a threshold for specificity at 95%.
Comparison of performance characteristics between different electronic medical record COPD classification algorithms within Mass General Brigham Biobank training set (N = 182).
| Algorithm | Counts (N) | Algorithm Performance (95% CI)* | ||||||
|---|---|---|---|---|---|---|---|---|
| True positive | True negative | False positive | False negative | Sensitivity | Specificity | PPV | NPV | |
| ICD-Strict | 103 | 9 | 68 | 2 | 0.981 | 0.117 | 0.602 | 0.818 |
| ICD-Broad | 105 | 3 | 74 | 0 | 1 | 0.039 | 0.587 | 1 |
| SAFE-NLP | 38 | 73 | 4 | 66 | 0.365 (0.270–0.462) | 0.948 (0.896–0.987) | 0.905 (0.816–0.977) | 0.525 (0.490–0.567) |
| CRTPFT- | 42 | 72 | 5 | 63 | 0.400 (0.314–0.495) | 0.935 (0.883–0.987) | 0.894 (0.808–0.976) | 0.533 (0.493–0.578) |
| CRTPFT+ | 62 | 73 | 4 | 43 | 0.590 (0.495–0.676) | 0.948 (0.896–0.987) | 0.939 (0.878–0.987) | 0.629 (0.577–0.688) |
| CRT + SAFE | 47 | 73 | 4 | 62 | 0.404 (0.317–0.490) | 0.948 (0.896–0.987) | 0.913 (0.830–0.979) | 0.541 (0.503–0.583) |
aCOPD-specific codes include: 1) ICD9: 491.2, 493.2, and 496; 2) ICD10: J43 or J44.
bBroad COPD codes include any codes with the following base numbers: 1) ICD9: 491, 492, 493.2, and 496; 2) ICD10: J40, J41, J42, J43, J44.
*All probabilistic algorithms were assessed at their corresponding thresholds specifying 95% specificity.
SAFE-NLP: Model based on surrogate assisted feature extraction with natural language processing of unstructured EHR data (free text); CRTPFT-: Model based on literature and expert-curated feature inputs primarily derived from structured data, excluding feature weights for spirometric FEV1/FVC performance; CRTPFT+: Model based on feature space of CRTPFT-, but inclusive of feature weights for spirometric FEV1/FVC performance; CRT + SAFE: Model based on combining the full feature space for CRTPFT+ and SAFE.
Figure 3Receiver-operator characteristic curves to assess classification performance of model-based algorithms.
Comparison of performance characteristics of probabilistic electronic medical record COPD classification algorithms within Mass General Brigham Biobank validation set (N = 100) and external, independent validation of final algorithm in the Marshfield Clinic (N = 100).
| Algorithm | Counts (N) | Algorithm Performance* | ||||||
|---|---|---|---|---|---|---|---|---|
| True positive | True negative | False positive | False negative | Sensitivity | Specificity | PPV | NPV | |
| Automatic NLP features | ||||||||
| SAFE-NLP | 17 | 51 | 3 | 28 | 0.370 | 0.944 | 0.850 | 0.638 |
| Curated features | ||||||||
| CRTPFT- | 20 | 51 | 3 | 26 | 0.435 | 0.944 | 0.870 | 0.662 |
| CRTPFT+ | 33 | 51 | 3 | 13 | 0.717 | 0.944 | 0.917 | 0.797 |
| Mixed features | ||||||||
| CRT + SAFE | 20 | 53 | 1 | 26 | 0.435 | 0.981 | 0.952 | 0.671 |
| Curated features | ||||||||
| CRTPFT+ | 43 | 27 | 3 | 27 | 0.614 | 0.900 | 0.935 | 0.500 |
*All probabilistic algorithms were assessed at their corresponding thresholds specifying 95% specificity.
SAFE-NLP Model based on surrogate assisted feature extraction with natural language processing of unstructured EHR data (free text), CRT Model based on literature and expert-curated feature inputs primarily derived from structured data, excluding feature weights for spirometric FEV1/FVC performance, CRT Model based on feature space of CRTPFT-, but inclusive of feature weights for spirometric FEV1/FVC performance, CRT + SAFE Model based on combining the full feature space for CRTPFT+ and SAFE.
Patient medical history features and weights used in the final Mass General Brigham Biobank CRTPFT+ algorithm for classification of COPD.
| Model feature | Model weight | Variable type | Description |
|---|---|---|---|
| Intercept | − 1.871 | ||
| everPFTlt70 | 1.750 | NLP | Ever had a pulmonary function test with spirometry indicating pre-bronchodilator FEV1/FVC ratio < 0.7 OR post-bronchodilator FEV1/FVC ratio < 0.7 |
| nCOPDGTE3_365 | 0.465 | Coded | Ever diagnosed with 3 or more COPD-related ICD codes within any rolling time window of 365 days |
| everTiotropium | 0.334 | Coded | Ever been prescribed tiotropium |
| iNotWhite | − 0.239 | Coded | Race category denoting whether subject is White or Not White |
| smkEver | 0.175 | NLP | Any current/former history of smoking |
| everdxAtPulmClinic | 0.056 | Coded | Ever diagnosed with a COPD-related ICD code at a pulmonary clinic |
| everCOPDmed | 0.048 | Coded | Ever been prescribed a medication used to treat COPD? |
| nmedLAMA | 0.017 | Coded | Total count of distinct prescription codes for long acting muscarinic antagonists in participant medical record for treatment of lung diseases |
| pftCount | − 0.016 | Coded | Total count of any kind of pulmonary function test |
| ageCOPDt1Specific | 0.013 | Coded | Age (in years) at first ICD code specific to COPD |
| nCOPD_ICD | 0.013 | Coded | The COPD feature count of distinct dates on which a subject has a code from this feature ICD10: J40–J44 ICD9: 491, 492, 493.2, 496 |
| nBronchitis | − 0.009 | Coded | The count of distinct dates on which a subject has a Bronchitis ICD code ICD10: J40, J41, J42 ICD9: 490, 491 |
| nBronchiectasis | − 0.008 | Coded | The count of distinct dates on which a subject has a Bronchiectasis ICD code ICD10: J47 ICD9: 494 |
| patient_dxenct | − 0.001 | Coded | Total number of encounters (visits) per subject with a coded diagnosis (any diagnosis not limited to COPD) |
The case assignment threshold for this model, holding specificity at 95%, was 0.754. For subjects who were missing PFT results, the everPFTlt70 variable was classified as ‘No’ in this model.