| Literature DB >> 23300414 |
Abstract
The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.Entities:
Mesh:
Year: 2012 PMID: 23300414 PMCID: PMC3531280 DOI: 10.1371/journal.pcbi.1002823
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Strengths and weakness of data classes within EHRs.
| ICD codes | CPT codes | Laboratory Data | Medication records | Clinical Documentation | |
|
| Near-universal | Near-universal | Near-universal | Variable | Variable |
|
| Medium | Poor | Medium | Inpatient: HighOutpatient: Variable | Medium |
|
| Medium | High | High | Inpatient: HighOutpatient: Variable | Medium-High |
|
| Medium | High | Medium-High | Medium | Low-Medium |
|
| Structured | Structured | Mostly structured | Structured, text queries, and NLP | NLP, text queries, and rarely structured |
|
| -Easy to query-Serves as a good first pass of disease status | -Easy to query-High precision | -Value depends on test-High data validity | Can have high validity | Best record of what providers thought |
|
| -Disease codes often used for screening when disease not actually present-Accuracy hindered by billing realities and clinic workflow | -Most susceptible to missing data errors (e.g., performed at another hospital)-Procedure receipt influenced by patient and payer factors external to disease process | -May need to aggregate different variations of the same data elements-Normal ranges and units may change over time | -Often need to interface inpatient and outpatient records-Medication records from outside providers not present-Medications prescribed not necessary taken | -Difficult to process automatically-Interpretation accuracy depends on assessment method-May suffer from significant “cut and paste”-Not universally available in EHRs-May be self-contradictory |
|
| Essential first element for electronic phenotyping | Helpful addition if relevant | Helpful addition if relevant | Useful for confirmation and a marker of severity | Useful for confirming common diagnoses or for finding rare ones |
Figure 1Comparison of natural language processing (NLP) and CPT codes to detect completed colonoscopies in 200 patients.
In this study, more completed colonoscopies were found via NLP than with billing codes alone, and only one colonoscopy was found with billing codes that was not found with NLP. NLP examples were reviewed for accuracy.
Figure 2Use of Intelligent Character Recognition to codify handwriting.
Figure courtesy of Luke Rasmussen, Northwestern University.
Figure 3General figure for identifying cases and controls using EHR data.
Application of electronic selection algorithms lead to division of a population of patients into four groups, the largest of which comprises patients who were excluded because they lack sufficient evidence to be either a case or control patient. Definite cases and controls cross some predefined threshold of positive predictive value (e.g., PPV≥95%), and thus do not require manual review. For very rare phenotypes or complicated case definitions, the category of “possible” cases may need to be reviewed manually to increase the sample size.
Methods of finding cases and controls for genetic analysis of five common diseases.
| Disease | Methods | Cases | Controls | Case PPV | Control PPV |
| Atrial fibrillation | NLP of ECG impressionsICD9 codesCPT codes | 168 | 1695 | 98% | 100% |
| Crohn's Disease | ICD9 codesMedications (text) | 116 | 2643 | 100% | 100% |
| Type 2 Diabetes | ICD9 codesMedications (text)Text searches (controls) | 570 | 764 | 100% | 100% |
| Multiple Sclerosis | ICD9 codes or text diagnosis | 66 | 1857 | 87% | 100% |
| Rheumatoid Arthritis | ICD9 codesMedications (text)Text searches (exclusions) | 170 | 701 | 97% | 100% |
Given the small number of multiple sclerosis cases, all possible cases were manually validated to ensure high recall.
eMERGE network participants.
| Institution | Biorepository Overview | Model | Size | EHR Summary | Phenotyping Methods |
|
|
| Disease specificCohort | 4000 | Comprehensive vendor-based EHR since 2004 | Structured data extraction, NLP |
|
|
| Population based | 20,000 | Comprehensive internally developed EHR since 1985 | Structured data extraction, NLP,Intelligent Character Recognition |
|
|
| Disease specificCohorts | 16,500 | Comprehensive internally developed EHR since 1995 | Structured data extraction, NLP |
|
|
| Population based | >10,000 | Comprehensive vendor based Inpatient and Outpatient (different systems) EHR since 2000 | Structured data extraction, text searches, NLP |
|
|
| Population based | 150,000 | Comprehensive internally developed EHR since 2000 | Structured data extraction, NLP |
|
|
| Population based | >30,000 | Comprehensive vendor-based EHR | Structured data extraction, NLP |
|
|
| Population based | >30,000 | Comprehensive vendor-based EHR since 2004 | Structured data extraction, NLP |
|
| General and disease cohorts. | Population based | >3,000 | Comprehensive vendor-based EHR | Structured data extraction, NLP |
|
| General and disease cohorts. | Population based | >100,000 | Comprehensive vendor-based EHR | Structured data extraction, NLP |
|
|
| Disease based | Virtual | Comprehensive internally developed EHR | Structured data extraction, NLP |
Sizes represent approximate sizes as of 2012; many sites are still actively recruiting. NLP = Natural Language Processing. Sites joined with 1eMERGE-I in 2007, 2eMERGE-II in 2011, or as 3pediatric sites in 2012.
Figure 4Use of NLP to identify patients without heart disease for a genome-wide analysis of normal cardiac conduction.
Using simple text searching, 1564 patients would have been eliminated unnecessarily due to negated terms, family medical history of heart disease, or low dose medication use that would not affect measurements on the electrocardiogram. Use of NLP improves recall of these cases without sacrificing positive predictive value. The final case cohort represented the patients used for GWAS in [71].
Figure 5A PheWAS plot for rs3135388 in HLA-DRA.
This region has known associations with multiple sclerosis. The red line indicates statistical significance at Bonferroni correction. The blue line represents p<0.05. This plot is generated from updated data from [78] and the updated PheWAS methods as described in [73].