| Literature DB >> 30054501 |
Wenyu Song1,2, Hailiang Huang3,4, Cheng-Zhong Zhang2,5,4, David W Bates1,6, Adam Wright7,8,9.
Abstract
Genome-wide association studies depend on accurate ascertainment of patient phenotype. However, phenotyping is difficult, and it is often treated as an afterthought in these studies because of the expense involved. Electronic health records (EHRs) may provide higher fidelity phenotypes for genomic research than other sources such as administrative data. We used whole genome association models to evaluate different EHR and administrative data-based phenotyping methods in a cohort of 16,858 Caucasian subjects for type 1 diabetes mellitus, type 2 diabetes mellitus, coronary artery disease and breast cancer. For each disease, we trained and evaluated polygenic models using three different phenotype definitions: phenotypes derived from billing data, the clinical problem list, or a curated phenotyping algorithm. We observed that for these diseases, the curated phenotype outperformed the problem list, and the problem list outperformed administrative billing data. This suggests that using advanced EHR-derived phenotypes can further increase the power of genome-wide association studies.Entities:
Mesh:
Year: 2018 PMID: 30054501 PMCID: PMC6063939 DOI: 10.1038/s41598-018-29634-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of three phenotype extraction methods.
| Phenotype Methods | Description | Data Source Structure | Original Purpose | Whether or not reviewed by Physicians | Merits/Shortcomings |
|---|---|---|---|---|---|
| Billing Data | the code system created for recording all the actions need insurance payments | structured data | insurance reimbursements | No | high sensitivity, low specificity |
| Problem List | temporal record of important problems happed to patients | structured data | diagnosis | Yes | low sensitivity, high specificity |
| Phenotype Algorithm | calculated phenotype based on the combination of ICD code and NLP processed patient notes | structured data + un-structured data | phenotype extraction | No | balanced method between sensitivity and specificity |
For each disease, we used three EHR-derived phenotyping methods to identify the case cohort: the billing data, clinical problem list and a Partners team developed phenotype algorithm.
Figure 1(a) The schematic diagram of patients identified by three different EHR extraction methods: Billing data alone, Problem List or Phenotype Algorithm. (b) The bar chart of percentages of patients in each phenotyping method: identified by only one specific phenotyping (Non-overlapping), by two phenotyping methods (Two-way overlapping), or by all three phenotyping methods (Three-way overlapping). For comparisons, there were 69, 1823, 2407 and 466 patients who recognized by all three methods in T1DM, T2DM, CAD and BC, which account for 7%, 49%, 47% and 40% of total billing code case populations. For phenotype algorithm, these percentages were 59%, 90%, 79% and 60%, while problem list approach had slightly lower percentages than phenotype algorithms. Meanwhile, there were 826, 1170, 1497 and 392 patients only discovered by billing code, converted to percentages of 83%, 32%, 29% and 34%, while both problem list and phenotype algorithm had only few patients uniquely identified. T1DM: type1 diabetes mellitus; T2DM: type2 diabetes mellitus; CAD: coronary artery disease; BC: breast cancer.
Summary of genetic evaluations for three phenotypes.
| Disease | Phenotype | PRS Mean Difference (Case-Control) (S.D.) | AUC (S.D.) | Odds Ratio (S.D.) |
|---|---|---|---|---|
| T1DM | Billing Data | 0.00054 (0.00003) | 0.5480 (0.0170) | 1.18 (0.06) |
| Problem List | 0.00211 (0.00053) | 0.6378 (0.0314) | 1.33 (0.09) | |
| Phenotype Algorithm | 0.04516 (0.00783) | 0.7012 (0.0318) | 1.85 (0.07) | |
| T2DM | Billing Data | 0.00010 (0.00002) | 0.5473 (0.0063) | 1.24 (0.05) |
| Problem List | 0.00028 (0.00004) | 0.5854 (0.0067) | 1.32 (0.06) | |
| Phenotype Algorithm | 0.00062 (0.00005) | 0.5878 (0.0038) | 1.41 (0.04) | |
| CAD | Billing Data | 0.00299 (0.00025) | 0.5604 (0.0049) | 1.22 (0.06) |
| Problem List | 0.00437 (0.00019) | 0.5973 (0.0095) | 1.35 (0.06) | |
| Phenotype Algorithm | 0.01206 (0.00031) | 0.6249 (0.0041) | 1.68 (0.03) | |
| BC | Billing Data | 0.00006 (0.000006) | 0.5291 (0.0029) | 1.07 (0.02) |
| Problem List | 0.00009 (0.000003) | 0.5382 (0.0052) | 1.11 (0.07) | |
| Phenotype Algorithm | 0.00015 (0.00002) | 0.5681 (0.0039) | 1.20 (0.03) |
The summary table for PRS score mean differences between control and case groups in four diseases with three EHR phenotype extraction methods. Also, the odds ratios from logistic regression were obtained. The predictive performances of polygenic models were estimated by the area under the curve (AUC) values. PRS: polygenic risk score Odds Ratio: odds ratio per standard deviation increase.
Figure 2The receiver operating characteristic curve (ROC curve) for polygenic models in four diseases using three different EHR phenotype extraction methods.
Summary of genetic evaluations for three phenotypes using published odds ratio.
| Disease | Phenotype | PRS Mean Difference (Case-Control) | AUC | Odds Ratio |
|---|---|---|---|---|
| T1DM | Billing Data | 0.00056 | 0.5381 | 1.12 |
| Problem List | 0.00329 | 0.6219 | 1.39 | |
| Phenotype Algorithm | 0.00619 | 0.7212 | 1.83 | |
| T2DM | Billing Data | 0.00012 | 0.5439 | 1.19 |
| Problem List | 0.00018 | 0.5632 | 1.38 | |
| Phenotype Algorithm | 0.00021 | 0.5821 | 1.38 | |
| CAD | Billing Data | 0.00012 | 0.5639 | 1.42 |
| Problem List | 0.00029 | 0.5891 | 1.44 | |
| Phenotype Algorithm | 0.00037 | 0.6372 | 1.61 | |
| BC | Billing Data | 0.00052 | 0.5339 | 1.42 |
| Problem List | 0.00058 | 0.5613 | 1.47 | |
| Phenotype Algorithm | 0.00061 | 0.5701 | 1.61 |
The summary table for PRS score mean differences between control and case groups in four diseases. We used the statistics summaries from published GWAS studies for these diseases to calculate PRS. Also, the odds ratios from logistic regression were obtained. The predictive performances of polygenic models were estimated by the area under the curve (AUC) values. PRS: polygenic risk score, Odds Ratio: odds ratio per standard deviation increase.
Summary of genetic evaluations for billing code sub-phenotypes.
| Disease | Phenotype | PRS Mean Difference (Case-Control) (S.D.) | AUC (S.D.) | Odds Ratio (S.D.) |
|---|---|---|---|---|
| T1DM | Billing_visit1 | 0.00054 (0.00003) | 0.5480 (0.0170) | 1.18 (0.06) |
| Billing_visit2 | 0.00138 (0.00007) | 0.5689 (0.0210) | 1.26 (0.05) | |
| Billing_visit3 | 0.00153 (0.00008) | 0.5512 (0.0190) | 1.26 (0.09) | |
| T2DM | Billing_visit1 | 0.00010 (0.00002) | 0.5473 (0.0063) | 1.24 (0.05) |
| Billing_visit2 | 0.00014 (0.00003) | 0.5666 (0.0048) | 1.25 (0.03) | |
| Billing_visit3 | 0.00013 (0.00001) | 0.5656 (0.0039) | 1.25 (0.05) | |
| CAD | Billing_visit1 | 0.00299 (0.00025) | 0.5604 (0.0049) | 1.22 (0.06) |
| Billing_visit2 | 0.00475 (0.00031) | 0.5749 (0.0042) | 1.31 (0.05) | |
| Billing_visit3 | 0.00472 (0.00019) | 0.5779 (0.0039) | 1.31 (0.09) | |
| BC | Billing_visit1 | 0.00006 (0.000006) | 0.5291 (0.0029) | 1.07 (0.02) |
| Billing_visit2 | 0.00009 (0.000003) | 0.5390 (0.0017) | 1.12 (0.03) | |
| Billing_visit3 | 0.00009 (0.000008) | 0.5293 (0.0023) | 1.10 (0.05) |
The summary table for PRS score mean differences between control and case groups in four diseases. We subset the billing data patient cohorts according to their hospital visiting times: subset for patients with at least 1 time visit, 2 times visits or 3 times visits. Also, the odds ratios from logistic regression were obtained. The predictive performances of polygenic models were estimated by the area under the curve (AUC) values. PRS: polygenic risk score, Billing_visit1: Billing code patients with at least one visit, Billing_visit2: Billing code patients with at least two visits, Billing_visit3: Billing code patients with at least three visits.