| Literature DB >> 29084368 |
Matthew K Breitenstein1,2, Hongfang Liu3, Kara N Maxwell4, Jyotishman Pathak5, Rui Zhang6,7.
Abstract
Precision medicine is at the forefront of biomedical research. Cancer registries provide rich perspectives and electronic health records (EHRs) are commonly utilized to gather additional clinical data elements needed for translational research. However, manual annotation is resource-intense and not readily scalable. Informatics-based phenotyping presents an ideal solution, but perspectives obtained can be impacted by both data source and algorithm selection. We derived breast cancer (BC) receptor status phenotypes from structured and unstructured EHR data using rule-based algorithms, including natural language processing (NLP). Overall, the use of NLP increased BC receptor status coverage by 39.2% from 69.1% with structured medication information alone. Using all available EHR data, estrogen receptor-positive BC cases were ascertained with high precision (P = 0.976) and recall (R = 0.987) compared with gold standard chart-reviewed patients. However, status negation (R = 0.591) decreased 40.2% when relying on structured medications alone. Using multiple EHR data types (and thorough understanding of the perspectives offered) are necessary to derive robust EHR-based precision medicine phenotypes.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29084368 PMCID: PMC5759745 DOI: 10.1111/cts.12514
Source DB: PubMed Journal: Clin Transl Sci ISSN: 1752-8054 Impact factor: 4.689
Figure 1Overview of cohorts and pseudo code. A series of steps were taken to develop the breast cancer precision medicine phenotype; these included: 1) Data preprocessing, where data were extracted from both structured and unstructured (i.e., free text notes) electronic health record (EHR) data sources and linked with cancer registry data; 2) Ascertainment of receptor status from multiple EHR perspectives, initiated with extraction of necessary data features and subsequently attributed via a series of rules; 3) Development of a Gold Standard Cohort, consisting of patients manually chart reviewed and annotated for receptor status, to evaluate performance of the EHR rule‐based algorithm; and 4) Perspectives and methodologies utilized to evaluate performance of the EHR rule‐based algorithm.
Gold Standard Cohort receptor status
| ER | PR | HER2 | TNBC | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cohort |
| (+) | (–) | na | (+) | (–) | na | (+) | (–) | na | yes | no | na |
| Gold standard | 871 | 751 | 113 | 7 | 678 | 187 | 6 | 116 | 453 | 302 | 44 | 682 | 145 |
| Training | 436 | 379 | 54 | 3 | 345 | 90 | 1 | 64 | 216 | 156 | 20 | 346 | 70 |
| Validation | 435 | 372 | 59 | 4 | 333 | 97 | 5 | 52 | 237 | 146 | 24 | 336 | 75 |
ER = estrogen receptor, PR = progesterone receptor, HER2 = human epidermal growth factor receptor, TN = triple negative, (+) = status positive, (–) = status negative, na = not available.
Coverage of individual EHR data sources and phenotypes
| Prescribed medications perspective | Clinical narrative perspective | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Receptor status | Data source | EDT | Clinical notes | EDT or clinical notes | Clinical notes | Pathology notes | Clinical or pathology | All EHR perspective | |
| Observation Cohort coverage |
| 8,826 | 8,078 | 9,851 | 11,287 | 10,236 | 11,766 | 12,291 | |
|
|
|
|
|
|
|
|
| ||
| ER feature coverage | (+) |
| 3,761 | 3,507 | 4,491 | 8,305 | 7,045 | 8,771 | 9,069 |
|
|
|
|
|
|
|
|
| ||
| (–) |
| 5,065 | 4,571 | 5,360 | 2,214 | 1,921 | 2,374 | 2,241 | |
|
|
|
|
|
|
|
|
| ||
| nr |
| 0 | 0 | 0 | 768 | 1,270 | 621 | 981 | |
|
|
|
|
|
|
|
|
| ||
| PR feature coverage | (+) |
| — | — | — | 7,205 | 5,341 | 7,531 | 7,531 |
|
|
|
|
|
|
|
|
| ||
| (–) |
| — | — | — | 3,030 | 2,728 | 3,255 | 3,255 | |
|
|
|
|
|
|
|
|
| ||
| nr |
| 8,826 | 8,078 | 9,851 | 1,052 | 2,167 | 980 | 1,505 | |
|
|
|
|
|
|
|
|
| ||
| HER2 feature coverage | (+) |
| 6 | 121 | 121 | 1,611 | 1,438 | 1,770 | 1,786 |
|
|
|
|
|
|
|
|
| ||
| (–) |
| 8,820 | 7,957 | 9,730 | 5,398 | 4,589 | 5,903 | 5,897 | |
|
|
|
|
|
|
|
|
| ||
| nr |
| 0 | 0 | 0 | 4,278 | 4,209 | 4,093 | 4,608 | |
|
|
|
|
|
|
|
|
| ||
| TNBC feature coverage | yes |
| 0 | 0 | 0 | 1,014 | 606 | 1,102 | 1,035 |
|
|
|
|
|
|
|
|
| ||
| no |
| 3,763 | 3,556 | 4,538 | 7,162 | 5,415 | 7,582 | 9,876 | |
|
|
|
|
|
|
|
|
| ||
| nr |
| 5,063 | 4,522 | 5,313 | 3,111 | 4,215 | 3,082 | 1,380 | |
|
|
|
|
|
|
|
|
| ||
Receptor status phenotype coverage by clinical data source Note: total cohort size n = 12,770; cohort coverage refers to coverage of that clinical data source out of the total cohort size. ER = estrogen receptor, PR = progesterone receptor, HER2 = human epidermal growth factor receptor 2, TNBC = triple negative, nr = true missing or unable to resolve status; % of patients with relevant EHR data source coverage for individual receptor status phenotypes.
Receptor status phenotype performance compared within manual chart reviewed Gold Standard Cohort—testing subset
| Prescribed medications | Clinical narratives | |||||||
|---|---|---|---|---|---|---|---|---|
| Receptor status | EDT | Clinical notes | EDT or clinical notes | Clinical | Pathology | Clinical or pathology | All EHR sources | |
|
| 374 | 335 | 374 | 377 | 360 | 377 | 435 | |
| Coverage |
|
|
|
|
|
|
| |
| ER | P | 0.9849 | 0.9702 | 0.9710 | 0.9877 | 0.9861 | 0.9877 | 0.9758 |
| R | 0.5909 | 0.5909 | 0.6091 | 0.9786 | 0.9100 | 0.9847 | 0.9877 | |
| F | 0.7386 | 0.7345 | 0.7486 | 0.9831 | 0.9465 | 0.9862 | 0.9818 | |
| PR | P |
|
|
| 0.9784 | 0.9730 | 0.9857 | 0.9857 |
| R |
|
|
| 0.9347 | 0.7780 | 0.9418 | 0.9418 | |
| F |
|
|
| 0.9561 | 0.8657 | 0.9632 | 0.9632 | |
| HER2 | P | 0.0000 | 1.0000 | 1.0000 | 0.7750 | 0.4583 | 0.6977 | 0.6977 |
| R | 0.0000 | 0.0294 | 0.0222 | 0.6889 | 0.5116 | 0.6667 | 0.6667 | |
| F | 0.0000 | 0.0571 | 0.0435 | 0.7294 | 0.4835 | 0.6818 | 0.6818 | |
| TN | P |
|
|
| 0.6522 | 0.8462 | 0.7000 | 0.7222 |
| R |
|
|
| 0.7895 | 0.5790 | 0.7368 | 0.6842 | |
| F |
|
|
| 0.7143 | 0.6875 | 0.7180 | 0.7027 | |
All comparisons made to “gold standard” validation cohort (n = 435); P = precision, R = recall, F = F1 score (harmonic mean of precision and recall; ER = estrogen receptor, PR = progesterone receptor, HER2 = human epidermal growth factor receptor 2, TN = triple negative. PR and TN are blank because they cannot be directly inferred from a prescribed medications perspective.