| Literature DB >> 34911934 |
Kenneth L Kehl1,2,3, Wenxin Xu4,5,6, Alexander Gusev4,5,6, Ziad Bakouny4,5,6, Toni K Choueiri4,5,6, Irbaz Bin Riaz7, Haitham Elmarakeby4,5,6,8, Eliezer M Van Allen4,5,6,8, Deborah Schrag9.
Abstract
To accelerate cancer research that correlates biomarkers with clinical endpoints, methods are needed to ascertain outcomes from electronic health records at scale. Here, we train deep natural language processing (NLP) models to extract outcomes for participants with any of 7 solid tumors in a precision oncology study. Outcomes are extracted from 305,151 imaging reports for 13,130 patients and 233,517 oncologist notes for 13,511 patients, including patients with 6 additional cancer types. NLP models recapitulate outcome annotation from these documents, including the presence of cancer, progression/worsening, response/improvement, and metastases, with excellent discrimination (AUROC > 0.90). Models generalize to cancers excluded from training and yield outcomes correlated with survival. Among patients receiving checkpoint inhibitors, we confirm that high tumor mutation burden is associated with superior progression-free survival ascertained using NLP. Here, we show that deep NLP can accelerate annotation of molecular cancer datasets with clinically meaningful endpoints to facilitate discovery.Entities:
Mesh:
Year: 2021 PMID: 34911934 PMCID: PMC8674229 DOI: 10.1038/s41467-021-27358-6
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Characteristics of patients with radiology reports for analysis.
| Total number of patients and radiology reports | Number of patients with unlabeled radiology reports and number of unlabeled radiology reports | Number of patients with labeled radiology reports and # of labeled radiology reports | ||||
|---|---|---|---|---|---|---|
| Patients | Reports | Patients | Reports | Patients | Reports | |
| Total cohort | 13130 (100) | 304160 (100) | 10300 (100) | 272964 (100) | 2830 (100) | 31196 (100) |
| Sex | ||||||
| Male | 5621 (43) | 105503 (35) | 4055 (39) | 89849 (33) | 1566 (55) | 15654 (50) |
| Female | 7509 (57) | 198657 (65) | 6245 (61) | 183115 (67) | 1264 (45) | 15542 (50) |
| Age at next generation genomic sequencing | ||||||
| <40 | 625 (5) | 14439 (5) | 488 (5) | 12835 (5) | 137 (5) | 1604 (5) |
| 40–49 | 1329 (10) | 30868 (10) | 999 (10) | 26490 (10) | 330 (12) | 4378 (14) |
| 50–59 | 3092 (24) | 75681 (25) | 2400 (23) | 67920 (25) | 692 (24) | 7761 (25) |
| 60–69 | 4172 (32) | 99399 (33) | 3295 (32) | 90158 (33) | 877 (31) | 9241 (30) |
| 70–79 | 2944 (22) | 65229 (21) | 2335 (23) | 58700 (22) | 609 (22) | 6529 (21) |
| 80+ | 968 (7) | 18544 (6) | 783 (8) | 16861 (6) | 185 (7) | 1683 (5) |
| Race as recorded in the electronic health record | ||||||
| Asian | 424 (3) | 10724 (4) | 353 (3) | 9716 (4) | 71 (3) | 1008 (3) |
| African-American | 458 (3) | 10649 (4) | 348 (3) | 9470 (3) | 110 (4) | 1179 (4) |
| Native American | 11 (<1) | 193 (<1) | 10 (<1) | 184 (<1) | 1 (<1) | 9 (<1) |
| Pacific Islander | 4 (<1) | 144 (<1) | 4 (<1) | 144 (<1) | 0 (0) | 0 (0) |
| White | 11760 (90) | 272156 (89) | 9205 (89) | 244173 (89) | 2555 (90) | 27983 (90) |
| More than one race | 39 (<1) | 729 (<1) | 33 (<1) | 652 (<1) | 6 (<1) | 77 (<1) |
| Other/unknown | 434 (3) | 9565 (3) | 347 (3) | 8625 (3) | 87 (3) | 940 (3) |
| Cancer type | ||||||
| Breast | 2029 (15) | 63789 (21) | 1676 (16) | 58209 (21) | 352 (12) | 5527 (18) |
| Colorectal | 1958 (15) | 37570 (12) | 1493 (14) | 32986 (12) | 466 (16) | 4588 (15) |
| Endometrial | 482 (4) | 9801 (3) | 482 (5) | 9801 (4) | 0 (0) | 0 (0) |
| Gastroesophageal | 878 (7) | 19794 (7) | 878 (9) | 19794 (7) | 0 (0) | 0 (0) |
| Head and neck | 461 (4) | 8796 (3) | 460 (4) | 8795 (3) | 0 (0) | 0 (0) |
| Leiomyosarcoma | 144 (1) | 6241 (2) | 144 (1) | 6241 (2) | 0 (0) | 0 (0) |
| Non-small cell lung | 3378 (26) | 82609 (27) | 2763 (27) | 73758 (27) | 614 (22) | 8838 (28) |
| Melanoma | 733 (6) | 20621 (7) | 731 (7) | 20591 (8) | 0 (0) | 0 (0) |
| Ovarian | 646 (5) | 22248 (7) | 646 (6) | 22248 (8) | 0 (0) | 0 (0) |
| Pancreatic | 685 (5) | 7854 (3) | 295 (3) | 4477 (2) | 394 (14) | 3450 (11) |
| Prostate | 617 (5) | 7506 (2) | 164 (2) | 2851 (1) | 453 (16) | 4676 (15) |
| Renal cell carcinoma | 499 (4) | 4737 (2) | 84 (<1) | 1721 (<1) | 415 (15) | 3016 (10) |
| Urothelial carcinoma | 620 (5) | 12594 (4) | 484 (5) | 11492 (4) | 136 (5) | 1101 (4) |
| Common tumor genomic variants | ||||||
| TP53 mutation | 5330 (41) | 124663 (41) | 2486 (42) | 112237 (41) | 1044 (37) | 12426 (40) |
| KRAS mutation | 2785 (21) | 53735 (18) | 2012 (20) | 45775 (17) | 773 (27) | 7960 (26) |
| PIK3CA mutation | 1738 (13) | 43168 (14) | 1455 (14) | 39788 (15) | 283 (10) | 3380 (11) |
| APC mutation | 1215 (9) | 24381 (8) | 942 (9) | 21627 (8) | 273 (10) | 2754 (9) |
| BRAF mutation | 688 (5) | 15938 (5) | 587 (6) | 14918 (5) | 101 (4) | 1020 (3) |
Characteristics of patients with medical oncologist notes for analysis.
| Total number of patients and medical oncologist notes | Number of patients with unlabeled medical oncology notes and number of unlabeled medical oncology notes | Number of patients with labeled medical oncology notes and number of labeled medical oncology notes | ||||
|---|---|---|---|---|---|---|
| Patients | Reports | Patients | Reports | Patients | Reports | |
| Total cohort | 13511 (100) | 232575 (100) | 10764 (100) | 200264 (100) | 2747 (100) | 32311 (100) |
| Sex | ||||||
| Male | 5561 (41) | 88755 (38) | 4088 (38) | 71790 (36) | 1473 (54) | 16965 (53) |
| Female | 7950 (59) | 143820 (62) | 6676 (62) | 128474 (64) | 1274 (46) | 15346 (47) |
| Age at next generation tumor genomic sequencing | ||||||
| <40 | 733 (5) | 13111 (6) | 574 (5) | 11226 (6) | 159 (6) | 1885 (6) |
| 40–49 | 1477 (11) | 26420 (11) | 1139 (11) | 21511 (11) | 338 (12) | 4909 (15) |
| 50–59 | 3297 (24) | 61142 (26) | 2616 (24) | 52618 (26) | 681 (25) | 8524 (26) |
| 60–69 | 4277 (32) | 74818 (32) | 3432 (32) | 65689 (33) | 845 (31) | 9129 (28) |
| 70–79 | 2864 (21) | 45914 (20) | 2301 (21) | 39553 (20) | 563 (20) | 6361 (20) |
| 80+ | 863 (6) | 11170 (5) | 702 (7) | 9667 (5) | 161 (6) | 1503 (5) |
| Race as recorded in the electronic health record | ||||||
| Asian | 439 (3) | 7914 (3) | 361 (3) | 6902 (3) | 78 (3) | 1012 (3) |
| African-American | 445 (3) | 7785 (3) | 344 (3) | 6550 (3) | 101 (4) | 1235 (4) |
| Native American | 13 (<1) | 99 (<1) | 11 (<1) | 93 (<1) | 2 (<1) | 6 (<1) |
| Pacific Islander | 4 (<1) | 123 (<1) | 4 (<1) | 123 (<1) | 0 (0) | 0 (0) |
| White | 12132 (90) | 207897 (89) | 9653 (90) | 179147 (89) | 2479 (90) | 28750 (89) |
| More than one race | 36 (<1) | 738 (<1) | 31 (<1) | 655 (<1) | 5 (<1) | 83 (<1) |
| Other/unknown | 442 (3) | 8019 (3) | 360 (3) | 6794 (3) | 82 (3) | 1225 (4) |
| Cancer type | ||||||
| Breast | 2382 (18) | 47595 (20) | 1972 (18) | 41462 (21) | 409 (15) | 6105 (19) |
| Colorectal | 2447 (18) | 35459 (15) | 1922 (18) | 29451 (15) | 526 (19) | 6011 (19) |
| Endometrial | 524 (4) | 6754 (3) | 524 (5) | 6754 (3) | 0 (0) | 0 (0) |
| Gastroesophageal | 1019 (8) | 16363 (7) | 1019 (9) | 16363 (8) | 0 (0) | 0 (0) |
| Head and neck | 447 (3) | 10901 (5) | 446 (4) | 10898 (5) | 0 (0) | 0 (0) |
| Leiomyosarcoma | 168 (1) | 4581 (2) | 168 (2) | 4581 (2) | 0 (0) | 0 (0) |
| Non-small cell lung | 2838 (21) | 43360 (19) | 2297 (21) | 38090 (19) | 540 (20) | 5237 (16) |
| Melanoma | 756 (6) | 19100 (8) | 754 (7) | 19064 (10) | 0 (0) | 0 (0) |
| Ovarian | 713 (5) | 19885 (9) | 713 (7) | 19885 (10) | 0 (0) | 0 (0) |
| Pancreatic | 878 (6) | 9111 (4) | 397 (4) | 5016 (3) | 485 (18) | 4173 (13) |
| Prostate | 549 (4) | 7818 (3) | 99 (<1) | 1678 (8) | 451 (16) | 6167 (19) |
| Renal cell carcinoma | 364 (3) | 4434 (2) | 93 (<1) | 756 (<1) | 271 (10) | 3680 (11) |
| Urothelial carcinoma | 426 (3) | 7214 (3) | 360 (3) | 6266 (3) | 65 (2) | 938 (3) |
| Common tumor genomic variants | ||||||
| TP53 mutation | 5780 (43) | 99351 (43) | 4675 (43) | 87185 (44) | 1105 (40) | 12166 (38) |
| KRAS mutation | 2993 (22) | 39485 (17) | 2161 (20) | 31814 (16) | 832 (30) | 7671 (24) |
| PIK3CA mutation | 1897 (14) | 32896 (14) | 1618 (15) | 29345 (15) | 279 (10) | 3551 (11) |
| APC mutation | 1457 (11) | 22278 (10) | 1147 (11) | 18628 (9) | 310 (11) | 3650 (11) |
| BRAF mutation | 727 (5) | 13091 (6) | 628 (6) | 12071 (6) | 99 (4) | 1020 (3) |
Areas under the receiver operating characteristic curve (AUROCs) for NLP models that interpret imaging report text to ascertain clinical outcomes, as evaluated in the labeled test set.
| Clinical outcome | Models trained on imaging reports for patients with all listed cancer types | |||||||
|---|---|---|---|---|---|---|---|---|
| All patients | Breast cancer | Colorectal cancer | NSCLC | Pancreatic cancer | Prostate cancer | Renal cell carcinoma | Urothelial carcinoma | |
| Any cancer | 0.98 | 0.98 | 0.98 | 0.97 | 0.97 | 0.98 | 0.97 | 0.97 |
| Progression | 0.95 | 0.95 | 0.96 | 0.96 | 0.93 | 0.96 | 0.96 | 0.94 |
| Response | 0.97 | 0.98 | 0.99 | 0.96 | 0.97 | 0.94 | 0.97 | 0.99 |
| Brain metastasis | 0.99 | 0.97 | 1.0 | 0.99 | 1.0 | 0.98 | 0.97 | * |
| Bone metastasis | 0.99 | 0.98 | 0.99 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 |
| Adrenal metastasis | 0.99 | 0.99 | 0.99 | 1.0 | 1.0 | * | 0.95 | 0.99 |
| Liver metastasis | 0.99 | 1.0 | 0.99 | 1.0 | 0.97 | 0.98 | 1.0 | 1.0 |
| Lung metastasis | 0.98 | 0.99 | 0.99 | 0.97 | 0.98 | 0.97 | 0.98 | 0.99 |
| Nodal metastasis | 0.98 | 0.99 | 0.97 | 0.98 | 0.96 | 0.99 | 0.99 | 0.97 |
| Peritoneal metastasis | 0.99 | 0.99 | 0.99 | 1.0 | 0.96 | 1.0 | * | 0.81 |
| Models trained on imaging reports for patients with all cancers except for the type under evaluation | ||||||||
| Any cancer | † | 0.98 | 0.98 | 0.95 | 0.95 | 0.98 | 0.98 | 0.96 |
| Progression | † | 0.94 | 0.96 | 0.96 | 0.92 | 0.96 | 0.95 | 0.94 |
| Response | † | 0.98 | 0.99 | 0.96 | 0.97 | 0.94 | 0.96 | 0.99 |
| Brain metastasis | † | 0.97 | 1.0 | 0.99 | 1.0 | 0.98 | 0.99 | * |
| Bone metastasis | † | 0.97 | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 1.0 |
| Adrenal metastasis | † | 0.99 | 0.99 | 0.99 | 1.0 | * | 0.94 | 0.99 |
| Liver metastasis | † | 1.0 | 0.99 | 0.99 | 0.96 | 0.98 | 1.0 | 1.0 |
| Lung metastasis | † | 0.99 | 0.99 | 0.94 | 0.97 | 0.97 | 0.97 | 0.99 |
| Nodal metastasis | † | 0.99 | 0.98 | 0.98 | 0.96 | 0.99 | 0.99 | 0.97 |
| Peritoneal metastasis | † | 0.99 | 0.98 | 1.0 | 0.97 | 1.0 | * | 0.77 |
*Not enough variation in outcome label to evaluate AUROC
†Not applicable
Areas under the receiver operating characteristic curve (AUROCs) for NLP models that interpret medical oncologist progress note text to ascertain clinical outcomes, as evaluated in the labeled test set.
| Clinical outcome | Models trained on all cancer types | |||||||
|---|---|---|---|---|---|---|---|---|
| All patients | Breast cancer | Colorectal cancer | NSCLC | Pancreatic cancer | Prostate cancer | Renal cell carcinoma | Urothelial carcinoma | |
| Any cancer | 0.93 | 0.98 | 0.98 | 0.95 | 0.78 | 0.91 | 0.91 | 0.95 |
| Progression | 0.92 | 0.97 | 0.91 | 0.96 | 0.92 | 0.87 | 0.86 | 0.78 |
| Response | 0.93 | 0.96 | 0.95 | 0.95 | 0.91 | 0.87 | 0.89 | 0.99 |
| Models trained on all cancers except for the type under evaluation | ||||||||
| Any cancer | † | 0.98 | 0.97 | 0.95 | 0.79 | 0.85 | 0.90 | 0.95 |
| Progression | † | 0.96 | 0.90 | 0.95 | 0.91 | 0.84 | 0.84 | 0.78 |
| Response | † | 0.94 | 0.94 | 0.94 | 0.88 | 0.83 | 0.88 | 1.0 |
†Not applicable
Association between PRISSMM outcomes and overall survival among patients receiving palliative-intent systemic therapy (Hazard ratio, 95% confidence interval).
| Cohort | PRISSMM imaging report annotations derived from natural language processing models | PRISSMM medical oncologist note annotations derived from natural language processing models | ||||
|---|---|---|---|---|---|---|
| N | Progression/worsening | Response/Improvement | Progression/worsening | Response/Improvement | ||
| All patients* | 4953 | 2.03 (1.87–2.20) | 0.36 (0.30–0.43) | 5064 | 4.34 (4.02–4.70) | 0.45 (0.38–0.54) |
| All patients with labeled cancer types* | 3712 | 2.10 (1.91–2.31) | 0.39 (0.32–0.47) | 3797 | 4.25 (3.88–4.64) | 0.44 (0.37–0.55) |
| Breast cancer† | 1058 | 2.19 (1.81–2.65) | 0.44 (0.30–0.65) | 1080 | 6.30 (5.23–7.58) | 0.56 (0.36–0.87) |
| Colorectal cancer† | 674 | 1.91 (1.54–2.37) | 0.19 (0.10–0.36) | 701 | 3.97 (3.24–4.87) | 0.27 (0.14–0.55) |
| NSCLC† | 1151 | 2.25 (1.92–2.64) | 0.46 (0.34–0.62) | 1141 | 3.53 (3.02–4.12) | 0.37 (0.27–0.52) |
| Pancreatic cancer† | 286 | 1.82 (1.34–2.47) | 0.24 (0.11–0.51) | 305 | 4.13 (3.13–5.45) | 0.35 (0.18–0.70) |
| Prostate cancer† | 197 | 1.74 (1.11–2.74) | 0.87 (0.35–2.14) | 211 | 3.41 (2.28–5.10) | 0.61 (0.18–2.11) |
| Renal cell carcinoma† | 161 | 1.91 (1.22–3.00) | 0.69 (0.26–1.85) | 173 | 2.95 (2.04–4.26) | 0.68 (0.40–1.15) |
| Urothelial carcinoma† | 185 | 2.04 (1.35–3.09) | 0.23 (0.07–0.73) | 186 | 3.85 (2.46–6.03) | 0.57 (0.30–1.09) |
| All patients with unlabeled cancer types* | 1241 | 1.84 (1.56–2.16) | 0.27 (0.18–0.41) | 1267 | 4.68 (3.99–5.48) | 0.46 (0.30–0.70) |
| Endometrial | 107 | 1.52 (0.88–2.63) | 0.21 (0.06–0.87) | 107 | 6.47 (3.71–11.29) | 0.68 (0.17–2.70) |
| Gastroesophageal cancer | 364 | 2.05 (1.57–2.69) | 0.14 (0.06–0.35) | 368 | 5.71 (4.36–7.48) | 0.22 (0.08–0.61) |
| Head and neck cancer | 91 | 1.91 (1.06–3.49) | 0.39 (0.13–1.24) | 96 | 2.84 (1.68–4.80) | 0.51 (0.16–1.61) |
| Leiomyosarcoma | 72 | 1.09 (0.57–2.10) | 0.14 (0.02–1.17) | 72 | ‡ | ‡ |
| Melanoma | 289 | 3.58 (2.41–5.31) | 0.60 (0.28–1.27) | 303 | 5.69 (3.96–8.16) | 0.85 (0.44–1.62) |
| Ovarian cancer | 318 | 1.13 (0.82–1.55) | 0.26 (0.12–0.60) | 321 | 3.16 (2.34–4.27) | 0.58 (0.24–1.41) |
Hazard ratios derived from multivariable models including all PRISSMM imaging outcomes, which were treated as time-varying covariates. Hazard ratios therefore capture the differential mortality risks associated with time periods following either cancer progression/worsening or cancer response/improvement. Only time following genomic testing was treated as at risk for mortality, since genomic testing was a cohort eligibility criterion. Automated annotations were derived from NLP models using a cross-validation approach, such that inference for each patient was performed using models that excluded that patient from training.
*Analysis also adjusted for cancer type.
†Labeled cancer type (labeled data for this individual cancer type were used to train NLP models).
‡Insufficient data for stable regression estimates.
Fig. 1Example of a clinico-genomic analysis based on outcomes ascertained using natural language processing models: Association between TMB and progression-free survival after initation of immunotherapy.
High tumor mutational burden defined as >=20 mutations per megabase. Results in this figure represent unadjusted Kaplan-Meier curves. Events were recorded using the “PFS-I-and-M” endpoint, defined as the earlier of death, or the time by which both a medical oncologist note and an imaging report had described cancer progression/worsening. Progression/worsening was defined using natural language processing models applied to imaging reports and medical oncologist notes. Survival curves were not adjusted for left truncation, since progression events were possible prior to genomic testing and cohort eligibility.