| Literature DB >> 32025650 |
Albee Y Ling1,2, Allison W Kurian3,4, Jennifer L Caswell-Jin3, George W Sledge3, Nigam H Shah2,5, Suzanne R Tamang2,6.
Abstract
OBJECTIVES: Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data.Entities:
Keywords: SEER; cancer distant recurrence; electronic medical records; natural language processing; semi-supervised machine learning
Year: 2019 PMID: 32025650 PMCID: PMC6994019 DOI: 10.1093/jamiaopen/ooz040
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Flowchart of Oncoshare Patient Count by Step. SHC: Stanford Health Care; PAMF: Palo Alto Medical Foundation; MBC; metastatic breast cancer.
Recurrent metastatic breast cancer term to concept mappings
| Custom word class | Short description | Example terms |
|---|---|---|
| DRECUR | Distant recurrence | Recurrent metastatic tnbc, distant relapse, distant recurrences, distant metastatic disease involving |
| LRECUR | Local or regional recurrence | Regional recurrence, nodal recurrence, loco-regional failure, locally recur, in-breast recurrence, local recur |
| MBC | Metastatic breast cancer | Widespread metastatic breast cancer, widely metastatic triple, metastatic breast carcinoma, metastatic tnbc |
| MBCLOW | Metastatic breast cancer (low confidence) | Metastic |
| METSBONE | Metastatic disease to the bone | Bone mets, bone metastasis, bone metasteses |
| METSBRAIN | Metastatic disease to the brain | Metastatic disease involving the brain, brain mets, mets to brain, brain metastasis, brain metastases |
| METSLIVER | Metastatic disease to the liver | Liver mets, liver metastasis, liver metastases, hepatic mets, hepatic metastasis, hepatic metastases |
| METSLUNG | Metastatic disease to the lung | Mets to lung, pulm mets, lung mets, mbc pulm, lung metastases |
| METSNOS | Metastatic disease (distant organ not specified) | Widespread metastatic disease, stage4, newly diagnosed metastatic, stage iv |
| RECUR | Recurrence | Recur, rapid recurrence, multiple recurrences, recurrent disease, reoccurrence, reoccurring |
| DIED | Death | Passed away, expired on, deceased |
These were original spellings from the clinical notes and the misspellings are left intentionally.
Metastatic breast cancer (MBC) case detection results by metastatic breast cancer status
| Recurrent MBC (stage 0–III at diagnosis) |
| Non-MBC | ||||
|---|---|---|---|---|---|---|
| No. | % | No. | % | No. | % | |
| Total | 1302 | 100 | 495 | 100 | 7590 | 100 |
| Age at diagnosis: mean (SD) | 52.99 (13.04) | 54.61 (13.62) | 55.36 (12.99) | |||
| Year of breast cancer diagnosis | ||||||
| Before 2005 | 526 | 40.40 | 116 | 23.43 | 1738 | 22.90 |
| 2005–2009 | 463 | 35.56 | 189 | 38.18 | 2787 | 36.72 |
| 2010–2015 | 313 | 24.04 | 190 | 38.38 | 3065 | 40.38 |
| Race | ||||||
| White | 1025 | 78.73 | 394 | 79.60 | 5887 | 77.56 |
| Black | 51 | 3.92 | 25 | 5.05 | 229 | 3.02 |
| Asian/Pacific Islander | 206 | 15.82 | 70 | 14.14 | 1371 | 18.06 |
| Other | 4 | 0.31 | 2 | 0.40 | 57 | 0.75 |
| Missing | 16 | 1.23 | 4 | 0.81 | 46 | 0.61 |
| Ethnicity | ||||||
| Hispanic | 129 | 9.91 | 55 | 11.11 | 588 | 7.75 |
| Non-Hispanic | 1170 | 89.86 | 439 | 88.69 | 6968 | 91.81 |
| Missing | 3 | 0.23 | 1 | 0.20 | 34 | 0.45 |
| Neighborhood socioeconomic status | ||||||
| Lowest quintile | 45 | 3.46 | 40 | 8.08 | 342 | 4.51 |
| Second quintile | 121 | 9.29 | 64 | 12.93 | 631 | 8.31 |
| Third quintile | 215 | 16.51 | 81 | 16.36 | 988 | 13.02 |
| Fourth quintile | 254 | 19.51 | 105 | 21.21 | 1433 | 18.88 |
| Highest quintile | 646 | 49.62 | 195 | 39.39 | 4015 | 52.90 |
| Missing | 21 | 1.61 | 10 | 2.02 | 181 | 2.39 |
| Stage | ||||||
| 0 | 72 | 5.53 | 0 | 0.00 | 1813 | 23.87 |
| I | 302 | 23.20 | 0 | 0.00 | 2837 | 37.37 |
| II | 585 | 44.93 | 0 | 0.00 | 2186 | 28.80 |
| III | 307 | 23.58 | 0 | 0.00 | 616 | 8.10 |
| IV | 0 | 0.00 | 495 | 100.00 | 0 | 0.00 |
| Missing | 36 | 2.76 | 0 | 0.00 | 141 | 1.86 |
| Tumor receptor subtype | ||||||
| Estrogen receptor and/or progesterone receptor (PR)-positive and HER2-negative | 608 | 46.70 | 247 | 49.90 | 3514 | 46.30 |
| HER2-positive | 259 | 19.89 | 117 | 23.64 | 969 | 12.77 |
| Triple-negative | 223 | 17.13 | 63 | 12.73 | 689 | 9.08 |
| Missing | 212 | 16.28 | 68 | 13.74 | 2418 | 31.86 |
| Grade | ||||||
| 1 | 163 | 12.52 | 21 | 4.24 | 1511 | 19.91 |
| 2 | 446 | 34.26 | 167 | 33.74 | 3020 | 39.79 |
| 3 | 580 | 44.55 | 177 | 35.76 | 2400 | 31.62 |
| Missing | 113 | 8.68 | 130 | 26.26 | 659 | 8.68 |
| Histology | ||||||
| Ductal | 1146 | 88.02 | 413 | 83.43 | 6433 | 84.76 |
| Lobular | 117 | 8.99 | 56 | 11.31 | 703 | 9.26 |
|
| 39 | 3.00 | 26 | 5.25 | 454 | 5.98 |
| Marital status | ||||||
| Single | 202 | 15.52 | 95 | 19.19 | 1133 | 14.93 |
| Married | 865 | 66.44 | 289 | 58.38 | 5096 | 67.14 |
| Divorced | 120 | 9.22 | 47 | 9.49 | 602 | 7.93 |
| Widowed | 81 | 6.22 | 44 | 8.89 | 518 | 6.82 |
| Separated, unmarried, or domestic partner | 20 | 1.54 | 15 | 3.03 | 170 | 2.24 |
| Missing | 14 | 1.07 | 5 | 1.01 | 71 | 0.94 |
| Payer | ||||||
| Not insured | 11 | 0.84 | 3 | 0.61 | 35 | 0.46 |
| Insurance, not otherwise specified | 144 | 11.06 | 45 | 9.09 | 729 | 9.60 |
| Managed care/HMO/PPO | 695 | 53.38 | 241 | 48.69 | 4489 | 59.14 |
| Medicaid | 121 | 9.29 | 60 | 12.12 | 390 | 5.14 |
| Medicare | 267 | 20.51 | 118 | 23.84 | 1610 | 21.21 |
| Others | 21 | 1.61 | 11 | 2.22 | 115 | 1.52 |
| Missing | 43 | 3.30 | 17 | 3.43 | 222 | 2.92 |
Neighborhood socioeconomic status (SES) quintile was assigned based on a previously developed measurement by Yost et al for cases diagnosed from 2000 to 2005, and Shariff-Marco et al for cases diagnosed 2006 to 2015.,
Triple negative: estrogen receptor, progesterone receptor and HER2 all negative. HER2 positive: HER2 positive, regardless of estrogen receptor or progesterone receptor status.
Performance of distant labels and classification models using 146 manually reviewed gold standard patients
| Performance measurements | ||||||||
|---|---|---|---|---|---|---|---|---|
| Area under the curve (AUC) | Sensitivity | Specificity | PPV | NPV | F-1 score | Accuracy | ||
| Distant labels | NA | 0.889 | 0.797 | 0.810 | 0.881 | 0.848 | 0.842 | |
| (0.818, 0.957) | (0.700, 0.889) | (0.723, 0.899) | (0.797, 0.952) | (0.783, 0.908) | (0.781, 0.904) | |||
| Classifier | A (CCR) | 0.789 | 0.542 | 0.824 | 0.750 | 0.649 | 0.629 | 0.685 |
| (0.716, 0.861) | (0.423, 0.662) | (0.727, 0.899) | (0.633, 0.863) | (0.543, 0.744) | (0.521, 0.726) | (0.603, 0.760) | ||
| B (NLP) | 0.917 | 0.861 | 0.878 | 0.873 | 0.867 | 0.867 | 0.870 | |
| (0.868, 0.966) | (0.778, 0.933) | (0.800, 0.944) | (0.794, 0.943) | (0.783, 0.936) | (0.800, 0.925) | (0.815, 0.925) | ||
| C (NLP + CCR) | 0.925 (0.880, 0.969) | 0.861 | 0.878 | 0.873 | 0.867 | 0.867 | 0.870 | |
| (0.778, 0.933) | (0.800, 0.944) | (0.794, 0.943) | (0.783, 0.936) | (0.800, 0.925) | (0.815, 0.925) | |||
Note that positive predictive value (PPV), negative predictive value (NPV), F-1 score, and overall accuracy are highly dependent on the prevalence of the condition, which in our case is 72/146 = 0.49. The actual prevalence of recurrent metastatic breast cancer in our study population is likely to be much lower. However, sensitivity, specificity, and area under the curve (AUC) are intrinsic properties of classifier and are insensitive to prevalence of cases.,
Characteristics of all studied breast cancer patients (N = 11 459) derived from the California Cancer Registry
| Stage at diagnosis | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Stage 0 | Stage I | Stage II | Stage III | Stage IV | Missing | |||||||
| No. | % | No. | % | No. | % | No. | % | No. | % | No. | % | |
| Total | 2335 | 100 | 3820 | 100 | 3443 | 100 | 1120 | 100 | 495 | 100 | 246 | 100 |
| Age at diagnosis: mean (SD) | 55.31 (11.93) | 56.68 (12.97) | 53.26 (13.2) | 51.77 (12.74) | 54.61 (13.62) | 56.71 (15.91) | ||||||
| Year of breast cancer diagnosis | ||||||||||||
| Before 2005 | 586 | 25.10 | 1106 | 28.95 | 1137 | 33.02 | 245 | 21.88 | 116 | 23.43 | 75 | 30.49 |
| 2005–2009 | 1039 | 44.50 | 1396 | 36.54 | 1277 | 37.09 | 474 | 42.32 | 189 | 38.18 | 118 | 47.97 |
| 2010–2015 | 710 | 30.41 | 1318 | 34.50 | 1029 | 29.89 | 401 | 35.80 | 190 | 38.38 | 53 | 21.54 |
| Race | ||||||||||||
| White | 1762 | 75.46 | 3075 | 80.50 | 2699 | 78.39 | 871 | 77.77 | 394 | 79.60 | 191 | 77.64 |
| Black | 66 | 2.83 | 109 | 2.85 | 114 | 3.31 | 53 | 4.73 | 25 | 5.05 | 10 | 4.07 |
| Asian/Pacific Islander | 472 | 20.21 | 599 | 15.68 | 586 | 17.02 | 179 | 15.98 | 70 | 14.14 | 32 | 13.01 |
| Other | 21 | 0.90 | 23 | 0.60 | 17 | 0.49 | 5 | 0.45 | 2 | 0.40 | 11 | 4.47 |
| Missing | 14 | 0.60 | 14 | 0.37 | 27 | 0.78 | 12 | 1.07 | 4 | 0.81 | 2 | 0.81 |
| Ethnicity | ||||||||||||
| Hispanic | 161 | 6.90 | 256 | 6.70 | 327 | 9.50 | 124 | 11.07 | 55 | 11.11 | 29 | 11.79 |
| Non-Hispanic | 2155 | 92.29 | 3553 | 93.01 | 3104 | 90.15 | 995 | 88.84 | 439 | 88.69 | 206 | 83.74 |
| Missing | 19 | 0.81 | 11 | 0.29 | 12 | 0.35 | 1 | 0.09 | 1 | 0.20 | 11 | 4.47 |
| Neighborhood socioeconomic status (SES) | ||||||||||||
| Lowest quintile | 104 | 4.45 | 158 | 4.14 | 151 | 4.39 | 62 | 5.54 | 40 | 8.08 | 25 | 10.16 |
| Second quintile | 188 | 8.05 | 298 | 7.80 | 313 | 9.09 | 122 | 10.89 | 64 | 12.93 | 29 | 11.79 |
| Third quintile | 315 | 13.49 | 489 | 12.80 | 520 | 15.10 | 188 | 16.79 | 81 | 16.36 | 35 | 14.23 |
| Fourth quintile | 468 | 20.04 | 735 | 19.24 | 666 | 19.34 | 230 | 20.54 | 105 | 21.21 | 46 | 18.70 |
| Highest quintile | 1202 | 51.48 | 2062 | 53.98 | 1737 | 50.45 | 496 | 44.29 | 195 | 39.39 | 106 | 43.09 |
| Missing | 58 | 2.48 | 78 | 2.04 | 56 | 1.63 | 22 | 1.96 | 10 | 2.02 | 5 | 2.03 |
| Tumor receptor subtype | ||||||||||||
| Estrogen (ER) and/or progesterone receptor (PR)-positive and HER2-negative | 118 | 5.05 | 2336 | 61.15 | 1800 | 52.28 | 567 | 50.63 | 247 | 49.90 | 66 | 26.83 |
| HER2-positive | 48 | 2.06 | 545 | 14.27 | 651 | 18.91 | 254 | 22.68 | 117 | 23.64 | 22 | 8.94 |
| Triple-negative | 12 | 0.51 | 366 | 9.58 | 556 | 16.15 | 199 | 17.77 | 63 | 12.73 | 18 | 7.32 |
| Missing | 2157 | 92.38 | 573 | 15.00 | 436 | 12.66 | 100 | 8.93 | 68 | 13.74 | 140 | 56.91 |
| Grade | ||||||||||||
| 1 | 209 | 8.95 | 1199 | 31.39 | 470 | 13.65 | 120 | 10.71 | 21 | 4.24 | 31 | 12.60 |
| 2 | 878 | 37.60 | 1551 | 40.60 | 1356 | 39.38 | 387 | 34.55 | 167 | 33.74 | 50 | 20.33 |
| 3 | 865 | 37.04 | 820 | 21.47 | 1420 | 41.24 | 521 | 46.52 | 177 | 35.76 | 64 | 26.02 |
| Missing | 383 | 16.40 | 250 | 6.54 | 197 | 5.72 | 92 | 8.21 | 130 | 26.26 | 101 | 41.06 |
| Histology | ||||||||||||
| Ductal | 1989 | 85.18 | 3296 | 86.28 | 2919 | 84.78 | 923 | 82.41 | 413 | 83.43 | 168 | 68.29 |
| Lobular | 181 | 7.75 | 283 | 7.41 | 341 | 9.90 | 174 | 15.54 | 56 | 11.31 | 12 | 4.88 |
| Other | 165 | 7.07 | 241 | 6.31 | 183 | 5.32 | 23 | 2.05 | 26 | 5.25 | 66 | 26.83 |
| Marital status | ||||||||||||
| Single | 333 | 14.26 | 577 | 15.10 | 512 | 14.87 | 184 | 16.43 | 95 | 19.19 | 41 | 16.67 |
| Married | 1577 | 67.54 | 2555 | 66.88 | 2322 | 67.44 | 747 | 66.70 | 289 | 58.38 | 122 | 49.59 |
| Divorced | 184 | 7.88 | 301 | 7.88 | 293 | 8.51 | 100 | 8.93 | 47 | 9.49 | 27 | 10.98 |
| Widowed | 169 | 7.24 | 286 | 7.49 | 217 | 6.30 | 55 | 4.91 | 44 | 8.89 | 24 | 9.76 |
| Separated, unmarried, or Domestic partner | 50 | 2.14 | 66 | 1.73 | 57 | 1.66 | 26 | 2.32 | 15 | 3.03 | 28 | 11.38 |
| Missing | 22 | 0.94 | 35 | 0.92 | 42 | 1.22 | 8 | 0.71 | 5 | 1.01 | 4 | 1.63 |
| Payer | ||||||||||||
| Not insured | 11 | 0.47 | 17 | 0.45 | 22 | 0.64 | 6 | 0.54 | 3 | 0.61 | 5 | 2.03 |
| Insurance, not otherwise specified | 244 | 10.45 | 367 | 9.61 | 362 | 10.51 | 111 | 9.91 | 45 | 9.09 | 20 | 8.13 |
| Managed care/HMO/PPO | 1455 | 62.31 | 2218 | 58.06 | 2031 | 58.99 | 658 | 58.75 | 241 | 48.69 | 111 | 45.12 |
| Medicaid | 90 | 3.85 | 146 | 3.82 | 228 | 6.62 | 128 | 11.43 | 60 | 12.12 | 13 | 5.28 |
| Medicare | 424 | 18.16 | 900 | 23.56 | 639 | 18.56 | 162 | 14.46 | 118 | 23.84 | 61 | 24.80 |
| Others | 41 | 1.76 | 50 | 1.31 | 48 | 1.39 | 13 | 1.16 | 11 | 2.22 | 5 | 2.03 |
| Missing | 70 | 3.00 | 122 | 3.19 | 113 | 3.28 | 42 | 3.75 | 17 | 3.43 | 31 | 12.60 |
Neighborhood socioeconomic status (SES) quintile was assigned based on a previously developed measurement by Yost et al for cases diagnosed from 2000 to 2005, and Shariff-Marco et al for cases diagnosed 2006 to 2015.,
Triple negative: estrogen receptor, progesterone receptor and HER2 all negative. HER2 positive: HER2 positive, regardless of estrogen receptor or progesterone receptor status.
Figure 2.Receiver Operating Characteristic Curve (ROC) of Statistical Classifiers Evaluated using the Test Set of 146 Patients. The area under the curve (AUC) of classifier with CCR and NLP features is 0.925 with 95% confidence interval 0.880–0.969. The AUC of the classifier with CCR features only is 0.789 with 95% confidence interval 0.716–0.861. The AUC of classifier with NLP features is 0.917 with 95% confidence interval 0.868–0.966.