| Literature DB >> 30591037 |
Zexian Zeng1, Sasa Espino2, Ankita Roy2, Xiaoyu Li3, Seema A Khan2, Susan E Clare2, Xia Jiang4, Richard Neapolitan1, Yuan Luo5.
Abstract
BACKGROUND: Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review.Entities:
Keywords: Breast cancer local recurrence; EHR; NLP; SVM
Mesh:
Year: 2018 PMID: 30591037 PMCID: PMC6309052 DOI: 10.1186/s12859-018-2466-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Diagram of the workflow of the study. Processing steps are in the circles; narratives, concepts, and features are in the squares. NP represents the number of pathology reports generated at least 120 days after the first primary diagnosis. We start with pipeline 1 by manually going through a development corpus of 50 randomly selected positive progress notes to build a positive concept set. We then start pipeline 2 by going through every patient’s progress notes. The dash line indicates that only concepts falling in the positive concept set are retained
Positive and negative examples of partial sentences indicating local recurrences
| Partial sentences | |
|---|---|
| Positive Examples | Now with newly diagnosed DCIS recurrence |
| She initially received breast-conserving therapy with radiotherapy for a right breast cancer in 1990 and now she presents with a right-sided pT1b N0 (SN) infiltrating lobular carcinoma. | |
| She was found to have an ipsilateral breast tumor recurrence | |
| Female with a history of left breast stage IIIC infiltrating ductal carcinoma who was treated with breast conserving surgery and adjuvant chemo radiation in 2010 who was recently diagnosed with a cancer in the ipsilateral breast | |
| Is currently receiving chemotherapy for her recurrent breast cancer | |
| Negative Examples | We recommended the patient undergo adjuvant radiation therapy with the goal of decreasing local regional recurrence and possibly increasing the overall long-term survival |
| Carefully explained to her that removing the right breast would not change her overall survival and had minimal impact on her recurrence | |
| She presents with recurrence of depressive symptoms associated with new breast cancer diagnosis | |
| Pt very concerned and anxious as some of her friends have been diagnosed with recurrent breast cancer. | |
| Despite this stressor, and the attendant emotional strain, she appears to be coping well at present and is dealing well with fear of recurrence and medical issues |
Fig. 2Constructing features using one partial sentence. Green circles are generated features for the model
Training cohort and test cohort distributions
| Total | Local Recurrence | Percentage (%) | Overall percentage (%) | |
|---|---|---|---|---|
| Double annotation set | 701 | 193 | 27.53% | 8.13% |
| Cross-validation set | 490 | 143 | 29.18% | |
| Held-out test set | 211 | 50 | 23.70% | |
| Single annotation set | 6198 | 368 | 5.94% |
Fig. 3Histogram plot for number of pathology reports. a frequency distribution of subjects with and without local recurrence; b density distribution of subjects with and without local recurrence
Cross-validation results using different methods
| Methods | P (SD) | R (SD) | F (SD) | AUC (SD) |
|---|---|---|---|---|
| Filtered MetaMap +Pathology Report Count (4151) | 0.84 (0.04) |
|
| 0.93 (0.01) |
| Full MetaMap (17897) | 0.80 (0.06) | 0.48 (0.05) | 0.60 (0.05) | 0.83 (0.03) |
| Filtered MetaMap (4150) | 0.82 (0.03) | 0.67 (0.02) | 0.74 (0.02) | 0.90 (0.01) |
| Bag of Words (57612) | 0.69 (0.07) | 0.42 (0.062) | 0.52 (0.06) | 0.78 (0.03) |
The number in the parenthesis in the first column is the number of features. The number in parenthesis in the 2nd~5th columns is standard deviation
Gray shade indicates baseline methods
P stands for precision, R stands for recall, F stands for f score, AUC stands for area under the receiver operator characteristic curve, and SD is standard deviation
Held-out test results using different methods
| Methods | P | R | F | AUC |
|---|---|---|---|---|
| Filtered MetaMap +Pathology Report Count (4151) | 0.74 | 0.84 | 0.79 | 0.87 |
| Full MetaMap (17897) | 0.66 | 0.34 | 0.45 | 0.80 |
| Filtered MetaMap (4150) | 0.71 | 0.78 | 0.74 | 0.84 |
| Bag of Words (57612) | 0.53 | 0.43 | 0.48 | 0.74 |
The number in the parenthesis in first column is the number of features
Gray shade indicates baseline methods
P stands for precision, R stands for recall, F stands for f score, AUC stands for area under the receiver operator characteristic curve
The top-ranked features with the corresponding coefficient in the model and the UMLS concept preferred name for the CUIs
| Feature | Coefficient | UMLS Concept Preferred Name |
|---|---|---|
| C0278493 | 0.66 | ‘Recurrent breast cancer’ |
| {C0007124; C0222600; C0222600} | 0.46 | {‘Noninfiltrating Intraductal Carcinoma’; ‘Right breast’; ‘Right breast’} |
| C0920420 | 0.43 | ‘Cancer recurrence’ |
| C1458156 | 0.41 | ‘Recurrent Malignant Neoplasm’ |
| C2945760 | 0.40 | ‘Recurrent’ |
| C0235653 | −0.36 | ‘Malignant neoplasm of female breast’ |
| C0277556 | 0.36 | ‘Recurrent disease’ |
| C1512083 | − 0.35 | ‘Ductal’ |
| {C0007124; C0205090; C0262512} | 0.32 | {‘Noninfiltrating Intraductal Carcinoma’; Right; ‘History of present illness’} |
| C4042789 | 0.30 | ‘Right-Sided Breast Neoplasms’ |