| Literature DB >> 34027105 |
Pei-Tse Yang1, Wen-Shuo Wu1, Chia-Chun Wu1, Yi-Nuo Shih2, Chung-Ho Hsieh3, Jia-Lien Hsu1.
Abstract
Breast cancer is one of the most common cancers in women all over the world. Due to the improvement of medical treatments, most of the breast cancer patients would be in remission. However, the patients have to face the next challenge, the recurrence of breast cancer which may cause more severe effects, and even death. The prediction of breast cancer recurrence is crucial for reducing mortality. This paper proposes a prediction model for the recurrence of breast cancer based on clinical nominal and numeric features. In this study, our data consist of 1,061 patients from Breast Cancer Registry from Shin Kong Wu Ho-Su Memorial Hospital between 2011 and 2016, in which 37 records are denoted as breast cancer recurrence. Each record has 85 features. Our approach consists of three stages. First, we perform data preprocessing and feature selection techniques to consolidate the dataset. Among all features, six features are identified for further processing in the following stages. Next, we apply resampling techniques to resolve the issue of class imbalance. Finally, we construct two classifiers, AdaBoost and cost-sensitive learning, to predict the risk of recurrence and carry out the performance evaluation in three-fold cross-validation. By applying the AdaBoost method, we achieve accuracy of 0.973 and sensitivity of 0.675. By combining the AdaBoost and cost-sensitive method of our model, we achieve a reasonable accuracy of 0.468 and substantially high sensitivity of 0.947 which guarantee almost no false dismissal. Our model can be used as a supporting tool in the setting and evaluation of the follow-up visit for early intervention and more advanced treatments to lower cancer mortality.Entities:
Keywords: AdaBoost; classification; cost-sensitive method; machine learning; recurrent breast cancer
Year: 2021 PMID: 34027105 PMCID: PMC8122465 DOI: 10.1515/med-2021-0282
Source DB: PubMed Journal: Open Med (Wars)
The statistics of duration features
| Features | Total ( |
|---|---|
| Duration of first contact | 1.88 ± 1.31 |
| Duration of initial diagnosis | 0.94 ± 1.28 |
| Duration of first microscopic confirmation | 0.93 ± 1.29 |
| Duration of first course of treatment | 1.17 ± 1.36 |
| Duration of first surgical procedure | 1.13 ± 1.36 |
| Duration of most definite surgical resection of the primary site | 1.13 ± 1.35 |
| Duration of RT (days) | 42.61 ± 6.37 |
| Duration of chemotherapy started at this facility | 1.16 ± 1.37 |
| Duration of hormone/steroid therapy started at this facility | 1.06 ± 1.33 |
| Duration of immunotherapy started at this facility | 0.50 ± 0.71 |
| Duration of HT and EP started at this facility | N/A |
| Duration of target therapy started at this facility | 0.98 ± 1.27 |
*Duration Features are in years, except the Duration of RT.
Figure 1The approach and the process flow of our system architecture.
The confusion matrix
| Positive prediction | Negative prediction | |
|---|---|---|
| Positive actual class | True positive (TP) | False negative (FN) |
| Negative actual class | False positive (FP) | True negative (TN) |
The statistics of selected features by the feature selection algorithm
| Variable | Total | Nonrecurrent | Recurrent |
|---|---|---|---|
|
|
|
| |
| Regional lymph nodes positive | 1.24 ± 3.36 | 1.12 ± 3.21 | 3.53 ± 5.07 |
| Duration of first contact (year) | 1.88 ± 1.31 | 1.83 ± 1.29 | 3.08 ± 1.32 |
| Tumor size (mm) | 26.23 ± 21.8 | 25.34 ± 18.94 | 49.16 ± 54.89 |
|
| (100%) | (100%) | (100%) |
| No evidence of the existence of this primary cancer | 753 (71.0%) | 743 (72.6%) | 10 (27.0%) |
| The presence of this primary cancer | 308 (29.0%) | 281 (27.4%) | 27 (73.0%) |
|
| (100%) | (100%) | (100%) |
| Complete response | 13 (1.2%) | 11 (1.1%) | 2 (5.4%) |
| Moderate response | 2 (0.2%) | 2 (0.2%) | 0 (0.0%) |
| Poor response | 18 (1.7%) | 15 (1.5%) | 3 (8.1%) |
| w/o Neoadjuvant therapy | 951 (89.6%) | 928 (90.6%) | 23 (62.2%) |
| w/o response | 44 (4.1%) | 38 (3.7%) | 6 (16.2%) |
| N/A (missing value) | 33 (3.1%) | 30 (2.9%) | 3 (8.1%) |
|
| (100%) | (100%) | (100%) |
| NX | 26 (2.5%) | 26 (2.5%) | 0 (0.0%) |
| N0 | 747 (70.4%) | 732 (71.5%) | 15 (40.5%) |
| N1 | 193 (18.2%) | 184 (18.0%) | 9 (24.3%) |
| N2 | 40 (3.8%) | 32 (3.1%) | 8 (21.6%) |
| N2a | 3 (0.3%) | 2 (0.2%) | 1 (2.7%) |
| N3 | 8 (0.8%) | 7 (0.7%) | 1 (2.7%) |
| N3a | 1 (0.1%) | 1 (0.1%) | 0 (0.0%) |
| N3b | 1 (0.1%) | 0 (0.0%) | 1 (2.7%) |
| N3c | 7 (0.7%) | 7 (0.7%) | 0 (0.0%) |
| No suitable definition | 3 (0.3%) | 2 (0.2%) | 1 (2.7%) |
| N/A (missing value) | 32 (3.0%) | 31 (3.0%) | 1 (2.7%) |
*Data are presented as number (%) or mean ± std dev.
Performance of all features vs six-selected features by using AdaBoost
| # Of features | Accuracy | Sensitivity | Precision | Specificity | ROC Area |
|
|---|---|---|---|---|---|---|
| All features | 0.972 | 0.352 | 0.700 | 0.994 | 0.760 | 0.610 |
| Six-selected features | 0.969 | 0.137 | 0.917 | 0.999 | 0.912 | 0.238 |
Applying AdaBoost with resampling techniques
| Method | no-R/R | Accuracy | Sensitivity | Precision | Specificity | ROC Area |
|
|---|---|---|---|---|---|---|---|
| w/o resampling techniques | 28:1 | 0.969 | 0.137 |
|
|
| 0.238 |
| SMOTE (2) | 14:1 | 0.968 | 0.222 | 0.421 | 0.995 | 0.911 | 0.291 |
| SMOTE (4) | 7:1 | 0.977 | 0.541 | 0.759 | 0.993 | 0.888 | 0.632 |
| SMOTE (8) | 3.5:1 | 0.974 | 0.622 | 0.686 | 0.986 | 0.889 | 0.652 |
| SMOTE (16) | 1.7:1 | 0.968 |
| 0.601 | 0.978 | 0.900 | 0.636 |
| SMOTE (8) w/U-S (15) | 3:1 |
|
| 0.640 | 0.983 | 0.890 |
|
| SMOTE (8) w/U-S (30) | 2.5:1 | 0.970 |
| 0.617 | 0.981 | 0.894 | 0.644 |
1 no-R/R: the ratio of no-recurrent to recurrent.
2 SMOTE (m): using SMOTE on the minority group by a factor of m times.
3 U-S (n): applying under-sampling to reduce the majority group by n percent.
The bold values are the largest values with respect to the corresponding column.
Performance of combing AdaBoost and cost-sensitive methods
| Penalty | Accuracy | Sensitivity | Precision | Specificity | ROC Area |
|
|---|---|---|---|---|---|---|
| 1 | 0.973 | 0.675 | 0.640 | 0.983 | 0.890 | 0.657 |
| 10 | 0.811 | 0.754 | 0.143 | 0.813 | 0.897 | 0.241 |
| 20 | 0.710 | 0.810 | 0.091 | 0.707 | 0.886 | 0.163 |
| 30 | 0.715 | 0.835 | 0.094 | 0.711 | 0.888 | 0.169 |
| 40 | 0.692 | 0.891 | 0.093 | 0.685 | 0.875 | 0.168 |
| 50 | 0.665 | 0.891 | 0.086 | 0.656 | 0.882 | 0.157 |
| 60 | 0.665 | 0.891 | 0.086 | 0.656 | 0.881 | 0.157 |
| 70 | 0.638 | 0.891 | 0.080 | 0.629 | 0.900 | 0.147 |
| 80 | 0.599 | 0.891 | 0.073 | 0.589 | 0.900 | 0.135 |
| 90 | 0.577 | 0.891 | 0.070 | 0.565 | 0.900 | 0.130 |
| 100 | 0.506 | 0.919 | 0.063 | 0.491 | 0.900 | 0.118 |
| 110 | 0.543 | 0.919 | 0.067 | 0.529 | 0.907 | 0.125 |
| 120 | 0.543 | 0.919 | 0.067 | 0.529 | 0.907 | 0.125 |
| 130 | 0.468 |
| 0.061 | 0.450 | 0.907 | 0.114 |
| 140 | 0.505 | 0.919 | 0.062 | 0.490 | 0.894 | 0.117 |
| 150 | 0.502 | 0.919 | 0.062 | 0.487 | 0.894 | 0.116 |
Performance comparison of breast cancer recurrence prediction model
| Method | Accuracy | Sensitivity | Selected features | Dataset size (total/# of recurrence) |
|---|---|---|---|---|
| BCRSVM [ | 0.846 | 0.890 | Histological grade, local invasion of tumor, no of tumors, tumor size, LVI, ER, no of metastatic lymph nodes | 679/195 (29%) |
| SVM [ | 0.957 | 0.971 | Age at diagnosis, age at menarche, age at menopause, tumor Size, LN involvement, grade, nexion (lymph node dissection), HER2 | 547/117 (21%) |
| Bagging [ | 0.923 | 0.923 | Tumor grade, molecular subtype, cancer focality, LVI, menopause, DCIS type, age, and dimension of primary tumor | 1,475/142 (10%) |
| OneR [ | 0.901 | 0.901 | ||
| ANN [ | 0.988 | 0.954 | Surgeon volume, hospital volume, tumor stage | 1,140/225 (20%) |
| SVM [ | 0.897 | 0.704 | ||
| KPCA-SVM [ | 0.785 | 0.833 | LN involvement rate, HER2 value, tumor size, tumor margin. | 5,471/2,517 (46%) |
| C5.0 [ | 0.819 | 0.869 | ||
| AdaBoost | 0.973 | 0.675 | Regional lymph nodes positive, duration of first contact, tumor size (mm), cancer status, response to Neoadjuvant therapy, clinical N | 1,061/37 (3.5%) |
| AdaBoost + cost-sensitive method | 0.468 | 0.947 |
| Feature categories | Features |
|---|---|
| Case confirmation | (1) Sex |
| (2) Date of birth | |
| Cancer confirmation | (3) Age at diagnosis |
| (4) Sequence number | |
| (5) Date of first contact | |
| (6) Date of initial diagnosis | |
| (7) Primary site | |
| (8) Laterality | |
| (9) Histology | |
| (10) Behavior code | |
| (11) Grade/differentiation | |
| (12) Diagnostic confirmation | |
| (13) Date of first microscopic confirmation | |
| (14) Tumor size | |
| (15) Regional lymph nodes examined | |
| (16) Regional lymph nodes positive | |
| Stage of disease at initial diagnosis | (17) Clinical T |
| (18) Clinical N | |
| (19) Clinical M | |
| (20) Clinical stage group | |
| (21) Clinical stage (prefix/suffix) descriptor | |
| Treatment | (22) Date of first course of treatment |
| (23) Date of first surgical procedure | |
| (24) Date of most definite surgical resection of the primary site | |
| (25) Surgical procedure of primary site at other facility | |
| (26) Surgical procedure of primary site at this facility | |
| (27) Surgical margins of the primary site | |
| (28) Scope of regional lymph node surgery at other facility | |
| (29) Scope of regional lymph node surgery at this facility | |
| (30) Surgical procedure/other site at other facility | |
| (31) Surgical procedure/other site at this facility | |
| (32) Reason for No Surgery of Primary Site | |
| (33) RT target summary | |
| (34) RT technique | |
| (35) Date of RT started | |
| (36) Date of RT ended | |
| (37) Sequence of radiotherapy and surgery | |
| (38) Sequence of locoregional therapy and systemic therapy | |
| (39) Institute of RT | |
| (40) Reasons for No RT | |
| (41) EBRT instruments | |
| (42) Target of CTV_H | |
| (43) Dose to CTV_H (cGy) | |
| (44) Number of Fractions to CTV_H | |
| (45) Target of CTV_L | |
| (46) Dose to CTV_L (cGy) | |
| (47) Number of fractions to CTV_L | |
| (48) Other RT technique | |
| (49) Other RT instruments | |
| (50) Target of other RT | |
| (51) Dose to target of other RT | |
| (52) Number of fractions to other RT | |
| (53) Chemotherapy at other facility | |
| (54) Chemotherapy at this facility | |
| (55) Date of chemotherapy started at this facility | |
| (56) Hormone/steroid therapy at other facility | |
| (57) Hormone/steroid therapy at this facility | |
| (58) Date of hormone/steroid therapy started at this facility | |
| (59) Immunotherapy at other facility | |
| (60) Immunotherapy at this facility | |
| (61) Date of immunotherapy started at this facility | |
| (62) Hematologic transplant and endocrine procedure | |
| (63) Date of HT and EP started at this facility | |
| (64) Target therapy at other facility | |
| (65) Target therapy at this facility | |
| (66) Date of target therapy started at this facility | |
| (67) Palliative care at this facility | |
| Treatment result | (68) Vital status |
| (69) Cancer status | |
| (70) Recurrence ( | |
| (71) Date of last contact or death | |
| Breast cancer site-specific factors | (72) Estrogen receptor assay |
| (73) Progesterone receptor assay | |
| (74) Response to Neoadjuvant therapy | |
| (75) No. of sentinel lymph nodes examined | |
| (76) No. of sentinel lymph nodes positive | |
| (77) Nottingham or Bloom-Richardson (BR) score/grade | |
| (78) HER2 (human epidermal growth factor receptor 2) IHC test lab value | |
| (79) Paget disease | |
| (80) Lymph vessels or vascular invasion (LVI) | |
| Other factors | (81) Height |
| (82) Weight | |
| (83) Smoking behavior | |
| (84) Betel net chewing behavior | |
| (85) Drinking behavior |
| Feature categories | Features |
|---|---|
| Case confirmation | (1) Sex |
| (2) Age (Date of birth) | |
| Cancer confirmation | (3) Age at diagnosis |
| (4) Sequence number | |
| (5) Duration of first contact | |
| (6) Duration of initial diagnosis | |
| (7) Primary site | |
| (8) Laterality | |
| (9) Histology | |
| (10) Behavior code | |
| (11) Grade/differentiation | |
| (12) Diagnostic confirmation | |
| (13) Duration of first microscopic confirmation | |
| (14) Tumor size | |
| (15) Regional lymph nodes examined | |
| (16) Regional lymph nodes positive | |
| Stage of disease at initial diagnosis | (17) Clinical T |
| (18) Clinical N | |
| (19) Clinical M | |
| (20) Clinical stage group | |
| (21) Clinical stage (prefix/suffix) descriptor | |
| Treatment | (22) Duration of first course of treatment |
| (23) Duration of first surgical procedure | |
| (24) Duration of most definite surgical resection of the primary site | |
| (25) Surgery | |
| (26) Surgical procedure of primary site at other facility | |
| (27) Surgical procedure of primary site at this facility | |
| (28) Surgical margins of the primary site | |
| (29) Scope of regional lymph node surgery at other facility | |
| (30) Scope of regional lymph node surgery at this facility | |
| (31) Surgical procedure/other site at other facility | |
| (32) Surgical procedure/other site at this facility | |
| (33) Reason for no surgery of primary site | |
| (34) RT target summary | |
| (35) RT technique | |
| (36) Duration of RT (days) | |
| (37) RT | |
| (38) Sequence of radiotherapy and surgery | |
| (39) Sequence of locoregional therapy and systemic therapy | |
| (40) Institute of RT | |
| (41) Reasons for no RT | |
| (42) EBRT instruments | |
| (43) Target of CTV H | |
| (44) Dose to CTV H (cGy) | |
| (45) Number of fractions to CTV H | |
| (46) Target of CTV L | |
| (47) Dose to CTV L (cGy) | |
| (48) Number of fractions to CTV L | |
| (49) Other RT technique | |
| (50) Other RT instruments | |
| (51) Target of other RT | |
| (52) Dose to target of other RT | |
| (53) Number of fractions to other RT | |
| (54) Chemotherapy at other facility | |
| (55) Chemotherapy at this facility | |
| (56) Duration of chemotherapy started at this facility | |
| (57) Chemotherapy | |
| (58) Hormone/steroid therapy at other facility | |
| (59) Hormone/steroid therapy at this facility | |
| (60) Duration of hormone/steroid therapy started at this facility | |
| (61) Hormone/steroid therapy | |
| (62) Immunotherapy at other facility | |
| (63) Immunotherapy at this facility | |
| (64) Duration of immunotherapy started at this facility | |
| (65) Immunotherapy | |
| (66) Hematologic transplant and endocrine procedure | |
| (67) Duration of HT and EP started at this facility | |
| (68) Target therapy at other facility | |
| (69) Target therapy at this facility | |
| (70) Duration of target therapy started at this facility | |
| (71) Target therapy | |
| (72) Palliative care at this facility | |
| Treatment result | (73) Vital status |
| (74) Cancer status | |
| (75) Recurrence ( | |
| Breast cancer site-specific factors | (76) Estrogen receptor assay |
| (77) Progesterone receptor assay | |
| (78) Response to Neoadjuvant therapy | |
| (79) No. of sentinel lymph nodes examined | |
| (80) No. of sentinel lymph nodes positive | |
| (81) Nottingham or Bloom-Richardson(BR) score/grade | |
| (82) HER2 (human epidermal growth factor receptor 2) IHC test lab value | |
| (83) Paget disease | |
| (84) Lymph Vessels or Vascular Invasion (LVI) | |
| Other factors | (85) BMI |
| (86) Smoking behavior | |
| (87) Betel net chewing behavior | |
| (88) Drinking behavior |