| Literature DB >> 28018557 |
Mohammad R Mohebian1, Hamid R Marateb2, Marjan Mansourian3, Miguel Angel Mañanas4, Fariborz Mokarian5.
Abstract
Cancer is a collection of diseases that involves growing abnormal cells with the potential to invade or spread to the body. Breast cancer is the second leading cause of cancer death among women. A method for 5-year breast cancer recurrence prediction is presented in this manuscript. Clinicopathologic characteristics of 579 breast cancer patients (recurrence prevalence of 19.3%) were analyzed and discriminative features were selected using statistical feature selection methods. They were further refined by Particle Swarm Optimization (PSO) as the inputs of the classification system with ensemble learning (Bagged Decision Tree: BDT). The proper combination of selected categorical features and also the weight (importance) of the selected interval-measurement-scale features were identified by the PSO algorithm. The performance of HPBCR (hybrid predictor of breast cancer recurrence) was assessed using the holdout and 4-fold cross-validation. Three other classifiers namely as supported vector machines, DT, and multilayer perceptron neural network were used for comparison. The selected features were diagnosis age, tumor size, lymph node involvement ratio, number of involved axillary lymph nodes, progesterone receptor expression, having hormone therapy and type of surgery. The minimum sensitivity, specificity, precision and accuracy of HPBCR were 77%, 93%, 95% and 85%, respectively in the entire cross-validation folds and the hold-out test fold. HPBCR outperformed the other tested classifiers. It showed excellent agreement with the gold standard (i.e. the oncologist opinion after blood tumor marker and imaging tests, and tissue biopsy). This algorithm is thus a promising online tool for the prediction of breast cancer recurrence.Entities:
Keywords: Breast cancer; CAD, computer-aided diagnosis; Cancer recurrence; Computer-assisted diagnosis; DT, decision tree; FH, family history of cancer; HPBCR, the proposed hybrid predictor of breast cancer recurrence; HRT, hormone therapy; I. Node, number of involved axillary lymph nodes; Machine learning; NR, lymph node involvement ratio; Prognosis; T. Node, number of dissected axillary lymph nodes; TS, tumor size; XRT, radiotherapy
Year: 2016 PMID: 28018557 PMCID: PMC5173316 DOI: 10.1016/j.csbj.2016.11.004
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1The structure of the proposed prognosis system (HPBCR). Other classifiers such as SVM and MLP could be used instead of BDT. The pseudo-code of HPBCR is provided in the Supplementary material S3. The input features in HPBCR were: diagnosis age, nodal ratio, menarche age, the number of pregnancy, tumor size, Ki67, the number of involved and dissected nodes, as interval features, type of surgery, molecular subtypes, family history of cancer, multifocal tumor, estrogen and progesterone receptor status, p53 mutation, Her2 expression, Cathepsin-D protein status, using hormone therapy, and radiotherapy as nominal features and the tumor grade as an ordinal feature. Briefly, the input features are first selected using statistical feature selection. The selected features were used by Bagged Decision Tree to build the classifier. The optimal feature set and the weight of features are estimated using Particle Swarm Optimization during learning. The algorithm stops if no significant improvement is seen in the objective function or the maximum number of iterations (set to 100 in our study) is reached.
The structure of the proposed prognosis system (HPBCR). Other classifiers such as SVM and MLP could be used instead of BDT. The pseudo-code of HPBCR is provided in the Supplementary material S3. The input features in HPBCR were: diagnosis age, nodal ratio, menarche age, the number of pregnancy, tumor size, Ki67, the number of involved and dissected nodes, as interval features, type of surgery, molecular subtypes, family history of cancer, multifocal tumor, estrogen and progesterone receptor status, p53 mutation, Her2 expression, Cathepsin-D protein status, using hormone therapy, and radiotherapy as nominal features and the tumor grade as an ordinal feature. Briefly, the input features are first selected using statistical feature selection. The selected features were used by Bagged Decision Tree to build the classifier. The optimal feature set and the weight of features are estimated using Particle Swarm Optimization during learning. The algorithm stops if no significant improvement is seen in the objective function or the maximum number of iterations (set to 100 in our study) is reached.
The classification performance measures used in our study.
| Se = | Sp = | Acc = |
| Pr = | Alpha = 1 − Sp | Beta = 1 − Se |
| F-score = | AUC = | |
| MCC = | ||
| DP = | ||
True positive (TP): subjects with cancer recurrence, correctly identified; false positive (FP): subjects without recurrence, incorrectly identified; true negative (TN): subjects without recurrence, correctly identified; false negative (FN): subjects with recurrence, incorrectly identified; Se: sensitivity = Power; Sp: specificity; Acc: Accuracy; Pr: precision; AUC: area under the receiver operating characteristic (ROC) curve; LR: likelihood ratio; DOR: diagnosis odds ratio; MCC: Matthews correlation coefficient; DP: discriminant power.
Supplementary material S5The snapshot of the developed on-line HPBCR. The information of a subject with breast cancer was entered. Based on the prediction, it is likely to have recurrence within 5 years after diagnosis.
Comparison of clinical and biochemical features (with interval measurement scale) of included subjects with/without cancer recurrence in μ ± σ [min, max].
| Variable | With recurrence | Without recurrence |
|---|---|---|
| Age | 45.4 ± 10.1 (ND) | 47.2 ± 9.9 |
| NR | 0.47 ± 0.36 | 0.24 ± 0.32 |
| Menarche | 13.3 ± 1.3 | 13.4 ± 1.4 |
| No. Preg | 3.5 ± 1.9 | 3.7 ± 2.1 |
| TS | 4.0 ± 1.7 (ND) | 3.7 ± 1.8 |
| Ki67 | 21.1 ± 12.6 | 18.0 ± 11.8 |
| I. Node | 6.6 ± 7.3 | 3.0 ± 4.9 |
| T. Node | 12.8 ± 6.1 | 11.4 ± 5.4 |
| No. Chemo | 7.7 ± 0.9 | 7.4 ± 1.3 |
Age (age at diagnosis); NR (lymph node involvement ratio); menarche (menarche age); No. Preg (number of pregnancy); TS (tumor size); Ki67 (Ki67 proliferation marker); I. Node: number of involved auxiliary lymph nodes; T. Node: number of dissected auxiliary lymph nodes; No. Chemo: number of chemotherapy; ND: normally distributed. The sample size was 579 and the recurrence prevalence was 19.3%.
Statistically significant features (P < 0.05).
Comparison of clinical and biochemical features (with nominal measurement scale) of included subjects with/without cancer recurrence in percentage.
| Variable | With recurrence | Without recurrence |
|---|---|---|
| FH | P 18 | P: 22 |
| Multifocal | 33 | 27 |
| ER | 50 | |
| PR | 50 | |
| P53 | 42 | 34 |
| Surgery | ||
| Her2 | ||
| Cathepsin | ||
| HRT | ||
| XRT | ||
| Subtypes | LA: 17 | 14 |
FH (family history of cancer); multifocal (having more than one tumor in the breast); ER (estrogen receptor); PR (progesterone receptor); P53 (tumor protein 53); surgery: type of surgery (MRM: modified radical mastectomy, BCS: breast-conserving surgery, Mast: mastectomy); Her2 (epidermal growth factor receptor-2); Cathepsin (Cathepsin-D); HRT (hormone therapy); XRT (radiotherapy); subtypes: cancer molecular subtypes (LA: luminal A, LB: luminal B; HLB: HER2-positive luminal B, NLH: non-luminal Her2, 3N: triple negative). The characteristics were shown as positive and negative percentages (i.e. relative frequent table) for binary variables, respectively and the mode was underlined. The sample size was 579 and the recurrence prevalence was 19.3%.
Statistically significant features (P < 0.05).
Comparison of the tumor “histological grade” feature (with ordinal measurement scales) of included subjects with/without cancer recurrence.
| Grade categories | With recurrence | Without recurrence |
|---|---|---|
| 1 | 22 | 10 |
| 2 | ||
| 3 | 33 | 29 |
| 4 | 1 | 2 |
The characteristics were shown as percentages (i.e. relative frequent table), and the mode (the most frequent item) was underlined in each category (recurrent or non-recurrent patient groups). Tumor grade was not statistically significant in subjects with/without recurrence. The sample size was 579 and the recurrence prevalence was 19.3%.
Fig. 2The value of the fitness function (F-Score of the proposed classifier (HPBCR) on the training set-solid line) and the F-Score on the test set (dash-dot line) during optimization procedure. The termination criterion was only the maximum number of iterations (i.e. 100) in this plot.
The holdout performance estimate (%) of the selected classifiers.
| Method | Se | Sp | Acc | Pr | F-score % | Alpha | Beta | AUC | MCC | DOR | DP | Kappa |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HPBCR | 81 | 98 | 90 | 97 | 89 | 0.02 | 0.19 | 0.90 | 0.81 | 208.9 | 1.28 | 0.83 |
| SVM | 67 | 88 | 78 | 84 | 75 | 0.12 | 0.33 | 0.77 | 0.57 | 14.9 | 0.65 | 0.73 |
| Decision tree | 75 | 78 | 77 | 79 | 77 | 0.22 | 0.25 | 0.76 | 0.58 | 10.6 | 0.57 | 0.72 |
| MLP | 69 | 81 | 76 | 79 | 73 | 0.19 | 0.31 | 0.75 | 0.52 | 10.63 | 0.57 | 0.71 |
Se: sensitivity, Sp: specificity, Acc: accuracy, Pr: precision, AUC: area under the curve, MCC: Matthews correlation coefficient, DOR: diagnostics odds ratio; DP: discriminant power; SVM: supported vector machines; MLP: multilayer perceptron artificial neural network. Seventy percent of the data was used for training and the rest for validation.
The performance of the classifiers in percent based on 4-fold cross-validation (mean ± SD) [min, max].
| Method | Sensitivity | Specificity | Accuracy | Precision |
|---|---|---|---|---|
| Proposed method | 80.0 ± 0.3 | 96.1 ± 1.0 | 89.2 ± 0.6 | 96.5 ± 1.0 |
| SVM | 73.0 ± 1.0 | 85.3 ± 1.2 | 77.6 ± 0.4 | 82.1 ± 0.3 |
| Decision tree | 74.0 ± 0.9 | 77.0 ± 0.8 | 77.1 ± 0.2 | 78.1 ± 0.6 |
| MLP | 67.5 ± 0.2 | 84.3 ± 1.0 | 76.0 ± .0.3 | 84.2 ± 1.8 |
SVM: supported vector machines; MLP: multilayer perceptron artificial neural network.
The overall confusion matrix of HPBCR on the test set.
| Total population | Condition (as determined by “gold standard”) | ||
|---|---|---|---|
| Condition positive | Condition negative | ||
| Test outcome | Test outcome positive | True positive | False positive |
| Test outcome negative | False negative | True negative | |
Gold standard: recurrent cancer; test outcome: the decision of the proposed diagnosis system (HPBCR). The classifier was trained on the training set (70% of the samples) and its performance was shown on the test set.
The holdout performance estimate (%) of the prognosis system with modifications.
| Scenario | Se | Sp | Acc | Pr | F-score % | Alpha | Beta | AUC | MCC | DOR | DP | Kappa |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 81 | 98 | 90 | 97 | 89 | 0.02 | 0.19 | 0.90 | 0.81 | 208.9 | 1.28 | 0.83 |
| 1 | 78 | 81 | 78 | 93 | 86 | 0.18 | 0.21 | 0.79 | 0.51 | 15.1 | 0.65 | 0.78 |
| 2 | 81 | 98 | 90 | 96 | 89 | 0.02 | 0.19 | 0.90 | 0.81 | 208.9 | 1.28 | 0.83 |
| 3 | 80 | 96 | 88 | 97 | 87 | 0.04 | 0.2 | 0.88 | 0.79 | 116.9 | 1.14 | 0.81 |
Baseline: original developed HPBCR; Scenario 1: excluding PSO from the algorithm, 2: using SVM classifier instead of BDT, 3: using MLP classifier instead of BDT, Se: sensitivity, Sp: specificity, Acc: accuracy, Pr: precision, AUC: area under the curve, MCC: Matthews correlation coefficient, DOR: diagnostics odds ratio; DP: discriminant power; SVM: supported vector machines; MLP: multilayer perceptron artificial neural network. Seventy percent of the data was used for training and the rest for validation.