| Literature DB >> 33193745 |
Yanfeng Wang1, Yuli Yang1, Junwei Sun1, Lidong Wang2, Xin Song2, Xueke Zhao2.
Abstract
The diagnosis of the degree of differentiation of tumor cells can help physicians to make timely detection and take appropriate treatment for the patient's condition. In this study, the original dataset is clustered into two independent types by the Kohonen clustering algorithm. One type is used as the development sets to find correlation indicators and establish predictive models of differentiation, while the other type is used as the validation sets to test the correlation indicators and models. In the development sets, thirteen indicators significantly associated with the degree of differentiation of esophageal squamous cell carcinoma are found by the Kohonen clustering algorithm. Thirteen relevant indicators are used as input features and the degree of tumor differentiations is used as output. Ten classification algorithms are used to predict the differentiation of esophageal squamous cell carcinoma. Artificial bee colony-support vector machine (ABC-SVM) predicts better than the other nine algorithms, with an average accuracy of 81.5% for the 10-fold cross-validation. Based on logistic regression and ReliefF algorithm, five models with the greater merit for the degree of differentiation are found in the development sets. The AUC values of the five models are 0.672, 0.628, 0.630, 0.628, and 0.608 (P < 0.05). The AUC values of the five models in the validation sets are 0.753, 0.728, 0.744, 0.776, and 0.868 (P < 0.0001). The predicted values of the five models are constructed as the input features of ABC-SVM. The accuracy of the 10-fold cross-validation reached 82.0 and 86.5% in the development sets and the validation sets, respectively.Entities:
Keywords: ABC-SVM; ESCC; ROC; ReliefF algorithm; clustering algorithm; degree of differentiation; prediction model
Year: 2020 PMID: 33193745 PMCID: PMC7645151 DOI: 10.3389/fgene.2020.595638
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
The population proportions of the original data sets.
| Genders | Male | 135 | 64% |
| Ages | ≥58 | 68 | 32% |
| Tumor site | Upper chest | 26 | 12% |
Critical threshold for age in the sample sets. Age is used as a variable, and the degree of tumor differentiation is used as a categorical variable. The ROC curve is drawn. After calculating the Youden index, the critical threshold of age for the degree of differentiation is determined to be 58. P < 0.05. The value of AUC is greater than 0.5. The Youden index is decided by (14).
Youden Index = Sensitivity−(1−Specificity) (14).
The original data information.
| Tumor length | 3.873459716 | 4 (1.5~10.5) | 2.548625592 |
| Tumor width | 2.538862559 | 2.5 (1~7) | 0.941911081 |
| Tumor thickness | 1.140758294 | 1 (0.8~5) | 0.31937847 |
| WBC count | 6.484549763 | 6 (2.5~15.3) | 4.841624915 |
| Lymphocyte count | 1.869336493 | 1.8 (0.4~11.7) | 0.849811939 |
| Monocyte count | 0.422985782 | 0.4 (0~1.4) | 0.1666239 |
| Neutrophil count | 3.8907109 | 3.4 (0.5~10.6) | 3.913749492 |
| Eosinophil count | 0.133649289 | 0.1 (0~0.6) | 0.022338524 |
| Basophil count | 0.04985782 | 0 (0~1) | 0.007075694 |
| Red blood cell count | 5.131516588 | 4.56 (2.93~5.75) | 83.44124436 |
| Hemoglobin concentration | 139.2938389 | 140 (95~100) | 250.5799142 |
| Platelets count | 235.6729858 | 226 (100~418) | 5328.859219 |
| Total protein | 71.32701422 | 71 (46~92) | 57.49731438 |
| Albumin | 42.54976303 | 43 (26~79) | 29.04870232 |
| Globulin | 28.87203791 | 28 (17~45) | 31.91211916 |
| PT | 10.15876777 | 10.1 (7~16.6) | 2.333767998 |
| INR | 0.775829384 | 0.77 (0.45~1.64) | 0.027433952 |
| APTT | 37.11374408 | 36.1 (19.7~56.7) | 59.97528639 |
| TT | 15.45118483 | 15.6 (8.3~21.3) | 4.021748589 |
| FIB | 349.3134218 | 344.029 (245.68~710.56) | 4881.231632 |
where the unit of tumor length, tumor width, tumor thickness is CM. The unit of WBC count, lymphocyte count, monocyte count, neutrophil count, eosinophil count, basophil count, red blood cell count, hemoglobin concentration, platelets count, total protein, albumin, globulin is g/L. The unit of PT, APTT, TT is second(s). The unit of FIB is mg/L. INR represents the international normalized ratio, which can be expressed by formula 1. ISI is the international sensitivity index for measuring reagents.
.
Figure 1Khonen neural network of 36–21 structure. η is the learning rate. k represents the k-th node of the output layer and ω is regarded as the connection weight value. X stands for the initial vector and i is the i-th node of the input layer.
Figure 2The flow diagram of Kohonen neural network algorithm.
Numbers of samples in the development sets and validation sets.
| Poorly differentiated | 37 | 16 | 28 | 14 | 95 |
| Moderate differentiation | 36 | 25 | 34 | 21 | 116 |
| Total number of samples | 114 | 97 | 211 | ||
Information of 13 indicators that are significantly related to the degree of differentiation in the development sets.
| Lower chest | 2.5 | 3 | 0.5 | 7.4 | 2 | 0.7 | 4.3 | 0.4 | 0 | 4.09 | 10 | 0.75 | Poorly differentiated |
| Middle chest | 3.5 | 3 | 0.5 | 10.4 | 3.9 | 0.7 | 5.6 | 0.1 | 0.1 | 5.02 | 9.6 | 0.71 | Poorly differentiated |
| Lower chest | 4 | 1 | 0.6 | 6 | 3.4 | 0.4 | 2.2 | 0 | 0 | 4.13 | 7.1 | 0.46 | Poorly differentiated |
| Middle chest | 4 | 3 | 1 | 7.1 | 3.5 | 0.3 | 3 | 0.3 | 0 | 4.31 | 11.8 | 0.95 | Moderate differentiation |
| Upper chest | 8 | 5 | 1 | 6.6 | 1.2 | 0.6 | 4.8 | 0 | 0 | 4.24 | 7 | 0.45 | Moderate differentiation |
| Middle chest | 2 | 2 | 1.5 | 7.7 | 2.1 | 0.4 | 5 | 0.2 | 0 | 4.07 | 8.8 | 0.63 | Moderate differentiation |
The prediction results of the degree of differentiation of 10 classification algorithms based on 21 indicators and 13 indicators.
| Development sets | 114 | SVM | 21 | 0.4339 | 57.9 |
| 13 | 0.3941 | 53 | |||
| QDA | 21 | 0.5149 | 57.9 | ||
| 13 | 0.3726 | 57.4 | |||
| CART | 21 | 2.7506 | 54.4 | ||
| 13 | 0.3985 | 59.1 | |||
| LDA | 21 | 1.1738 | 59.6 | ||
| 13 | 0.3570 | 57.4 | |||
| KNN | 21 | 0.4219 | 57.9 | ||
| 13 | 0.3879 | 53.9 | |||
| Ensemble | 21 | 3.1797 | 61.4 | ||
| 13 | 3.3533 | 61.7 | |||
| ELM | 21 | 0.0691 | 58 | ||
| 13 | 0.0121 | 47 | |||
| PSO-SVM | 21 | 134.23 | 58 | ||
| 13 | 109.11 | 59 | |||
| GA-SVM | 21 | 8.4211 | 51 | ||
| 13 | 6.7524 | 50.01 | |||
| 21 | 0.9919 | 75 | |||
| Validation sets | 97 | SVM | 21 | 0.3991 | 58.8 |
| 13 | 0.3852 | 54.8 | |||
| QDA | 21 | 0.3802 | 52.6 | ||
| 13 | 0.3642 | 52.3 | |||
| CART | 21 | 0.3444 | 48.5 | ||
| 13 | 0.3609 | 48.7 | |||
| LDA | 21 | 0.3929 | 49.5 | ||
| 13 | 0.3568 | 61.9 | |||
| KNN | 21 | 0.4039 | 60.8 | ||
| 13 | 0.3976 | 53 | |||
| Ensemble | 21 | 3.2612 | 61.9 | ||
| 13 | 3.3079 | 57.4 | |||
| ELM | 21 | 0.0541 | 53 | ||
| 13 | 0.0113 | 53 | |||
| PSO-SVM | 21 | 258.12 | 60 | ||
| 13 | 139.68 | 58.89 | |||
| GA-SVM | 21 | 6.4214 | 58 | ||
| 13 | 4.9353 | 63 | |||
| 21 | 0.62332 | 76 | |||
where SVM is Support Vector Machine and QDA is Quadratic Discriminant Analysis. The CART represents Classification And Regression Tree and the LDA represents Linear Discriminant Analysis. KNN is K-Nearest Neighbor and Ensemble is Ensemble Bagged Tree. The ELM represents Extreme Learning Machine and the PSO-SVM represents Particle Swarm Optimization-Support Vector Machine. GA-SVM is Genetic Algorithm-Support Vector Machine and ABC-SVM is Artificial Bee Colony-Support Vector Machine.
Figure 3ROC curves for the five models in the development sets. (A) ROC curve of Model 1. (B) ROC curve of Model 2. (C) ROC curve of Model 3. (D) ROC curve of Model 4. (E) ROC curve of Model 5. The ordinate is “Sensitivity” and the abscissa is “1-Specificity,” the curves is clearly located at the upper left of the diagonal and has a good significance.
Results of ROC curve in the development sets.
| Area under the ROC curve (AUC) | |||||
| Standard Error | 0.0505 | 0.0522 | 0.0522 | 0.0521 | 0.0532 |
| 95% Condence interval | 0.578 to 0.757 | 0.533 to 0.717 | 0.534 to 0.718 | 0.533 to 0.717 | 0.512 to 0.698 |
| Z statistic | 3.406 | 2.452 | 2.483 | 2.459 | 2.037 |
| Significance level P (Area = 0.5) | |||||
| Youden index J | 0.3498 | 0.2162 | 0.2187 | 0.2088 | 0.2459 |
| Associated criterion | >-17.25294 | >0.98361 | >-0.79807 | >0.732 | >1.9012 |
| Sensitivity | 79.25 | 77.36 | 79.25 | 71.7 | 100 |
| Specificity | 55.74 | 44.26 | 42.62 | 49.18 | 24.59 |
Figure 4ROC curves for the five models in the validation sets. (A) ROC curve of Model 1. (B) ROC curve of Model 2. (C) ROC curve of Model 3. (D) ROC curve of Model 4. (E) ROC curve of Model 5. The ordinate is “Sensitivity” and the abscissa is “1-Specificity,” the curves is clearly located at the upper left of the diagonal and has a good significance.
Results of ROC curve in the validation sets.
| Area under the ROC curve (AUC) | |||||
| Standard Error | 0.0505 | 0.053 | 0.0517 | 0.0507 | 0.0409 |
| 95% Condence interval | 0.655 to 0.835 | 0.628 to 0.814 | 0.645 to 0.827 | 0.680 to 0.855 | 0.784 to 0.928 |
| Z statistic | 4.999 | 4.307 | 4.716 | 5.448 | 8.995 |
| Significance level P (Area = 0.5) | |||||
| Youden index J | 0.4404 | 0.3984 | 0.4593 | 0.527 | 0.6441 |
| Associated criterion | ≤ −12.37098 | ≤ 3.96107 | ≤ 0.92197 | ≤ 1.074 | ≤ 3.4248 |
| Sensitivity | 73.58 | 83.02 | 75.47 | 86.79 | 96.23 |
| Specificity | 70.45 | 56.82 | 70.45 | 65.91 | 68.18 |
ABC-SVM prediction results based on the new features of the five model constructs.
| Development sets | 114 | 21 indicators | 75% |
| 13 indicators | 81.5% | ||
| Validation sets | 97 | 21 indicators | 76% |
| 13 indicators | 80% | ||
Framework of the Khonen neural network algorithm.
| 1: Data normalization
|
| 2: Randomly set the vector of the initial connection weight value between the mapping layer and the input layer. The initial value η of the learning rate is 0.7, η ∈ (0, 1). The initial neighborhood is set to |
| 3: Input of initial vector |
| 4: Calculate the distance between the weight vector of the mapping layer and the initial vector
|
| 5: Weight learning
|
| 6: Winning neurons are labeled. |
| 7: End |
Framework of ReliefF algorithm.
| 1: Set training data as |
| 2: for |
| 3: Randomly select a sample |
| 4: In the same sample sets of |
| 5: for |
| 6: |
| 7: + |
| 8: end |