| Literature DB >> 35739223 |
Baoshan Ma1, Bingjie Chai2, Heng Dong2, Jishuang Qi2, Pengcheng Wang3, Tong Xiong2, Yi Gong2, Di Li4, Shuxin Liu5, Fengju Song6.
Abstract
The potential role of DNA methylation from paracancerous tissues in cancer diagnosis has not been explored until now. In this study, we built classification models using well-known machine learning models based on DNA methylation profiles of paracancerous tissues. We evaluated our methods on nine cancer datasets collected from The Cancer Genome Atlas (TCGA) and utilized fivefold cross-validation to assess the performance of models. Additionally, we performed gene ontology (GO) enrichment analysis on the basis of the significant CpG sites selected by feature importance scores of XGBoost model, aiming to identify biological pathways involved in cancer progression. We also exploited the XGBoost algorithm to classify cancer types using DNA methylation profiles of paracancerous tissues in external validation datasets. Comparative experiments suggested that XGBoost achieved better predictive performance than the other four machine learning methods in predicting cancer stage. GO enrichment analysis revealed key pathways involved, highlighting the importance of paracancerous tissues in cancer progression. Furthermore, XGBoost model can accurately classify nine different cancers from TCGA, and the feature sets selected by XGBoost can also effectively predict seven cancer types on independent GEO datasets. This study provided new insights into cancer diagnosis from an epigenetic perspective and may facilitate the development of personalized diagnosis and treatment strategies.Entities:
Mesh:
Year: 2022 PMID: 35739223 PMCID: PMC9226137 DOI: 10.1038/s41598-022-14786-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Comparison of prediction performance of different classification models on different datasets.
| Cancer type | Model | AUC | ACC | AUPR | MCC | Precision | Recall |
|---|---|---|---|---|---|---|---|
| KIRC | XGBoost | 0.353 | 0.703 | 0.747 | |||
| SVM | 0.764 | 0.650 | 0.827 | 0.298 | 0.700 | ||
| RF | 0.743 | 0.600 | 0.817 | 0.205 | 0.643 | 0.683 | |
| KNN | 0.741 | 0.669 | 0.795 | 0.542 | |||
| NB | 0.674 | 0.656 | 0.795 | 0.350 | 0.747 | 0.631 | |
| BRCA | XGBoost | 0.040 | |||||
| SVM | 0.456 | 0.779 | 0.292 | 0.000 | 0.000 | 0.000 | |
| RF | 0.372 | 0.779 | 0.184 | 0.000 | 0.000 | 0.000 | |
| KNN | 0.505 | 0.726 | 0.263 | -0.051 | 0.050 | 0.050 | |
| NB | 0.432 | 0.632 | 0.207 | -0.126 | 0.133 | 0.080 | |
| THCA | XGBoost | 0.538 | 0.137 | 0.183 | |||
| SVM | 0.773 | 0.566 | 0.202 | 0.250 | |||
| RF | 0.719 | 0.733 | 0.441 | 0.045 | 0.200 | 0.050 | |
| KNN | 0.681 | 0.662 | 0.103 | 0.333 | 0.317 | ||
| NB | 0.620 | 0.679 | 0.515 | ||||
| HNSC | XGBoost | 0.925 | 0.000 | ||||
| SVM | 0.622 | 0.840 | 0.000 | 0.840 | 1.000 | ||
| RF | 0.614 | 0.820 | 0.924 | -0.022 | 0.838 | 0.978 | |
| KNN | 0.603 | 0.840 | 0.917 | 0.000 | 0.840 | 1.000 | |
| NB | 0.500 | 0.840 | 0.920 | 0.000 | 0.840 | 1.000 | |
| KIRP | XGBoost | 0.444 | 0.683 | 0.087 | 0.406 | 0.670 | |
| SVM | 0.541 | 0.444 | 0.680 | 0.404 | |||
| RF | 0.514 | 0.621 | 0.011 | 0.533 | 0.549 | ||
| KNN | 0.576 | 0.467 | 0.089 | 0.402 | |||
| NB | 0.407 | 0.422 | 0.582 | -0.163 | 0.450 | 0.404 | |
| LUSC | XGBoost | 0.180 | 0.000 | 0.000 | 0.000 | ||
| SVM | 0.513 | 0.828 | 0.197 | 0.000 | 0.000 | 0.000 | |
| RF | 0.517 | 0.828 | 0.315 | 0.000 | 0.000 | 0.000 | |
| KNN | 0.556 | 0.828 | 0.282 | ||||
| NB | 0.500 | 0.828 | 0.000 | 0.000 | 0.000 | ||
| LIHC | XGBoost | 0.675 | 0.477 | -0.087 | 0.000 | 0.000 | |
| SVM | 0.550 | 0.000 | 0.000 | 0.000 | |||
| RF | 0.625 | 0.725 | 0.483 | 0.000 | 0.000 | 0.000 | |
| KNN | 0.638 | 0.725 | 0.473 | ||||
| NB | 0.513 | 0.725 | 0.608 | 0.030 | 0.100 | 0.067 | |
| COAD | XGBoost | 0.589 | 0.671 | 0.291 | 0.617 | 0.550 | |
| SVM | 0.527 | 0.618 | 0.658 | 0.000 | 0.000 | 0.000 | |
| RF | 0.713 | 0.643 | 0.599 | 0.340 | 0.433 | ||
| KNN | 0.732 | 0.582 | 0.188 | 0.333 | 0.433 | ||
| NB | 0.682 | 0.689 | |||||
| UCEC | XGBoost | 0.688 | 0.300 | ||||
| SVM | 0.617 | 0.638 | 0.000 | 0.000 | 0.000 | ||
| RF | 0.547 | 0.581 | 0.608 | 0.061 | 0.067 | 0.200 | |
| KNN | 0.612 | 0.557 | 0.562 | 0.124 | 0.333 | 0.350 | |
| NB | 0.600 | 0.614 | 0.605 | 0.217 | 0.400 |
KIRC kidney renal clear cell carcinoma, BRCA breast invasive carcinoma, THCA thyroid carcinoma, HNSC head and neck squamous cell carcinoma, KIRP kidney renal papillary cell carcinoma, LUSC lung squamous cell carcinoma, LIHC liver hepatocellular carcinoma, COAD colon adenocarcinoma, UCEC uterine corpus endometrial carcinoma, XGBoost Extreme gradient boosting, SVM Support vector machine, RF Random forest, KNN K-Nearest Neighbor, NB Naive Bayes. AUC the area under the receiver operating characteristic curve, ACC accuracy, AUPR the area under precision-recall curve, MCC matthews correlation coefficient. Significant values are in bold.
Figure 1The ROC curves of XGBoost, SVM, RF, KNN and NB on nine datasets. (a) KIRC, (b) BRCA, (c)THCA, (d) HNSC, (e) KIRP, (f) LUSC, (g) LIHC, (h) COAD, (i) UCEC.
Figure 2Cluego analysis for GO terminology on nine datasets. Node: GO term; the bigger the node, the smaller the P value; Each line indicates the correlation between functions, and a larger kappa coefficient represents the line is more thicker; different colors denote the function enrichment classification of GO terms. Networks were generated with ClueGO (version 2.5.7) in Cytoscape (version 3.6.0) (http://apps.cytoscape.org/apps/cluego). (a) KIRC, (b) BRCA, (c) THCA, (d) HNSC, (e) KIRP, (f) LUSC, (g) LIHC, (h) COAD, (i) UCEC.
Classification accuracy of XGBoost on TCGA and GEO datasets.
| Cancer type | XGBoost (TCGA dataset) (%) | XGBoost (GEO dataset) |
|---|---|---|
| KIRC | 100 | 100 |
| BRCA | 100 | 97.6 |
| THCA | 100 | 97.6 |
| HNSC | 100 | 96.6 |
| KIRP | 100 | – |
| LUSC | 100 | 71.4 |
| LIHC | 100 | 65.1 |
| COAD | 100 | 81.8 |
| UCEC | 100 | – |
| Overall accuracy | 100 | 86.1 |
KIRC kidney renal clear cell carcinoma, BRCA breast invasive carcinoma, THCA thyroid carcinoma, HNSC head and neck squamous cell carcinoma, KIRP kidney renal papillary cell carcinoma, LUSC lung squamous cell carcinoma, LIHC liver hepatocellular carcinoma, COAD colon adenocarcinoma, UCEC uterine corpus endometrial carcinoma. XGBoost Extreme gradient boosting.
The description of TCGA datasets used in this study.
| Cancer type | Patient class | Total of patients | Total of methylation profiles |
|---|---|---|---|
| KIRC | Early | 71 | 395,708 |
| Late | 89 | ||
| BRCA | Early | 74 | 395,479 |
| Late | 21 | ||
| THCA | Early | 41 | 395,661 |
| Late | 15 | ||
| HNSC | Early | 8 | 395,363 |
| Late | 42 | ||
| KIRP | Early | 22 | 395,392 |
| Late | 23 | ||
| LUSC | Early | 34 | 395,680 |
| Late | 7 | ||
| LIHC | Early | 29 | 395,564 |
| Late | 11 | ||
| COAD | Early | 23 | 395,552 |
| Late | 15 | ||
| UCEC | Early | 22 | 395,616 |
| Late | 12 |
KIRC kidney renal clear cell carcinoma, BRCA breast invasive carcinoma, THCA thyroid carcinoma, HNSC head and neck squamous cell carcinoma, KIRP kidney renal papillary cell carcinoma, LUSC lung squamous cell carcinoma, LIHC liver hepatocellular carcinoma, COAD colon adenocarcinoma, UCEC uterine corpus endometrial carcinoma.
The description of GEO datasets used in this study.
| Cancer type | Accession number | Total of patients | Total of methylation profiles |
|---|---|---|---|
| KIRC | GSE61441 | 46 | 229,845 |
| BRCA | GSE69914 | 42 | 485,512 |
| THCA | GSE86961 | 41 | 448,547 |
| HNSC | GSE75537 | 29 | 485,512 |
| LUSC | GSE94785 | 28 | 452,162 |
| LIHC | GSE54503 | 66 | 485,577 |
| COAD | GSE42752 | 22 | 485,577 |
KIRC kidney renal clear cell carcinoma, BRCA breast invasive carcinoma, THCA thyroid carcinoma, HNSC head and neck squamous cell carcinoma, LUSC lung squamous cell carcinoma, LIHC liver hepatocellular carcinoma, COAD colon adenocarcinoma.
Figure 3Schematic overview of the framework developed for classifying tumor stages.