| Literature DB >> 33329723 |
Yulin Zhang1, Tong Feng1, Shudong Wang2, Ruyi Dong3, Jialiang Yang3, Jionglong Su4, Bo Wang3.
Abstract
The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.Entities:
Keywords: XGBoost; copy number variations; extremely randomized tree; multiclass; principal component analysis; tissue-of-origin
Year: 2020 PMID: 33329723 PMCID: PMC7716814 DOI: 10.3389/fgene.2020.585029
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Flow chart of the method.
Description of the datasets.
| Bladder urothelial carcinoma | BLCA | 135 | 112 |
| Breast invasive carcinoma | BRCA | 847 | 235 |
| Colorectal adenocarcinoma | COADREAD | 575 | 170 |
| Glioblastoma multiforme | GBM | 563 | 28 |
| Head and neck squamous cell carcinoma | HNSC | 306 | 216 |
| Kidney renal clear cell carcinoma | KIRC | 490 | 41 |
| Lung adenocarcinoma | LUAD | 356 | 120 |
| Lung squamous cell carcinoma | LUSC | 289 | 212 |
| Ovarian serous cystadenocarcinoma | OV | 562 | 22 |
| Uterine corpus endometrial carcinoma | UCEC | 443 | 106 |
FIGURE 2Flow chart of feature selection using extremely randomized tree (Extra tree).
FIGURE 3Classification accuracy for different dimensions using extremely randomized tree (Extra tree) and principal component analysis (PCA).
Classification accuracy for each cancer with XGBoost classifier on training datasets via 10-fold cross-validation.
| BLCA | 0.9611 |
| BRCA | 0.7852 |
| COADREAD | 0.8716 |
| GBM | 0.9101 |
| HNSC | 0.9378 |
| KIRC | 0.9776 |
| LUAD | 0.8928 |
| LUSC | 0.9453 |
| OV | 0.8052 |
| UCEC | 0.7988 |
FIGURE 4Receiver operating characteristic to predict the tissue of origin of CNV_origin.
Comparison with other algorithms on independent datasets.
| BLCA | XGBoost | 0.4464 | 0.5917 | |
| KNN | 0.5984 | |||
| Bayes | 0.3232 | 0.2857 | 0.3033 | |
| Adaboost | 0.8000 | 0.1785 | 0.2919 | |
| SVM | 0.5555 | 0.6250 | 0.5882 | |
| BRCA | XGBoost | 0.7034 | ||
| KNN | 0.4170 | 0.5429 | ||
| Bayes | 0.3153 | 0.1489 | 0.2023 | |
| Adaboost | 0.4192 | 0.6297 | 0.5034 | |
| SVM | 0.7777 | 0.5361 | 0.6347 | |
| COADREAD | XGBoost | |||
| KNN | 0.5265 | 0.7588 | 0.6216 | |
| Bayes | 0.1250 | 0.1000 | 0.1574 | |
| Adaboost | 0.6891 | 0.6000 | 0.6415 | |
| SVM | 0.7986 | 0.4861 | 0.7324 | |
| GBM | XGBoost | |||
| KNN | 0.5135 | 0.6785 | 0.5846 | |
| Bayes | 0.1250 | 0.0357 | 0.0555 | |
| Adaboost | 0.5483 | 0.6071 | 0.5762 | |
| SVM | 0.6176 | 0.7500 | 0.6774 | |
| HNSE | XGBoost | |||
| KNN | 0.6371 | 0.6667 | 0.6515 | |
| Bayes | 0.3297 | 0.2870 | 0.3069 | |
| Adaboost | 0.5363 | 0.5462 | 0.5412 | |
| SVM | 0.6774 | 0.4861 | 0.5660 | |
| KIRC | XGBoost | |||
| KNN | 0.4137 | 0.8780 | 0.5625 | |
| Bayes | 0.0723 | 0.9268 | 0.1342 | |
| Adaboost | 0.6341 | 0.6341 | 0.6341 | |
| SVM | 0.6271 | 0.9024 | 0.7400 | |
| LUAD | XGBoost | |||
| KNN | 0.3535 | 0.2916 | 0.3196 | |
| Bayes | 0.1212 | 0.0333 | 0.0522 | |
| Adaboost | 0.2341 | 0.3083 | 0.2661 | |
| SVM | 0.4869 | 0.4666 | 0.4765 | |
| LUSC | XGBoost | |||
| KNN | 0.7486 | 0.6745 | 0.7096 | |
| Bayes | 0.5000 | 0.0849 | 0.1450 | |
| Adaboost | 0.7227 | 0.3443 | 0.4664 | |
| SVM | 0.7142 | 0.6603 | 0.6862 | |
| OV | XGBoost | |||
| KNN | 0.3090 | 0.4415 | 0.7727 | |
| Bayes | 0.0408 | 0.3636 | 0.0733 | |
| Adaboost | 0.2142 | 0.4090 | 0.2812 | |
| SVM | 0.3508 | 0.9090 | 0.5063 | |
| UCEC | XGBoost | 0.4864 | 0.6698 | 0.5669 |
| KNN | 0.4927 | 0.3207 | 0.3885 | |
| Bayes | 0.0500 | 0.0094 | 0.0158 | |
| Adaboost | ||||
| SVM | 0.3785 | 0.6792 | 0.5062 |
FIGURE 5Comparison of the overall accuracy for the classifiers on training datasets and independent validation datasets.
Top six genes and corresponding molecular function.
| 28512 | Intracellular signal transduction | GO:0035556 | LUAD KIRC | |
| 1030 | Positive regulation of transforming growth factor beta receptor signaling pathway | GO:0030511 | HNSC | |
| 64848 | Positive regulation by host of viral genome replication | GO:0044829 | COADREAD BRCA | |
| 2122 | Negative regulation of JNK cascade | GO:0046329 | COADREAD | |
| 54715 | Regulation of alternative mRNA splicing, via spliceosome | GO:0000381 | COADREAD | |
| 51560 | Intra-Golgi vesicle-mediated transport | GO:0006891 | COADREAD |
FIGURE 6Enriched terms bar graph colored by p-values in gene lists.