| Literature DB >> 36016607 |
Yongchang Miao1,2,3, Xueliang Zhang4, Sijie Chen5, Wenjing Zhou6, Dalai Xu7, Xiaoli Shi8,9, Jian Li5, Jinhui Tu5, Xuelian Yuan8, Kebo Lv5, Geng Tian8,9.
Abstract
Cancer of unknown primary (CUP) refers to cancer with primary lesion unidentifiable by regular pathological and clinical diagnostic methods. This kind of cancer is extremely difficult to treat, and patients with CUP usually have a very short survival time. Recent studies have suggested that cancer treatment targeting primary lesion will significantly improve the survival of CUP patients. Thus, it is critical to develop accurate yet fast methods to infer the tissue-of-origin (TOO) of CUP. In the past years, there are a few computational methods to infer TOO based on single omics data like gene expression, methylation, somatic mutation, and so on. However, the metastasis of tumor involves the interaction of multiple levels of biological molecules. In this study, we developed a novel computational method to predict TOO of CUP patients by explicitly integrating expression quantitative trait loci (eQTL) into an XGBoost classification model. We trained our model with The Cancer Genome Atlas (TCGA) data involving over 7,000 samples across 20 types of solid tumors. In the 10-fold cross-validation, the prediction accuracy of the model with eQTL was over 0.96, better than that without eQTL. In addition, we also tested our model in an independent data downloaded from Gene Expression Omnibus (GEO) consisting of 87 samples across 4 cancer types. The model also achieved an f1-score of 0.7-1 depending on different cancer types. In summary, eQTL was an important information in inferring cancer TOO and the model might be applied in clinical routine test for CUP patients in the future.Entities:
Keywords: GEO; TCGA; XGBoost; cancer of unknown primary; expression quantitative trait loci; tissue-of-origin
Year: 2022 PMID: 36016607 PMCID: PMC9396384 DOI: 10.3389/fonc.2022.946552
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 5.738
Data size and proportion.
| Training Data from TCGA | ||||
|---|---|---|---|---|
| Cancer Type | Amount | Percent | ||
| Breast invasive carcinoma (BRCA) | 1,056 | 13.68% | ||
| Kidney renal papillary cell carcinoma (KIRC) | 526 | 6.81% | ||
| Uterine corpus endometrial carcinoma (UCEC) | 516 | 6.68% | ||
| Thyroid carcinoma (THCA) | 500 | 6.48% | ||
| Lung adenocarcinoma (LUAD) | 486 | 6.29% | ||
| Head and neck squamous cell carcinoma (HNSC) | 480 | 6.22% | ||
| Colon adenocarcinoma (COAD) | 451 | 5.84% | ||
| Brain lower-grade glioma (LGG) | 439 | 5.69% | ||
| Stomach adenocarcinoma (STAD) | 415 | 5.37% | ||
| Prostate adenocarcinoma (PRAD) | 379 | 4.91% | ||
| Bladder urothelial carcinoma (BLCA) | 301 | 3.90% | ||
| Liver hepatocellular carcinoma (LIHC) | 294 | 3.81% | ||
| Ovarian serous cystadenocarcinoma (OV) | 261 | 3.38% | ||
| Squamous cell carcinoma and endocervical adenocarcinoma (CESC) | 258 | 3.34% | ||
| Kidney renal clear cell carcinoma (KIRP) | 222 | 2.88% | ||
| Acute myeloid leukemia (LAML) | 173 | 2.24% | ||
| Glioblastoma multiforme (GBM) | 153 | 1.98% | ||
| Rectum adenocarcinoma (READ) | 153 | 1.98% | ||
| Pancreatic adenocarcinoma (PAAD) | 142 | 1.84% | ||
| Skin cutaneous melanoma (SKCM) | 80 | 1.04% | ||
| Unknown cancer | 430 | 5.57% | ||
|
| ||||
|
|
|
| ||
| PRAD | 44 | 38.60% | ||
| BRCA | 25 | 45.61% | ||
| LUAD | 1 | 00.88% | ||
| OV | 17 | 14.91% | ||
Figure 1The flow diagram of common eQTL analysis processes. The eQTL data we analyzed were also generated by these processes.
Figure 2The performance of the model against the number of genes. Tenfold cross-validation was used to train the model, and some data that were not used for training were independently used for testing. XGBoost and MLP were used for classification, respectively. The accuracies of training and verification are shown in this figure.
The accuracy of training data and testing data.
| Number of features | Accuracy of XGB in training data | Accuracy of XGB in testing data | Accuracy of MLP in training data | Accuracy of MLP in testing data |
|---|---|---|---|---|
| 200 | 0.943393782383419 | 0.945990297099496 | 0.832642487 | 0.83364232 |
| 300 | 0.950647668393782 | 0.946705989675118 | 0.854015544 | 0.856544482 |
| 400 | 0.956865284974093 | 0.950883005411232 | 0.867098446 | 0.871523024 |
| 500 | 0.956476683937823 | 0.95160761304501 | 0.87992228 | 0.878871727 |
| 600 | 0.9610103626943 | 0.954631683702029 | 0.884455959 | 0.894136587 |
| 700 | 0.960621761658031 | 0.953910807953061 | 0.89119171 | 0.898887898 |
| 800 | 0.962046632124352 |
| 0.894041451 | 0.904075011 |
| 900 |
| 0.955063338378288 | 0.899093264 | 0.904507288 |
| 1,000 | 0.961917098445595 | 0.955207430597308 |
|
|
Bold values indicate the highest accuracy in each data set.
Figure 3The receiver operating characteristic curve (ROC curve) for classification. Twenty cancer ROC curves of the optimal 10-fold CVs’ results are shown in (A–C). (D) The average ROC curve.
Figure 4The performance of the model in the testing data. (A) The model test results (R2-score) on four cancers. (B) The confusion matrix on testing data.
The model test results (precision, recall, and f1-score) of 4 cancers on the GEO dataset.
| Abbreviation | Precision | Recall | f1-score | Support |
|---|---|---|---|---|
| PRAD | 1 | 0.729545455 | 0.83772727 | 44 |
| BRCA | 1 | 0.940576923 | 0.97230769 | 52 |
| OV | 1 | 0.71 | 0.83 | 17 |
| LUAD | 1 | 1 | 1 | 1 |
| avg/total | 1 | 0.825263158 | 0.89938596 | 114 |
Figure 5The heatmaps of gene expression. Heatmaps representing the expressions of 15 genes for each cancer sample in the training data (A) and testing data (B) were averaged and then log-transformed. Red represented high expression and blue represented low expression.
Figure 6The enrichment analysis display. (A) KEGG enrichment histogram. The pathways of 800 genes’ enrichment were demonstrated (p< 0.01). (B) GO enrichment histogram. The top 20 pathways with 800 genes were demonstrated (p< 0.01). The pathway association networks of KEGG and GO are shown in (C) and (D). In the networks, each node represented a pathway, and the edges between nodes represented the existence of common genes between pathways.