| Literature DB >> 32509741 |
Binsheng He1, Jidong Lang2, Bo Wang2, Xiaojun Liu2, Qingqing Lu2, Jianjun He1, Wei Gao3, Pingping Bing1, Geng Tian2, Jialiang Yang2.
Abstract
Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone.Entities:
Keywords: cross-validation; gene expression; random forest; somatic mutation; tissue-of-origin
Year: 2020 PMID: 32509741 PMCID: PMC7248358 DOI: 10.3389/fbioe.2020.00394
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1The complete workflow of prediction on cancer tissue origin.
Sample distribution of each cancer from ICGC database.
| Bladder urothelial carcinoma | BLCA | 294 | 4.20% |
| Breast invasive carcinoma | BRCA | 970 | 13.84% |
| Cervical squamous cell carcinoma and endocervical adenocarcinoma | CESC | 241 | 3.44% |
| Colon adenocarcinoma | COAD | 390 | 5.57% |
| Glioblastoma multiforme | GBM | 148 | 2.11% |
| Head and neck squamous cell carcinoma | HNSC | 460 | 6.56% |
| Kidney renal clear cell carcinoma | KIRC | 345 | 4.92% |
| Kidney renal papillary cell carcinoma | KIRP | 216 | 3.08% |
| Acute myeloid leukemia | LAML | 121 | 1.73% |
| Brain lower grade glioma | LGG | 433 | 6.18% |
| Liver hepatocellular carcinoma | LIHC | 282 | 4.02% |
| Lung adenocarcinoma | LUAD | 475 | 6.78% |
| Lung squamous cell carcinoma | LUSC | 411 | 5.87% |
| Ovarian serous cystadenocarcinoma | OV | 185 | 2.64% |
| Pancreatic adenocarcinoma | PAAD | 134 | 1.91% |
| Prostate adenocarcinoma | PRAD | 374 | 5.34% |
| Rectum adenocarcinoma | READ | 137 | 1.95% |
| Stomach adenocarcinoma | STAD | 412 | 5.88% |
| Thyroid carcinoma | THCA | 486 | 6.93% |
| Uterine corpus endometrial carcinoma | UCEC | 494 | 7.05% |
| Total | 7008 | 100% | |
Performance of classification of combination of somatic mutation and gene expression by using 80 genes.
| BLCA | 0.8906 | 0.9354 | 0.9124 | 294.0000 | 0.9950 |
| BRCA | 0.9987 | 0.9947 | 0.9967 | 970.0000 | 0.9998 |
| CESC | 0.9148 | 0.8859 | 0.9001 | 241.0000 | 0.9971 |
| COAD | 0.7548 | 0.9644 | 0.8468 | 390.0000 | 0.9815 |
| GBM | 0.9940 | 1.0000 | 0.9970 | 148.0000 | 0.9999 |
| HNSC | 0.9916 | 1.0000 | 0.9958 | 460.0000 | 0.9994 |
| KIRC | 0.9850 | 0.9516 | 0.9680 | 345.0000 | 0.9992 |
| KIRP | 0.9344 | 0.9630 | 0.9485 | 216.0000 | 0.9979 |
| LAML | 1.0000 | 1.0000 | 1.0000 | 121.0000 | 1.0000 |
| LGG | 0.9926 | 0.9977 | 0.9952 | 433.0000 | 0.9995 |
| LIHC | 0.9925 | 0.9844 | 0.9884 | 282.0000 | 0.9997 |
| LUAD | 0.9358 | 0.9448 | 0.9403 | 475.0000 | 0.9953 |
| LUSC | 0.9408 | 0.9000 | 0.9199 | 411.0000 | 0.9965 |
| OV | 1.0000 | 0.9946 | 0.9973 | 185.0000 | 1.0000 |
| PAAD | 0.9378 | 0.9552 | 0.9464 | 134.0000 | 0.9988 |
| PRAD | 0.9973 | 1.0000 | 0.9987 | 374.0000 | 0.9998 |
| READ | 0.7569 | 0.1591 | 0.2627 | 137.0000 | 0.9990 |
| STAD | 0.9947 | 0.9976 | 0.9961 | 412.0000 | 0.9997 |
| THCA | 1.0000 | 0.9979 | 0.9990 | 486.0000 | 1.0000 |
| UCEC | 0.9673 | 0.9816 | 0.9744 | 494.0000 | 0.9975 |
| Accuracy | 0.9577 | 0.9577 | 0.9577 | 0.0000 |
FIGURE 2The classification accuracy of using somatic mutation, gene expression and combination of somatic mutation and gene expression, respectively.
Prediction probabilities of each samples on each cancer.
FIGURE 3Heatmap of mean value of gene expression on each cancer.
FIGURE 4Heatmap of mean value of somatic mutations on each cancer.
FIGURE 5Selected top-rank 80 genes enriched in cellular component, biological process and molecular function.