| Literature DB >> 33330438 |
Xin Liang1,2,3, Wen Zhu1,2,3, Bo Liao1,2,3, Bo Wang4,5, Jialiang Yang4,5, Xiaofei Mo4,5, Ruixi Li1,2,3.
Abstract
Some carcinomas show that one or more metastatic sites appear with unknown origins. The identification of primary or metastatic tumor tissues is crucial for physicians to develop precise treatment plans for patients. With unknown primary origin sites, it is challenging to design specific plans for patients. Usually, those patients receive broad-spectrum chemotherapy, while still having poor prognosis though. Machine learning has been widely used and already achieved significant advantages in clinical practices. In this study, we classify and predict a large number of tumor samples with uncertain origins by applying the random forest and Naive Bayesian algorithms. We use the precision, recall, and other measurements to evaluate the performance of our approach. The results have showed that the prediction accuracy of this method was 90.4 for 7,713 samples. The accuracy was 80% for 20 metastatic tumors samples. In addition, the 10-fold cross-validation is used to evaluate the accuracy of classification, which reaches 91%.Entities:
Keywords: machine learning; naive Bayes; random forest; the ability of tissue tracing; uncertain origins
Year: 2020 PMID: 33330438 PMCID: PMC7732438 DOI: 10.3389/fbioe.2020.607126
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1Flow chart of the article.
Detailed information of data covering 21 cancers downloaded from TGCA.
| BLCA | 301 | 80 | 221 | 3.9 | |
| BRCA | 1,056 | 1,044 | 11 | 13.7 | 1 person has no clinical information |
| CESC | 258 | 258 | 0 | 3.3 | |
| COAD | 451 | 215 | 236 | 5.8 | |
| GBM | 153 | 53 | 100 | 2.0 | |
| HNSC | 480 | 128 | 352 | 6.2 | |
| KIRC | 526 | 184 | 342 | 6.8 | |
| KIRP | 222 | 63 | 159 | 2.9 | |
| LAML | 173 | 80 | 93 | 2.2 | |
| LGG | 439 | 192 | 247 | 5.7 | |
| LIHC | 294 | 99 | 195 | 3.8 | |
| LUAD | 486 | 262 | 224 | 6.3 | |
| LUSC | 428 | 109 | 319 | 5.5 | |
| OV | 261 | 261 | 0 | 3.4 | |
| PAAD | 142 | 64 | 78 | 1.8 | |
| PRAD | 379 | 0 | 379 | 4.9 | |
| READ | 153 | 70 | 82 | 2.0 | 1 person has no clinical information |
| SKCM | 80 | 34 | 46 | 1.0 | |
| STAD | 415 | 147 | 268 | 5.4 | |
| THCA | 500 | 367 | 133 | 6.5 | |
| UCEC | 516 | 516 | 0 | 6.7 | |
| Total | 7,713 | 4,226 | 3,485 | 99.8 | |
Detailed information of data covering five cancers downloaded from GEO.
| LIHC | 9 | 18.75 |
| UCEC | 6 | 12.5 |
| THCA | 8 | 16.67 |
| BLCA | 11 | 22.92 |
| PAAD | 14 | 29.17 |
| Total | 48 | 99.98 |
FIGURE 2The accuracy with the different of number genes. With a 10-fold cross-validation accuracy, the value of the accuracy is increasing up to 1,700 genes, after which it keeps stable with the value of 91.07%.
FIGURE 3The confusion matrix of 2,284 genes in the classifier, in which red represented the result of inconsistency between primary and predicted cancer types.
FIGURE 4Gene enriched in biological process, cellular component, and molecular function were drawn for first 100-gene set by ClueGO.
FIGURE 5A heat map of the first 100 genes was screened by the random forest algorithm. Where, row is cancer type, column is gene. In this part, RPKM is used to define the gene expression level, and the average value of samples in each cancer type is calculated as the gene expression difference.
FIGURE 6The result of independent verification. Blue represents the primary tumor, orange represents the accuracy of the prediction, light red represents the predicted tumor type, and dark red represents the number of predicted tumor types.
FIGURE 7The figure represented the recalls and precision after 10-fold cross-validation.
FIGURE 8In this figure, the first was the result of k-nearest neighbor (k = 5) algorithm, and its prediction accuracy was only 88%; the second was the result of decision tree algorithm, and the classification accuracy was only 88%; the third is the result of naive Bayesian algorithm, and the classification accuracy was reaching to 90%.