| Literature DB >> 33330634 |
Lei Chen1,2, Zhandong Li3, Tao Zeng4, Yu-Hang Zhang5, Dejing Liu6, Hao Li3, Tao Huang6, Yu-Dong Cai1.
Abstract
Cancer can be generally defined as a cluster of systematic diseases triggered by abnormal cell proliferation and growth. With the development of biological sciences and biotechnologies, the etiology of cancer is partially revealed, including some of the most substantial pathogenic factors [either endogenous (genetics) or exogenous (environmental)]. However, some remaining factors that contribute to the tumorigenesis but have not been analyzed and discussed in detail remain. For instance, some typical correlations between microorganisms and tumorigenesis have been reported already, but previous studies are just sporadic studies on single microorganism-cancer subtype pairs and do not explain and validate the specific contribution of microbiome on tumorigenesis. On the basis of the systematic microbiome analyses of blood and cancer-associated tissues in cancer patients/controls in public domain, we performed interpretable analyses. We identified several core regulatory microorganisms that contribute to the classification of multiple tumor subtypes and established quantitative predictive models for interpretable prediction by using multiple machine learning methods. We also compared the optimal features (microorganisms) and rules identified from microbiome profiles processed using the Kraken and the SHOGUN. Collectively, our study identified new microbiome signatures and their interpretable classification rules for cancer discrimination and carried out reliable methodological comparison for robust cancer microbiome analyses, thereby promoting the development of tumor etiology at the microbiome level.Entities:
Keywords: cancer type; decision tree; machine learning algorithm; microbiota; rules
Year: 2020 PMID: 33330634 PMCID: PMC7672214 DOI: 10.3389/fmolb.2020.604794
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
Summary of the Kraken and SHOGUN datasets.
| Index | Cancer Type | Sample size | |
| Kraken dataset | SHOGUN dataset | ||
| 1 | Adrenocortical carcinoma | 79 | 79 |
| 2 | Bladder urothelial carcinoma | 729 | 729 |
| 3 | Brain lower grade glioma | 731 | 731 |
| 4 | Breast invasive carcinoma | 1483 | 1483 |
| 5 | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 451 | 451 |
| 6 | Cholangiocarcinoma | 45 | 45 |
| 7 | Colon adenocarcinoma | 1006 | 417 |
| 8 | Esophageal carcinoma | 340 | 340 |
| 9 | Glioblastoma multiforme | 489 | 338 |
| 10 | Head and Neck squamous cell carcinoma | 906 | 297 |
| 11 | Kidney chromophobe | 191 | 65 |
| 12 | Kidney renal clear cell carcinoma | 1141 | 1114 |
| 13 | Kidney renal papillary cell carcinoma | 393 | 23 |
| 14 | Liver hepatocellular carcinoma | 523 | 162 |
| 15 | Lung adenocarcinoma | 911 | 911 |
| 16 | Lung squamous cell carcinoma | 638 | 534 |
| 17 | Lymphoid neoplasm diffuse large b-cell lymphoma | 61 | 61 |
| 18 | Mesothelioma | 87 | 87 |
| 19 | Ovarian serous cystadenocarcinoma | 1031 | 1031 |
| 20 | Pancreatic adenocarcinoma | 183 | 183 |
| 21 | Pheochromocytoma and Paraganglioma | 186 | 186 |
| 22 | Prostate adenocarcinoma | 829 | 829 |
| 23 | Rectum adenocarcinoma | 372 | 372 |
| 24 | Sarcoma | 347 | 347 |
| 25 | Skin cutaneous melanoma | 792 | 667 |
| 26 | Stomach adenocarcinoma | 1079 | 1079 |
| 27 | Testicular germ cell tumors | 139 | 139 |
| 28 | Thymoma | 122 | 122 |
| 29 | Thyroid carcinoma | 880 | 287 |
| 30 | Uterine carcinosarcoma | 57 | 57 |
| 31 | Uterine corpus endometrial carcinoma | 1222 | 169 |
| 32 | Uveal melanoma | 182 | 182 |
FIGURE 1Venn diagram to show the common and different features used in two datasets. Several exclusive features are contained in each dataset.
FIGURE 2Flow chart to show the detailed analysis procedures. The two datasets are analyzed by minimum redundancy maximum relevance, resulting in a feature list for each dataset. The incremental feature selection, which incorporates synthetic minority oversampling technique, four classification algorithms and 10-fold cross-validation, is applied to each feature list. The results include interpretable classification rules, key features and efficient classification models.
FIGURE 3IFS curves yielded by models with different classification algorithms on the Kraken dataset. The random forest with the top 582 features produces the highest MCC of 0.918.
Summary of the performance of the best model with different classification algorithms on two datasets.
| Classification algorithm | Kraken dataset | SHOGUN dataset | ||||
| Number of features | ACC | MCC | Number of features | ACC | MCC | |
| Random forest | 582 | 0.921 | 0.918 | 146 | 0.884 | 0.878 |
| Support vector machine | 1989 | 0.588 | 0.575 | 1592 | 0.633 | 0.616 |
| k-nearest neighbor | 682 | 0.812 | 0.804 | 277 | 0.895 | 0.889 |
| Decision tree | 580 | 0.736 | 0.724 | 1481 | 0.824 | 0.814 |
FIGURE 4Violin plot to show the accuracies on cancer types which are produced by the best model with different classification algorithms on two datasets. (A) Kraken dataset, the RF model yields the most high accuracies; (B) SHOGUN dataset, the kNN model produces the most high accuracies.
FIGURE 5IFS curves yielded by models with different classification algorithms on the SHOGUN dataset. The k-nearest neighbor with top 227 features generates the highest MCC of 0.889.
FIGURE 6Venn diagram to show the common and difference of optimum feature subsets by applying a given classification algorithm to each of two datasets. (A) Random forest; (B) Support vector machine; (C) k-nearest neighbor; (D) Decision tree.
FIGURE 7Venn diagram to show the common and difference of the global optimum feature subsets on two datasets.
Number of rules for each cancer type on two datasets.
| Index | Cancer Type | Number of rules | |
| Kraken dataset | SHOGUN dataset | ||
| 1 | Adrenocortical carcinoma | 24 | 8 |
| 2 | Bladder urothelial carcinoma | 235 | 168 |
| 3 | Brain lower grade glioma | 128 | 53 |
| 4 | Breast Invasive carcinoma | 310 | 173 |
| 5 | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 91 | 89 |
| 6 | Cholangiocarcinoma | 12 | 13 |
| 7 | Colon adenocarcinoma | 217 | 112 |
| 8 | Esophageal carcinoma | 55 | 33 |
| 9 | Glioblastoma multiforme | 31 | 18 |
| 10 | Head and Neck squamous cell carcinoma | 228 | 85 |
| 11 | Kidney chromophobe | 40 | 16 |
| 12 | Kidney renal clear cell carcinoma | 164 | 104 |
| 13 | Kidney renal papillary cell carcinoma | 132 | 22 |
| 14 | Liver hepatocellular carcinoma | 122 | 60 |
| 15 | Lung adenocarcinoma | 232 | 150 |
| 16 | Lung squamous cell carcinoma | 143 | 135 |
| 17 | Lymphoid neoplasm diffuse large b-cell lymphoma | 15 | 18 |
| 18 | Mesothelioma | 19 | 32 |
| 19 | Ovarian serous cystadenocarcinoma | 59 | 35 |
| 20 | Pancreatic adenocarcinoma | 65 | 59 |
| 21 | Pheochromocytoma and Paraganglioma | 26 | 29 |
| 22 | Prostate adenocarcinoma | 203 | 137 |
| 23 | Rectum adenocarcinoma | 89 | 86 |
| 24 | Sarcoma | 56 | 68 |
| 25 | Skin cutaneous melanoma | 210 | 94 |
| 26 | Stomach adenocarcinoma | 129 | 69 |
| 27 | Testicular germ cell tumors | 38 | 23 |
| 28 | Thymoma | 33 | 14 |
| 29 | Thyroid carcinoma | 199 | 101 |
| 30 | Uterine carcinosarcoma | 10 | 10 |
| 31 | Uterine corpus endometrial carcinoma | 228 | 11 |
| 32 | Uveal melanoma | 36 | 5 |
| Total | – | 3579 | 2030 |