| Literature DB >> 30717284 |
Dongmei Ai1,2, Hongfei Pan3, Rongbao Han4, Xiaoxin Li5, Gang Liu6, Li C Xia7.
Abstract
The imbalance of human gut microbiota has been associated with colorectal cancer. In recent years, metagenomics research has provided a large amount of scientific data enabling us to study the dedicated roles of gut microbes in the onset and progression of cancer. We removed unrelated and redundant features during feature selection by mutual information. We then trained a random forest classifier on a large metagenomics dataset of colorectal cancer patients and healthy people assembled from published reports and extracted and analysed the information from the learned decision trees. We identified key microbial species associated with colorectal cancers. These microbes included Porphyromonas asaccharolytica, Peptostreptococcus stomatis, Fusobacterium, Parvimonas sp., Streptococcus vestibularis and Flavonifractor plautii. We obtained the optimal splitting abundance thresholds for these species to distinguish between healthy and colorectal cancer samples. This extracted consensus decision tree may be applied to the diagnosis of colorectal cancers.Entities:
Keywords: colorectal cancer; microbial community analysis; microbial relative abundances; mutual information; random forest
Mesh:
Year: 2019 PMID: 30717284 PMCID: PMC6410271 DOI: 10.3390/genes10020112
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The workflow for metagenomic analysis.
Figure 2Schematic diagram of one decision tree. Decision trees have nodes and every node includes a feature ID (microbe ID), split value, Gini index, and sample number.
Schematic diagram of node information in decision tree.
| Tree ID | Node Index | Father | Layer | Microbe ID | Split Value | Gini Index | Sample Number |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 324 | 0 | 0.4829 | 126 |
| 0 | 1 | 0 | 1 | 443 | 0.0006 | 0.3648 | 76 |
| 0 | 2 | 1 | 2 | 313 | 0.0001 | 0.4861 | 19 |
| 0 | 3 | 2 | 3 | 170 | 0.0001 | 0.4688 | 13 |
| 0 | 4 | 3 | 4 | 275 | 0 | 0.48 | 6 |
| 0 | 9 | 1 | 2 | 85 | 0 | 0.1818 | 54 |
| 0 | 11 | 9 | 3 | 428 | 0.0003 | 0.0907 | 51 |
| 0 | 13 | 11 | 4 | 429 | 0 | 0.2975 | 14 |
| 0 | 16 | 0 | 1 | 167 | 0 | 0.4543 | 53 |
| 0 | 17 | 16 | 2 | 90 | 0.0031 | 0.1327 | 8 |
| 0 | 20 | 16 | 2 | 440 | 0.0034 | 0.3607 | 45 |
| 0 | 21 | 20 | 3 | 273 | 0.0011 | 0.2706 | 39 |
| 0 | 22 | 21 | 4 | 319 | 0.0027 | 01349 | 35 |
| 0 | 25 | 21 | 4 | 458 | 0 | 0.2449 | 4 |
| 0 | 28 | 20 | 3 | 132 | 0.002 | 0.42 | 6 |
| 0 | 29 | 28 | 4 | 233 | 0.0002 | 0.2188 | 5 |
Examples of the number of occurrences in various decision tree layers and the overall score of features.
| Layer | 0 | 1 | 2 | 3 | 4 | Score | Microbial Species |
|---|---|---|---|---|---|---|---|
| Microbe ID | |||||||
| 334 | 213 | 262 | 270 | 239 | 181 | 232.437 |
|
| 200 | 168 | 160 | 146 | 122 | 111 | 154.21 |
|
| 324 | 151 | 156 | 155 | 134 | 94 | 146.661 | |
| 220 | 177 | 127 | 129 | 117 | 80 | 145.268 | |
| 350 | 144 | 147 | 157 | 131 | 132 | 144.319 |
|
| 443 | 117 | 149 | 170 | 151 | 136 | 136.618 |
|
| 343 | 119 | 129 | 161 | 138 | 122 | 129.228 |
|
| 332 | 130 | 138 | 118 | 118 | 89 | 125.879 |
|
| 226 | 131 | 128 | 125 | 115 | 68 | 123.025 |
|
| 323 | 135 | 122 | 81 | 133 | 107 | 122.101 |
|
| 233 | 117 | 104 | 123 | 103 | 111 | 112.955 |
|
| 213 | 82 | 130 | 131 | 142 | 125 | 109.201 |
|
| 217 | 135 | 103 | 82 | 78 | 47 | 107.802 | |
| 139 | 103 | 113 | 111 | 96 | 82 | 104.154 |
Information about the two sample groups.
| Study Population | Healthy | Adenoma | Colorectal Cancer | Country of Residence | |||||
|---|---|---|---|---|---|---|---|---|---|
| Small (<1 cm) | Large (≥1 cm) | Early Stage | Late Stage | ||||||
| 0 | I | II | III | IV | |||||
| F ( | 61 | 27 | 15 | 0 | 15 | 7 | 10 | 21 | France |
| A | 63 | 47 | 46 | Austria | |||||
Microbial species with high scores and abundance thresholds.
| Microbe ID | Microbial Species | Score | Abundance Thresholds |
|---|---|---|---|
| 334 |
| 232.437 | 3.052 × 10−5 |
| 200 |
| 154.21 | 0.006662 |
| 324 | 146.661 | 1.391 × 10−5 | |
| 220 | 145.268 | 0 | |
| 350 |
| 144.319 | 0 |
| 443 |
| 136.618 | 0.0006701 |
| 343 |
| 129.228 | 0.000179 |
| 332 |
| 125.879 | 9.154 × 10−5 |
| 226 |
| 123.025 | 9.15 × 10−5 |
| 323 |
| 122.101 | 7.63 × 10−5 |
| 233 |
| 112.955 | 5.19 × 10−5 |
| 213 |
| 109.201 | 9.83 × 10−5 |
| 217 | 107.802 | 0 | |
| 139 | 104.154 | 0.000912 |
Figure 3Top 20 microbial species with high relative abundance in samples of different disease stages. (A) The relative abundances of the top 20 microbes in the healthy samples. (B) The relative abundances of the top 20 microbes in small adenoma patients. (C) The relative abundances of the top 20 microbes in large adenoma patients. (D) The relative abundances of the top 20 microbes in colorectal cancer patients. The horizontal axis represents the abbreviation of the corresponding names of microbial species, the vertical axis represents the relative abundance of the corresponding microbes, and the different colored bars in the box plot indicate different microbial genera in Figure 3A–D.