| Literature DB >> 35013454 |
Carlos S Casimiro-Soriguer1, Carlos Loucera1,2, María Peña-Chilet1,2,3, Joaquin Dopazo4,5,6,7.
Abstract
Gut microbiome is gaining interest because of its links with several diseases, including colorectal cancer (CRC), as well as the possibility of being used to obtain non-intrusive predictive disease biomarkers. Here we performed a meta-analysis of 1042 fecal metagenomic samples from seven publicly available studies. We used an interpretable machine learning approach based on functional profiles, instead of the conventional taxonomic profiles, to produce a highly accurate predictor of CRC with better precision than those of previous proposals. Moreover, this approach is also able to discriminate samples with adenoma, which makes this approach very promising for CRC prevention by detecting early stages in which intervention is easier and more effective. In addition, interpretable machine learning methods allow extracting features relevant for the classification, which reveals basic molecular mechanisms accounting for the changes undergone by the microbiome functional landscape in the transition from healthy gut to adenoma and CRC conditions. Functional profiles have demonstrated superior accuracy in predicting CRC and adenoma conditions than taxonomic profiles and additionally, in a context of explainable machine learning, provide useful hints on the molecular mechanisms operating in the microbiota behind these conditions.Entities:
Mesh:
Year: 2022 PMID: 35013454 PMCID: PMC8748837 DOI: 10.1038/s41598-021-04182-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Datasets used in the study.
| Project ID | Dataset name | References | Samples | Mean aligned reads |
|---|---|---|---|---|
| PRJNA389927 | Hannigan | [ | 82 | 2.308.712 |
| PRJEB12449 | Vogtmann | [ | 104 | 3.897.639 |
| PRJEB6070 | Zeller | [ | 199 | 4.517.730 |
| PRJEB7774 | Feng | [ | 132 | 9.154.788 |
| PRJEB10878 | Yu | [ | 128 | 18.372.510 |
| PRJNA447983 | Thomas0 | [ | 124 | 14.841.290 |
| PRJEB27928 | Thomas1 | [ | 82 | 6.518.536 |
Figure 1Stability estimate along with the confidence interval (alpha = 0.05) for the Random Stability Sub Sampling (RSSS-test) and 20-times tenfold cross validation (CV-test) splitting schemas, for each metagenomic profile (KEGG, eggNog and taxonomic) and project. The vertical bars indicate the theoretical thresholds on the effect size: below 0.4 represent bad agreement, between o.4 and 0.7 refers to a good enough agreement and scores above 0.7 represent a near perfect agreement.
Figure 2Rank stability estimate (the mean of all the hyperbolic-weighted tau pairwise rank comparisons) along with the 0.25 and 0.75 quantiles for the Random Stability Sub Sampling (RSSS-test) and 20-times tenfold cross validation (CV-test) splitting schemas for each metagenomic profile (KEGG, eggNog and taxonomic) and project.
Figure 3Mean of the area under the receiver operating characteristic curve (AUROC) along with the 0.25 and 0.75 quantiles for the Random Stability Sub Sampling (RSSS-test) and 20-times tenfold cross validation (CV-test) splitting schemas when discriminating between CRC and healthy samples for each metagenomic profile (KEGG, eggNog and taxonomic) and project.
Figure 4Cross-prediction matrix that measures the performance of the proposed model in terms of the area under the receiver operating characteristic curve (AUROC) for (A) taxonomic, (B) KEGG and (C) eggNog metagenomic profiles. The diagonal represents the intra-project performance by reporting the mean of the AUROC of 20-times tenfold cross validation, whereas the off-diagonal shows the cross-dataset performance, i.e. train with the model indicated in the rows and test over the project in the columns. Finally, the Leave one Project Out (LOPO) row reports the performance of predicting the dataset referred to in the columns while training with the other datasets, whereas the oLOPO row is the same experiment but using the functional signature learned during the LOPO procedure.
Figure 5Significance of the cross-validated score through the use of the target permutation technique for each metagenomic profile (KEGG, eggNog and taxonomic). The p-value approximates the probability that the score for each profile would be the result of chance. The number of permutations is 100 for each profile using the consensus signature previously learnt and a 100-times tenfold cross-validation schema. Note that the worst outcome is 1 and the best is ~ 0.009. The vertical lines for each profile report the true score without permuting the outcome (being CRC or healthy) and the luck threshold (in black), whereas the continuous color lines show the permutation scores distribution (i.e. the null distribution) for each profile.
Figure 6Radar plot with the performances of the different comparisons of the distribution of the probabilities between pairs of categories of samples using a Mann–Whitney rank test. Comparisons are clockwise: Adenoma < Tumor (A < T), healthy < Adenoma (H < A), healthy < small adenoma (H < S), Small adenoma < Adenoma (S < A) and healthy < tumor (H < T). Models were trained with Taxonomic (Taxo) and functional features (KEGG and eggNOG).
The 20 most relevant taxons selected by the machine learning method used.
| Relevance score | Name | Taxon ID |
|---|---|---|
| 2.41252 | 33,033 | |
| 1.59494 | 457,405 | |
| 1.56152 | 469,607 | |
| 1.28725 | 76,859 | |
| 0.96797 | 879,243 | |
| 0.81790 | 525,283 | |
| 0.70496 | 39,950 | |
| 0.62700 | 76,857 | |
| 0.61251 | 29,391 | |
| 0.57994 | 143,387 | |
| 0.53820 | 1509 | |
| 0.53034 | 862,971 | |
| 0.50665 | 2,584,943 | |
| 0.49567 | 649,756 | |
| 0.47842 | 546 | |
| 0.46420 | 155,615 | |
| 0.44508 | 361,101 | |
| 0.44365 | 537,007 | |
| 0.43355 | 469,604 | |
| 0.41181 | 469,602 |
The 20 most relevant KEGG features selected by the machine learning method used here.
| Relevance score | Name | KEGG ID |
|---|---|---|
| 1.50544 | glmS, mutS, mamA; methylaspartate mutase sigma subunit [EC:5.4.99.1] | K01846 |
| 1.34932 | mal; methylaspartate ammonia-lyase [EC:4.3.1.2] | K04835 |
| 0.77486 | rocR; arginine utilization regulatory protein | K06714 |
| 0.73183 | pldB; lysophospholipase [EC:3.1.1.5] | K01048 |
| 0.73001 | 6GAL; galactan endo-1,6-beta-galactosidase [EC:3.2.1.164] | K18579 |
| 0.72130 | pdaA; peptidoglycan-N-acetylmuramic acid deacetylase [EC:3.5.1.-] | K01567 |
| 0.71882 | MARS, metG; methionyl-tRNA synthetase [EC:6.1.1.10] | K01874 |
| 0.70670 | thiQ; thiamine transport system ATP-binding protein [EC:7.6.2.15] | K02062 |
| 0.70362 | epr; minor extracellular protease Epr [EC:3.4.21.-] | K13277 |
| 0.66044 | kamA; lysine 2,3-aminomutase [EC:5.4.3.2 | K01843 |
| 0.62861 | E2.1.1.77, pcm; protein-L-isoaspartate(D-aspartate) O-methyltransferase [EC:2.1.1.77] | K00573 |
| 0.62662 | troC, mntC, znuB; manganese/zinc/iron transport system permease protein | K11708 |
| 0.62170 | glpX; fructose-1,6-bisphosphatase II [EC:3.1.3.11] | K02446 |
| 0.61670 | bpsB, srsB; methyltransferase | K16168 |
| 0.61284 | spoIIP; stage II sporulation protein P | K06385 |
| 0.60627 | waaY, rfaY; heptose II phosphotransferase [EC:2.7.1.-] | K02850 |
| 0.60596 | FBA, fbaA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13] | K01624 |
| 0.60594 | rgpF; rhamnosyltransferase [EC:2.4.1.-] | K07272 |
| 0.59660 | tex; protein Tex | K06959 |
| 0.58759 | murA; UDP-N-acetylglucosamine 1-carboxyvinyltransferase [EC:2.5.1.7] | K00790 |
Metabolites described as systematically deregulated in cancer and their relevance in the model using KEGG functional features.
| Metabolite_name | HMDB_ID | KEGG_conpound_ID | Frequency | model_KEGG_KO_score |
|---|---|---|---|---|
| Glycine | HMDB0000123 | C00037 | 24 | 4.823568715 |
| L-Valine | HMDB0000883 | C00183 | 23 | 0.6930878983 |
| L-Alanine | HMDB0000161 | C00041 | 22 | 3.22739351 |
| L-Lactic acid | HMDB0000190 | C00186 | 22 | 0.5279944356 |
| L-Phenylalanine | HMDB0000159 | C00079 | 20 | 2.117857375 |
| L-Proline | HMDB0000162 | C00148 | 20 | 1.434113835 |
| L-Leucine | HMDB0000687 | C00123 | 20 | 0.3006354234 |
| L-Glutamic acid | HMDB0000148 | C00025 | 17 | 13.32534903 |
| Taurine | HMDB0000251 | C00245 | 16 | 0.9103833031 |
| Palmitic acid | HMDB0000220 | C00249 | 15 | 0.3209053449 |
| L-Methionine | HMDB0000696 | C00073 | 15 | 4.503947812 |
| Glycerol | HMDB0000131 | C00116 | 14 | 1.125734858 |
| L-Tyrosine | HMDB0000158 | C00082 | 14 | 1.718910949 |
| L-Threonine | HMDB0000167 | C00188 | 14 | 1.354361221 |
| L-Isoleucine | HMDB0000172 | C00407 | 14 | 0.3873029906 |
| L-Serine | HMDB0000187 | C00065 | 14 | 2.042185229 |
| L-Aspartic acid | HMDB0000191 | C00049 | 14 | 3.752760624 |
| D-Glucose | HMDB0000122 | C00221 | 13 | 0.9079622305 |
| L-Lysine | HMDB0000182 | C00047 | 12 | 1.845182919 |
| L-Arginine | HMDB0000517 | C00062 | 12 | 1.818335229 |
| L-Glutamine | HMDB0000641 | C00064 | 12 | 5.074283424 |
| Choline | HMDB0000097 | C00114 | 11 | 0.3910553328 |
| L-Asparagine | HMDB0000168 | C00152 | 11 | 0.7764985602 |
| myo-Inositol | HMDB0000211 | C00137 | 11 | 0.3847705919 |
| Succinic acid | HMDB0000254 | C00042 | 11 | 1.819066672 |
| L-Tryptophan | HMDB0000929 | C00078 | 11 | 1.028295479 |
| Acetic acid | HMDB0000042 | C00033 | 10 | 4.986949589 |
| Uridine | HMDB0000296 | C00299 | 10 | 1.376520768 |
HMBD is the identifier of the metabolome database (https://hmdb.ca/) and the Frequency column denotes the number of studies in which the metabolite was found as deregulated according to a recent review[53]. The metabolite scores were calculated by adding the KEGG_KO’s scores, from the machine learning model, for each of the metabolites.
The 20 most relevant eggnog features selected by the model.
| Score | Feature ID (eggNOG 4.5) | Taxonomic Level | Description |
|---|---|---|---|
| 2.47062 | 08XIZ | bactNOG | Integral membrane protein TIGR02185 |
| 2.30583 | 06J4I | bactNOG | N/A |
| 2.22012 | 0NI2F | firmNOG | Integral membrane protein TIGR02185 |
| 1.93081 | 00DN8 | actNOG | One of the primary rRNA binding proteins, it binds directly to 16S rRNA where it nucleates assembly of the head domain of the 30S subunit. Is located at the subunit interface close to the decoding center, probably blocks exit of the E-site tRNA (By similarity) |
| 1.92116 | 0NTFT | firmNOG | N/A |
| 1.76985 | 0Y9D1 | NOG | N/A |
| 1.73496 | 0EX7J | cloNOG | N/A |
| 1.47985 | 05DDE | bactNOG | Outer membrane autotransporter barrel domain-containing protein |
| 1.46286 | 057E2 | bacteNOG | DNA binding protein, excisionase family |
| 1.4445 | 0587E | bacteNOG | Protein of unknown function (DUF1446) |
| 1.37804 | 06F02 | bactNOG | N/A |
| 1.34625 | 05CMH | bactNOG | DEHYDRATASE |
| 1.33923 | 08NTT | bactNOG | N/A |
| 1.33323 | 079XJ | bactNOG | Major outer membrane protein |
| 1.31216 | 08C1U | bactNOG | N/A |
| 1.29914 | 08HM2 | bactNOG | Cell wall binding repeat 2-containing protein |
| 1.26147 | 059H0 | bacteNOG | Methylaspartate mutase, E subunit |
| 1.25394 | 05DCU | bactNOG | 2-Hydroxyglutaryl-CoA dehydratase |
| 1.24681 | 08BIB | bactNOG | Hypothetical bacterial integral membrane protein (Trep_Strep) |
| 1.23689 | 07I7J | bactNOG | s-layer protein |