| Literature DB >> 32392251 |
Alan Le Goallec1,2, Braden T Tierney1,3,4,5, Jacob M Luber1, Evan M Cofer6, Aleksandar D Kostic3,4,5, Chirag J Patel1.
Abstract
The microbiome is a new frontier for building predictors of human phenotypes. However, machine learning in the microbiome is fraught with issues of reproducibility, driven in large part by the wide range of analytic models and metagenomic data types available. We aimed to build robust metagenomic predictors of host phenotype by comparing prediction performances and biological interpretation across 8 machine learning methods and 4 different types of metagenomic data. Using 1,570 samples from 300 infants, we fit 7,865 models for 6 host phenotypes. We demonstrate the dependence of accuracy on algorithm choice and feature definition in microbiome data and propose a framework for building microbiome-derived indicators of host phenotype. We additionally identify biological features predictive of age, sex, breastfeeding status, historical antibiotic usage, country of origin, and delivery type. Our complete results can be viewed at http://apps.chiragjpgroup.org/ubiome_predictions/.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32392251 PMCID: PMC7241849 DOI: 10.1371/journal.pcbi.1007895
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1A) Data/Feature processing pipeline. We aggregate our data, and for each sample we identify the abundance of each species found within it via MetaPhlan2. We de novo assemble each sample and identify the non-redundant set of microbial genes within them. We quantify and normalize the abundance of each gene and then cluster them based on co-occurrence into CAGs. We then collapse raw genes into BioCyc pathways. Finally, we extracted genes for modeling from phenotype-associated-CAGs. B) Machine learning pipeline. Raw data is cleaned according to phenotypic variable completeness. We then use a nested cross-validation and a suite of machine learning tools to run our prediction analysis.
Best performing machine learning algorithms on the testing set for both experimental groups (including microbiome data) versus control group (just demographic data).
| Metric | Best Predictor Set | Best Algorithm, Experimental | Best Experimental Algorithm, Performance | Best Algorithm, Baseline | Baseline Performance | Difference b/w Best Experimental and Control Performance Metrics | |
|---|---|---|---|---|---|---|---|
| R-Squared | CAGs + Demographics | Random Forest (Caret) | .625+-.021 | Random Forest 2 | .120+-.013 | 0.505 | |
| AUC of the ROC | Genes + Demographics | Elastic Net (Caret) | .796+-.013 | Random Forest 2 | .786+-.013 | 0.01 | |
| AUC of the ROC | Genes + Demographics | Gradient Boosted Machine (Caret) | .794+-.012 | Gradient Boosted Machine (Caret) | .786+-.013 | 0.008 | |
| AUC of the ROC | Genes + Demographics | Elastic Net (Caret) | .760+-.021 | Gradient Boosted Machine 2 | .587+-.025 | 0.173 | |
| AUC of the ROC | Genes | Gradient Boosted Machine (Caret) | .605+-.016 | Naive Bayes | .529+-.019 | 0.076 | |
| Mean Class Accuracies | Genes + Demographics | Gradient Boosted Machine 2 | .807+-.012 | Gradient Boosted Machine (Caret) | .651+-.014 | 0.156 |
Fig 2Concordance between most important predictors, measured by similarity in ranking and relative importance, between A) phenotype and data type and B) algorithm choice.
Fig 3Classification performances by data type and algorithm for all variables other than age.
Fig 4A) Performance of metagenomic predictors of infant age by algorithm and data type. B) Correlation of top 25 most predictive CAGs with age C)-F) Taxonomic breakdown and spline fits (relative to age) for a representative positively-associated and negatively-associated CAG.