| Literature DB >> 34779841 |
Kristen D Curry1, Michael G Nute1, Todd J Treangen1.
Abstract
Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.Entities:
Keywords: bioinformatics; host–microbe interactions; machine learning; microbiome
Mesh:
Year: 2021 PMID: 34779841 PMCID: PMC8786294 DOI: 10.1042/ETLS20210213
Source DB: PubMed Journal: Emerg Top Life Sci ISSN: 2397-8554
Figure 1.Standard workflow for determining microbiome-disease associations through a case-control study or ML model.
Both approaches begin by separating study participants into diseased and healthy cohorts, collecting samples, then performing high-throughput sequencing. Sequencing is completed through either a WGS or 16S approach then reads are converted to either k-mer counts [21], microbial profiles or functional annotations. In a standard case-control study (left path) alpha diversity, beta diversity and multivariate analysis are used to establish statistically significant differences between the two cohorts. A manual literature review is then performed to determine if findings are consistent across various studies. However, in a standard ML approach, features are extracted from sequence information and a model is constructed to detect trends separating the two groups. Cross-study validation is then performed by calculating accuracy in classification results from other test data sets.
Summary statistics for discussed data sets. Here, ‘x’ denotes use of data set in method publication.
| Disease | Cases | Controls | MetAML | PopPhy-CNN | Met2Img | MetaPheno | DeepMicro | MVIB |
|---|---|---|---|---|---|---|---|---|
| Liver cirrhosis | 118 | 114 | x | x | x | – | x | x |
| IBD | 25 | 85 | x | – | x | – | x | x |
| Obesity | 164 | 89 | x | x | x | x | x | x |
| Type 2 diabetes | 170 | 174 | x | x* | x | x | x | x |
x*: reported results for disease include additional samples (53 case, 43 controls).
A selection of machine learning methods for disease classification from metagenomic sequences. Best AUC here denotes the highest AUC value reported in publication for specified data set.
| Software | Model input | Model description | Best AUC | Novelty | |||
|---|---|---|---|---|---|---|---|
| Cirr. | IBD | T2D | Obes. | ||||
| MetAML 2016 [ | sp. rel. ab. or strain markers | Parameter sweep for 4 classifiers (SVM, RF, Lasso, ENet) with 3 feature selection methods (RF | 0.96 SVM | 0.91 SVM | 0.76 SVM | 0.66 SVM | Foundational cross-validation test data and framework; first parameter sweep of metagenome disease prediction from off-the-shelf ML models |
| PopPhy-CNN 2020 [ | OTU rel. ab. | PhyloT tree construction; populated with input OTU rel. ab.; transformed to 2D matrix; CNN with ELU | 0.95 | N/A | 0.69 | 0.67 | CNN with spatial quantitative relationship in input taxonomy data; novel alg for selecting most important features from first convolutional layer |
| Met2Img 2018 [ | sp. or genus rel. ab. | Rel. ab. binned, colored, and visualized with Fill-up or t-SNE; 24x24 px (or smaller) images input into CNN with ReLU | 0.91 Fillup SPB | 0.87 Fillup SPB | 0.68 tSNE QTF | 0.69 tSNE SPB | Colored pixel visualization for microbiome profile; explores 3 binning methods (PR, QTF, SPB) with color and gray colormaps |
| MicroPheno 2018 [ | 16S raw seqs | Find subsample size for stable k-mer profile; find best | N/A | N/A | N/A | N/A | 16S sequences; k-mer distribution from shallow sub-samples outperformed OTU features; first 16S deep learning metagenome-phenotype exploration |
| MetaPheno 2019 [ | sp. rel. ab. or raw seqs | Jelly-fish k-mer counts; identify sig. k-mers with cohort p-values; apply hyper-parameter grid search models | N/A | N/A | 0.76 gcF, k-mer | 0.65 gcF, rel. ab. | Review of current methods; compares features: k-mers and rel. ab. with classifiers: SVM, RF, XGBoost, gcForest, AE-pretained DNN (AutoNN) |
| DeepMicro 2020 [ | sp. rel. ab. or strain markers | Low-dimensional profile representation from autoencoder; input into MLP with ReLU or hyper-parameter grid SVM or RF | 0.94 SVM CAE | 0.96 SVM SAE | 0.76 MLP CAE | 0.67 RF DAE | 4 autoencoders (shallow, deep, variational, convolutional) to reduce microbiome dimension; combines with MLP, SVM, and RF param. sweep |
| MVIB 2021 [ | sp. rel. ab. and strain markers | MLP for each modality (rel. ab., strain marker, metabolomics); Information Bottleneck theory to learn joint stochastic encoding | 0.93 D | 0.94 J;T | 0.76 J;T | 0.67 D | Combine multiple heterogeneous data modalities; explore default and joint pre-processing (D,J); optional triple margin loss extension (T) |
Challenges presented by microbiome data as input for ML models and the approaches taken by discussed methods to tackle these challenges.
|
|
| |
| |
| |
| |
| |
| |
| |
|
|
| |
|
|
| |
| |