| Literature DB >> 33680353 |
Ryan B Ghannam1, Stephen M Techtmann1.
Abstract
Advances in nucleic acid sequencing technology have enabled expansion of our ability to profile microbial diversity. These large datasets of taxonomic and functional diversity are key to better understanding microbial ecology. Machine learning has proven to be a useful approach for analyzing microbial community data and making predictions about outcomes including human and environmental health. Machine learning applied to microbial community profiles has been used to predict disease states in human health, environmental quality and presence of contamination in the environment, and as trace evidence in forensics. Machine learning has appeal as a powerful tool that can provide deep insights into microbial communities and identify patterns in microbial community data. However, often machine learning models can be used as black boxes to predict a specific outcome, with little understanding of how the models arrived at predictions. Complex machine learning algorithms often may value higher accuracy and performance at the sacrifice of interpretability. In order to leverage machine learning into more translational research related to the microbiome and strengthen our ability to extract meaningful biological information, it is important for models to be interpretable. Here we review current trends in machine learning applications in microbial ecology as well as some of the important challenges and opportunities for more broad application of machine learning to understanding microbial communities.Entities:
Keywords: 16S rRNA; ANN, Artificial Neural Networks; ASV, Amplicon Sequence Variant; AUC, Area Under the Curve; Forensics; GB, Gradient Boosting; ML, Machine Learning; Machine learning; Marker genes; Metagenomics; PCoA, Principal Coordinate Analysis; RF, Random Forests; ROC, Receiver Operating Characteristic; SML, Supervised Machine Learning; SVM, Support Vector Machines; USML, Unsupervised Machine Learning; tSNE, t-distributed Stochastic Neighbor Embedding
Year: 2021 PMID: 33680353 PMCID: PMC7892807 DOI: 10.1016/j.csbj.2021.01.028
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Illustrative pipeline for the investigation of microbial communities using metagenomics.
Fig. 2Schematic representation of unsupervised and supervised forms of learning and several ML methods predicting three conditional response labels (blue/red/yellow). (A) Depicts a common microbial frequency matrix containing observations or samples (N), features (X1, …, X23) and multiple class labels (Y). Input data are algorithmized and processed to either predict which cluster Y belongs to (unsupervised) or to find a best fit decision boundary between X and Y (supervised). (B) Linear SVM classifier demonstrating separation between class labels where the hyperplane maximizes the distance (margin) between the nearest data training points. Support vectors refer to the three position vectors drawn from the origin of the sample positions (dashed circle) with the goal of maximizing the distance between the optimal hyperplane and the support vectors (max-margin) so that a decision boundary can be drawn. (C) A decision tree constructed for the classification of samples into Y based on input feature values. Trees start from a root node (t0) and are grown to various leaf nodes (closed circle) to end at a terminal node (dashed circle) so that bootstrap aggregated predictions across terminal nodes are averaged across k-trees for best predictions of Ŷ. (D) A neural network displaying the structure of successive layers. Input values of X are transmitted to the proceeding hidden layer which passes weighted connections to the output layer for predictions of Ŷ. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Summary of ML techniques used for microbiome-based prediction tasks. This table briefly summarizes each technique, provides the source of the software, noteworthy ML implementations and interpretation of its result with reference to either the source study or specific studies that have applied these techniques for microbiome profiling. This table is not exhaustive but mentions current and commonly employed ML and ML related pipelines tailored to the characteristics of microbiome data or that are domain agnostic but relevant to research questions relating to the microbiome.
| Software name | Summary | Source | Example implementation | Remarks | URL |
|---|---|---|---|---|---|
| SIAMCAT (*) | R package ‘SIAMCAT’ | FS, ML, INTERP, VIS | Confounder analysis | ||
| DeepMicro (*) | Deep representation learning for disease prediction based on microbiome data | Python: | DR, ML | Deep representation learning using autoencoders to handle high-dimensional data | |
| MetAML (*) | Metagenomic prediction Analysis based on Machine Learning | Python: | FS, ML, INTERP, VIS | Enables cross study comparison of models on single cohorts, across stages of same the same study and across different studies | |
| mAML (*) | An automated machine learning pipeline with a microbiome repository for human disease classification | Python: | FS, ML, INTERP, VIS | Automates optimized, interpretable and reproducible models | |
| BiomMiner (*) | An advanced exploratory microbiome analysis and visualization pipeline | Docker: | FS, DR, ML, INTERP, VIS | Automatically tunes optimal hyper-parameters | |
| MIPMLP (*) | Microbiome Preprocessing Machine Learning Pipeline | Python: | FS, DR, ML, INTERP, VIS | Approaches for standardized ML preprocessing | |
| MicrobiomeAnalystR (*) | Comprehensive statistical, functional, and meta-analysis of microbiome data | R package ‘MicrobiomeAnalystR’ | FS, DR, ML, INTERP, VIS | Comprehensive analysis reporting | |
| Meta-Signer (*) | Python: | FS, ML, INTERP | Ensemble learning for feature ranking | ||
| QIIME2 (*) | FS, DR, ML, INTERP, VIS | Automatic tracking of data provenance | |||
| mothur (*) | Microbial community analysis pipeline | FS, DR, ML, INTERP, VIS | Can handle data from multiple sequencing platforms | ||
| scikit-learn | Simple and efficient tools for predictive data analysis | Python: | FS, DR, ML, INTERP, VIS | Robust machine learning library and support system | |
| Keras | Simple deep learning API | R package ‘keras’ | FS, DR, ML, INTERP, VIS | High-level learning API that limits the number of user actions | |
| caret | R package ‘caret’ | FS, DR, ML, INTERP, VIS | Streamlines complex predictive tasks | https://www.jstatsoft.org/article/view/v028i05 | |
| mlr | Machine learning in R | R package ‘mlr3′ | FS, DR, ML, INTERP, VIS | Modern and extensible ML framework for developers and practitioners | |
| H2O.ai | Fast scalable ML API | R package ‘h2o’ | FS, ML, DR, INTERP, VIS | End-to-end engine specialized for big data | |
| iml | Interpretable machine learning | R package ‘iml’ | FS, ML, INTERP, VIS | Feature effects on the influence of predictions | |
| LIME | Local interpretable model-agnostic explanations | R package ‘lime’ | FS, ML, INTERP, VIS | Explains individual predictions of a black box ML model | |
| inTrees | Interpretable tree ensembles | R package ‘inTrees’ | FS, ML, INTERP | Extracts, measures, prunes, selects and summarizes rules from a tree ensemble | |
| dtreeviz | Decision Tree Visualization | Python: | FS, ML, INTERP | Advanced visualizations | |
| ranger | R package ‘ranger’ | FS, ML, INTERP | Fast implementations of random forests optimized for high-dimensional data | ||
| partykit | A toolkit for recursive partitioning | R package ‘partykit’ | FS, ML, INTERP | Can coerce tree models from different sources into a unified infrastructure |
FS, Feature Selection; DR, Dimensionality Reduction; ML, Machine Learning; INTERP, Interpretation Measures; VIS, Visualization Outputs. (*) Denotes whether the software Is microbiome-specific (as opposed to domain agnostic).
Fig. 3Depiction of performance-interpretability trade-off and random forests interpretation. Note that these figures are fictional and are not based on experimental quantification (the axes in this figure lack meaning). (A) Performance-interpretability tradeoff of commonly deployed algorithms in microbiome research. However, in practice, the models characterized here tend to varying degrees of accuracy and interpretability based on experimental procedure. Had a plot been generated from experiment, model choice and complexity could vary such that inconsistent illustrations could arise. By way of example: tuning models to become more accurate could result in the belief that more accurate models are less interpretable and may not respect whether the model infrastructure supports inherently easier interpretation. (B) Hypothetical extraction of ‘association’ rules that measure frequent microbial community member interactions from fictional decision tree ensembles () for low error predictions of . Additionally diagramed is a feature ‘importance’ schematic that scores each feature on its relative importance in making predictions of
Studies using Machine learning in microbial ecology and microbiome studies.
| System | Classification | Input data | Number of samples | Method | Training and Validation | Reference |
|---|---|---|---|---|---|---|
| Human | Colonic screen relevant neoplasias | 16S rRNA | 172 patients with normal colonoscopies, 198 with adenomas, and 120 with carcinomas | L2-regularized logistic regression, L1- and L2-regularized SVM with linear and radial basis function kernels, a decision tree, RF, and gradient boosted trees | 80% Training, 20% Validation, 20% Test, Five-fold cross validation | Topçuoğlu et al 2020 |
| Human | Personalized postprandial glycemic response | 16S rRNA | 900 samples, 800 in training100 in validation | Gradient boosted trees | 800 samples used and validated with a leave one out cross validation scheme, 100 sample validation cohort | Zeevi et al 2015 |
| Environmental | Crop Productivity | Shotgun metagenomic | 12 samples | RF | 10 samples as training set, 2 samples as validation set (all combinations of the 12 samples) | Chang et al 2017 |
| Environmental | DOC level | 16S rRNA | 302 samples | feed-forward neural network regression and RF | 257 samples as training set and 51 as test set | Thompson et al 2019 |
| Environmental | Environmental quality status associated with salmon farms | SSU RNA (bacteria and ciliates) | 152 across seven salmon farms | RF and SVM | Models trained on six of the salmon farms and tested with the seventh | Cordier et al 2018 |
| Environmental | Environmental impacts of marine aquaculture | SSU RNA (five marker genes – one bacterial, one foraminiferal, and three universal eukaryote) | 144 Sediment samples | RF | Models trained on four of the salmon farms and tested with the other farm | Frühe et al 2020 |
| Environmental | Environmental quality status associated with salmon farms | Bacterial 16S rRNA | 12 sediment samples collected from six sites | RF | 12 samples validated with a leave one out cross validation scheme | Dully et al 2020 |
| Environmental | Contamination state (uranium, nitrate, oil) | 16S rRNA | 93 samples for ground water contamination, 42 samples for oil contamination | RF | Performance metrics were determiened from a confusion matrix based on out-of-bag predictions | Smith et al 2015 |
| Environmental | Glyphosate presence | 16S rRNA | 32 16S rRNA gene samples and 32 16S rRNA samples | ANN and RF | 32 samples used and validated with a leave one out cross validation scheme | Janßen et al 2019 |
| Forensic | Postmortem Interval | 16S rRNA | 144 sample swabs were taken from a total of 21 cadavers | SVR, K-neighbor Regression, Ridge Regression, Lasso Regression, Elastic Net Regression, RF regression, Baysiean Ridge Regression | 80% of samples for training set and 20% of samples for validation set | Johnson et al 2016 |
| Forensic | Postmortem Interval | 16S rRNA | 176 samples | RF, SVM, ANN | 70% for training and 30% for testing. Accuracy determined by mean absolute error and goodness of fit of 15 models | Liu et al 2020 |
| Forensic | geospatial location (port of origin) | 16S rRNA | 1,218 samples | RF | repeated k-fold cross validation (k 10 with 3 repeats) | Ghannam et al 2020 |