| Literature DB >> 31293616 |
Abstract
With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.Entities:
Keywords: disease; machine learning; modeling; phenotype; prediction
Year: 2019 PMID: 31293616 PMCID: PMC6603228 DOI: 10.3389/fgene.2019.00579
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Schematic illustration of several machine learning prediction methods using case/control (red/blue) status. For two features, (A) illustrates linear discrimination methods. The solid line shows the linear discriminant line corresponding to equally probable outcomes, while the dashed line shows the midpoint of the maximum-margin support vector machine. (B) For k-nearest neighbors, the gray point is predicted using an average of the neighbors (red, in this instance). (C) Decision tree ensembles include random forests, which average over bootstrapped trees, and boosted trees, where successive residuals are used for fitting. Trees may not extend to the level of individual observations, and modal or mean values in the terminal nodes are used for prediction. (D) A neural network with few hidden layers.
Review of published prediction accuracy comparisons.
| Pasolli et al., | Qin et al., | Liver cirrhosis | 232 | 118 | 114 | 542 | Species | Random forest | AUC | 0.95 |
| SVM | AUC | 0.92 | ||||||||
| Elastic net | AUC | 0.91 | ||||||||
| Lasso | AUC | 0.88 | ||||||||
| Zeller et al., | Colorectal cancer | 121 | 48 | 73 | 503 | Species | Random forest | AUC | 0.87 | |
| SVM | AUC | 0.81 | ||||||||
| Elastic net | AUC | 0.79 | ||||||||
| Lasso | AUC | 0.73 | ||||||||
| Qin et al., | IBD | 110 | 25 | 85 | 443 | Species | Random forest | AUC | 0.89 | |
| SVM | AUC | 0.86 | ||||||||
| Elastic net | AUC | 0.83 | ||||||||
| Lasso | AUC | 0.81 | ||||||||
| Le Chatelier et al., | Obesity | 253 | 164 | 89 | 465 | Species | Random forest | AUC | 0.66 | |
| SVM | AUC | 0.65 | ||||||||
| Elastic net | AUC | 0.64 | ||||||||
| Lasso | AUC | 0.60 | ||||||||
| Qin et al., | Type II diabetes | 344 | 170 | 174 | 572 | Species | Random forest | AUC | 0.74 | |
| SVM | AUC | 0.66 | ||||||||
| Elastic net | AUC | 0.70 | ||||||||
| Lasso | AUC | 0.71 | ||||||||
| Karlsson et al., | Type II diabetes | 96 | 53 | 43 | 381 | Species | Random forest | AUC | 0.76 | |
| SVM | AUC | 0.66 | ||||||||
| Elastic net | AUC | 0.60 | ||||||||
| Lasso | AUC | 0.54 | ||||||||
| Johnson et al., | Post-mortem interval (PMI) | 67 | NA | NA | 52 | Phylum | Ridge | Error rate | 0.46 | |
| 52 | Phylum | Elastic net | Error rate | 0.48 | ||||||
| 3,130 | Species | Lasso | Error rate | 0.49 | ||||||
| 52 | Phylum | SVM | Error rate | 0.50 | ||||||
| 3,130 | Species | Ridge | Error rate | 0.51 | ||||||
| 3,130 | Species | Elastic net | Error rate | 0.52 | ||||||
| 52 | Phylum | Lasso | Error rate | 0.52 | ||||||
| Ditzler et al., | Rousk, 2010 | Soil pH (low/medium/high) | 22 | NA | NA | 500 | Various | Recursive neural network (RNN) (50) | Error rate | 0.15 |
| Deep belief network (DBN) (500) | Error rate | 0.08 | ||||||||
| Deep belief network (DBN) (750) | Error rate | 0.08 | ||||||||
| Random forest | Error rate | 0.15 | ||||||||
| Multi-layer perceptron Neural network (MLPNN) (500) | Error rate | 0.00 | ||||||||
| Caporaso et al., | Host gender | 1,967 | NA | NA | 500 | various | Recursive neural network (RNN) (250) | Error rate | 0.15 | |
| Recursive neural network (RNN) (500) | Error rate | 0.19 | ||||||||
| Deep belief network (DBN) (250) | Error rate | 0.24 | ||||||||
| Deep belief network (DBN) (500) | Error rate | 0.24 | ||||||||
| Random forest | Error rate | 0.03 | ||||||||
| Multi-layer perceptron neural network (MLPNN) (500) | Error rate | 0.08 | ||||||||
| Caporaso et al., | Three body sites | 1,967 | NA | NA | 500 | Various | Recursive neural network (RNN) (250) | Error rate | 0.17 | |
| Recursive neural network (RNN) (500) | Error rate | 0.16 | ||||||||
| Deep belief network (DBN) (250) | Error rate | 0.03 | ||||||||
| Deep belief network (DBN) (500) | Error rate | 0.03 | ||||||||
| Random forest | Error rate | 0.01 | ||||||||
| Multi-layer perceptron neural network (MLPNN) (500) | Error rate | 0.01 | ||||||||
| Reiman et al., | Caporaso et al., | Three body sites | 1,967 | NA | NA | 1,706 | Various | Recursive neural network (RNN) (250) | Accuracy | 0.83 |
| Recursive neural network (RNN) (500) | Accuracy | 0.84 | ||||||||
| Deep belief network (DBN) (250) | Accuracy | 0.97 | ||||||||
| Deep belief network (DBN) (500) | Accuracy | 0.97 | ||||||||
| Multi-layer perceptron Neural network (MLPNN) (500) | Accuracy | 0.99 | ||||||||
| Random forest | Accuracy | 0.99 | ||||||||
| Convolutional neural Network (CNN-1D) | Accuracy | 0.95 | ||||||||
| Convolutional neural Network (CNN-2D) | Accuracy | 0.99 | ||||||||
| Moitinho-Silva et al., | Microbial abundance from sponges (high/low) | 1,232 | NA | NA | 30 | Phylum | random forest | Accuracy | 0.97 | |
| Adaptive boosting (AdaBoost) | Accuracy | 0.95 | ||||||||
| 76 | Class | Random forest | Accuracy | 0.95 | ||||||
| Adaptive boosting (AdaBoost) | Accuracy | 0.91 | ||||||||
| 2,322 | Various | Random forest | Accuracy | 0.50 | ||||||
| Adaptive boosting (AdaBoost) | Accuracy | 0.91 | ||||||||
| Ai et al., | Colorectal cancer (CRC) | 141 | 42 | 99 | 1,171 | Species | Bayes net | AUC | 0.93 | |
| Random forest | AUC | 0.94 | ||||||||
| Logistic | AUC | 0.98 | ||||||||
| 141 | 53 | 88 | 783 | Species | Bayes net | AUC | 0.86 | |||
| Random forest | AUC | 0.86 | ||||||||
| Logistic | AUC | 0.71 | ||||||||
| Wu et al., | Three diseases | 806 | 423 | 383 | 300 | Genus | Logistic | F1 | 0.91 | |
| k-nearest neighbor | F1 | 0.86 | ||||||||
| Random forest | F1 | 0.83 | ||||||||
| SVM | F1 | 0.91 | ||||||||
| Gradient boosting | F1 | 0.87 | ||||||||
| Adaptive boosting | F1 | 0.90 | ||||||||
| NAkano et al., | Oral malodour | 90 | 45 | 45 | 37 | Genus | SVM | Accuracy | 0.79 | |
| Deep learning | Accuracy | 0.97 | ||||||||
| Asgari et al., | HMP | Five body sites | 1,192 | NA | NA | 20,589 | Various | Random forest | F1 | 0.89 |
| SVM | F1 | 0.85 | ||||||||
| Gevers et al., | Crohn's disease | 1,359 | 731 | 628 | 9,511 | Various | Random forest | F1 | 0.74 | |
| SVM | F1 | 0.68 |
Figure 2(A–C) ROC curves for each machine learning method using all OTUs. The AUC values are shown in the legend. The size of each dataset (# cases/controls ×# OTUs) is shown in the title. (D) Bar graph showing the average Pearson correlation (R) between predicted and actual BMI in the Goodrich dataset, using BMI as a continuous trait.
Figure 3ROC curves after collapsing OTUs to the genus level (A) the Singh dataset, (B) the Vincent dataset, and (C) the Goodrich dataset.
Figure 4ROC curves after applying the HFE method to select a subset of informative features (A) Singh dataset, (B) Vincent dataset, (C) Goodrich dataset.
Figure 5Scatterplot comparing the average AUCs between the full dataset and the HFE subset. (A) Singh dataset, (B) Vincent dataset, (C) Goodrich dataset.