Literature DB >> 34779841

It takes guts to learn: machine learning techniques for disease detection from the gut microbiome.

Kristen D Curry¹, Michael G Nute¹, Todd J Treangen¹.

Abstract

Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.

Entities: Chemical

Keywords: bioinformatics; host–microbe interactions; machine learning; microbiome

Mesh：

Year: 2021 PMID： 34779841 PMCID： PMC8786294 DOI： 10.1042/ETLS20210213

Source DB: PubMed Journal: Emerg Top Life Sci ISSN： 2397-8554

Introduction

The collection of microscopic organisms residing in the intestinal tract is commonly referred to as the gut microbiome [1]. This community of microorganisms is associated with the well-being of the host [2-7], yet the specific roles and contributions of the individual microbes towards disease are often unknown [8, 9]. The advent of high-throughput sequencing along with pioneering work on population-level analysis of the human gut microbiome from the Human Microbiome Project (HMP) [10, 11], Belgian Flemish Gut Flora Project [12] and METAgenomics of the Human Intestinal Tract (MetaHIT) Project [13] have all broadened our understanding of host–microbiome interactions and raise the possibility that the gut microbiome could be an avenue for new forms of medical interventions. Microbiome data is especially enticing in human health research as it has the potential to explain medical mysteries that current clinical information has not been able to resolve. Host DNA for example can be used to calculate disease risk, but the static nature of this material means it cannot be used to measure the current health state of an individual [14]. Microbiome information is still unique to each individual [15, 16], yet has proven to change with many common illnesses and infections, providing a real-time snapshot into the health state of the host [17, 18]. Microbiome information is also intriguing from a data science perspective since it may not be just the presence of specific microbes that influences disease expression, but rather microbe abundances, the phylogenetic relationship between microbes, or the communication between microbes and their environment [19, 20]. A natural first question is if the data and descriptive statistics from the gut microbial community for any particular disease contain predictive power as to the disease state of the host. If so, a stool sample could be applied directly as a non-invasive clinical diagnostic tool provided the accuracy is robust and reproducible. But more broadly, as with any data science discipline, exploring and parsing the relationship between variables and outcome is the route to additional discovery. In this review, we present a survey of machine learning (ML) models designed specifically for this purpose: classification of host disease status based on features derived from DNA sequencing of the gut microbiome. A standard study design for investigating associations between microbes and host health is a case-control study, illustrated in Figure 1.

Figure 1.

Standard workflow for determining microbiome-disease associations through a case-control study or ML model.

First, subjects are separated into two equally sized cohorts based on disease state (sick and healthy) with the subjects chosen to make the two groups as closely matched as possible in terms of other potentially confounding variables. The gut microbiome for each participant is then established by sequencing a fecal sample through either a whole genome sequencing (WGS) or 16S rRNA approach. In WGS, all genetic material is sequenced providing a complete understanding of all cells in the sample with limited bias [22, 23]. A more cost-effective alternative is to perform targeted amplicon sequencing for the 16S gene, which restricts results so that only the microbes’ abundances relative to each other can be assumed [24, 25]. The resulting data are processed using various computational pipelines [21, 26–30] to distill the large volume of unstructured sequences into more manageable descriptive data that can be mined for signal.

Standard workflow for determining microbiome-disease associations through a case-control study or ML model.

Both approaches begin by separating study participants into diseased and healthy cohorts, collecting samples, then performing high-throughput sequencing. Sequencing is completed through either a WGS or 16S approach then reads are converted to either k-mer counts [21], microbial profiles or functional annotations. In a standard case-control study (left path) alpha diversity, beta diversity and multivariate analysis are used to establish statistically significant differences between the two cohorts. A manual literature review is then performed to determine if findings are consistent across various studies. However, in a standard ML approach, features are extracted from sequence information and a model is constructed to detect trends separating the two groups. Cross-study validation is then performed by calculating accuracy in classification results from other test data sets. A common first analysis is a basic comparison to test for statistically significant differences in the average community of the two groups. This is analogous to an ANOVA test in single-variable statistics, though in practice it is done using summary metrics like alpha diversity, beta diversity or multiple-hypothesis tests for differences in a larger set of values (i.e. microbe abundance) with an appropriate software [31, 32]. This is not a predictive modeling exercise per se, but establishing that there is a non-zero average difference in the communities is a primitive version of classification. Variations of these statistical models have been valuable for finding trends in the study-specific associations between microbes and a particular disease [33-36], but have been unsuccessful in terms of disease diagnosis or prevention because individual studies often report inconsistent findings [19, 37–41], raising reproducibility questions. Many researchers are voicing concerns with these statistical tests due to their ability to introduce bias [42, 43]. ML models are enticing for microbiome-phenotype classification tasks because, theoretically, disease profiles and biomarkers can be identified with only limited prior knowledge of the underlying system [44]. In addition, ML methods have shown success in other avenues of microbiome analysis, such as taxonomic classification [26, 45]. The potential scope of features is virtually limitless, as we will discuss later, but the approaches presented in this review rely on a few usual suspects: species-level relative abundance, a set of strain-specific markers from MetaPhlAn2 [27], k-mer profiles, and operating taxonomic units (OTUs). To produce relative abundance or biomarker detection results, a reference database is used to assign sequences as the closest match. K-mer profile generation skips the sequence classification step however by simply counting the frequency of all substrings of length k. Methods utilizing this technique have shown great success in reducing computational complexity while still producing accurate results in a range of microbiome analysis tasks [30, 46, 47]. In OTU clustering, similar sequences are grouped together then assigned either a consensus sequence or a reference-based taxonomic classification. A more recent form of clustering is one involving amplicon sequence variant (ASV), which includes an error reduction step to establish exact sequences and filters out reads based on a confidence threshold [48]. Any of these features can be fed into an off-the-shelf ML algorithm for supervised learning (i.e. random forest [49], support vector machines [50], neural networks [51]). Model output is then a simple yes/no disease classification with the potential to extract the most influential features through either built-in methods or post-processing pipelines [52], which can provide clinical value once interpreted [53, 54]. Accuracy can be validated on some kind of holdout sample, although in practice a more robust method is to apply the model to a totally separate study under different conditions, which truly tests the generalizability of the model. This cross-study validation step is challenging, however, and is not always completed. In this article, we review several ML approaches for disease prediction from metagenomic data and their strategies for overcoming computational and biological challenges. We have chosen seven methods for inclusion, for two key reasons: (1) they are all recent additions to the literature that made specific design decisions putting them beyond simple off-the-shelf models, (2) six of the seven have been evaluated on the same set of benchmark case-control studies (Table 1), which allows us to assess these methods based on a common benchmark data set. Following this, we will discuss the major challenges that remain, as well as prospective developments and advances.

Table 1.

Summary statistics for discussed data sets. Here, ‘x’ denotes use of data set in method publication.

Disease	Cases	Controls	MetAML	PopPhy-CNN	Met2Img	MetaPheno	DeepMicro	MVIB
Liver cirrhosis	118	114	x	x	x	–	x	x
IBD	25	85	x	–	x	–	x	x
Obesity	164	89	x	x	x	x	x	x
Type 2 diabetes	170	174	x	x*	x	x	x	x

x*: reported results for disease include additional samples (53 case, 43 controls).

Current approaches

The HMP and MetaHIT projects spurred the development of several 16S and WGS microbiome foundational tools including: taxonomic classification, phylogenetic placement, functional annotation and clustering [27, 30, 55–63]. These tools and data sets have been used in both single trial and encompassing meta-analyses studies for detecting microbiome trends, or lack there of, in relation to disease [37, 38, 40]. The methods covered in this section (Table 2), along with several others [54, 71–78], take this idea one step further and aim to establish generalizable models for discovering consistent microbiome-phenotype trends across individual studies.

Table 2.

A selection of machine learning methods for disease classification from metagenomic sequences. Best AUC here denotes the highest AUC value reported in publication for specified data set.

Software	Model input	Model description	Best AUC				Novelty
			Cirr.	IBD	T2D	Obes.
MetAML 2016 [64]	sp. rel. ab. or strain markers	Parameter sweep for 4 classifiers (SVM, RF, Lasso, ENet) with 3 feature selection methods (RF n most important, Lasso, ENet)	0.96 SVM	0.91 SVM	0.76 SVM	0.66 SVM	Foundational cross-validation test data and framework; first parameter sweep of metagenome disease prediction from off-the-shelf ML models
PopPhy-CNN 2020 [65]	OTU rel. ab.	PhyloT tree construction; populated with input OTU rel. ab.; transformed to 2D matrix; CNN with ELU	0.95	N/A	0.69	0.67	CNN with spatial quantitative relationship in input taxonomy data; novel alg for selecting most important features from first convolutional layer
Met2Img 2018 [66]	sp. or genus rel. ab.	Rel. ab. binned, colored, and visualized with Fill-up or t-SNE; 24x24 px (or smaller) images input into CNN with ReLU	0.91 Fillup SPB	0.87 Fillup SPB	0.68 tSNE QTF	0.69 tSNE SPB	Colored pixel visualization for microbiome profile; explores 3 binning methods (PR, QTF, SPB) with color and gray colormaps
MicroPheno 2018 [67]	16S raw seqs	Find subsample size for stable k-mer profile; find best k; input k-mers to DNN (MLP w/ ReLU), RF, or multi-class linear SVM	N/A	N/A	N/A	N/A	16S sequences; k-mer distribution from shallow sub-samples outperformed OTU features; first 16S deep learning metagenome-phenotype exploration
MetaPheno 2019 [68]	sp. rel. ab. or raw seqs	Jelly-fish k-mer counts; identify sig. k-mers with cohort p-values; apply hyper-parameter grid search models	N/A	N/A	0.76 gcF, k-mer	0.65 gcF, rel. ab.	Review of current methods; compares features: k-mers and rel. ab. with classifiers: SVM, RF, XGBoost, gcForest, AE-pretained DNN (AutoNN)
DeepMicro 2020 [69]	sp. rel. ab. or strain markers	Low-dimensional profile representation from autoencoder; input into MLP with ReLU or hyper-parameter grid SVM or RF	0.94 SVM CAE	0.96 SVM SAE	0.76 MLP CAE	0.67 RF DAE	4 autoencoders (shallow, deep, variational, convolutional) to reduce microbiome dimension; combines with MLP, SVM, and RF param. sweep
MVIB 2021 [70]	sp. rel. ab. and strain markers	MLP for each modality (rel. ab., strain marker, metabolomics); Information Bottleneck theory to learn joint stochastic encoding	0.93 D	0.94 J;T	0.76 J;T	0.67 D	Combine multiple heterogeneous data modalities; explore default and joint pre-processing (D,J); optional triple margin loss extension (T)

MetAML

The metagenomic prediction analysis based on machine learning (MetAML) [64] software laid the groundwork for detecting microbiome-phenotype associations by generating the first validated toolbox for disease prediction from shotgun metagenomes. MetAML established a computational ML framework for metagenome-based prediction tasks with implemented classifiers: support vector machine (SVM), random forest (RF), Lasso [79] and Elastic Net (ENet) [80]. MetAML also established the quantitative assessment to evaluate accuracy of each model and its ability to translate to the general population through cross-validation (prediction strength on metagenomic data) and cross-study (generalization of model on different studies) analyses. Results are measured with accuracy metrics: overall accuracy (OA), precision, recall, F1 and area under the curve (AUC). The MetAML framework was evaluated on metagenomic case-control datasets from five different diseases: inflammatory bowel disease (IBD), obesity, type-2 diabetes (T2D), liver cirrhosis and colorectal cancer. Features for tested models were generated from either species-level profiles or strain-specific presence markers based on results from MetaPhlAn2 [27] and further feature selection was also conducted with Lasso, ENet and RF embedded feature selection to give emphasis to features with greater discrepancy between cohorts. MetAML reported AUC scores over 0.88 for liver cirrhosis, colorectal cancer and IBD prediction, which highlighted the potential for disease detection from gut metagenomic data. Additionally, this exploratory analysis showed improved results when healthy cohorts were included in training models, features were extracted from a lower taxonomic rank (strain-specific markers), and methods of feature reduction were implemented. Despite these promising results, T2D and obesity datasets reported lower AUC scores (<0.80), encouraging researchers to explore alternative approaches to improve upon MetAML and yield better disease prediction results for these datasets.

PopPhy-CNN

PopPhy-CNN [65] aimed to improve classification accuracy of the liver cirrhosis, type 2 diabetes and obesity datasets from the MetAML package with a convolutional neural network (CNN) [81, 82] learning framework, where each layer uses the exponential linear unit (ELU) activation function. This method uses genus- and species-level relative abundances as well as a phylogenetic tree to empower the neural network to explore both quantitative characteristics from metagenomic data and spatial relationships from the taxonomic tree. The novelty of this approach lies in the use of a taxonomic relativity between microbes and a custom-built feature extraction algorithm, yet results did not show significant improvement over MetAML across tested datasets.

Met2Img

Met2Img [66] also built a CNN, but this time with a rectified linear unit (ReLU) activation function. This approach incorporated a creative feature extraction step where each sample is transformed into an image containing colored pixels representing the various microbes and their relative quantities. Images are generated by one of two different methods: phylogenetic sorting using Fill-up or a t-distributed stochastic neighbor embedding (t-SNE) visualization method. The resulting images are then used as features for the neural network. Met2Img reports improved accuracy over the MetAML RF model for three of the diseases (liver cirrhosis, IBD and obesity), but little to no change for the remaining diseases (colorectal cancer and T2D).

MicroPheno

MicroPheno [67] simplified the sequencing step of the pipeline by utilizing k-mers from short-read 16S rRNA data, rather than shotgun metagenomic sequences, in a deep learning model. MicroPheno extracts k-mers from shallow sub-samples and includes hidden layer dropout from its multi-layer-perceptrons (MLP) [83] neural network architecture, allowing for a computationally inexpensive pipeline. When applied to a sample set consisting of samples from different human body sites, an F1 score of over 0.90 was reported. However, this number dropped to 0.75 when applied to a Crohn’s disease dataset. While this oversimplified pipeline may prevent the model from accurate classification for complex samples, it did raise the idea of k-mer based feature extraction.

MetaPheno

LaPierre et al. [68] compared and contrasted existing metagenomic methods that used MetAML datasets in their publication results in an evaluation called MetaPheno. The authors hypothesize that classification accuracy falls short on T2D and obesity datasets due to overfitting, and specifically explore ways to improve upon these results. They implemented k-mer-based feature extraction as shown in MicroPheno, but this time k-mers were derived from shotgun metagenomic data rather than subsampled from 16S. The MetaPheno pipeline is completed by counting k-mer frequencies with Jelly-fish [21], extracting significant k-mers through a statistical model, then applying a machine learning model (SVM, RF, XGBoost [84], gcForest [85] or an autoencoder-pretrained deep neural network (DNN) [83, 86, 87]). Results showed that no single model outperformed others in all metrics and that the explored methods of feature reduction were unsuccessful in drastically improving accuracy over MetAML findings. The authors ultimately were not successful with T2D and obesity; they speculate that prediction using only metagenomic reads may not be possible and perhaps there is an upper limit on the accuracy that can be achieved this input data. They recommend future work in methods utilizing host genomic or additional multi-omic data sources alongside metagenomic data, then building a deep learning model such as a similarity network fusion [88].

DeepMicro

Following the publication of MetaPheno, DeepMicro [69] was released as another deep learning method evaluated on the datasets from MetAML, specifically focused on feature extraction. DeepMicro experiments with converting high-dimensional microbiome data to low-dimensional representations through an autoencoder (AE) [87]. Datasets from all 5 MetAML diseases were tested with both species-level relative abundance and strain-level marker output from MetaPhlAn2, with four different autoencoders (shallow (SAE), deep (DAE), variational (VAE) and convolutional (CAE)), and with three different classification algorithms (SVM, RF and MLP). In the results, the best performing autoencoder is highly dependent on the problem complexity and intrinsic properties of the input data. Additionally, incorporating healthy controls into the model worsened performance, a contrast with findings from MetAML. Still, the best DeepMicro approach outperformed the best MetAML approach in all but one of the tested diseases (colorectal cancer), highlighting the importance of effective feature extraction techniques for metagenomic data. However, and yet again, the AUC score for obesity is still only 0.674, leaving plenty of room for future success in this application.

MVIB

ML approaches to accurately classify obesity and T2D is still an open area of research that continues to improve as new ML techniques are developed. One up-and-coming method is multimodal variation information bottleneck (MVIB) [70], which takes advantage of both the species-level abundance and strain-level markers from MetaPhlAn2 output by computing a joint stochastic encoding from both profiles. MVIB is a multimodal generalization of the Deep Variational Information Bottleneck [89], which allows a model to learn a joint encoding from heterogeneous input data modalities with a deep neural network. MVIB reports an improvement over DeepMicro with VAE in each of the test datasets, emphasizing the value in obtaining multiple sources of information for each sample in a single model. This is especially valuable to this line of research as it sets a foundation for incorporating additional input parameters and clinical data which could potentially improve the classification accuracy of future models. In addition, the authors experiment with adding a joint collection pre-processing step, where input abundances and markers were made homogeneous across all diseases, as well as transfer learning from non-targeted disease data sets.

Open challenges

Catering ML models to detect disease patterns from microbiome data presents a range of challenges (Table 3). The first is the standard ‘big-p, little-n’ issue, where the number of variables in the input data dwarf the number of samples available. Importantly, this particular challenge is the defining challenge of phenotype classification from microbiome features. Sequencing costs combined with logistical challenges of sample collection and patient recruitment limit the number of available samples that will be available for a given study. Increasing the size of training data has long been a reliable way to improve model performance in most disciplines, but that is commonly not an option for microbiome studies. Nonetheless, this situation often results in overfitting, yielding a suboptimal model and in turn preventing the model from accurate classification on test data.

Table 3.

Challenges presented by microbiome data as input for ML models and the approaches taken by discussed methods to tackle these challenges.

Large feature space; small sample size

MetAML feature selection with Lasso, ENet, or RF n most important

PopPhy-CNN feature selection with novel alg; network regularization

DeepMicro autoencoder for low-dimensionality representation; early stopping

MVIB stochastic probabilistic encoders

MicroPheno shallow subset of 16S k-mers; early stopping; dropout hidden layers

MetaPheno select k-mer counts from 1000 most significant k-mers

Met2Img convert profile to binned image; early stopping

Presence of novel species

MicroPheno & MetaPheno raw sequence input data (k-mers)

Temporal fluctuations in microbe abundances

MetAML include multiple samples from a single test subject

MVIB combine abundance and marker profiles for each sample

To overcome this issue, each of the methods discussed above includes a feature reduction step. Both MetAML and PopPhy-CNN select only features that are considered the most important for the model. MetaPheno also uses this ideology by embedding an algorithm to select the 1000 features with the smallest p-value between case and control groups. DeepMicro explores four different autoencoders, which are used in the model to learn low-dimensional representations from complete microbiome profiles. MVIB incorporates a stochastic encoder for each input data modality. MetaPheno and MicroPheno both use k-mer counts and reduce complexity in their neural networks with hidden layer dropout tactics. Met2Img takes an entirely unique approach and converts microbial quantities into binned images, which essentially transforms each sample into a single feature. A second challenge encountered is the presence of novel species. This challenge highlights the limitations of reference-based feature computation, where the most similar database entry is used for labeling and therefore reads can only be classified as an existing database entry. This results in novel microbes either misclassified as an incorrect organism or grouped together in an ‘unclassified’ category. Both MicroPheno and MetaPheno overcome this challenge by using k-mer count profiles, avoiding the classification step entirely. An alternative approach would be use of a reference-free and therefore database-agnostic method [90]. A third challenge presented is the fluctuating nature of microbial communities. Although the gut microbiome in healthy adults is shown to be relatively stable over time [91], more substantial fluctuations have been observed in subjects exhibiting illness [92-94]. This dynamic environment raises concern since the most informative time to collect samples is not yet established and, beyond this, the true separation between disease states may lie in the changes that occur over time [95]. MetAML takes this knowledge into account by incorporating samples from various stages throughout a participant’s illness in the liver cirrhosis and T2D test data sets. MVIB aims to gain a more thorough representation of each changing environments by including multiple profiles (species-level abundance and strain-level markers) for each input samples. However, none of the methods presented exploit longitudinal patient samples from both a healthy and diseased state and therefore cannot truly explore temporal changes in an individual that arise with disease onset. Although this would introduce a layer of complexity to the approach illustrated in Figure 1, it would provide a comprehensive tactic to account for the gut microbe community changes that have been observed in disease.

Future research directions

Despite repeatedly observed associations of the gut microbiota in T2D and obesity studies, the presented methods have encountered difficulty providing accurate phenotype classification for these diseases [34, 35, 96–102]. This raises the question: do the current feature sets, exclusively derived from metagenomic data, inherently limit predictive power for some phenotypes? It may be that microbes are only one of several contributing factors in the condition, or it may be that the microbes do not contribute at all but rather respond in a consistent way to the environmental change brought about by the disease. A major source of future advancement in phenotype-prediction would be the result of discovering new data sources or feature types that have complementary predictive power, then utilizing the appropriate model structures for leveraging additional information. Here, we consider two biological factors that interact with the microbiome to have a combined effect on host health (host genetics and microbe-derived metabolites) and explore their potential to act as complementary predictors in microbiome-disease relations.

Gut microbiome and host genetics

Because gut microbes impact host health due to interactions with host cells [19], incorporation of host DNA may improve future models. In addition, associations between gut microbes and host genes have been detected in a variety of studies [103-108]. One case where inclusion of host genomic information strengthened findings was in a study where Ryan et al. [109] examined the colonic microbiota in relation to IBD through host transcriptomics, epigenomics and genetics data. Authors concluded that while the microbiota appeared to be linked to the disease, there was no evidence of a distinct microbial diagnostic signature, likely due to heterogeneous host–microbe interactions. They then included epithelial DNA methylation into the algorithm, which yielded better classification results.

Gut microbiome and metabolites

Metabolites are the intermediate or end product of interactions between microbes and host cells [110]. Short-chain fatty acids (SCFA) and bile acids are metabolites that can have beneficial or damaging impacts on the host tissue [111] and have shown associations with a range of illnesses [102, 112–116]. This promotes the idea of incorporating metabolite information in disease state analyses. Jeffery et al. found that only the fecal metabolome, rather than the microbiome alone, could distinguish between IBD patients that expressed the bile acid malabsorption phenotype. In a separate study, Sanna et al. [117, 118] leveraged SCFA levels in addition to gut microbial data to discover an association with risk of T2D, which is likely due to the fact that microbe-derived SCFA acts an additional energy source and therefore can increase likelihood for obesity. In both these studies, inclusion of microbe-derived metabolites was essential for observing associations between the microbiome and disease states. Promising results produced from leveraging both these data types have led to the development of further studies and software tools for effective combination of metabolite and microbiome information as input data [70, 119–122].

Microbiome-based models for obesity classification

Current microbiome-based models for obesity classification will likely be improved by incorporating variables containing additional signal [123, 124]. Increased blood glucose levels heavily influences obesity [125], but is difficult to control since foods elicit vastly different responses across individuals [126-128]. However, Zeevi et al. showed great success in developing an ML algorithm for predicting postprandial blood glucose levels by integrating blood parameters, dietary habits, anthropometrics, physical activity and gut microbiome data. This model was constructed from a 800-person cohort, then tested on a separate 100-person cohort where it accurately predicted personalized glucose response after consumption of different foods. In addition, the model was adjusted and applied in a blinded randomized controlled dietary intervention, where participants effectively altered their gut microbiota to successfully lower postprandial responses based on personalized diet recommendations from the algorithm. This advancement in blood glucose response prediction suggests the possibility of a similarly accurate model for obesity classification. Another avenue for future research in phenotype prediction models is the integration of the causal relationship between microbiota and host disease, given one exists. This relationship is likely to differ between illnesses and is currently difficult to ascertain in many cases [41, 129]. Plausibly addressing causality is challenging since the fine control over experimental conditions required can only be conducted in animals. Given microbial communities vary substantially across hosts, conclusions in animal models often are not transferable to humans [130], even with increased validation from replication of findings across different animal hosts. As the causal relationship of gut microbes becomes further established for each illness, leveraging this information appears promising for future disease classification ML models.

Conclusion

Recent advances in ML methods have opened the door to deciphering the intricate role of gut microbes in host health and disease. Methods have proven successful in classification of multiple illnesses using solely metagenomic information due to their inherent ability to handle multi-dimensional data and identify trends with little upfront knowledge. Published methods have explored a broad range of standard model platforms including random forest, deep neural networks and support vector machines as well as unconventional approaches to overcome challenges presented by microbial communities. While notable accuracy of irritable bowel disease and liver cirrhosis classification has been reported, less success has been observed for obesity [64, 68, 70]. After exhausting model types and parameter settings through trial-and-error, the current limiting factor appears to be due to unknown causal roles for microbes and lack of further influential features. Additional clinical data, including but not limited to human genetics, metabolomes and lifestyle factors, combined with microbial information and the appropriate feature reduction technique shows promise for improved disease prediction accuracy in future ML algorithms for these complex gut microbiota relationships. Several ML methods have been developed for host disease detection from microbiome sequences with various classifiers and feature reduction approaches. Methods have proven successful in predicting liver cirrhosis and IBD from gut metagenomic samples, but have not shown such accuracy with obesity prediction. Future perspectives for improvement in disease detection algorithms include incorporating additional biological factors as input into ML models, such as host genomics, microbiome-derived metabolite levels or blood glucose response predictions.

108 in total

1. Multi-Layer and Recursive Neural Networks for Metagenomic Classification.

Authors: Gregory Ditzler; Robi Polikar; Gail Rosen
Journal: IEEE Trans Nanobioscience Date: 2015-08-24 Impact factor: 2.935

2. Structural variation in the gut microbiome associates with host health.

Authors: David Zeevi; Tal Korem; Anastasia Godneva; Noam Bar; Alexander Kurilshikov; Maya Lotan-Pompan; Adina Weinberger; Jingyuan Fu; Cisca Wijmenga; Alexandra Zhernakova; Eran Segal
Journal: Nature Date: 2019-03-27 Impact factor: 49.962

3. The effect of host genetics on the gut microbiome.

Authors: Marc Jan Bonder; Alexander Kurilshikov; Ettje F Tigchelaar; Zlatan Mujagic; Floris Imhann; Arnau Vich Vila; Patrick Deelen; Tommi Vatanen; Melanie Schirmer; Sanne P Smeekens; Daria V Zhernakova; Soesma A Jankipersadsing; Martin Jaeger; Marije Oosting; Maria Carmen Cenit; Ad A M Masclee; Morris A Swertz; Yang Li; Vinod Kumar; Leo Joosten; Hermie Harmsen; Rinse K Weersma; Lude Franke; Marten H Hofker; Ramnik J Xavier; Daisy Jonkers; Mihai G Netea; Cisca Wijmenga; Jingyuan Fu; Alexandra Zhernakova
Journal: Nat Genet Date: 2016-10-03 Impact factor: 38.330

Review 4. Gut microbiome and type 2 diabetes: where we are and where to go?

Authors: Sapna Sharma; Prabhanshu Tripathi
Journal: J Nutr Biochem Date: 2018-10-11 Impact factor: 6.048

5. Genetic Determinants of the Gut Microbiome in UK Twins.

Authors: Julia K Goodrich; Emily R Davenport; Michelle Beaumont; Matthew A Jackson; Rob Knight; Carole Ober; Tim D Spector; Jordana T Bell; Andrew G Clark; Ruth E Ley
Journal: Cell Host Microbe Date: 2016-05-11 Impact factor: 21.023

6. Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals.

Authors: Francesco Asnicar; Sarah E Berry; Ana M Valdes; Long H Nguyen; Gianmarco Piccinno; David A Drew; Emily Leeming; Rachel Gibson; Caroline Le Roy; Haya Al Khatib; Lucy Francis; Mohsen Mazidi; Olatz Mompeo; Mireia Valles-Colomer; Adrian Tett; Francesco Beghini; Léonard Dubois; Davide Bazzani; Andrew Maltez Thomas; Chloe Mirzayi; Asya Khleborodova; Sehyun Oh; Rachel Hine; Christopher Bonnett; Joan Capdevila; Serge Danzanvilliers; Francesca Giordano; Ludwig Geistlinger; Levi Waldron; Richard Davies; George Hadjigeorgiou; Jonathan Wolf; José M Ordovás; Christopher Gardner; Paul W Franks; Andrew T Chan; Curtis Huttenhower; Tim D Spector; Nicola Segata
Journal: Nat Med Date: 2021-01-11 Impact factor: 87.241

Review 7. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment.

Authors: Laura Judith Marcos-Zambrano; Kanita Karaduzovic-Hadziabdic; Tatjana Loncar Turukalo; Piotr Przymus; Vladimir Trajkovik; Oliver Aasmets; Magali Berland; Aleksandra Gruca; Jasminka Hasic; Karel Hron; Thomas Klammsteiner; Mikhail Kolev; Leo Lahti; Marta B Lopes; Victor Moreno; Irina Naskinova; Elin Org; Inês Paciência; Georgios Papoutsoglou; Rajesh Shigdel; Blaz Stres; Baiba Vilne; Malik Yousef; Eftim Zdravevski; Ioannis Tsamardinos; Enrique Carrillo de Santa Pau; Marcus J Claesson; Isabel Moreno-Indias; Jaak Truu
Journal: Front Microbiol Date: 2021-02-19 Impact factor: 5.640

8. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox.

Authors: Jakob Wirbel; Konrad Zych; Morgan Essex; Nicolai Karcher; Ece Kartal; Guillem Salazar; Peer Bork; Shinichi Sunagawa; Georg Zeller
Journal: Genome Biol Date: 2021-03-30 Impact factor: 13.583

Review 9. Defining the human microbiome.

Authors: Luke K Ursell; Jessica L Metcalf; Laura Wegener Parfrey; Rob Knight
Journal: Nutr Rev Date: 2012-08 Impact factor: 7.110

10. Analysis of Gut Microbiome Using Explainable Machine Learning Predicts Risk of Diarrhea Associated With Tyrosine Kinase Inhibitor Neratinib: A Pilot Study.

Authors: Chi Wah Wong; Susan E Yost; Jin Sun Lee; John D Gillece; Megan Folkerts; Lauren Reining; Sarah K Highlander; Zahra Eftekhari; Joanne Mortimer; Yuan Yuan
Journal: Front Oncol Date: 2021-03-10 Impact factor: 6.244