| Literature DB >> 31781153 |
Duo Jiang1, Courtney R Armour2, Chenxiao Hu1, Meng Mei1, Chuan Tian1, Thomas J Sharpton1,2, Yuan Jiang1.
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.Entities:
Keywords: compositionality; heterogeneity; microbiome networks; multi-omics data integration; network analysis; normalization; sparsity
Year: 2019 PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of available network-based procedures.
| Method type | Network type | Representative methods (software: packages) | Advantages | Disadvantages |
|---|---|---|---|---|
| Marginal correlation analysis | Undirected | Pearson’s correlation, Spearman’s rank correlation, Kendall’s tau (R: base); Local similarity analysis (Linux: ELSA); WGCNA (R: WGCNA) | Easy to implement; nonparametric options available. | Subject to spurious findings due to confounding. |
| Dimension reduction methods | Typically undirected | PCA (R: base); CCA (R: CCA); PLS (R: pls); CIA (R: ade4); Sparse CCA, Sparse multiple CCA (R: PMA); Sparse PLS (R: spls); Sparse CIA (R: pCIA); Kernel PCA, kernel CCA (R: kernlab) | Can be used to construct networks linking modules of features. | Poor interpretability because each node represents multiple, if not all, features. |
| Regression-based methods | Directed or undirected | Linear and generalized linear models (R: base); Linear and generalized linear mixed models (R: nlme, lme4); Regularized regression: Lasso, ridge, elastic net (R: glmnet), SCAD, MCP (R: ncvreg), Group lasso, group elastic net, group SCAD, group MCP (R: grpreg); Regularized multivariate regression: Graph-guided fused lasso (R: GFLASSO), remMap (R: remMap), Reduced-rank regression (R: rrpack) | Easy to incorporate covariates; a large number of statistical methods and software tools are available. | Need to specify each feature as either a response variable or a predictor. |
| Graphical models | Undirected | Graphical lasso (R: glasso, huge); Neighbourhood selection (R: huge); Joint graphical lasso (R: JGL); Conditional graphical models Covariated-adjusted graphical models (R code: caPC) | Conditional dependency captures direct biological interactions more effectively than methods based on marginal correlations. | Most methods assume a multivariate normal distribution. |
| Bayesian networks | Directed | CONEXIC (Linux: CONEXIC); QTLnet (R: qtlnet); Bayesian Network Prior (MATLAB: BNP); Search-and-score approaches, constrain-based approaches (R: bnlearn) | Links more directly related to causality; ability to incorporate prior knowledge; possibility to handle data following disparate distribution types. | Current methods do not scale well to massive data sets. |
| Network integration | Undirected | GeneMania (Cytoscape/Web: GeneMANIA); SNF (R: SNPtools); DCA (MATLAB: Mathup) | Often simple to implement; ability to borrow information from multiple networks. | Individual networks that serve as the input of the methods must be reliably estimated; a shared biological mechanism is assumed. |
Figure 1Visualizing the unique challenges of microbiome data. A mock set of bacterial samples from two populations where each colored shape is a bacterial taxon. (A) Compositionality. The taxon abundance table depicts the count of each observed taxon in each sample. When sequencing microbiome samples, the resulting counts of taxa are not representative of the actual taxa counts in the sample due to constraints of sequencing. Due to this, relative abundances are generally used in analysis of microbiome data. The bar plots illustrate the difference in community representation between raw counts (top) and relative abundances (bottom). (B) Normalization. Due to the constraints of sequencing, the overall sequencing depth of a sample can impact the results. For example, shallow sequencing may miss rare taxa such as the green taxon V in the example sample A that is present in low abundance in the community. (C) Sparsity. Microbiome data are often very sparse, where most observations are zero. This is illustrated by the histogram of taxa counts for each sample where most counts are zero and there are few taxa with high counts. This can also be seen in the table for part A, where many entries are zero. (D) Heterogeneity. The table summarizes the taxonomic heterogeneity in the mock dataset between the two populations. Each sample has a unique taxonomic composition, but there are also population specific signatures. The samples in each population are dominated by a few taxa, and these dominant taxa are different for the two populations. Additionally, there are taxa that are highly abundant in one sample and absent from the rest, such as the purple taxon Y in sample A.