| Literature DB >> 33801081 |
Abstract
Metabolomics deals with multiple and complex chemical reactions within living organisms and how these are influenced by external or internal perturbations. It lies at the heart of omics profiling technologies not only as the underlying biochemical layer that reflects information expressed by the genome, the transcriptome and the proteome, but also as the closest layer to the phenome. The combination of metabolomics data with the information available from genomics, transcriptomics, and proteomics offers unprecedented possibilities to enhance current understanding of biological functions, elucidate their underlying mechanisms and uncover hidden associations between omics variables. As a result, a vast array of computational tools have been developed to assist with integrative analysis of metabolomics data with different omics. Here, we review and propose five criteria-hypothesis, data types, strategies, study design and study focus- to classify statistical multi-omics data integration approaches into state-of-the-art classes under which all existing statistical methods fall. The purpose of this review is to look at various aspects that lead the choice of the statistical integrative analysis pipeline in terms of the different classes. We will draw particular attention to metabolomics and genomics data to assist those new to this field in the choice of the integrative analysis pipeline.Entities:
Keywords: data integration; genomics; integration strategies; multi-omics
Year: 2021 PMID: 33801081 PMCID: PMC8003953 DOI: 10.3390/metabo11030184
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Classification of different data integration approaches. The examples list is by no means exhaustive.
| Integrative Analysis | Description | Examples |
|---|---|---|
|
| ||
| Repeated study | In a repeated study the experiment is repeated in another time or place to generate a second type of data. | Cavill et al. [ |
| Replicate matched study | In a replicate matched study, biological replicates are used to generate additional types of data. | Cavill et al. [ |
| Split sample study | In a split sample study, the same biological sample is split for profiling with different omics technologies. | Cavill et al. [ |
| Source matched study | In a source matched study, different samples from the same biological organism are extracted and used to generate different types of data. | Cavill et al. [ |
|
| ||
| Horizontal or homogeneous data integration (meta-analysis) | Horizontal integration involves combining measurements of the same omics entities across various cohorts, labs or studies. | Richardson et al. [ |
| Vertical or heterogeneous data integration | Vertical integration involves combining entities from different omics levels, often measured using different platforms. | Richardson et al. [ |
|
| ||
| Multi-staged | In multi-staged integration, inter-omics variation (variation between omics) is assumed to be unidirectional from the genome to the metabolome | Nicholson et al. [ |
| Meta-dimensional | In meta-dimensional integration, inter-omics variation is assumed to be multi-directional or simultaneous. | Smolinska et al. [ |
|
| ||
| Early integration | Early integration combines two datasets by simply concatenating them into one data. | Fridley et al. [ |
| Intermediate integration | Intermediate integration involves a data transformation step to be performed prior to modeling. | Le et al. [ |
| Late integration | Late integration consists of combining single data models into a high level model. | Acharjee et al. [ |
|
|
| |
| Sequential analysis | Does the additional data type enhance understanding of the first data type? | Yuan et al. [ |
| Biological analysis | What are the underlying processes leading to phenotypical changes? Which mechanisms explain the prevalence of a phenotype? | Hirai et al. [ |
| Model-based analysis | Which variables are phenotypically relevant? significantly associated? Can predictive ability be improved? | Smolinska et al. [ |
Case study examples underlining considerations that researchers should make when carrying out multi-omics experiments and analyses. Integrative analysis that is driven by a hypothesis should result in a data interpretation that links back to that hypothesis (see Section 5). Hence the underlying hypothesis should be considered along with the research question but also at the data interpretation step.
| Workflow | Considerations | Choices and Comments |
|---|---|---|
| Example from Le et al. [ | ||
| Study focus | Research questions | Is it possible to predict metabolite abundance from bacteria abundance in inflammatory bowel disease (IBD)? Can we learn the synergistic relationship between the gut microbiome and their surrounding metabolites? These questions suggest an interest in complex associations between the metabolome and the microbiome which will be investigated through model-based analysis. The choice of a model-based analysis highly affects the integrative strategy while requiring it to comply with the hypothesis. |
| Hypothesis | As suggested by the research question, the authors assume that there exists intermediate factors that act in the middle of the process that transforms microbes to metabolites and that the processes in which microbes affect metabolites are highly interdependent following a multi-staged integrative approach. | |
| Study design, sample collection and data acquisition | Study type | Paired data from a cohort of inflammatory bowel disease patients. |
| Omics layers | Microbiome and metabolome | |
| Biological samples | Fecal samples | |
| Platforms | Next-Generation Sequencing (NGS) and LC-MS | |
| Preprocessing | In addition to the standard pre-processing workflow applied to each platform, the authors used compositional methods e.g., centered log-ratio transformation, to ensure that their workflow will generalize to any pair of omics data. | |
| Data types | Vertical data integration on paired data with heterogeneous features: microbe abundance and metabolite abundance. | |
| Data analysis | Strategies | Intermediate integration via neural encoder-decoder networks. Non negative weights are imposed on the networks to enforce a unidirectional variation from the microbiome to the metabolome. |
| Data interpretation | Hypothesis | Microbe abundance is able to reliably predict abundance of a range of metabolites while empowering clinically relevant relationships. The findings also suggest that the “microbe-metabolite axis itself, not just the microbes and metabolites alone, is an IBD-specific biomarker signature.” |
| Example from Nicholson et al. [ | ||
| Study focus | Research question | Are there 1H NMR-detectable metabolites in urine or plasma that are strongly influenced by common single-locus genetic variation? This question involves, but not restricted to, a model-based integrative analysis and will guide the study design, data analysis and data interpretation. |
| Hypothesis | Variation is unidirectional downstream from genes to metabolites. | |
| Study design, sample collection and data acquisition | Study type | Cohort study |
| Omics layers | Genome and metabolome | |
| Biological samples | Whole-blood, plasma and urine | |
| Platforms | Untargeted 1H NMR and targeted flow-injection tandem MS: The sets of metabolites observed from the two platforms were minimally overlapping and therefore complementary. The genotyping assay used Illumina arrays. | |
| Longitudinal profiling | Measurements of heterogeneous omics entities were recorded at the same time point. The longitudinal design allowed detailed variance-components analysis of the sources of population variation in metabolite levels. | |
| Preprocessing | Preprocessing including metabolite annotation was performed using standard pipelines for each platform. | |
| Data types | The authors considered two cohorts from the MolPAGE study with the aim of using one cohort to replicate findings of the other one (Sequential integration). Vertical data integration has been performed on Genome-wide SNP genotypes and metabolic features. | |
| Data analysis | Strategies | Early integration through Genome-Wide Metabolic QTL Analysis to identify associations. |
| Data interpretation | Hypothesis | The mQTLs explained a significant biological population variation in the corresponding metabolites’ concentrations which is well aligned with the hypothesis of a multi-staged integrative analysis. This is also coherent with the research question (study focus) and strategy adopted. |
Figure 1This figure illustrates how different data types can be coupled to each other Example (a): Meta-analysis or horizontal integrative analysis involves data collection under different conditions resulting in two datasets that share the same features (e.g., only metabolomic features) but different samples. These observations can be combined into one data matrix after meta-analysis. Example (b): In heterogeneous or vertical integrative analysis data are acquired from samples profiled under the same conditions, but do not share the same features e.g., genomic features vs metabolomic features. Strategies that can be used for these types of integrative analysis are depicted in Figure 3.
Figure 2Examples of a multi-stage integrative analysis approach. Example (a) illustrates a three-step framework where genomic and metabolomic datasets are concurrently tested for association with the phenotype resulting in smaller datasets. These datasets are then investigated to infer linked variables. Example (b) illustrates a typical scenario where genomic variables are tested for association with transcripts which are in turn associated with metabotypes. These metabotypes might for instance explain the expression of a given phenotype. These models are useful for vertical data integration but not suitable for meta-analysis since they assume that different omics entities are observed.
Figure 3Different data integration strategies. (a) illustrates early integration where data is combined into a single data matrix before modeling. (b) depicts the intermediate data integration level where data matrices are transformed or mapped into a common meaningful representation before modeling. In (c), each data model is generated separately and is then combined with models based on other data types to generate the integrative or high-level model. Early integration is often used in meta-analysis [76]. Intermediate and late data integration strategies can be applied for meta-analysis but such applications are scarce in the literature.
Brief overview of some multi-omics tools and techniques supporting integrative analysis in alphabetical order.
| Resource | Core Integrative Analysis Tasks | Interface | Study Focus | Reference |
|---|---|---|---|---|
| GAIT-GM | Annotation, network modeling and pathway analysis | Python | Sequential analysis & Biological-based integration | McIntyre et al. [ |
| iOmicsPASS | Network-based analysis and predictive feature selection | C++ | Model-based integration & Biological-based integration | Koh et al. [ |
| INDEED | Network analysis | R | Model-based integration | Zuo et al. [ |
| OmicsTIDE | Clustering and visualisation | online | Model-based integration & Sequential integration | Harbig et al. [ |
| mbpls | Dimension reduction (Multi-block PLS) | Python | Model-based analysis & Sequential integration | Baum and Vermue [ |
| MetaboAnalyst | Enrichment analysis | online, R | Biological-based integration | Xia et al. [ |
| MetaBridge | Pathway mapping | online | Biological-based integration | Hinshaw et al. [ |
| MetExplore | Pathway mapping and graph-based analysis | online | Biological-based integration | Cottret et al. [ |
| mixOmics | Dimension reduction and feature selection | R | Model-based integration | Rohart et al. [ |
| multiGSEA | Enrichment analysis | R | Biological-based integration | Canzler et al. [ |
| NetMet | Network modeling | online | Biological-based integration | Tal et al. [ |
| paintOmics 3 | Pathway visualisation | online | Biological-based integration | García-Alcalde et al. [ |
| ROSA | Dimension reduction (Multi-block PLS) | R | Model-based analysis & Sequential integration | Liland et al. [ |