Literature DB >> 22303399

When one and one gives more than two: challenges and opportunities of integrative omics.

Abstract

Since the dawn of the post-genomic era a myriad of novel high-throughput technologies have been developed that are capable of measuring thousands of biological molecules at once, giving rise to various "omics" platforms. These advances offer the unique opportunity to study how individual parts of a biological system work together to produce emerging phenotypes. Today, many research laboratories are moving toward applying multiple omics platforms to analyze the same biological samples. In addition, network information of interacting molecules is being incorporated more and more into the analysis and interpretation of these multiple omics datasets, which provides novel ways to integrate multiple layers of heterogeneous biological information into a single coherent picture. Here, we provide a perspective on how such recent "integrative omics" efforts are likely going to shift biological paradigms once again, and what challenges lie ahead.

Entities: CellLine Chemical Disease Species

Keywords: data integration; omics; statistical data analysis; systems biology

Year: 2012 PMID： 22303399 PMCID： PMC3262227 DOI： 10.3389/fgene.2011.00105

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

The first generation of whole-genome sequencing projects have inspired the development of technologies aimed at comprehensively characterizing various types of biological molecules, opening up entirely new fields such as genomics, transcriptomics, proteomics, metabolomics, and so forth. Thanks to these technological advances, one can now routinely sequence the entire genome of an organism to scan for genetic polymorphisms, measure the abundance of genes and their products, map epigenetic modifications and transcriptional regulations, chart the global networks of genetic interactions or protein–protein interactions (PPI), and comprehensively measure sugars, lipids, and metabolites in virtually any biological specimen. The systems-level information provided by each omics platform offers a unique insight into the complexity of a biological system and, as a consequence, scientific discoveries and their clinical applications have immensely benefited from omics data over the past decade (Van de Vijver et al., 2002; Van ’t Veer et al., 2002; Hanash et al., 2008; Stratton et al., 2009; 1000_Genomes_Project_Consortium, 2010; Hudson et al., 2010; Meyerson et al., 2010; Pang et al., 2010; Solit and Mellinghoff, 2010). Microarrays were among the first omics platforms to be developed, and already since their first appearance it became clear that microarray data would have to be integrated with other levels of biological information in order to allow researchers to see the “big picture” (Kohane et al., 2002). As experimental protocols evolve with declining costs, scientists are now starting to apply multiple omics platforms to analyze the same biological samples (Ideker et al., 2001; Joyce and Palsson, 2006; Zhang et al., 2010). This type of studies will be critically useful for biologists since they can measure molecular changes at multiple levels simultaneously and get one step closer to understanding how biological systems work as a whole, which is one of the primary goals of “systems biology” (Kitano, 2002; Ge et al., 2003; Fukushima et al., 2009). As such, combining multiple omics, or “integrative omics,” holds a great potential to revolutionize the systems-level analysis of complex biological phenomena and several efforts are already ongoing in various directions. Given the enormous promise of integrative omics, questions regarding how to design experiments and jointly analyze the heterogeneous data are quickly becoming of interest. Indeed, these new technologies generate an unprecedentedly large amount of data and, ironically, the sheer volume makes it difficult to find a reasonable interpretation of the data. Thus the key to successful application will depend on properly designed experiments, statistically sound data analysis, and appropriate interpretation of the data. In this Perspective, we review both challenges and opportunities encountered by systems biologists, bioinformaticians, and statisticians undertaking the exciting and daunting task of integrating multiple heterogeneous omics datasets.

Opportunities of Integrative Omics

Biological opportunities

Many problems in systems biology can be addressed only by integrating multiple layers of biological information. For example, numerous genetic studies using single-nucleotide polymorphism (SNP) microarrays or high-throughput sequencing often report hundreds of point mutations above the minimal allele frequency as potential disease markers (Carlson et al., 2004; Manolio et al., 2009). However, many of these markers lack the predictive power and fail to reproduce the results across different study populations (Altshuler et al., 2008). This implies that these candidate markers must be further prioritized with additional information such as transcriptional or translational regulation of the gene products affected by the mutations. Accordingly, recent genetics research frequently explores the “genetical genomics” approach (Li and Burmeister, 2005) to integrate population-wide SNP data and transcriptomics data, aiming to identify expression quantitative trait loci (eQTL; Cheung and Spielman, 2009; Cookson et al., 2009; Montgomery et al., 2011). The paired genotype and gene expression data reveals the impact of genetic mutations on transcriptional expression, which is the major mechanism to channel genetic abnormalities into phenotypes. On a similar front, many research articles have reported integration of copy number data and gene expression data to cancer or adaptive evolution studies (Pollack et al., 2002; Chin et al., 2006; Gresham et al., 2008; Rancati et al., 2008). The resulting data explains how various forms of copy number aberration, such as point amplification/deletion, segmental changes, and aneuploidy, induce gene expression changes (Bussey et al., 2006; Stranger et al., 2007). Besides the integration of genomic datasets, advances in tandem mass spectrometry have gradually allowed us to integrate transcriptomics data with quantitative proteomics data (Griffin et al., 2002; Cox et al., 2005; Lu et al., 2007; Fournier et al., 2010; Pavelka et al., 2010), where proteomics data provide direct information to assess the impact of transcriptional changes on the gene products. So far we reviewed the opportunities when the same genes are profiled at different levels of the primary omics. However, there exists additional network information generated using other high-throughput technologies, where the correlation between interacting molecules can be explicitly modeled. These include various assays for screening PPI (Rual et al., 2005; Gingras et al., 2007; Costanzo et al., 2010), protein–DNA interaction data for mapping transcriptional regulation and epigenetics (Ren et al., 2000; Johnson et al., 2007), post-transcriptional regulation mediated by microRNAs (Bartel, 2009; Hafner et al., 2010), and so forth. Using this information, the association between different molecules, and the lack thereof, can be adjusted for other interacting molecules causally linked across available omics datasets. For instance, transcriptomics and metabolomics data were integrated to identify clusters of genes and metabolites that were coordinately modulated in response to specific nutritional stresses in the model plant Arabidopsis thaliana (Hirai et al., 2004). In addition, transcriptomics data were coupled with PPI networks to determine under which circumstances protein hubs are co-expressed with their respective interacting partners (Taylor et al., 2009) and to use joint expression levels of genes belonging to interaction subnetworks to establish more predictive breast cancer biomarkers (Chuang et al., 2007). Transcriptomics data were also combined with protein–DNA interaction data to infer gene regulatory networks (Lin et al., 2009; Ouyang et al., 2009).

Statistical opportunities

Integrative omics also opens an opportunity for improved statistical analysis. For one, parallel omics datasets can help implement procedures to infer missing data. Many omics platforms are known to be subject to missing observations due to lagging depth, exemplified by the poor coverage of next-generation sequencing (NGS) in repeat-rich regions and the faltering peptide identification of tandem mass spectrometry in low-abundance proteins. Some transcriptomic platforms such as microarrays are also subject to the limitation that only a fixed form of transcripts can be measured while other isoforms present in the sample go undetected. By generating both transcriptomic and proteomic data, however, one can perform statistical inference on the missing observations in one platform using the observations in the other platform since the two data are expected to be correlated within the same biological sample. Recent endeavors to improve peptide sequencing in tandem mass spectrometry (MS/MS) using the parallel transcriptomic data are good examples of this kind (Ramakrishnan et al., 2009; Ning and Nesvizhskii, 2010), but a more sophisticated treatment of missing data using external sources is yet to be developed. Another important problem in the omics data analysis is the control or estimation of false positives and false negatives, which are incurred when many statistical decisions are to be made simultaneously, i.e., the multiple testing problem. As simultaneous hypothesis testing typically leads to excessively many selections in omics data, currently existing multiple hypothesis testing methods are geared toward controlling the number of false positives, as evidenced by the development of false discovery rate estimation procedures (Benjamini and Hochberg, 1995; Efron et al., 2001). Although these procedures are applicable to the analysis of a single omics platform, the methods are easily generalizable to the multivariate cases for more sophisticated hypothesis testing when the data are available from more than a single omics platform. Suppose that differential expression is tested at the mRNA and protein level simultaneously. Then the hypothesis testing can be performed using bivariate statistics, which is expected to be more powerful than using two independent univariate statistics, since the correlation between the two dataset can be explicitly accounted for. In addition, the added complexity in the joint testing allows differentiation of the genes differentially expressed at both levels versus the genes regulated at either one of the two levels only, providing additional information to infer the underlying regulatory mechanism. Unfortunately, such routines using correlated statistics have rarely been implemented in the integrative omics data analysis so far, but we can envision that as the number of such integrated datasets will increase, so will the level of sophistication in the statistical analysis. More importantly, the ultimate statistical opportunity in the integrative omics data is the possibility for systems-level probabilistic modeling of multiple data types. In practice, one may well perform a crude form of integrative analysis, i.e., analyze each type of molecular level separately and aggregate the results in a post hoc manner (Figure 1A). This approach, however, fails to capitalize on the power of the correlated data, especially for detecting weak yet consistent signals from multiple data sources (Ideker et al., 2011). Hence one can start using a slightly more sophisticated approach where the data measured at different molecular levels are modeled using multivariate probability models (Figure 1B). As the bivariate example showed above, incorporating data from multiple molecular levels can strengthen the statistical power, since the effects we aim to measure at one molecular level can be adjusted by the data at the other levels. Furthermore, the new threads of network-level information that is becoming increasingly available – such as transcriptional regulatory networks, genetic interaction networks, PPI networks, signal transduction pathways and metabolic networks – allows computational biologists to integrate omics datasets at the level of nodes and edges of biological networks (Figure 1C) and to move beyond the statistical analysis under the assumption of full independence among the different molecules. For instance, versatile statistical techniques such as graphical models can be used in conjunction with the experimentally validated networks, which provides the underlying backbone of the correlation structure. Such models give an efficient probabilistic representation of the complex, systems-wide molecular profiles and considerably improve the statistical power in the analysis.

Figure 1

(A) Independent analysis of each data for each gene and protein. Significant findings are filtered at each data type and aggregated by looking at the overlap of the results. (B) Joint modeling of the bivariate distribution of transcript and protein level data for each gene. (C) Joint modeling incorporating the network information such as the interaction between transcription factors and their targets.

Challenges of Integrative Omics

Bioinformatics challenges

The first problem bioinformaticians face when asked to integrate, for instance, a transcriptomics dataset and a proteomics dataset is how to map transcript identifiers to protein identifiers. If the one-gene-one-protein hypothesis still holds relatively well in prokaryotes and some lower eukaryotes, the same is certainly not true in higher organisms: genes often encode multiple transcripts by means of alternative splicing (Graveley, 2001) and transcripts can be translated into multiple protein isoforms by means of alternative translation initiation sites (Cavener and Ray, 1991) and post-translational modifications (Mann and Jensen, 2003). A partial solution to this problem is provided by genome-centric databases such as EnsEMBL (Hubbard et al., 2002), protein-centric databases such as UniProt (Apweiler et al., 2004) or more general-purposes web services such as Babelomics (Al-Shahrour et al., 2005), that provide coherent mappings between gene, transcript, and protein identifiers. The challenge becomes even more daunting when one starts to venture outside the central dogma of molecular biology and attempts to integrate a transcriptomics or proteomics dataset with a metabolomic, glycomic, or lipidomic dataset. Here, one could take advantage of the knowledge of metabolic networks to map enzymes involved in the synthesis or chemical conversion of metabolites (e.g., as provided by KEGG, Kanehisa and Goto, 2000, or Reactome, Joshi-Tope et al., 2005) to establish links between the two types of datasets (Antonov et al., 2010). To this end, the systems biology markup language (SBML) represents one of the first and most successful efforts in developing a unified language to represent complex models of interacting biological molecules (Hucka et al., 2003) and has been widely implemented by several software tools. However, only a fraction of the genes in a genome typically encode metabolic enzymes, the rest being structural, regulatory, or signal transduction proteins. Unfortunately, it is not immediately obvious how to close these gaps. It is thus expected that integrative omics data analysis methods will have to deal with the existence of “orphan” molecules that cannot be directly mapped between the two types of datasets. Another bioinformatic issue is the existence of heterogeneous repositories of primary data sources. Due to the different nature of omics platforms, databases of microarray, NGS, proteomics, or metabolomics experiments have been designed according to different schemes. While it is true that each omics domain has developed its own standards (such as MIAME, Brazma et al., 2001, and MAGE-ML, Spellman et al., 2002, for microarray data, or mzXML, Pedrioli et al., 2004, and HUPO-PSI for proteomics data, Orchard et al., 2003), the lack of well-defined data standards and of standardized nomenclature across different data repositories makes the coherent retrieval and assembly of integrated datasets a non-trivial task. One way to address this issue is the development of so-called “data warehouses,” in which a significant effort is being put in by developers a priori to store and integrate heterogeneous primary databases into a coherent scheme by making use of intermediate abstraction layers between the raw data layer and the user access layer (Rhodes et al., 2004; Chen et al., 2010). An alternative promising approach to data integration in life sciences is offered by Semantic Web technologies (Splendiani et al., 2011). These technologies enable an immediate “connection” between data, which can be easily queried across different databases. At the same time they allow a precise characterization of the “semantics” of the data, i.e., which entities are represented, and which are their relations (Berners-Lee and Hendler, 2001). Such semantic characterization can then provide an integration of information across different databases, which can easily cope with a variety of rapidly evolving data sources and types (Cheung et al., 2005; Smith et al., 2007). How widely this technology will be adopted is likely tied to how well developers of primary omics databases will implement such data representation methods.

Statistical challenges

In addition to the bioinformatics issues, there are important statistical challenges in the integrative omics analysis. As we build more complex models such as multivariate or inter-molecular models, we must revisit some limitations that had plagued the single-source omics data analysis. First, it is likely that the number of biological samples analyzed in a typical integrative study will remain limited, e.g., on the order of a few tens in case–control studies and at most several replicate experiments per comparative condition in the studies using cell lines. To address this limitation, one can utilize efficient statistical methods such as hierarchical models, which are capable of pooling statistical information across different molecular levels (Parmigiani et al., 2002; Sharpf et al., 2009; Ji and Liu, 2010). Second, as we consider modeling the correlations among an increasing number of molecules in the statistical model, the model parameter space will expand in a computationally intractable manner and the limiting sample size will likely lead to over-fitting of models even further. As such, although advanced statistical methods for model selection (e.g., regularization Tibshirani, 1996) may facilitate the choice of predictive models, it must be reminded that there exists a certain trade-off between the gain in power from the added complexity and the loss in specificity due to a poor model fit, where the latter is mainly determined by experimental design issues such as the sample size. Therefore, when complex models are employed, the interaction between model complexity and experimental design factors must be thoroughly evaluated in terms of strengthening sensitivity–specificity profile and reproducibility of results. In sum, it is necessary to find the right balance between complexity and model sparsity to deliver the most reproducible system-wide models from multi-layered omics data.

Conclusion

As it is becoming increasingly clear that integrating multiple omics dataset allows researchers to explore previously uncharted territories describing the functioning of biological systems, more advanced data analysis methods will be required to fully translate this enormous wave of information into biological knowledge. Will the field of bioinformatics and computational biology be able to keep the pace with the exponential development of omics technologies? While it is currently difficult to predict whether this gap will eventually be filled, we argue that if careful statistical considerations are taken into account already at the experimental design phase of a multi-omics project, then there is an opportunity to build rigorous systems-level statistical models that fully take advantage of the interdependent workings of biological molecules. Finally, to foster further advancement of the field, it will be critical to build integrated multi-omics statistical models that are both reusable and easily extendable by other researchers.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

70 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Genome-wide location and function of DNA binding proteins.

Authors: B Ren; F Robert; J J Wyrick; O Aparicio; E G Jennings; I Simon; J Zeitlinger; J Schreiber; N Hannett; E Kanin; T L Volkert; C J Wilson; S P Bell; R A Young
Journal: Science Date: 2000-12-22 Impact factor: 47.728

Review 3. Mapping complex disease loci in whole-genome association studies.

Authors: Christopher S Carlson; Michael A Eberle; Leonid Kruglyak; Deborah A Nickerson
Journal: Nature Date: 2004-05-27 Impact factor: 49.962

Review 4. The model organism as a system: integrating 'omics' data sets.

Authors: Andrew R Joyce; Bernhard Ø Palsson
Journal: Nat Rev Mol Cell Biol Date: 2006-03 Impact factor: 94.444

Review 5. Integrated omics approaches in plant systems biology.

Authors: Atsushi Fukushima; Miyako Kusano; Henning Redestig; Masanori Arita; Kazuki Saito
Journal: Curr Opin Chem Biol Date: 2009-12 Impact factor: 8.822

Review 6. Boosting signal-to-noise in complex biology: prior knowledge is power.

Authors: Trey Ideker; Janusz Dutkowski; Leroy Hood
Journal: Cell Date: 2011-03-18 Impact factor: 41.582

7. Towards a comprehensive structural variation map of an individual human genome.

Authors: Andy W Pang; Jeffrey R MacDonald; Dalila Pinto; John Wei; Muhammad A Rafiq; Donald F Conrad; Hansoo Park; Matthew E Hurles; Charles Lee; J Craig Venter; Ewen F Kirkness; Samuel Levy; Lars Feuk; Stephen W Scherer
Journal: Genome Biol Date: 2010-05-19 Impact factor: 13.583

Review 8. The cancer genome.

Authors: Michael R Stratton; Peter J Campbell; P Andrew Futreal
Journal: Nature Date: 2009-04-09 Impact factor: 49.962

9. Genome-wide mapping of in vivo protein-DNA interactions.

Authors: David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal: Science Date: 2007-05-31 Impact factor: 47.728

10. BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments.

Authors: Fátima Al-Shahrour; Pablo Minguez; Juan M Vaquerizas; Lucía Conde; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

15 in total

Review 1. Plant systems biology: insights, advances and challenges.

Authors: Bhavisha P Sheth; Vrinda S Thaker
Journal: Planta Date: 2014-03-27 Impact factor: 4.116

2. Integration of Gut Microbiota and Metabolomics for Chinese Medicine Research: Opportunities and Challenges.

Authors: Wu-Wen Feng; Juan Liu; Hao Cheng; Cheng Peng
Journal: Chin J Integr Med Date: 2021-11-10 Impact factor: 2.626

Review 3. Recent Advances in the Etiopathogenesis of Inflammatory Bowel Disease: The Role of Omics.

Authors: Eleni Stylianou
Journal: Mol Diagn Ther Date: 2018-02 Impact factor: 4.074

4. Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections.

Authors: Yohann Mansiaux; Fabrice Carrat
Journal: BMC Med Res Methodol Date: 2014-08-26 Impact factor: 4.615

Review 5. Network analysis and juvenile idiopathic arthritis (JIA): a new horizon for the understanding of disease pathogenesis and therapeutic target identification.

Authors: Rachelle Donn; Chiara De Leonibus; Stefan Meyer; Adam Stevens
Journal: Pediatr Rheumatol Online J Date: 2016-07-02 Impact factor: 3.054

6. Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli.

Authors: Minseung Kim; Navneet Rai; Violeta Zorraquino; Ilias Tagkopoulos
Journal: Nat Commun Date: 2016-10-07 Impact factor: 14.919

7. The role of longitudinal cohort studies in epigenetic epidemiology: challenges and opportunities.

Authors: Jane W Y Ng; Laura M Barrett; Andrew Wong; Diana Kuh; George Davey Smith; Caroline L Relton
Journal: Genome Biol Date: 2012-06-29 Impact factor: 13.583

8. Pharmacogenomics of insulin-like growth factor-I generation during GH treatment in children with GH deficiency or Turner syndrome.

Authors: A Stevens; P Clayton; L Tatò; H W Yoo; M D Rodriguez-Arnao; J Skorodok; G R Ambler; M Zignani; J Zieschang; G Della Corte; B Destenaves; A Champigneulle; J Raelson; P Chatelain
Journal: Pharmacogenomics J Date: 2013-04-09 Impact factor: 3.550

Review 9. Moving H5N1 studies into the era of systems biology.

Authors: Laurence Josset; Jennifer Tisoncik-Go; Michael G Katze
Journal: Virus Res Date: 2013-03-14 Impact factor: 3.303

10. Network analysis identifies protein clusters of functional importance in juvenile idiopathic arthritis.

Authors: Adam Stevens; Stefan Meyer; Daniel Hanson; Peter Clayton; Rachelle P Donn
Journal: Arthritis Res Ther Date: 2014-05-08 Impact factor: 5.156