| Literature DB >> 25364758 |
Claudia Angelini1, Valerio Costa2.
Abstract
The availability of omic data produced from international consortia, as well as from worldwide laboratories, is offering the possibility both to answer long-standing questions in biomedicine/molecular biology and to formulate novel hypotheses to test. However, the impact of such data is not fully exploited due to a limited availability of multi-omic data integration tools and methods. In this paper, we discuss the interplay between gene expression and epigenetic markers/transcription factors. We show how integrating ChIP-seq and RNA-seq data can help to elucidate gene regulatory mechanisms. In particular, we discuss the two following questions: (i) Can transcription factor occupancies or histone modification data predict gene expression? (ii) Can ChIP-seq and RNA-seq data be used to infer gene regulatory networks? We propose potential directions for statistical data integration. We discuss the importance of incorporating underestimated aspects (such as alternative splicing and long-range chromatin interactions). We also highlight the lack of data benchmarks and the need to develop tools for data integration from a statistical viewpoint, designed in the spirit of reproducible research.Entities:
Keywords: ChIP-seq; RNA-seq; data integration; gene regulatory mechanisms; statistics
Year: 2014 PMID: 25364758 PMCID: PMC4207007 DOI: 10.3389/fcell.2014.00051
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
Figure 1(A) Schematic representation of the dynamic interactions among chromatin modifications and TFs, and their impact on gene transcription in a cell. Different cells share the same TF binding sites despite of differences in functionality, shape and differentiation state. Transcriptional patterns are controlled by differential TF bindings and other factors, such as local chromatin states and epigenetic modifications. These factors can limit, or promote, TF occupancies at specific loci, and regulate gene transcription. (B) Each NGS coverage data track (bedgraph format) is representative of the result of a single omic data analysis (i.e., ChIP-seq or RNA-seq experiments). The visualization of several tracks allows qualitatively studying a specific gene locus. The computational analysis of single omics allows investigating (on a genome-wide scale) different epigenetic modifications (TFs, HMs, CpG methylation, chromatin accessibility) and measuring gene expression. (C) When a limited amount of ChIP-seq (TF binding and/or HMs) and RNA-seq datasets are available, simple predictive models based on PCA and log-linear or support vector regression are used to predict gene expression and to reveal the most relevant epigenetic signatures able to explain the gene expression. By plotting loading factors it is possible to reveal that epigenetic signatures can act either as activators or repressors of transcription at different loci (see Section Can TF occupancies or histone modification data predict gene expression?). (D) In the presence of a large number of gene expression datasets more sophisticated models can be used to infer complex GRNs. This network allows visualizing TF-gene relations. In particular, it is possible to show that a given TF can control several genes and that genes are strongly interconnected (see Section Can ChIP-seq and RNA-seq data be used to infer gene regulatory networks?).
Figure 2One of the key-points in the integration process is the way in which the epigenetic and transcriptional signals are transformed into a statistical model that relates a response vector A scheme showing gene transcription, and the molecular factors involved (TFs and HMs), is illustrated in the upper part. (B) Different models have been proposed to build the so-called gene to epigenetic signature matrix X. Naive models proposed to use a binary matrix to integrate epigenetic signatures with gene expression. Therefore, 0/1 values were used to annotate and associate a given TF or HM to a specific gene according to a proximity measure between the peak and/or the enriched region and TSS of the corresponding gene. More advanced models, such as the one from Ouyang et al. (2009), proposed to use a weighed sum of peaks around the TSS. In this way it is possible to tune the strength of the binding and the distance from the TSS in a continuous way. Along the same direction, Sikora-Wohlfeld et al. (2013) compared several other measures to build X. All such approaches share the idea that matrix X is built with respect to the position of the TSSs (or using reads in a window around the TSSs) by collapsing each epigenetic feature into a single value per gene. A slightly different, and more sophisticated, approach consists in mapping each epigenetic feature into a vector of several components measured (in several bins) both at the TSSs and TTSs, as proposed in the series of papers by Cheng and colleagues. In this way, they showed that the best predictive power for TFs is indeed achieved at TSSs, however for HMs the information available at TTSs can provide further improvement. Finally, a set of 13 features for each epigenetic mark is used in Althammer et al. (2012) to classify genes as up-regulated; down regulated and no-change between two experimental conditions. The features are evaluated over the gene body, on its upstream and downstream regions (including promoters, TSSs, first exons, first introns, etc). (C) Gene expression Y (usually measured in terms of Fragment per kilobase of exon per million fragments mapped, FPKM) is obtained from RNA-seq data.