Literature DB >> 29487738

A multi-tool recipe to identify regions of protein-DNA binding and their influence on associated gene expression.

Daniel Carlin¹, Kassi Kosnicki¹, Sara Garamszegi², Trey Ideker¹, Helga Thorvaldsdóttir², Michael Reich¹, Jill Mesirov^1,3.

Abstract

One commonly performed bioinformatics task is to infer functional regulation of transcription factors by observing differential expression under a knockout, and integrating DNA binding information of that transcription factor. However, until now, this task has required dedicated bioinformatics support to perform the necessary data integration. GenomeSpace provides a protocol, or "recipe", and a user interface with inter-operating software tools to identify protein occupancies along the genome from a ChIP-seq experiment and associated differentially regulated genes from a RNA-Seq experiment. By integrating RNA-Seq and ChIP-seq analyses, a user is easily able to associate differing expression phenotypes with changing epigenetic landscapes.

Entities: Chemical Disease Gene Species

Keywords: ChIP-seq; RNA-Seq; data integration; differential expression; epigenetics; histone modification; regulation; transcription factor

Year: 2017 PMID： 29487738 PMCID： PMC5806050 DOI： 10.12688/f1000research.11616.2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

The genetic make-up of an organism plays a key role in gene regulation, especially during early cell differentiation and development. We can observe this phenomenon in siblings who possess different eye and hair color as a result of differing genetic code. However, epigenetic mechanisms, such as histone modifications, transcription factor binding and DNA methylation, also contribute to the complexity of individuals’ phenotypes as is observed in identical twins who possess the same genetic code while having slightly different features. Phenotypic differences associated with disease and varying stages of development have been mapped to changing patterns in gene regulation; and phenotype can often be attributed to a changing epigenetic landscape rather than hard-coded genetic features. In order to decode these epigenetic differences, biologists often turn to an analysis based on two experimental assays; RNA sequencing (RNA-Seq) ( Nagalakshmi ; Wilhelm ), which quantifies the amount of (usually messenger) RNA in a cell, and Chromatin Immuno-precipitation sequencing (ChIP-seq) ( Johnson ; Robertson ), which shows where a particular protein binds the genome. Commonly, this protein is expected to have some influence on the mRNA expression of nearby genes (i.e., it is a transcription factor). Thus, by knocking out the gene that codes for the DNA binding protein and observing changes in mRNA expression, the biologist can infer the direct effect of the protein on expression. When analyzing genomic data, today’s computational biologist may utilize a variety of different tools specific to each step of their analysis process. Not only must they be able to create the perfect marriage between the type of data and the tool, but they must be able to correctly manipulate the output, both for interpretation and for format conversion between tools. For the non-programming biologist, smooth integration of many of these tools is provided through GenomeSpace ( Qu , www.genomespace.org) and its user-friendly “recipes” ( recipes.genomespace.org). GenomeSpace is a web-based visual workbench that supports a diverse range of bioinformatics tools and data resources popularly used in genomic analyses. Because GenomeSpace provides the ability to reformat data as it moves between software tools, one can create easy to use step-by-step workflows specific to a given analysis task. We refer to these published workflows as “recipes”. We present one such recipe, currently available in GenomeSpace, which identifies differentially expressed genes between two samples, and compares that gene list with differential transcription factor occupancy from a ChIP-Seq experiment. This recipe is designed to elucidate which DNA-protein binding events are responsible for an observed change in mRNA expression. By identifying protein occupancies throughout the genome and comparing them to observed differences in mRNA expression, we can support hypotheses of functional regulation.

Methods

This recipe takes as input the aligned reads from a differential RNA-seq transcription factor knockout experiment, and aligned reads from a ChIP-Seq experiment for the transcription factor that was knocked out. The output is a visualization of the genomic regions containing both differentially expressed genes and a binding site for the transcription factor. Since all tools used in this recipe are hosted remotely, running the recipe has no system requirements beyond an internet connection. An overall workflow diagram appears in Figure 1. We describe the individual steps of the recipe here.

Figure 1.

Workflow diagram.

Obtaining and loading data

We start by obtaining a reference genome matching our model organism and aligning RNA-seq reads from two or more conditions (e.g. experimental and control) and ChIP-Seq reads from at least two samples, an input control and an experiment. In ChIP-Seq, the input control is a sample that has been run through all of the same preparatory and sequencing steps as the experiment, except for the antibody binding. This controls for the natural background of reads that are not selected by the binding of the target protein. Both RNA-seq and ChIP-Seq read data are uploaded to GenomeSpace in the BAM (Binary sequence Alignment MaP) format and the reference genome in the GTF (Gene Transfer Format). If users are beginning with unaligned reads, they can use a previously published RNA-seq recipe (http://recipes.genomespace.org/view/5) to pre-process their reads.

Differential gene expression analysis

We next perform differential expression analysis using GenePattern ( Reich , genepattern.broadinstitute.org), which can be launched from the GenomeSpace user interface. We use GenePattern’s Cuffdiff module to identify genes with differential expression between samples, measured by their FPKM (Fragments Per Kilobase of transcript per Million mapped reads) value. For each condition, we input the read data for an individual sample followed by the GTF reference genome. The output of the differential analysis is exported to GenomeSpace in Cuffdiff’s tabular format.

Filtering and formatting differential gene expression data

We next launch Galaxy ( Afgan ; Giardine , galaxyproject.org), again available through the GenomeSpace interface, and import RNA-seq reads from both conditions along with a file containing differential expression for each gene. This data is directly available through GenomeSpace. Using a Galaxy workflow, we filter genes that are significantly (q-value < 0.05) differentially expressed between the experiment (in this case a knockout) and control samples and extract their chromosome number, gene region start, gene region end, and gene symbol. Next we use Galaxy’s SAMtools ( Li ) Filter subtool, which extracts this data from the original RNA-seq reads in the BAM format. We convert the BAM files to the bigWig format so that they can be viewed in the Integrative Genomics Viewer (IGV) ( Robinson ; Thorvaldsdottir ).

Identifying transcription factor binding sites

Next, we use GenomeSpace to import the ChIP-seq files from both the input control and experimental samples to Galaxy. Using Galaxy’s MACS2 ( Feng ) callpeak subtool, we obtain a bedGraph file containing peak data of both our experimental and input control files. Additionally, we use the MACS2 callpeak tool to identify differential peaks along the genome, indicative of transcription factor binding sites, and output this data as a bedGraph file. The two bedGraph files are converted in Galaxy to the bigWig format for visualization in IGV.

Visualizing transcription factor binding sites and expression of associated genes

We next launch IGV through the GenomeSpace user interface. We select the appropriate reference genome included in IGV, and load all gene expression and peak-enrichment Bigwig files from GenomeSpace. Tracks are then scaled by group for visualization.

Use case

We applied the recipe described above to an example dataset from Laurent , accession GSE6328, from NCBI’s Gene Expression Omnibus (GEO) database ( Barrett ; Edgar ). We can identify the interplay between the epigenetics and transcriptomics of mouse embryonic stems cells by observing how the binding of the transcription factor, Prep1, influences gene expression. Prep1 is known for its contribution in embryonic development ( Laurent ). In comparing genome-wide maps of mouse embryonic cells expressing Prep1 to those that do not, we can identify potential target genes that are being differentially regulated by these binding events. One such example of this is illustrated in Figure 2. Here, the transcription factor binding site has been identified and shown to up-regulate the expression of the gene Igf2.

Figure 2.

Epigenetic landscape of Prep1 binding and associated regulation of Igf2.

The left panel illustrates the binding of the Prep1 transcription factor. In the right panel, we see the up-regulation of the gene, Igf2, as a result of this binding event.

Epigenetic landscape of Prep1 binding and associated regulation of Igf2.

The left panel illustrates the binding of the Prep1 transcription factor. In the right panel, we see the up-regulation of the gene, Igf2, as a result of this binding event.

Variations of this recipe

This recipe can be used, not only to identify the regulation of genes by transcription factor binding, but also to identify any epigenetic mechanism that can be analyzed by ChIP-sequencing. For example, we can identify regions in the genome where histone modifications have occurred, and match those regions to observed changes in expression presumably resulting from the histone modifications. However, we must consider the nature of the data when selecting parameters in the MACS2 tool in Galaxy. For example, when performing peak enrichment on histone modification occupancies, a user must select an advanced option to include broader regions, since histone modifications are represented by a much broader peak area along the genome. The GenomeSpace recipes use the tools with which the authors are most familiar and comfortable. However, we do recognize that there are alternative tools and approaches to completing the same analysis. For instance, DESeq2 ( Love ) can be employed in Gene Pattern as an alternative method for calling differential expression. Furthermore, there are many other alternatives for calling differentially expressed genes though they may not be available within the family of GenomeSpace tools, e.g., EdgeR ( Robinson ), and so any issues of format compatibility would have to be handled by the user. We note that CuffDiff ( Trapnell ) and MACS2 are available through Galaxy in addition to GenePattern, and can be employed through the framework if the user is more comfortable with the Galaxy tool.

Data availability

The original ChIP-seq and RNA-seq data of this experiment have been deposited in GEO, with accession number GSE63282. The recipe providing all the detailed steps and corresponding videos associated with this process is accessible at: http://recipes.genomespace.org/view/69. The authors have mostly addressed the reviewers concerns. However, one minor point: the authors still mention the peaks called by MACS2 as differential peaks. As reviewer 1 explains, this is misleading. Differential peaks are those derived by comparison between 2 experiments, not whats MACS does (that is calculating enrichment in experiment over input). I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The authors present a workflow (which they refer to as a “recipe”) for the integration of RNA-Seq and ChIP-Seq experiments to find associations between genomic binding of TFs and their potential direct effects on the mRNA expression using the web-based work-bench GenomeSpace. The workflow enables integration of differential gene expression analysis following transcription factor knockdown (RNA-seq data) with binding data for the same transcription factor (ChIP-seq data). While there are many useful pipelines available for analysing and integrating sequencing data, GenomeSpace and its associated “recipes” make the analysis and integration of the data less daunting for a biologist with little or no programming experience. Overall this “recipe” will likely be useful for biologists. However, we would like to make a few comments: 1. The authors do not make it clear whether the user has to pre-align the files to the reference genome before uploading them to GenomeSpace or the alignment itself can be done via their recipe. If not then it will be useful to provide the necessary tools and guidance required for alignment given that aligning the reads to genome is one of the most memory and time consuming steps. 2.The authors use the Cuffdiff module from the Cufflinks package to perform differential expression. Although the goal of this recipe is providing user-friendly and simplified workflow for integration of data the authors should mention the advantages of alternate tools such as EdgeR and DeSeq for identifying differential expression. These tools are available in GenePattern therefore it will be sensible to provide the user with all options. Especially since these tools are known to have better normalisation techniques and perform a more robust and reliable identification of differentially expressed genes compared to Cuffdiff. 3. The authors should not confuse the Enriched peak data obtained by using the MACS callpeak tool on the ChIP data and its input with Differential peaks. Differential peaks are obtained between two experimental conditions and not between the ChIP experiment and its input. This language can be misleading especially for beginners. 4. A flowchart of the analysis steps in the paper would be highly useful to get started. 5. Figure 1 is not entirely clear as presented. It is not clear why two separate panels are provided rather than a single panel that shows the location of the ChIP-seq peaks relative to the gene expression changes. Also, indicating what the colours represent in the gene expression data. The track labelling on the left is also not clear, and presumably “overlay” is the RNAseq data and “peaks” the ChIP-seq data. While the current labelling is presumably driven by the naming of the original files, better labelling is suggested in the context of this figure so it is clear to the reader. Finally, looking at the coordinates provided, we are not convinced that this is a good example to provide. The TF binding event Appears to be over 2.5Mb from the TSS of the putative target gene, meaning that any links here would be fairly low confidence. This would of course become more obvious if displayed on a single panel. 6. To make the “recipe” really useful, it would be good to have outputs beyond simple genome browser views. Having a tabular output of differentially expressed genes and the relative location(s) (and coordinates) of any binding peaks for the transcription factor in question would be useful to have. We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The authors have made a pipeline integrating differential RNA-Seq expression analysis with ChIP-Seq analysis and implemented it through the GenomeSpace platform. Though as mentioned by the authors in the paper: this is a commonly performed bioinformatic task, their aim is to make this integrated analysis easily accessible to non-bioinformatic users. For this purpose their workflow on the web-based workbench involves integrating multiple tools like Cuffdiff module in GenePattern (for Differential RNA-Seq analysis) with MACS2 in Galaxy (for ChIP-Seq analysis). Though the workflow presented by the authors seems easy to use by any biologist, it also appears to be severely limited not just in its scope of application but also in its choice of tools which are hardcoded. For example the authors use "CuffDiff" for Differential RNA-Seq expression analysis. The authors do not state why they chose this particular method or even why they chose its GenePattern module and not the Galaxy implementation. Though admittedly this is a popular tool and has the advantage of transcript level analysis, it also suffers from known limitation of underestimating the number of differential genes. Other count based method like DESeq2 tool (also implemented in Galaxy) might be better suited for most gene-level differential RNA-Seq analysis. Also, the authors do not clearly explain why their workflow is better or easier for a biologist to implement than using the same tools through Galaxy directly (which has been made for a non-coding biologist). I would argue that working directly on Galaxy even if slightly more complicated would be more rewarding to users as it offers not just greater flexibility of tools but also the option to select different parameters than default. Hence in conclusion, to make this manuscript better the authors should 1) provide a clearer explanation for their choice of tools and why is it easier/better to use their pipeline than the same tools on Galaxy directly, 2) If possible the authors should try to expand their workflow to provide a greater flexibility to the user to choose their tools for RNA-Seq and ChIP-Seq analysis. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. We greatly appreciate the reviewers' comments pointing out that the "Variations of the Recipe" section of this manuscript did not discuss alternative tools. The GenomeSpace recipes are subjective in that they use tools with which the authors are most familiar and comfortable. However, we do recognize that there are, of course, other options. We will revise the manuscript (and the recipe) to include some of the other options that could be used. In the past that has been our custom and we apologize that we did not catch the lack of this information in this recipe. In fact, on the best practices page of our website - http://recipes.genomespace.org/page/best-practices-for-creating-recipes we suggest authors supply alternatives. We would invite both the reviewers and other readers of the manuscript to contribute versions of this recipe using their preferred tools and approaches. GenomeSpace is a community project and contribution of recipes is both welcome and encouraged.

18 in total

1. GenePattern 2.0.

Authors: Michael Reich; Ted Liefeld; Joshua Gould; Jim Lerner; Pablo Tamayo; Jill P Mesirov
Journal: Nat Genet Date: 2006-05 Impact factor: 38.330

2. The transcriptional landscape of the yeast genome defined by RNA sequencing.

Authors: Ugrappa Nagalakshmi; Zhong Wang; Karl Waern; Chong Shou; Debasish Raha; Mark Gerstein; Michael Snyder
Journal: Science Date: 2008-05-01 Impact factor: 47.728

3. Identifying ChIP-seq enrichment using MACS.

Authors: Jianxing Feng; Tao Liu; Bo Qin; Yong Zhang; Xiaole Shirley Liu
Journal: Nat Protoc Date: 2012-08-30 Impact factor: 13.491

4. Differential analysis of gene regulation at transcript resolution with RNA-seq.

Authors: Cole Trapnell; David G Hendrickson; Martin Sauvageau; Loyal Goff; John L Rinn; Lior Pachter
Journal: Nat Biotechnol Date: 2012-12-09 Impact factor: 54.908

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

6. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.

Authors: Gordon Robertson; Martin Hirst; Matthew Bainbridge; Misha Bilenky; Yongjun Zhao; Thomas Zeng; Ghia Euskirchen; Bridget Bernier; Richard Varhol; Allen Delaney; Nina Thiessen; Obi L Griffith; Ann He; Marco Marra; Michael Snyder; Steven Jones
Journal: Nat Methods Date: 2007-06-11 Impact factor: 28.547

7. Genome-wide mapping of in vivo protein-DNA interactions.

Authors: David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal: Science Date: 2007-05-31 Impact factor: 47.728

8. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.

Authors: Helga Thorvaldsdóttir; James T Robinson; Jill P Mesirov
Journal: Brief Bioinform Date: 2012-04-19 Impact factor: 11.622

9. ChIP-Seq and RNA-Seq analyses identify components of the Wnt and Fgf signaling pathways as Prep1 target genes in mouse embryonic stem cells.

Authors: Audrey Laurent; Manuela Calabrese; Hans-Jörg Warnatz; Marie-Laure Yaspo; Vsevolod Tkachuk; Miguel Torres; Francesco Blasi; Dmitry Penkov
Journal: PLoS One Date: 2015-04-13 Impact factor: 3.240

10. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937