| Literature DB >> 27377783 |
Nathan Cormier1, Tyler Kolisnik1, Mark Bieda2.
Abstract
BACKGROUND: There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking.Entities:
Keywords: Bioconductor; ChIP-seq analysis; Scientific workflows; Software packages
Mesh:
Year: 2016 PMID: 27377783 PMCID: PMC4932705 DOI: 10.1186/s12859-016-1125-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Feature Comparison
| Feature | CisGenome [ | unix tool scripting [ | Fish the Chips [ | R pipeline [ | HiChIP [ | Cistrome [ | This package |
|---|---|---|---|---|---|---|---|
| Primary implementation | C | unix tools | C++ | C and R | R | Galaxy | R and Kepler |
| Single modules available | no | noa | no | no | no | yes | yes |
| Single turnkey pipeline | no | no | yes | no | yes | no | yes |
| Interface | GUI, command line | command line | GUI, command line | command line | command line | web | GUI (Kepler) + command line |
| Data provenance tracking | no | no | no | no | no | yes (Galaxy) | yes (Kepler) |
| Generate genome browser track | yes (custom) | yes (UCSC) | yes (UCSC) | no | yes (UCSC) | yes (UCSC) | yes (UCSC) |
| Peak calling | yes (custom) | yes (MACS) | yes (MACS) | yes (PICS) | yes (MACS and SICER) | yes (MACS) | yes (MACS) |
| Summary peak statistics | yes | yes | yes | no | yes | yes | yes |
| Summary location statistics | yes | yes | yes | no | yes | yes | yes |
| Map to nearby genes | yes | yes | yes | no | yes | yes | yes |
| Automatic generation of peak graphs | no | no | no | no | no | no | yes |
| Heatmap of read density | no | no | no | no | yes | yes | yes |
| Average profiles | no | no | no | no | yes | yes | yes |
| De novo motif analysis | yes | yes (MEME) | no | yes (rGADEM) | yes (MEME) | yes (SeqPos) | yes (MEME) |
| Motif enrichment analysis | yes | yes | no | yes | no | yes | no |
| Gene ontology | no | yes | no | no | yes(custom) | nob | yes |
| Pathway analysis | no | no | no | no | no | no | yes |
aSoftware is a listing of code in a protocol bGene ontology implemented for microarray data but not ChIP-seq data
Fig. 1Conceptual diagram of overall project components. Components and relationships for a full analysis of a ChIP-seq dataset. These steps are present in the full workflow. It is provided in Kepler and R script versions (R script version does not have read mapping) and each component is also implemented individually as separate Kepler workflows and R scripts. Names in parentheses refer to critical external, stand-alone programs (e.g., bowtie, MACS, ngsplot, MEME). Most components use R and Bioconductor packages. Some minor steps in the full pipeline (e.g., file conversion), accessory and other utility workflows are not displayed
List of Software Modules
| Name | Goal |
|---|---|
| Full Pipelines | |
| Pipeline_UnmappedReads.kar | Full pipeline starting with unmapped reads |
| Pipeline_MappedReads(.kar/.R) | Full pipeline starting with mapped reads |
| Pipeline_MappedReadsAndExternalPeakFile(.kar/.R) | Full pipeline starting with mapped reads and a file of peaks |
| Pipeline_onlyExternalPeakFile(.kar/.R) | Full pipeline when only file of peaks is available |
| Independent Modules | |
| MapReads.kar | Map reads to a reference genome |
| AnnotatePeaks(.kar/.R) | Map peaks to genes |
| CalcDistanceToTSS(.kar/.R) | Summary of peak distance to TSS |
| CalcPeakStats(.kar/.R) | Summary peak statistics |
| GeneOntologyAnalysis(.kar/.R) | Gene Ontology Analysis |
| GeneratePeakExamples(.kar/.R) | Generate a set of graphs of peak and non-peak regions automatically |
| GetPeakSequences(.kar/.R) | Generate set of DNA sequences from peaks |
| MakeTSSHeatmapAndDensityPlots(.kar/.R) | Makes a heatmap of reads around all TSSs and also an average profile plot |
| MakeUCSCfile(.kar/.R) | Makes a bedgraph.tar.gz file ready for direct upload to the UCSC genome browser |
| MotifDiscovery(.kar/.R) | Runs MEME to attempt motif finding from peaks |
| PathwayAnalysis(.kar/.R) | Performs a pathway analysis, generates list of high-scoring pathways and images of results |
| RunMACS(.kar/.R) | Runs MACS to predict peaks |
| Utility | |
| BamToBed(.kar/.R) | Converts bam format to bed format |
| IndexBAM(.kar/.R) | Indexes a bam file |
| SamToBed(.kar/.R) | Converts sam format to bed format |
| Extra | |
| GraphSingleDataRange(.kar/.R) | Generates a graph of read density for any region of the genome |
Notes: (.kar/.R) refers to workflows having Kepler (.kar) and R (.R) versions. The same root name is used; for example RunMACS.kar and RunMACS.R
Fig. 2MapReads.kar Kepler workflow. This workflow displays some fundamental features of Kepler and Kepler’s ability to use external computing and data resources. The workflow is executed from the user’s machine but performs mapping of a set of sequence reads on an external high performance machine. A sequence read file (fastq format) is copied to an external machine/cluster and the program bowtie [13] is used to align reads to the genome specified by parameter genomeDirectory. The output file (sam format) is then copied back to the user’s machine. The number of CPU cores used for this computation can be limited by the ‘Available CPU cores’ parameter. This is a screenshot of the actual Kepler canvas that is displayed to the user
Fig. 3Generation of peak examples. a Screenshot of Kepler canvas for this workflow. This workflow shows the typical design for most of our single module Kepler workflows. This task is part of each of the full workflows for ChIP-seq, except those starting with only predicted peaks. It is available as an individual Kepler workflow and as an independent R script. b, c Output from workflow applied to H3K4me3 dataset (see IMPLEMENTATION AND RESULTS). This workflow produces graphs of positive peaks (example shown in b; x-axis range is chr16:2379054–2400852) and graphs of negative regions (example shown in c; x-axis range is chr19:34482526–34487526). Workflow output graphs are designed for on-screen viewing and graph aesthetics have been adjusted for clarity of presentation here. The number of produced graph examples and axis ranges are controlled by the user. Standard R graphics (“base graphics”) are used for graph generation, allowing the experienced user to easily modify graph properties
Fig. 4A full analysis pipeline for ChIP-seq data. This displayed workflow begins with raw sequence reads; other variants begin with mapped data or mapped data with called peaks (see Table 2). The workflow implements steps displayed conceptually in Fig. 1. This workflow is also available as a standalone R script (but without initial read mapping step)
Fig. 5A partial set of outputs of the full pipeline and component workflows. See Fig. 6 for more outputs. All of these are outputs of all the full workflows, except some are not available in the pipeline beginning with only peak data. Each individual output is available as the product of individual workflows or as standalone R scripts (see Table 2). All data is from analysis of the H3K4me3 dataset except E, which is from analysis of the GATA1 dataset. a GeneratePeakExamples component output. (left) A predicted peak (x-axis: chr11:113174680–113196653) (right) A non-peak region (x-axis: chr3:50344686–50349686). Also see Fig. 3. b Partial output of CalcPeakStats module. c Partial output of mapping to genes workflow (AnnotatePeaks). Several additional columns of information are not shown here and full table has many more rows. d (i) Heatmap of coverage around transcription start sites (TSSs). (ii) Density plot of coverage around TSSs. (iii) Table of distances from TSS. Note expected pattern of TSS-centric location of H3K4me3, as indicated in previous studies. e Output of motif discovery component for GATA1 dataset. Note close match to expected GATA1 motif [27] is obtained, including central GATA motif (positions 2–5). f Comparison of workflow-generated UCSC track (top) to ENCODE-generated UCSC track (middle) for H3K4me3 dataset. Image from UCSC browser
Fig. 6More outputs from the full pipelines and individual components. All outputs are produced from all full pipelines (except the one starting with only peak data), appropriate individual workflows, and Kepler-independent R scripts (see Table 2). Genes possessing a peak in the proximal promoter are made into a list with an associated score equal to score of maximum scoring peak in the proximal promoter. This list is the input into this analysis. a Partial gene ontology output for GATA1 dataset. Top four gene ontology categories are shown. b Partial GATA1 ChIP-seq pathway analysis results. Top four pathways are shown. c Porphyrin pathway (hsa00860) image generated by workflow. Image is truncated for presentation clarity. Red color indicates a gene with a peak in the proximal promoter (see IMPLEMENTATION AND RESULTS for details). Genes without any color coding did not have peaks in their promoters. Note that this pathway is strongly implicated in GATA1 function (e.g., [43])