| Literature DB >> 34599234 |
Enrique Blanco1, Mar González-Ramírez2, Luciano Di Croce3,4,5.
Abstract
Large-scale sequencing techniques to chart genomes are entirely consolidated. Stable computational methods to perform primary tasks such as quality control, read mapping, peak calling, and counting are likewise available. However, there is a lack of uniform standards for graphical data mining, which is also of central importance. To fill this gap, we developed SeqCode, an open suite of applications that analyzes sequencing data in an elegant but efficient manner. Our software is a portable resource written in ANSI C that can be expected to work for almost all genomes in any computational configuration. Furthermore, we offer a user-friendly front-end web server that integrates SeqCode functions with other graphical analysis tools. Our analysis and visualization toolkit represents a significant improvement in terms of performance and usability as compare to other existing programs. Thus, SeqCode has the potential to become a key multipurpose instrument for high-throughput professional analysis; further, it provides an extremely useful open educational platform for the world-wide scientific community. SeqCode website is hosted at http://ldicrocelab.crg.eu , and the source code is freely distributed at https://github.com/eblancoga/seqcode .Entities:
Mesh:
Year: 2021 PMID: 34599234 PMCID: PMC8486768 DOI: 10.1038/s41598-021-98889-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Modular architecture of SeqCode. Diagram of subroutines implemented in SeqCode to perform multiple tasks. Functions were classified depending on the object of analysis: genes, genomic regions or peaks, genome tracks, and signal levels. Functions are represented as boxes; arrows indicate the dataflow of each pipeline.
List of SeqCode functions.
| Name | Description | Input | Output | |
|---|---|---|---|---|
| Custom tracks for genome browsers | buildChIPprofile | Generates a custom track from a sequencing experiment to be visualized in current genome browsers | One SAM/BAM file | The custom track in BedGraph format |
| combineChIPprofiles | Generates a custom track by subtracting the second sequencing experiment from the first one to be visualized in current genome browsers | Two SAM/BAM files | ||
| Average occupancy plots | combineTSSplots | Draws the average distribution by subtracting the second sequencing experiment from the first one around the TSS of selected genes | Two SAM/BAM files, a list of genes | The average plot in PDF, the signal values and the R script |
| produceGENEplots | Draws the average distribution of a sequencing experiment along the body of the meta-gene of selected genes | One SAM/BAM file, a list of genes | ||
| producePEAKplots | Draws the average distribution of a sequencing experiment around the center of selected peaks | One SAM/BAM file, a list of BED peaks | ||
| produceTESplots | Draws the average distribution of a sequencing experiment around the TES of selected genes | One SAM/BAM file, a list of genes | ||
| produceTSSplots | Draws the average distribution of a sequencing experiment around the TSS of selected genes | One SAM/BAM file, a list of genes | ||
| Density heatmaps | combineTSSmaps | Draws the heatmap by subtracting the second experiment from the first one around the TSS of selected genes | Two SAM/BAM files, a list of genes | The heat map in PDF, the rank of signal values and the R script |
| produceGENEmaps | Draws the heatmap of reads of an experiment along the body of the meta-gene of selected genes | One SAM/BAM file, a list of genes | ||
| producePEAKmaps | Draws the heatmap of reads of an experiment around the centre of selected peaks | One SAM/BAM file, a list of BED peaks | ||
| produceTESmaps | Draws the heatmap of a sequencing experiment around the TES of selected genes | One SAM/BAM file, a list of genes | ||
| produceTSSmaps | Draws the heatmap of a sequencing experiment around the TSS of selected genes | One SAM/BAM file, a list of genes | ||
| Signal levels | recoverChIPlevels | Calculates the average, maximum, and total number of normalized reads of a sequencing experiment inside a set of regions of the genome | One SAM/BAM file, a list of regions in BED format | Average, maximum and total number of reads inside each region |
| Peak analysis | genomeDistribution | Distributes a list of regions of the genome into distinct gene features | A list of regions in BED format | The genome distribution in PDF, the annotation of the regions |
| matchpeaks | Calculates the overlap between the components of two lists of peaks | Two lists of BED peaks | The list of overlapping peaks and the subsets of peaks that do not overlap on the other set | |
| matchpeaksgenes | Identifies genes genome-wide that contain one (or more) peaks from a list of selected peaks defined by the user accordingly to a set of rules and distances | A list of BED peaks | The list of genes in the genome that are target of the peaks | |
| Evolutionary conservation | scorePhastCons | Calculates the average, maximum value, and total PhastCons score inside a set of genomic regions | A list of regions in BED format | Average, maximum and total score inside each region |
Figure 2SeqCode ChIP-seq and RNA-seq profiles in mESCs for visualizing in genome browsers. (a) Actively transcribed region containing the Oct4/Pou5f1 pluripotency gene. (b) Region repressed for transcription by PcG proteins containing the HoxA complex. Raw data were retrieved from[32,33,38,75]. The SeqCode buildChIPprofile function (window size = 10) was used to generate each custom track from resulting BAM files. The SeqCode combineChIPprofiles function was used to generate the RNA-seq strand-specific profiles. Composite ChIP-seq and RNA-seq supertracks integrate all information from each individual track shown below.
Figure 3Basic analysis of the H3K4me3 ChIP-seq sample in mESCs using SeqCode. (a) From left to right: average distributions around the TSS of H3K4me3 target genes (produceTSSplots), along the gene body of the same genes (produceGENEplots), and around the center of H3K4me3 peaks (producePEAKplots). (b) From left to right: heatmap around the TSS of H3K4me3 target genes (produceTSSmaps), along the gene body of the same genes (produceGENEmaps), and around the center of H3K4me3 peaks (producePEAKmaps). (c) Basic and detailed genome distribution of H3K4me3 peaks (genomeDistribution). (d) ChIP-seq signal levels of selected genes (recoverChIPlevels). Raw data were retrieved from[32].
Figure 4Evolutionary conservation landscape of distinct features of the mouse genome calculated by SeqCode. (a) PhastCons average score distribution calculated by the SeqCode scorePhastCons function on distinct gene features: protein-coding regions (CDS), 5′ untranslated regions (5′UTR), 3′ untranslated regions (3′UTR), promoters 1000 bp upstream of the TSS (Upstream), regions 1000 bp downstream the TES (Downstream), intronic regions (Introns), and intergenic regions (Intergenic). RefSeq annotations were used to extract the coordinates of all instances of each feature in the mouse genome (mm9). Left, distribution of the score (from 0 to 1); right, boxplot summarizing differences between features. (b–d) PhastCons average score calculated by SeqCode on distinct regulatory features: (b) super-enhancers identified as significant concentrations of H3K27ac in mESCs[76], (red); (c) broad domains of H3K4me3 reported in mESCs[77] (green); and (d) computational predictions of TATA boxes using Jaspar[78], before and after the conservation filtering (TBP and TBP conserved), to highlight the significant drop in the number of ab initio predictions (blue). Left, the global distribution of each class of elements; right, an example of high conservation. The phastCons30way track of mouse was used to score each set of regions with the SeqCode scorePhastCons function. Raw data were retrieved from[32,79].
Figure 5Epigenetic signature of active and bivalent genes from mESCs reconstructed with SeqCode. (a) Bivalent region repressed for transcription by PcG proteins containing the HoxD complex. Genome-wide profiles and peaks of H3K4me3 and H3K27me3 are shown along this locus. (b) Overlap of H3K4me3 and H3K27me3 peaks produces three classes of sites: active (H3K4me3 + /H3K27me3–; orange), bivalent (H3K4me3 + /H3K27me3 + ; violet), and silent (H3K4me3–/H3K27me3 + ; blue). Determination of signal strength and genomic distribution of each class of peaks. (c) Overlap of H3K4me3 and H3K27me3 target genes, showing the functional analysis of active genes (orange), bivalent genes (violet), and silent genes (blue). (d) Average distribution and heatmap of the ChIP-seq signal around the TSS of active and bivalent genes for H3K4me3, H3K27me3, H3K36me3, and Suz12. (e) ChIP-seq levels of each experiment are shown for active and bivalent genes (left and right boxes in the boxplots, respectively). For comparison, gene expression (in RPKMs) is shown for both gene sets. Raw data of ChIP-seq and RNA-seq experiments were retrieved from[36]. The SeqCode buildChIPprofile function was used to generate each custom track from the resulting BAM files after mapping; the matchpeaks application compared the H3K4me3 and H3K27me3 peaks; the recoverChIPlevels application determined the strength of the ChIP-seq signal at each subset of peaks; the genomeDistribution program calculated the genomic composition of each collection of peaks, according to RefSeq annotations; the matchpeaksgenes routine associated ChIP-seq peaks and target genes; and the produceTSSplots and produceTSSmaps programs generated the average distribution meta-plots and heatmaps of each ChIP-seq sample for signal around the TSS of active and bivalent genes, respectively. GO term enrichment was analyzed with Enrichr[80]. For boxplots in (b,e), ChIP-seq counts were normalized by the total number of mapped reads, and RNA-seq expression values were calculated as RPKMs.
Figure 6Panel of active and repressive actors in mESCs generated by SeqCode. For each ChIP-seq signal and corresponding set of target genes, the average distribution (produceTSSplots) and heatmap (produceTSSmaps) around the TSS, average distribution along the gene body (produceGENEplots), average distribution (producePEAKplots) and heatmap (producePEAKmaps) around the center of the peaks, genome distribution of each set of peaks (genomeDistribution), and ChIP-seq levels normalized by the total number of reads (recoverChIPlevels) are shown for 100 highly expressed genes (more than 1000 RPKMs; red), 100 moderately expressed genes (100–500 RPKMs; yellow) and 100 silenced genes (0–1 RPKMs; blue). Raw data were retrieved from[32,33,37,38]. Target genes of each ChIP-seq were identified with the matchpeaksgenes routine.
Figure 7Evaluation of ChIP-seq experiments after genetic perturbation with SeqCode. (a) Study of the efficiency of MLL2 knockout (KO) in mESCs (left) and its effect over the occupancy pattern of H3K4me3 (right). (b) Study of the efficiency of RING1B knockdown (KD) in the Ewing sarcoma A673 cell line (left) and its effect over the binding pattern of EWS-FLI1 oncogenic protein (right). (c) Study of the efficiency of PHF19 KD in the prostatic cancer DU145 cell line (left) and its effect over the pattern of MTF2 (right). Peaks called in the wild-type (WT) condition of the factor being analyzed on each case were used as a reference in both conditions. For every pair of WT-KO/KD experiments, ChIP-seq normalized levels (recoverChIPlevels), and the average distribution (producePEAKplots) and heatmap (producePEAKmaps) around the peaks are shown. Below, genome-wide screenshots of the WT and KO/KD profiles on a locus containing peaks of this class were generated with the SeqCode buildChIPprofile function. Boxplots of ChIP-seq values are shown in log scale. All samples were normalized by the total number of reads of each experiment, except MTF2 ChIP-seq samples that were normalized by the total number of spike-in fruit-fly reads. Raw data were retrieved from[40–42].
Figure 8Homepage of the SeqCode website. The main menu of the web server is divided into four categories: NGS Tools (sequencing plots), Datasets (distribution plots), Gene Sets (intersection plots) and List Ops (operations with lists of identifiers). Additional help documentation and a section with downloading links are also provided here.
List of bioinformatics tools for graphical analysis of sequencing information.
| Software | References |
|---|---|
| SeqCode | This article |
| BindDB | Livyatan et al.[ |
| CGAT | Sims et al.[ |
| ChAsE | Younesy et al.[ |
| ChIPseeker | Yu et al.[ |
| ChIP-seq tools | Ambrosini et al.[ |
| CisGenome | Ji et al.[ |
| Cistrome | Liu et al.[ |
| deepTools | Ramirez et al.[ |
| EaSeq | Lerdrup et al.[ |
| Epidaurus | Wang et al.[ |
| GeneProf | Halbritter et al.[ |
| Genomation | Akalin et al.[ |
| HOMER | Heinz et al.[ |
| ngs.plot | Shen et al.[ |
| PyBedGraph | Zhang et al.[ |
| seqMINER | Ye et al.[ |
| Spark | Nielsen et al.[ |
Figure 9Evaluation of tools for the visualization of sequencing data. (a) Radar chart displaying the eight attributes that have been used to describe the characteristics of the existing applications. (b) Radar charts of each selected bioinformatics software for graphical summarizing sequencing data (see Table 3 for the values that are graphically displayed for each program here).
Characteristic features of SeqCode and other similar resources.
| Feature | SeqCode | BindDB | CGAT | ChAsE | ChIPseeker | ChIP-seq | CisGenome | Cistrome | deepTools | EaSeq | Epidaurus | GeneProf | genomation | HOMER | ngs.plot | PyBedGraph | seqMINER | Spark |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1. Selection | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |
| 2. All genomes | X | X | X | X | X | X | ||||||||||||
| 3. All assemblies | X | X | X | X | X | X | X | |||||||||||
| 1. Gene | X | X | X | X | X | X | X | |||||||||||
| 2. Enhancer | X | X | X | X | X | |||||||||||||
| 3. Genomic | X | X | X | X | X | X | X | X | X | X | X | X | X | X | ||||
| 1. ChIP-seq | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
| 2. RNA-seq | X | X | X | X | X | X | X | X | X | |||||||||
| 3. ATAC-seq | X | X | X | X | X | X | X | X | X | |||||||||
| 1. Aggregated | X | X | X | X | X | X | X | X | X | X | X | X | ||||||
| 2. Heatmaps | X | X | X | X | X | X | X | X | X | X | X | X | ||||||
| 3. Distributions | X | X | X | X | X | X | X | X | X | |||||||||
| 1. Custom tracks | X | X | X | X | X | X | X | X | X | |||||||||
| 2. Signal strength | X | X | X | X | X | X | X | |||||||||||
| 3. Conservation | X | X | X | X | ||||||||||||||
| 1. Local | X | X | X | X | X | `X | X | X | X | X | X | X | X | |||||
| 2. Flexible | X | X | X | X | X | X | ||||||||||||
| 3. Dynamic | X | X | X | X | X | X | X | X | X | X | X | X | X | X | ||||
| 1. Executable | X | X | X | X | X | |||||||||||||
| 2. Parametric | X | X | X | X | X | X | ||||||||||||
| 3. Portable | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |
| 1. Web site | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | |||
| 2. Tutorials | X | X | X | X | X | X | X | X | X | X | ||||||||
| 3. Self-docs | X | X | ||||||||||||||||
List of GEO accession codes of the raw data analyzed in this work.
| GEO accession | Sample | Cell/Tissue | References |
|---|---|---|---|
| SRX367147 | H3K4me3 | mESCs | Tee et al.[ |
| SRX336228 | Jarid2 | ||
| GSM1019769 | H3K36me3 | Ballare et al.[ | |
| GSM1019772 | H3K27me3 | ||
| GSM1019771 | Suz12 | ||
| GSM850467 | Ser5P | Brookes et al.[ | |
| GSM850470 | Ser2P | ||
| GSM850471 | H2Aub1 | ||
| GSM1562339 | RNA-seq | Jacinto et al.[ | |
| GSM1526287 | H3K27ac | Ji et al.[ | |
| GSM2098958 | H3K4me3 | Beringer et al.[ | |
| GSM2098952 | H3K27me3 | ||
| GSM1041372 | Ring1b | Morey et al.[ | |
| GSM2645501 | MLL2 WT | Mas et al.[ | |
| GSM2645502 | MLL2 KO | ||
| GSM2645495 | H3K4me3 WT | ||
| GSM2645496 | H3K4me3 KO | ||
| GSM4616798 | RING1B WT | A673 | Sánchez et al.[ |
| GSM4616801 | RING1B KD | ||
| GSM4616799 | EWSR1-FLI1 WT | ||
| GSM4616802 | EWSR1-FLI1 KD | ||
| GSM4023634 | PHF19 WT | DU145 | Jain et al.[ |
| GSM4023635 | PHF19 KD | ||
| GSM4023643 | MTF2 WT | ||
| GSM4023644 | MTF2 KD | ||
| GSM593408 | H3K4me3 | Wing imaginal disc (fly) | Perez-Lluch et al.[ |
| GSM593409 | H3K27me3 | ||
| GSM593410 | H3K36me3 | ||
| GSM593412 | Ser5P | ||
| GSM593411 | Ser2P | ||
| GSM1182471 | H3K4me1 | Perez-Lluch et al.[ | |
| GSM1182472 | H3K27ac | ||
| GSM1363590 | H3K9ac | ||
| GSM1060715 | RNA-seq | ||
| GSM1005586 | Cabut | Ruiz-Romero et al.[ |