| Literature DB >> 31664892 |
Alejandro Saettone1, Marcelo Ponce2, Syed Nabeel-Shah3, Jeffrey Fillingham4.
Abstract
BACKGROUND: Chromatin immunoprecipitation coupled to next generation sequencing (ChIP-Seq) is a widely-used molecular method to investigate the function of chromatin-related proteins by identifying their associated DNA sequences on a genomic scale. ChIP-Seq generates large quantities of data that is difficult to process and analyze, particularly for organisms with a contig-based sequenced genomes that typically have minimal annotation on their associated set of genes other than their associated coordinates primarily predicted by gene finding programs. Poorly annotated genome sequence makes comprehensive analysis of ChIP-Seq data difficult and as such standardized analysis pipelines are lacking.Entities:
Keywords: Bioinformatics pipeline; Chromatin immunoprecipitation; High-performance computing; Next generation sequencing; Tetrahymena thermophila
Mesh:
Year: 2019 PMID: 31664892 PMCID: PMC6819487 DOI: 10.1186/s12859-019-3100-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Core RACS pipeline overview. This flowchart represents the logic steps implemented in the core pipeline. Boxes represent files and file types as indicated in the text. Files with thick boxes represent the Input Files for Intergenic calculations. Files in green are to be uploaded to IGV. File in blue is needed for IGV but it does not have to be uploaded to IGV. This file has to be kept in the same folder directory than the sorted bam
Fig. 2Schematic diagram of the tasks implemented for the RACS core pipeline. Included are details of the processing stages in relation to the scaffold based genome and the breakdown of each these steps. Bold names, indicate bioinformatic specific modules while normal fonts represent generic ones. The bifurcation represents tasks that can be executed in parallel, as there is no data dependency among them
RACS scaling and performance trends for the ORF part of the pipeline: we performed the standard strong scaling analysis, as well as a function of different dataset sizes
| Initial data size | Number of procesors | Workspace usage | Walltime time |
|---|---|---|---|
| ≈3 GBa | 1 | ≈27 GB | 7037 secs |
| 2 | " | 5059 secs | |
| 4 | " | 3856 secs | |
| 8 | " | 3238 secs | |
| 16 | " | 2940 secs | |
| 32 | " | 2801 secs | |
| 64b | " | 2463 secs | |
| ≈2.4 GBc | 1 | ≈20 GB | 5477 secs |
| 2 | " | 4005 secs | |
| 4 | " | 3128 secs | |
| 8 | " | 2678 secs | |
| 16 | " | 2456 secs | |
| 32 | " | 2344 secs | |
| 64 | " | 2161 secs | |
| ≈6.8 GBd | 1 | ≈50.3 GB | 6987 secs |
| 2 | " | 5662 secs | |
| 4 | " | 4864 secs | |
| 8 | " | 4451 secs | |
| 16 | " | 4245 secs | |
| 32 | " | 4148 secs | |
| 64 | " | 4155 secs | |
| ≈7.1 GBe | 1 | ≈53.4 GB | 7728 secs |
| 2 | " | 6191 secs | |
| 4 | " | 5255 secs | |
| 8 | " | 4740 secs | |
| 16 | " | 4529 secs | |
| 32 | " | 4413 secs | |
| 64 | " | 4249 secs | |
| ≈1.4 GBf | 1 | ≈8.3 GB | 2874 secs |
| 2 | " | 1796 secs | |
| 4 | " | 1218 secs | |
| 8 | " | 920 secs | |
| 16 | " | 773 secs | |
| 32 | " | 702 secs | |
| 64 | " | 639 secs |
aIbd1-1 data set for T.thermophila [16].
bAlthough there are 40 physical cores in the TDS/Niagara nodes, hyperthreading is enabled so it can be used up to 80 logical cores.
cIbd1-2 data set for T.thermophila [16].
dMED31-1 data set for T.thermophila [48].
eMED31-2 data set for T.thermophila [48].
fData set for O.trifallax.
As it can be seen, the working space (in this case memory utilization) can reach up to a factor of 9-10 × the size of the initial data to be processed. Further details about memory consumption can be found in the README document and the “doc” directory, included within the RACS repository. These tests were run in the TDS system (i.e. one Lenovo SD530 node with 40 cores and 192GB of RAM with CentOS 7.4 operating system) of the Niagara supercomputer [27], utilizing RAMDISK as working space
Fig. 3Diagram summarizing the ChIP-Seq technique used to prepare the samples and generate the data from the “wet-lab”: 1) Native state of chromatin. 2) Specific antibodies recognize the tagged proteins. 3) Isolation of tagged protein plus its interacting chromatin. 4) After DNA purification and library preparation NGS is performed. 5) The output data from NGS is aligned to Tetrahymena thermophila’s genome assembly
Fig. 4a and b Visualization genic and intergenic region using IGV. The top track shows MACS2 broad and gapped peaks. The middle track shows RACS visual representation of reads accumulation. Note that RACS shows graphical reads behaviour and accumulation preferences. The bottom track shows T.thermophila’s genes. On the other hand, MACS2 found two weak peaks that can be interpreted as background by our pipeline. The range inside the brackets represents the highest number of reads for that specific track. c Ibd1 localizes to more intergenic than genic regions. d The majority of reads are found in genic regions. These results take into consideration the updated information provided by the mock samples
Comparison of Ibd1 localization presented in [16] (without untagged controls) and analyzed by RACS using untagged controls (current study)
| (a) Ibd1 localization presented in both studies. | ||
| Ibd1 localization % | Ibd1 localization % | |
| Expression level | using untagged controls | without untagged controls |
| High expression | 51 | 54 |
| Moderate expression | 6 | 16 |
| Low to no-expression | 16 | 14 |
| Non-available expression | ||
| for the TTHERMs in the | ||
| GSM692081 data set | 27 | 16 |
| (b) t-Test: Paired Two Sample for Means | ||
| t-Test: Paired Two Sample for Means | ||
| Ibd1 localization % using untagged controls | Ibd1 localization % without untagged controls | |
| Mean | 25 | 25 |
| Variance | 374 | 374.6667 |
| Observations | 4 | 4 |
| Pearson Correlation | 0.89581514 | |
| Hypothesized mean difference | 0 | |
| df | 3 | |
| 0 | ||
| 0.5 | ||
| 2.35336343 | ||
| 1 | ||
| 3.18244631 | ||
There is a correlation of 0.896 and non-statistical differences between the two data sets. The data presented in [16] uses an arbitrary cut-off. The data presented in this paper does not use the arbitrary cut-off and instead uses as cut-off the values obtained by the analyses of the untagged samples
Fig. 5Gene Ontology (GO) analysis of genes controlled by Ibd1. GO predicted that the majority of Ibd1 bounded genes are related to housekeeping functions
Fig. 6RACS analysis using Oxytricha trifallax ChIP-Seq Rpb1 data gives results comparable to the reported in [17]. a and b Rpb1 enriches along Contig22209.0 and Contig451.1. The range inside the brackets represents the highest number of reads for that specific track. c Rpb1 can be found binding a similar amount of genic and intergenic regions throughout the genome; however, d most of the reads that were pulled down by Rpb1 are within genic regions