| Literature DB >> 32429868 |
Abdullah El-Kurdi1,2, Ghiwa Ali Khalil1, Georges Khazen3, Pierre Khoueiry4,5.
Abstract
BACKGROUND: Finding combinations of homotypic or heterotypic genomic sites obeying a specific grammar in DNA sequences is a frequent task in bioinformatics. A typical case corresponds to the identification of cis-regulatory modules characterized by a combination of transcription factor binding sites in a defined window size. Although previous studies identified clusters of genomic sites in species with varying genome sizes, the availability of a dedicated and versatile tool to search for such clusters is lacking.Entities:
Keywords: Bed; Bioconductor; Cis-regulatory modules; GRanges; Genome scan; Genomic clusters; Next generation sequencing; Transcription factor binding sites; Variants data; Vcf
Year: 2020 PMID: 32429868 PMCID: PMC7236483 DOI: 10.1186/s12859-020-3536-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1fcScan search strategy and performance. Left: Example of data input that can be one of GRanges, data frame or BED/VCF files. Right: schematic representation of a search example for clusters containing a combination of 2 heterotypic genomic features (orange circles and green squares) and excluding a third genomic feature (yellow triangle) in a window size of 200 bp. Sites within identified clusters must obey the order and orientation defined. Bottom: Function call on the data represented in the upper left corner with its corresponding output. Only one cluster, marked by the “correct” sign, is called based on the criteria above. All remaining clusters are eliminated for the reason described above each case
Fig. 2fcScan is optimized to scan and cluster large input data. a Call used for benchmarking “getCluster” function in fcScan on a randomly generated datasets of up to 1,000,000 features and 3 heterotypic genomic features (a, b, and c) over 5 chromosomes using arguments “c = c(“a” = 1, “b” = 2, “c” = 1)” with “greedy = FALSE” for panel “b” and “greedy = TRUE” for panel “c” with either 1 or 5 computing cores (1 computing core per chromosome). For input “x”, window size “w” and “greedy” options, all different used values are shown. b,c Left panels corresponds to a fixed number of input size of 1,000,000 features as a function of window size. Right panels correspond to a fixed window size of 500 bp as a function of input size. Y-axis represents rounded mean runtimes in seconds over 10 runs
Fig. 3fcScan identified CTCF clusters delimiting the shh/ZRS TAD in Forward and Reverse nomenclatures. Genome browser view showing tracks of CTCF ChIP-seq on K562 and GM12878, SMC3 ChIP-seq on K562, CTCF PWM hits in forward (blue) and reverse (red), Hi-C based TADs in K562 and fcScan identified CTCF clusters for forward (blue) and reverse (red) CTCF sites. The last track shows the gene models with shh gene highlighted in red. CTCF PWM for forward and reverse orientations are shown below. COSMIC data track was omitted for clarity. The function call used to identify the clusters is shown on top and was run independently for CTCF PWM based binding sites on positive (the call showed) and negative strands