| Literature DB >> 21261981 |
Michael Hackenberg1, Pedro Carpena, Pedro Bernaola-Galván, Guillermo Barturen, Angel M Alganza, José L Oliver.
Abstract
BACKGROUND: Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.Entities:
Year: 2011 PMID: 21261981 PMCID: PMC3037320 DOI: 10.1186/1748-7188-6-2
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Distance distributions. Expected and observed distance distributions for human chromosomes 16 (above) and 5 (below). It can be seen that for chr16 the median, the chromosome intersection and the genome intersection are very close (within 1 bp), while for chromosome 5 notable differences exist (from 33 bp to 49 bp).
WordCluster predictions of CpG clusters*
| Method | # | Length ± SD | GC ± SD | OE ± SD |
|---|---|---|---|---|
| cpg50 | 198703 | 273.2 ± 246.4 | 63.8 ± 7.5 | 0.855 ± 0.265 |
| cpgISc | 194725 | 218.7 ± 200.1 | 65.6 ± 7.7 | 0.916 ± 0.273 |
| cpgISg | 204238 | 202.6 ± 183.8 | 66.3 ± 7.5 | 0.930 ± 0.274 |
*Basic statistic of CpG island predictions using three different distance models: cpgISg (genome intersection), cpg50 (Median) and cpgISc (chromosome intersection). The number of predicted islands, the length, the G+C content and the observed to expected ratios are shown. Note that the original cpg50 algorithm predicts 198702 islands, i.e. one less than WordCluster with the median model. This is due to the changes introduced regarding the N-runs (see main text).
Biological meaning of WordCluster predictions*
| Method | #islands | #TSS overlap | #R13 overlap | #Alu overlap | #PhastCons overlap |
|---|---|---|---|---|---|
| cpg50 | 198703 | 12432 (6.3%) | 30660 (15.4%) | 80323 (40.4%) | 48787 (24.6%) |
| cpgISc | 194724 | 11926 (6.1%) | 34567 (17.8%) | 70144 (36.0%) | 48930 (25.1%) |
| cpgISg | 204238 | 12156 (6.0%) | 37616 (18.4%) | 70456 (34.5%) | 52335 (25.6%) |
*Comparison of three WordCluster predictions of CG clusters (CpG islands) using three different distance models: cpgISg (genome intersection), cpg50 (median) and cpgISc (chromosome intersection). The overlap with two gene regions (TSS and R13), Alu elements and phylogenetically conserved PhastCons elements have been measured and both absolute numbers and percentages are given.
Clusters of CWG trinucleotides*
| N | 84996 |
|---|---|
| Genome coverage (bp) | 15700789 |
| Average length (bp) | 184.7 |
| No. of clusters co-locating with gene regions: | |
| TSS | 272 |
| TSS ± 100 bp | 686 |
| 5'UTR | 4712 |
| Introns | 29326 |
| Exons | 1852 |
| 3'UTR | 1658 |
*Statistically significant clusters of CAG and CTG trinucleotides detected by WordCluster in the human genome (hg18). We used the "genome intersection" distance model and a p-value threshold of 1E-05.
Clusters of OR genes in human chromosome 11*
| Cluster | chromStart | chromEnd | length | count | |
|---|---|---|---|---|---|
| 1 | 4345160 | 5178488 | 833329 | 53 | 1.60E-49 |
| 2 | 5269273 | 5559687 | 290415 | 21 | 6.80E-21 |
| 3 | 5697096 | 6177989 | 480894 | 28 | 2.70E-25 |
| 4 | 48194938 | 48344593 | 149656 | 9 | 2.50E-08 |
| 5 | 48398372 | 48505102 | 106731 | 9 | 1.70E-09 |
| 6 | 49876392 | 49960613 | 84222 | 7 | 2.60E-07 |
| 7 | 51250039 | 51384376 | 134338 | 11 | 1.60E-11 |
| 8 | 54842612 | 55380573 | 537962 | 32 | 4.00E-29 |
| 9 | 55427396 | 56344568 | 917173 | 66 | 6.30E-65 |
| 10 | 56495101 | 56580184 | 85084 | 7 | 2.90E-07 |
| 11 | 57555001 | 57964200 | 409200 | 22 | 3.20E-19 |
| 12 | 58833691 | 59056759 | 223069 | 12 | 1.30E-10 |
| 13 | 123181329 | 123481891 | 300563 | 16 | 5.40E-14 |
*Chromosome coordinates, length, number of OR genes and p-values for all statistically significant OR gene clusters in chromosome 11.
Figure 2Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].