| Literature DB >> 15914667 |
Julian L Huppert1, Shankar Balasubramanian.
Abstract
Guanine-rich DNA sequences of a particular form have the ability to fold into four-stranded structures called G-quadruplexes. In this paper, we present a working rule to predict which primary sequences can form this structure, and describe a search algorithm to identify such sequences in genomic DNA. We count the number of quadruplexes found in the human genome and compare that with the figure predicted by modelling DNA as a Bernoulli stream or as a Markov chain, using windows of various sizes. We demonstrate that the distribution of loop lengths is significantly different from what would be expected in a random case, providing an indication of the number of potentially relevant quadruplex-forming sequences. In particular, we show that there is a significant repression of quadruplexes in the coding strand of exonic regions, which suggests that quadruplex-forming patterns are disfavoured in sequences that will form RNA.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15914667 PMCID: PMC1140081 DOI: 10.1093/nar/gki609
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Left: hydrogen bond pattern in a G-tetrad. A monvalent cation occupies the central position. Right: Schematic diagram of a unimolecular G-quadruplex structure.
Figure 2Process for generating Markov windowed simulates. A real chromosome (top) is separated into discrete windows. For each of these, a table of base and diad frequencies is generated (middle), which is then used to generate a simulated window (bottom), which are then joined to produce the replicate chromosome.
Number of X-patterns of the form d(X3+N1–7X3+N1–7X3+N1–7X3+) where X refers to the base being examined, for the whole human genome (NCBI build 34, accessed via ENSEMBL)
| G-patterns | 188 836 | A-patterns | 1 624 670 |
| C-patterns | 187 610 | T-patterns | 1 638 487 |
| Total GC-patterns | 376 446 | Total AT-patterns | 3 263 157 |
G-patterns have a physical reality, and C-patterns identify G-patterns in the complementary strand. A- and T-patterns have no known physical meaning.
Diad analysis of every human chromosome
| Base | Previous base | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|
| G | C | A | T | ||||||
| G | 0.26 | +0.05 | 0.05 | −0.16 | 0.24 | +0.03 | 0.25 | +0.04 | 0.21 |
| C | 0.21 | — | 0.26 | +0.05 | 0.17 | −0.04 | 0.20 | — | 0.21 |
| A | 0.29 | — | 0.35 | +0.06 | 0.33 | +0.04 | 0.22 | −0.08 | 0.29 |
| T | 0.25 | −0.05 | 0.34 | +0.05 | 0.26 | −0.04 | 0.33 | +0.04 | 0.29 |
Vertical lines show the percentage probabilities of each base following a given base, and then the deviation from the percentage probabilities expected if each base was independent. For clarity, data resulting from the borders of unsequenced regions of DNA have been suppressed.
Total number of GC- and AT-patterns found in the real human genome and simulates using various methods
| Method | GC-patterns | AT-patterns |
|---|---|---|
| Markov, size 50 | 687 k | 4.01 M |
| Markov, size 75 | 514 k | 3.26 M |
| Markov, size 100 | 420 k | 2.81 M |
| Markov, size 150 | 320 k | 2.29 M |
| Markov, size 200 | 269 k | 2.02 M |
| Markov, size 400 | 185 k | 1.56 M |
| Markov, size 1000 | 123 k | 1.20 M |
| Markov, size 2000 | 93 k | 1.02 M |
| Markov, size 4000 | 75 k | 0.89 M |
| Bernoulli | 8 k | 0.30 M |
| Real human genome | 376 k | 3.26 M |
In the window methods, simulates were generated conserving diad base frequencies in windows of the size shown. Five independent analyses were performed, and the SD was in all cases <1%. The ‘Bernoulli’ method treats DNA as a stream of independent bases, with base frequencies homogenous across each chromosome. The Markov model that correctly predicted the number of AT-patterns (window size 75 bp) is shown in boldface.
Frequencies of X-patterns of the form d(X3+N1–7X3+N1–7X3+N1–7X3+) for x = G, C, A, T under various conditions, normalized such that the frequency of T-patterns in each column is 1
| X | Relative frequencies, normalized to T = 1 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Actually observed | Bernoulli | Markov 50 bp | Markov 75 bp | Markov 100 bp | Markov 150 bp | Markov 200 bp | Markov 400 bp | Markov 1000 bp | Markov 2000 bp | Markov 4000 bp | |
| G | 0.12 | 0.03 | 0.17 | 0.16 | 0.15 | 0.14 | 0.13 | 0.12 | 0.10 | 0.09 | 0.08 |
| C | 0.12 | 0.03 | 0.17 | 0.16 | 0.15 | 0.14 | 0.13 | 0.12 | 0.10 | 0.09 | 0.08 |
| A | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| T | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
The actually observed data are taken from the NCBI build 34 of the human genome. The Bernoulli model and Markov models are described elsewhere in the text, and the number below the word ‘Markov’ refers to the window size used. It can be clearly seen throughout these data that the pseudo-Chargaff rule G = C and A = T holds, to be within 1%. This also shows that the relative depletion of GC-patterns increases with increasing window size.
Frequencies of bases, diads and patterns for each base the exonic regions
| Base | Frequency | Diad repeat frequency | Observed pattern frequency | Markov predicted pattern frequency | Observed/Markov predicted frequency ratio |
|---|---|---|---|---|---|
| G | 0.25 | 0.27 | 0.48 | 0.91 | 0.53 |
| C | 0.25 | 0.29 | 0.83 | 1.10 | 0.75 |
| A | 0.26 | 0.29 | 0.93 | 1.21 | 0.77 |
| T | 0.24 | 0.27 | 1 | 1 | 1 |
Frequency lists the frequency of each bases in the relevant region. Diad repeat frequency refers to the chance that after a given base, the same base will be repeated. The observed and predicted pattern frequencies refer to the relative frequencies of patterns of the form d(X3+N1–7X3+N1–7X3+N1–7X3+) for X=G, C, A, T, normalized to 1 for the frequency of T-patterns, either in the actual human genome or in a simulate using a Markov model with a window size of 75 bp. The data show that the G-patterns are dramatically underrepresented, and there is a weaker effect on C-patterns, and another on A-patterns.
Figure 3Left: frequency distributions of loops of lengths 1–7 bases for the entire human genome. Right: percentage excesses of loop 2 counts over the averages of loops 1 and 3 for the entire human genome.
Figure 4Mosaic plot representing the loop lengths of all putative quadruplexes found in the human genome. The seven principle columns represent the lengths of the first loop, the seven rows the lengths of the second loop, and the seven segments in each box the lengths of the third loop. The area of each box is proportional to the number of sequences found with that combination of loop lengths. The plot was produced using the program R, () using the command mosaicplot.
The 20 most common and 20 least common sets of observed PQS loop lengths
| Most common loop lengths | Least common loop lengths | ||||||
|---|---|---|---|---|---|---|---|
| Loop 1 | Loop 2 | Loop 3 | Number | Loop 1 | Loop 2 | Loop 3 | Number |
| 1 | 1 | 1 | 47 475 | 6 | 5 | 7 | 441 |
| 1 | 4 | 1 | 11 328 | 7 | 6 | 5 | 441 |
| 1 | 2 | 1 | 10 656 | 7 | 6 | 3 | 447 |
| 1 | 1 | 2 | 10 415 | 5 | 6 | 7 | 447 |
| 2 | 1 | 1 | 10 040 | 6 | 6 | 7 | 449 |
| 2 | 2 | 2 | 9411 | 6 | 7 | 7 | 450 |
| 1 | 3 | 1 | 9127 | 7 | 5 | 6 | 452 |
| 1 | 5 | 1 | 7799 | 5 | 7 | 6 | 484 |
| 5 | 1 | 1 | 7379 | 5 | 5 | 7 | 501 |
| 1 | 1 | 5 | 7337 | 5 | 6 | 3 | 505 |
| 3 | 3 | 3 | 6827 | 7 | 7 | 6 | 505 |
| 3 | 1 | 1 | 6458 | 6 | 6 | 5 | 506 |
| 1 | 1 | 3 | 6403 | 5 | 6 | 6 | 511 |
| 1 | 1 | 4 | 6196 | 3 | 6 | 7 | 521 |
| 4 | 1 | 1 | 6189 | 7 | 6 | 6 | 523 |
| 2 | 2 | 1 | 5123 | 6 | 7 | 3 | 525 |
| 1 | 2 | 2 | 5046 | 7 | 7 | 4 | 528 |
| 2 | 1 | 2 | 4780 | 3 | 7 | 6 | 533 |
| 1 | 6 | 1 | 4556 | 5 | 7 | 7 | 536 |
| 6 | 1 | 1 | 4462 | 6 | 7 | 5 | 538 |
Loops are numbered from 5′ to 3′ of the G-rich strand.