| Literature DB >> 15914666 |
Alan K Todd1, Matthew Johnston, Stephen Neidle.
Abstract
We report here the results of a systematic search for the existence and prevalence of potential intramolecular G-quadruplex forming sequences in the human genome. We have also examined the tendency for particular sequences of 'loop' regions to occur in particular positions with respect to the G-tracts in a quadruplex. Using arithmetic ratio and probability techniques we have discovered frequent and systematic occurrence of certain sequence types, the most prominent being a potential quadruplex containing CCTGT in the first 'loop' position. Being able to highlight types of potential quadruplex sequences in G-rich regions is an important step in searching for biologically relevant sequences and finding their function.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15914666 PMCID: PMC1140077 DOI: 10.1093/nar/gki553
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of quadruplex sequences occurring in human genomic DNA
| Number of quadruplexes | Number of unique quadruplexes | Number of unique loop sequences (number observed/number possible) | |
|---|---|---|---|
| Un-restricted dataset | 5 713 900 | 3 166 800 | 20 492/21 844 |
| Arbitrary dataset | 375 157 | 226 157 | 10 551/12 289 |
Figure 1Ways in which quadruplex-fold ambiguity can occur. (a) Shaded regions represent the guanines contributing to the G-quartets and the unshaded regions the loops. Regions of high guanine density tend to have more quadruplex hits which in some cases lead to many hits for a single region of DNA. (b) Overlapping quadruplexes. In the first (un-restricted) dataset, the above sequence would produce five possible quadruplex folds, and in the second (arbitrary) dataset, this sequence would only have been counted as two distinct quadruplexes.
Figure 2Consensus sequences for (a) CCTGTCA and (b) CCTGTT sequence types. Diagrams were generated with the program MakeLogo (33). A total of 1956 sequences were used to find the consensus sequence for CCTGTCA and 2361 sequences for the CCTGTT type. The height of each letter is proportional to the number of times each base appeared in that position.
Top 20 loop sequence by maximum ratio of population in loop position, for the arbitrary dataset
| Sequence | Ratio of max and min populations | Population in loop a | Population in loop b | Population in loop c | |
|---|---|---|---|---|---|
| 1 | CCTGTCA | 309 | 1239 | 5 | 4 |
| 2 | CCTGTT | 140 | 1266 | 18 | 9 |
| 3 | CCTGTC | 139 | 836 | 8 | 6 |
| 4 | CCTGTTA | 90 | 90 | 1 | 0 |
| 5 | ATCTCCA | 74 | 1 | 5 | 74 |
| 6 | TGGTCTT | 58 | 3 | 1 | 58 |
| 7 | CCTATCA | 53 | 53 | 1 | 0 |
| 8 | TCTGTCA | 51 | 51 | 3 | 0 |
| 9 | TAGCACA | 42 | 0 | 5 | 42 |
| 10 | CCTATC | 38 | 38 | 1 | 1 |
| 11 | CCTATT | 37 | 75 | 4 | 2 |
| 12 | CCTTTCA | 37 | 37 | 1 | 0 |
| 13 | CTTGGC | 36 | 13 | 16 | 471 |
| 14 | TAGCATT | 34 | 1 | 0 | 34 |
| 15 | CCTGTCC | 30 | 30 | 0 | 3 |
| 16 | CCTGTTT | 29 | 58 | 4 | 2 |
| 17 | CTTGTCA | 29 | 58 | 4 | 2 |
| 18 | CCTGTGA | 28 | 28 | 8 | 1 |
| 19 | CCAGTC | 28 | 28 | 1 | 3 |
| 20 | ACCTGTC | 27 | 27 | 1 | 2 |
Top 20 loop sequence by probability for the arbitrary dataset. Sequences 5,6,7 and 10 also feature in Table 2
| Sequence | −Log probability | Population in loop a | Population in loop b | Population in loop c | |
|---|---|---|---|---|---|
| 1 | T | 3277 | 53 234 | 37 657 | 30 515 |
| 2 | A | 2873 | 51 361 | 63 872 | 78 523 |
| 3 | AGGT | 2413 | 1516 | 6448 | 1470 |
| 4 | G | 1319 | 7183 | 8375 | 14 065 |
| 5 | CCTGTCA | 1313 | 1239 | 5 | 4 |
| 6 | CCTGTT | 1275 | 1266 | 18 | 9 |
| 7 | CCTGTC | 855 | 836 | 8 | 6 |
| 8 | TT | 494 | 7437 | 5530 | 4122 |
| 9 | TC | 458 | 3181 | 1774 | 1283 |
| 10 | CTTGGC | 421 | 13 | 16 | 471 |
| 11 | TCTGA | 412 | 737 | 115 | 85 |
| 12 | AGGA | 405 | 1932 | 3559 | 3972 |
| 13 | CTA | 316 | 769 | 2068 | 1481 |
| 14 | AGT | 295 | 2767 | 4447 | 2682 |
| 15 | TGGA | 287 | 2573 | 1379 | 1282 |
| 16 | CAA | 175 | 1035 | 1928 | 1876 |
| 17 | ACTCA | 175 | 428 | 79 | 108 |
| 18 | AGC | 173 | 973 | 1826 | 1042 |
| 19 | ACTT | 173 | 674 | 225 | 223 |
| 20 | AAAT | 152 | 324 | 299 | 781 |
Top 20 loop sequence by maximum ratio of population in loop position for un-restricted dataset
| Sequence | Ratio of max and min populations | Population in loop a | Population in loop b | Population in loop c | |
|---|---|---|---|---|---|
| 1 | TAGCATT | 1058 | 1 | 0 | 1058 |
| 2 | CCTGTTG | 990 | 10 897 | 79 | 11 |
| 3 | CCTGTCG | 949 | 7592 | 40 | 8 |
| 4 | CCTGTCA | 714 | 12 138 | 39 | 17 |
| 5 | CCTATCA | 467 | 467 | 2 | 0 |
| 6 | CCTATCG | 352 | 352 | 2 | 0 |
| 7 | GCCTATT | 336 | 336 | 1 | 3 |
| 8 | CCTGTT | 332 | 12 308 | 113 | 37 |
| 9 | CCTTTCA | 310 | 310 | 4 | 1 |
| 10 | GCCTGTT | 303 | 6373 | 61 | 21 |
| 11 | TCTGTCG | 287 | 287 | 0 | 3 |
| 12 | CCTATTG | 268 | 537 | 10 | 2 |
| 13 | CCTGTC | 267 | 8553 | 104 | 32 |
| 14 | CCTGTTA | 221 | 885 | 4 | 5 |
| 15 | CCTATC | 203 | 407 | 3 | 2 |
| 16 | GCCTATC | 203 | 203 | 2 | 1 |
| 17 | GACTCAA | 190 | 190 | 7 | 1 |
| 18 | GCCTGTC | 179 | 4679 | 89 | 26 |
| 19 | ACTGTCA | 173 | 173 | 0 | 10 |
| 20 | CCAGTTG | 165 | 165 | 0 | 2 |
Top 20 loop sequence by probability, for the unrestricted dataset. Sequences 11, 12, 14 and 19 also feature in Table 4
| Sequence | −Log probability | Population in loop a | Population in loop b | Population in loop c | |
|---|---|---|---|---|---|
| 1 | GA | 63 611 | 117 903 | 163 870 | 340 624 |
| 2 | GGA | 51 459 | 34 048 | 67 293 | 165 738 |
| 3 | GGGA | 48 837 | 9892 | 31 567 | 102 345 |
| 4 | A | 38 358 | 273 627 | 300 495 | 492 842 |
| 5 | GTGGG | 25 655 | 55 719 | 11 578 | 8617 |
| 6 | TGGG | 24 126 | 62 418 | 15 377 | 12 614 |
| 7 | TGG | 22 161 | 101 363 | 41 802 | 32 928 |
| 8 | GTGG | 22 104 | 82 252 | 30 832 | 22 040 |
| 9 | TG | 17 189 | 153 386 | 86 943 | 72 114 |
| 10 | GTG | 16 479 | 114 504 | 61 257 | 46 660 |
| 11 | CCTGTCA | 13 009 | 12 138 | 39 | 17 |
| 12 | CCTGTT | 12 795 | 12 308 | 113 | 37 |
| 13 | GT | 12 062 | 143 082 | 88 734 | 75 429 |
| 14 | CCTGTTG | 11 518 | 10 897 | 79 | 11 |
| 15 | T | 10 975 | 220 140 | 154 559 | 136 752 |
| 16 | GGAGGG | 10 793 | 4373 | 25 445 | 5925 |
| 17 | GGGAGGG | 9174 | 2588 | 19 101 | 4022 |
| 18 | GGGAGG | 8778 | 4447 | 22 995 | 6125 |
| 19 | CCTGTC | 8775 | 8553 | 104 | 32 |
| 20 | GAGGG | 8557 | 8102 | 29 589 | 9460 |
Most popular loop sequences for the arbitrary dataset
| Sequence | Population | Population in loop a | Population in loop b | Population in loop c | |
|---|---|---|---|---|---|
| 1 | A | 193 756 | 51 361 | 63 872 | 78 523 |
| 2 | T | 121 406 | 53 234 | 37 657 | 30 515 |
| 3 | C | 44 020 | 14 983 | 14 907 | 14 130 |
| 4 | AA | 40 026 | 12 778 | 13 717 | 13 531 |
| 5 | CT | 32 472 | 11 637 | 10 554 | 10 281 |
| 6 | CA | 32 070 | 10 781 | 10 846 | 10 443 |
| 7 | G | 29 623 | 7183 | 8375 | 14 065 |
| 8 | AT | 19 957 | 6789 | 7242 | 5926 |
| 9 | AGA | 19 144 | 5377 | 6919 | 6848 |
| 10 | TT | 17 089 | 7437 | 5530 | 4122 |
| 11 | TA | 12 641 | 4744 | 4329 | 3568 |
| 12 | CC | 10 955 | 3646 | 3726 | 3583 |
| 13 | AGT | 9896 | 2767 | 4447 | 2682 |
| 14 | AGGA | 9463 | 1932 | 3559 | 3972 |
| 15 | AGGT | 9434 | 1516 | 6448 | 1470 |
| 16 | TGA | 9237 | 3006 | 2849 | 3382 |
| 17 | AAA | 7839 | 2393 | 2970 | 2476 |
| 18 | CCT | 7151 | 2540 | 2298 | 2313 |
| 19 | TGT | 6619 | 2530 | 2307 | 1782 |
| 20 | CCA | 6269 | 2105 | 2048 | 2116 |
Quadruplex sequences containing CCTGT in the first loop
| Loop a | Loop b | Loop c | Length of G-run | Population | |
|---|---|---|---|---|---|
| 1 | CCTGTCA | T | CTA | 3 | 39 |
| 2 | CCTGTT | T | CTA | 3 | 38 |
| 3 | CCTGTCA | T | CTA | 4 | 37 |
| 4 | CCTGTC | T | CTA | 3 | 35 |
| 5 | CCTGTCA | T | CT | 3 | 23 |
| 6 | CCTGTCA | T | CT | 4 | 22 |
| 7 | CCTGTCA | T | CAA | 3 | 21 |
| 8 | CCTGTC | T | CTA | 4 | 21 |
| 9 | CCTGTT | T | CAA | 3 | 20 |
| 10 | CCTGTT | T | CTA | 4 | 18 |
| 11 | CCTGTT | T | A | 3 | 18 |
| 12 | CCTGTC | T | CT | 3 | 18 |
| 13 | CCTGTCA | T | CAA | 4 | 16 |
| 14 | CCTGTC | T | CAA | 3 | 16 |
| 15 | CCTGTT | T | CT | 4 | 15 |
| 16 | CCTGTT | TT | CTA | 3 | 15 |
| 17 | CCTGTCA | TT | CTA | 3 | 13 |
| 18 | CCTGTC | TT | CAA | 3 | 12 |
| 19 | CCTGTT | A | T | 3 | 12 |
| 20 | CCTGTT | T | CT | 3 | 12 |
| 21 | CCTGTT | AT | CAA | 3 | 11 |
| 22 | CCTGTC | T | CT | 4 | 11 |
| 23 | CCTGTCA | TT | CTA | 4 | 11 |
| 24 | CCTGTCA | AT | CTA | 3 | 10 |
| 25 | CCTGTT | TT | CT | 3 | 10 |
| 26 | CCTGT | T | T | 3 | 10 |
| 27 | CCTGTCA | TGA | CTA | 4 | 10 |
| 28 | CCTGTT | T | T | 3 | 10 |
| 29 | CCTGTC | T | CAA | 4 | 10 |
| 30 | CCTGTCA | T | AGGCAA | 3 | 9 |
| 31 | CCTGTT | AT | CTA | 3 | 9 |
| 32 | CCTGTT | T | TGA | 3 | 9 |
| 33 | CCTGTCA | TGGA | CTA | 3 | 9 |
| 34 | CCTGTCA | TT | CAA | 3 | 9 |
| 35 | CCTGTT | G | T | 3 | 9 |
| 36 | CCTGTCA | T | ACTA | 4 | 9 |
| 37 | CCTGTT | T | CAA | 4 | 9 |
| 38 | CCTGTT | AGT | CTA | 3 | 8 |
| 39 | CCTGTC | AT | CAA | 3 | 8 |
| 40 | CCTGTCA | T | ACTA | 3 | 8 |
Sequence distribution by DNA function for the arbitrary dataset
| All quadruplexes | CCTGTT quadruplexes | CCTGTCA quadruplexes | |
|---|---|---|---|
| Intergenic regions | 223 321 (60%) | 1193 (76%) | 1490 (77%) |
| Within genes (plus strand) | 75 189 (20%) | 170 (11%) | 162 (8%) |
| Within genes (minus strand) | 76 647 (20%) | 212 (13%) | 290 (15%) |
| Of which within exons | 14 009 | 1 | 2 |
The numbers represent the number of quadruplex sequences occuring within the given type of DNA. Number totally within exons 12 393.