| Literature DB >> 21097885 |
Regina Z Cer1, Kevin H Bruce, Uma S Mudunuri, Ming Yi, Natalia Volfovsky, Brian T Luke, Albino Bacolla, Jack R Collins, Robert M Stephens.
Abstract
Although the capability of DNA to form a variety of non-canonical (non-B) structures has long been recognized, the overall significance of these alternate conformations in biology has only recently become accepted en masse. In order to provide access to genome-wide locations of these classes of predicted structures, we have developed non-B DB, a database integrating annotations and analysis of non-B DNA-forming sequence motifs. The database provides the most complete list of alternative DNA structure predictions available, including Z-DNA motifs, quadruplex-forming motifs, inverted repeats, mirror repeats and direct repeats and their associated subsets of cruciforms, triplex and slipped structures, respectively. The database also contains motifs predicted to form static DNA bends, short tandem repeats and homo(purine•pyrimidine) tracts that have been associated with disease. The database has been built using the latest releases of the human, chimp, dog, macaque and mouse genomes, so that the results can be compared directly with other data sources. In order to make the data interpretable in a genomic context, features such as genes, single-nucleotide polymorphisms and repetitive elements (SINE, LINE, etc.) have also been incorporated. The database is accessed through query pages that produce results with links to the UCSC browser and a GBrowse-based genomic viewer. It is freely accessible at http://nonb.abcc.ncifcrf.gov.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21097885 PMCID: PMC3013731 DOI: 10.1093/nar/gkq1170
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Criteria for predicting non-B DNA-forming motifs in non-B DB
| DNA feature | Search criteria | Subset of ‘DNA feature’ forming non-B DNA | Search criteria for ‘Subset of DNA feature’ |
|---|---|---|---|
| Inverted repeat | Repeat: 10–100 nt | Cruciform motif | Repeat: 10–100 nt |
| Spacer: 0–100 nt | Spacer: 0–3 nt | ||
| Mirror repeat | Repeat: 10–100 nt | Triplex motif | Repeat: 10–100 R |
| or Y nt | |||
| Spacer: 0–100 nt | Spacer: 0–8 nt | ||
| Direct repeat | Repeat: 10–50 nt | Slipped motif | Repeat: 10–50 nt |
| Spacer: 0–5 nt | Spacer: 0 nt | ||
| Z-DNA repeat | ≥5 units of CG/TG or CG/CA repeats | Whole set | As per the whole set |
| G-quadruplex forming repeat | Four identical blocks of ( | Whole set | As per the whole set |
| A-phased repeat | ≥3 runs of A-tracts with 10-bp phasing | Whole set | As per the whole set |
Inverted repeat: a pair of DNA sequences, each 10–100 nt in length and separated by a spacer of 0–100 nt, whose sequence composition on the same strand of DNA is such that the bases of the first repeat, when read in the 5′→3′ orientation, are complementary to those of the second repeat read in the 3′→5′ orientation. The term ‘complementary’ refers to the Watson–Crick hydrogen bonding scheme, whereby A only pairs with T and C only pairs with G. Only perfect inverted repeats that conform to this Watson–Crick pairing scheme are considered.
Cruciform motif: the subset of inverted repeat sequences in which the ‘Spacer’ comprises 0–3 bases; due to their proximity, this subset of inverted repeat sequences may fold-back and form intramolecular, antiparallel, double helices stabilized by Watson–Crick hydrogen bonds, i.e. a cruciform structure (1,34).
Mirror repeat: a pair of DNA sequences, each 10–100 nt in length and separated by a spacer of 0–100 nt, whose sequence composition on the same strand of DNA is such that the bases of the first repeat, when read in the 5′→3′ orientation, are identical to those of the second repeat read in the 3′→5′ orientation (palindrome); only perfectly matching repeats are included.
Triplex motif: the subset of mirror repeat sequences comprising only purines (R = A and G) [or pyrimidines (Y = C and T)] on the same strand of DNA, and which are separated by few (0–8) nt (‘Spacer’). These motifs are able to form various intramolecular three-stranded (triplex, H-DNA) isoforms stabilized by Hoogsteen hydrogen bonds (1,52,53). Only R•Y-containing mirror repeats that may yield A:A•T and G:G•C base triplets (colon indicates Hoogsteen hydrogen bonded bases; dot indicates Watson–Crick hydrogen bonded bases) for the R:R•Y type of intramolecular triplexes and T:A•T and C+:G•C triplets for the Y:R•Y type of intramolecular triplexes are included since these are considered the most stable triplet combinations.
Direct repeat: two tracts of DNA, each comprising 10–50 nt and separated by 0–5 nt, having the same sequence composition.
Slipped motif: the subset of direct repeat sequences without a spacer (tandem repeats); when aligned in an out-of-register fashion, tandem repeats may give rise to single-stranded loops and/or hairpins (1).
Z-DNA motif: five or more tandem repeats, each comprising an alternating pyrimidine–purine dinucleotide motif, in which the pattern YG is maintained on at least one of the DNA strands; examples include (CG•CG)6, (CA•TG)5 and [(TG)3(CG)4•(CG)4(CA)3]; these motifs may adopt the left-handed Z-DNA conformation (3,54).
G-quadruplex-forming repeat: four blocks, each containing the same number (n) of G bases (n can vary from 3 to 7), on the plus or minus strand, separated by 1–7 nt; this type of DNA sequence may adopt quadruplex structures (2); overlapping tracts of four G-blocks are also considered.
A-phased repeat: three runs of A bases (A-tracts) in phase with the helical pitch of the DNA double-helix, i.e. 10 bp; an A-tract is defined as a set of A•T base-pairs without a TpA step (47,55–57); three or more tracts of A3–7, T3–7, AAATTT, AAATTTT and AAAATTT (in any combination) on the plus or minus strand, whose centers are separated by 10 bases, are considered; since A-tracts induce static bends in the DNA double helix, the overall DNA superhelix is expected to display either a left-handed or a right-handed writhe (47,55–57); as mentioned, all the search criteria used herein do not allow for interruptions in the repeats and no thermodynamic information was factored-in in the algorithms used.
Figure 1.(A) Non-B DB user interface. (B) Query results page. Non-B DB user interface (A) three query modes are available at the Genomic Database Search Tools page of the non-B DB web interface: Search by Feature (shown), Search by Feature Attributes, and Search by Location (a feature browser). In this example, all non-B DNA motifs were queried by the gene symbol MYC. (B) Query results and links to PolyBrowse, UCSC Genome Browser, bioDBnet (51), etc. Users may download the results in GFF format and tab delimited format. Direct access to the sequence and other annotation information for each of the features is available by clicking on the ‘A’ link (red box).
Statistics for DNA repeats and non-B DNA-forming motifs in non-B DB
| DNA feature | Human 37 | Mouse 37 | Dog 2 | Chimp 2 | Macaque 1 |
|---|---|---|---|---|---|
| G-quadruplex forming repeat | 374 545 | 559 280 | 492 535 | 314 171 | 298 142 |
| Inverted repeat | 1 044 533 | 801 242 | 814 080 | 998 249 | 843 889 |
| Cruciform motif | 197 910 | 188 532 | 172 032 | 190 736 | 128 334 |
| Direct repeat | 871 045 | 1 593 107 | 968 955 | 787 335 | 765 798 |
| Slipped motif | 347 969 | 695 150 | 404 750 | 314 516 | 305 285 |
| Mirror repeat | 16 51 723 | 3 431 486 | 1 829 867 | 14 85 135 | 14 55 025 |
| Triplex motif | 1 79 623 | 618 928 | 336 642 | 1 05 640 | 1 40 580 |
| Z-DNA repeat | 294 320 | 690 276 | 261 012 | 278 928 | 280 982 |
| A-phased repeat | 1 130 731 | 9 09 653 | 1 241 082 | 1 085 591 | 1 098 030 |
For the current releases of the five mammalian genomes indicated, the motif searches were performed and the number of features for each class was counted. According to Table 1, the cruciform motifs represent a subset of the inverted repeat class, the slipped motifs represent a subset of the direct repeat class and the triplex motif represents a subset of the mirror repeat class.
Figure 2.(A) PolyBrowse view of G-quadruplex forming repeats in the MYC gene with species syntenic information. (B) Syntenic region of the human MYC gene in chimp. (C) Syntenic region of the human MYC gene in macaque. (D) Syntenic region of the human MYC gene in mouse. (E) Syntenic region of the human MYC gene in dog. Visualization of cross-species information in PolyBrowse: the genomic region shown in Figure 1 is displayed herein to illustrate the syntenic capability. In this case, the G-quadruplex forming repeat track shows the locations of these features in the region near the beginning of the MYC gene. The additional track illustrating the liftOver1k blocks is also reported. In that track, when the user moves the cursor over the objects, a popup window appears that contains links (A) to the syntenic locations in the other available species. (B–E) Results upon clicking on the links for chimp, macaque, mouse and dog, respectively. Currently, the liftOver feature only allows moving from the human to the other species but not among the non-human species.
Figure 3.(A) PolyBrowse view of some of the non-B DNA features in the human MYC gene. (B) G-Quadruplex polymorphism tracks. (C) UCSC view of tracks. Links to PolyBrowse and UCSC browsers: (A) display of PolyBrowse rendering the MYC gene and its promoter with some of the non-B DNA motifs tracks with the PuPy [poly(purine•pyrimidine)] and polymorphism tracks turned on. Several other tracks include features computed at the Advanced Biomedical Computing Center (ABCC), such as STRs, base composition, physical DNA characteristics, mapping, as well as the NCBI derived features, including genes, SNPs, cytogenic markers assembly information, RepeatMasker elements, etc. (data not shown). (B) Display of the MYC gene in PolyBrowse showing some of the polymorphism information in non-B DB. Below the gene and mRNA tracks, the searched motifs in the reference sequence are displayed (teal). Below them, the alignments from the trace archive are shown (blue). Some of the predicted motifs are found only in the trace sequences. An additional track shows specific G-quadruplex motifs in which all of the observed trace files contained mutations disrupting the G-quadruplex motif. (C) Display of the MYC gene at the UCSC human genome browser (GRCh37/hg19 assembly) as linked to non-B DB from the search shown in Figure 2B. Some of the non-B DNA motifs from non-B DB can be seen in the red rectangular box.