Literature DB >> 21357607

Mapping the sequences of potential guanine quadruplex motifs.

Alan K Todd1, Stephen Neidle.   

Abstract

The knowledge that potential guanine quadruplex sequences (PQs) are non-randomly distributed in relation to genomic features is now well established. However, this is for a general potential quadruplex motif which is characterized by short runs of guanine separated by loop regions, regardless of the nature of the loop sequence. There have been no studies to date which map the distribution of PQs in terms of primary sequence or which categorize PQs. To this end, we have generated clusters of PQ sequence groups of various sizes and various degrees of similarity for the non-template strand of introns in the human genome. We started with 86 697 sequences, and successively merged them into groups based on sequence similarity, carrying out 66 clustering cycles before convergence. We have demonstrated here that by using complete linkage hierarchical agglomerative clustering such PQ sequence categorization can be achieved. Our results give an insight into sequence diversity and categories of PQ sequences which occur in human intronic regions. We also highlight a number of clusters for which interesting relationships among their members were immediately evident and other clusters whose members seem unrelated, illustrating, we believe, a distinct role for different sequence types.

Entities:  

Mesh:

Year:  2011        PMID: 21357607      PMCID: PMC3130275          DOI: 10.1093/nar/gkr104

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The occurrence of potential guanine quadruplex sequence motifs (PQs) within non-telomeric nucleic acids has been the subject of a number of studies (1–14) (for reviews, see refs 15 and 16) and several databases and web resources are available (17–21). Most of the emphasis of these surveys has been to examine the number of PQs and the genomic regions in which they occur. Several studies of individual and specific sequences at a small number of loci have been carried out. In particular, PQs associated with the promoter regions of the c-kit (22–25) and c-myc (26,27) genes have been examined in detail, as well as the 5′-untranslated region (UTR) region in several other genes including N-ras (28) and zic-1 (29). Apart from our initial analysis describing loop sequences within PQ sequences (1), there has been no systematic classification of PQs in terms of their primary sequence. Crystallographic, nuclear magnetic resonance (NMR) and modelling studies have demonstrated that the topology of guanine quadruplexes is very dependent on their primary sequence, as found, for example, in various human telomeric sequences (30–33), and the two c-kit sequences (22–25). Biophysical studies of loop size (34–36) and analyses of the effects of sequence in single-base loops (37) also confirm this conclusion. From the outset of sequence-based studies into potential quadruplex sequences in non-telomeric nucleic acids, it has been clear that there are more sequences than can be experimentally studied, and to date only a very small fraction of the individual sequences have been examined, although there have been attempts to establish some more general rules governing the energetics of quadruplexes (38). Our initial survey of PQs in the human genome showed that there are 226 157 unique sequences that concur with our search criterion (1). In the same study, we carried out a detailed examination of loop sequences and established that in terms of sequence space, the distribution of loop sequences is far from random, with some being very common and many others not appearing at all. However, examining loop sequences in this way is problematic since, in instances with variable numbers of guanines in the G-tracts and/or isolated guanines in loop sequences, it is not currently possible to determine which guanines are part of the loop and which are part of the G-quartet core, in the absence of relevant experimental data. When more than four G-tracts are present in a sequence, we have the additional problem of determining which of them would participate in a more stable quadruplex structure. We thus need a more practical and robust way of studying quadruplex sequences in detail than trying to derive information from loop sequences alone. In this study, we consider the sequences of potential quadruplex-forming regions as a whole rather than their component parts (G-tracts and loop regions) and describe a method for finding groups of similar sequence. This removes any need to make prior assumptions about topology. Finding many examples of a complex sequence is compelling evidence of positive selection. The possibility therefore exists that quadruplex structure is the reason for such selections. Of the clusters which contain sequence that are proven to form G-quadruplex structures, there is also the possibility that similar sequences may also form similar folding topologies. We have used the non-template strand of introns in the human genome to develop our method and at the same time to produce new data on quadruplex sequences within introns. Our goals are therefore to develop a method to find groups of similar short sequence, apply it to potential guanine quadruplex sequences and, subsequently, determine whether one can find correlations within these groups or something to link the genes in which the sequences occur. In addition, the application of this method can be seen as a hypothesis-generating exercise, as the potential guanine quadruplex clusters can be used as a starting point for further analyses. We have therefore chosen a number of clusters to illustrate that different types of correlation can be found in the clusters. Hierarchical agglomerative clustering is a method with which one starts with the individual data and merges the most similar (39). The resulting clusters are then successively merged until only one cluster remains. One ends up with a grouping of data in a dendrogram where, at successive levels, the cluster members are less similar. This is schematically represented in Figure 1. In order to cluster nucleic acid sequences in such a way, a similarity metric is needed, and for this, pair-wise sequence alignments were carried out and a similarity score was obtained. Once a similarity metric has been established, there are a number of ways to compare clusters. In this instance, the complete linkage method has been used, where the distance between two clusters is the score of the least-similar, longest distance between any member of one cluster to any member of the other.
Figure 1.

The clustering process begins at the top, where the individual data are treated as clusters. The most similar data are merged, the similarity between the new clusters is derived and the most similar of those are merged until all of the data are in the same cluster.

The clustering process begins at the top, where the individual data are treated as clusters. The most similar data are merged, the similarity between the new clusters is derived and the most similar of those are merged until all of the data are in the same cluster.

METHODS

All genomic data were taken from the ENSEMBL database (40) homo_sapiens_core_57_37b and the non-template strand sequences were extracted from the intronic regions for all genes with status ‘known’. The same method was used in earlier studies (1,5) to gather the G-rich sequences with the pattern: G3−5L1−7G3−5L1−7G3−5L1−7G3−5, where G represents guanine bases and L represents any base including guanine. The ENSEMBL perl API was used to extract the genomic regions of interest and in-house software written in C++ used to extract the PQ regions. Regions that had more than four G-tracts were treated as a single sequence. A total of 101 926 potential quadruplex-forming regions were extracted; however, a number of identical sequences were identified in this set, giving 86 697 unique sequences. The sequences were then collated into a mySQL database, along with information about their genomic locations. Sequence alignments and clustering were carried out using in-house software written in C++. The Smith–Waterman method was used as described by Durbin et al. (41) to carry out the individual alignments. The scoring scheme is quite simple since all mismatches are considered equal. Match = 1, mismatch = 0, gap = −0.5 edge-gap = 0. Edge-gaps are the over-hanging part at the end of the sequences which arise from the fact that the sequences are often of different lengths, so edge-gaps are inevitable and therefore less costly than gaps within the sequence. The scoring for the clustering was carried out in a different way from that of the pair-wise sequence alignments, since there is a different purpose for each of these. It was done by counting the number of gaps and mismatches dividing by the number of potential matches. The maximum number of matches in any alignment is the length of the shortest sequence. Since gaps in the longer sequence lead to fewer matches, we only penalize gaps on the shorter sequence. The scoring scheme used is described in the following. Details of this scheme and our rationale behind it can be found in the Supplementary Data. To avoid confusion these are not being called ‘alignment scores’ because they were not obtained when the alignments were carried out, but rather they are ‘similarity scores’. This now provides a metric of sequence similarity which is independent of the lengths of the sequences involved. We also obtained more clear-cut results for the clustering by separating the two scoring schemes. The pair-wise sequence alignments were done to find the ‘best’ alignment between the sequences, and the scores for the clustering were calculated so that one pair-wise sequence alignment can be compared with another. It was necessary to score them independently of size since the pair-wise alignments can be of varying size. (Supplementary Figure S1 and Table S4 illustrate how the clustering was biased towards longer sequences being clustered first when using the alignment scores for the clustering from a subset of 1000 sequences chosen at random.) The scoring scheme that we developed was effectively our definition of sequence similarity and had to compare alignments of various lengths. There are many ways in which we could score our alignments, depending on how we define sequence similarity, which would possibly produce differing results, e.g. scoring mismatches higher than gaps might be sensible if we decided that guanine quadruplex loop length was more important to stability than base composition. However, we wish to assume as little as possible and so have kept the scoring scheme as simple as we can. We believe that the results from the method that we settled on indicates that it fits the purpose well. To compare whole clusters, full-linkage hierarchical agglomerative clustering was used. When attempting to use single-linkage and mean linkage hierarchical agglomerative clustering, it was found that the clusters were subject to an unacceptable amount of chaining, where unrelated sequences can end up belonging to the same cluster. This method was very computer-intensive since to measure the similarity score between two clusters, every pair of sequences between the two clusters must be aligned. The similarity matrix was too large to be held in computer memory (∼28 Gb of data). It was found that it was faster to pre-compute all of the sequence alignments (3 758 141 556 alignments), calculate the comparison scores and store them on a hard drive, since looking up the scores from the hard drive was faster than carrying out the alignment and obtaining the similarity scores on the fly. The clustering process went as follows: The process at stage 2 is traditionally performed by merging the best pair of clusters and re-calculating the similarity scores between the newly formed cluster and the remaining clusters. However, this would have taken an impossibly long time with such a large number of sequences, since cluster comparisons are very costly in terms of computer time. To expedite the process, a coarse-grained approach was adopted which merged many of the clusters in a single cycle and greatly reduced the number of cluster comparisons that were carried out. By decreasing the similarity threshold by 0.05 increments every time, the process was greatly speeded up; we suggest that the outcome was not significantly different from what would have happened if it were practical to cluster by re-calculating the score matrix after every merging. (A comparison of the performance of both methods can be found in Supplementary Figure S2.) The clusters formed at the last cycle before the similarity threshold was dropped, thereby being of most interest. The degree of similarity of the cluster members can be derived from the similarity threshold and hence is related to the cycle number. The earlier in the clustering, the more similar are the cluster members. Set similarity threshold to 1 and consider each sequence as a cluster. Compare all clusters and when a pair is found which has a score equal to or better than the similarity threshold, merge them together. Repeat stage 2 until there are no longer any pairs of clusters at or above the similarity threshold. Decrease the similarity threshold by 0.05 and go back to stage 2. Several prominent clusters were chosen and dendrograms drawn with software that was developed in-house, using the Python Imaging Library and the aggdraw module, in the Python programming language. We also carried out multiple sequence alignments between the cluster members for the purposes of illustration using ClustalW (42). We used FuncAssociate 2.0 (43) which employs the Fisher’s exact test to determine the probability that gene ontology (44) (GO) terms are over-represented [the null hypothesis is that it is unsurprising that the number of any particular GO term appears in the test set (Cluster) by chance]. Since it is not impossible to find false positives when looking for correlations in large sets of data, FuncAssociate calculates an adjusted P-value that includes an estimation of the probability of obtaining at least one false positive. The list of genes belonging to each of the clusters produced by Cycles 5, 9, 13, 17, 20, 23, 26, 29, 32, 36, 39, 42, 45 and 48 whose sequences were associated with more than 10 different genes were sent to the FuncAssociate server. A P-value cut-off of 0.05 was used to determine which clusters were over-represented in any GO term.

RESULTS

Cluster size distribution

Figure 2 shows how the number of clusters decreases at each cycle. The threshold level is also shown and it can be seen at which cycles the similarity threshold was decreased and how this affects the number of clusters. For example, between Cycles 7 and 8 the largest drop in the number of clusters occurs, from 72 106 to 56 049 clusters. The next significant drop, between Cycles 11 and 12 (55 802–39 852 clusters), is almost as large. These are, therefore, the stages with the largest number of clusters merging and coincide with the similarity threshold decreasing from 0.95 to 0.9 and from 0.9 to 0.85.
Figure 2.

The relationship between the similarity threshold and the number of clusters during the progression of the clustering process.

The relationship between the similarity threshold and the number of clusters during the progression of the clustering process. Figure 3 shows the cluster size distribution changing with each cluster cycle, for clusters containing between 1 and 400 sequences and on the final column for clusters larger than 400. The clusters were arranged in bins depending on the number of sequence members which they contained, starting with clusters with 0–10 members, then 10–20 members and so on until the clusters along the right-hand side with 400–86 696 sequence members. The heights represent the total number of sequences within the clusters in a particular bin and each coloured row represents a cluster cycle. It can be seen that, as expected, the sequences are initially distributed among the small clusters (0–10) and it is not until Cycle 5 that there are clusters with greater than 10 sequences in them. By Cycle 27 the number of clusters containing 0–10 sequences is dropping significantly and the sequences are distributed among larger clusters and by Cycle 31 there are clusters which contain more than 400 sequences. As the process of merging clusters continues, the distribution moves to the right until by Cycle 56 there are no longer any clusters below 400 sequences and finally at Cycle 65 the clustering has converged into a single cluster.
Figure 3.

The distribution of cluster members by cluster size and the progression of the clustering process.

The distribution of cluster members by cluster size and the progression of the clustering process. There are a very large number of clusters and to carry out a detailed manual analysis of all of them would be unfeasible. We have therefore taken several clusters and highlighted some interesting features within them. Diagrams of these clusters show the multiple sequence alignments calculated using the program ClustalW next to the dendrograms generated from the clustering data. Many of the groupings in the dendrogram on the right can be correlated to features of the ClustalW-aligned sequences even though they were derived through different means. For example, in Figure 5a sequences 8–10, which share very similar sequences over the first 17 bases, are grouped together much earlier in the clustering process than they are with the rest of the sequences in the cluster which differ in this region.
Figure 5.

(a) Cycle 18 cluster 202 zinc finger type genes 1. (b) Cycle 21 cluster number 4672 zinc finger genes.

Figure 4 and Table 1 show Cycle 27 cluster number 4470, which contains a cluster of human telomere and human telomere-like sequences with the potential to form quadruplex structures. Azzalin et al. (45) and Schoeftner and Blasco (46) showed that telomeres are not transcriptionally silent and that the C-rich strand is transcribed more than the G-rich strand, resulting in r(UUAGGG) being more abundant than r(CCCUAA). These G-rich RNAs can interact with telomeric DNA and also with the telomerase RNA template and thus inhibit the catalytic action of the telomerase enzyme complex. They can also interact with other gene products such as that of SMG which are also involved in the maintenance of telomeres. The clusters that we have here are examples of an area where this new class of RNA could also be transcribed. Although these sequences are within introns, it is not inconceivable that they can exist alone or as part of smaller molecules after splicing and digestion. For example, it was been observed (47) that while in the quadruplex form, G-rich telomeric RNA is immune to digestion by T1 nuclease, which normally cleaves RNA after a single-stranded guanine residue. Further detail on these clusters is given in Supplementary Tables 1S and 2S. Locating telomeric repeats in non-telomeric DNA has been previously observed, albeit not at the sequence level—for example Meyne et al. (48) discussed their distribution in 100 vertebrate species.
Figure 4.

Telomeric like quadruplex sequences.

Table 1.

Telomeric sequence clusters

Leaf no.GeneEnsemblIDFrom startTo endFeature
1BET1LENSG0000017795146036519Intron 4–5
2ST8SIA1ENSG00000111728793339 013Intron 4–5
MRVI1ENSG0000007295228 38331 060Intron 1–2
3BET1LENSG0000017795144086697Intron 4–5
4EHD4ENSG0000010396627307842Intron 2–3
ARNT2ENSG0000017237954956719Intron 3–4
ARFGAP3ENSG00000242247556418 179Intron 1–2
5BET1LENSG0000017795145126588Intron 4–5
BET1LENSG0000017795147136387Intron 4–5
6BET1LENSG0000017795148396279Intron 4–5
7NLGN4XENSG0000014693839 96281 579Intron 2–3
8CBFA2T3ENSG00000129993458762Intron 8–9
9RP11-40F6.1ENSG0000023752387111055Intron 1–2
10RP11-416N4.2ENSG0000023050617 7965299Intron 1–2
11AC004490.2ENSG0000023443218 3438360Intron 1–2
12BET1LENSG0000017795139167179Intron 4–5
13FAM157CENSG0000023301357836648Intron 3–4
14ZNF275ENSG00000063587329846Intron 3–4
15BET1LENSG0000017795138307298Intron 4–5
16BET1LENSG0000017795137537335Intron 4–5
17FAM157CENSG0000023301356386804Intron 3–4
18CALN1ENSG0000018316626 8607885Intron 1–2
19RPL23AP82ENSG000001843199672391Intron 3–4
RPL23AP7ENSG000002260199672391Intron 2–3
20RP11-218L14.1ENSG0000022539387229458Intron 1–2
21ARHGEF3ENSG00000163947946966 708Intron 2–3
22BET1LENSG0000017795113 70111 400Intron 3–4
23KCNJ6ENSG000001575424819120 673Intron 2–3
RP1-207H1.1ENSG0000023115080438579Intron 1–2
24CFDP1ENSG0000015377460 85029 018Intron 5–6
25AL078621.1ENSG0000022800396912 303Intron 2–3
26AC068541.3ENSG0000023389751 142108 760Intron 3–4
27BET1LENSG0000017795137157412Intron 4–5
28FAM157CENSG0000023301355946887Intron 3–4
29SLC8A2ENSG000001181602410762Intron 6–7
Telomeric like quadruplex sequences. Telomeric sequence clusters Figure 5a and b and Tables 2 and 3 show clusters which are mainly composed of closely related zinc-finger genes. The members of the cluster in Figure 5a all belong to the same interpro (49) families, IPR001909 Krueppel-associated box and IPR007087 Znf_C2H2. They occur at 13 different locations, with 10 unique sequences. The location of the sequences within the genes is similar for most of these genes, usually about 200–300 bases from the beginning of the first intron. The variable parts of the sequence tend to be outer ‘loops’ while the central GGGAGGG core appears to be conserved. This is also conserved in another very similar cluster shown in Figure 5b. The majority of those genes belong to the same interpro families, IPR001909 Krueppel-associated box and IPR007087 Znf_C2H2. Sequence 7 is shared by two genes which overlap, AC010300.1 and ZNF91. ZNF91 being contained entirely within an intron of AC010300.1. Almost all of these genes are found in the same area of chromosome 19; however, two genes are found in entirely different locations, ZNF107 is found on chromosome 7 and MAP1B is found on chromosome 5. Although MAP1B is an unrelated gene, its expression has been shown to be controlled by the zinc finger gene BCL11A which also belongs to interpro family IPR007087 Znf_C2H2 (50). When looking at the variable and conserved regions, we need to be aware that the search criterion may have an effect on what we see, i.e. be cognizant of the fact that we will always have conserved runs of guanines in the sequences. It may be more instructive to look at the conservation of the loop sequences; however, if the guanine runs are longer than three bases, there is scope for variability around the edges, under our search criterion. The cluster in Figure 5a appears to be more variable in the region of the first loop while the central ‘A’ loop is the same throughout and the third loop ‘TCAT’ has only one difference, a substitution of an adenine for a cytosine. Since the final G-runs are longer than three bases, we see two cases where the guanines are substituted for an adenine and for a thymine.
Table 2.

Cycle 18 cluster 202 zinc finger type genes 1

Leaf no.GeneEnsemblIDFrom startTo endFeature
1ZNF844ENSG000002235473008830Intron 1–2
2ZNF491ENSG000001775992895499Intron 1–2
3ZNF833ENSG000001973322864031Intron 1–2
4ZNF709ENSG0000024285224746 621Intron 1–2
ZNF564ENSG0000019682637 83346 621Intron 1–2
5ZNF709ENSG0000024285229 38017 489Intron 1–2
ZNF564ENSG0000019682666 96617 489Intron 1–2
6ZNF69ENSG0000019842927915 324Intron 1–2
7ZNF627ENSG0000019855122816 643Intron 1–2
8ZNF791ENSG0000017387521112 383Intron 1–2
9ZNF20ENSG000001320102903960Intron 1–2
ZNF625ENSG000002132972903960Intron 5–6
10ZNF44ENSG00000197857649615 598Intron 4–5
Table 3.

Cycle 21 cluster number 4672 zinc finger genes

Leaf no.GeneEnsemblIDFrom startTo endFeature
1AC011477.1ENSG0000024538131 17725 473Intron 2–3
2ZNF100ENSG000001970202791336Intron 1–2
ZNF681ENSG000001961722642906Intron 1–2
3ZNF431ENSG000001967052881051Intron 1–2
4ZNF493ENSG000001962682867549Intron 1–2
5ZNF492ENSG0000022967629018 517Intron 1–2
6ZNF85ENSG0000010575028210 313Intron 1–2
7AC010300.1ENSG0000023569471 07370 836Intron 9–10
ZNF91ENSG0000016723228720 248Intron 1–2
8ZNF254ENSG0000021309627218 305Intron 1–2
9ZNF738ENSG000001726872892308Intron 1–2
10ZNF724PENSG0000019608128717 634Intron 1–2
11MAP1BENSG0000013171141 78826 124Intron 2–3
12ZNF588ENSG0000019624732212 544Intron 1–2
(a) Cycle 18 cluster 202 zinc finger type genes 1. (b) Cycle 21 cluster number 4672 zinc finger genes. Cycle 18 cluster 202 zinc finger type genes 1 Cycle 21 cluster number 4672 zinc finger genes Within Cycle 18, several clusters were found to be over-represented in the GO term GO:0003823 ‘antigen binding’. Cycle 18 cluster 13 461 (Figure 6 and Table 4) is one such cluster, which consists mainly of LIR genes (leucocyte immunoglobulin-like receptor). These genes are all found in the same genomic location: region 19q13.4. All but one of the genes in the cluster occur at this location; however, since some of the genes are overlapping, the total number of locations is 11. Cycle 18 cluster 448 (Figure 7 and Table 5) contains a number of sequences which occur within two immunoglobulin genes, IGHA2 and IGHM which contain a number of very similar sequences. A third IGH gene IGHV3-6 is a pseudogene; however, since certain pseudogenes may play an important role in regulation (51,52), this may still be a biologically relevant locus. Three other genes which appear in this cluster, TRIM29, ZNF831 and BRSK2, are unrelated to the immunoglobulins. A similar cluster, Cycle 18 cluster 1086, (Figure 8 and Table 6), contains the same immunoglobulin genes and similar sequence motifs. This also contains three non-immunoglobulin genes KCNK2, SMAD and the same kinase gene as found in the aforementioned cluster, BRSK2. Closer examination of the regions in which these sequences occur in both the immunoglobulin genes and the BRSK2 suggests that they are all part of a larger region of similarity. That we have closely related genes with similar sequences within their introns is perhaps no great surprise; however, the existence of similar sequences within the introns of unrelated genes is an unexpected observation.
Figure 6.

Cycle 18 cluster 1346. Cluster containing sequences that occur chiefly within leukocyte immunoglobulin-like receptor (LIR) genes.

Table 4.

Cycle 18 cluster 13 461. Cluster containing sequence which occur chiefly within leucocyte immunoglobulin-like receptor (LIR) genes

Leaf no.GeneEnsemblIDFrom startTo endFeature
1LILRA6ENSG0000024448277147Intron 5–6
LILRB3ENSG0000020457777147Intron 5–6
LILRB3ENSG0000020457718 8991733Intron 7–8
2LILRA1ENSG000001049746173193Intron 5–6
LILRB1ENSG0000010497221 90635 443Intron 2–3
LILRP2ENSG0000024025884146Intron 3–4
AC006293.1ENSG0000017085884146Intron 4–5
3LILRB1ENSG0000010497277148Intron 5–6
4KCNH2ENSG0000005511844630Intron 8–9
5LILRA2ENSG0000023999884147Intron 6–7
LILRB1ENSG00000104972152555 824Intron 2–3
6LILRA4ENSG0000023996179147Intron 5–6
LILRA3ENSG0000017086677146Intron 9–10
7AC011515.1ENSG0000022537077153Intron 2–3
Figure 7.

Cycle 18 cluster 448. Cluster containing sequences that occur chiefly in immunoglobulin genes IGHA2 and IGHM.

Table 5.

Cycle 18 cluster 448. Cluster containing sequences which occur in immunoglobulin genes IGHA2 and IGHM

Leaf no.GeneEnsemblIDFrom startTo endFeature
1IGHA2ENSG0000021189014591736Intron 1–2
2TRIM29ENSG000001376998326383Intron 1–2
3IGHA2ENSG000002118909792231Intron 1–2
4IGHA2ENSG000002118904722763Intron 1–2
5IGHA2ENSG000002118908442321Intron 1–2
6IGHA2ENSG000002118905492701Intron 1–2
7IGHA2ENSG0000021189010842106Intron 1–2
8IGHA2ENSG0000021189017491486Intron 1–2
9IGHV3-6ENSG00000233855577386 881Intron 10–11
IGHMENSG0000021189928722301Intron 1–2
10IGHV3-6ENSG00000233855398288 648Intron 10–11
IGHMENSG0000021189910814068Intron 1–2
11BRSK2ENSG000001746725173328Intron 12–13
12IGHV3-6ENSG00000233855594386 716Intron 10–11
IGHV3-6ENSG00000233855598386 676Intron 10–11
IGHMENSG0000021189930422136Intron 1–2
IGHMENSG0000021189930822096Intron 1–2
13ZNF831ENSG000001242031516430 732Intron 3–4
14IGHV3-6ENSG00000233855483387 826Intron 10–11
IGHV3-6ENSG00000233855488487 775Intron 10–11
IGHV3-6ENSG00000233855527487385Intron 10–11
IGHMENSG0000021189919323246Intron 1–2
IGHMENSG0000021189919833195Intron 1–2
IGHMENSG0000021189923732805Intron 1–2
Figure 8.

Cycle 18 cluster 1086. Cluster containing sequences that occur chiefly in immunoglobulin genes IGHA2 and IGHM.

Table 6.

Cycle 18 cluster 1086. Cluster containing sequences which occur chiefly in immunoglobulin genes IGHA2 and IGHM

Leaf no.GeneEnsemblIDFrom startTo endFeature
1IGHV3-6ENSG00000233855623786 417Intron 10–11
IGHMENSG0000021189933361837Intron 1–2
2IGHA2ENSG000002118902392848Intron 1–2
3KCNK2ENSG0000008248261 11319 276Intron 1–2
4IGHV3-6ENSG00000233855424788 402Intron 10–11
IGHMENSG0000021189913463822Intron 1–2
5BRSK2ENSG000001746723633468Intron 12–13
6IGHA2ENSG0000021189016541591Intron 1–2
7IGHV3-6ENSG00000233855405288 592Intron 10–11
IGHMENSG0000021189911514012Intron 1–2
8IGHV3-6ENSG00000233855419788 447Intron 10–11
IGHMENSG0000021189912963867Intron 1–2
9SMAD1ENSG0000017036510 59114 155Intron 2–3
Cycle 18 cluster 1346. Cluster containing sequences that occur chiefly within leukocyte immunoglobulin-like receptor (LIR) genes. Cycle 18 cluster 448. Cluster containing sequences that occur chiefly in immunoglobulin genes IGHA2 and IGHM. Cycle 18 cluster 1086. Cluster containing sequences that occur chiefly in immunoglobulin genes IGHA2 and IGHM. Cycle 18 cluster 13 461. Cluster containing sequence which occur chiefly within leucocyte immunoglobulin-like receptor (LIR) genes Cycle 18 cluster 448. Cluster containing sequences which occur in immunoglobulin genes IGHA2 and IGHM Cycle 18 cluster 1086. Cluster containing sequences which occur chiefly in immunoglobulin genes IGHA2 and IGHM Cycle 18 cluster 2 (Figure 9 and Table 7) contains 27 sequences; however, many occur in more than one locus and the sequences in the cluster appear 140 times. Sequences 1 and 5 are the most common, occurring 45 and 51 times, respectively. We used biomart (53) in the ENSEMBL website to retrieve the interpro (49) IDs for the genes involved (full details are given in the Supplementary Data). Of the 88 genes which had interpro mappings, several were related; however, none occurred more than seven times and the genes are distributed over a range of gene families. Among them are kinases, zinc-finger genes, RAB/RAS genes, WD-40 domains and catenins. Many are known to be associated with signal-transduction pathways (RIN3, TBC1D19, RASGRF3, CDK14, ARHGAP6, CTNNA3 and CTNND2, to name a few) and many are involved in mitosis (CENPQ, PARD3B, ALMS1, SPTLC1, etc.). This cluster demonstrates a significant number of genes both related and unrelated, which contain similar and often identical PQ sequences.
Figure 9.

Cycle 18 cluster 2. This shows sequences that occur in unrelated genes.

Table 7.

Cycle 18 cluster 2. Sequences which occur in unrelated genes. List of genes in which each sequence in this cluster occurs

1CHST9 CTNND2 PDZD2 TBC1D19 PDE4D TRIM5 PLCB1 LRRC9 TMEM170B AP000705.4 LRRK2 SUMF1 NELL1 MEGF10 FBN2 AP003355.2 VPS13B AC090922.1 AC096733.1 RNF150 WDFY4 ALK SLC16A7 GLIPR1L2 SEMA3D EIF4G3 PFTK1 C2orf34 DLG2 F8 RP11-451L9.1 FRMD4B ZNF28 ZNF665 PDE4B AC010132.1 CTB-111H14.1 AC009264.1 COL24A1 RP11-457K10.1 DYTN KCNE4 AC007254.1 RAB3GAP2 RP11-479J7.2
2RIN3
3PTPRD
4NBEA FBXL5 KLF12 ADAMTS3 SLC24A3 RASGRF2 BICD1 BEND7 AP000235.2 LACTB2 FAM190A C11orf74 TBCK AMBRA1 PSMC1 TEX11 PPP2R2B KIAA2022 SPTLC1 MAGT1 CTNNA3 ODZ3 UQCRFS1 AC008413.1 MON2 C11orf80 CENPQ NRCAM TRIM77 AC003050.1 ATRNL1 FXYD6 RP11-702L6.4 SLC9A10 STXBP5L RP11-310E22.1 CASC2 AC005582.1 ST6GALNAC3 RP4-630C24.1 LRP1B GALNT13 SLC25A24 RP11-439L18.3 AC018359.2 PTH2R AC079613.1 AC093865.2 RP11-542C10.1 RP11-202K23.1 RP11-479J7.2
5RP11-735B13.1
6PK4P
7DLG2
8PTPRD
9AF127577.3 AGBL1 AFF2 CYP4B1 AC003090.1 FAM19A3 PARD3B ALMS1P
10HERC2
11SLC26A7
12NRSN1
13EFCAB5 TFAP2D
14NAV3
15XKR4
16BBOX1 ALOX5 AL592494.3
17C2orf34
18TRPC4
19THSD4
20ARHGAP6 ALMS1 RP11-615J4.4 AC009499.1 MRPL33
21ARL15 ACCN1 PDSS2 JAK1 PDE4B RP3-433F14.1
22RP4-781K5.2
23KIAA0146
24COL5A3
25PDE3B
26MBD5
27RP11-202P11.1
Cycle 18 cluster 2. Sequences which occur in unrelated genes. List of genes in which each sequence in this cluster occurs Cycle 18 cluster 2. This shows sequences that occur in unrelated genes. Using the FuncAssociate tool to characterize gene sets, we discovered that a number of clusters were over-represented in certain GO terms. The results are summarized in Table 8, which shows, for each cycle, the number of clusters whose sequences fell in more than 10 ENSEMBL genes, the number of these which were over-represented in GO terms, the sum of the number of GO terms which were found to be over-represented in each cluster and the percentage of chosen clusters in which were found to be over-represented in GO terms. The percentage of clusters examined which contained over-represented GO terms did not vary dramatically for clusters 5–39 where it began at 12% for Cycle 5 and remained for the most part between 9% and 10%. At cluster 49 it began to rise, 18% for Cycle 49, 24% for Cycle 45 and 48% for Cycle 48. This approximately steady rate at Cycles 5–39 is probably due to the fact that while new clusters are being formed and genes which are associated to common GO terms come together, other clusters which are over-represented in GO terms are being ‘diluted’ and the significance of the over-represented GO terms is being reduced.
Table 8.

Number of clusters whose associated GO terms were found to be over-represented using FuncAssociate

Cycle numberClusters with over- represented GO termsSum of GO terms over- represented in each clusterNumber of clusters occurring in >10 different genes% Clusters with over represented GO terms
5169413312.030075188
92212021910.0456621005
13332013898.48329048843
176529164210.1246105919
209436710488.96946564885
2316953217689.55882352941
2626877229219.17494008901
2934886236209.61325966851
3231167031679.82001894537
36196392182410.7456140351
39111240106010.4716981132
428623050616.9960474308
455420723822.6890756303
485033110050.0
Number of clusters whose associated GO terms were found to be over-represented using FuncAssociate The raw cluster data will be available upon request from alan.todd@pharmacy.ac.uk and in the future from a webpage.

DISCUSSION

Introns have often been assumed to be mutationally neutral. However, there is growing interest in blocks of intronic regions which are conserved across species and which have been suggested as candidate areas of trans-acting regulatory regions (54–56). Although we have only examined a single species in the present analysis, the same reasoning can be applied to paraloguous regions as well as orthologues. Indeed, genes which are co-expressed and have a common regulatory mechanism do not necessarily have to be paralogues; the same cis-acting promoter binding motifs, for example, often exist upstream of unrelated genes. From a sequence conservation point of view, it is perhaps more remarkable to find large numbers of similar sequences in unrelated genes as in closely related ones. One could argue that it is possible for similar sequences in closely related genes to be simply passenger sequences which have not yet had time to diverge. In less closely related genes, it could even be argued that mutational cold spots (57) are responsible for some of the conserved sequence. Since we have identified clusters in genes whose members have a range of genetic distances from the closely related zinc finger genes in Figure 5 to the unrelated genes in Figure 9, we feel confident in stating that selective pressure is likely to be responsible for many of the sequence clusters observed here. The range of types of cluster and sequence types suggests that they have many different biological roles. Eddy and Maizels (4) showed that there was a relationship between gene function and the number of PQ sequences found within those genes. By finding clusters which are over-represented in particular GO terms, we have shown that this type of relationship also applies at the sequence level and we can use the clusters to examine it further. By comparing the sequences in a multiple sequence alignment, we may see which elements are conserved and which are variable. If the sequence group forms a quadruplex structure then some of these conserved and variable regions may not be critical in quadruplex formation but may be critical bases for molecular recognition. In certain cases, this would be more useful than simply finding quadruplex-dependent positions. Whether one takes the abundance of similar PQs as evidence of selective pressure or not, the clustering data may still be exploited. For example, one of the key areas of G-quadruplex research currently focuses on developing ligands which block transcription by stabilizing a particular quadruplex sequence. It may be important to know how unique that sequence is in order to provide specificity.

Meaningfulness of clusters

Since the clusters were merged using the full-linkage method, then the similarity threshold will be the lowest score between any pair of sequences in a cluster. At Cycle 16 (where most of the examples are from), the similarity threshold was 0.8. For a comparison of sequences where the shortest sequence is around 24 bases long, similar to the majority of cases in the cluster in Figure 4, the worst alignments would have to contain, for example three mismatches and two gaps which would give a similarity score of 0.808. In practice, the majority of alignments in that cluster are much more similar and this generally appears to be the case. We have derived clusters of varying similarity and size, which raises the question of what represents a biologically relevant cluster. As the clustering progresses, less similar sequences are added to each cluster and at some stage the members will be merged, which do not have a similar biological role. The point at which this occurs is impossible to determine without knowledge of the role of these sequences or without experimental evidence. In the cases where we have discretely grouped clusters, rather than continuous merging through the clustering process, this should be less of a problem. We suggest that sequence types whose significance is determined in the future may have differing roles and so will require different degrees of similarity. Indeed, the cluster examples which we have presented were chosen because they represent a variety of different types of correlation: clusters which had a correlation with gene ontologies, those which correlated with protein families, clusters which belonged to disparate protein families and an example of a cluster which was found because of a particular interest (TERRA). The TERRA cluster is also an example which contains sequences that are known to form stable DNA and RNA quadruplexes. The clustering in this study was performed on introns of human genes. It is now possible to examine other regions of genomic DNA with this methodology and search for clusters in, for example UTR regions, promoter regions or exons. The sequences which were clustered here are those which we selected using our criteria of four runs of at least three guanines, separated by loop regions. However, guanine quadruplex structures may not necessarily be formed exclusively from this sequence type. Indeed, in light of a recent structural study by Kuryavyi and Patel (58), we feel that a clustering approach using a yet more general rule for which sequences can potentially form quadruplex structures, will in due course bear fruit. This structure is not the only one to report a G-quadruplex with a topology which involves more than a simple sequence containing G-tracts separated by loop sequences; see in particular the molecular structures of the sequence in the promoter region of the c-kit gene (22–25). Clustering methods can be applied to any group of sequences including, for example those which follow a specific template and those which are generally G-rich.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

This work has been supported by a programme grant (No. C129/A4489) from Cancer Research UK (to S.N.). Funding for open access charges: CRUK grant. Conflict of interest statement. None declared.
  55 in total

1.  Enrichment of G4 DNA motif in transcriptional regulatory region of chicken genome.

Authors:  Zhuo Du; Ping Kong; Yu Gao; Ning Li
Journal:  Biochem Biophys Res Commun       Date:  2007-01-25       Impact factor: 3.575

2.  Structure of an unprecedented G-quadruplex scaffold in the human c-kit promoter.

Authors:  Anh Tuân Phan; Vitaly Kuryavyi; Sarah Burge; Stephen Neidle; Dinshaw J Patel
Journal:  J Am Chem Soc       Date:  2007-03-16       Impact factor: 15.419

3.  Telomeric repeat containing RNA and RNA surveillance factors at mammalian chromosome ends.

Authors:  Claus M Azzalin; Patrick Reichenbach; Lela Khoriauli; Elena Giulotto; Joachim Lingner
Journal:  Science       Date:  2007-10-04       Impact factor: 47.728

4.  Bioinformatics approaches to quadruplex sequence location.

Authors:  Alan K Todd
Journal:  Methods       Date:  2007-12       Impact factor: 3.608

5.  Extensive selection for the enrichment of G4 DNA motifs in transcriptional regulatory regions of warm blooded animals.

Authors:  Yiqiang Zhao; Zhuo Du; Ning Li
Journal:  FEBS Lett       Date:  2007-04-18       Impact factor: 4.124

6.  Sequence effects of single base loops in intramolecular quadruplex DNA.

Authors:  Phillip A Rachwal; Tom Brown; Keith R Fox
Journal:  FEBS Lett       Date:  2007-03-28       Impact factor: 4.124

7.  An RNA G-quadruplex in the 5' UTR of the NRAS proto-oncogene modulates translation.

Authors:  Sunita Kumari; Anthony Bugaut; Julian L Huppert; Shankar Balasubramanian
Journal:  Nat Chem Biol       Date:  2007-02-25       Impact factor: 15.040

8.  New developments in the InterPro database.

Authors:  Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Virginie Buillard; Lorenzo Cerutti; Richard Copley; Emmanuel Courcelle; Ujjwal Das; Louise Daugherty; Mark Dibley; Robert Finn; Wolfgang Fleischmann; Julian Gough; Daniel Haft; Nicolas Hulo; Sarah Hunter; Daniel Kahn; Alexander Kanapin; Anish Kejariwal; Alberto Labarga; Petra S Langendijk-Genevaux; David Lonsdale; Rodrigo Lopez; Ivica Letunic; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Anastasia N Nikolskaya; Sandra Orchard; Christine Orengo; Robert Petryszak; Jeremy D Selengut; Christian J A Sigrist; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal:  Nucleic Acids Res       Date:  2007-01       Impact factor: 16.971

9.  Ensembl 2009.

Authors:  T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal:  Nucleic Acids Res       Date:  2008-11-25       Impact factor: 16.971

10.  Sequence occurrence and structural uniqueness of a G-quadruplex in the human c-kit promoter.

Authors:  Alan K Todd; Shozeb M Haider; Gary N Parkinson; Stephen Neidle
Journal:  Nucleic Acids Res       Date:  2007-08-24       Impact factor: 16.971

View more
  13 in total

1.  Neisseria gonorrhoeae MutS affects pilin antigenic variation through mismatch correction and not by pilE guanine quartet binding.

Authors:  Ella Rotman; H Steven Seifert
Journal:  J Bacteriol       Date:  2015-03-16       Impact factor: 3.490

2.  Strand invasion of DNA quadruplexes by PNA: comparison of homologous and complementary hybridization.

Authors:  Anisha Gupta; Ling-Ling Lee; Subhadeep Roy; Farial A Tanious; W David Wilson; Danith H Ly; Bruce A Armitage
Journal:  Chembiochem       Date:  2013-07-19       Impact factor: 3.164

3.  Bioinformatic analysis reveals an evolutional selection for DNA:RNA hybrid G-quadruplex structures as putative transcription regulatory elements in warm-blooded animals.

Authors:  Shan Xiao; Jia-Yu Zhang; Ke-Wei Zheng; Yu-Hua Hao; Zheng Tan
Journal:  Nucleic Acids Res       Date:  2013-09-02       Impact factor: 16.971

Review 4.  Making the bend: DNA tertiary structure and protein-DNA interactions.

Authors:  Sabrina Harteis; Sabine Schneider
Journal:  Int J Mol Sci       Date:  2014-07-14       Impact factor: 5.923

5.  Exploring possible DNA structures in real-time polymerase kinetics using Pacific Biosciences sequencer data.

Authors:  Sterling Sawaya; James Boocock; Michael A Black; Neil J Gemmell
Journal:  BMC Bioinformatics       Date:  2015-01-28       Impact factor: 3.169

6.  Conformational diversity of single-stranded DNA from bacterial repetitive extragenic palindromes: Implications for the DNA recognition elements of transposases.

Authors:  Tatsiana Charnavets; Jaroslav Nunvar; Iva Nečasová; Jens Völker; Kenneth J Breslauer; Bohdan Schneider
Journal:  Biopolymers       Date:  2015-10       Impact factor: 2.505

7.  The effects of DNA supercoiling on G-quadruplex formation.

Authors:  Doreen A T Sekibo; Keith R Fox
Journal:  Nucleic Acids Res       Date:  2017-12-01       Impact factor: 16.971

8.  Characterization of long G4-rich enhancer-associated genomic regions engaging in a novel loop:loop 'G4 Kissing' interaction.

Authors:  Jonathan D Williams; Dominika Houserova; Bradley R Johnson; Brad Dyniewski; Alexandra Berroyer; Hannah French; Addison A Barchie; Dakota D Bilbrey; Jeffrey D Demeis; Kanesha R Ghee; Alexandra G Hughes; Naden W Kreitz; Cameron H McInnis; Susanna C Pudner; Monica N Reeves; Ashlyn N Stahly; Ana Turcu; Brianna C Watters; Grant T Daly; Raymond J Langley; Mark N Gillespie; Aishwarya Prakash; Erik D Larson; Mohan V Kasukurthi; Jingshan Huang; Sue Jinks-Robertson; Glen M Borchert
Journal:  Nucleic Acids Res       Date:  2020-06-19       Impact factor: 16.971

9.  N-methylmesoporphyrin IX fluorescence as a reporter of strand orientation in guanine quadruplexes.

Authors:  Navin C Sabharwal; Victoria Savikhin; Joshua R Turek-Herman; John M Nicoludis; Veronika A Szalai; Liliya A Yatsunyk
Journal:  FEBS J       Date:  2014-02-26       Impact factor: 5.542

10.  A DNA structural alphabet provides new insight into DNA flexibility.

Authors:  Bohdan Schneider; Paulína Boǽíková; Iva Nečasová; Petr Čech; Daniel Svozil; Jiří Černý
Journal:  Acta Crystallogr D Struct Biol       Date:  2018-01-01       Impact factor: 7.652

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.