Literature DB >> 20725630

A toolbox for predicting g-quadruplex formation and stability.

Han Min Wong¹, Oliver Stegle, Simon Rodgers, Julian Leon Huppert.

Abstract

G-quadruplexes are four stranded nucleic acid structures formed around a core of guanines, arranged in squares with mutual hydrogen bonding. Many of these structures are highly thermally stable, especially in the presence of monovalent cations, such as those found under physiological conditions. Understanding of their physiological roles is expanding rapidly, and they have been implicated in regulating gene transcription and translation among other functions. We have built a community-focused website to act as a repository for the information that is now being developed. At its core, this site has a detailed database (QuadDB) of predicted G-quadruplexes in the human and other genomes, together with the predictive algorithm used to identify them. We also provide a QuadPredict server, which predicts thermal stability and acts as a repository for experimental data from all researchers. There are also a number of other data sources with computational predictions. We anticipate that the wide availability of this information will be of use both to researchers already active in this exciting field and to those who wish to investigate a particular gene hypothesis.

Entities: Chemical Disease Gene Species

Year: 2010 PMID： 20725630 PMCID： PMC2915886 DOI： 10.4061/2010/564946

Source DB: PubMed Journal: J Nucleic Acids ISSN： 2090-0201

1. Introduction

It was observed in 1910 [1] that a sufficiently high concentration of guanosine could form a gel, unlike the other nucleobases, and in 1962 [2] it was discovered that four guanosine can self-assemble to form a hydrogen-bonded square, with bonds between the N1–O6 and N2–N7 positions. This structure is known as a G-tetrad or G-quartet. Like any nucleobase, there is also a strong propensity for these structures to stack on each other via π-π interactions, forming four-stranded helices called G-quadruplexes, with the phosphate backbone perpendicular to the plane of the G-quartets. The four strands may be from separate molecules, or they may be from only 2 or 1, with loops joining them together [3-7]. They form with great thermal stability, [8] and have been found experimentally to form from genomic sequences in critical regions such as telomeres, gene promoters and UTRs, [9, 10] and to have physiological effects in each of these regions. In telomeres, their formation reduces the activity of telomerase, the upregulation of which has been associated with 85% of cancers, and has led to much pharmaceutical interest [11]. G-quadruplexes in gene promoters, such as the oncogenes c-myc and c-kit, [12, 13] have been shown to control transcriptional activity in vitro, although interestingly their formation can lead to the increase or decrease of activity in different systems. It has been shown that G-quadruplex formation in the 5′ UTR can decrease translational activity, [14] and there have been suggestions of other physiological effects. A wide variety of proteins have been found to interact specifically with them, [15] and they have been shown experimentally to form in vivo [16-18]. G-quadruplexes have also been employed as biosensors (e.g., for thrombin[19]) and in other nanotechnological applications (e.g., [20, 21]). Some of these uses are reviewed in [5]. In parallel with the experimental work being developed, computational techniques have also been developed to predict which sequences will form G-quadruplexes [22-24]. There are a variety of different algorithmic rules that can be used to predict which sequences can form G-quadruplexes, [25, 26] although some are more widely used and accepted. There is not sufficient evidence for any of them to be held as absolutely true, and it is only recently that any work has been done to try to predict relative stabilities of possible G-quadruplex structures, rather than just whether they could form or not. Despite this limitation, computational methods have led to a number of discoveries, including the observations that G-quadruplexes are relatively rare in the human genome, but more prevalent than expected in gene promoters [27]. Some of the computational discoveries have been recently reviewed [25, 26]. The field as a whole has grown very significantly in recent times, with a roughly exponential rise in publications (see Figure 1), including over 350 in 2009. A dedicated book has been produced, [28] together with special issues of some journals focused on this topic, and some databases on particular aspects of G-quadruplexes. A few G-quadruplex based drugs have also entered clinical trials. A series of International Conferences has been initiated, the first two hosted in Louisville, KY [29, 30]. At the first of these, it was suggested that a central and coherent website to store and provide data related to G-quadruplexes should be produced, and we volunteered to provide such a repository [29], hosted at the URL http://www.quadruplex.org/.

Figure 1

Annual publication rate in the G-quadruplex field in the past decade. Data obtained from the Web of Knowledge, searching for the term “G-quadruplex” or “G-tetraplex.” The solid line is an exponential fit to the data.

Here, we describe the features available at that website, and in particular the core databases to describe predicted G-quadruplexes, and a new tool to estimate the thermal stability of these structures computationally. We also describe the other online sources of predictive data for G-quadruplexes, so that researchers may chose the most appropriate tool for their work.

2. QuadDB—A Database of G-Quadruplex Predictions

The core quadruplex database (QuadDB, http://www.quadruplex.org/?view=quadbase) provides both static and searchable data for researchers on computationally predicted G-quadruplexes (Putative Quadruplex Sequences, PQS). These have been generated as previously described [22], using our favoured predictive algorithm, which identifies sequences on either strand of the form (G3+N1–7G3+N1–7G3+N1–7G3+). This has been shown experimentally to be a good predictor of in vitro G-quadruplex formation [31]. It aims to identify specific G-quadruplexes that may form, providing a testable in vitro hypothesis that can be tested using simple biophysical methods.

2.1. Quadparser

For any researcher interested in identifying PQS in specific sequences, we provide the quadparser program pre-compiled for MS Windows and Mac OS X with detailed instructions. The program is customisable, so that different patterns can be searched for. Different loop length constraints, G-tract lengths and so forth may all be set, so that the algorithm can be adjusted to fit with the particular context desired. Quadparser has a variety of output styles for different uses, and reads sequence data in FASTA format.

2.2. Data Search

The Data search section allows a researcher to identify any PQS in gene promoters (defined as the 1 kb upstream of the TSS) or UTRs for their gene of interest. The genes may be identified by ensembl ID, HGNC code or description. The output provides full details of the gene, including genomic parameters, and the location and sequence of PQS in the appropriate regions of every transcript of the gene. Links are also provided to Ensembl so the PQS may be seen in context. Figure 2 displays the output when searching the human genome for PQS in the promoter or UTRs of c-kit (HGNC nomenclature KIT). Currently, searches may be performed against the human, chimpanzee and mouse genomes.

Figure 2

Sample output from the data search component of Quadbase. The search query is shown at the bottom of the figure.

2.3. Data Download

As a convenient alternative to gene-by-gene searches or using the quadparser program, we also provide a downloadable listing of every PQS identified in various genomes. We currently offer this data for human (builds 34, 35 and 36 for back compatibility), chimpanzee (2.1), mouse (37), rat (3.4), dog (2), chicken (2), zebrafish (7), fruitfly (5.4), roundworm (180) and yeast (1.01) genomes. In each case the data provides a genomic coordinates for each PQS, together with the strand, sequence and a unique identifier. Data may be taken altogether or by chromosome.

3. Quadpredict—Predicting G-Quadruplex Stability

The thermal stability of G-quadruplexes varies with the concentration of monovalent cations, specifically Na+ and K+. However, even for fixed concentrations, the exact details of the sequence, and hence the structure formed, make a very large difference. G-quadruplexes can vary from those which are too unstable to form at 5°C to those which will resist temperatures above 95°C [31]. It is therefore necessary not just to predict which sequences can form G-quadruplexes at all, but also the stability with which such sequences can form. Such experiments are relatively easy to perform, and have led to a series of studies of different aspects of the relationship between sequence and stability [31-34]. However, this does not enable prediction of unmeasured sequences, forcing researchers to make informed guesses as to the stability of novel sequences. We recently developed [35] a Bayesian learning algorithm that is capable of making accurate predictions of thermal stability for new sequences, having been trained on a collection of measured sequences. Full details of the methodology and the parameters considered are available elsewhere [35]. We provide an interface to this system at http://www.quadruplex.org/?view=quadpredict, enabling researchers to make easy predictions of melting temperatures under various conditions for any desired sequence. Figure 3 gives an example of such predictions.

Figure 3

Sample predictions from QuadPredict, using Bayesian inference to calculate predicted melting temperatures together with predicted uncertainties.

One feature of the Bayesian inference we use is that in addition to predictions of the melting temperature, we also provide uncertainties in the values for each sequence. In general, the uncertainty increases for sequences that are highly unlike those in our training set. This therefore enables researchers to decide rationally how much faith to place in a particular prediction. We intend to develop the training data further, and have already employed a rational active learning protocol to collect more data and reduce the uncertainties below that originally presented. We will continue to do this, and also provide an opportunity for researchers to contribute their own data, so that the Bayesian inference can be increasingly accurate. We hope that depositing data publicly may become a standard requirement for publication of G-quadruplex thermal data. We allow researchers to discover whether particular sequences they are interested in are already in our database of measurements, with information about exactly how such an experiment was performed. We hope that these facilities will prove useful to all those working in this field. As well as those interested in biological aspects of G-quadruplexes, we feel this facility may be particularly helpful for those working in nanotechnology or materials science, providing them with a method of rationally selecting G-quadruplex-forming oligonucleotides.

4. Other G-Quadruplex Computational Tools

There are a number of other tools that may be used to predict the existence of G-quadruplexes in DNA, and links to these are provided from http://www.quadruplex.org/. Bagga and coworkers use a similar algorithm to quadparser called QGRS mapper [36]. It has different default parameters, in particular looking at sequences with fewer consecutive guanines and longer loops, but essentially looks for much the same sequences. Interestingly, it includes a scoring parameter for different possible G-quadruplexes that can be formed. Although this is loosely based on empiric evidence, it is not clear how the “G-score” produced, which ranges up to a maximum of 105, relates to stability. To the best of our knowledge, no empiric tests have been performed testing the validity of the G-score even as a ranking list, but it is still a useful formulation of established rules of thumb. As well as the QGRS mapper, which also provides the facility to search by genes, they also provide specialised databases, GRSDB2 and GRS_UTRdb, [37] for searching pre-mRNAs and UTR sequences. Maiti and coworkers offer a site called Quadfinder, [38] which implements essentially the same algorithm as quadparser. (At the time of writing it does not appear to be functioning.) At the same institute, Chowdhury and coworkers have a site called QuadBase, [39] again using essentially the same algorithm. They focus on cross-species analysis, offering an ortholog analysis for finding conserved G-quadruplexes, across either prokaryotes (ProQuad) (and see [40]) or eukaryotes (EuQuad). It should be noted that the conservation required is by presence, and no sequence comparison is performed. Lastly in this category is the Greglist database of potential G-quadruplex regulated genes, which lists all human genes that have a G-quadruplex in the 1 kb region upstream of the transcription start site. The quadparser algorithm is used to predict these sequences [41]. A completely different approach to G-quadruplex prediction is taken by the Maizels lab [42, 43]. Whereas other methods aim to predict specific G-quadruplex sequences, largely driven by the desire of structural biologists to have structures to study, and by the desire of medicinal chemists to have a defined form to target, the G4 calculator from Eddy and Maizels accepts that many of these structures are highly polymorphic in vivo. As a result, they do not aim to predict individual structures but look at the density of sequences likely to lead to G-quadruplex structures. Given that this is an entirely orthogonal approach, it is striking that in many cases, particularly working on the gene functions that are likely to be regulated by G-quadruplexes, very similar conclusions arise from using this approach as the quadparser model. We strongly recommend that for any large-scale genomic studies, both approaches are used to corroborate the results found.

5. Conclusions

Computational methods have been of great use in understanding the role that G-quadruplexes may play in biology, unveiling their function in gene promoters [27, 42] and in regulating translation [44]. They have also revealed that stable G-quadruplexes are generally located in nucleosome-free regions [45]. Stability predictions have been used to develop experimental methods to directly visualise G-quadruplexes using AFM [46]. We anticipate that greater availability of ever more reliable tools will both improve the quality of informatic research in this area and make it increasingly easy for experimentalists to access computational results.

43 in total

Review 1. G-quadruplex DNA structures--variations on a theme.

Authors: T Simonsson
Journal: Biol Chem Date: 2001-04 Impact factor: 3.915

Review 2. Guanine quartet structures.

Authors: D Sen; W Gilbert
Journal: Methods Enzymol Date: 1992 Impact factor: 1.600

Review 3. Physiological relevance of telomeric G-quadruplex formation: a potential drug target.

Authors: Liana Oganesian; Tracy M Bryan
Journal: Bioessays Date: 2007-02 Impact factor: 4.345

Review 4. Four-stranded nucleic acids: structure, function and targeting of G-quadruplexes.

Authors: Julian Leon Huppert
Journal: Chem Soc Rev Date: 2008-05-06 Impact factor: 54.564

5. Influence of loop size on the stability of intramolecular DNA quadruplexes.

Authors: Antonina Risitano; Keith R Fox
Journal: Nucleic Acids Res Date: 2004-05-11 Impact factor: 16.971

6. Highly prevalent putative quadruplex sequence motifs in human DNA.

Authors: Alan K Todd; Matthew Johnston; Stephen Neidle
Journal: Nucleic Acids Res Date: 2005-05-24 Impact factor: 16.971

7. Gene function correlates with potential for G4 DNA formation in the human genome.

Authors: Johanna Eddy; Nancy Maizels
Journal: Nucleic Acids Res Date: 2006-08-10 Impact factor: 16.971

8. QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences.

Authors: Oleg Kikin; Lawrence D'Antonio; Paramjeet S Bagga
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. Greglist: a database listing potential G-quadruplex regulated genes.

Authors: Ren Zhang; Yan Lin; Chun-Ting Zhang
Journal: Nucleic Acids Res Date: 2007-10-04 Impact factor: 16.971

10. Guanines are a quartet's best friend: impact of base substitutions on the kinetics and stability of tetramolecular quadruplexes.

Authors: Julien Gros; Frédéric Rosu; Samir Amrane; Anne De Cian; Valérie Gabelica; Laurent Lacroix; Jean-Louis Mergny
Journal: Nucleic Acids Res Date: 2007-04-22 Impact factor: 16.971

23 in total

1. New insights into replication origin characteristics in metazoans.

Authors: Christelle Cayrou; Philippe Coulombe; Aurore Puy; Stephanie Rialle; Noam Kaplan; Eran Segal; Marcel Méchali
Journal: Cell Cycle Date: 2012-02-15 Impact factor: 4.534

2. Short loop length and high thermal stability determine genomic instability induced by G-quadruplex-forming minisatellites.

Authors: Aurèle Piazza; Michael Adrian; Frédéric Samazan; Brahim Heddi; Florian Hamon; Alexandre Serero; Judith Lopes; Marie-Paule Teulade-Fichou; Anh Tuân Phan; Alain Nicolas
Journal: EMBO J Date: 2015-05-08 Impact factor: 11.598

3. High-resolution three-dimensional NMR structure of the KRAS proto-oncogene promoter reveals key features of a G-quadruplex involved in transcriptional regulation.

Authors: Abdelaziz Kerkour; Julien Marquevielle; Stefaniia Ivashchenko; Liliya A Yatsunyk; Jean-Louis Mergny; Gilmar F Salgado
Journal: J Biol Chem Date: 2017-03-22 Impact factor: 5.157

4. G-Quadruplexes Involving Both Strands of Genomic DNA Are Highly Abundant and Colocalize with Functional Sites in the Human Genome.

Authors: Andrzej S Kudlicki
Journal: PLoS One Date: 2016-01-04 Impact factor: 3.240

10. C9orf72 hexanucleotide repeat associated with amyotrophic lateral sclerosis and frontotemporal dementia forms RNA G-quadruplexes.

Authors: Pietro Fratta; Sarah Mizielinska; Andrew J Nicoll; Mire Zloh; Elizabeth M C Fisher; Gary Parkinson; Adrian M Isaacs
Journal: Sci Rep Date: 2012-12-21 Impact factor: 4.379