Literature DB >> 16381828

GRSDB: a database of quadruplex forming G-rich sequences in alternatively processed mammalian pre-mRNA sequences.

Rumen Kostadinov¹, Nishtha Malhotra, Manuel Viotti, Robert Shine, Lawrence D'Antonio, Paramjeet Bagga.

Abstract

Guanine-rich nucleic acids are known to form highly stable G-quadruplex structures, also known as G-quartets. Recently, there has been a tremendous amount of interest in studying G-quadruplexes owing to the realization of their biological importance. G-rich sequences (GRSs) capable of forming G-quadruplexes are found in the vicinity of polyadenylation regions and are involved in regulating 3' end processing of mammalian pre-mRNAs. G-rich motifs are also known to play an important role in alternative, tissue-specific splicing by interacting with hnRNP H protein subfamily. Whether quadruplex structure directly plays a role in regulating RNA processing events requires further investigation. To date there has not been a comprehensive effort to study G-quadruplexes near RNA processing sites. We have applied a computational approach to map putative Quadruplex forming GRSs within the transcribed regions of a large number of alternatively processed human and mouse gene sequences that were obtained as fully annotated entries from GenBank and RefSeq. We have used the computed data to build the GRSDB database that provides a unique avenue for studying G-quadruplexes in the context of RNA processing sites. GRSDB website offers visual comparison of G-quadruplex distribution patterns among all the alternative RNA products of a gene with the help of dynamic graphics. At present, GRSDB contains data from 1310 human and mouse genes, of which 1188 are alternatively processed. It has a total of 379,223 predicted G-quadruplexes, of which 54,252 are near RNA processing sites. GRSDB is a good resource for researchers interested in investigating the functional relevance of G-quadruplexes, especially in the context of alternative RNA processing. It can be accessed at http://bioinformatics.ramapo.edu/grsdb/.

Entities: Chemical Gene Species

Mesh：

Substances：

Year: 2006 PMID： 16381828 PMCID： PMC1347436 DOI： 10.1093/nar/gkj073

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Guanine-rich nucleic acids are known to form higher order structures. Their ability to form highly stable quadruplex structures was discovered more than four decades ago (1). The G-quadruplex structure, also known as a G-quartet, is composed of stacked G-tetrads, which are square co-planar arrays of four guanine bases each. Cyclic Hoogsteen hydrogen bonding between the four guanines within each tetrad renders a high level of stability to the quadruplex (Figure 1). Although structures with three or more G-tetrads are considered to be more stable, many nucleotide sequences are known to form quadruplexes with two G-tetrads (2,3). G-quadruplexes may be formed by repeated folding of a single nucleic acid molecule (unimolecular G-quadruplex) or by interaction of two or four strands. The former is more likely to be encountered in physiological conditions (4,5). (The present work focuses only on the unimolecular quadruplexes.) Formation of G-quadruplexes in vivo is facilitated by proteins (6). Some proteins are also implicated in resolving the G-quadruplex structure (7,8).

Figure 1

QGRS: 5′-GGGCAGGGCAGGUGGGA-3′. Predicted intramolecular G-quadruplex formed by a GRS.

G-quadruplex sequence motifs have been reported in telomeric, promoter and other regions of mammalian genomes. Formation of a G-quadruplex in the promoter region has been associated with transcription regulation of the c-myc oncogene and is being considered as a potential target for therapeutic purposes (9,10). Owing to the realization of their biological importance, recently, there has been a tremendous amount of interest in studying G-quadruplexes. This is evident from a surge in the published literature. [for reviews see (8,11)]. Although initially most of the studies focused on G-quadruplexes in the DNA, lately there have been many efforts to study G-quadruplex forming RNA (12–16). In fact, G- rich sequences capable of forming G-quadruplexes in the RNA have been implicated in a variety of important biological activities, such as mRNA turnover (6), Fragile X Mental Retardation Protein (FMRP) binding (14), translation initiation (15) as well as repression (16). We have previously shown that a conserved auxiliary G-rich sequence (GRS) found near the polyadenylation regions can mediate efficient 3′ end processing of mammalian pre-mRNAs (17,18) by interacting with DSEF1/hnRNP H/H′ protein (19). However, hnRNP F has been shown to be a negative regulator of 3′ end processing (20). Regulated polyadenylation is an important component of differential gene expression. More than 50% of human and 32% of mouse genes are known to have alternative polyadenylation (21). An interplay among GRS-binding proteins, hnRNP H/H′ and F, helps in regulating alternative polyadenylation of immunoglobulin pre-mRNA (20) which, combined with alternative splicing, plays an important role in mouse B lymphocyte development (22). In addition to differential gene expression, alternative splicing affects disease processes (23) and is a major source of protein diversity. More than two-thirds of human genes are thought to undergo alternative splicing (24). Members of the hnRNP H protein subfamily, that bind G-rich motifs, are known to be involved in alternative, tissue-specific, regulated splicing events (25–27). GRS motifs that are present near splice sites act as splicing regulators by interacting with hnRNP H (28). For example, binding of hnRNP H and F to G-rich tracts near 5′ splice site favors production of alternative pro-apoptotic Bcl-xs product (29). The regulatory G-rich motifs may be capable of forming quadruplex structures. Whether quadruplex structure directly plays a role in regulating RNA processing events requires investigation. The majority of the mammalian poly(A) region GRS sequences that we had surveyed in our previous studies (18,19) are capable of forming unimolecular G-quadruplexes. Our preliminary analysis of ∼100 alternatively processed human transcripts has also revealed the presence of quadruplex forming sequences near alternative splice sites (30). However, a more detailed investigation into the distribution of G-quadruplex sequences near RNA processing sites requires a systematic large-scale analysis of mammalian genes. Although, there have been two recent surveys of quadruplexes in the human genome (31,32), to date there has not been a comprehensive effort to study G-quadruplexes near RNA processing sites. We have used a computational approach (30) to map putative G-quadruplex forming sequences within the transcribed regions of a large number of alternatively processed human and mouse genes. The fully annotated genomic nucleotide sequences are obtained from NCBI-based GenBank and RefSeq for computational analysis. Based on our analysis of alternatively spliced and alternatively polyadenylated human and mouse genes, we have built the GRSDB database. GRSDB provides a unique avenue for studying G-quadruplex forming sequences in the context of RNA processing sites. In addition to providing data on composition and locations of mapped quadruplexes relative to the processing sites in the pre-mRNA sequence, GRSDB offers simultaneous visual comparison of G-quadruplex distribution patterns among all the alternative RNA products of individual genes with the help of dynamically generated graphics. Researchers interested in investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing, will find GRSDB to be of great value. It allows a comprehensive large-scale analysis as well as detailed studies in individual genes. GRSDB is also a good resource for performing large-scale analysis of G-quadruplex sequence composition, including study of loops, in the transcribed regions.

METHODS

Quadruplex forming GRS

The basic unit of study in GRSDB is the putative G-quadruplex that we have called QGRS (Quadruplex forming GRS). These sequences follow the motif GNGNGNG. Here G refers to the group of guanines (which we will refer to as a G-group) that form a complex of x stacked G-tetrads. In the individual gene entries stored in GRSDB, x is generally 2, 3 or 4. The intervening arbitrary bases, N, N and N, are called gaps or loops. Two sequences are said to be overlapping if their positions in the nucleotide sequence do overlap. The default action of GRSDB is to display non-overlapping sequences, but the user can display all QGRS.

Structure of GRSDB

GRSDB is a relational database built using MySQL. The GRSDB website can be accessed at . This database primarily stores information about putative G-quadruplex sequences (QGRS) for genes that are alternatively processed (either alternatively spliced or alternatively polyadenylated). GRSDB is structured to facilitate queries about alternatively processed genes and to display information on the G-quadruplex sequences contained in the transcribed regions of the gene and their locations relative to RNA processing sites. Table 1 shows the types of objects found in the database.

Table 1

Structure of GRSDB

Data type	Comments
Gene	Includes gene name, gene location, whether or not gene is alternatively spliced or alternatively polyadenylated
Product	Includes RNA product name, and numbers of introns, exons and poly(A) signals
Intron/exon map	Maps start and stop positions of introns and exons
Poly(A) map	Maps start and stop position of poly(A) signal
QGRS	Includes actual sequences, their position and G-scores

GRSDB is populated using an auxiliary program, QGRS-Mapper, that is based on previously published methods (30) and was developed using BioPerl. Once appropriate genes have been identified, this program links to GenBank or RefSeq, downloads the corresponding genomic nucleotide sequence entry of the gene, and parses the entry for product, intron, exon, poly(A) and related information. The program then processes the nucleotide sequence to find all QGRS and map their location within the gene and their distance from relevant RNA processing sites.

QGRS scoring

A scoring method is applied to each QGRS. The computed score, called a G-score (30), is formulated to reward sequences with smaller, more even gaps between the G-groups in addition to larger G-group size, thereby favoring the arrangement that is more likely to form a unimolecular complex. This choice of scoring system is in agreement with the existing literature on loop structures in G-quadruplexes (31–35). In particular, the data gathered in this research points to loop sizes tending to be small and preferentially equal or nearly equal.

Interfaces

The data flow for GRSDB is summarized in Table 2. After the gene information is downloaded from NCBI, parsed, processed for QGRS, and scored, it is then uploaded into GRSDB. At this point the database is ready for user queries. There are three different interfaces provided for viewing database contents: the gene view, the data view and the graphical view.

Table 2

Data flow of GRSDB

Database users are given a variety of options in formulating a query, including searching for genes that are alternatively spliced or alternatively polyadenylated. Once a query has been entered, a table is displayed of all genes satisfying the query. Information for individual genes is displayed in a table as shown in Figure 2 for the particular gene MUCDHL. This is what we call the gene view. One can see that MUCDHL is both alternatively spliced and alternatively polyadenylated.

Figure 2

MUCDHL—Gene View. The first table in this view gives basic information about the gene being analyzed, including the alternatively spliced or polydenylated status of the gene. The second table provides information about the first alternatively spliced RNA product of the gene, including numbers of exons/introns/poly(A) signals. Also displayed are the total number of QGRS in the product and near RNA processing sites (i.e. within 120 nt of a site). Located below each product table are buttons allowing the user to view the QGRS information for that product in two ways; the data view and the graphic view, shown in Figures 3 and 4. Not shown are the second product table and the button placed at the bottom of the page allowing the user to analyze all RNA products simultaneously. To view the entire screen shot, see the Supplementary Data.

At this point the user can choose to analyze one of the products or all products simultaneously. There are two types of analysis possible, the data view or graphical view. Figure 3 represents the data view analysis for Product 1 of MUCDHL, showing all non-overlapping QGRS. The table shows the location of each QGRS, its distance from the nearest splice site, and its G-score.

Figure 3

MUCDHL—Data View. A table of the mapping data for the non-overlapping QGRS in the product (refer to the text for explanation of non-overlapping) is shown. The table provides the location of QGRS in exons, the distance from 3′ and 5′ splice sites, the actual sequence, and its G-score (refer to the text for explanation). Not shown are the tables for QGRS mapping data in introns and in poly(A) regions. To view the entire screen shot, see Supplementary Data.

Alternatively, the user can select the graphical view of any product (or again, all products together), which is shown in Figure 4. A visual model of the product is displayed, showing the location of exons, introns and untranscribed regions. Further, the location of each QGRS is indicated by a vertical line. The length of the line is proportional to the G-score for the sequence.

Figure 4

MUCDHL gene—Graphic View. A visual representation for RNA Product 1 of this gene. The upper graph shows the location of exons/introns in the product, along with a scale to locate their positions. The QGRS in the product are indicated by the vertical bars, whose length is proportional to the G-scores of the QGRS. The lower graph represents a zoom-in of the RNA product 1 displaying the QGRS at position 8310. The arrows at the bottom left are used to navigate the RNA product with the interactive-zoom tool. It is possible to visually compare both the alternatively spliced products, for example, to identify differential association of QGRS with alternative sites. See Supplementary Data for product comparison.

CONCLUSIONS

GRSDB provides curated information on composition and distribution of putative QGRSs in the transcribed regions of alternatively processed human and mouse genes. The data are based on the analysis of fully annotated GenBank/RefSeq human and mouse genomic nucleotide entries that exhibit alternative processing information. Although the NCBI databases contain a large number of mRNA sequence records, at present the number of genomic entries that will provide information needed for our studies is limited. At present, our database contains information obtained from 1310 human and mouse genes, of which 1188 are alternatively processed. A total of 30 584 introns and 33 816 exons were analyzed, containing a total of 3231 RNA products. These products taken together contain a total of 379 223 putative G-quadruplexes, of which 54 252 are near RNA processing sites [within 120 nt of a splice site or a poly(A) signal]. Note that while GRSDB currently contains data only on human and mouse genes, our computational tools and the database are designed to include other organisms as well. GRSDB is continuously being updated with new data entries. The database is structured to facilitate a wide variety of queries and to map G-quadruplex sequences relative to the RNA processing sites in both data and graphic formats. The user friendly interface allows comparisons of all the alternative RNA products of individual genes on the same screen. GRSDB is a good resource for researchers interested in investigating the functional relevance of G-quadruplexes, especially in the context of alternative RNA processing. We are using the database to conduct detailed bioinformatics studies on the distribution patterns of QGRS near RNA processing sites. In particular, we are investigating whether there is a correlation between the distribution pattern of QGRS and alternative processing. Our group is also studying the loop composition of these sequences.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

34 in total

1. Downstream sequence elements with different affinities for the hnRNP H/H' protein influence the processing efficiency of mammalian polyadenylation signals.

Authors: George K Arhin; Monika Boots; Paramjeet S Bagga; Christine Milcarek; Jeffrey Wilusz
Journal: Nucleic Acids Res Date: 2002-04-15 Impact factor: 16.971

2. SR proteins and hnRNP H regulate the splicing of the HIV-1 tev-specific exon 6D.

Authors: Massimo Caputi; Alan M Zahler
Journal: EMBO J Date: 2002-02-15 Impact factor: 11.598

Review 3. G-quadruplex DNA structures--variations on a theme.

Authors: T Simonsson
Journal: Biol Chem Date: 2001-04 Impact factor: 3.915

Review 4. Pre-mRNA splicing and human disease.

Authors: Nuno André Faustino; Thomas A Cooper
Journal: Genes Dev Date: 2003-02-15 Impact factor: 11.361

5. Small change in a G-rich sequence, a dramatic change in topology: new dimeric G-quadruplex folding motif with unique loop orientations.

Authors: Martin Crnugelj; Primoz Sket; Janez Plavec
Journal: J Am Chem Soc Date: 2003-07-02 Impact factor: 15.419

6. A dimeric RNA quadruplex architecture comprised of two G:G(:A):G:G(:A) hexads, G:G:G:G tetrads and UUUU loops.

Authors: Hui Liu; Akimasa Matsugami; Masato Katahira; Seiichi Uesugi
Journal: J Mol Biol Date: 2002-10-04 Impact factor: 5.469

7. B-cell and plasma-cell splicing differences: a potential role in regulated immunoglobulin RNA processing.

Authors: Shirley R Bruce; R W Cameron Dingle; Martha L Peterson
Journal: RNA Date: 2003-10 Impact factor: 4.942

8. In vitro generated antibodies specific for telomeric guanine-quadruplex DNA react with Stylonychia lemnae macronuclei.

Authors: C Schaffitzel; I Berger; J Postberg; J Hanes; H J Lipps; A Plückthun
Journal: Proc Natl Acad Sci U S A Date: 2001-07-03 Impact factor: 11.205

Review 9. Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures.

Authors: Margarita I Zarudnaya; Iryna M Kolomiets; Andriy L Potyahaylo; Dmytro M Hovorun
Journal: Nucleic Acids Res Date: 2003-03-01 Impact factor: 16.971

10. A single internal ribosome entry site containing a G quartet RNA structure drives fibroblast growth factor 2 gene expression at four alternative translation initiation codons.

Authors: Sophie Bonnal; Céline Schaeffer; Laurent Créancier; Simone Clamens; Hervé Moine; Anne-Catherine Prats; Stéphan Vagner
Journal: J Biol Chem Date: 2003-07-11 Impact factor: 5.157

44 in total

1. Comparative investigation of the genomic regions involved in antigenic variation of the TprK antigen among treponemal species, subspecies, and strains.

Authors: Lorenzo Giacani; Stephanie L Brandt; Maritza Puray-Chavez; Tara Brinck Reid; Charmie Godornes; Barbara J Molini; Martin Benzler; Jörg S Hartig; Sheila A Lukehart; Arturo Centurion-Lara
Journal: J Bacteriol Date: 2012-06-01 Impact factor: 3.490

2. Functional impact of heterogeneous nuclear ribonucleoprotein A2/B1 in smooth muscle differentiation from stem cells and embryonic arteriogenesis.

Authors: Gang Wang; Qingzhong Xiao; Zhenling Luo; Shu Ye; Qingbo Xu
Journal: J Biol Chem Date: 2011-12-05 Impact factor: 5.157

3. Searching for non-B DNA-forming motifs using nBMST (non-B DNA motif search tool).

Authors: R Z Cer; K H Bruce; D E Donohue; N A Temiz; U S Mudunuri; M Yi; N Volfovsky; A Bacolla; B T Luke; J R Collins; R M Stephens
Journal: Curr Protoc Hum Genet Date: 2012-04

4. Translational repression of the disintegrin and metalloprotease ADAM10 by a stable G-quadruplex secondary structure in its 5'-untranslated region.

Authors: Sven Lammich; Frits Kamp; Judith Wagner; Brigitte Nuscher; Sonja Zilow; Ann-Katrin Ludwig; Michael Willem; Christian Haass
Journal: J Biol Chem Date: 2011-11-07 Impact factor: 5.157

5. Inhibition of translation in living eukaryotic cells by an RNA G-quadruplex motif.

Authors: Amit Arora; Mariola Dutkiewicz; Vinod Scaria; Manoj Hariharan; Souvik Maiti; Jens Kurreck
Journal: RNA Date: 2008-05-30 Impact factor: 4.942

6. G Run-mediated recognition of proteolipid protein and DM20 5' splice sites by U1 small nuclear RNA is regulated by context and proximity to the splice site.

Authors: Erming Wang; William F Mueller; Klemens J Hertel; Franca Cambi
Journal: J Biol Chem Date: 2010-12-02 Impact factor: 5.157

Review 7. Structural insights into G-quadruplexes: towards new anticancer drugs.

Authors: Danzhou Yang; Keika Okamoto
Journal: Future Med Chem Date: 2010-04 Impact factor: 3.808

8. 5'-UTR G-quadruplex structures acting as translational repressors.

Authors: Jean-Denis Beaudoin; Jean-Pierre Perreault
Journal: Nucleic Acids Res Date: 2010-06-22 Impact factor: 16.971

9. G4 resolvase 1 binds both DNA and RNA tetramolecular quadruplex with high affinity and is the major source of tetramolecular quadruplex G4-DNA and G4-RNA resolving activity in HeLa cell lysates.

Authors: Steven D Creacy; Eric D Routh; Fumiko Iwamoto; Yoshikuni Nagamine; Steven A Akman; James P Vaughn
Journal: J Biol Chem Date: 2008-10-07 Impact factor: 5.157

10. Discrimination of common and unique RNA-binding activities among Fragile X mental retardation protein paralogs.

Authors: Jennifer C Darnell; Claire E Fraser; Olga Mostovetsky; Robert B Darnell
Journal: Hum Mol Genet Date: 2009-06-01 Impact factor: 6.150