Literature DB >> 18353860

The relationship of potential G-quadruplex sequences in cis-upstream regions of the human genome to SP1-binding elements.

Abstract

We have carried out a survey of potential quadruplex structure sequences (PQSS), which occur in the immediate upstream region (500 bp) of n class="Species">human genpan>es. By examining the number and distribution of these we have established that there is a clear link betweenpan> them and the occurrenpan>ce of the SP1-binding elemenpan>t 'GGGCGG', such that a large number of upstream PQSS incorporate the SP1-binding elemenpan>t.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Sp1 Transcription Factor

Year: 2008 PMID： 18353860 PMCID： PMC2377421 DOI： 10.1093/nar/gkn078

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Certain guanine-rich DNA sequences have the ability to form stable secondary structures, G-quadruplexes, comprising G-tetrad motifs that involve four guanines arranged in a planar array interacting via Hoogsteen hydrogen bonds (1,2). These G-tetrads are stable when a number are stacked on top of one another. There have been numerous topologies observed for G-quadruplexes (3) and a number of studies have mapped out potential quadruplex forming sequences in genomic DNAs (4–11). The search criteria for most of these surveys have been sequences, which contain four or more runs of G-tracts occurring close together on the same strand. Most of the observed topologies follow this pattern (12,13). These potential quadruplex structure sequences (PQSS) have been found to occur with elevated frequency in regions directly upstream of the transcription start site of genes in species as diverse as Escherichia coli (8) and humans (11). There are also a number of specific promoter regions for which there is biophysical and structural evidence for the formation of PQSSs (for example, 11–18), at least in vitro. The zinc finger protein SP1 acts as a transcription factor, which has been shown to bind to the upstream element sequence ‘GGGCGG’ (19–21). This element has been found in many different promoters, often with a copy number >1 (19,20) and often within the first 100 bp upstream of the transcription start site. The consensus sequence has been shown to be ‘GGGGCGGGGC’ (22,23). The fact that it is guanine-rich, with consecutive guanines, gives it the ability to participate in PQSSs, at least in principle. We show here that many of the PQSSs found in the regions directly upstream of the transcription start site actually contain the SP1 consensus sequence and that there is a correlation between genes, which have the SP1 and PQSS in upstream regions. Another common upstream promoter element ‘CCAAT’ occurs in a similarly large number of promoters but does not contribute as much n class="Chemical">guanine-richness as the SP1-binding elemenpan>t and can be employed as a useful comparative group.

METHODS

The mySQL tables for the Ensembl human core database v45 (24) were downloaded from the Ensembl web site (www.enpan>sembl.org) and imported onto a local computer. The human genome was searched for PQSSs in the same way as described earlier (5) and the PQSS data was uploaded into mySQL tables. The search criterion was four runs of guanine G X G X G X G where m is between 3 and 5, and X are any combination of bases where n, o and p are between 1 and 7. Using Perl scripts and the Ensembl Perl API (25), a list of all genes with Ensembl status ‘known’ was compiled and the 500 bp upstream flanking sequences were extracted. These were searched for the sequences ‘GGGCGG’ and ‘CCAAT’ as well as their complementary sequences and the genes were grouped into the following categories: Genes with ‘GGGCGG’ in the region 500 bps upstream of their transcription start site (500USR). Genes with ‘CCAAT’ in their 500USR. Genes with both in their 500USR. Genes with ‘GGGCGG’ but not ‘CCAAT’ in their 500USR. Genes with ‘CCAAT’ but not ‘GGGCGG’ in their 500USR. Genes with neither sequence in their 500USR. The MySQL table of PQSSs was used to count the PQSSs in the upstream regions of the genes in each category and their distances (−1 bp to −500 bps) from the transcription start sites were noted. In order to represent these graphically the positions of the quadruplex sequences were put into bins of 10 base pairs. When looking at the number of PQSSs we counted each contiguous region of potential quadruplex region as a single sequence element i.e. if there were overlaps between more than one PQSS, then these were counted as one sequence element. For example the sequence: GGGAGGCGGGCGGTGGGGGGGTGGGGGTGGGG, which is in the upstream region of the Ensembl gene ENSG00000204219 (HGNC symbol TCEA3), was counted as one sequence element. Even though it contains up to six runs of guanines, we assume that only one quadruplex structure at a time can be formed in this region. The number of cytosines and guanines in all known genes was also derived for each of the bins. The cytosine and guanine data were normalized as was the total quadruplex distribution so that a comparison of the relative distributions could be made, since the absolute number of cytosines and guanines was much higher. Our database of PQSSs was also searched for PQSSs, which incorporated the SP1 consensus sequence and the distributions of PQSSs containing the SP1 consensus sequence and those not containing the SP1 consensus sequence could then be determined.

RESULTS AND DISCUSSION

Figure 1 shows the number of PQSS elements per gene for a number of different grouping of genes. Bars A and B show that PQSSs in upstream regions of genes that contain SP1 elements are much more common than PQSSs in upstream regions of genes without SP1 elements. In bars C and D we see that there is no such relationship in genes whose upstream regions contain CCAAT elements. If anything the reverse is true and genes without CCAAT elements contain more potential PQSSs. This trend is repeated in bars E, F, G and H where E and G contain SP1 elements and contain many more PQSSs per gene than bars F and H. The trend of fewer PQSSs in the upstream regions with CCAAT is also apparent here. In bar I the number of PQSS elements per gene in the upstream region of all genes is much lower than those which contain SP1 sequence elements and higher than those which do not.

Figure 1.

PQSS sequence elements per gene for potential quadruplex sequences in upstream regions (A) containing SP1 elements, (B) not containing SP1 elements, (C) containing CCAAT elements, (D) without CCAAT elements, (E) containing CCAAT and SP1 elements, (F) containing CCAAT and no SP1 elements, (G) containing SP1 and no CCAAT elements, (H) containing neither SP1 nor CCAAT elements, (I) of all known Ensembl genes. Figure 2 shows the distribution of the promoter-binding elements SP1 (GGGCGG) and CCAAT with distance from the transcription start site. The shape of this graph shows that the frequency of sequence elements rises steeply, reaching a peak in the −50 to −41 range for SP1 and in the −70 to −61 range for CCAAT before falling off gradually.

Figure 2.

(A) SP1 sequence elements, which occur in upstream regions, (B) CCAAT sequence elements, which occur in upstream regions.

(A) SP1 sequence elements, which occur in upstream regions, (B) CCAAT sequence elements, which occur in upstream regions. The distribution of PQSSs in the 500 bp upstream region of transcription start sites is very similar to that of the regulatory motifs in Figure 2. Figure 3A shows the distribution for PQSSs in the upstream region of all Ensembl genes with status ‘known’; the maximum peak is in the same region as the peak for the distribution of SP1-binding elements, −50 to −41 bases. A search for the SP1-binding site motif revealed that of the 22 633 known genes, just over half (52.5%) contained the motif in their upstream region (Table 1). However in Figure 3B we can see that this set of genes accounts for the vast majority of PQSSs in upstream regions (86.6%). n class="Chemical">Not only are the absolute number of PQSSs differenpan>t, but the distribution of PQSSs which are in the upstream regions of genpan>es with and without the SP1 consenpan>sus sequenpan>ce differ markedly, as seenpan> in Figure 3B and C, respectively. The PQSSs in non-SP1 upstream regions have a much flatter distribution than that of the SP1 motif genpan>es.

Figure 3.

Table 1.

Summary of quadruplex occurrences in upstream regions of the human genome

Quadruplex occurences	Number of genes	Number of sequence elements
SP1 sequence elements which occur in upstream regions	11 872	29 991
PQSSs in upstream regions which contain SP1 sequence elements	5415	8660
Genes without SP1 consensus sequence in their upstream regions	10 761
PQSSs in upstream regions which do not contain SP1 sequence elements	1096	1335
PQSSs which incorporate SP1 sequence elements and are within upstream regions	3596	4721
PQSSs which do not incorporate SP1 sequence elements and are within upstream regions	4150	5274
CCAAT sequence elements which occur in upstream regions	10 963	17 574
PQSSs in upstream regions containing CCAAT sequence elements	2718	3948
Genes whose upstream regions contain no CCAAT sequence elements	11 670
PQSSs in upstream regions containing no CCAAT sequence elements	3794	6047
Genes whose upstream regions contain CCAAT and SP1	5451
PQSSs in upstream regions containing SP1 and CCAAT sequence elements	2236	3368
Genes whose upstream regions contain CCAAT and no SP1	5512
PQSSs in upstream regions containing CCAAT and no SP1 sequence elements	482	580
Genes whose upstream regions contain SP1 and no CCAAT	6421
PQSSs in upstream regions containing SP1 and no CCAAT elements	3180	5292
Genes whose upstream regions contain neither SP1 nor CCAAT	5249
PQSSs in upstream regions containing neither SP1 nor CCAAT elements	614	755
All genes	22 633
PQSSs in all upstream regions	6512	9995

Potential quadruplex sequences (A) occurring in upstream regions, (B) in upstream regions which contain SP1 sequence elements (C) in upstream regions which do not contain SP1 sequence elements, (D) which incorporate SP1 sequence elements and are within upstream regions, (E) which do not incorporate SP1 sequence elements and are within upstream regions. Summary of quadruplex occurrences in upstream regions of the n class="Species">human genpan>ome Since the SP1 consensus sequence is n class="Chemical">guanine-rich and can be incorporated into PQSSs, we examined the number of PQSSs, which contained the SP1 consenpan>sus sequenpan>ce. These account for just under half the total PQSSs (47.2%). The distribution of PQSSs incorporating the SP1 sequenpan>ce (Figure 3D) and PQSSs without the SP1 consenpan>sus sequenpan>ce (Figure 3E) is very differenpan>t. Figure 3D shows a distribution similar to the SP1 motif while that in Figure 3E is much flatter. For a random sequence the probability of finding a PQSS is related to its guanine contenpan>t so we looked at the guanine density within upstream regions. In Figure 4 we have the normalized distributions of PQSSs (A), guanine bases (B) and cytosine bases (C). Both guanine and cytosine do indeed get more frequent closer to the transcription start site although it is hard to say whether this is related to PQSS distribution.

Figure 4.

Normalized distributions in the 500 bp upstream regions of Ensembl genes with status ‘known’ of (A) PQSSs, (B) guanines, (C) cytosines.

Normalized distributions in the 500 bp upstream regions of Ensembl genes with status ‘known’ of (A) PQSSs, (B) n class="Chemical">guanines, (C) n class="Chemical">cytosines. Figure 5 shows the distributions of PQSSs within genes that contain the regulatory element CCAAT in their upstream region (Figure 5B) and PQSSs within genes, which do not contain CCAAT in their upstream region (Figure 5C). The distribution of PQSSs in CCAAT, although having a maximum at the same place as the CCAAT elements themselves (151 elements in the −70 to −61 region) has virtually the same number (149) in the region of the peak of the SP1 consensus sequence distribution (−70 to −61 bases). The PQSSs that occur in genes without upstream CCAAT elements have a peak in the same regions as the SP1 sequence elements.

Figure 5.

(A) CCAAT sequence elements occurring in upstream regions. (B) PQSSs in upstream regions containing CCAAT sequence elements. (C) PQSSs in upstream regions containing no CCAAT sequence elements.

(A) CCAAT sequence elements occurring in upstream regions. (B) PQSSs in upstream regions containing CCAAT sequence elements. (C) PQSSs in upstream regions containing no CCAAT sequence elements. Figure 6 shows the effect of the presence of SP1 and CCAAT sequence elements on the PQSS distribution, focusing on the distribution in additional groupings of genes. PQSSs upstream of genes with upstream SP1 elements but no upstream CCAAT elements have a distribution very similar to the SP1 element and the PQSS distribution of all known Ensembl genes (Figure 6A). There are very few PQSSs in genes, which contain upstream CCAAT sequences but no SP1 elements, and their distribution is very flat (Figure 6B). Genes with both upstream SP1 and CCAAT elements have many more PQSS however not such a distinct maximum (Figure 6C) and genes with neither upstream SP1 nor CCAAT elements have very few PQSSs and a rather flat distribution (Figure 6D).

Figure 6.

(A) PQSSs in upstream regions containing SP1 but no CCAAT sequence elements. (B) PQSSs in upstream regions containing CCAAT but no SP1 sequence elements. (C) PQSSs in upstream regions containing SP1 and CCAAT sequence elements. (D) PQSSs in upstream regions containing neither SP1 nor CCAAT sequence elements. The distribution of PQSSs resembles that of regulatory motifs such as SP1 and CCAAT, although it would appear that upstream SP1 elements have a positive effect on the number of upstream PQSSs while the presence of CCAAT has a deleterious effect. Almost half of the total upstream PQSSs have the SP1 consensus sequence incorporated. Thus we can demonstrate that PQSSs linked to SP1 sequence motifs in the upstream regions is perhaps unsurprising but what is not necessarily so obvious is how dominant this effect is. It has been proposed that induction of quadruplex formation in promoter sequences by quadruplex-selective small molecules can be a viable therapeutics strategy (26). The present study provides support for this approach, and suggests that effort could be focused on those genes in which PQSSs are linked to SP1 sites, as is the case, for example, of the c-kit genpan>e implicated in gastrointestinal cancers (15,27). A very recent report (28), in contrast, finds G-rich sequences in the first intron of many human genes, and considers that these are more likely to be PQSSs suitable for therapeutic intervention, in part because of the potential for structural polymorphism in the upstream sites. It is not clear that this would be a problem since it is likely that small molecule binding would tend to drive the equilibrium towards discrete quadruplex species. In addition, some PQSS sites such as those in the c-kit promoter (15,27), comprise isolated runs of just four G-tracts each, and are much less likely to participate in quadruplex polymorphism. We also note that the presence of the zinc finger motif in SP1 may be significant in view of findings that the motif has been selected out from phage libraries to bind to quadruplex DNAs (29–31), and that transcription factors containing zinc fingers have beenpan> reported to bind to G-tract promoter sequenpan>ces, notably the insulin promoter factor Pur-1/MAZ (32).

30 in total

1. Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription.

Authors: Adam Siddiqui-Jain; Cory L Grand; David J Bearss; Laurence H Hurley
Journal: Proc Natl Acad Sci U S A Date: 2002-08-23 Impact factor: 11.205

Review 2. The Ensembl core software libraries.

Authors: Arne Stabenau; Graham McVicker; Craig Melsopp; Glenn Proctor; Michele Clamp; Ewan Birney
Journal: Genome Res Date: 2004-05 Impact factor: 9.043

3. Target Detection Assay (TDA): a versatile procedure to determine DNA binding sites as demonstrated on SP1 protein.

Authors: H J Thiesen; C Bach
Journal: Nucleic Acids Res Date: 1990-06-11 Impact factor: 16.971

4. The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter.

Authors: W S Dynan; R Tjian
Journal: Cell Date: 1983-11 Impact factor: 41.582

5. Control of eukaryotic messenger RNA synthesis by sequence-specific DNA-binding proteins.

Authors: W S Dynan; R Tjian
Journal: Nature Date: 1985 Aug 29-Sep 4 Impact factor: 49.962

6. Unusual DNA structure of the diabetes susceptibility locus IDDM2 and its effect on transcription by the insulin promoter factor Pur-1/MAZ.

Authors: A Lew; W J Rutter; G C Kennedy
Journal: Proc Natl Acad Sci U S A Date: 2000-11-07 Impact factor: 11.205

7. The repeated GC-rich motifs upstream from the TATA box are important elements of the SV40 early promoter.

Authors: R D Everett; D Baty; P Chambon
Journal: Nucleic Acids Res Date: 1983-04-25 Impact factor: 16.971

8. Formation of parallel four-stranded complexes by guanine-rich motifs in DNA and its implications for meiosis.

Authors: D Sen; W Gilbert
Journal: Nature Date: 1988-07-28 Impact factor: 49.962

9. Inhibition of human telomerase activity by an engineered zinc finger protein that binds G-quadruplexes.

Authors: Sachin D Patel; Mark Isalan; Gérald Gavory; Sylvain Ladame; Yen Choo; Shankar Balasubramanian
Journal: Biochemistry Date: 2004-10-26 Impact factor: 3.162

10. Conserved elements with potential to form polymorphic G-quadruplex structures in the first intron of human genes.

Authors: Johanna Eddy; Nancy Maizels
Journal: Nucleic Acids Res Date: 2008-01-10 Impact factor: 16.971

25 in total

1. Analysis of the AHR gene proximal promoter GGGGC-repeat polymorphism in lung, breast, and colon cancer.

Authors: Barbara C Spink; Michael S Bloom; Susan Wu; Stewart Sell; Erasmus Schneider; Xinxin Ding; David C Spink
Journal: Toxicol Appl Pharmacol Date: 2014-11-04 Impact factor: 4.219

2. G-quadruplex nucleic acids as therapeutic targets.

Authors: Shankar Balasubramanian; Stephen Neidle
Journal: Curr Opin Chem Biol Date: 2009-06-08 Impact factor: 8.822

3. A G-rich sequence within the c-kit oncogene promoter forms a parallel G-quadruplex having asymmetric G-tetrad dynamics.

Authors: Shang-Te Danny Hsu; Peter Varnai; Anthony Bugaut; Anthony P Reszka; Stephen Neidle; Shankar Balasubramanian
Journal: J Am Chem Soc Date: 2009-09-23 Impact factor: 15.419

4. The importance of negative superhelicity in inducing the formation of G-quadruplex and i-motif structures in the c-Myc promoter: implications for drug targeting and control of gene expression.

Authors: Daekyu Sun; Laurence H Hurley
Journal: J Med Chem Date: 2009-05-14 Impact factor: 7.446

5. HRAS is silenced by two neighboring G-quadruplexes and activated by MAZ, a zinc-finger transcription factor with DNA unfolding property.

Authors: Susanna Cogoi; Andrey E Shchekotikhin; Luigi E Xodo
Journal: Nucleic Acids Res Date: 2014-07-10 Impact factor: 16.971

Review 6. Targeting G-quadruplexes in gene promoters: a novel anticancer strategy?

Authors: Shankar Balasubramanian; Laurence H Hurley; Stephen Neidle
Journal: Nat Rev Drug Discov Date: 2011-04 Impact factor: 84.694

7. Genome-wide colonization of gene regulatory elements by G4 DNA motifs.

Authors: Zhuo Du; Yiqiang Zhao; Ning Li
Journal: Nucleic Acids Res Date: 2009-09-16 Impact factor: 16.971

8. The disruptive positions in human G-quadruplex motifs are less polymorphic and more conserved than their neutral counterparts.

Authors: Sigve Nakken; Torbjørn Rognes; Eivind Hovig
Journal: Nucleic Acids Res Date: 2009-07-17 Impact factor: 16.971

9. Molecular models for intrastrand DNA G-quadruplexes.

Authors: Federico Fogolari; Haritha Haridas; Alessandra Corazza; Paolo Viglino; Davide Corà; Michele Caselle; Gennaro Esposito; Luigi E Xodo
Journal: BMC Struct Biol Date: 2009-10-07

10. Microsatellite tandem repeats are abundant in human promoters and are associated with regulatory elements.

Authors: Sterling Sawaya; Andrew Bagshaw; Emmanuel Buschiazzo; Pankaj Kumar; Shantanu Chowdhury; Michael A Black; Neil Gemmell
Journal: PLoS One Date: 2013-02-06 Impact factor: 3.240