An accurate method for locating genes under tumor suppressor p53 control that is based on a well-established mathematical theory and built using naturally occurring, experimentally proven p53 sites is essential in understanding the complete p53 network. We used a molecular information theory approach to create a flexible model for p53 binding. By searching around transcription start sites in human chromosomes 1 and 2, we predicted 16 novel p53 binding sites and experimentally demonstrated that 15 of the 16 (94%) sites were bound by p53. Some were also bound by the related proteins p63 and p73. Thirteen of the adjacent genes were controlled by at least one of the proteins. Eleven of the 16 sites (69%) had not been identified previously. This molecular information theory approach can be extended to any genetic system to predict new sites for DNA-binding proteins.
An accurate method for locating genes under tumor suppressor p53 control that is based on a well-established mathematical theory and built using naturally occurring, experimentally proven p53 sites is essential in understanding the complete p53 network. We used a molecular information theory approach to create a flexible model for p53 binding. By searching around transcription start sites in human chromosomes 1 and 2, we predicted 16 novel p53 binding sites and experimentally demonstrated that 15 of the 16 (94%) sites were bound by p53. Some were also bound by the related proteins p63 and p73. Thirteen of the adjacent genes were controlled by at least one of the proteins. Eleven of the 16 sites (69%) had not been identified previously. This molecular information theory approach can be extended to any genetic system to predict new sites for DNA-binding proteins.
p53 is a transcription factor that acts as a tumor suppressor and modulates expression of genes related to the cell cycle, DNA repair, apoptosis and angiogenesis (1). More than half of the tumors in some cancer types have mutations in p53 (2–4). Whole genome scanning for transcription factor sites in conjunction with experimental confirmation can fill the gaps in our understanding of gene regulation networks (5,6).El-Deiry et al. (7) proposed the p53 consensus sequence 5′-PuPuPuC(A/T)(T/A)GPyPyPy-[0-13 bp]-PuPuPuC(A/T)(T/A)GPyPyPy-3′, which consists of two decameric sequences separated by 0–13 bp. Microarray experiments have established that hundreds of genes are controlled by p53 (8–12). Attempts have been made to predict p53 binding sites using base frequency weight matrices (13–16) and hidden Markov models (17).We present a p53 DNA binding model, based on Claude Shannon's; information theory (18,19), which sharply distinguishes between specific and non-specific DNA-binding sites (20,21). This theory has been applied to genetic control systems including replication (22,23), transcription (5,24–26), splicing (27) and translation (28). It also has application beyond genetic control in characterizing molecular states and patterns in general (29,30). The information measure consistently accounts for sequence variability and conservation in universal units, bits of information (31). [A bit is a unit of information that allows one to distinguish between two states (19).] We analyzed binding sequences from two earlier studies (7,32) and from a collection of proven natural sites (i.e. naturally occurring experimentally confirmed sites).To identify p53 response elements (REs), El-Deiry et al. selected more than 500 human genomic fragments and tested them for p53 protein binding in vitro (7). Twenty REs were identified. Since p53 binds a decameric site as a dimer, both the sequences and their complementary strands for the two decamers of the p53 RE were used to construct an information theory model (Figure 1a). Because tetrameric p53 binds to two dimeric sites, a total of 80 sequences are available from the 20 REs.
Figure 1.
Decameric p53 models. The sequence logos (left) and individual information (Ri) distribution histograms (right) for individual binding sites come from (a) El-Deiry et al. (7), (b) Funk et al. (32) and (c) our collection of proven natural sites (Supplementary materials, Table 1). Rs is the total information (area of the sequence logo) and also the average of the Ri distribution. Error bars indicate the standard deviation of the information based on sample size. The peaks of sine waves (wavelength of 10.6 bp) above the logos represent the DNA major groove facing the protein, as determined by X-ray crystallography (48) and methylation interference (7). As with a number of other binding sites (22), the sequence logos of p53 follow the accessibility sine wave, especially the Funk data, which are remarkably close. On the distribution histograms a Gaussian curve with the same mean and standard deviation as the data is shown and a vertical line indicates 0 bits of information. (d) Histogram of the distance between natural decameric sites.
Decameric p53 models. The sequence logos (left) and individual information (Ri) distribution histograms (right) for individual binding sites come from (a) El-Deiry et al. (7), (b) Funk et al. (32) and (c) our collection of proven natural sites (Supplementary materials, Table 1). Rs is the total information (area of the sequence logo) and also the average of the Ri distribution. Error bars indicate the standard deviation of the information based on sample size. The peaks of sine waves (wavelength of 10.6 bp) above the logos represent the DNA major groove facing the protein, as determined by X-ray crystallography (48) and methylation interference (7). As with a number of other binding sites (22), the sequence logos of p53 follow the accessibility sine wave, especially the Funk data, which are remarkably close. On the distribution histograms a Gaussian curve with the same mean and standard deviation as the data is shown and a vertical line indicates 0 bits of information. (d) Histogram of the distance between natural decameric sites.Another attempt to identify p53 REs used cyclic amplification and selection of targets (32). p53 was incubated with DNA containing degenerate bases, the p53–DNA complex was purified, the DNA was amplified by PCR and this cycle was repeated. A decameric model built from 17 sequenced DNAs is shown in Figure 1b.
METHODS
Methods are provided in the Supplementary Data.
RESULTS
To construct a natural model, we analyzed p53 decameric sites from 35 previously identified p53-controlled genes, containing the experimentally confirmed sites (Supplementary Materials, Table 1). Individual information was calculated by the method given previously (33) and the sites with information less then zero were removed from the list because these sites can not bind a protein specifically, according to the second law of thermodynamics (30). We aligned the remaining 66 decameric sites and their complementary strands (34), created a sequence logo (27) and generated an individual information weight matrix with a range from −4 to +5 (33). We built a flexible p53 model containing two rigid decameric sites (Figure 1c) and a flexible spacer (Figure 1d) (25,28) (Supplementary Materials, Table 2). The natural model (Figure 1c) is more accurate then the consensus models of El-Deiry et al. (7) and Funk et al. (32) because ignoring the base frequencies and instead counting matches to a consensus inappropriately makes conserved bases equally important for binding (35). Also, they used artificially selected sets of sites (Figure 1a and b), but we used only naturally occurring experimentally confirmed sites. SELEX and other artificial selection methods can give inconsistent results (36).The El-Deiry model (Figure 1a) is similar to the natural one (Figure 1c). We suggest that the Funk model (Figure 1b) has excess information because strong sites were selected. According to molecular information theory, Rs is the total information in the binding site model, while Rf is the information required to locate the sites in a genome (31). These two independently determined numbers are expected to converge during evolution (37). In the natural p53 model, accounting for the variable distance between the decamers reduces the total information of two decamers to Rs = 12.3 ± 3.1 bits (25,28). In comparison, El-Deiry et al. (7) found 18 REs in 530 DNA fragments ranging from 139 to 470 bp long, which requires RF = 13.1 ± 0.8 bits. Thus, the information Rs in our binding site model is close to the information Rf needed to find the sites in the genome, as occurs for other genetic systems (31,37), suggesting that the natural model is reasonable.The average information content of the flexible p53 model is 12.3 ± 3.1 bits and 50% of the distances between the natural p53 REs and their promoters are < 300 bp, so we scanned genomic sequences around identified promoters (range −300 to +100) with the flexible p53 model, using a 12-bit cutoff. Each decameric site was at least 5 bits. We chose these parameters to identify the strongest sites. The sequences of human chromosomes 1 and 2 [ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ (38)] were scanned and 16 sites were found (Table 1; Supplementary Materials, Table 3).
Table 1.
Genes containing the predicted p53 response elements
Gene name
Information content, bits
Description
H6PD
12.5
Hexose-6-phosphate dehydrogenase
FLJ38753
12.0
Hypothetical protein
LEPRE1
12.2
Proteoglycan, potential growth suppressor
MGC955
12.2
Hypothetical protein
RPS8
13.0
Ribosomal protein S8
CLCA2
12.1
Calcium-activated ion channel protein
KCNA2
14.2, 12.2
Potassium channel protein
S100A6
12.1
S100 calcium binding protein A6 (calcyclin)
RDH14
13.5
Retinol dehydrogenase
DQX1
13.3
DEAQ box polypeptide 1 (RNA-dependent ATPase)
VPS24
14.6
Transmembrane protein sorting
PROM2
13.0
Prominin 2
U5-200KD
12.4
U5 snRNP-specific protein, RNA helicase
BZW1
13.3
Basic leucine zipper protein
UGT1A6
13.9
UDP glycosyltransferase
FLJ43374
12.6
Hypothetical protein
The KCNA2 gene contains two p53 REs.
Genes containing the predicted p53 response elementsThe KCNA2 gene contains two p53 REs.There are two more proteins from the p53 family, p63 and p73, which are involved in cell cycle arrest and apoptosis (39). These transcription factors have DNA-binding domains, similar to p53, and, therefore, bind some of the p53 REs that cause transcription activation of p53-dependent genes. Because our initial data set may contain binding sites for these additional proteins, the sites predicted by the genomic scan were confirmed by electrophoretic mobility shift assays (EMSA) with p53, p63 and p73 proteins (Figure 2). p53 bound all predicted sites except S100A6. p63 does not bind the KCNA2-2, S100A6 and BZW1 sites, and weakly binds LEPRE1, KCNA2-1. All other sites form stable complexes with the p63 protein. p73 binds all sites except S100A6. Therefore, 15 out of the 16 sites show affinity to the p53, p63 or p73 proteins.
Figure 2.
Electrophoretic mobility shift assays (EMSA) with hairpin oligonucleotides containing predicted p53 binding sites (Supplementary materials, Table 6) using the p53, p63 and p73 proteins. The bottom bands are unbound oligonucleotides, and the top bands are protein–oligonucleotide complexes. Names of the predicted genes are marked on the top. The ‘PCNA’ oligonucleotide containing the p53 RE from the promoter of the PCNA gene (49) is a positive control. The KCNA gene contains three close decameric p53 sites. Oligonucleotide ‘KCNA-1’ contains sites 1 and 2, oligonucleotide ‘KCNA-2’ contains sites 2 and 3. The ‘Con’ oligonucleotide containing the consensus p53 binding site is a positive control. ‘Anti-con’ has no p53 binding sites and is a negative control.
Electrophoretic mobility shift assays (EMSA) with hairpin oligonucleotides containing predicted p53 binding sites (Supplementary materials, Table 6) using the p53, p63 and p73 proteins. The bottom bands are unbound oligonucleotides, and the top bands are protein–oligonucleotide complexes. Names of the predicted genes are marked on the top. The ‘PCNA’ oligonucleotide containing the p53 RE from the promoter of the PCNA gene (49) is a positive control. The KCNA gene contains three close decameric p53 sites. Oligonucleotide ‘KCNA-1’ contains sites 1 and 2, oligonucleotide ‘KCNA-2’ contains sites 2 and 3. The ‘Con’ oligonucleotide containing the consensus p53 binding site is a positive control. ‘Anti-con’ has no p53 binding sites and is a negative control.We used a luciferase reporter assay to confirm the predicted sites in human cells. Promoters containing each site were cloned upstream of the luciferase gene, the plasmids were co-transfected with expression vectors for p53, p63, p73 and a negative control into HEK293 cells, and luciferase activity was measured (Figure 3). Seven genes (CLCA2, FLJ43374, UGT1A6, FLJ38753, KCNA2, PROM2 and H6PD) were activated >5-fold by at least one of the proteins. Five genes (RPS8, DQX1, VPS24, RDH14 and U5-200KD) showed 2- to 5-fold activation by at least one of the proteins. Two divergently expressed genes (MGC955 and LEPRE1), which contain a common p53 RE, showed ∼2-fold repression by p53 but not by p63 or p73. Two genes (S100A6 and BZW1) showed no regulation by p53, p63 or p73. Therefore, by the luciferase reporter assay, 14 genes out of the 16 were activated or repressed by at least one of the proteins.
Figure 3.
Transcriptional regulation of genes containing the predicted binding sites by p53, p63 and p73. The charts show the ratio between induced and non-induced luciferase signals (top chart) and qPCR signals (bottom chart). Rectangles at the left side represent promoter induction fold for p53, p63 and p73 proteins. The white area in the rectangles means that the signals were undetectable. Error bars indicate the standard deviation of the luciferase signal from two experiments.
Transcriptional regulation of genes containing the predicted binding sites by p53, p63 and p73. The charts show the ratio between induced and non-induced luciferase signals (top chart) and qPCR signals (bottom chart). Rectangles at the left side represent promoter induction fold for p53, p63 and p73 proteins. The white area in the rectangles means that the signals were undetectable. Error bars indicate the standard deviation of the luciferase signal from two experiments.Total RNA from p53-, p63-, p73-transfected cells was isolated and analyzed by qPCR using primers for detection of the 16 predicted genes and the GAPDH and actin controls (Supplementary Materials, Table 4). The qPCR data (Supplementary Materials, Tables 5A–D) are summarized in Figure 3. Four genes (CLCA2, FLJ43374, KCNA2 and PROM2) were activated >5-fold. Three genes (FLJ38753, DQX1 and VPS24) showed 2- to 5-fold activation by at least one of the proteins. Five genes (RPS8, RDH14, U5-200KD, S100A6 and LEPRE1) showed no activation by p53, p63 or p73. The UGT1A6, MGC955 and BZW1 gene expression was undetectable by qPCR. Although H6PD appears to be repressed (Supplementary Materials, Table 5D), the computed errors are larger than the averages, so we did not report it in Figure 3. Therefore, by qPCR 7 out of the 16 genes showed regulation by the proteins.Some of our findings are consistent with microarray data: CLCA2 is p53-, p63- and p73-inducible by 40-fold or more (40), VPS24 is p53-inducible by 2-fold (16), DQX1 and PROM2 in human mammary epithelial cells were upregulated after expression of p53 (41). The S100A6 gene was not induced by p53 (12), which is consistent with our results. The other genes we found have not been analyzed or did not show induction or repression in microarray experiments. Our method located 11 previously unidentified REs.The only exceptional gene, S100A6, was not bound by any proteins in vivo or in vitro. We suggest that this site has an unusual DNA structure that blocks binding.We searched all transcription starts in the human Reference Sequence [GenBank accessions NC_000001 to NC_000024, 2006, build 36 version 2, (42)] using the same parameters as before, and identified 198 potentially controlled genes (Supplementary Materials, Table 8). There were two missing RE compared to our previous search. The transcription start of H6PD is now annotated to be 10121 bp upstream, leaving the p53 site in the middle of the gene just after the start of the second exon, yet it activates luciferase expression in vivo (Figure 3). Genetic controls inside genes are known, an example for p53 is in the LIF gene (43). However, there are other possible explanations for this activation. Therefore, only 13 of the 16 genes are clearly activated by p53. Likewise, in the new sequence the annotated transcription start for S100A6 is shifted 214 bp upstream (old: NT_086596 referring to the mRNA NM_014624.2; new: NC_000001.9 referring to NM_014624.3), placing the predicted p53 site outside the searched region (−300 to +100), which explains why it was not located in the new search. This location does not explain the lack of p53, p63 or p73 binding to the sequence.
DISCUSSION
We used a proven mathematical method, molecular information theory, to measure DNA binding site conservation in universal units, bits. Not all positions in a site are equally important for protein binding (35). Information content measures the conservation of bases and allows for comparison of different positions (Figure 1). Summing the information content of the bases across the site gives the total information content of the site, which is an important physiological parameter (31,37). In contrast, summing any other function gives inconsistent results, and the total site information cannot be calculated (33). Because of its mathematical consistency, the molecular information theory-based flexible p53 model gave accurate gene predictions.p53, p63 and p73 are known to have different affinities to different sequences (41,44) and they can be both activators and repressors (45–47). This is consistent with our data. Our model, built from experimentally proven, naturally occurring sites, is a combined model for p53, p63 and p73 that does not distinguish between the proteins. In order to build models specific for each particular protein, one would have to have 3 sets of sites, one for each of p53, p63 and p73. In contrast, the model built using El-Deiry's; set is a pure p53 model. The differences in conservation pattern between the models (Figure 1a versus 1c) may reflect the contribution of p63 and p73 proteins in our natural model. Even so, we used three different experimental methods with all three proteins to test the sites predicted using the natural model, and were able to show that all of the sites except one have in vitro or in vivo activity. Such a precise and reliable method for prediction of the p53 family response elements will allow further discovery of cancer-related genes.
Authors: R Zhao; K Gish; M Murphy; Y Yin; D Notterman; W H Hoffman; E Tom; D H Mack; A J Levine Journal: Genes Dev Date: 2000-04-15 Impact factor: 11.361