Quantifying the binding energy in DNA-protein interactions is of critical importance to understand transcriptional regulation. Based on a simple computational model, this study describes a high-throughput percentage-of-binding strategy to measure the binding energy in DNA-protein interactions between the Shewanella oneidensis ArcA two-component transcription factor protein and a systematic set of mutants in an ArcA-P (phosphorylated ArcA) binding site. The binding energies corresponding to each of the 4 nt at each position in the 15-bp binding site were used to construct a position-specific energy matrix (PEM) that allowed a reliable prediction of ArcA-P binding sites not only in Shewanella but also in related bacterial genomes.
Quantifying the binding energy in DNA-protein interactions is of critical importance to understand transcriptional regulation. Based on a simple computational model, this study describes a high-throughput percentage-of-binding strategy to measure the binding energy in DNA-protein interactions between the Shewanella oneidensisArcA two-component transcription factor protein and a systematic set of mutants in an ArcA-P (phosphorylated ArcA) binding site. The binding energies corresponding to each of the 4 nt at each position in the 15-bp binding site were used to construct a position-specific energy matrix (PEM) that allowed a reliable prediction of ArcA-P binding sites not only in Shewanella but also in related bacterial genomes.
Transcription factor proteins can recognize and bind a collection of similar DNA sequences with various affinities (1). This degenerate binding ability renders cells capable of controlling thousands of genes with relatively few regulatory proteins (2). Degenerate binding, however, poses a significant challenge for understanding the mechanism of transcriptional regulation, especially in terms of identifying new binding sites (i.e. site discovery).During the past two decades, numerous computational site-discovery methods have been developed. However, it is still a challenge to predict transcription factor binding sites (3–5). One explanation for the difficulty is that computational predictions are usually based on sequence conservation of transcription factor binding sites rather than thermodynamic parameters that govern DNA–protein interactions. Experimental data, such as that obtained from footprinting assays and transcriptional profiling, can greatly increase the accuracy of computational predictions (6). Obtaining sufficient high quality experimental data, however, is a work-intensive task. For high-throughput experimental methods such as various ChIP-based approaches (7–10), a high-quality antibody and multiple experimental steps are usually necessary.Although the in vivo occupancy of cis-regulatory elements may be affected by many factors (11), the occurrence and strength of a DNA–protein interaction is ultimately determined by whether it is a thermodynamically favored reaction. Thus, measuring a DNA–protein binding constant and thereby the binding energy of an interaction represents a crucial step towards understanding transcriptional regulation. Although experimental approaches for measuring these thermodynamic parameters are well established, high-throughput methods have not yet been extensively developed. The few available medium- or high-throughput experimental methods, including SPR (surface plasmon resonance) (12), microwell-based assays (13), displacement of DNA binding dye (14), MITOMI (mechanically induced trapping of molecular interactions) (15,16) and competition assays (17–20) are limited by various factors, such as cost, sensitivity and special protein/DNA constructs. To date, the most commonly used experimental approach remains the time consuming curve-fitting method.The goal of this study was to develop an effective general approach for a rapid and accurate genome-scale prediction of transcription factor binding sites using ArcA as a model system. The ArcA transcription factor belongs to the canonical ArcA/B two-component system in which ArcB is a membrane associated histidine kinase and ArcA is a downstream transcription response regulator (21). As a major oxygen response regulator, the ArcA protein is well conserved in many Gram-negative bacteria including Shewanella oneidensis MR-1, which is a model organism for bioremediation studies (22). Recent studies, however, indicate the ArcA protein may regulate a different set of genes in S. oneidensis than those regulated in Escherichia coli (23,24). In order to determine the sequence requirements for ArcA-P binding, systematic mutagenesis of an ArcA-P binding site and subsequent quantitation of binding energy of each mutant was performed. Mathematical modeling indicated that, in principle, the binding energy in DNA–protein interactions can be determined using a simple percentage-of-binding approach instead of curve fitting. By applying this method to the traditional electrophoretic mobility shift assay (EMSA) and a recently developed protein binding microarray (PBM) technology (25), the DNA sequence requirements and associated binding energies for the ArcA-P protein were systematically determined by a simple one-step binding assay and this experimental information was used to construct a position-specific energy matrix (PEM) for genome-scale prediction of binding sites.
MATERIALS AND METHODS
Computational model of DNA–protein interactions
In a DNA–protein interaction: [L]+[P]↔[LP], where [L], [P] and [LP] represent the concentration of free (or unbound) DNA ligand, free protein and the DNA–protein complex, respectively. If it is assumed that the pressure and temperature do not change in the binding reaction, the Gibbs binding energy ΔG can be then determined by Equation (1) and the dissociation constant Kd by Equation (2) when the binding reaction reaches equilibrium, in which R, T, x and 1−x represent the gas constant, the absolute temperature, the fraction of DNA ligand bound to protein and the fraction of free DNA ligand, respectively (26,27). By combining equations, the ΔG can be calculated according to Equation (3).
Based on Equation (3), the binding energy of any DNA–protein interaction, such as the ΔG1 and ΔGRef reactions, can be determined using Equations (4) and (5), respectively. The relative binding energy (ΔΔG) between ΔG1 and ΔGRef reactions can then be calculated using Equation (6). If the free protein concentration is kept constant in these two binding reactions (i.e. [P]1 = [P]Ref), Equation (6) can be simplified as Equation (7), where ΔΔG is determined by the percentage of DNA ligand bound. In actual binding reactions, it is very difficult to keep the free protein concentration constant. It is possible, however, to achieve an approximately constant protein concentration by performing assays with a protein concentration that is much higher than the DNA ligand concentration.
EMSA with systematically mutated SO1661 promoters
The purification of His-tagged ShewanellaArcA and E. coli ArcB78-778 as well as the ArcA phosphorylation were performed as described previously (24,28,29). ArcA protein is labeled as ArcA-PC or ArcA-PE when phosphorylated by carbamoyl phosphate or E. coli ArcB78-778, respectively. In this study, a total of 46 oligonucleotides of 48 bp each were synthesized. Forty-five of these primers contained a single mutation in the 15-bp ArcA-P binding site region in the SO1661 promoter (Figure 1A and Table 1). Radiolabeled promoters (144 bp each) were generated from these 46 primers by PCR amplification with a common P33 5′-end labeled SO1661-328 primer (5'-CCACACCATACCGATAAAGAAGC). The interaction of each radiolabeled DNA with ArcA-P (phosphorylated ArcA) was then tested using EMSAs containing ∼100–250 fmol (∼10–20 nM) labeled probe and 1.5 μM ArcA-PE as previously described (24) in which the active amount of ArcA-P was estimated to be 100 nM (Supplementary Method 1). These binding assays were repeated three times and the fraction of promoter DNA bound to the ArcA-P protein was quantified by measuring the density of both shifted and nonshifted DNA bands using ImageQuant TL (GE Healthcare Life Sciences, Piscataway, NJ, USA).
Figure 1.
(A) SO1661 promoter DNA. The blue dotted square indicates the ArcA-P binding site as identified by DNA footprinting. The conserved 15-bp motif is shown as pink letters. (B) Representative EMSAs with SO1661 mutant promoters. The EMSAs were performed with 10–20 nM radiolabeled DNA ligand and 100 nM of the active form of ArcA-PE (A). DNA ligand binding reactions lacking ArcA-PE protein were analyzed as a negative control (B). Lane 1: SO1661wt promoter. Lanes 2–18: SO1661-1 to SO1661-17 promoters, respectively (Table 1). The blue and red dotted rectangles indicate the shifted DNA–protein complexes and free unshifted DNA ligands, respectively. (C) PBM results with ArcA-PC. The results of two replica PBM experiments are shown. Different colors represent relative binding strength, which is indicated by the white arrow. Red or blue indicates a strong or weak binding signal, respectively. White indicates the binding signal exceeds the upper detection limit of the method used in this study. Spot A1: SO1661wt promoter. Spots A2–D12: SO1661-1 to SO1661-47 promoters, respectively.
Table 1.
Oligonucleotide sequences and binding energies of SO1661wt and mutant promoter DNAs
The 15 bp ArcA-P binding motif is shown in bold letters and mutant nucleotides are indicated by red letters. The oligonucleotide sequences shown in this table are the complement (bottom strand) of the sequence shown in Figure 1A.
Oligonucleotide sequences and binding energies of SO1661wt and mutant promoter DNAsThe 15 bp ArcA-P binding motif is shown in bold letters and mutant nucleotides are indicated by red letters. The oligonucleotide sequences shown in this table are the complement (bottom strand) of the sequence shown in Figure 1A.
Protein binding microarray
For protein binding microarray studies, a series of 48-bp mutant promoters were constructed by synthesis of oligonucleotide pairs that were annealed rather than using PCR for second strand synthesis. The annealed promoter DNAs each contained a single mutation in the conserved 15-bp ArcA-P site (Figure 1A and Table 1). In addition, an amine group was also synthesized onto the 5′-end of one of the paired oligonucleotides. These promoter DNAs were printed and covalently immobilized onto Codelink activated slides in 50 mM PBS (pH 8.5) at 50 pmol/μl, (Amersham Biosciences, Piscataway, NJ, USA) according to the manufacturer's instructions.(A) SO1661 promoter DNA. The blue dotted square indicates the ArcA-P binding site as identified by DNA footprinting. The conserved 15-bp motif is shown as pink letters. (B) Representative EMSAs with SO1661 mutant promoters. The EMSAs were performed with 10–20 nM radiolabeled DNA ligand and 100 nM of the active form of ArcA-PE (A). DNA ligand binding reactions lacking ArcA-PE protein were analyzed as a negative control (B). Lane 1: SO1661wt promoter. Lanes 2–18: SO1661-1 to SO1661-17 promoters, respectively (Table 1). The blue and red dotted rectangles indicate the shifted DNA–protein complexes and free unshifted DNA ligands, respectively. (C) PBM results with ArcA-PC. The results of two replica PBM experiments are shown. Different colors represent relative binding strength, which is indicated by the white arrow. Red or blue indicates a strong or weak binding signal, respectively. White indicates the binding signal exceeds the upper detection limit of the method used in this study. Spot A1: SO1661wt promoter. Spots A2–D12: SO1661-1 to SO1661-47 promoters, respectively.The PBM experiment was performed by first incubating the microarray slides with blocking buffer (10 mM Tris/HCl, 150 mM NaCl, 5 mM MgCl2, 0.05% Tween-20, 5% milk, pH 7.4) for 30 min at room temperature, and then rinsing briefly with 1 × TBS (10 mM Tris/HCl, 150 mM NaCl, 5 mM MgCl2, pH 7.4). The on-slide DNA–protein interaction was initiated by covering the slides with 300 μl binding solution consisting of 100 mM Tris/HCl (pH 7.4), 20 mM KCl, 10 mM MgCl2, 2 mM DTT, 10% glycerol, 0.1 μg/μl poly(dI·dC) and 2 μM carbamyol phosphate phosphorylated ArcA protein [895 nM active protein concentration (Supplementary Method 1)]. The binding reaction was performed at room temperature for 1 h. After washing off unbound protein, the microarray slides were incubated with an anti-His-tag antibody for 1 h at room temperature. The anti-His-tag antibody was purchased from Qiagen, Valencia, CA, USA (Cat# 34660) and diluted to 1:1200 in blocking buffer. The unbound anti-His-tag antibody was then washed off and the slides were incubated with Cy5-conjugated secondary antibody for 1 h at room temperature. The secondary antibody was purchased from Chemicon, Temecula, CA, USA (Cat# AP160s) and diluted 1:1200 in blocking buffer. After washing away unbound secondary antibody, the slides were dried and quantified using a microarray scanner. To reduce errors in data analysis, the signal intensity of each spot was normalized to the average signal intensity of all the 48 spots. In total, 27 binding replicas were performed with the promoter DNAs.
Genome-scale ArcA-P binding site discovery method
In this study, the upstream intergenic DNA for each S. oneidensis MR-1 ORF (open reading frame) (www.ncbi.nlm.nih.gov) including the first 100 bp of coding sequence was first obtained. These DNA sequences were then scanned with a sliding 15-bp DNA motif window, and the scores of each motif were calculated based on the energy-based ArcA-P PEM (15 × 4) (Table 2) with the assumption that each base contributes independently to the total score of the 15-bp motif (1). These scores represent the predicted binding energies for each DNA site with ArcA-P. The motif with the lowest score (most favorable binding energy) was selected for each intergenic DNA region and ranked according to their ΔΔG scores (Supplementary Table 1). Using the same sliding window approach with the same energy matrix, potential ArcA-P binding sites were also predicted in the promoter regions (intergenic region + the first 100-bp coding sequence) of E. coli K12 MG1655 and Haemophilus influenzae Rd KW20 genomes (Supplementary Tables 2 and 3).
Table 2.
The position energy matrix (PEM) of Shewanella ArcA-P (unit: kcal/mol)
Position
A
C
G
T
1
1.79
2.74
0.00
1.35
2
1.79
2.40
2.89
0.00
3
0.98
2.34
0.92
0.00
4
0.00
2.65
1.17
2.09
5
0.00
0.10
0.93
0.60
6
0.00
0.12
0.23
0.47
7
0.04
0.23
0.11
0.00
8
0.00
0.38
0.70
0.07
9
0.00
0.58
0.58
0.28
10
0.16
0.77
0.70
0.00
11
1.37
1.10
1.42
0.00
12
2.93
3.35
0.00
3.24
13
2.42
2.10
2.40
0.00
14
0.49
1.98
0.59
0.00
15
0.00
2.20
1.18
1.53
The position energy matrix (PEM) of ShewanellaArcA-P (unit: kcal/mol)
RESULTS
Model for determining binding energies of mutant promoter DNA to ArcA-P
According to the computational model implemented in Equation (7) (see Materials and Methods section), the relative binding energy difference (ΔΔG) for wild-type versus a mutant DNA with respect to a DNA–protein interaction can be determined by performing two separate binding reactions and measuring the fraction of DNA ligand bound (percentage-of-binding) for each reaction at equilibrium. This model assumes that the concentration of free active protein ([P]) remains constant in both binding reactions. In an actual experiment, [P] varies to differing extents according to the binding conditions. Based on Equation (3) or (6), the overall ΔΔG is determined solely by two variable factors, Δln[(1−x)/x] and Δln[P]. This suggests that the error associated with using Equation (7) to determine ΔΔG can be estimated according to the relative weights of Δln[(1−x)/x] and Δln[P]. The ratio of Δln[(1−x)/x] versus Δln[P], is shown as the function f(x) in Equation (8), in which [P], [P]0 or [L]0 represent the concentration of free protein, the total input protein or total input DNA ligand, respectively. The plot generated using Equation (8) is shown in Supplementary Figure 1. The results indicate that the minimal value of f(x) is 9.9, 17.9 or 37.9, if [P]0 is 3, 5 or 10 times that of [L]0, respectively. Therefore, when [P]0 is 3, 5 or 10 times that of [L]0, the weight of Δln[P] in the estimation of total ΔΔG is less than 9.2%, 5.3% or 2.6%, respectively. These data suggest that ΔΔG can be determined accurately using Equation (7) with a ratio of [P]0/[L]0 above 5 (error < 5.3%).
Relative binding energies determined using comparative EMSAs (EMSA-ΔΔG)
The SO1661 promoter has been shown to be under the direct control of ArcA (24). Based on footprinting assays and mutational analyses (data not shown), the ArcA-P binding site within the SO1661 promoter was determined to be a 15-bp DNA motif (Figure 1A). To understand the role of ArcA in Shewanella, each position of the 15-bp ArcA-P binding motif within the SO1661 promoter was systematically mutated and the effect on the binding of ArcA-PE [ArcA protein phosphorylated by E. coliArcB protein is labeled ArcA-PE (see Materials and Methods section)] was examined by EMSAs (Figure 1B). In these EMSAs, the molar ratio of active ArcA-PE versus DNA ligand was kept above 5 (see Materials and methods section and Supplementary Method 1). The percentage of DNA bound by ArcA-PE [equal to x in Equation (7)] of wild-type and various mutant SO1661 promoters was determined by measuring the band intensity of shifted and nonshifted DNA. The fraction DNA bound data was then used to determine the relative binding energy of the mutant promoters (i. e. ΔGmu − ΔGwt which will be referred to as ΔΔG hereafter) using Equation (7). These EMSA-ΔΔG values (Table 1) represent the contribution relative to wild-type of each nucleotide at a given position within the 15-bp binding sequence to the total binding energy (ΔG). A PEM was generated by placing the ΔΔG values of the promoter DNA with each mutant nucleotide at the corresponding position within the 15-bp DNA motif (Table 2). The information in the matrix can be summarized as a sequence logo using the enoLOGOS program (30) (Figure 2). The results indicate that two repeating GTTA units are very important for binding ArcA-P (Figure 2). This pattern is similar in sequence but significantly different in position weighting to a consensus revealed by searching for a common motif in 11 E. coliArcA-P interacting promoters (31). The importance of both GTTA sites may be related to the fact that the active form of ArcA-P is a dimer (32).
Figure 2.
Sequence logo of Shewanella ArcA-P. Sequence logos generated by EMSAs with ArcA-PE (A) or by PBM with ArcA-PC (B). Both logos were created by enoLOGOS30 based on relative binding energies (ΔΔG) of the mutants.
Sequence logo of ShewanellaArcA-P. Sequence logos generated by EMSAs with ArcA-PE (A) or by PBM with ArcA-PC (B). Both logos were created by enoLOGOS30 based on relative binding energies (ΔΔG) of the mutants.
Relative binding energies determined using PBM assays (PBM-ΔΔG)
PBM technology is a recently developed high-throughput method to study DNA–protein interactions (25). To test if Equation (7) can also be used to determine ΔΔG in microarray-based DNA–protein interaction measurements, SO1661 promoter microarrays were generated using a series of synthesized 48-bp promoter DNAs (Table 1). The PBM binding reactions were performed with either ArcA-PC [ArcA protein phosphorylated by carbamyol phosphate is labeled ArcA-PC (see Materials and Methods section)] or ArcA-PE. However, the results with ArcA-PE were not as reproducible, possibly due to the low efficiency of E. coli ArcB78-778 in phosphorylating the ShewanellaArcA protein (data not shown) and therefore the ArcA-PE results were not used for further analysis. The PBM results indicated that ArcA-PC exhibited varied binding affinities with different mutant promoters consistent with the results from the EMSAs (Figure 1C), while the unphosphorylated ArcA protein did not exhibit detectable DNA binding activity (data not shown).The percentage of binding values for various mutant promoters in the PBM assays were determined indirectly by comparing their signal intensity relative to SO1661wt promoter DNA signal intensity and the percentage of SO1661wt DNA bound in an EMSA performed under identical conditions (Supplementary Method 2). The ΔΔG of different mutant promoters relative to wild-type was then calculated using Equation (7) (Table 1). With these PBM-ΔΔG scores, an energy-based sequence logo was created for ArcA-PC using enoLOGOS (Figure 2). The sequence logos generated using ΔΔG values determined by EMS and PBM assays are similar, suggesting the percentage of promoter binding approach can be used to estimate binding energies with either assay.
Binding energies determined using a curve-fitting method (Curve-ΔG)
In order to validate the binding energy values obtained by the percentage of binding approach, several mutant SO1661 promoter DNAs (48-bp synthesized DNAs) were selected and their binding constants (Kd) with the ArcA-P protein were determined by a competitive EMSA using a curve-fitting method (33). In these assays, the input ArcA-P was held constant and the binding of labeled promoters (radioligand) to ArcA-P was subjected to competition with various amounts of unlabeled promoter DNA (Supplementary Method 3). By fitting the EMS data to a sigmoidal dose response curve (Figure 3A and B), the IC50 value for SO1661wt was determined to be 234 ± 24 nM for ArcA-PC and 216 ± 21 nM for ArcA-PE, with corresponding Kd values of 154 ± 24 nM and 136 ± 21 nM, respectively (according to the formula IC50 = Kd + [radioligand]) (33) (Table 3). The Kd values of several selected mutant promoters including SO1661-15, SO1661-17, SO1661-19 and SO1661-20 were also determined using the same method with ArcA-PC (Table 3). The binding energy (Curve fitting-ΔG) of these promoters was then determined from the Kd values according to Equation (1) (Table 3).
Figure 3.
Competition binding curve of the SO1661wt promoter with (A) ArcA-PC and (B) ArcA-PE. The x-axis indicates the log concentration of the unlabeled probe in nanomolar and the y-axis indicates the percentage of binding relative to the maximal binding signal.
Table 3.
Binding affinities of selected promoter DNAs
Promoter
ΔΔG (kcal/mol)
Kd (nM)
ΔG (kcal/mol)
EMSA
PBM
Curve fitting
EMSA
PBM
Curve fitting
EMSA
PBM
SO1661wt
0.00
0.00
154 ± 24
–
–
−8.98 ± 0.08
–
–
SO1661-12
0.87
0.75
711 ± 111
573 ± 89
SO1661-13
0.51
0.41
376 ± 59
313 ± 49
SO1661-14
0.29
0.34
257 ± 40
280 ± 44
SO1661-15
0.55
0.77
390 ± 80
401 ± 63
591 ± 81
−8.45 ± 0.11
−8.43 ± 0.08
−8.21 ± 0.07
SO1661-17
0.37
−0.02
131 ± 39
291 ± 45
150 ± 23
−9.08 ± 0.13
−8.62 ± 0.08
−9.00 ± 0.08
SO1661-18
0.33
0.27
276 ± 43
248 ± 39
SO1661-19
−0.50
−0.13
59 ± 23
64 ± 10
123 ± 19
−9.53 ± 0.19
−9.49 ± 0.08
−9.11 ± 0.08
SO1661-20
0.39
0.32
69 ± 2
303 ± 47
269 ± 42
−9.45 ± 0.02
−8.60 ± 0.08
−8.66 ± 0.08
SO1661-21
−0.18
−0.16
113 ± 18
115 ± 18
SO1661-22
0.28
0.47
251 ± 39
347 ± 54
The data in this table were determined using the 48 bp synthesized promoters and ArcA-PC.
Competition binding curve of the SO1661wt promoter with (A) ArcA-PC and (B) ArcA-PE. The x-axis indicates the log concentration of the unlabeled probe in nanomolar and the y-axis indicates the percentage of binding relative to the maximal binding signal.Binding affinities of selected promoter DNAsThe data in this table were determined using the 48 bp synthesized promoters and ArcA-PC.
Comparison of the thermodynamic parameters (ΔΔG, ΔG or Kd) determined using EMSA, PBM and curve-fitting methods
In order to evaluate the relationship between the ΔΔG values determined by EMSA versus the PBM methods, the values obtained for different mutant promoters were compared by linear regression (Figure 4, solid line). The resulting Pearson correlation coefficient is 0.97 (Figure 4), indicating the results are in close agreement in terms of the trend of the ΔΔG values, i.e. the methods give very similar results on the relative importance of a position. Consistent with this finding, the sequence logos generated using the EMSA and PBM data are also very similar (Figure 2). However, the EMSA-ΔΔG values are usually higher than the corresponding PBM-ΔΔG values as indicated by the position of the data points in Figure 4 relative to the dotted line that depicts a perfect correlation between the absolute values of ΔΔG. A possible explanation is that the ratio of [P]0/[L]0 in the EMSAs was relatively low (5–10 : 1) and the larger ΔΔG values may be a result of neglecting the contribution of ΔlnP. In addition, the EMSA values in Table 1 were determined using ArcAE and the PBM values were determined using ArcAC. To test these possibilities, several mutant promoters (48-bp synthesized DNAs) were selected (Table 3) and their interaction with ArcA-PC was determined using the same comparative EMSA but at an increased ratio of [P]0/[L]0 (∼100:1), and the resulting EMSA-ΔΔG values agree more closely with the corresponding PBM-ΔΔG values (Table 3).
Figure 4.
Correlation of relative binding energy (ΔΔG). The x- and y-axis represent the ΔΔG values derived from PBM (with ArcA-PC) and EMS (with ArcA-PE) assays, respectively. The R2-value is 0.94 (solid line). The dotted line represents the perfect correlation line for absolute values of ΔΔG.
Correlation of relative binding energy (ΔΔG). The x- and y-axis represent the ΔΔG values derived from PBM (with ArcA-PC) and EMS (with ArcA-PE) assays, respectively. The R2-value is 0.94 (solid line). The dotted line represents the perfect correlation line for absolute values of ΔΔG.As stated earlier, the binding energy (Curve-ΔG) of the SO1661-16, SO1661-17, SO1661-19 and SO1661-20 mutant DNAs was determined by a curve-fitting method. As a comparison, the EMSA-ΔΔG and PBM-ΔΔG of these four mutant promoters was converted into binding energy (EMSA- or PBM-ΔG) according to Equation (6) (ΔΔG = ΔGmu − ΔGwt) (Table 3). The average difference between the Curve-ΔG, EMSA-ΔG and PBM-ΔG of the four selected promoters is <5.7% (Table 3). Since ΔG is relatively insensitive to changes in binding affinity, these ΔG scores were then converted into binding constants (Kd) using Equation (1) (ΔG = RT • lnKd). The results show that the average difference of these Kd scores determined by different methods is within a 2.6-fold range (Table 3). Considering that it is not uncommon to observe 2- to 3-fold variations when determining Kd even by curve-fitting methods (1), these results suggest that ΔG and Kd can be reliably obtained using the comparative EMS and PBM assays described in this study.
Genome-scale prediction of ArcA-P binding sites
The EMSA results indicated that any mutant binding site with a ΔΔG score above 1.77 kcal/mol interacted with ArcA-PE weakly (Table 1 and Figure 1B). This energy score, which yields an estimated binding constant of ∼1 μM, was used as the cut-off ΔΔG value to predict the ArcA-P binding sites with a PEM based on the values in Table 2. In total, 45 ArcA-P binding sites with a ΔΔG score below 1.77 kcal/mol were identified within the Shewanellla genome (Supplementary Table 1). Because a single binding site can be situated between divergently transcribed genes, there are 61 genes directly associated with the 45 binding sites. In a previous study, several promoters with varying binding energies were selected and their interactions with the ArcA-P protein were examined by EMSAs (24). Of the 14 promoters that contain a binding site with predicted ΔΔG values ranging from 0.64 to 1.77 kcal/mol, 13 exhibited clear binding with ArcA-P when tested by in vitro EMSAs with radiolabeled PCR products (24). The promoter (SO3659) that bound weakly has a relatively high ΔΔG value (1.60 kcal/mol). Of the nine promoters that contain sites with predicted ΔΔG values above 3.09 kcal/mol, none exhibited noticeable binding with ArcA-P (24). These results suggest that the predicted binding energies are strongly associated with the ability to interact with ArcA-P. To date, microarray gene expression data is available for the genes encoded at 43 of the 45 predicted ArcA-P sites corresponding to 61 potentially regulated genes. Of these 43 sites, 27 (63%) encode genes exhibiting >2-fold regulation and 35 (81%) exhibit >1.5-fold regulation by ArcA (Supplementary Table 1). Because the microarray study only examined transcriptional regulation at a given condition and a given time, the accuracy of the site-discovery approach described in this study is likely >81% with regard to in vivo gene regulation. Thus, the number of false positive predictions appears to be low. False negatives, however, may be higher since a total of ∼300 genes were identified as under ArcA regulation (>2-fold regulation) in the microarray study (24).The ShewanellaArcA protein shares high sequence identity with ArcA from several other Gram-negative bacteria, such as E. coli (81%) and H. influenzae (79%). The role of ArcA has been extensively studied in E. coli, thus providing an ideal system to examine the accuracy of the site-discovery approach described in this study. For this purpose, a genome-scale prediction ArcA-P binding sites in E. coli was performed using the same ArcA-PE-derived PEM as that used above for S. oneidensis (Table 2). Using the same 1.77 kcal/mol energy threshold, a total of 57 putative ArcA-P binding sites were identified including seven of the nine canonical ArcA-P regulated promoters (21,31) (Supplementary Table 2). Among the 57 predicted sites, 27 (47%) have been reported to encode genes exhibiting >2-fold regulation by ArcA (Supplementary Table 2). This number could increase as additional gene regulatory data becomes available. To date, footprinting assay results have been published for a total of 15 E. coliArcA-P interacting DNAs including 14 promoters (Supplementary Table 4). For these ArcA-P footprinted DNAs, a total of 27 ArcA-P binding sites (up to two sites per footprinted DNA) were predicted using the ShewanellaArcA-PE PEM with no preset threshold (Table 2), among which 24 sites are located exactly within the ArcA-P footprinting regions and three are within a region that was not examined by footprinting or any other assays (Supplementary Table 4). For the 14 E. coliArcA-P footprinted promoters, eight contain a site with a predicted ΔΔG value below 1.82 kcal/mol and these promoters are all strongly regulated by ArcA-P (>5-fold regulation), including the lctPRD promoter which contains an ArcA-P site with the most favorable predicted ΔΔG score and which also exhibits the most significant level of regulation by ArcA-P (∼90- to 100-fold) (34). For the six promoters containing sites with predicted ΔΔG values >2.95 kcal/mol, five are weakly regulated by ArcA-P (<3-fold regulation). Taken together, these results indicate a strong correlation of predicted ΔΔG scores with the strength of ArcA-P binding and regulation.Genome-scale predictions of H. influenzaeArcA-P binding sites were also performed using the ArcA-PE PEM (Table 2). By using the 1.77 kcal/mol cut-off ΔΔG score used for the S. oneidensis and E. coli predictions, a total of 22 ArcA-P target binding sites were identified (Supplementary Table 3). The 22 binding sites is a somewhat smaller number of predictions than those for the S. oneidensis (45) and E. coli (57) genomes, but it is consistent with a recent microarray study where only 23 genes exhibited >2-fold regulation by ArcA in H. influenzae (35). Among these 23 genes identified in the microarray study, 12 were predicted to contain an ArcA-P binding site using the 1.77 kcal/mol threshold. Interestingly, the eight predicted ArcA-P binding sites with the most favorable energy scores (ΔΔG ≤ 1.13 kcal/mol) were all within promoter regions for genes exhibiting >2-fold regulation by ArcA-P in the microarray study (35) (Supplementary Table 3). These results, in addition to the S. oneidensis and E. coli results, suggest that false positive predictions are rare among the binding sites with favorable energy scores.
DISCUSSION
In this study, a simple model was used to examine binding energy in DNA–protein interactions using electrophoretic gel shift and PBM assays. With this approach, the importance of each position within the ArcA-P binding site was quantitatively established by characterizing the interaction between ShewanellaArcA-P and a series of mutant promoter DNAs, whereby each position in the binding site was systematically mutated to all possible single nucleotide changes. The results of the fine mapping were used to create a PEM that was used for a genome-scale prediction of 45 ArcA-P sites in Shewanella. A further examination suggests that this prediction is >81% consistent with in vivo gene regulation according to microarray studies and >92% (13/14) accurate in terms of published in vitro gel shift validation binding assays (24). In addition, this study predicted 27 ArcA-P sites for 15 published E. coliArcA-P footprinted DNAs, and 24 of them were found exactly within the footprinting protected regions and the other three sites fall into the regions that were not examined by footprinting assays (Supplementary Table 4). This is the first report showing that footprinting protected regions can be effectively predicted by starting from a single known transcription factor binding site. Finally, the predicted H. influenzaeArcA-P sites correlate well with in vivo regulation determined by a microarray analysis in that the eight predicted binding sites with the most favorable ΔΔG scores all exhibit ArcA dependent gene regulation (Supplementary Table 3) (35).As indicated earlier, the available validation data suggest the identification of binding sites using binding energies is highly accurate in terms of very few false positives but that false negatives clearly occur. There are several possible explanations for false negative predictions. One obvious contributor to false negatives is the ΔΔG threshold chosen for the scores obtained from the genome scan. False negative predictions may also occur due to cooperative protein binding to multiple weak binding sites present in a promoter region. It has been shown that ArcA protein multimerizes upon phosphorylation and that the multimeric protein can bind to multiple sites within a promoter region (36,37).The one-step percentage-of-binding strategy described in this study provides a rapid approach to examine binding energy in DNA–protein interactions via systematic mutation of the DNA binding site. Since most cis-regulatory sites are ∼6–12 bp long (38), the one-step EMSA described here provides an efficient means of generating a PEM for genome-scale site discovery. Compared with other site-discovery approaches, the method described in this study requires little previously known experimental data (only a single known binding site is necessary). Compared with the few available high-throughput methods (12–20) to measure DNA–protein binding energies, the percentage-of-binding approach represents a simple yet effective method. In addition, the application of percentage-of-binding strategy to microarray-based DNA–protein interactions could result in a low cost and high throughput genome-scale site-discovery approach for many other transcription factors.
Authors: B Ren; F Robert; J J Wyrick; O Aparicio; E G Jennings; I Simon; J Zeitlinger; J Schreiber; N Hannett; E Kanin; T L Volkert; C J Wilson; S P Bell; R A Young Journal: Science Date: 2000-12-22 Impact factor: 47.728
Authors: Michael F Berger; Anthony A Philippakis; Aaron M Qureshi; Fangxue S He; Preston W Estep; Martha L Bulyk Journal: Nat Biotechnol Date: 2006-09-24 Impact factor: 54.908
Authors: David D Pollock; A P Jason de Koning; Hyunmin Kim; Todd A Castoe; Mair E A Churchill; Katerina J Kechris Journal: PLoS One Date: 2011-11-01 Impact factor: 3.240