Literature DB >> 15202939

From sequence to structure and back again: approaches for predicting protein-DNA binding.

Abstract

Gene regulation in higher organisms is achieved by a complex network of transcription factors (TFs). Modulating gene expression and exploring gene function are major aims in molecular biology. Furthermore, the identification of putative target genes for a certain TF serve as powerful tools for specific targeting of rational drugs.Detecting the short and variable transcription factor binding sites (TFBSs) in genomic DNA is an intriguing challenge for computational and structural biologists. Fast and reliable computational methods for predicting TFBSs on a whole-genome scale offer several advantages compared to the current experimental methods that are rather laborious and slow. Two main approaches are being explored, advanced sequence-based algorithms and structure-based methods.The aim of this review is to outline the computational and experimental methods currently being applied in the field of protein-DNA interactions. With a focus on the former, the current state of the art in modeling these interactions is discussed. Surveying sequence and structure-based methods for predicting TFBSs, we conclude that in order to achieve a sound and specific method applicable on genomic sequences it is desirable and important to bring these two approaches together.

Entities: Chemical Disease Gene Species

Year: 2004 PMID： 15202939 PMCID： PMC441406 DOI： 10.1186/1477-5956-2-3

Source DB: PubMed Journal: Proteome Sci ISSN： 1477-5956 Impact factor: 2.480

Introduction

A complex network of gene regulatory signals allows each cell in both single- and multicellular organisms to flexibly respond to environmental factors. In 1967, Ptashne realized that gene expression is regulated by protein switches that bind to target sequences in the DNA [1]. Understanding the mechanisms underlying sequence-specific binding of proteins to DNA and the resulting gene expression, holds great promise for targeting numerous diseases through rational drug development [2]. The sequencing of whole genomes alongside with experimental studies of the control of gene expression has revealed some fundamental mechanisms. Each gene is regulated by at least one, but often multiple transcription factor (TFs). The TFs bind to specific transcription factor binding sites (TFBSs) within the regulatory regions (promoters) of the genes. The functional arrangement, i.e. the presence, combination, and order of the TFBSs in a regulatory region, form promoter modules [3] that control the spatial and temporal expression of genes [4]. The analysis of individual TFBSs can provide important clues in deducing regulatory networks in a cell and the functional context of specific genes. Over time, several experimental methods have been developed for studying TFBSs. In vitro analysis is complicated by two facts typical of TFs: TFs usually bind to multiple target sequences with varying affinity and they often regulate multiple genes. In silico analysis is not straight-forward either, but presents a necessary extension to current in vitro methods. The main obstacles are that TFBSs are often located in non-coding DNA, degenerate in their sequence, and relatively short (5–12 nucleotides). Searching for such low-information content sites within huge amounts of genomic DNA using computational methods typically yields a large number of randomly occurring false positive sites. Reducing the number of these false positives has been the goal of many efforts. Currently, most successful sequence-based algorithms are context-sensitive and account for the presence of other TFBSs [5], relative positioning to transcription start site (TSS) [6], and evolutionary conservation of functional regulatory elements [7]. Seen from a structural point of view, the recognition of a nucleotide sequence by a DNA-binding protein is determined by the interactions between the DNA base pair (bp) edges and the amino acid side chains. Structure-based methods use either statistical information obtained from structural data, or models for representing the steric and chemical complementarity, for evaluating the affinity of a protein-DNA complex [8]. Research during the past decades has focused on understanding the mechanisms underlying protein-DNA interactions and aiming towards expressing these using general sets of rules. First attempts to define such a recognition code arose in 1976 through the work of Seeman and Rosenberg [9], who identified a specific pattern of hydrogen bond (H-bond) acceptors and donors on the DNA bp edges. More detailed studies of protein-DNA structural complexes soon concluded that the interactions could not be explained by a simple one-to-one correspondence [10,11]. However, specific amino acid-base preferences do exist [12,13], which comes as no surprise given their chemical and structural characteristics. Current sequence-based algorithms and structure-based models will benefit from a mutual integration, when the primary aim is to develop fast and reliable prediction methods for TFBSs and an understanding for how DNA recognition is facilitated. Experimental techniques for studying protein-DNA interactions and the physical characteristics of such interactions will be explained in the first two sections. In the final section, accurate computational modeling of the binding sites of regulatory proteins will be discussed in the light of experimental and theoretical implications.

Experimental methods

In order to be able to analyze differences and commonalities of how binding takes place, examples of binding sites are required. Experimental methods used in the determination of binding sites for transcription factors are important for creating a sound description of each TFBS. There are a several methods available for producing interaction data. Nitrocellulose-binding assay [14], electrophoretic mobility shift assay (EMSA) [15], enzyme-linked immunosorbent assay (ELISA) [16], DNase 1 footprinting [17], DNA-protein crosslinking (DPC) [18], and reporter conducts [19] are examples of in vitro techniques that are used for determining DNA binding sites and analyzing the difference in binding specificity for different protein-DNA complexes. They are all currently in use, but suffer from major drawbacks: they are not suited for high-throughput experiments and information on optimal vs. suboptimal protein binding sites is lost. Chromatin immunoprecipitation (ChIP) is a recent microarray-based assay developed for genome-wide determination of protein binding sites on DNA [20]. Systemic evolution of ligands by exponential enrichment (SELEX) [21] and Phage Display (PD) [22] represent another type of experiments and offer a high-throughput possibility to select high-affinity binders, DNA and protein targets respectively. Both SELEX and PD suffer from the same drawback, the fact that the multitude of sequences obtained from these experiments are all good binders, but it is hard to say anything about their relative affinities. The assumption that the best binders occur more frequently, from purely statistical reasons, is commonly adopted. The differences between individual mutants have to be measured one at a time by other and more laborious methods (discussed above). In 1999, Bulyk et al. presented dsDNA microarrays for exploring sequence specific protein-DNA binding [23]. The major advantage over the methods discussed above is that it is a high-throughput method resulting in data with associated relative binding affinities, which is of high importance in protein-DNA interaction studies. Finally, there is X-ray crystallographic and NMR spectroscopic data providing a base for studying the structural details of protein-DNA interactions. Protein-DNA complexes have successfully been co-crystallized [24], and the data has been deposited into the Protein Data Bank (PDB) and Nucleic Acid Database (NDB). Each complex is a 3D representation of all intermolecular interactions participating in protein-DNA recognition, however, the experiments are very time-consuming.

Characteristics of protein-DNA interactions

Double-stranded DNA forms the famous double helix [25], where pairs of complementary bases on opposing strands are stabilized by intermolecular H-bonds. The chemical composition of the DNA sugar-phosphate backbone is independent of the bp sequence and thus not involved in the specificity of sequence recognition. Only the edges of the bp are exposed in the grooves of the helical DNA, where they form a pattern of H-bond acceptors and donors [9] that can be recognized by the amino acid side chains, see Figure 1 for an illustration. Specific recognition of DNA has to rely on the interactions with these exposed patches. TFs typically contain a DNA-binding domain and one or multiple interaction domains that bind to other TFs. It is common to group the TFs into families according to the structure of these DNA-binding domains [26], where each family employs a different mechanism for recognizing the DNA sequence of the target site [12].

Figure 1

Characteristics of C-G and T-A base pairs Intermolecular H-bonds (dotted lines) in the C-G and T-A bp, stabilize the DNA double helix. The bp edges form a pattern of H-bond acceptors and donors that can be recognized by amino acid side chains of proteins. This pattern is unique for each bp (C-G, G-C, T-A, and A-T) in the major groove (up), whereas it is only possible to distinguish a C-G bp (top) form an T-A bp (bottom) in the minor groove (down) [9]. H-bond acceptors and donors are indicated by outward and inward pointing arrows respectively. The letter M is the methyl group of the base T and His a ring hydrogen donor. The chemical composition of the DNA sugar-phosphate backbone (not shown) is constant and independent of the bp sequence. The energetics and mode of protein-DNA interactions differ from those of protein-protein interactions. The main differences are that the protein-DNA interfaces are much more polar, have many more intermolecular H-bonds, and a higher abundance of buried water molecules [27,28]. The most important biochemical interactions in protein-DNA complexes are van der Waals contacts, H-bonds, and water-mediated contacts [29]. About two-thirds of all contacts are non-specific and made with the sugar-phosphate backbone of the DNA, leaving one-third of all interactions for the specificity [30]. Nonspecific interactions (protein-DNA backbone) are extremely important for the overall stability of the complex, and are mainly mediated through van der Waals contacts. About two-thirds of the specific interactions (protein-DNA base edges) involve complex H-bond patterns [29]. The distribution of H-bonds clearly demonstrates particular amino acid-base preferences, but no generalizable code can be deduced [13]. It is important to note that each amino acid can interact with more than one bp simultaneously, and several different amino acids can interact with the same bp. Interdependence between both bases and amino acids is an important feature of the interaction scheme. Very specific contact patterns can be achieved in this way and enable subtle but crucial differences in binding affinities [31]. Water molecules act as contact-mediators and space-fillers at the protein-DNA interface and play a key role in complex formation. As suggested in [32], an atomic description of water molecules at the interface is required for a complete formulation of protein-DNA interactions. Important water bridges can be identified in crystal structures or using molecular modeling [33]. The helical DNA structure is often distorted when bound to a protein [34,35]. Enforced bending of the DNA strand occurs through kinks at the base steps, leading to unstacking and unwinding of the helix. Several types of structural changes have been detected, including shift, slide, twist, rise, roll, and tilt [36]. The stiffness of the DNA helix is determined by the background bp composition [37], i.e. C-G bp are more rigid since they have one additional H-bond compared to A-T bp. The side chains of the protein are flexible and can re-arrange upon complex formation in order to achieve complementarity.

Computational methods

Computational approaches present an attractive solution for modeling and discovering TFBSs on a genomic scale. Several different computational approaches for predicting TFBSs have been explored, which has lead to considerable progress during recent years. The main approaches are sequence and structure-based, where the difference is that sequence-based methods consider only the primary structure of DNA, whereas structure-based methods aim at describing the physical and chemical complementarity between a TF and its binding site. We will now briefly discuss some selected sequence and structure-based computational methods for predicting TFBSs. Experimentally verified binding sites can be used for constructing a consensus sequence motif of the binding site of a TF. A consensus sequence can be obtained from a multiple alignment of known binding sites [38], and can be used for scanning genomic sequences in the search for TFBSs [39,40]. However, methods using scoring matrices for describing the binding sites [41,42] offer great advantages over consensus sequence methods. Position specific scoring matrices (PSSMs) are based on experimentally verified binding sites and represent the relative distribution and conservation of all nucleotides in the binding site. PSSMs exist for almost all types of TFBSs [43] and are widely used for predicting binding sites [41]. For an excellent review on PSSMs, see [44]. Table 1 is an illustration of a consensus sequence and a PSSM for an example TFBS. Sequence logos can be used for graphically describing the PSSMs [45]. The main advantage of PSSMs is that a qualitative measure can be obtained rather than the yes/no type of answer obtained from consensus models. Accounting for interdependence [46] between bases in the TFBS is not trivial, thus treating the binding energy contribution of each position in the binding site as independent ("independent binding hypothesis") is a frequently adopted approximation [47]. However, some improvement in performance has been achieved using higher order PSSM models [48,49].

Table 1

CONSENSUS		T	G	C	G	T	G	G	G	C	G
POSITION		1	2	3	4	5	6	7	8	9	10

SCORING MATRIX	A	5	7	0	2	0	31	0	0	13	0
	C	3	0	98	0	2	0	0	0	76	0
	G	5	93	0	98	14	69	100	100	0	100
	T	87	0	2	0	84	0	0	0	11	0

Representation of an example TFBS. Two sequence-based representations of the same TFBS, a consensus sequence and a position specific scoring matrix (PSSM). The example used here is the binding site of the early growth response protein 1 (EGR-1, Zif268), which is a zinc finger protein. False positive hits are detected with high frequencies [50], when using consensus or PSSMs for scanning genomes for putative binding sites. Bringing genetic context into the models has improved the specificity of the prediction methods. Limiting the search to predicted promoter regions [6,51], combining a set of functionally related TFs [4], and searching for their co-abundance has increased the specificity significantly [40]. The inclusion of spacing-rules between the TFs [52], limitations of the number of each contributing TF [53], and combinatorial aspects of TF positioning [54,55] has further reduced the number of false hits. Several TFs bind their target sequences as homo- or heterodimers, leading to co-occurring binding sites. The number of nucleotides in the gaps between the two half-sites may vary, even for the same TF binding to two different sites [56]. Accounting for varying half-site spacing in computational search algorithms is not trivial, nevertheless essential. Synergy, or cooperative binding is another reason for co-occurring motifs. Per definition, classical cooperative binding is when protein-protein interactions lead to a more efficient control of the promoter. Biological experiments have shown that synergistic activation can also occur when two regulatory proteins have no physical contact [57]. Computer simulations indicate that this might be an effect of the protein first binding changing the tension in the DNA strand [58]. Several computational methods predicting TFBS have been developed that take such putative synergy effects into account. BioProspector [59] and Co-Bind [5] are examples of methods that can be used for discovering co-occurring motifs. Computational de novo discovery of overrepresented motifs has been used for finding putative and functionally related TFBSs. Detecting short and degenerate binding sites in genomic sequences is a very hard task. Limiting the search to promoters and conserved non-coding regions where TFBSs are enriched [60] has improved the performance. Gibbs Sampling [61], Ann-SPEC [62], and LOGOS [63] are examples of algorithms that have proven helpful in detecting TFBSs [64,65]. Further improvement has been made by assuming that co-expressed genes are co-regulated [66], at least to some extent. Inferring co-expression in order to detect overrepresented motifs in regulatory sequences has frequently been adopted [67,68]. Phylogenetic footprinting is a computational method commonly applied as a filter for pointing towards conserved, possibly functional regions of non-coding regulatory sequences [7,69]. Several successful examples have been reported [70,71], and the computational methods have been reviewed in [72,73]. Alongside with an increasing number of genomic sequences, the amount of structural information on protein-DNA complexes has been increasing rapidly. Careful structural analyses of protein-DNA complexes obtained from PDB and NDB have identified the characteristics of such interactions [13,27,29]. Examination of the relationship between amino acid sequence conservation and role in DNA sequence recognition in protein-DNA complexes has revealed a strong correlation across all protein structural families [74,75]. Structure-based models offer promising extensions to the sequence-based models. These provide a way to qualitatively analyze DNA deformation, cooperativity, and other structural properties of protein-DNA interactions. There are mainly two categories of structure-based approaches. The first one is based on statistical potentials and the second one on potentials obtained from molecular mechanics simulations. Statistical potentials are derived from systematic analysis of structural protein-DNA complexes. Pairwise potentials are extracted from distributions of Cα atoms around DNA bases of known protein-DNA complexes, which reflect the statistical occurrence of specific interactions. They have proven to be sufficiently sensitive to evaluate the affinities of sequences obtained in a combinatorial fashion by threading them onto the fold of the original complex [8,76]. Computer simulations have been used to derive free-energy interaction maps between pairs of bases and amino acids [77,78], which can be used for prediction of TFBSs in a similar fashion as described above. In order to fully address structural flexibility of both protein and DNA, and interaction redundancy, intensive computation is needed. Observing processes during appropriate simulation periods and accounting for whole-system interactions are the two main limiting factors. Despite the required computing power, free energies have been analyzed in larger biological systems, see [79] for a review. Encoding the structural properties of specific DNA sequences and using these in combination with sequence-based methods can improve the specificity of the predictions [28,80]. The direct interactions between amino acids and DNA bases are mainly specific hydrogen bonds, which are fairly well understood. The non-specific interactions, constituting the majority of all interactions involved, are less well understood yet nevertheless, indications exist that these will provide important clues in understanding the complete picture of protein-DNA recognition. Structure-based approaches for modeling protein-DNA interactions are expensive regarding computing power, however, they provide valuable insights into the physical interactions at an atomic level.

Conclusion

Protein-DNA interactions have been under intense research during recent years, which has resulted in numerous valuable finding as well as computational methods for the prediction of TFBSs. While sequence-based methods are amenable to analyses on a whole-genome scale, the computational costs for structure-based methods are currently still prohibitively high. The required computation time ranges up to several days for one single protein-DNA complex, due to the complexity of the interactions. At the same time, structure-based methods provide deep insights into the mechanisms and features of the protein-DNA interaction. These insights allow us to validate – or falsify – some of the assumptions and approximations underlying some of the sequence-based methods. Sequence-based algorithms also provide a fast and flexible system for analyzing and reducing the search space in genomic sequences, whereas computationally intensive structure-based approaches can then be used in a final step with the specificity needed for a final evaluation of the predicted binding sites. We hence observe both a need and a recent tendency to use structure-based methods for validation of sequence-based methods. We conclude that advanced sequence-based methods and detailed structure-based methods will make a strong combination in the search for putative binding sites for regulatory proteins in genomic sequences.

77 in total

1. Quantifying DNA-protein interactions by double-stranded DNA arrays.

Authors: M L Bulyk; E Gentalen; D J Lockhart; G M Church
Journal: Nat Biotechnol Date: 1999-06 Impact factor: 54.908

2. The TRANSFAC system on gene expression regulation.

Authors: E Wingender; X Chen; E Fricke; R Geffers; R Hehl; I Liebich; M Krull; V Matys; H Michael; R Ohnhäuser; M Prüss; F Schacherer; S Thiele; S Urbach
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

3. Genome-wide location and function of DNA binding proteins.

Authors: B Ren; F Robert; J J Wyrick; O Aparicio; E G Jennings; I Simon; J Zeitlinger; J Schreiber; N Hannett; E Kanin; T L Volkert; C J Wilson; S P Bell; R A Young
Journal: Science Date: 2000-12-22 Impact factor: 47.728

Review 4. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

5. Regulatory element detection using correlation with expression.

Authors: H J Bussemaker; H Li; E D Siggia
Journal: Nat Genet Date: 2001-02 Impact factor: 38.330

6. NMR and molecular dynamics studies of the hydration of a zinc finger-DNA complex.

Authors: V Tsui; I Radhakrishnan; P E Wright; D A Case
Journal: J Mol Biol Date: 2000-10-06 Impact factor: 5.469

7. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.

Authors: C T Workman; G D Stormo
Journal: Pac Symp Biocomput Date: 2000

8. The structure of DNA.

Authors: J D WATSON; F H CRICK
Journal: Cold Spring Harb Symp Quant Biol Date: 1953

9. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes.

Authors: L McCue; W Thompson; C Carmack; M P Ryan; J S Liu; V Derbyshire; C E Lawrence
Journal: Nucleic Acids Res Date: 2001-02-01 Impact factor: 16.971

10. Specific binding of the lambda phage repressor to lambda DNA.

Authors: M Ptashne
Journal: Nature Date: 1967-04-15 Impact factor: 49.962

10 in total

1. A SILAC-based DNA protein interaction screen that identifies candidate binding proteins to functional DNA elements.

Authors: Gerhard Mittler; Falk Butter; Matthias Mann
Journal: Genome Res Date: 2008-11-17 Impact factor: 9.043

2. Direct inference of protein-DNA interactions using compressed sensing methods.

Authors: Mohammed AlQuraishi; Harley H McAdams
Journal: Proc Natl Acad Sci U S A Date: 2011-08-08 Impact factor: 11.205

3. Role of protein structure and the role of individual fingers in zinc finger protein-DNA recognition: a molecular dynamics simulation study and free energy calculations.

Authors: Mazen Y Hamed
Journal: J Comput Aided Mol Des Date: 2018-05-03 Impact factor: 3.686

4. Three enhancements to the inference of statistical protein-DNA potentials.

Authors: Mohammed AlQuraishi; Harley H McAdams
Journal: Proteins Date: 2012-11-12

5. Predicting DNA-binding locations and orientation on proteins using knowledge-based learning of geometric properties.

Authors: Chien-Chih Wang; Chien-Yu Chen
Journal: Proteome Sci Date: 2011-10-14 Impact factor: 2.480

6. Identification of regulatory targets of tissue-specific transcription factors: application to retina-specific gene regulation.

Authors: Jiang Qian; Noriko Esumi; Yangjian Chen; Qingliang Wang; Itay Chowers; Donald J Zack
Journal: Nucleic Acids Res Date: 2005-06-20 Impact factor: 16.971

10. Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions.

Authors: Victor G Levitsky; Elena V Ignatieva; Elena A Ananko; Igor I Turnaev; Tatyana I Merkulova; Nikolay A Kolchanov; T C Hodgman
Journal: BMC Bioinformatics Date: 2007-12-19 Impact factor: 3.169