Literature DB >> 19767616

3D-footprint: a database for the structural analysis of protein-DNA complexes.

Abstract

3D-footprint is a living database, updated and curated on a weekly basis, which provides estimates of binding specificity for all protein-DNA complexes available at the Protein Data Bank. The web interface allows the user to: (i) browse DNA-binding proteins by keyword; (ii) find proteins that recognize a similar DNA motif and (iii) BLAST similar DNA-binding proteins, highlighting interface residues in the resulting alignments. Each complex in the database is dissected to draw interface graphs and footprint logos, and two complementary algorithms are employed to characterize binding specificity. Moreover, oligonucleotide sequences extracted from literature abstracts are reported in order to show the range of variant sites bound by each protein and other related proteins. Benchmark experiments, including comparisons with expert-curated databases RegulonDB and TRANSFAC, support the quality of structure-based estimates of specificity. The relevant content of the database is available for download as flat files and it is also possible to use the 3D-footprint pipeline to analyze protein coordinates input by the user. 3D-footprint is available at http://floresta.eead.csic.es/3dfootprint with demo buttons and a comprehensive tutorial that illustrates the main uses of this resource.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2009 PMID： 19767616 PMCID： PMC2808867 DOI： 10.1093/nar/gkp781

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

DNA footprinting is a well-proven experimental methodology used to probe the specificity of proteins that bind DNA. The method takes its name from the footprint that bound proteins leave on electrophoresis gels after DNA digestion (1). In the same way, the structures of protein–DNA complexes, contributed by researchers worldwide and stored in the Protein Data Bank (PDB) (2), can be seen as molecular footprints at the atomic scale. The increasing collection of such complexes has encouraged computational approaches that take atomic coordinates as input to analyze sequence recognition from different perspectives (3–8). Two of the most recent approaches, the cumulative contact method (6)—which assumes that the consensus DNA sequence is captured in the experimental structure—and DNAPROT (7)—that samples oligonucleotides considering both direct and indirect readout mechanisms—are taken here to construct 3D-footprint, a weekly updated database that contains structure-based footprints and specificity estimates for all complexes available at the PDB. 3D-footprint entries are annotated according to the SCOP (9) and Pfam (10) databases, and their binding interfaces are clustered in terms of structural similarity. In addition, each complex in the database is dissected in order to draw interface graphs and footprint diagrams, a novel graphical representation that summarizes contacts across the double-stranded DNA segment at the interface. Oligonucleotides extracted from related scientific papers are also reported in order to show the range of variant sites bound by each protein and other related proteins. Furthermore, entries in the database are linked to external resources that provide valuable related information: NDB (11), PDBSum (12), ProNuc (13), NPIDB (14), BIPA (15) and the Protein Mutant Database (16). The web interface of the database, also mirrored at Universidad Nacional Autónoma de México, offers the following: an up-to-date repository of complexes that classifies complexes as non-redundant, multimeric and redundant interface graphs that plot indirect readout bases and atomic interactions responsible for direct readout footprint diagrams with bases depicted as circles of diameter proportional to the number of contacts observed at the interface, with DNA strands plotted separately structure-based position weight matrices (PWMs) that can be used to scan genomic sequences via RSAT (17) interface dendrograms that summarize the similarities between related DNA-binding proteins a keyword search tool that allows browsing the repertoire of protein–DNA complexes a protein sequence search engine that returns BLAST-like alignments which highlight interface residues a motif search facility which ranks 3D-footprint proteins that recognize similar DNA sequences to those input by the user an interactive footprint pipeline for the analysis of complexes provided by the user a miner form that retrieves DNA motifs related to a search term from PubMed abstracts a repository of flat files with all relevant data in the database available for download The database is automatically updated by identifying new structures reported in the PDB; however, new complexes are manually curated before being added to the collection in order to spot and fix unusual problems with the coordinates—such as broken or reversed DNA strands—and to assign adequate protein names—and synonyms—that will drive the search for related motifs in scientific papers. PubMed abstracts frequently contain binding sites and consensus sequences that enrich those inferred from the molecular coordinates. Benchmark experiments have been carried out in order to validate 3D-footprint specificity measurements, including a comparison with two expert-curated databases, RegulonDB (18) and TRANSFAC (19). The tests confirm that structure-based estimations of binding specificity correlate well with those derived from consensus alignments produced by biocurators. In addition, it is found that DNA-binding superfamilies display different specificities, in agreement with measurements obtained from TRANFAC motifs. With respect to other databases, 3D-footprint occupies a unique position as to my knowledge it is currently the only up-to-date database providing structurally derived DNA motifs, in the form of PWMs, and sequence alignments that highlight interface residues responsible for DNA recognition. Moreover, the web interface allows custom analyses of protein–DNA complexes provided by the user.

DATA, METHODS AND IMPLEMENTATION

3D-footprint consists of several modules, which are built with custom-made programs, mostly written in Perl—making use of the CPAN module DB_File—and C++, plus third-party software. Figures B1 and B6 of the ‘Benchmark’ section at http://floresta.eead.csic.es/3dfootprint/benchmark.html explain how these modules integrate with each other; they are now described in more detail.

A non-redundant set of protein–DNA complexes

The PDB database is mirrored once a week, including the 95% non-redundant list of protein chains. For each cluster in the list, the best resolution chain with most protein–DNA contacts is taken, and any other chains are regarded as redundant. Protein chains docking the same DNA molecule are taken as components of multimeric complexes, which are also reported as they usually capture the most biologically relevant molecular docks. Further redundancy filters are applied in order to derive hydrogen bond and hydrophobic atomic preference matrices that are available in the download area of the web site, as published earlier (7). Note that no distinction is made between transcription factors and any other DNA-binding proteins. As of 19 July 2009, the database contains 500 non-redundant, 677 multimeric and 1045 redundant complexes.

Clusters of interfaces

A cluster is defined as a set of DNA-binding proteins with significant structural similarity. Every monomer in the repository is structurally compared to the non-redundant set of complexes and all significant hits, with MAMMOTH (20) ln(E-value) < –7, are stacked in order to compile a multiple alignment or matrix of equivalent interface residues and bases. For alignment display purposes, only one base is linked to each interface residue, overlooking the cases in which a single residue contacts several bases. Expectation values among members of a cluster are converted to distances that support an unrooted dendrogram calculated with PHYLIP (21), as depicted in Figure 1E.

Figure 1.

(A–E) Typical content of 3D-footprint entries, illustrated with dimeric complex 1llm_CD, a Zif23-GCN4 chimera (32), and with non-redundant monomeric complex 1a0a_B, positive regulatory protein PHO4 (33). (A) An interface graph dissecting atomic contacts and nucleotides at the interface responsible for specific DNA discrimination, where solid bases indicate indirect readout mechanisms. (B) Footprint logo diagram of 1llm_CD containing four central base-pairs subject to indirect readout. (C) Sequence logo and structure-based PWM obtained by averaging contact and readout PWMs for complex 1llm_CD. The calculated information content places this complex in the dark gray region of the boxplot in panel G (see below). Note that the underneath link exports this PWM to a RSAT form where the user can scan genomes of DNA fragments for occurrences of this motif. (D) Examples of literature-extracted DNA sequences associated to the term ‘GCN4’ and their E-values, corresponding to non-redundant entry 1llm_D. (E) Dendrogram of similar interfaces for entry 1a0a_B, where the distance tree is based on the estimated structural similarity between binding domains and interface residues—those with 4.5Å heavy atom contacts with nitrogen bases—are aligned coloring their nucleotide partners. (F) Querying 3D-footprint with the protein sequence of Zea mays transcription factor PTZm00668.1 (34). Note that all six interface residues are covered in the alignment, but only three are conserved. (G) Scale of specificity observed for SCOP superfamilies in the database, computed over the parenthesized number of non-redundant of complexes, after excluding superfamilies with less than seven complexes. An up-to-date scale is available at http://floresta.eead.csic.es/3dfootprint/stats.html.

Interface graphs and footprint logo diagrams

A detailed view of the interface is provided for every complex in the form of a graph that shows the DNA molecule encoded in the PDB coordinates and a list of labeled atomic contacts that confer sequence discrimination. Contacts can be of three types: hydrogen bonds, water-mediated hydrogen bonds and hydrophobic interactions. Each contact has assigned a log-odd score, extracted from the preference matrices mentioned earlier and distance-corrected that evaluates its statistical significance. Often it is not possible to automatically classify some side-chain contacts, which are still reported without a score. The graph itself is produced using the Graphviz library at http://www.graphviz.org and a modified version of HBPLUS (22) that handles hydrophobic contacts. The 3DNA package (23) is employed for labeling indirect readout bases, shown as solid hexagons—circles in footprint logos—as previously explained in detail (7). Interface bases also display the number of side-chain heavy atom contacts within 4.5Å, which is proportional to the diameter of nitrogen bases in footprint diagrams. Sample interface and footprint diagrams are displayed in Figure 1A and B.

PMWs (3D-footprints)

It is possible to convert the tally of side-chain contacts for every base in the DNA molecule into a column of a PMW, assuming that the coordinates file actually harbors the consensus sequence, as it is reasonable. This conversion follows the formulation of Morozov and Siggia (6) that states that bases with high contact numbers are more conserved in the consensus, as stated in previous reports (24). These are named contact PWMs, in contrast with readout PWMs, which are derived from the scores of the interface contacts that result when 4N nucleotide sequences are threaded into the backbone of the DNA molecule in the complex, of length N. These later PWMs consider both direct and indirect readout mechanisms and are calculated using the DNAPROT protocol (7). When reliable readout matrices are available—those derived from interfaces with at least five atomic interactions, see the ‘Benchmark’ section—a mean matrix is obtained by averaging both contact and readout PWMs. Otherwise, only contact PWMs are considered. In either case, 3D-footprints capture the binding specificity of proteins in the database and can be used to scan genomic sequences. However, not all proteins are equally specific. To guide the user, the information content (IC) of each footprint is reported as calculated by RSAT (17). IC has previously been shown to be a good approximation of binding specificity within bacterial regulatory networks (25). Sequence logos calculated with WEBLOGO (26) are more intuitive representations for PWMs, as they actually plot the IC of each motif and are calculated by taking the best B sequences as scored by the reference PWM. B is set by default to 50, but can in some cases be a smaller number b, if the (b + 1)-th site has a score that is worse than the worst single nucleotide mutation. A structure-based PWM and its corresponding sequence logo, overlaid, are shown in Figure 1C. Note that the IC of this example PWM corresponds to a highly specific binding class as shown in Figure 1G.

Motif search engine

A library of structure-based PWMs, or 3D-footprints, is maintained, encompassing both monomeric and multimeric protein–DNA complexes. This library can be searched by taking an input motif provided by the user, running STAMP (27) with appropriate background score distributions for local (JaspRand_PCC_SWU) or global (JaspRand_SSD_NW_go1000_ge1000) alignments. If the input data is originally an oligonucleotide or consensus sequence, it is first converted to a PWM based on the minimum number of sites required to convert all sequence letters to integer values.

Protein search engine

The collection of protein sequences of non-redundant monomeric complexes can be searched taking as input a protein sequence. The search engine is powered by PSI-BLAST (28) and returns similar DNA-binding proteins evaluated in terms of E-value and also in terms of interface identity, that scores the proportion of residues found to be in contact with DNA nitrogen bases that are conserved, as defined in a previous paper (29).

Consensus miner

3D-footprint includes a text-mining facility designed to extract oligonucleotide sequences, presumably consensus sequences, from the set of PubMed abstracts related to some input term, usually a protein name. This search engine builds on the Entrez eUtils toolbox at http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html and uses keywords to guide abstract retrieval. The set of keywords includes words found to be over-represented in papers reporting binding sites curated in RegulonDB 6.0 (18) and TRANSFAC 9.3 (19) and also terms identified by A.Santos-Zavaleta, curator of the RegulonDB team (see http://floresta.eead.csic.es/3dfootprint/miner.html). Non-redundant entries of the database include a table with similar DNA sequences reported in the literature, with E-values associated to STAMP local alignments, as shown in Figure 1D. The terms associated to non-redundant entries, which drive the PubMed search, are extracted from the corresponding title in the PDB file by running the GAPSCORE tagger (30) and are then manually curated when entries are added to the database for the first time. As of 19 July 2009, the database contains 184 PubMed reports linked to non-redundant entries.

WEB INTERFACE: EXAMPLES OF USE

This section presents a few examples of how the database can help with queries related with protein–DNA recognition. Typically, the result of a query will be a list of entries in the database, which contains several diagrams and reports as summarized in Figure 1. The possible queries and the interpretation of results are further explained in the online tutorial at http://floresta.eead.csic.es/3dfootprint/tutorial.html. The tutorial also includes sample Perl code to query the web services engine.

Text search

The user can type a single term, such as a name, a PDB identifier, a superfamily or a species that is related to a protein of interest and the server will return a list of links to relevant entries. Each entry contains a summary of the protein–DNA complex including the elements explained above, which will include a structure-based PWM if evidence for specific DNA binding is found.

Motif search

Another way of interrogating 3D-footprint is by asking: is there a protein that binds a DNA motif or sequence similar to this? For this purpose the user can simply input an oligonucleotide sequence, in which degenerate bases are allowed, or rather a PWM in CONSENSUS or TRANSFAC format. By default, searches run local motif alignments, but it is often useful trying global alignments, particularly when spaced short repeats, such as the CGGNNNNNNNNNNNCGG yeast Gal4 motif, are searched. The expectation value cutoff for motif similarity can be changed by the user.

Protein sequence search

The third basic query to 3D-footprint can be done by pasting a protein sequence. This will trigger a search for similar proteins in the database, that are likely to bind similar DNA motifs provided that they conserve the residues at the interface (6,29). A sample alignment is included in Figure 1F. The motifs of matched DNA-binding proteins are only displayed when the percent interface identity (%IID) is at least 50%. The server returns a list of complexes ranked by overall similarity, reported as a BLAST E-value. %IID and percentage of interface coverage values are also shown to assist in the interpretation of the alignments, which are printed in a BLAST-like format. If any similar complexes are found the server provides a link that exports the input protein sequence to the TFmodeller web server (31), in case the user wishes to build a comparative model of the input protein docked to a DNA molecule.

Analyzing a protein–DNA complex provided by the user (interactive footprint)

The pipeline of analysis is also available for interactive use. The required input is a file containing the coordinates of a protein–DNA complex in PDB format. For a correct function, it is necessary that protein and DNA atoms have different chain identifiers. This anonymous form can be used to analyze experimentally determined complexes or models generated by computational approaches. There is an option, set by default, to sample interface side-chain rotamers during the analysis, as this was found to be important during previous experiments (7). The returned results include interface diagrams and estimated binding preferences in PWM and sequence logo format.

Mining DNA motifs in PubMed abstracts

A consensus miner is bundled in the website, where the user can paste a protein name to launch a PubMed search that will return a list of short DNA sequences related with the input term. As explained in the ‘Data, methods and implementation’ section, this search is driven by keywords known to be associated to binding sites in prokaryotes and eukaryotes. Results are tabulated and each oligonucleotide reported has pointers to the original literature sources.

Downloading data from 3D-footprint

Most data in 3D-footprint can be obtained at the download area, which is updated every week. The collection of structure-based PWMs is distributed in CONSENSUS and TRANSFAC format, for use with motif scanning software. The set of protein sequences for the non-redundant set of monomeric complexes is also available, including a list of interface residues in the FASTA header. The set of atomic interaction matrices, used to evaluate interface contacts, and the list of complexes used to derive them, are also available. The ‘Statistics’ section at http://floresta.eead.csic.es/3dfootprint/stats.html includes a report, concerning the specificity of 3D-footprints, which is also updated weekly.

BENCHMARK

The ‘Benchmark’ and ‘Stats’ sections of the website describe experiments that evaluate the quality of structure-based estimations of binding specificity; this section makes references to figures included there due to space restrictions.

Comparing contact and readout PWMs

3D-footprint uses two different algorithms in order to infer binding preferences from sets of atomic coordinates, but how do they compare to each other? Figure B4, panel A, of the ‘Benchmark’ section indicates that contact PWM and readout PWMs, inferred with both algorithms, are more similar as the number of atomic interactions at the interface increases, as measured with STAMP local alignments. Figure B4, panel B, provides further information, as it shows that the IC of readout PWMs follows the same trend. Taken together, these results suggest that the DNAPROT method is more sensitive to the quality of input data, as it requires a minimum number of atomic interactions in order to produce PWMs with high similarity to contact PWMs, which are assumed to describe cognate sites. As a consequence, a minimum five atomic interactions were required in order to further consider readout PWMs, which according to the regression lines is equivalent to a −log(E-value) of 4.3 and information content of 3.3 bits. In addition, it was necessary to evaluate the specificity estimations of 3D-footprint, by means of comparison to external datasets of known quality. To this end the set of 22 transcriptions factors in 3D-footprint for which PWMs were available from RegulonDB (seven proteins from Escherichia coli) and TRANSFAC (15 eukaryotic proteins) were taken to plot Figure B5, panel A, that unveils a strong correlation between both measurements. Figure B5, panel B, shows that the mean IC values for transcription factor superfamilies in 3D-footprint match those calculated from collections of TRANSFAC motifs, although it also hints that two superfamilies—homeodomains and zinc fingers are often over-estimated. These data support the scale of specificity presented in Figure 1G, which includes additional DNA-binding protein superfamilies, such as restriction and homing endonucleases. Furthermore, Figure S3 in the ‘Stats’ section provides one more independent evaluation of the quality of 3D-footprint, based on the set of entries for which a PubMed report is available. The histogram shows the E-values obtained after comparing 3D-footprints and literature motifs by means of local STAMP alignments, demonstrating that structure-based PWMs are indeed significantly similar to the motifs published in the literature for those proteins.

CONCLUSIONS

Existing databases (13–15) have already demonstrated that structures deposited at the PDB contain a wealth of information that can be exploited in order to study the mechanisms of DNA binding in biological systems. Moreover, benchmark experiments presented in this and earlier (25) work suggest that atomic coordinates can be effectively used to compute binding specificity and justify the main mission of 3D-footprint: to enrich the repertoire of known regulatory elements with those embedded in the atomic descriptions of protein–DNA complexes.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Consejo Superior de Investigaciones Científicas [grant number 200720I038]. Funding to pay the Open Access publication charges for this article was provided by [grant A06] from Gobierno de Aragón to the research group of J.M. Lasa in 2009. Conflict of interest statement. None declared.

33 in total

1. Structure-based prediction of DNA target sites by regulatory proteins.

Authors: H Kono; A Sarai
Journal: Proteins Date: 1999-04-01

2. The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids.

Authors: H M Berman; W K Olson; D L Beveridge; J Westbrook; A Gelbin; T Demeny; S H Hsieh; A R Srinivasan; B Schneider
Journal: Biophys J Date: 1992-09 Impact factor: 4.033

3. WebLogo: a sequence logo generator.

Authors: Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

4. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity.

Authors: D J Galas; A Schmitz
Journal: Nucleic Acids Res Date: 1978-09 Impact factor: 16.971

5. The Protein Mutant Database.

Authors: T Kawabata; M Ota; K Nishikawa
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

Review 6. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

7. Crystal structure of PHO4 bHLH domain-DNA complex: flanking base recognition.

Authors: T Shimizu; A Toumoto; K Ihara; M Shimizu; Y Kyogoku; N Ogawa; Y Oshima; T Hakoshima
Journal: EMBO J Date: 1997-08-01 Impact factor: 11.598

8. SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Authors: A G Murzin; S E Brenner; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1995-04-07 Impact factor: 5.469

9. Satisfying hydrogen bonding potential in proteins.

Authors: I K McDonald; J M Thornton
Journal: J Mol Biol Date: 1994-05-20 Impact factor: 5.469

10. Protein-DNA binding specificity predictions with structural models.

Authors: Alexandre V Morozov; James J Havranek; David Baker; Eric D Siggia
Journal: Nucleic Acids Res Date: 2005-10-24 Impact factor: 16.971

30 in total

1. Protein databases on the internet.

Authors: Dong Xu; Ying Xu
Journal: Curr Protoc Mol Biol Date: 2004-11

2. OsRMC, a negative regulator of salt stress response in rice, is regulated by two AP2/ERF transcription factors.

Authors: Tânia S Serra; Duarte D Figueiredo; André M Cordeiro; Diego M Almeida; Tiago Lourenço; Isabel A Abreu; Alvaro Sebastián; Lisete Fernandes; Bruno Contreras-Moreira; M Margarida Oliveira; Nelson J M Saibo
Journal: Plant Mol Biol Date: 2013-05-24 Impact factor: 4.076

3. DNAproDB: an expanded database and web-based tool for structural analysis of DNA-protein complexes.

Authors: Jared M Sagendorf; Nicholas Markarian; Helen M Berman; Remo Rohs
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

Review 4. Structure-based modeling of protein: DNA specificity.

Authors: Adam P Joyce; Chi Zhang; Philip Bradley; James J Havranek
Journal: Brief Funct Genomics Date: 2014-11-19 Impact factor: 4.241

5. A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites.

Authors: Jichen Yang; Stephen A Ramsey
Journal: Bioinformatics Date: 2015-06-30 Impact factor: 6.937