| Literature DB >> 16545112 |
Kevin A Snyder1, Howard J Feldman, Michel Dumontier, John J Salama, Christopher W V Hogue.
Abstract
BACKGROUND: Accurate small molecule binding site information for a protein can facilitate studies in drug docking, drug discovery and function prediction, but small molecule binding site protein sequence annotation is sparse. The Small Molecule Interaction Database (SMID), a database of protein domain-small molecule interactions, was created using structural data from the Protein Data Bank (PDB). More importantly it provides a means to predict small molecule binding sites on proteins with a known or unknown structure and unlike prior approaches, removes large numbers of false positive hits arising from transitive alignment errors, non-biologically significant small molecules and crystallographic conditions that overpredict ion binding sites. DESCRIPTION: Using a set of co-crystallized protein-small molecule structures as a starting point, SMID interactions were generated by identifying protein domains that bind to small molecules, using NCBI's Reverse Position Specific BLAST (RPS-BLAST) algorithm. SMID records are available for viewing at http://smid.blueprint.org. The SMID-BLAST tool provides accurate transitive annotation of small-molecule binding sites for proteins not found in the PDB. Given a protein sequence, SMID-BLAST identifies domains using RPS-BLAST and then lists potential small molecule ligands based on SMID records, as well as their aligned binding sites. A heuristic ligand score is calculated based on E-value, ligand residue identity and domain entropy to assign a level of confidence to hits found. SMID-BLAST predictions were validated against a set of 793 experimental small molecule interactions from the PDB, of which 472 (60%) of predicted interactions identically matched the experimental small molecule and of these, 344 had greater than 80% of the binding site residues correctly identified. Further, we estimate that 45% of predictions which were not observed in the PDB validation set may be true positives.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16545112 PMCID: PMC1435939 DOI: 10.1186/1471-2105-7-152
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1SMID record as viewed from the SMID web interface. This record was derived from PDB entry 1HG1, which shows an interaction between an Asparaginase domain (residues 15–322 of chain A, identified by RPS-BLAST with an E-value of 1.34e-103) and D-Aspartate. The GI for 1HG1 chain A is 15825850. For the SMID record shown, seven of the eight residues of the binding site are located within the Asparaginase domain.
Figure 2SMID small molecule information page, as viewed from the SMID web interface. The small molecule page shown here indicates that 8 SMID records involve the molecule D-Aspartate.
Figure 3A CDD domain family multiple alignment. All sequences from a CDD domain family are listed including the consensus. In addition, the sequence for the PDB protein from which the SMID interaction was derived is included, with its PDB code highlighted in red. Lowercase residues do not align with the consensus and represent insertions or deletions relative to the consensus. Small molecule binding site residues are mapped to the domain family sequences from the parent PDB sequence using the following colour-coding scheme: red for conserved residues, blue for similar residues and yellow for non-conserved residues. In cases where a binding site aligns to a gap in the consensus, conservation cannot be measured and thus no coloured residue is displayed. Note that some binding site residues may be highlighted in addition to those associated with the parent PDB sequence if there are redundant interactions from other PDB files with a similar binding site. This alignment has been truncated for clarity.
Figure 4A 3-D SMID interaction. The x-ray crystallographic structure of Erwinia chrysanthemi L-Asparaginase associating with D-Aspartate (PDB ID: 1HG1), as viewed by Cn3D. The structure was annotated by SMID to highlight the domain residues (purple), domain residues contacting the D-Aspartate molecule (green) and the non-domain residues (grey). The D-Aspartate small molecule ligand is shown in space-fill format. The sequence/alignment viewer provides sequences for all chains found in the PDB record. For the sequence involved in the small molecule interaction, residues are colour-coded using the same scheme seen in the structural model.
Figure 5SMID-BLAST validation final ligand score distributions. a) Distribution of predictions in the validation set as a function of final ligand score. The solid line represents percent correct predictions, while dotted line represents predictions that were not observed in the PDB validation set; these latter interactions are comprised of both false positives, and true positives that simply have not been observed yet. For example, 12% of correct predictions had a final ligand score below 100, while 21% of unvalidated predictions had a final ligand score below 100. The dashed line represents an estimate of the distribution of final ligand scores for false positives as outlined in the text. b) Coverage as a function of final ligand score, for the predictions which were observed in the PDB validation set. Coverage is defined as the percent of true binding site residues which were included in the predicted binding site.
Figure 6Selected chemical structures. Chemical structures of selected SMID-BLAST small molecule hits from query proteins MiaB, Phosphoglycerate Mutase, TrpRS and TyrRS.
SMID-BLAST hits for Burkholderia pseudomallei K96243 tRNA thiotransferase. For clarity, only small molecule hits with a final ligand score above the cutoff value of 50 are included. Molecule 3-letter names were obtained from PDBSum [71].
| 157,159,161,163–165,167–168,200–202,207–208,242–244,281 | 181.641 | |
| 151,153,163–166,199–202,205–208,240–244, 267,269,281,288,307,309–310,338–341,350 | 108.434 |
Figure 7Binding sites predicted by SMID-BLAST. a) Shown is a comparative model of the predicted Elp3 domain of MiaB. The iron-sulfur cluster (orange) and SAM (CPK stick model) have had their co-ordinates transferred from the modelling template, PDB 1OLT chain A to illustrate how they might bind. The predicted Fe-S binding site residues are indicated in red, the predicted SAM binding residues are shown in purple, and the three cysteine residues which interact with the Fe-S cluster are indicated in yellow. A mixture of red and purple was used for residues common to both binding sites. b) Structural alignment of PDB 1RII chain A (phosphoglycerate mutase from M. tuberculosis, blue) and 2BIF chain A (6-phosphopructo-2-kinase/fructose-2,6-bisphosphatase from Rattus norvegicus, yellow). The small molecules from 2BIF are also shown along with their PDB short labels. Purple molecules associate with the N-terminal domain of 2BIF chain A, while blue molecules associate with the C-temrinal domain. Note that BOG was part of the crystallization buffer in this example. Structures were aligned with Swiss PDBViewer.
SMID-BLAST hits for Mycobacterium tuberculosis CDC1551 Phosphoglycerate Mutase. For clarity, only small molecule hits with a final ligand score above the cutoff value of 50 are included. Molecule 3-letter names were obtained from PDBSum [71].
| 11–12,14–15,17–18,22–24,93,101,117–118,206 | 897.76 | |
| 11,18,23–25,90–91,93,101,117–118,185 | 804.199 | |
| 11–15,18–19,24,63,90,183,209 | 799.119 | |
| 11,18,22–25,63,90,93,101,117 | 690.171 | |
| 11–12,18,23,63,90,153,183–184,188 | 555.168 | |
| 11–12,18,24,63,90,183–185 | 528.282 | |
| 11,18,93,117–118,185,206 | 437.423 | |
| 12,14–15,18–19,24 | 257.242 | |
| 23–24,90,101,115,119,153,184,188 | 134.283 | |
| 23,90–91,101,112,115,119,124,153,184–185,188 | 124.075 | |
| 101,115,119,153,184–185,188 | 62.337 |
Top scoring SMID-BLAST ligand hits for TrpRS and TyrRS across a wide range of organisms. 82 TrpRS sequences and 83 TyrRS sequences were employed. The expected best ligand is shown in bold.
| Ligand to TrpRS | Hits | Ligand to TyrRS | Hits | |
| Top scoring ligand, ignoring ATP and Mg++ | 80 | 37 | ||
| Tyrosinal | 2 | SB-239629 | 18 | |
| Tryptophanyl Adenylate | 13 | |||
| SB-243545 | 9 | |||
| SB-284485 | 5 | |||
| D-tyrosine | 1 | |||
| Top scoring ligand, ignoring ATP, Mg++ and binding site occupancy | 36 | SB-239629 | 50 | |
| L-tryptophan | 22 | D-tyrosine | 31 | |
| D-tyrosine | 10 | 1 | ||
| SB-239629 | 7 | Tyrosinal | 1 | |
| Tyrosinal | 6 | |||
| L-tryptophanamide | 1 |
Figure 8An overview of how protein-small molecule interactions are identified from the PDB and utilized to generate SMID records. The process of 'Interaction Tagging' involves the identification of protein-small molecule interactions that involve i) single atom contacts ii) an unknown protein sequence iii) a biologically irrelevant small molecule iv) false contacts with biologically relevant ions using a Support Vector Machine. See text for details.