| Literature DB >> 21728306 |
James B Dunbar1, Richard D Smith, Chao-Yie Yang, Peter Man-Un Ung, Katrina W Lexa, Nickolay A Khazanov, Jeanne A Stuckey, Shaomeng Wang, Heather A Carlson.
Abstract
A major goal in drug design is the improvement of computational methods for docking and scoring. The Community Structure Activity Resource (CSAR) aims to collect available data from industry and academia which may be used for this purpose ( www.csardock.org ). Also, CSAR is charged with organizing community-wide exercises based on the collected data. The first of these exercises was aimed to gauge the overall state of docking and scoring, using a large and diverse data set of protein-ligand complexes. Participants were asked to calculate the affinity of the complexes as provided and then recalculate with changes which may improve their specific method. This first data set was selected from existing PDB entries which had binding data (K(d) or K(i)) in Binding MOAD, augmented with entries from PDB bind. The final data set contains 343 diverse protein-ligand complexes and spans 14 pK(d). Sixteen proteins have three or more complexes in the data set, from which a user could start an inspection of congeneric series. Inherent experimental error limits the possible correlation between scores and measured affinity; Pearson R is limited to ~ 0.91 (Pearson R2 0.83) when fitting to the data set without over parameterizing. Pearson R is limited to ~ 0.83(Pearson R2 ~ 0.70) when scoring the data set with a method trained on outside data [corrected]. The details of how the data set was initially selected, and the process by which it matured to better fit the needs of the community are presented. Many groups generously participated in improving the data set, and this underscores the value of a supportive, collaborative effort in moving our field forward.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21728306 PMCID: PMC3180202 DOI: 10.1021/ci200082t
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
The Evolution of Sets of Protein–Ligand Complexes with Affinities
| database | year | no. complexes |
|---|---|---|
| Score1( | 1994 | 54 |
| Jain( | 1996 | 34 |
| VALIDATE( | 1996 | 65 |
| ChemScore( | 1997 | 112 |
| Score2( | 1998 | 94 |
| SCORE( | 1998 | 181 |
| PMF( | 1999 | 225 |
| BLEEP[ | 1999 | 90 |
| DrugScore[ | 2000 | 83 |
| LPDB( | 2001 | 195 |
| SMoG2001( | 2002 | 119 |
| HINT( | 2002 | 53 |
| X-Score( | 2002 | 230 |
| PLD( | 2003 | 485 |
| AffinDB( | 2006 | 474 |
| PDBbind – refined set | 2007 | 1300 |
| MOAD with affinities | 2008 | 2948 |
Figure 1Flowchart of how the data set was curated.
Figure 2Distribution analysis of the calculated physical properties of the data sets.
Congeneric Series Present in the Data Set
| protein | count | ligand series type | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id | pdb id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HIV-1 protease (L63P) | 11 | hydroxyethylamine | 1ec0 | ||||||||||
| HIV-1 protease (WT) | 11 | hydroxyethylamine | |||||||||||
| tyrosine-protein phosphatase | 8 | thiophene-dicarboxylic acid | |||||||||||
| coagulation factor X | 6 | pyrrolidin-2-one | 2cji | ||||||||||
| carbonic anhydrase 2 | 6 | sulfamide analogues | |||||||||||
| tRNA guanine transglycosylase | 5 | quinazolin-4-one | |||||||||||
| HMGCoA reductase | 5 | atorvastatin analogues | |||||||||||
| acetylcholinesterase | 4 | huperzine/hupyridone | |||||||||||
| estrogen receptor alpha | 4 | phenol | |||||||||||
| urokinase | 4 | benzamidine analogues | 1gj8 | ||||||||||
| β-1,4-xylanase | 3 | aza-sugar analogues | |||||||||||
| glutamate [NMDA] receptor ζ1 | 3 | small ring amino carboxylic acid | |||||||||||
| lectin | 3 | saccharides | |||||||||||
| membrane lipoprotein tmpC | 3 | purine neucloside analogues | |||||||||||
| retropepsin | 3 | macrocyclic peptidomimetic | |||||||||||
| transporter (LeuT) | 3 | small amino acid |
Figure 3The addition of random error with standard deviations of 0.5 log K (top) or 1.0 log K (bottom) do not significantly degrade the “signal to noise” in the CSAR-NRC data set. Only 10 of the 100 randomly generated sets are shown for clarity, and a line with a slope of 1.0 is given as a guideline in all the graphs. (Left) Correlations based on the model that the reported affinities are ideal (y-axis) and random, normally distributed error can be added to generate possible measurements found in another lab (x-axis). (Right) Correlations based on the model that both the reported value and another measured value could have the same, random error. These plots also approximate the variation between scores and measured affinity values.