| Literature DB >> 18305831 |
Gábor Iván1, Zoltán Szabadka, Vince Grolmusz.
Abstract
The Protein Data Bank contains the description of more than 45,000 three-dimensional protein and nucleic-acid structures today. Started to exist as the computer-readable depository of crystallographic data complementing printed articles, the proper interpretation of the content of the individual files in the PDB still frequently needs the detailed information found in the citing publication. This fact implies that the fully automatic processing of the whole PDB is a very hard task. We first cleaned and re-structured the PDB data, then analyzed the residue composition of the binding sites in the whole PDB for frequency and for hidden association rules. Main results of the paper: (i) the cleaning and repairing algorithm (ii) redundancy elimination from the data (iii) application of association rule mining to the cleaned non-redundant data set. We have found numerous significant relations of the residue-composition of the ligand binding sites on protein surfaces, summarized in two figures. One of the classical data-mining methods for exploring implication-rules, the association-rule mining, is capable to find previously unknown residue-set preferences of bind ligands on protein surfaces. Since protein-ligand binding is a key step in enzymatic mechanisms and in drug discovery, these uncovered preferences in the study of more than 19,500 binding sites may help in identifying new binding protein-ligand pairs.Entities:
Keywords: association rules; binding site; functions; protein; structural data
Year: 2007 PMID: 18305831 PMCID: PMC2241929 DOI: 10.6026/97320630002216
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1Association rules-Set 1: Figure 1 was created by deleting all X → GLY association rules for clarity, and including those rules which satisfy that their supports are at least 7.15% and their confidences are at least 0.5 and, moreover, at least one of the following conditions hold: (a) their confidences are at least 0.8 or (b) their lifts are at least 1.8 or (c) their lifts are at most 0.97 or (d) their supports are at least 24%. The color and width of the arrows corresponds to the lift, the color of residue-sets corresponds to the support, as shown on the figure legend. Four areas are identifiable on the figure: in the lower half the rules of large lifts are shown; in the upper left corner the rules of high confidences (with one exception), in the upper middle part the lower than 0.97 lift rules, and in the upper right corner the high support rules are shown. Note that these rules form almost disjoint classes
Figure 2Association rules-Set 2: The figure was created by deleting all X→GLY association rules for clarity, and including only those rules which satisfy that their support is at least 7.15% and their confidence is at least 0.55 and their lift is at least 1.7