| Literature DB >> 27857569 |
Akira R Kinjo1, Haruki Nakamura1.
Abstract
A method to search for local structural similarities in proteins at atomic resolution is presented. It is demonstrated that a huge amount of structural data can be handled within a reasonable CPU time by using a conventional relational database management system with appropriate indexing of geometric data. This method, which we call geometric indexing, can enumerate ligand binding sites that are structurally similar to sub-structures of a query protein among more than 160,000 possible candidates within a few hours of CPU time on an ordinary desktop computer. After detecting a set of high scoring ligand binding sites by the geometric indexing search, structural alignments at atomic resolution are constructed by iteratively applying the Hungarian algorithm, and the statistical significance of the final score is estimated from an empirical model based on a gamma distribution. Applications of this method to several protein structures clearly shows that significant similarities can be detected between local structures of non-homologous as well as homologous proteins.Entities:
Keywords: Hungarian algorithm; geometric indexing; ligand binding sites; relational database; structural alignment
Year: 2007 PMID: 27857569 PMCID: PMC5036654 DOI: 10.2142/biophysics.3.75
Source DB: PubMed Journal: Biophysics (Nagoya-shi) ISSN: 1349-2942
Figure 1Overview of the method. The left part (“Compiling database”) illustrates the pre-processing step. The right part (“Searching”) shows the search step for a given protein structure as a query.
Figure 2Local coordinate system defined by a refset (tetrahedron).
Definition of the table for ligand binding sites
| CREATE TABLE lbsmldb ( | |||
| lbsml_id | INTEGER PRIMARY KEY, | ...... | |
| lbsml | TEXT, | ...... | |
| pdbx | TEXT, | ...... | |
| ligand | TEXT, | ...... | |
| natoms | INTEGER ); | ...... | |
unique identifier;
file name;
PDB’s description of the protein;
PDB’s annotation of the ligand;
the number of protein atoms in contact with the ligand.
Definition of the refset table
| CREATE TABLE refsetdb ( | |||
| lbsml_id | INTEGER, | ...... | |
| irs | INTEGER, | ...... | |
| PRIMARY KEY (lbsml_id, irs) | ...... | ||
| tetra | TEXT, | ...... | |
| tvol | DOUBLE PRECISION, | ...... | |
| td01 | DOUBLE PRECISION, | ...... | |
| td02 | DOUBLE PRECISION, | ...... | |
| td03 | DOUBLE PRECISION, | ...... | |
| td12 | DOUBLE PRECISION, | ...... | |
| td23 | DOUBLE PRECISION, | ...... | |
| td31 | DOUBLE PRECISION, | ...... | |
| atype_id | INTEGER [ ], | ...... | |
| xco | DOUBLE PRECISION [ ], | ...... | |
| yco | DOUBLE PRECISION [ ], | ...... | |
| zco | DOUBLE PRECISION [ ] | ...... | |
| ); | |||
reference to “lbsmldb” (Table 1);
reference set identifier;
a pair of lbsml_id and irs makes the primary key of the refset.
tetrahedron type;
volume of tetrahedron;
“tdij” denotes the length of edge between vertices i and j of tetrahedron (A tetrahedron consists of four atoms denoted i, j=0, 1, 2, and 3).
types of the atoms spanned by the refset (encoded as integers).
local coordinates of the atoms spanned by the refset.
Pseudo SQL expression for local structure search
| SELECT atype, xco, yco, zco, lbsml_id, irs FROM refsetdb | |
| WHERE tetra = ‘ | |
| AND tvo1 BETWEEN | |
| AND td01 BETWEEN | |
| AND td02 BETWEEN | |
| AND td03 BETWEEN | |
| AND td12 BETWEEN | |
| AND td23 BETWEEN | |
| AND td31 BETWEEN | |
The table refsetdb is defined in Table 2. t, v, and d are the type, volume, and edge length of a refset of the query. Δ’s are predefined constants for similarity thresholds. Expressions such as “v−Δ” are given as constants in the actual code. We set Δ=1Å3 and Δ=2Å.
Figure 3Comparison of GI score and IR score. Each point represents a template included in the top 50,000 hits for the query (PDB ID: 101m). The regression line is also shown. The correlation coefficient between the scores is 0.87.
Figure 4Distribution of IR scores of randomly selected templates. The red bars indicate the histogram of IR scores of randomly selected templates obtained for the query 101m. The green line is the probability density function (PDF) of the gamma distribution GAM(α, β) with the parameters α=1.32 and β=1.75 calculated from the mean and variance of the scores. The blue line is the PDF of the type 2 extreme value distribution with the parameters determined to best fit the histogram.
Figure 5Scatter plot of the IR scores and coordinate RMS deviations resulted from a search with the PDB entry 101m. The regions enclosed by the circles marked with M and G contain mostly myoglobins and other globins, respectively.
Figure 6Optimal superpositions of the query 1svn on templates. The wire-frame model in the CPK color scheme is the query protein 1svn. The template atoms are colored in green. Aligned atoms are in ball-and-stick model. The ligand of the template is the ball-and-stick model in magenta. A: Peptide-binding site of subtilisin DY (PDB ID: 1bh621). B: Peptide-binding site of γ-chymotrypsin (PDB ID: 7gch24); the labeled Ser, His, Asp are the aligned catalytic triad. The figures were created by using the PDBjViewer25.
Figure 7Optimal superpositions of the ATP-binding sites of the query cAMP-dependent protein kinase (cAPK; PDB ID: 1atp26) on templates. A: The template is the ATP-binding site of casein kinase-1 (PDB ID: 1csn29) from Schizosaccharomyces pombe. B: The template is the ATP-binding site of glutathion synthetase (PDB ID: 1m0w30) from Saccharomyces cerevisiae. The color scheme is the same as Fig. 6. The ligand of 1atp is also shown in the stick model with the CPK colors.
Figure 8Optimal superpositions of the NAD-binding sites of the query alcohol dehydrogenase (PDB ID: 1het)31 on templates. A: The template is the NAD-binding site of urocanase protein (PDB ID: 1x87; Tereshko et al., unpublished) from Bacillus stearothermophilus. B: The template is the FAD-binding site of p-hydroxybenzoate hydroxylase (PDB ID: 1iuv32) from Pseudomonas aeruginosa. The color scheme is the same as Fig. 6. The ligand of 1het is also shown in the stick model with the CPK colors.