| Literature DB >> 23675414 |
Jerome P Nilmeier1, Daniel A Kirshner, Sergio E Wong, Felice C Lightstone.
Abstract
We present an enzyme protein function identification algorithm, Catalytic Site Identification (CatSId), based on identification of catalytic residues. The method is optimized for highly accurate template identification across a diverse template library and is also very efficient in regards to time and scalability of comparisons. The algorithm matches three-dimensional residue arrangements in a query protein to a library of manually annotated, catalytic residues--The Catalytic Site Atlas (CSA). Two main processes are involved. The first process is a rapid protein-to-template matching algorithm that scales quadratically with target protein size and linearly with template size. The second process incorporates a number of physical descriptors, including binding site predictions, in a logistic scoring procedure to re-score matches found in Process 1. This approach shows very good performance overall, with a Receiver-Operator-Characteristic Area Under Curve (AUC) of 0.971 for the training set evaluated. The procedure is able to process cofactors, ions, nonstandard residues, and point substitutions for residues and ions in a robust and integrated fashion. Sites with only two critical (catalytic) residues are challenging cases, resulting in AUCs of 0.9411 and 0.5413 for the training and test sets, respectively. The remaining sites show excellent performance with AUCs greater than 0.90 for both the training and test data on templates of size greater than two critical (catalytic) residues. The procedure has considerable promise for larger scale searches.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23675414 PMCID: PMC3651201 DOI: 10.1371/journal.pone.0062535
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flowchart for catalytic site search process.
The main processes that were developed are highlighted in yellow. The structure (target PDB) is the input to the process. The initial template search stage is a fast search procedure that produces a set of site candidates. From this set, additional descriptors are calculated, including alignment binding site properties, which enhance the prediction quality through a logistic regression procedure.
Figure 2Diagram of Catalytic Site Atlas example entries and EC number lookup.
Each site has a list of critical residues associated with it. Two entries associated with protein 132 l are listed. The literature based entry is shown in bold, and is what is used to populate the template catalog used for the present study. The EC number is listed in a separate table, and associated with the literature entry PDB id. The PSIBLAST entry shown is 1 of 45 PSIBLAST entries associated with 132 l.
Figure 3Distribution of template sizes for reannotated CSA.
Figure 4Example template distance matrix construction from PDB structure 1aa6, site 1 (E.C. 3.40.50.720).
a) The CSA entry has a corresponding EC number, as well as a list of residues that comprise the site. Each residue has a centroid associated with it, which is shown in parentheses and represented as a balls in both a) and b). Cofactors and ions have a specific centroid assigned to them, while standard and nonstandard residues have the Ca as the centroid. B) a distance matrix with 28 elements is constructed from the resulting site coordinates and stored as for rapid comparison to a target structure.
Figure 5Template specific substitution matrix example.
a) list of observed substitutions for the 1alk LIT entry family. All allowed substitutions are stored in a table (supplementary material) b) The family specific substitution matrix as constructed from entry in a) that is used as an input to template matching procedure.
Figure 6Summary of substitutions found in CSA dataset.
a) Lower diagonal binary matrix indicates substitutions found in CSA matrix (gray if any substitutions are observed, white if none are observed), and upper diagonal is the binary form of BLOSUM 62 with gray indicated for values greater than −1 and white otherwise. B) Lower diagonal matrix identical to CSA binary matrix in a) for reference. Upper diagonal matrix shows differences between substitution matrices: (gray: no difference, black: CSA only, white: BLOSUM 62 only ).
Figure 7Distribution of number of substitutions per template.
A total 302 family members (out of 967 total) display at least one substitution.
Figure 8Diagram for template search procedure.
The template distance matrix is shown for reference. The structure contains a supermatrix of distances, and the template search procedure searches for the sequence whose distance matrix best corresponds to the template. This is a variant of the Ullman subgraph isomorphism problem.
Figure 9Typical calculation times for subgraph matching calculation.
A) Data plot shown for Pmax = 100. Timings are reported for pre-constructed distance matrices of protein 1 eus for comparison with 1980 templates from author curated CSA. n is the number of sites in the distance supermatrix constructed for each template comparison. B) Polynomial coefficients of fits to data for Pmax = 100 and 50, with correlation coefficients.
Initial parameters are from a preliminary fit as described in Appendix S3 in File S1.
| Estimated coefficients by regression | ||||
| Descriptor | Initial | minimal | Inter-mediate | Final |
|
| 17.74 | −3.09 | −18.86 | −4.8 |
|
| −2.1 | 1.35 | – | |
| 2-residue templates | – | – | −0.60 | 0.21 |
| 3-residue templates | – | – | −0.92 | −0.91 |
| 4–7-residue templates | – | – | −0.45 | −0.01 |
|
| 0.94 | – | – | – |
|
| −18.18 | −0.84 | 19.15 | −1.73 |
|
| – | – | −2.764 | – |
|
| −3.94 | – | – | |
| 2-residue templates | – | – | 0.49 | 0.55 |
| 3-residue templates | – | – | 3.15 | 5.33 |
| 4–7-residue templates | – | – | 0.50 | 1.84 |
|
| −0.15 | – | – | |
| 1–3-residue templates | – | – | – | |
| 4–7-residue templates | – | – | −44.89 | 18.61 |
|
| – | – | – | |
| 2-residue templates | – | – | −3.80 | 2.83 |
| 3-residue templates | – | – | −11.47 | −12.63 |
| 4–7-residue templates | – | – | 14.81 | 3.18 |
The minimal parameter set is plotted as (f,Ca) in Figure 7. Initial and intermediate parameter sets are used for preliminary rankings, as described in the text. Final parameters as used for remaining data analyses.
Constrained to be equal across 2, 3, and 4–7-residue templates.
Computed as d
Computed as (d+0.1)−1.
List of PDB codes used in test and training sets.
| Training Set |
| 1b7g,1bwk,1cla,1cy1,1d6n,1dbz,1dv7,1f4c,1g02,1ge6,1gin,1gzg,1hqd,1hzz,1i2o,1igw,1jiu,1jp7,1kgq,1krc,1l5w,1lmz,1mj5,1nwc,1ojp,1p5g,1q2e,1rrj,1rsm,1rwp,1s70,1t2a,1t3z,1t4d,1ucl,1w23,1w3y,1wkl,1wo8,1wow,1wyi,1yai,1ykn,1ytn,1z83,1z8x,2aad,2bcd,2be7,2brv,2g22,2hb1,2ido,2j4s,2ori,2p9e,2qll,2qmo,2v6s,2vel,2vf5,2vmn,3c80,3cn9,3cuf,3d4z,3dt2,3dzc,3eju,3fpd,3gxf,3hh4,3i4c, 3i9l,4fua |
|
|
| 1ado,1ajb,1blm,1cib,1d4e,1e2r,1ep9,1eus,1f3x,1f49,1fdv,1g1y,1g87,1ggf,1i45,1ib4,1iu8,1jol,1k3t,1kak,1kg4,1khn,1kvy,1l7a,1nto,1nwr,1p07,1rry,1ru1,1tsl,1wdd,1xpt,1xv8,1xww,1yja,2a3t,2ayl,2ayo,2c0h,2cba,2cnh,2ewn,2ez9,2fbp,2fpt,2nu8,2nze,2o3q,2otc,2pov,2ppy,2qd4,2qu9,2veg,2wfp,2whr,2zj3,2zyd,3bbf,3c52,3czn,3dhe,3ehb,3fgd,3gtd,3it1,3pf |
Figure 10Comparison of performance on training data using minimal (naïve) descriptor set and final parameter set.
Curve in gray is generated using descriptors resulting from analysis of process 1 output only. Final regression contains full descriptor set as described in the text.
Figure 11Performance of training dataset.
a) ROC plots b) True Positive Rate (TPR) vs. logistic score (threshold) c) Matthews Correlation Coefficient versus logistic score.
Figure 12Performance of test dataset.
a) ROC plots b) True Positive Rate (TPR) vs. logistic score (threshold) c) Matthews Correlation Coefficient vs logistic score.
Area under ROC curves (AUR) for templates of differing sizes, as well as full dataset.
| AUC(train) | AUC(test) | |
|
| 0.9411 | 0.5413 |
|
| 0.9821 | 0.9040 |
|
| 0.9932 | 0.9935 |
|
| 0.9622 | 0.9369 |
| All | 0.9714 | 0.7989 |