| Literature DB >> 26664348 |
Reecha Nepal1, Joanna Spencer2, Guneet Bhogal3, Amulya Nedunuri4, Thomas Poelman5, Thejas Kamath6, Edwin Chung3, Katherine Kantardjieff7, Andrea Gottlieb2, Brooke Lustig1.
Abstract
A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20- and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov-Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.79% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications.Entities:
Keywords: education; relative solvent accessibility prediction; teaching
Year: 2015 PMID: 26664348 PMCID: PMC4665666 DOI: 10.1107/S1600576715018531
Source DB: PubMed Journal: J Appl Crystallogr ISSN: 0021-8898 Impact factor: 3.304
Figure 1Flowchart of key inputs and outputs.
Non-optimum homology subsets for test set proteins
| Manesh-215 | CASP8 | CASP9 |
|---|---|---|
| 1axna | 3d3oa | 3mqza |
| 1bhmb | 3d5pa | 3n53a |
| 1ceoa | 3dewa | 3n6za |
| 1cnva | 3df8a | 3na2a |
| 1esca | 3dm3a | 3ngwa |
| 1exnb | 3doua | 3ni8a |
| 1hlba | 3njaa | |
| 1kpta | 3nkga | |
| 1udii | 3nrga | |
| 1vcaa | 3nrva | |
| 1wbaa | 3nwza | |
| 2ccya | 3nyma | |
| 2scpa | ||
| 2sila |
Figure 2Linear and logistic regression fits for query residues valine (V) and aspartate (D) from the 18-protein transient-binding subset. Here, the least-squares fit corresponds to the NACCESS RSA values regressed on E6 and amino acid type (AA). For illustrative purposes only two amino acid types are shown. Valine (top) and aspartate (bottom) include 177 and 172 residues, respectively. Both least-squares fits have a slope (E6) of 10.56, but they have different corresponding intercepts 13.83 and 45.17. The residues correctly classified by the logistic model (E6+AA) are shown in red (127 for V, 148 for D). Note, 76.49 (linear) and 75.64% (logistic) of all 2786 residues are classified correctly. Here, a 20% threshold was utilized in both observed and predicted RSA values to create classifications. Moreover, the results were validated by evaluating the fitted model on a 13-protein subset (2049 residues) of the Manesh-215 test set consisting of transient-binding proteins (Pettit et al., 2007 ▸). Here we observe slightly higher accuracies of 76.34 (linear) and 77.27% (logistic).
Classic model fit (E20+E6+FSR+FSHP+AA) including parameter estimates, corresponding standard errors, and z and p values based on the Wald test
Descriptors included are sequence entropies E20 and E6, the amino acid qualitative predictor (AA) with 20 classes, and FSHP and FSR indicating the fraction of optimum homology residues that are strongly hydrophobic (V, I, L, F, M, Y, W) and small (A, G). The standard 1363-based learning set and a 20% threshold was utilized.†
| Variables | β | Standard error | exp(β) |
|
|
|---|---|---|---|---|---|
| Intercept | −0.528 | 0.031 | 0.590 | −17.247 | <0.001 |
| E20 | 0.342 | 0.012 | 1.407 | 29.162 | <0.001 |
| E6 | 0.862 | 0.017 | 2.369 | 51.156 | <0.001 |
| FSR | −0.922 | 0.031 | 0.398 | −29.690 | <0.001 |
| FSHP | −1.646 | 0.030 | 0.193 | −54.898 | <0.001 |
| ALA | −0.267 | 0.034 | 0.766 | −7.861 | <0.001 |
| ARG | 0.765 | 0.034 | 2.149 | 22.363 | <0.001 |
| ASN | 0.358 | 0.035 | 1.430 | 10.236 | <0.001 |
| ASP | 0.774 | 0.034 | 2.168 | 22.670 | <0.001 |
| CYS | −1.543 | 0.052 | 0.214 | −29.405 | <0.001 |
| GLN | 0.366 | 0.036 | 1.442 | 10.234 | <0.001 |
| GLU | 0.985 | 0.034 | 2.677 | 29.047 | <0.001 |
| GLY | 0.829 | 0.038 | 2.292 | 21.588 | <0.001 |
| HIS | −0.114 | 0.038 | 0.893 | −3.003 | 0.003 |
| ILE | −0.036 | 0.027 | 0.965 | −1.353 | 0.176 |
| LEU | 0.202 | 0.023 | 1.224 | 8.767 | <0.001 |
| LYS | 1.509 | 0.036 | 4.522 | 41.384 | <0.001 |
| MET | 0.269 | 0.036 | 1.308 | 7.424 | <0.001 |
| PHE | 0.040 | 0.030 | 1.041 | 1.359 | 0.174 |
| PRO | 0.449 | 0.034 | 1.567 | 13.072 | <0.001 |
| SER | −0.166 | 0.032 | 0.847 | −5.113 | <0.001 |
| THR | −0.168 | 0.032 | 0.845 | −5.291 | <0.001 |
| TRP | 0.567 | 0.041 | 1.763 | 13.784 | <0.001 |
| TYR | 0.690 | 0.029 | 1.995 | 24.195 | <0.001 |
Note descriptor values for nine PDB chains (1G291, 1L2WA, 1MUWA, 1W85I, 1XC3B, 1XVHA, 2I6CA, 2PI2E) from the original 1363 set are insufficient and here considered null.
Selected logistic model accuracies for test sets based on X-ray crystal structures
For comparison, accuracies are shown for models built using both 20 and 25% relative solvent accessibility threshold values. The standard 1363-based learning set was utilized for model fitting.
| 1363 training/Manesh-215 test | 1363 training/CASP(8+9) test | |
|---|---|---|
| Model | Optimum homology | Optimum homology |
| Threshold | 25% | 25% |
| E20 | 66.10 | 64.81 |
| E6 | 69.40 | 68.06 |
| FSHP | 65.61 | 66.14 |
| AA | 69.62 | 68.36 |
| E6+AA | 74.79 | 73.51 |
| Classic | 75.56 | 74.32 |
Optimum homology Manesh-215 subset (47 609 residues).
Optimum homology CASP(8+9) subset (41 967 residues).
Non-optimum homology Manesh-215 subset (3113 residues).
Non-optimum homology CASP(8+9) subset (2832 residues).
Note the other two models with AA and a single quantitative descriptor, E20+AA and FSHP+AA, are not reported as they have less predictive accuracy than E6+AA.
As shown in Table 2 ▸, E20+E6+FSR+FSHP+AA saturated model.
Selected logistic model accuracies for test sets based on X-ray crystal structures
LGDP and AQN are included as additional descriptors. We list prediction accuracies for oligomers and non-oligomers together. The standard 1363-based learning set was utilized for model fitting. The difference in accuracy for oligomers minus non-oligomers is scaled, M if difference <−0.5, P if >0.5 and otherwise O. The change in total accuracy for oligomers with likely interfacial residues removed is scaled M if difference <−0.5, P if >0.5 and otherwise O.
| 1363 Training/Manesh-215 test | 1363 Training/ CASP(8+9) test | |||
|---|---|---|---|---|
| Optimum homology | Optimum homology | |||
| Model | Total Acc 25% | ΔAcc (Olig-NonOlig) // ΔAcc (Olig w/o interface) 25% | Total Acc 25% | ΔAcc (Olig-NonOlig) // ΔAcc (Olig w/o interface) 25% |
| E6+FSR+FSHP+AA | 75.23 | M | 74.12 | M |
| E6+AA | 74.79 | O | 73.50 | O |
| LGDP | 60.03 | P | 56.99 | P |
| AQN( | 55.97 | P | 52.45 | P |
| AQN( | 55.46 | P | 51.89 | P |
| LGDP+E6+AA | 75.74 | O | 73.92 | O |
| LGDP+E6+FSR+FSHP+AA | 76.05 | O | 74.56 | M |
| LGDP+AA+AQN | 71.29 | P | 69.75 | P |
| Comprehensive model | 76.41 | M | 75.01 | O |
| All proteins | 76.11 | O | 74.79 | M |
Optimum homology Manesh-215 subset for oligomers (21 513 residues; 16 283 residues non-interfacial) and non-oligomers (26 096 residues); alignment with LGDP values truncated 132 of 146 residues for 8ATCB, and one residue each for 1CHMA and 1TYSA.
Optimum homology CASP(8+9) subset for oligomers (24 176 residues; 18 573 residues non-interfacial) and non-oligomers (17 791 residues).
Non-homology descriptor model evaluated on non-optimum homology Manesh-215, gives percent accuracies of 73.12 (25% threshold) and 71.49 (20% threshold), for oligomers (919 residues; 787 residues non-interfacial); 70.95 (25% threshold) and 71.75 (20% threshold) for non-oligomers (2194 residues).
Non-homology descriptor model that, when evaluated on non-optimum homology CASP(8+9), gives percent accuracies of 72.26 (25% threshold) and 71.83 (20% threshold) for oligomers (2080 residues; 1393 residues non-interfacial); 69.55 (25% threshold) and 70.88 (20% threshold) for non-oligomers (752 residues).
E6+FSR+FSHP+AA+LGDP+AQN model.
Residue weighted accuracies, comprehensive model for optimum homology proteins and non-homology descriptors for non-optimum homology proteins.