| Literature DB >> 29799510 |
Buzhong Zhang1,2, Linqing Li3, Qiang Lü4.
Abstract
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.Entities:
Keywords: bidirectional recurrent network; merging operator; sequence profile; solvent-accessibility prediction
Mesh:
Year: 2018 PMID: 29799510 PMCID: PMC6023031 DOI: 10.3390/biom8020033
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
The recent developments, in chronological order, for predicting the values of rASA.
| Work | Algorithm | Description of Features | MAE (%) |
|---|---|---|---|
| Ahmad, 2003 [ | Neural network | Amino acid composition | 18.8 |
| Wang, 2005 [ | Multiple linear regression | PSSM | 16.2 |
| Garg, 2005 [ | Neural network | PSSM, secondary structure | 16.6 |
| Nguyen, 2006 [ | Two-stage SVR | PSSM | 15.7 |
| Wang, 2007 [ | Support vector machine | PSSM | 15.1 |
| Dor, 2007 [ | Neural network | PSSM, physical properties | 14.3 |
| Chang, 2008 [ | Support vector regression | Enhances PSSM-based features | 14.8 |
| Faraggi, 2009 [ | Neural networks | PSSM, physical properties, secondary structure | 11.1 |
| Meshkin, 2009 [ | Pace regression | PSSM | 13.4 |
| Joo, 2012 [ | k-nearest neighbor | PSSM | 14.8 |
| Kashefi, 2013 [ | SVR and scatter search methods | PSSM, qualitative physicochemical features | 12.31 |
| Zhang, 2015 [ | Weighted sliding window | PSSM, secondary structure, native disorder, physicochemical propensities, sequence-based features | 14 |
| Fan, 2016 [ | Gradient boosted regression trees | PSSM, secondary structure, native disorder, conservation score, side-chain environment | 9.4 |
The MAEs reported in this table were evaluated on a different dataset.
Performance comparison in predicting relative solvent-accessible areas (the best results are shown in bold).
| Method | CB502 | Manesh215 | ||
|---|---|---|---|---|
| MAE (%) | PCC | MAE (%) | PCC | |
| SARpred | 17.4 | 0.6 | 16.6 | 0.61 |
| SVR | 14.8 | 0.68 | 14.2 | 0.69 |
| Real-SPINE | 14.5 | 0.68 | 13.8 | 0.7 |
| NetSurfP | 14.3 | 0.71 | 13.6 | 0.7 |
| PredRSA | 9.4 | 0.73 | 9.0 | 0.75 |
| SDBRNN |
|
|
|
|
Figure 1Predicted MAE based-on individual sequence from the Manesh215 dataset. The protein sequences are ordered by sequence length.
Binary classification prediction comparison between our method and other reported methods with different thresholds on the Manesh215 dataset.
| Method | Accuracy for Two-State Prediction | ||||||
|---|---|---|---|---|---|---|---|
| 5% | 10% | 20% | 25% | 30% | 40% | 50% | |
| SARpred | 74.9 | 77.2 | 77.7 | - | 77.8 | 78.1 | 80.5 |
| PR | 76.8 | 74.8 | 75.3 | 76.7 | 77.7 | 79.8 | 86.3 |
| SVR | 80.9 | 80.1 | 78.7 | - | - | - | 80.8 |
| SS-SVM | 79.2 | 78.2 | 77.6 | 77.6 | 77.5 | 79.7 | 86.5 |
| Two-stage SVR | 81.1 | 78.7 | 77.6 | 77.3 | - | - | 79.5 |
| PredRSA | 80 | 81.6 | 80.9 | 81.1 | 82.2 | 87.1 |
|
| SDBRNN |
|
|
|
|
|
| 93 |
Figure 2Predicted accuracy on individual sequences from the Manesh215 dataset. The rASA threshold is 25%. Protein sequences are ordered by sequence length.
Accuracy (%) performance comparison in binary classification prediction (the best results are shown in bold).
| Threshold (%) | Manesh215 | CB502 | CASP10 | |||
|---|---|---|---|---|---|---|
| PredRSA | SDBRNN | PredRSA | SDBRNN | PredRSA | SDBRNN | |
| 5 | 80.1 |
| 77.9 |
| 78.5 |
|
| 10 | 81.7 |
| 79 |
| 79.1 |
|
| 20 | 81 |
| 80.5 |
| 78.3 |
|
| 25 | 81.2 |
| 81 |
| 79.7 |
|
| 30 | 82.4 |
| 82.1 |
| 80.5 |
|
| 40 | 87.1 |
| 86.8 |
| 85 |
|
| 50 |
| 93 |
| 92.4 | 91.2 |
|
Matthews’ correlation coefficient performance comparison in binary classification prediction.
| Threshold (%) | Manesh215 | CB502 | CASP10 | |||
|---|---|---|---|---|---|---|
| PredRSA | SDBRNN | PredRSA | SDBRNN | PredRSA | SDBRNN | |
| 5 | 0.54 |
| 0.5 |
| 0.48 |
|
| 10 | 0.63 |
| 0.58 |
| 0.57 |
|
| 20 | 0.61 |
| 0.6 | 0.6 | 0.56 |
|
| 25 | 0.58 |
| 0.57 |
| 0.56 |
|
| 30 | 0.54 |
| 0.52 |
| 0.51 |
|
| 40 | 0.42 |
| 0.39 |
| 0.4 |
|
| 50 | 0.25 |
| 0.23 |
| 0.3 |
|
Figure 3Comparison of true mean values and predicted mean values for 20 types of amino acids using the CB502 dataset.
Different combinations of sequence-derived features for SDBRNN predictors on an independent test set (TS261).
| Feature | MAE (%) | PCC |
|---|---|---|
| PSSM | 9.33 | 0.732 |
| PSSM + SC | 9.03 | 0.749 |
| PSSM + SC + CS | 9.00 | 0.750 |
| PSSM + SC + CS + PP | 8.95 | 0.750 |
| PSSM + SC + CS + PP + PC |
|
|
PSSM: position specific scoring matrix, 20 dimensions; SC: protein sequence coding, 22 dimensions; CS: residue conservation score, 1 dimension; PP: physical properties, 7 dimensions; PC: physicochemical characteristics, 3 dimensions.
Comparison of different LSTM models on relative RSA area prediction. MAE is the value percentage (%).
| Method | CB502 | Manesh215 | CASP10 | TS261 | ||||
|---|---|---|---|---|---|---|---|---|
| MAE (%) | PCC | MAE (%) | PCC | MAE (%) | PCC | MAE (%) | PCC | |
| LSTM | 9.8 | 0.694 | 9.4 | 0.722 | 10.0 | 0.698 | 10.0 | 0.695 |
| BLSTM_C | 9.0 | 0.74 | 8.44 | 0.772 | 9.33 | 0.734 | 9.0 | 0.748 |
| BLSTM_S | 8.93 | 0.744 | 8.33 | 0.775 | 9.26 | 0.739 | 8.96 | 0.747 |
| SDBRNN |
|
|
|
|
|
|
|
|
Figure 4In order to remember more long-range information in the sequence study, when the past computing information and the future computing information are merged, the merging operator is proposed to execute the merging operation.
Figure 5SDBRNN architecture. The merging operator “concat” is used in the first BRNN layer. The “sum” and “weighting sum” operator are used in the second and third layer. Two multi-perception networks are connected to BRNN.