| Literature DB >> 33173130 |
Fereshteh Mataeimoghadam1, M A Hakim Newton2,3, Abdollah Dehzangi4,5, Abdul Karim6, B Jayaram7, Shoba Ranganathan8, Abdul Sattar6,9.
Abstract
Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP can significantly outperform existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are 6-8 in terms of mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap .Entities:
Mesh:
Substances:
Year: 2020 PMID: 33173130 PMCID: PMC7655839 DOI: 10.1038/s41598-020-76317-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Backbone angles of a protein structure.
Figure 2Sliding window of size 5: two residues on each side of a given residue.
Figure 3The fully connected deep neural network used in our method. It has three hidden layers, each having 150 neurons. The numbers of inputs and outputs could vary depending on the combinations of features used (e.g. PSSM plus SS and combinations of 7PCP and ASA) and the representation of the output angles (Direct Angles vs Trigonometric Ratios).
Numbers of proteins and residues in training, validation, and testing datasets.
| Datasets | Training | Validation | Testing | Total |
|---|---|---|---|---|
| Proteins | 6721 | 667 | 1206 | 8594 |
| Residues | 1,670,605 | 165,530 | 282,461 | 2,118,596 |
27 testing proteins are in TEST2018 and 1179 are in TEST2016.
Performances of SPIDER2, SPOT-1D, SAP, and OPUS-TASS on our testing dataset and its subsets TEST2016 and TEST2018. The emboldened values are the winning numbers for the corresponding types of angles and datasets. OPUS-TASS does not predict and angles while the other three methods predict all four types of angles.
| Results below are as we run all of the systems on our datasets | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | Proteins | Residues | Method | ||||
| TEST2016 | 1179 | 278553 | SPIDER2 | 18.93 | 30.14 | 8.15 | 32.13 |
| SPOT-1D | 16.23 | 23.23 | 6.77 | 24.58 | |||
| OPUS-TASS | 15.75 | 22.41 | – | – | |||
| SAP | |||||||
| TEST2018 | 27 | 3908 | SPIDER2 | 18.51 | 28.78 | 7.80 | 30.35 |
| SPOT-1D | 16.07 | 22.66 | 6.51 | 23.54 | |||
| OPUS-TASS | 15.62 | 21.96 | – | – | |||
| SAP | |||||||
| Testing | 1206 | 282461 | SPIDER2 | 18.92 | 30.12 | 8.15 | 32.11 |
| SPOT-1D | 16.23 | 23.22 | 6.77 | 24.57 | |||
| OPUS-TASS | 15.74 | 22.41 | – | – | |||
| SAP | |||||||
Performance of SAP settings on 1206 testing proteins. In the table, column ASA denotes whether accessible surface area is used (Yes/No), column 7PCP denotes whether 7 physicochemical properties are used (Yes/No), column OR denotes output representation is in direct angles (D) or trigonometric ratios (R), column NM denotes normalisation method for input feature encoding is [0,1] range based (R) or Z-score based (Z), WS denotes the best size of the sliding window. Note that the emboldened cells denote the best performance for each combination of ASA and 7PCP while the boxed plus emboldened cells in each respective column denote the best performance among all SAP settings.
| Features | Encoding | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASA | 7PCP | OR | NM | WS | Test | Valid | WS | Test | Valid | WS | Test | Valid | WS | Test | Valid |
| N | N | D | R | 5 | 17.14 | 17.66 | 5 | 20.19 | 20.48 | 5 | 6.40 | 6.49 | 5 | 22.74 | 23.02 |
| Z | 5 | 5 | 5 | 5 | 22.44 | 22.68 | |||||||||
| R | R | 5 | 18.05 | 18.58 | 5 | 21.58 | 21.97 | 5 | 6.70 | 6.81 | 5 | 24.44 | 24.71 | ||
| Z | 9 | 17.07 | 17.57 | 5 | 20.14 | 20.43 | 9 | 6.48 | 5 | ||||||
| Y | N | D | R | 9 | 9 | 9 | 9 | 21.83 | 21.35 | ||||||
| Z | 5 | 16.44 | 16.55 | 5 | 19.36 | 19.56 | 5 | 6.23 | 6.18 | 9 | 21.52 | 21.63 | |||
| R | R | 13 | 17.04 | 17.54 | 13 | 19.15 | 19.20 | 13 | 6.38 | 6.18 | 13 | 22.39 | 22.06 | ||
| Z | 9 | 16.41 | 16.78 | 9 | 19.26 | 19.94 | 9 | 6.17 | 6.33 | 9 | |||||
| N | Y | D | R | 5 | 5 | 5 | 5 | ||||||||
| Z | 5 | 16.42 | 16.84 | 5 | 19.59 | 19.84 | 5 | 6.32 | 6.41 | 5 | 22.23 | 22.47 | |||
| R | R | 9 | 17.49 | 17.95 | 9 | 21.51 | 21.88 | 9 | 6.68 | 6.79 | 9 | 24.34 | 24.57 | ||
| Z | 13 | 16.31 | 16.66 | 13 | 19.68 | 19.87 | 13 | 6.30 | 6.37 | 13 | 22.08 | 22.23 | |||
| Y | Y | D | R | 9 | 9 | 5 | 5 | 21.49 | 21.89 | ||||||
| Z | 9 | 15.87 | 16.55 | 5 | 18.91 | 5 | 6.16 | 6.19 | 5 | 21.71 | |||||
| R | R | 9 | 16.15 | 16.85 | 9 | 19.30 | 19.91 | 9 | 6.23 | 6.20 | 9 | 21.63 | 21.71 | ||
| Z | 9 | 16.70 | 16.51 | 9 | 18.86 | 18.91 | 9 | 6.17 | 6.20 | 9 | 21.74 | ||||
Performance of the best SAP setting when the numbers of hidden layers in the DNNs are varied.
| Hidden | ||||||||
|---|---|---|---|---|---|---|---|---|
| Layer | Test | Valid | Test | Valid | Test | Valid | Test | Valid |
| 2 | 21.20 | |||||||
| 3 | ||||||||
| 4 | 15.72 | 16.12 | 18.71 | 18.91 | 6.11 | 6.21 | 21.18 | 21.33 |
Average performance of the best setting of SAP after 10-fold cross validation is performed.
| Dataset | Measure | ||||
|---|---|---|---|---|---|
| Validation | MAE | 16.04 | 18.80 | 6.16 | 21.18 |
| Testing | MAE | 15.65 | 18.59 | 6.07 | 21.03 |
| 10-Fold | MAE | 16.14 | 18.82 | 6.33 | 21.31 |
| 10-Fold | SDMAE | 0.24 | 0.09 | 0.08 | 0.21 |
Performances of SPIDER2, SPOT-1D, OPUS-TASS, and SAP on filtered PDB150 and CAMEO93 proteins. The emboldened values are the winning numbers for the corresponding types of angles and datasets. OPUS-TASS does not predict and angles while the other three methods predict all four types of angles.
| Results below are as we run all of the systems on our datasets | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | Proteins | Residues | Method | ||||
| PDB150 | 71 | 11547 | SPIDER2 | 20.98 | 32.32 | 8.39 | 53.46 |
| SPOT-1D | 52.58 | ||||||
| SAP | 19.29 | 26.37 | 7.20 | ||||
| CAMEO93 | 55 | 13872 | SPIDER2 | 20.05 | 31.80 | 8.34 | 33.83 |
| OPUS-TASS | – | – | |||||
| SAP | 20.24 | 31.02 | |||||
Performance of SAP, OPUS-TASS, SPOT-1D, and SPIDER2 when our testing proteins are grouped based on their lengths. In the table, MAE of a system (e.g. OPUS-TASS, SPOT-1D or SPIDER2) is its MAE minus the MAE of SAP. As such, the greater the value of MAE, the worse the performance of the system w.r.t. the performance of SAP.
| Testing proteins | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SAP | OPUS-TASS | SPOT-1D | SPIDER2 | SAP | OPUS-TASS | SPOT-1D | SPIDER2 | SAP | SPOT-1D | SPIDER2 | SAP | SPOT-1D | SPIDER2 | ||
| Length | Count | MAE | MAE | MAE | MAE | ||||||||||
| 001–100 | 210 | 14.46 | + 0.11 | + 0.57 | + 3.03 | 17.88 | + 3.15 | + 3.71 | + 9.32 | 5.63 | + 0.53 | + 1.82 | 18.95 | + 2.68 | + 8.98 |
| 101–200 | 381 | 15.37 | + 0.02 | + 0.46 | + 3.08 | 18.40 | + 3.27 | + 3.93 | + 10.35 | 6.10 | + 0.55 | + 1.93 | 20.79 | +2.63 | + 9.79 |
| 201–300 | 264 | 15.24 | + 0.25 | + 0.61 | + 3.17 | 18.02 | + 3.93 | + 4.66 | + 11.14 | 5.96 | + 0.71 | + 1.99 | 20.38 | + 3.50 | + 10.74 |
| 301–400 | 180 | 15.76 | − 0.29 | + 0.30 | + 3.42 | 18.58 | + 3.06 | + 4.09 | + 11.70 | 6.12 | + 0.59 | + 2.10 | 21.43 | + 2.96 | + 11.23 |
| 401–500 | 102 | 16.06 | + 0.34 | + 0.87 | + 3.53 | 18.98 | + 4.76 | + 5.37 | + 12.91 | 6.09 | + 0.86 | + 2.32 | 21.49 | + 4.28 | + 12.36 |
| 501–800 | 69 | 16.52 | + 0.25 | + 0.81 | + 3.29 | 19.64 | + 4.75 | + 5.89 | + 12.77 | 6.29 | + 0.89 | + 2.20 | 22.04 | + 5.22 | + 12.44 |
| Overall | 1206 | 15.65 | + 0.09 | + 0.58 | + 3.27 | 18.59 | + 3.82 | + 4.63 | + 11.53 | 6.07 | + 0.70 | + 2.08 | 21.03 | + 3.54 | + 11.08 |
Residue distribution over the testing proteins when residues are grouped on their (Left) SS and (Right) AA types. Also, on the left, typical ranges suggested for the torsion angles and for various secondary structures[41].
| SS | Residues | Percentage | ||
|---|---|---|---|---|
| B | 2955 | 1.05 | [− 130, − 110] | [110, 130] |
| C | 56,250 | 19.91 | [− 180, + 180] | [− 180, + 180] |
| E | 61,041 | 21.61 | [− 130, − 110] | [110, 130] |
| G | 10,581 | 3.75 | [− 59, − 39] | [− 36, − 16] |
| H | 96,993 | 34.34 | [− 67, − 47] | [− 57, − 37] |
| I | 47 | 0.02 | [− 67, − 47] | [− 80, − 60] |
| S | 22,984 | 8.14 | [− 180, + 180] | [− 180, + 180] |
| T | 31,610 | 11.19 | [− 180, + 180] | [− 180, + 180] |
| Total | 282,461 | 100.00 |
Figure 4Performance of SAP, OPUS-TASS, SPOT-1D, SPIDER2 on the testing proteins when residues are grouped based (Top Four) on their SS types and (Bottom Four) on their AA types. In the charts, y-axis shows MAE values and x-axis shows SS or AA types. The dashed horizontal line in each chart shows the overall MAE value for SAP.
Figure 5Distributions of actual angles of testing proteins and predictions of SAP, OPUS-TASS, SPOT-1D, and SPIDER2.
Figure 6RMSD values for SAP, SPOT-1D, and OPUS-TASS on TEST2018 proteins.
Figure 7Percentages of proteins (y-axis) that have a given percentage of residues (x-axis) with AE at most a given threshold T where T is 6 and 18 and are denoted by T6 and T18. The lower the threshold, the better the prediction quality.