| Literature DB >> 34983370 |
M A Hakim Newton1,2, Fereshteh Mataeimoghadam3, Rianon Zaman4, Abdul Sattar4,5.
Abstract
MOTIVATION: Protein backbone angle prediction has achieved significant accuracy improvement with the development of deep learning methods. Usually the same deep learning model is used in making prediction for all residues regardless of the categories of secondary structures they belong to. In this paper, we propose to train separate deep learning models for each category of secondary structures. Machine learning methods strive to achieve generality over the training examples and consequently loose accuracy. In this work, we explicitly exploit classification knowledge to restrict generalisation within the specific class of training examples. This is to compensate the loss of generalisation by exploiting specialisation knowledge in an informed way.Entities:
Keywords: Deep learning; Dihedral angle prediction; Protein structure prediction
Mesh:
Substances:
Year: 2022 PMID: 34983370 PMCID: PMC8728911 DOI: 10.1186/s12859-021-04525-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Encoding of 8-state SS predictions by SSPro8 using one-hot vectors. Exactly one bit in each bit string of length 8 has 1 in it and the other 7 bits are 0s
Numbers of proteins and residues in training, validation, and testing datasets. The testing set comprises 1205, 61, 55 proteins from SPOT-1D, PDB150, CAMEO93 test sets respectively
| Datasets | Training | Validation | Testing | Total |
|---|---|---|---|---|
| Proteins | 6721 | 667 | 1321 | 8709 |
| Residues | 1,670,605 | 165,530 | 306,608 | 2142,743 |
Distribution of protein residues over 3-state secondary structure types
| SS type | Training | Validation | Testing | |||
|---|---|---|---|---|---|---|
| Residues | Percent | Residue | Percent | Residue | Percent | |
| Helix | 637,203 | 38.14 | 61,814 | 37.34 | 116,691 | 38.06 |
| Sheet | 383,140 | 22.93 | 38,401 | 23.20 | 68,975 | 22.50 |
| Coil | 650,262 | 38.92 | 65,315 | 39.46 | 120,942 | 39.45 |
| Helix | 637,996 | 38.19 | 61,925 | 37.41 | 116,568 | 38.02 |
| Sheet | 384,423 | 23.01 | 38,723 | 23.39 | 69,303 | 22.60 |
| Coil | 648,186 | 38.80 | 64,882 | 39.20 | 120,737 | 39.38 |
Fig. 2Validation and testing performances in MAE (y-axis) of total 16 SAP4SS settings (x-axis) for each type of backbone angles and for each type of 3-state SS classes
The best SAP4SS settings for SS types and angle types
| SS type | Angle types | Best setting |
|---|---|---|
| Helix | YYR9 | |
| Sheet | YYR9 | |
| YYZ9 | ||
| Coil | YYR5 | |
| YYR9 |
MAE values for various angles as predicted by various methods for residues of various 3-state actual SS types in our1321 testing proteins
| SS-type | Residues | Method | ||||
|---|---|---|---|---|---|---|
| Helix | 116,691 | SPOT-1D | 7.51 | 11.77 | 3.36 | 11.99 |
| OPUS-TASS | 7.10 | 11.02 | ||||
| SAP | 6.36 | 8.14 | 2.60 | 9.00 | ||
| SAP4SS | ||||||
| 0.79% | 0.74% | 2.77% | 3.09% | |||
| Sheet | 68,975 | SPOT-1D | 16.43 | 17.85 | 8.19 | 23.47 |
| OPUS-TASS | 17.29 | |||||
| SAP | 17.22 | 16.68 | 8.32 | 24.44 | ||
| SAP4SS | 16.48 | |||||
| − 3.34% | 6.31% | 3.02% | 3.94% | |||
| Coil | 120,942 | SPOT-1D | 24.85 | 37.63 | 9.28 | 40.27 |
| OPUS-TASS | 24.33 | 36.78 | ||||
| SAP | 24.65 | 31.99 | 8.48 | 35.03 | ||
| SAP4SS | ||||||
| 0.62% | 2.20% | 1.31% | 2.76% |
The emboldened values are the best values over the methods compared
MAE values for angles as predicted by various methods for 306608 residues in 1321 test proteins
| SS-type | Residues | Method | ||||
|---|---|---|---|---|---|---|
| All | 306,608 | SPOT-1D | 16.30 | 23.25 | 6.76 | 25.56 |
| Three | OPUS-TASS | 15.83 | 22.49 | |||
| Types | SAP | 15.96 | 19.39 | 6.19 | 22.60 | |
| SAP4SS | ||||||
| 1.54% | 2.76% | 2.65% | 4.10% |
The emboldened values are the best values over the methods compared. Residues are not categorised using SS types
Spearman rank correlation coefficients for the association between actual angles and angles predicted by various methods
| Method | ||||
|---|---|---|---|---|
| SPOT-1D | 0.722 | 0.758 | 0.779 | 0.515 |
| OPUS-TASS | 0.741 | 0.767 | ||
| SAP | 0.731 | 0.773 | 0.810 | 0.518 |
| SAP4SS |
The emboldened values are the best values as the higher the coefficients the better the correlation
Fig. 395% confidence intervals of AE values (y-axis) for various methods (x-axis)
Performance of SAP4SS, SAP, OPUS-TASS (in columns OTASS), and SPOT-1D (in columns SPOT) when our testing proteins are grouped based on their lengths
| Testing proteins | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SAP4SS | SAP | OTASS | SPOT | SAP4SS | SAP | OTASS | SPOT | SAP4SS | SAP | SPOT | SAP4SS | SAP | SPOT | ||
| Length | Count | MAE | MAE | MAE | MAE | ||||||||||
| 001–100 | 232 | 14.60 | + 2.05 | + 1.03 | + 3.77 | 18.47 | + 1.08 | + 15.54 | + 17.16 | 5.63 | + 1.95 | + 9.59 | 20.39 | + 3.68 | + 13.68 |
| 101–200 | 424 | 15.32 | + 2.48 | + 1.17 | + 3.98 | 18.78 | + 2.50 | + 15.65 | + 19.33 | 6.05 | + 2.64 | + 9.75 | 22.01 | + 4.27 | + 14.08 |
| 201–300 | 294 | 15.24 | + 2.49 | + 2.23 | + 4.66 | 18.41 | + 2.66 | + 19.77 | + 23.19 | 5.95 | + 2.35 | + 11.93 | 22.00 | + 3.27 | + 16.55 |
| 301–400 | 190 | 15.60 | + 2.88 | − 0.19 | + 3.53 | 18.58 | + 3.39 | + 16.85 | + 22.50 | 6.04 | + 2.98 | + 11.42 | 21.23 | + 5.37 | + 17.57 |
| 401–500 | 103 | 15.80 | + 2.09 | + 3.92 | + 7.28 | 18.57 | + 2.85 | + 27.95 | + 31.07 | 5.96 | + 2.52 | + 16.61 | 20.75 | + 4.14 | + 24.24 |
| 501–800 | 78 | 16.64 | + 1.98 | + 1.08 | + 4.21 | 20.59 | + 2.77 | + 19.14 | + 23.75 | 6.34 | + 2.68 | + 12.93 | 22.96 | + 3.48 | + 19.99 |
| Overall | 1321 | 15.59 | + 2.37 | + 1.54 | + 4.55 | 18.87 | + 2.76 | + 19.18 | + 23.21 | 6.03 | + 2.65 | + 12.11 | 21.71 | + 4.10 | + 17.73 |
In the table, % of a method (e.g. SAP, OPUS-TASS, or SPOT-1D) is computed as . The greater the value of MAE, the worse the performance of the method w.r.t. SAP4SS
Fig. 4Percentages of proteins (y-axis) having certain percentages of angles (x-axis) with AE within threshold 6 (T6) and 12 (T12) degrees. The lower the threshold, the better the results
Fig. 5RMSD values (y-axis) of protein structures (x-axis) generated using and angles predicted by various methods. In the charts proteins are sorted in the x-axis based on their lengths