| Literature DB >> 35534620 |
Jaspreet Singh1, Kuldip Paliwal2, Thomas Litfin3, Jaswinder Singh3, Yaoqi Zhou4,5,6.
Abstract
Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35534620 PMCID: PMC9085874 DOI: 10.1038/s41598-022-11684-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Performance in secondary structure prediction by using different input features as labelled for three different model architectures on three test sets (TEST2018, TEST2020, and Neff1-2020).
Individual model performance as compared to the ensemble performance on TEST2018 and TEST2020 set for prediction of secondary structure in three (SS3) and eight (SS8) states, solvent accessibility (ASA), half-sphere-exposure-up (HSE-u), half-sphere-exposure-down (HSE-d), contact number (CN), backbone angles (, , , and ). Performance measures are accuracy for SS3 and SS8, correlation coefficient for ASA, HSE-u, HSE-d, and CN, and mean absolute errors for the angles.
| Model | TEST2018 | TEST2020 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SS3 | SS8 | ASA | HSE-u | HSE-d | CN | SS3 | SS8 | ASA | HSE-u | HSE-d | CN | |||||||||
| 2-Layer-LSTM | 86.50 | 76.07 | 0.804 | 0.745 | 0.755 | 0.788 | 23.964 | 16.139 | 6.540 | 24.821 | 79.57 | 66.44 | 0.708 | 0.516 | 0.591 | 0.612 | 36.792 | 20.671 | 8.738 | 36.149 |
| Multi-Scale ResNet | 86.33 | 75.57 | 0.799 | 0.748 | 0.748 | 0.786 | 24.285 | 16.216 | 6.577 | 25.114 | 79.59 | 66.32 | 0.700 | 0.510 | 0.588 | 0.607 | 37.040 | 20.879 | 8.793 | 36.193 |
| Multi-Scale ResNet LSTM | 86.49 | 75.85 | 0.799 | 0.749 | 0.748 | 0.778 | 24.396 | 16.426 | 6.617 | 25.280 | 79.48 | 66.33 | 0.702 | 0.512 | 0.584 | 0.606 | 36.877 | 20.849 | 8.725 | 36.118 |
| Ensemble (This work) | 86.74 | 76.47 | 0.814 | 0.759 | 0.761 | 0.690 | 23.748 | 15.990 | 6.461 | 24.600 | 79.82 | 66.68 | 0.731 | 0.522 | 0.597 | 0.623 | 36.574 | 20.672 | 8.674 | 35.795 |
Figure 2Comparing the accuracy of secondary structure prediction of SPOT-1D-LM (this work) with single sequence methods (SPIDER3-Single, ProteinUnet, and SPOT-1D-Single) and sequence-profile-based methods (SPOT-1D and NetSurfP-2.0) on six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM, and CASP14-FM) for three-state (SS3) secondary structure prediction.
Comparing the performance of SPOT-1D-LM with single-sequence-based methods (SPIDER3-Single, ProteinUnet, and SPOT-1D-Single) and sequence-profile-based methods (SPOT-1D and NetSurfP-2.0) in the prediction of secondary structure in three (SS3) and (SS8) states, solvent accessibility (ASA), half-sphere-exposure-up (HSE-u), HSE-down (HSE-d), contact number (CN), backbone angles(, , and ) for TEST2018. Performance measures are accuracy for SS3 and SS8, correlation coefficient for ASA, HSE-u, HSE-d, and CN, and mean absolute errors for the angles.
| Model | SS3 | SS8 | ASA | HSE-u | HSE-d | CN | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| SPIDER3-Single | 72.57 | 59.81 | 0.647 | 0.523 | 0.487 | 0.547 | 43.05 | 23.78 | 11.07 | 45.38 |
| ProteinUnet | 72.57 | 60.30 | 0.620 | 0.537 | 0.510 | 0.545 | 42.93 | 23.42 | 10.28 | 44.94 |
| SPOT-1D-Single | 74.28 | 72.17 | 0.665 | 0.573 | 0.563 | 0.585 | 40.58 | 22.16 | 9.35 | 42.32 |
| NetSurfP-2.0(profile) | 85.35 | 73.48 | 0.783 | – | – | – | 26.63 | 17.90 | – | – |
| SPOT-1D (profile) | 86.18 | 75.41 | 0.787 | 0.732 | 0.737 | 0.777 | 24.87 | 16.88 | 6.91 | 25.94 |
| SPOT-1D-LM (This work) | 86.74 | 76.47 | 0.814 | 0.759 | 0.761 | 0.690 | 23.74 | 15.99 | 6.46 | 24.60 |
Comparing the performance of SPOT-1D-LM with single-sequence-based methods (SPIDER3-Single, ProteinUnet, and SPOT-1D-Single) and sequence-profile-based methods (SPOT-1D and NetSurfP-2.0) in the prediction of secondary structure in three (SS3) and eight (SS8) states, solvent accessibility (ASA), half-sphere-exposure-up (HSE-u), HSE-down (HSE-d), contact number (CN), backbone angles (, , and ) for TEST2020. Performance measures are accuracy for SS3 and SS8, correlation coefficient for ASA, HSE-u, HSE-d, and CN, and mean absolute errors for the angles.
| Model | SS3 | SS8 | ASA | HSE-u | HSE-d | CN | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| SPIDER3-Single | 71.31 | 57.57 | 0.596 | 0.358 | 0.417 | 0.434 | 45.64 | 23.48 | 11.52 | 46.04 |
| ProteinUnet | 72.20 | 58.71 | 0.555 | 0.366 | 0.426 | 0.441 | 44.87 | 23.19 | 10.49 | 44.95 |
| SPOT-1D-Single | 73.80 | 60.35 | 0.621 | 0.400 | 0.478 | 0.485 | 44.25 | 22.92 | 9.88 | 43.67 |
| NetSurfP-2.0(profile) | 79.42 | 66.36 | 0.702 | – | – | – | 35.07 | 20.70 | – | – |
| SPOT-1D (profile) | 80.52 | 67.76 | 0.691 | 0.516 | 0.594 | 0.600 | 34.46 | 20.33 | 8.50 | 33.64 |
| SPOT-1D-LM (This work) | 79.82 | 66.68 | 0.731 | 0.522 | 0.597 | 0.704 | 36.57 | 20.67 | 8.67 | 35.80 |
Figure 3As in Fig. 2 but for prediction of tertiary structure proteins (solvent accessibility).
Figure 4Overview of the model architecture.