| Literature DB >> 35648435 |
Magnus Haraldson Høie1, Erik Nicolas Kiehl1, Bent Petersen2,3, Morten Nielsen1, Ole Winther4,5,6, Henrik Nielsen1, Jeppe Hallgren7, Paolo Marcatili1.
Abstract
Recent advances in machine learning and natural language processing have made it possible to profoundly advance our ability to accurately predict protein structures and their functions. While such improvements are significantly impacting the fields of biology and biotechnology at large, such methods have the downside of high demands in terms of computing power and runtime, hampering their applicability to large datasets. Here, we present NetSurfP-3.0, a tool for predicting solvent accessibility, secondary structure, structural disorder and backbone dihedral angles for each residue of an amino acid sequence. This NetSurfP update exploits recent advances in pre-trained protein language models to drastically improve the runtime of its predecessor by two orders of magnitude, while displaying similar prediction performance. We assessed the accuracy of NetSurfP-3.0 on several independent test datasets and found it to consistently produce state-of-the-art predictions for each of its output features, with a runtime that is up to to 600 times faster than the most commonly available methods performing the same tasks. The tool is freely available as a web server with a user-friendly interface to navigate the results, as well as a standalone downloadable package.Entities:
Year: 2022 PMID: 35648435 PMCID: PMC9252760 DOI: 10.1093/nar/gkac439
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.Runtime analysis in seconds for submissions of 1 to 1000 sequences for NetsurfP-3.0, SPOT-1D-LM and NetsurfP-2.0. We note a 5.3× speed-up for NetSurfP-3.0 versus its closest competitor SPOT-1D-LM, and 665× speed-up versus the older NetSurfP-2.0, when processing 1000 sequences. The runtime for a single AlphaFold2 model is shown with a dashed line. We performed all benchmarks on the same AWS EC2 G4 instance, with a 16GB NVIDIA T4 GPU (see Methods).
Model performance when applying one-hot encodings or ESM1b embeddings to predict protein local structure
| Test dataset | Model | RSA ↑ | ASA ↑ | Q8 ↑ | Q3 ↑ | Disorder ↑ | Disorder ↓ | Phi ↓ | Psi ↓ |
|---|---|---|---|---|---|---|---|---|---|
| (PCC) | (PCC) | (ACC) | (ACC) | (MCC) | (FNR) | (MAE) | (MAE) | ||
| CB513 | NSP One-Hot | 0.628 | 0.669 | 0.573 | 0.719 | - | - | 25.80 | 46.16 |
| NetSurfP-2.0 | 0.791 | 0.804 |
| 0.845 | - | - | 20.35 |
| |
| NetSurfP-3.0 |
|
| 0.711 |
| - | - |
| 29.25 | |
| TS115 | NSP One-Hot | 0.633 | 0.679 | 0.628 | 0.746 | 0.561 |
| 22.60 | 41.40 |
| NetSurfP-2.0 | 0.771 | 0.793 | 0.740 | 0.849 | 0.624 | 0.013 | 17.40 | 26.80 | |
| NetSurfP-3.0 |
|
|
|
|
| 0.015 |
|
| |
| CASP12 | NSP One-Hot | 0.570 | 0.608 | 0.576 | 0.704 | 0.573 |
| 26.30 | 46.70 |
| NetSurfP-2.0 |
|
|
|
|
| 0.015 |
|
| |
| NetSurfP-3.0 | 0.707 | 0.722 | 0.669 | 0.791 | 0.621 | 0.024 | 21.25 | 33.92 |
Assessment of NetSurfP-3.0, NetSurfP-2.0, and NetSurfP with one-hot encoding (NSP One-Hot) on the CB513, TS115 and CASP12 datasets. Each column reports an output variable with the corresponding metrics: Pearson correlation coefficient (PCC), accuracy (ACC), Matthews correlation coefficient (MCC), false positive rate (FPR) and mean absolute error (MAE). Up- and down-facing arrows indicate metrics for which an improvement is reprepresented by larger or lower values, respectively. For each dataset, the values corresponding to the best performances are in bold.
Performance benchmark on CASP14_fm test set proteins
| Model | Q3 ↑ | Q8 ↑ | RSA ↑ | Phi ↓ | Psi ↓ |
|---|---|---|---|---|---|
| (ACC) | (ACC) | (PCC) | (MAE) | (MAE) | |
| NetSurfP-3.0 | 0.601 | 0.607 | 0.599 |
| 42.02 |
| NetSurfP-2.0 | 0.581 | 0.618 |
| 52.73 |
|
| SPOT-1D | 0.576 | 0.572 | 0.556 | 71.56 | 84.92 |
| SPOT-1D-LM |
|
| 0.601 | 66.64 | 81.72 |
Performance of NetSurfP-3.0, NetSurfP-2.0, SPOT-1D-Single (SPOT-1D) and SPOT-1D-LM (SPOT-1D-LM) on the CASP14_FM dataset. Each column represents a different output variable, with the associated metric. Up- and down-facing arrows indicate metrics for which an improvement is represented by larger or lower values, respectively. The values corresponding to the best performances are in bold.
Figure 2.The web interface of NetSurfP-3.0. A graphical representation displays Q3 secondary structure, RSA, and disorder predictions together with the sequence. By hovering on a residue, the user can visualize all predictions. Through the Export menus it is possible to download individual predictions in different file formats, as well as the complete results as a compressed archive.