| Literature DB >> 36163232 |
Henriette Capel1, Robin Weiler1, Maurits Dijkstra1, Reinier Vleugels1, Peter Bloem1, K Anton Feenstra2.
Abstract
Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue .Entities:
Mesh:
Substances:
Year: 2022 PMID: 36163232 PMCID: PMC9512797 DOI: 10.1038/s41598-022-19608-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Overview from literature of the best performing protein representation models on the secondary structure prediction task for 8 classes (SS8), using the CB513 dataset.
| Model name | CB513 SS8 accuracy | Model size | Pre-training db size | |||
|---|---|---|---|---|---|---|
| TAPE Transformer | 59 | [ | 38M | [ | 32M | [ |
| ProtBert-BFD | 70 | [ | 420M | [ | 2 122M | [ |
| ProtTrans-T5 | 71.4 ± 0.2 | [ | 3 000M | [ | 2 122M | [ |
| ESM-1b | 71.6 ± 0.1 | [ | 650M | [ | 250M | [ |
| ESM-MSA-1 | 72.9 ± 0.2 | [ | 100M | [ | 26M | |
| NetSurfP-2.0 (mmseqs) | 72.3 | [ | ||||
| RaptorX | 70.6 | [ | ||||
We have gathered results from three articles that report model performance(s) on the CB513 dataset. The performance metric shown here is percent SS8 accuracy as reported. NetSurfP-2.0 and RaptorX models from Klausen et al.[1] are included as examples of state-of-the-art performance (neither of these are based on protein representation models). The other methods are all represenation models, and only the MSA Transformer (ESM-MSA-1) achieves higher accuracy than NetSurfP-2.0. The number of parameters (model size), the number of sequences in the pre-training dataset, and corresponding references are also reported. Rao et al.[47] create one MSA per sequence as input for ESM-MSA-1, which means pre-training was performed on the same number of sequences as MSAs.
Overview of the number of protein sequences in the training, validation and test set of the seven downstream protein structural prediction tasks.
| Dataset | SS3 | SS8 | BUR | ASA | PPI | EPI | HPR |
|---|---|---|---|---|---|---|---|
| Training set | 8 803 | 8 803 | 8 803 | 8 803 | 324 | 179 | 2 949 |
| Validation set | 1 102 | 1 102 | 1 102 | 1 102 | 81 | 45 | 738 |
| Test set | 1 102 | 1 102 | 1 102 | 1 102 | 137 | 56 | 1 230 |
The prediction tasks secondary structure in three (SS3) and eight (SS8) classes, buried residues (BUR) and absolute solvent accessibility (ASA) are based on the same dataset. Furthermore, we included the prediction tasks protein-protein interaction interfaces (PPI), epitope interfaces (EPI) and hydrophobic patches (HPR).
Figure 1Improved performance on downstream tasks with pre-trained models. Prediction performances of the benchmark test set, as introduced in section ProteinGLUE Benchmark tasks, for a medium model without the pre-training step (pink), medium model including a pre-trained model until step 2 000 000 (dark red), base model without the pre-training step (grey), and base model including a pre-trained model until step 1 000 000 (dark blue). The yellow lines indicate the performance of a random or majority-class baseline. All models are trained ten times on their selected set of hyperparameters after which the mean performance and standard error is determined.
Figure 2Performance of downstream tasks monitored during pre-training. Prediction performances of the benchmark test set, as introduced in section ProteinGLUE Benchmark tasks, (a) for a medium sized model, and (b) for a base sized model, trained without the pre-training step (pink/grey), and on a pre-trained model for 350 000 (light red/blue), 700 000 (red/blue) and 1 000 000 steps (dark red/blue). Further details as in Fig. 1.
Figure 3Performance on downstream task is generally stable between test and validation sets. Prediction performances of the benchmark validation and test set for both the converged medium and base models. Further details are given in Fig. 1.