| Literature DB >> 31913448 |
Nils Strodthoff1, Patrick Wagner1, Markus Wenzel1, Wojciech Samek1.
Abstract
MOTIVATION: Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step.Entities:
Year: 2020 PMID: 31913448 PMCID: PMC7178389 DOI: 10.1093/bioinformatics/btaa003
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic illustration of the training procedure, here for the amino acid sequence MSLR…RI. The -token marks the beginning of the sequence. The red arrows show the context for forward language model for predicting next character (S) given sequence M of length 2. For fine-tuning on the downstream classification tasks, all embeddings weights and LSTM weights are initialized using the same set of weights obtained from language model pre-training. This has to be contrasted with the use of pre-trained embeddings, where just the embedding weights are initialized in a structured way before the downstream fine-tuning step
EC classification accuracy on the custom EC40 and EC50 datasets
| Level | EC40 | EC50 | |||||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 1 | 2 | ||
| Baseline | Seq; non-red. | 0.83 | 0.38 | 0.25 | 0.88 | 0.71 | 0.70 |
| Seq | 0.84 | 0.61 | 0.47 | 0.92 | 0.80 | 0.79 | |
| Seq+PSSM; non-red.; clean | 0.91 | 0.84 | 0.72 | 0.95 | 0.94 | 0.91 | |
| Seq+PSSM; non-red.; leak. |
| 0.85 | 0.71 | 0.95 | 0.95 | 0.92 | |
|
| Fwd; pretr.; non-red. | 0.82 | 0.79 | 0.71 | 0.93 | 0.94 | 0.92 |
| Fwd; from scratch | 0.87 | 0.79 | 0.74 | 0.94 | 0.94 | 0.92 | |
| Fwd; pretr. | 0.89 | 0.84 | 0.83 | 0.95 | 0.96 | 0.94 | |
| Bwd; pretr. | 0.90 | 0.85 | 0.81 | 0.95 | 0.96 | 0.94 | |
| Fwd+bwd; pretr. | 0.91 |
|
|
|
|
| |
Note: The best-performing classifiers are marked in bold face.
Fwd/bwd, training in forward/backward direction; seq, raw sequence as input; non-red, training on non-redundant sequences, i.e. representatives only; pretr., using language model pre-training; leak., leakage PSSM features computed on the full dataset.
EC classification accuracy on the published DEEPre and ECPred datasets compared with literature results from DEEPre (Li ) and ECPred (Dalkiran ) disregarding models relying on Pfam features
| Level | DEEPre (acc.) | ECPred (mean F1) | ||||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 1 | ||
|
| — | — | — | 0.96 |
| |
|
| 0.88 | 0.82 | 0.43 | — | — | |
| Baseline |
|
| 0.59 | 0.97 | 0.94 | |
|
| Fwd; pretr. | 0.86 | 0.81 | 0.75 | 0.95 | 0.93 |
| Bwd; pretr. | 0.86 | 0.83 | 0.73 | 0.97 | 0.93 | |
| Fwd+bwd; pretr. | 0.87 |
|
| 0.97 | 0.94 | |
| Fwd; pretr.; red. | — | — | — | 0.97 | 0.95 | |
| Bwd; pretr.; red. | — | — | — | 0.97 | 0.95 | |
| Fwd+bwd; pretr.; red. | — | — | — |
| 0.95 | |
Note: Results on the DEEPre dataset were evaluated using 5-fold cross-validation.
Fwd/bwd, training in forward/backward direction; seq, raw sequence as input; pretr., using language model pre-training.
Results established in this work.
Fig. 2.Dependence of the EC classification accuracy (Level 1; EC50 dataset) on the size of the training dataset. UDSMProt performs better than the baseline model also in the regime of small datasets that is particularly important for practical applications
GO prediction performance on a dataset based on a time-based split as in (Kulmanov and Hoehndorf, 2019; You ) in comparison to literature results collected by DeepGOPlus (Kulmanov and Hoehndorf, 2019)
| Methods |
|
| AUPR | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MFO | BPO | CCO | MFO | BPO | CCO | MFO | BPO | CCO | ||
| Single |
| 0.306 | 0.318 | 0.605 | 12.105 | 38.890 | 9.646 | 0.150 | 0.219 | 0.512 |
|
|
|
| 0.621 |
|
| 7.997 | 0.362 | 0.240 | 0.363 | |
|
| 0.449 | 0.398 | 0.667 | 10.722 | 35.085 |
| 0.409 | 0.328 | 0.696 | |
|
| 0.409 | 0.383 | 0.663 | 11.296 | 36.451 | 8.642 | 0.350 | 0.316 | 0.688 | |
| Ensemble |
|
| 0.441 | 0.694 | 5.240 | 17.713 |
|
| 0.336 |
|
|
| 0.580 | 0.370 | 0.687 |
|
| 5.518 | 0.546 | 0.225 | 0.700 | |
|
| 0.585 | 0.474 |
| 8.824 | 33.576 | 7.693 | 0.536 | 0.407 | 0.726 | |
|
| Fwd; from scratch | 0.418 | 0.303 | 0.655 | 14.906 | 47.208 | 12.929 | 0.304 | 0.284 | 0.612 |
| Fwd; pretr. | 0.465 | 0.404 | 0.683 | 10.578 | 36.667 | 8.210 | 0.406 | 0.345 | 0.695 | |
| Bwd; pretr. | 0.465 | 0.403 | 0.664 | 10.802 | 36.361 | 8.210 | 0.414 | 0.348 | 0.685 | |
| Fwd+bwd; pretr. | 0.481 | 0.411 |
| 10.505 | 36.147 | 8.244 |
|
|
| |
| Bwd+bwd; pretr. + | 0.582 |
| 0.697 | 8.787 | 33.615 | 7.618 | 0.548 |
| 0.728 | |
Note: Best overall results (highest and AUPR; lowest ) are marked in bold face and best single-model results are underlined.
Fwd/bwd, training in forward/backward direction; pretr., using language model pre-training.
Results established in this work.
Remote homology and fold detection performance on the SCOP 1.67 benchmark dataset compared with literature results from GPkernel (Håndstad ), LSTM_protein (Hochreiter ) and ProDec-BLSTM (Li )
| Methods | Superfamily level | Fold level | |||
|---|---|---|---|---|---|
|
|
|
|
| ||
|
|
| 0.902 | 0.591 | 0.844 | 0.514 |
|
| 0.942 | 0.773 | 0.821 | 0.571 | |
|
| 0.969 | 0.849 | — | — | |
| Fwd; from scratch | 0.706 | 0.552 | 0.734 | 0.653 | |
| Fwd; pretr. | 0.957 | 0.880 | 0.834 | 0.734 | |
| Bwd; pretr. | 0.969 | 0.912 | 0.839 | 0.757 | |
| Fwd+bwd; pretr. |
|
|
|
| |
Fwd/bwd, training in forward/backward direction; pretr., using language model pre-training. The best-performing classifiers are marked in bold face.
Results established in this work.
Fig. 3.Attribution map for the class EC3 for UniProt accession Q60452 based on integrated gradients. The heatmap shows high relevance on the ‘DEAH’ box (DEAH; Pos. 234–237)