| Literature DB >> 35639661 |
Tim Kucera1, Matteo Togninalli2, Laetitia Meng-Papaxanthos3.
Abstract
MOTIVATION: Protein design has become increasingly important for medical and biotechnological applications. Because of the complex mechanisms underlying protein formation, the creation of a novel protein requires tedious and time-consuming computational or experimental protocols. At the same time, machine learning has enabled the solving of complex problems by leveraging large amounts of available data, more recently with great improvements on the domain of generative modeling. Yet, generative models have mainly been applied to specific sub-problems of protein design.Entities:
Year: 2022 PMID: 35639661 PMCID: PMC9237736 DOI: 10.1093/bioinformatics/btac353
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Architecture of ProteoGAN after extensive hyperparameter optimization. We varied the number of layers, conditioning mechanism(s), number of projections, convolutional filters and label embeddings, among others
Classification results of SVMs trained on different embeddings, for the structural classes of the CATH database (balanced accuracy), and for 50 functional classes of the GO ( score)
| Embedding | C | A | T | H | GO |
|---|---|---|---|---|---|
| Spectrum | 65 ± 3 | 54 ± 4 | 57 ± 2 | 47 ± 3 | 58 ± 1 |
| ProFET | 53 ± 2 | 38 ± 4 | 48 ± 2 | 28 ± 3 | 52 ± 1 |
| UniRep | 72 ± 4 | 73 ± 6 | 68 ± 2 | 58 ± 3 | 71 ± 1 |
| ESM | 77 ± 3 | 91 ± 1 | 86 ± 1 | 79 ± 2 | 80 ± 1 |
Note: All values in percent.
Evaluation of ProteoGAN and various baselines with MMD, MRR and diversity metrics based on the Spectrum kernel embedding (the results of other embeddings can be found in Supplementary Tables S6–S8)
| Model | MMD↓ | Gauss. MMD↓ | MRR↑ |
| ΔEntropy | ΔDistance |
|---|---|---|---|---|---|---|
| Positive Control | 0.011 ± 0.000 | 0.010 ± 0.000 | 0.893 ± 0.016 | 0.966 ± 0.018 | 0.002 ± 0.006 | −0.000 ± 0.001 |
| Negative Control | 1.016 ± 0.000 | 0.935 ± 0.000 | 0.090 ± 0.000 | 0.099 ± 0.001 | 0.728 ± 0.006 | 1.843 ± 0.001 |
| ProteoGAN |
|
|
|
|
|
|
| Predictorguided |
|
| 0.114 ± 0.007 | 0.136 ± 0.016 |
|
|
| Non-Hierarchical | 0.337 ± 0.118 | 0.242 ± 0.096 | 0.306 ± 0.034 | 0.406 ± 0.039 | −0.352 ± 0.178 | 0.290 ± 0.171 |
| ProGen | 0.048 | 0.030 |
|
|
| 0.037 |
| CVAE | 0.232 ± 0.078 | 0.148 ± 0.058 | 0.301 ± 0.053 | 0.424 ± 0.083 | 0.247 ± 0.027 | 0.145 ± 0.085 |
| OpC-ngram | 0.056 ± 0.001 | 0.034 ± 0.001 |
| 0.505 ± 0.034 | 0.208 ± 0.006 | −0.050 ± 0.002 |
| OpC-HMM | 0.170 ± 0.003 | 0.108 ± 0.002 | 0.095 ± 0.001 | 0.143 ± 0.002 | −0.579 ± 0.014 | 0.199 ± 0.004 |
| OpL-GAN |
|
|
|
|
|
|
| OpL-ngram |
|
|
|
|
|
|
| OpL-HMM | 0.195 ± 0.002 | 0.126 ± 0.002 | 0.100 ± 0.003 | 0.147 ± 0.002 | −0.654 ± 0.015 | 0.244 ± 0.004 |
| ProteoGAN (100 labels) | 0.036 | 0.024 | 0.585 | 0.736 | −0.026 | 0.019 |
| ProteoGAN (200 labels) | 0.162 | 0.112 | 0.374 | 0.524 | 0.104 | 0.051 |
Note: An arrow indicates that lower (↓) or higher (↑) is better. The positive control is a sample of real sequences and simulates a perfect model, the negative control is a sample that simulates the worst possible model for each metric (constant sequence for MMD, randomized labels for MRR, repeated sequences for diversity measures). Best results in bold, second best underlined. Given are mean and standard deviation over five data splits. Due to the computational effort, OpL-GAN and ProGen were only trained on one split.
Fig. 2.Mean rank of each individual label in Spectrum MRR over five data splits. The structure represents the relations of the 50 labels of interest in the GO DAG. Lower rank is better. 27 of 50 labels were on average ranked first or second. The worst targeted label is ‘kinase activity’
Fig. 3.Top-10 accuracy (in %) with the Spectrum embedding for OOD-capable models. The boxplots cover the 5 OOD sets, A–E, the bar represents the average. We add a random baseline for comparison, where generated sequences are sampled uniformly at random from the training set. Complementary results for the other embeddings can be found in Supplementary Figure S18