| Literature DB >> 34437621 |
Julian Nazet1, Elmar Lang1, Rainer Merkl1.
Abstract
Rational protein design aims at the targeted modification of existing proteins. To reach this goal, software suites like Rosetta propose sequences to introduce the desired properties. Challenging design problems necessitate the representation of a protein by means of a structural ensemble. Thus, Rosetta multi-state design (MSD) protocols have been developed wherein each state represents one protein conformation. Computational demands of MSD protocols are high, because for each of the candidate sequences a costly three-dimensional (3D) model has to be created and assessed for all states. Each of these scores contributes one data point to a complex, design-specific energy landscape. As neural networks (NN) proved well-suited to learn such solution spaces, we integrated one into the framework Rosetta:MSF instead of the so far used genetic algorithm with the aim to reduce computational costs. As its predecessor, Rosetta:MSF:NN administers a set of candidate sequences and their scores and scans sequence space iteratively. During each iteration, the union of all candidate sequences and their Rosetta scores are used to re-train NNs that possess a design-specific architecture. The enormous speed of the NNs allows an extensive assessment of alternative sequences, which are ranked on the scores predicted by the NN. Costly 3D models are computed only for a small fraction of best-scoring sequences; these and the corresponding 3D-based scores replace half of the candidate sequences during each iteration. The analysis of two sets of candidate sequences generated for a specific design problem by means of a genetic algorithm confirmed that the NN predicted 3D-based scores quite well; the Pearson correlation coefficient was at least 0.95. Applying Rosetta:MSF:NN:enzdes to a benchmark consisting of 16 ligand-binding problems showed that this protocol converges ten-times faster than the genetic algorithm and finds sequences with comparable scores.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34437621 PMCID: PMC8389498 DOI: 10.1371/journal.pone.0256691
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance comparison of the GA- and NN-based protocols.
| PDB-ID |
|
| NAAGA | NAANN |
|
|
|
|---|---|---|---|---|---|---|---|
| 1fzq | 20 | 29 | 0.59 | 0.96 | 8 | 32 | 4 |
| 1hsl | 19 | 42 | 0.73 | 0.96 | 9 | 16 | 3 |
| 1j6z | 27 | 45 | 0.75 | 0.95 | 9 | 31 | 5 |
| 1n4h | 25 | 51 | 0.62 | 0.95 | 7 | 19 | 3 |
| 1nq7 | 28 | 57 | 0.62 | 0.95 | 9 | 24 | 4 |
| 1opb | 22 | 50 | 0.57 | 0.93 | 10 | 21 | 4 |
| 1pot | 19 | 41 | 0.67 | 0.94 | 7 | 4 | 2 |
| 1urg | 19 | 40 | 0.56 | 0.96 | 7 | 24 | 3 |
| 2b3b | 17 | 46 | 0.68 | 0.97 | 6 | 25 | 3 |
| 2dri | 19 | 42 | 0.67 | 0.97 | 6 | 29 | 4 |
| 2ifb | 22 | 54 | 0.63 | 0.95 | 8 | 18 | 3 |
| 2q2y | 23 | 34 | 0.67 | 0.96 | 8 | 47 | 5 |
| 2qo4 | 22 | 40 | 0.58 | 0.96 | 7 | 25 | 4 |
| 2rct | 22 | 51 | 0.57 | 0.94 | 7 | 17 | 4 |
| 2rde | 20 | 35 | 0.61 | 0.93 | 7 | 19 | 4 |
| 2uyi | 23 | 35 | 0.61 | 0.96 | 7 | 61 | 5 |
|
|
|
|
|
|
|
|
|
All values were determined for each of the k = 16 proteins (indicated by their PDB-ID) from the MD_EnzBench set. n is the number of design shell residues and ss the number of second shell residues, respectively. The NAAGA and NAANN values specify the area above the plots of the corresponding RS values for the GA- and NN-based protocol. is the number of the first NN iteration, whose RSNN(OPT) value reached the RSGA(OPT100) value that served as a reference. The and values indicate for the GA- and the NN-based protocols the number of the first iteration that generated a sequence set having native-like Rosetta energies.
Performance of FastDesign and a protocol based on four iterations of NN training.
| 2dri | 2ifb | 1opb | 2rct | |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| FD1000 | -929.3 | - | -407.6 | - | -438.0 | - | -459.0 | - |
| FDbest250 | -934.0 | - | -412.7 | - | -442.7 | - | -465.3 | - |
| NN(FD1000) | -880.0 | 0.50 | -386.0 | 0.49 | -391.1 | 0.46 | -435.4 | 0.51 |
|
| -929.8 | 0.03 | -407.7 | 0.04 | -438.0 | 0.04 | -459.1 | 0.05 |
| FDNN_1 | -889.4 | 0.51 | -370.4 | 0.43 | -404.1 | 0.48 | -413.7 | 0.52 |
| FDNN_2 | -897.8 | 0.38 | -382.4 | 0.37 | -385.8 | 0.55 | -432.6 | 0.34 |
| FDNN_3 | -924.8 | 0.38 | -400.0 | 0.43 | -437.5 | 0.38 | -448.5 | 0.39 |
| FDNN_4 | -933.2 | 0.50 | -410.8 | 0.56 | -442.2 | 0.50 | -457.7 | 0.41 |
The RSFD columns list mean FastDesign scores and the columns frequency distances for designs with the four proteins from repr_prot. Rows FD1000 and FDbest250 represent the values related to 1000 and the best 250 FastDesign sequences. The NN(FD1000) values resulted from an NN training with the FD1000 sequences. The values represent the initial training set of an iterative training that generated the output FDNN_1 to FDNN_4.
Comparison of sequence heterogeneity in candidate sequences added to OPT during r = 100 iterations.
| Mutations | 2dri | 2ifb | 1opb | 2rct | ||||
|---|---|---|---|---|---|---|---|---|
| GA | NN | GA | NN | GA | NN | GA | NN | |
| 1 | 11551 | 10606 | 11551 | 9562 | 11551 | 9019 | 11551 | 9226 |
| 2 | 708 | 1527 | 1738 | 1456 | ||||
| 3 | 123 | 259 | 297 | 327 | ||||
| 4 | 71 | 145 | 98 | 134 | ||||
| 5 | 38 | 65 | 84 | 99 | ||||
| 6 | 44 | 37 | 73 | 110 | ||||
| 7 | 32 | 29 | 99 | 109 | ||||
| 8 | 31 | 20 | 126 | 62 | ||||
| 9 | 28 | 19 | 57 | 40 | ||||
| 10 | 11 | 19 | 28 | 24 | ||||
| 11 | 14 | 6 | 17 | 31 | ||||
| 12 | 3 | 7 | 22 | 16 | ||||
| 13 | 2 | 5 | 10 | 23 | ||||
| 14 | 4 | 3 | 17 | 26 | ||||
| 15 | 1 | 1 | 21 | 14 | ||||
| 16 | 3 | 11 | 12 | |||||
| 17 | 2 | 11 | 8 | |||||
| 18 | 3 | 2 | ||||||
The table lists numbers of candidate sequences and grouped according to their differences (number of mutations) to the most similar sequence from OPT. For this analysis, r = 100 iterations of the designs for the four proteins from repr_prot were analyzed.
Comparison of sequence heterogeneity in candidate sequences added to OPT during the first 10 iterations.
| Mutations | 2dri | 2ifb | 1opb | 2rct | ||||
|---|---|---|---|---|---|---|---|---|
| GA | NN | GA | NN | GA | NN | GA | NN | |
| 1 | 1067 | 237 | 1066 | 160 | 1058 | 77 | 1059 | 116 |
| 2 | 442 | 472 | 408 | 373 | ||||
| 3 | 113 | 149 | 162 | 140 | ||||
| 4 | 71 | 83 | 72 | 65 | ||||
| 5 | 38 | 56 | 55 | 44 | ||||
| 6 | 44 | 37 | 47 | 42 | ||||
| 7 | 32 | 29 | 42 | 55 | ||||
| 8 | 31 | 20 | 36 | 41 | ||||
| 9 | 28 | 19 | 32 | 39 | ||||
| 10 | 11 | 19 | 28 | 24 | ||||
| 11 | 14 | 6 | 17 | 31 | ||||
| 12 | 3 | 7 | 22 | 16 | ||||
| 13 | 2 | 5 | 10 | 23 | ||||
| 14 | 4 | 3 | 17 | 26 | ||||
| 15 | 1 | 1 | 21 | 14 | ||||
| 16 | 3 | 11 | 12 | |||||
| 17 | 2 | 11 | 8 | |||||
| 18 | 3 | 2 | ||||||
The table lists numbers of candidate sequences and grouped according to their differences (number of mutations) to the most similar sequence from OPT. For this analysis, the first 10 iterations of the designs for the four proteins from repr_prot were analyzed.
Generation-specific PCC values determined for the first ten iterations of four designs.
| Generation | 2dri | 2ifb | 1opb | 2rct |
|---|---|---|---|---|
| 1 | 0.53 | 0.54 | 0.48 | 0.42 |
| 2 | 0.84 | 0.87 | 0.70 | 0.69 |
| 3 | 0.90 | 0.92 | 0.78 | 0.74 |
| 4 | 0.95 | 0.94 | 0.82 | 0.82 |
| 5 | 0.97 | 0.96 | 0.88 | 0.89 |
| 6 | 0.98 | 0.97 | 0.91 | 0.92 |
| 7 | 0.98 | 0.97 | 0.92 | 0.94 |
| 8 | 0.98 | 0.98 | 0.93 | 0.94 |
| 9 | 0.98 | 0.98 | 0.95 | 0.96 |
| 10 | 0.98 | 0.98 | 0.95 | 0.96 |
The table lists Pearson correlation coefficient (PCC) values for the first 10 iterations of the designs with the proteins from repr_prot and Rosetta:MSF:NN:enzdes.