| Literature DB >> 25110986 |
Yun Xu1, Changyu Hu1, Yang Dai1, Jie Liang1.
Abstract
The construction of fitness landscape has broad implication in understanding molecular evolution, cellular epigenetic state, and protein structures. We studied the problem of constructing fitness landscape of inverse protein folding or protein design, with the aim to generate amino acid sequences that would fold into an a priori determined structural fold which would enable engineering novel or enhanced biochemistry. For this task, an effective fitness function should allow identification of correct sequences that would fold into the desired structure. In this study, we showed that nonlinear fitness function for protein design can be constructed using a rectangular kernel with a basis set of proteins and decoys chosen a priori. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3,200 non-protein decoys via a finite Newton method. A blind test of a simplified version of fitness function for sequence design was carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified function correctly classified 408 native sequences (20 misclassifications, 95% correct rate), which outperforms several other statistical linear scoring function and optimized linear function. Our results further suggested that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of proteins and decoys, and we showed in addition that the overall landscape is not overly sensitive to the specific choice of this set. Our results can be generalized to construct other types of fitness landscape.Entities:
Mesh:
Year: 2014 PMID: 25110986 PMCID: PMC4128808 DOI: 10.1371/journal.pone.0104403
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Decoy generation by gapless threading.
Sequence decoys can be generated by threading the sequence of a larger protein to the structure of an unrelated smaller protein.
The number of misclassification compared with other methods.
| Method | Training set | Training set | Test set | Test set |
| 800 / 36 M | 440 / 14 M | 428 / 11 M | 201 / 3 M | |
| Nonlinear function | 4 / 988 | NA | 20 / 218 | NA |
| Tobi | NA | 192 / 39,583 | NA | 44 / 53,137 |
| Bastolla | NA | 134 / 47,750 | NA | 58 / 29,309 |
| Miyazawa & Jernigan | NA | 173 / 229,549 | NA | 87 / 80,716 |
The number of misclassification using simplified nonlinear fitness function, optimal linear scoring function taken as reported in [26], [28], and Miyazawa-Jernigan statistical potential [34] for both native proteins and decoys (separated by “/”) in the test set and the training set. The simplified nonlinear function is formed using a basis set of 3,680 (480 native+3,200 decoy) contact vectors derived using Strategy 2.
Effects of the size of basis set on performance of discrimination using Strategy 1.
| Training set | Test set | ||||
| Select decoys rate | Iteration | Native / Decoy |
| Native / Decoy |
|
| 800 / 36 M | 428 / 11 M | ||||
| 0% | 4 | 21 / 1,374 | 0.958 | 26 / 387 | 0.931 |
| 2% | 5 | 19 / 1,029 | 0.964 | 27 / 219 | 0.933 |
| 5% | 5 | 17 / 1,303 | 0.963 | 21 / 317 | 0.944 |
| 8% | 5 | 13 / 1,246 | 0.969 | 23 / 274 | 0.941 |
| 10% | 5 | 14 / 922 | 0.972 | 24 / 216 | 0.940 |
| 20% | 6 | 16 / 902 | 0.969 | 28 / 250 | 0.930 |
| 30% | 6 | 10 / 1,037 | 0.975 | 29 / 304 | 0.926 |
| 40% | 10 | 16 / 812 | 0.970 | 27 / 199 | 0.933 |
| 50% | 10 | 13 / 1,112 | 0.971 | 25 / 269 | 0.936 |
| 60% | 12 | 15 / 802 | 0.972 | 27 / 237 | 0.932 |
| 70% | 9 | 13 / 947 | 0.973 | 24 / 256 | 0.939 |
| 80% | 8 | 11 / 1,078 | 0.973 | 28 / 278 | 0.929 |
| 90% | 9 | 12 / 690 | 0.977 | 27 / 170 | 0.934 |
| 100% | 5 | 5 / 2,681 | 0.962 | 24 / 609 | 0.931 |
The number of misclassifications of both native proteins and decoys (separated by “/”) with select native proteins rate 60% in both training set and test set are listed. Misclassifications as well as the scores in two tests using different number of native proteins and decoys are listed (see text for details).
Effect of the size of the pre-selection of dataset using Strategy 2.
| Training Set | Test set | |||||
| Pre-select native proteins top | Pre-select decoys top | Iteration | Native / Decoy |
| Native / Decoy |
|
| 800 / 36 M | 428 / 11 M | |||||
| 0% | 1 | 6 | 8 / 1,010 | 0.978 | 25 / 212 | 0.938 |
| 2% | 1 | 5 | 5 / 1,079 | 0.981 | 24 / 266 | 0.939 |
| 5% | 1 | 5 | 5 / 1,038 | 0.981 | 24 / 247 | 0.939 |
| 8% | 1 | 5 | 5 / 1,093 | 0.981 | 24 / 249 | 0.939 |
| 10% | 1 | 5 | 5 / 997 | 0.982 | 24 / 242 | 0.939 |
| 20% | 1 | 6 | 9 / 625 | 0.981 | 26 / 174 | 0.936 |
| 30% | 1 | 6 | 9 / 689 | 0.980 | 24 / 211 | 0.940 |
| 40% | 1 | 6 | 8 / 869 | 0.980 | 25 / 218 | 0.937 |
| 50% | 1 | 5 | 4 / 988 | 0.983 | 20 / 218 | 0.949 |
| 60% | 1 | 5 | 6 / 1,039 | 0.980 | 24 / 280 | 0.938 |
| 10% | 1 | 5 | 5 / 997 | 0.982 | 24 / 242 | 0.939 |
| 10% | 2 | 5 | 6 / 1,270 | 0.977 | 22 / 372 | 0.941 |
| 10% | 3 | 7 | 9 / 934 | 0.978 | 22 / 247 | 0.944 |
| 10% | 4 | 5 | 5 / 1,071 | 0.981 | 24 / 210 | 0.944 |
Test results using Strategy 2 with different sizes of the pre-selected native proteins, which range from 0% to 60% while the pre-selected decoys are fixed as the top 1 level, and with different pre-selected decoys, which ranges from the top 1 s to the top 4 s while the pre-selected native proteins are fixed at 10%. Misclassifications as well as the scores in two tests using different number of native proteins and decoys are listed (see text for details).
Figure 2Discriminating a different decoy set using the nonlinear fitness function.
Sequence decoys in this set are generated by swapping residues at different positions. (A). The length distribution of the 1,227 native proteins in the set; (B). The relationship between the number of swaps and the percentage of misclassified decoys grouped by protein length binned with a width of 50 residues shown in different curves. (C). The relationship between the sequence identity binned with width 0.1 and the percentage of misclassification grouped by protein length shown in different curves. The fitness function was derived using strategy 2, with top 50% pre-selected native proteins, and top 1 pre-selected decoys. (D). Misclassified sequence decoys have overall lower DFIRE energy values than correctly classified sequence decoys and therefore are more native-like. The -axis is the net DFIRE energy difference of decoys to native proteins, and the -axis is the number count of decoys at different net DFIRE energy differences. The solid black line represents decoys misclassified by our fitness function and the dashed red line represents decoys correctly classified by our fitness function.
20 native proteins in the test set are misclassified using Strategy 2.
| Molecular name | Classification | Ligand(s) | PDBID | Chain | Fitness value | |
| Catalase |
| Oxidoreductase | 1 HEM and 3 SO | 1gwe | A | 0.1085 |
| Streptavidin |
| Biotin binding | 1 BTN and 2 GOL | 2f01 | A | 0.1407 |
| Acutohaemonlysin |
| Toxin | 2 IPA | 1mc2 | A | 0.1728 |
| Endonuclease I |
| Hydrolase | 1 Mg and 2 Cl | 2pu3 | A | 0.1900 |
| cytochrome c, putative |
| Electron transport | 2 SO | 2czs | A | 0.2664 |
| Cytochrome F |
| Electron transport | 1 HEME C | 1e2w | A | 0.6023 |
| Bowman-Birk type trypsin inhibitor |
| Hydrolase inhibitor | None | 2fj8 | A | 0.8463 |
| Uncharacterized protein with erredoxin-like fold |
| Structural genomics Unknown function | 1 Unkown ligand | 3e8o | A | 1.1592 |
| General secretion pathway protein G |
| Protein transport | 1 Zn | 1t92 | A | 1.3175 |
| ARF GTPase-activating protein git1 |
| Signaling protein | None | 2w6a | A | 1.6581 |
| Cystatin B |
| Protein binding | None | 2oct | A | 1.8043 |
| SNAP-25A |
| Transport protein | None | 1n7s | D | 1.9074 |
| Lin2189 protein |
| Structural genomics Unknown function | 2 GOL | 3b49 | A | 2.0142 |
| Fibritin |
| Chaperone | None | 2ibl | A | 2.1211 |
| Oxalate oxidase 1 |
| Oxidoreductase | 1 Mn, 1 GLV | 2et1 | A | 2.9975 |
| Alpha-2-macroglobulin receptor-associated protein |
| Lipid transport endocytosis chaperone | 2 Ca, 1 Na and 3 MPD | 2fcw | B | 3.5660 |
| Recombination endonuclease VII |
| Plasma protein | 1 Zn and 7 SO | 1e7l | A | 3.7397 |
| Hypothetical protein |
| Isomerase | 1 BEZ | 1gyx | A | 4.2697 |
| YDCE |
| |||||
| Syntaxin 1a |
| Transport protein | None | 1n7s | B | 5.0204 |
| Bacteriophage t4 short tail fibre |
| Structural protein | 1 CIT, 2 SO | 1ocy | A | 8.0264 |
The number of ligands bound to the protein are listed. The molecules are sorted by the fitness value. 14 of them (marked by “”) have ligand(s) bound to the protein. 4 of them (marked by “”) have contacts due to inter chain interactions. The fitness function definitively failed for only 3 proteins (marked by “”). For the remaining 17 proteins, the contacts between organic compounds and metal ions with the protein and inter chain interactions may provide additional stability beyond the intra-residue interactions captured in the descriptors.