| Literature DB >> 29679026 |
Jingxue Wang1, Huali Cao1, John Z H Zhang1,2,3,4, Yifei Qi5,6.
Abstract
Computational protein design has a wide variety of applications. Despite its remarkable success, designing a protein for a given structure and function is still a challenging task. On the other hand, the number of solved protein structures is rapidly increasing while the number of unique protein folds has reached a steady number, suggesting more structural information is being accumulated on each fold. Deep learning neural network is a powerful method to learn such big data set and has shown superior performance in many machine learning fields. In this study, we applied the deep learning neural network approach to computational protein design for predicting the probability of 20 natural amino acids on each residue in a protein. A large set of protein structures was collected and a multi-layer neural network was constructed. A number of structural properties were extracted as input features and the best network achieved an accuracy of 38.3%. Using the network output as residue type restraints improves the average sequence identity in designing three natural proteins using Rosetta. Moreover, the predictions from our network show ~3% higher sequence identity than a previous method. Results from this study may benefit further development of computational protein design methods.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29679026 PMCID: PMC5910428 DOI: 10.1038/s41598-018-24760-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Architecture of the neural networks. (A) The residue probability network, (B) Weight network, and (C) The full network. The residue probability and weight networks are used as subnetworks that share the same set of network parameters for different inputs. Each input consists of the features from the target residue and one of its neighbor residues.
Accuracy from five-fold cross-validation of the neural network on different datasets with different number of neighbor residues.
| Identity cutoff | |||||
|---|---|---|---|---|---|
| 30% | 0.329 |
| 0.333 | 0.331 | 0.321 |
| 50% | 0.353 |
| 0.358 | 0.359 | 0.342 |
| 90% | 0.367 |
| 0.382 | 0.379 | 0.352 |
*Numbers in parentheses are standard deviations.
Figure 2Recall and precision of different amino acids of the network trained on the SI90N15 dataset. Recall is the percent of native residues that are correctly predicted (recovered), and precision is the percent of predictions that are correct.
Figure 3Probability of each amino acid being predicted as 20 amino acids.
Figure 4Top-K accuracy of the neural network trained on the SI90N15 dataset.
Figure 5Structures of the proteins used in protein design with residue-type restraints.
Average sequence identity of Rosetta fixed-backbone design on three proteins with/without residue-type restraints.
| Protein | No-restrain* | Top 1 | Top 3* | Top 5* | Top 10* |
|---|---|---|---|---|---|
| 2B8I | 0.276 ± 0.033 | 0.337 | 0.306 ± 0.017 (0.558) | 0.293 ± 0.037 (0.883) | |
| 1HOE | 0.408 ± 0.026 | 0.338 | 0.441 ± 0.018 (0.689) | 0.416 ± 0.028 (0.851) | |
| 2IGD | 0.409 ± 0.034 |
| 0.473 ± 0.023 (0.705) | 0.401 ± 0.028 (0.754) | 0.408 ± 0.032 (0.967) |
*Sequence identities are presented as average ± standard deviation from 500 designs. Numbers in parentheses are maximal possible identities given the residue-type restraints.
Average sequence identity of SPIN and our network on 50 test proteins.
| Top 1 | Top 2 | Top 3 | Top 5 | Top 10 | |
|---|---|---|---|---|---|
| SPIN | 0.302 | 0.453 | 0.552 | 0.677 | 0.868 |
| This study* | 0.330 | 0.487 | 0.585 | 0.717 | 0.896 |
*Numbers in parentheses are standard deviations from 5 networks trained on the same dataset with different random number seeds.