| Literature DB >> 36147660 |
Haoran Zhang1, Zhetao Zheng1, Liangzhen Dong1, Ningning Shi1, Yuelin Yang1, Hongmin Chen1, Yuxuan Shen1, Qing Xia1.
Abstract
The unnatural amino acid (UAA) incorporation technique through genetic code expansion has been extensively used in protein engineering for the last two decades. Mutations into UAAs offer more dimensions to tune protein structures and functions. However, the huge library of optional UAAs and various circumstances of mutation sites on different proteins urge rational UAA incorporations guided by artificial intelligence. Here we collected existing experimental proofs of UAA-incorporated proteins in literature and established a database of known UAA substitution sites. By program designing and machine learning on the database, we showed that UAA incorporations into proteins are predictable by the observed evolutional, steric and physiochemical factors. Based on the predicted probability of successful UAA substitutions, we tested the model performance using literature-reported and freshly-designed experimental proofs, and demonstrated its potential in screening UAA-incorporated proteins. This work expands structure-based computational biology and virtual screening to UAA-incorporated proteins, and offers a useful tool to automate the rational design of proteins with any UAA.Entities:
Keywords: Genetic code expansion; Machine learning; Protein design; Unnatural amino acid incorporation; Virtual screening
Year: 2022 PMID: 36147660 PMCID: PMC9472073 DOI: 10.1016/j.csbj.2022.08.063
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Architecture of this study. (a) Schematic illustration of a typical UAA substitution and the related influencing factors. (b) Flow chart of preparing the database and designing the RPDUAA program.
The variables or features used for machine learning in this study.
| k1 | entropy | the | xml file | evolutional tolerance |
| k2 | rasa | the | cif file | steric effects |
| k3 | hsebup | the | ||
| k4 | hsebdn | the | ||
| k5 | sstype | the | ||
| k6 | uaasimi | the | canvas | physiochemical changes |
| k7 | △AlogP | the difference between UAA and NAA on | ||
| k8 | △Estate | the difference between UAA and NAA on | ||
| k9 | △PSA | the difference between UAA and NAA on | ||
| △Polar | the difference between UAA and NAA on | |||
| △HBA | the difference between UAA and NAA on the number of | |||
| k12 | △HBD | the difference between UAA and NAA on the number of | ||
| k13 | △RB | the difference between UAA and NAA on the number of | ||
| k14 | △MW | the relative difference between UAA and NAA on | ||
| k15 | cdtype | the | other | other factors |
| k16 | prooflv | the |
Fig. 2Performance of the prediction model. Principal component analyses of observed features (left panels), scatter plots of predicted probability (middle panels, where the error bars denote 25%, 50% and 75% quartiles, and P values for the Mann-Whitney test) and receiver operating characteristic curves (right panels) were shown for (a) the whole database, (b) the non-predicted PDB subset, and (c) the balanced subset.
Fig. 3Stability of the prediction model. (a) Performance in a typical round of holdout validation. Graph definitions, error bars, and the statistical test are the same as previously described. (b) Changes of the optimal cutoff, accuracy and ROC area in 100 rounds of holdout validations using the whole database. (c) Changes of the optimal cutoff, accuracy and ROC area in 100 rounds of resampling validations using balanced subsets.
Fig. 4Timesplit validation of the prediction model. (a) Train-test-split at the time point 2018-1-1 for the whole database. (b) Train-test-split at the time point 2019-1-1 for the whole database. (c) Train-test-split at the time point 2019-1-1 for a balanced subset. Records before the split time point were used to train the model, while records after that time point were used to test the model and plot the graphs. Graph definitions, error bars, and the statistical test are the same as previously described.
Fig. 5Experimental validation of the prediction model. (a) Composition of the laboratory database of freshly-designed UAA-incorporated proteins. (b) Structure of the ATOH1 protein predicted by AlphaFold2. (c) Chemical formula of NAEK. (d) Full probability matrix for all UAAs substituting any site of the ATOH1 protein. The probability was predicted by the RPDUAA program using the whole database. (e) Western blot images of NAEK-incorporated ATOH1. The NAEK-dependent ATOH1 expressions were an indicator of successful UAA substitution at a site. (f) Fluorescence images of a downstream mCherry reporter confirming NAEK incorporations in ATOH1 in HEK293T cells. Scale bar, 100 μm. (g) Mass spectrometry of fragments with NAEK incorporated at the V185 site of ATOH1. (h) Correlation plot between the predicted probability and UAA incorporation efficiency for the 10 sites of ATOH1. (i) Performances of the prediction model when trained with the whole database from the literature and tested with the freshly-prepared laboratory records. Graph definitions, error bars, and the statistical test are the same as previously described.
Fig. 6Prediction of the UAA incorporation efficiency. (a) Subsets with coarse or exact yields in the database and prediction models in the RPDUAA program. (b) Performance of the model using the subset of coarse yield. (c) Performance of the model using the subset of exact yield. The solid line represents the linear regression between the predicted and detected UAA incorporation efficiencies, and the Pearson’s correlation coefficient R is reported.