| Literature DB >> 29059183 |
Claire N Bedbrook1, Kevin K Yang2, Austin J Rice2, Viviana Gradinaru1, Frances H Arnold1,2.
Abstract
There is growing interest in studying and engineering integral membrane proteins (MPs) that play key roles in sensing and regulating cellular response to diverse external signals. A MP must be expressed, correctly inserted and folded in a lipid bilayer, and trafficked to the proper cellular location in order to function. The sequence and structural determinants of these processes are complex and highly constrained. Here we describe a predictive, machine-learning approach that captures this complexity to facilitate successful MP engineering and design. Machine learning on carefully-chosen training sequences made by structure-guided SCHEMA recombination has enabled us to accurately predict the rare sequences in a diverse library of channelrhodopsins (ChRs) that express and localize to the plasma membrane of mammalian cells. These light-gated channel proteins of microbial origin are of interest for neuroscience applications, where expression and localization to the plasma membrane is a prerequisite for function. We trained Gaussian process (GP) classification and regression models with expression and localization data from 218 ChR chimeras chosen from a 118,098-variant library designed by SCHEMA recombination of three parent ChRs. We use these GP models to identify ChRs that express and localize well and show that our models can elucidate sequence and structure elements important for these processes. We also used the predictive models to convert a naturally occurring ChR incapable of mammalian localization into one that localizes well.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29059183 PMCID: PMC5695628 DOI: 10.1371/journal.pcbi.1005786
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 2GP binary classification models for expression and localization.
Plots of predicted probability vs measured properties are divided into ‘high’ performers (white background) and ‘low’ performers (gray background) for each property (expression and localization). (A) & (D) Predicted probability vs measured properties for the training set (gray points) and the exploration set (cyan points). Predictions for the training and exploration sets were made using LOO cross-validation. (B) & (E) Predicted probabilities vs measured properties for the verification set. Predictions for the verification set were made by a model trained on the training and exploration sets. (C) & (F) Predicted probability of ‘high’ expression, and localization for all chimeras in the recombination library (118,098 chimeras) made by models trained on the data from the training and exploration sets. The gray line shows all chimeras in the library, the gray points indicate the training set, the cyan points indicate the exploration set, the purple points indicate the verification set, and the yellow points indicate the parents. (A-C) Show expression and (D-F) show localization. For all plots, the measured property is plotted on a log2 scale.
Comparison of size, diversity, and localization properties of the training set and subsequent sets of chimeras chosen by models in the iterative steps of model development.
| 3 | 0 | 100% | 5.6 ± 3.0 | |
| 112 | 15 ± 9 | 33% | 3.2 ± 3.4 | |
| 103 | 73 ± 21 | 12% | 1.5 ± 2.5 | |
| 16 | 69 ± 12 | 50% | 4.8 ± 4.7 | |
| 4 | 29 ± 17 | 100% | 8.0 ± 1.6 | |
| 7 | 67 ± 12 | 0% | 0.89 ± 0.73 | |
| 4 | 43 ± 6 | 100% | 14 ± 3.5 |
* ‘good localization’ is localization at or above that of the lowest-performing parent, CheRiff