| Literature DB >> 30326841 |
Frédéric Cadet1, Nicolas Fontaine2, Iyanar Vetrivel2, Matthieu Ng Fuk Chong2, Olivier Savriama2, Xavier Cadet2, Philippe Charton2.
Abstract
BACKGROUND: Connecting the dots between the protein sequence and its function is of fundamental interest for protein engineers. In-silico methods are useful in this quest especially when structural information is not available. In this study we propose a mutant library screening tool called iSAR (innovative Sequence Activity Relationship) that relies on the physicochemical properties of the amino acids, digital signal processing and partial least squares regression to uncover these sequence-function correlations.Entities:
Keywords: Directed evolution; Protein sequence activity relationship; Protein spectrum; Rational screening; Statistical modelling
Mesh:
Substances:
Year: 2018 PMID: 30326841 PMCID: PMC6191906 DOI: 10.1186/s12859-018-2407-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Principles of statistical methods used to model structure or sequence to activity relationship (Damborský and Brezovsky [41] reproduced with permissions). a Schema illustrating the principles behind Quantitative Structure to Function Relationship method whereby numerical descriptors derived from structure are regressed on the activity data (yellow column). b Principles behind Protein Sequence to Activity Relationship methods whereby numerical descriptors derived from sequence are regressed on the activity data
Characteristics of the experimental datasets. n is the number of mutated positions and k is the number of residues at each position
| Dataset | Size of dataset |
|
| Theoretical size of sequence space | Length of protein sequence |
|---|---|---|---|---|---|
| Cyt P450 | 242 | 8 | 3 | 6561 | 464–466 |
| GLP-2 | 31 | 31 | 2 | 2.147 billion | 33 |
| Enterotoxin | 12 | 40 | 2 | 1099.5 billion | 233 |
| TNF | 21 | 17 | [2, 7, 4, 6, 2, 9, 9, 9, 9, 9, 2, 2, 2, 2, 6, 8, 7] | 213.3 billion | 157 |
The theoretical size of sequence space S is calculated as the product all k values for all mutated positions
Fig. 2Fourier spectra of a protein sequence and a single point variant of the same protein. Shown are the Fourier spectra of wild type GLP-1 peptide (in blue) and of its E3A variant (in red). The spectra are obtained after numerically encoding the amino acid sequence using one index from AAindex database and their processing using Fast Fourier Transform (FFT) technique (see Methods section for details). A single point mutation impacts the whole spectrum. In the iSAR methodology, the variations caused by the mutation in the spectra of variants are correlated with variations observed in their corresponding biological activity using the PLS regression technique (see Fig. 3)
Fig. 3General scheme for the iSAR methodology described in this paper. “Multivariate Analysis” and “Classification (for rational screening)” on the right part of the figure are optional
Summary of the different R2 and RMSE values obtained through predictions for the full set of protein sequences and after an 80/20 splitting in order to generate a training set and a validation set
| Set | Partition | cvR2 | cvRMSE |
|---|---|---|---|
| Cyt P450 (thermostability) | Full set (10-fold CV) | 0.96 | 1.19 |
| Train set (80%) (10-fold CV) | 0.93 | 1.33 | |
| Validation set (20%) | 0.92 | 1.72 | |
| Enterotoxins (thermostability) | Full set (LOOCV) | 0.95 | 1.58 |
| Train set (80%) (LOOCV) | 0.85 | 2.58 | |
| Validation set (20%) | 0.99 | 0.59 | |
| TNF (relative binding affinities) | Full set (LOOCV) | 0.85 | 0.31 |
| Train set (80%) (LOOCV) | 0.86 | 0.33 | |
| Validation set (20%) | 0.92 | 0.20 | |
| GLP-2 Potency (fold-increase in cAMP) | Full set (LOOCV) | 0.42 | 2.05 |
| Train set (80%) (LOOCV) | 0.75 | 1.39 | |
| Validation set (20%) | 0.71 | 1.44 |
For the full set and train set (80%), cvR2 and cvRMSE (same units as the activity for RMSE) values were evaluated after leave-one-out cross-validation (LOOCV) or 10-fold cross-validation scheme
Fig. 4Evaluation of iSAR for modelling the thermostability of cytochrome P450 variants. Shown are the measured against predicted thermostability values (melting temperature in °C) assessed under the 10-fold cross-validation scheme for the full set of 242 ϖαριαντσ (+), for a training set composed of 80% of the dataset (○) and for a validation set comprising 20% of the variants (□)