| Literature DB >> 33094545 |
Guangyue Li1, Youcai Qin1, Nicolas T Fontaine2, Matthieu Ng Fuk Chong2, Miguel A Maria-Solano3, Ferran Feixas3, Xavier F Cadet2, Rudy Pandjaitan2, Marc Garcia-Borràs3, Frederic Cadet2, Manfred T Reetz4,5,6.
Abstract
Machine learning (ML) has pervaded most areas of protein engineering, including stability and stereoselectivity. Using limonene epoxide hydrolase as the model enzyme and innov'SAR as the ML platform, comprising a digital signal process, we achieved high protein robustness that can resist unfolding with concomitant detrimental aggregation. Fourier transform (FT) allows us to take into account the order of the protein sequence and the nonlinear interactions between positions, and thus to grasp epistatic phenomena. The innov'SAR approach is interpolative, extrapolative and makes outside-the-box, predictions not found in other state-of-the-art ML or deep learning approaches. Equally significant is the finding that our approach to ML in the present context, flanked by advanced molecular dynamics simulations, uncovers the connection between epistatic mutational interactions and protein robustness.Entities:
Keywords: artificial intelligence; epistasis; epoxide hydrolase; innov'SAR; machine learning; molecular dynamics simulations
Year: 2020 PMID: 33094545 PMCID: PMC7984044 DOI: 10.1002/cbic.202000612
Source DB: PubMed Journal: Chembiochem ISSN: 1439-4227 Impact factor: 3.164
Figure 1Workflow of the modelling process. A) A protein sequence is encoded in two steps: i) with numerical encoding based on an index of the AA index database, ii) FFT is applied to convert the encoded sequence into a protein spectrum. Each numerical encoding from an index will give a unique protein spectrum. Here three specific encodings give three specific protein spectra. Each protein spectrum is an elementary numerical sequence available for modelling with innov'SAR. B) Construction of a numerical extended sequence (Ext_SEQ) by concatenating the elementary numerical sequences. C) The different phases of innov'SAR: an encoding phase transforms the primary sequences of the initial dataset into protein spectra. The modelling phase uses the protein spectra and protein thermostability as a learning dataset in order to construct a regression model. Here, for the modelling of the epoxide hydrolase LEH, the construction of the model is based on a partial least‐squares regression method. Then the predictive phase uses the regression model and the protein spectra of new variants to predict their thermostability.
Scheme 1The proposed catalytic mechanism of limonene epoxide hydrolase (LEH).
Figure 2Ranking of predicted thermostability (T 50 [°C]) LEH. 11 943 936 variants were considered (for convenience only a subsample is presented on this plot). In green, predictions of stabilities over the range of the dataset thermostability (from 38 to 63 °C). The blue diamond shows the thermostability of the best point from the learning dataset., Predictions of 86 090 variants with a thermostability equal to or greater than that of the previous variant are shown in red.
The determined T m and T agg values of WT LEH and LEH variants based on collected label‐free fluorescence, DSF and SLS.
|
Sample |
Mutations |
Predicted |
Expl. |
Expl. |
|---|---|---|---|---|
|
WT LEH |
|
41 |
46.38±0.01 |
42.6±0.06 |
|
LEH‐1 |
S15P/A19K/T85V/G89C/S91C/L114V/E124D |
69.14 |
65.95±0.36 |
53.4±0.02 |
|
LEH‐2 |
S15P/A19K/T85V/G89C/S91C/L114V/I116V/E124D |
69.04 |
60.09±0.02 |
56.6±0.19 |
|
LEH‐3 |
I5C/S15P/A19K/L74F/T85V/G89C/S91C/L114V/I116V/E124D |
66.21 |
73.62±0.33 |
63.3±0.00 |
|
LEH‐4 |
S15P/A19K/M78F/I80V/T85V/G89C/S91C/N92K/L114V/I116V/F139V/L147F |
66.13 |
61.31±0.33 |
49.3±0.10 |
|
LEH‐5 |
S15P/A19K/M78F/I80V/T85V/G89C/S91C/Y96F/L114V/I116V/E124D/F139V/L147F |
66.07 |
71.30±0.33 |
62.6±0.07 |
|
LEH‐F1b |
I5C/S15P/A19K/T76K/E84C/T85V/G89C/S91C/N92K/Y96F/E124D |
63 |
63.12±0.46 |
55.6±0.21 |
Activity and stereoselectivity of WT LEH and LEH mutants based on catalytic conversion of substrate 1 monitored by GC.
|
Enzyme |
Relative activity[a] |
|
Preferred enantiomer |
|---|---|---|---|
|
WT LEH |
100 |
1.4 |
( |
|
LEH‐1 |
585.8 |
3.7 |
( |
|
LEH‐2 |
298.4 |
1.3 |
( |
|
LEH‐3 |
4.7 |
34.5 |
( |
|
LEH‐4 |
11.8 |
24.9 |
( |
|
LEH‐5 |
26.8 |
25.8 |
( |
|
LEH‐F1b |
77.2 |
3.6 |
( |
[a] The relative activity was determined based on the conversion rate, and the conversion rate of WT LEH was defined as 100 %.
Figure 3A) RMSFs of all residues computed from the aMD simulations at four different temperatures (300 K in gray, 323 K in blue, 343 K in green and 363 K in red) for the WT, LEH‐1 and LEH‐5 enzyme variants. The LEH‐1 mutations are marked as inverted triangles while the extra mutations introduced in LEH‐5 as stars and the catalytic residues as pink diamonds. The unfolding hotspots are also highlighted. B) Representation of the WT flexibility computed at 343 K by means of RMSF. The main hotpots are coloured (N loop in red, C loop‐H4 in red, β1‐loop A‐β2 in blue and the loop B in green). C) Enzyme sequence showing the unfolding hotspots, the LEH‐1 and LEH‐5 mutations and the positions of the catalytic residues.
Figure 4A) 3D structure of the monomeric form of LEH enzyme. The LEH‐1 mutations are highlighted by cyan spheres, the extra LEH‐5 mutations are shown as ochre spheres, and the catalytic residues are shown as violet sticks. The unfolding hotspots regions are also coloured as in Figure 4. B) Zoom view of the epistatic interactions of ML‐designed mutations observed in LEH‐5 from the MD simulations. C) Dynamical cross‐correlation analysis of the LEH‐5 variant. The suboptimal paths that connect the network of the residues are represented in blue, and the residues involved in the path are highlighted as small grey spheres. LEH‐1 mutations are highlighted as turquoise spheres; the LEH‐5 mutations are shown as ochre spheres. The terminal N loop is displayed in red and the terminal C loop in orange.