| Literature DB >> 17386103 |
Jun Liao1, Manfred K Warmuth, Sridhar Govindarajan, Jon E Ness, Rebecca P Wang, Claes Gustafsson, Jeremy Minshull.
Abstract
BACKGROUND: Altering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput physical or computational tests that can accurately predict protein activity under conditions relevant to its final application. Here we describe a new synthetic biology approach to protein engineering that avoids these limitations by combining high throughput gene synthesis with machine learning-based design algorithms.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17386103 PMCID: PMC1847811 DOI: 10.1186/1472-6750-7-16
Source DB: PubMed Journal: BMC Biotechnol ISSN: 1472-6750 Impact factor: 2.563
Figure 1Flowchart of protein engineering design and testing process. The figure shows an overview of the experimental flow described in this work. Details are provided for each step in the indicated section of Results and Discussion.
Amino acid substitutions selected for modification of proteinase K.
| N95C | Lethal | Literature report: disulphide bond between 95C and 299C reported to stabilize subtilisin BPN' (S3C and Q206C in subtilisin)[34,35]. |
| P97S | Lethal | Literature report: P to S reported to increase stability in subtilisin BPN' (P5S in subtilisin)[34,35]. |
| S107D | Negative | Homolog sequence alignment analysis: D present at this position in 2/42 Group B homologs. |
| S123A | Positive | Thermostable homolog sequence alignment analysis: residue found in 8/11 Group C homologs and 6/42 Group B homologs. |
| I132V | Positive | Thermostable homolog sequence alignment analysis: residue found in 10/11 Group C homologs, 1/6 Group A homologs and 13/42 Group B homologs. Also a favorable change according to Dayhoff substitution matrix[36]. |
| E138A | Lethal | Literature report: acidic residue to A reported to increase stabibility in subtilisin BPN' (D41A in subtilisin)[34,35]. |
| M145F | Negative | Literature report: M to F reported to increase stabibility in subtilisin BPN' (M50F in subtilisin) [34,35]. |
| Y151A | Strong positive | Thermostable homolog sequence alignment analysis: residue found in close thermostable homolog gi|131084 and 2/42 Group B homologs. |
| V167I | Negative | Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue found in 1/6 Group A homologs and 27/42 Group B homologs. |
| L180I | Positive | Thermostable homolog sequence alignment analysis: residue found in 10/11 Group C homologs, 1/6 Group A homologs and 10/42 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. |
| Y194S | Negative | Random mutation obtained during synthesis of wt proteinase K. |
| A199S | Negative | Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue found in 1/6 Group A homologs and 9/42 Group B homologs. |
| K208H | Positive | PCA identification of amino acids responsible for clustering of thermophilic sequences gi|4092486; gi|56160990; gi|114081 within Group A and B homologs [37]. |
| A236V | Lethal | PCA identification of amino acids responsible for clustering of thermophilic sequences gi|4092486; gi|56160990; gi|114081 within Group A and B homologs [37]. |
| R237N | Negative | Thermostable homolog sequence alignment analysis: residue found in 9/11 Group C homologs, 1/6 Group A homologs and 1/42 Group B homologs. |
| P265S | Negative | Structural considerations: literature report: P5S reported to increase stability in subtilisin BPN' (P5S in subtilisin) [34,35]. 265S found at this position in proteinase K closest homolog (gi|131084). |
| V267I | Positive | Substitution matrix-derived change: favorable change according to Dayhoff substitution matrix [36]. Residue found in 1/6 Group A homologs and 1/41 Group B homologs. |
| S273T | Positive | Thermostable homolog sequence alignment analysis: residue found in 11/11 Group C homologs, 1/6 Group A homologs and 29/41 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. |
| G293A | Strong positive | Thermostable homolog sequence alignment analysis: residue found in 11/11 Group C homologs, 1/6 Group A homologs and 38/41 Group B homologs. |
| L299C | Lethal | Disulphide bond between 95C and 299C reported to stabilize serine proteases [34,35]. |
| I310K | Negative | Literature report: K substitution at this position reported to increase stabibility by adding hydrogen bonding in subtilisin BPN' (Y217K in subtilisin) [34,35]. |
| K332R | Positive | Thermostable homolog sequence alignment analysis: residue found in 8/11 Group C homologs and 1/6 Group A homologs. Also a favorable change according to Dayhoff substitution matrix [36]. |
| S337N | Positive | Thermostable homolog sequence alignment analysis: residue found in 8/11 Group C homologs, 1/6 Group A homologs and 2/41 Group B homologs. Also a favorable change according to Dayhoff substitution matrix [36]. |
| P355S | Negative | Structural considerations: literature report: P5S reported to increase stability in subtilisin BPN' (P5S in subtilisin) [34,35]. 355S found at this position in proteinase K closest homolog (gi|131084). |
Selection criteria and references are shown for 24 amino acid substitutions within proteinase K. Group A, wild type plus 5 closest homologs (>90% identity); Group B, 42 homologs (30–90% identity); Group C, 11 thermostable homologs. The effect of each substitution is also shown. Lethal: no active variant contained this substitution. Negative: the substitution was not selected by any of the third round design methods. Positive: the substitution was selected by at least one third round design method and was present in at least one third round variant with activity > 3× wild type. Strong positive: the substitution was selected by all third round design methods and are present in the most active variants.
Figure 2Three cycles of proteinase K variant design and testing. Mean activity measurements of the 3 sets of proteinase K variants are shown. Set 1 (diamonds) is the initial set of 59 variants. Set 2 (squares, 20 variants) was designed using the activities of Set 1. Set 3 (triangles, 16 variants) was designed based on sets 1 and 2. Activities towards N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide were measured at 37°C following a 5 minutes heat treatment of the enzyme at 68°C. Activities are expressed relative to the mean activity of 2 replicates of the wild-type proteinase K.
Vector weights calculated for amino acid substitutions.
| S107D | -0.03 | 0.13 | 0.00 | 0.02 | -0.70 | 0.26 | -0.16 | 0.13 |
| S123A | -1.00 | 0.13 | -0.41 | 0.35 | -1.42 | 0.23 | -0.93 | 0.14 |
| I132V | 0.04 | 0.44 | 0.04 | 0.55 | 0.32 | 0.76 | -0.34 | 0.29 |
| M145F | -1.46 | 0.19 | -2.27 | 0.49 | -1.98 | 0.32 | -1.58 | 0.20 |
| Y151A | 1.18 | 0.23 | 0.91 | 0.23 | 1.66 | 0.37 | 0.91 | 0.15 |
| V167I | -0.97 | 0.13 | -1.09 | 0.15 | -1.10 | 0.17 | -0.79 | 0.14 |
| L180I | -0.23 | 0.15 | -0.05 | 0.10 | -0.35 | 0.19 | -0.36 | 0.13 |
| Y194S | 0.27 | 0.20 | 0.00 | 0.01 | 0.94 | 0.73 | 0.01 | 0.14 |
| A199S | -1.16 | 0.39 | -1.09 | 0.46 | -2.66 | 0.98 | -0.86 | 0.21 |
| K208H | 0.28 | 0.15 | 0.07 | 0.12 | 0.52 | 0.18 | 0.36 | 0.17 |
| R237N | -0.93 | 0.09 | -0.91 | 0.13 | -1.21 | 0.12 | -0.86 | 0.15 |
| V267I | -0.48 | 0.11 | -0.32 | 0.14 | -0.68 | 0.13 | -0.16 | 0.12 |
| S273T | 0.12 | 0.14 | 0.01 | 0.06 | 0.28 | 0.19 | -0.05 | 0.17 |
| G293A | 1.95 | 0.13 | 2.24 | 0.14 | 2.10 | 0.17 | 1.70 | 0.13 |
| K332R | 0.07 | 0.13 | -0.01 | 0.05 | 0.02 | 0.14 | 0.09 | 0.15 |
| S337N | -0.02 | 0.14 | 0.03 | 0.09 | -0.20 | 0.15 | 0.03 | 0.14 |
| P355S | -1.08 | 0.12 | -1.20 | 0.15 | -1.25 | 0.13 | -1.10 | 0.15 |
| S107D | -0.01 | 0.20 | -0.02 | 0.21 | -0.35 | 0.24 | -0.35 | 0.24 |
| S123A | -0.41 | 0.43 | -0.40 | 0.41 | 0.52 | 0.93 | 0.52 | 0.93 |
| I132V | 0.22 | 0.69 | 0.18 | 0.56 | 2.61 | 0.91 | 2.61 | 0.91 |
| M145F | -2.39 | 0.53 | -2.39 | 0.53 | -5.33 | 1.05 | -5.33 | 1.05 |
| Y151A | 0.82 | 0.24 | 0.82 | 0.25 | 0.64 | 0.24 | 0.64 | 0.24 |
| V167I | -0.99 | 0.24 | -0.98 | 0.23 | -1.63 | 0.24 | -1.63 | 0.24 |
| L180I | -0.20 | 0.16 | -0.19 | 0.16 | 0.60 | 0.23 | 0.60 | 0.23 |
| Y194S | -0.02 | 0.08 | 0.21 | 0.38 | 4.59 | 2.10 | 4.59 | 2.10 |
| A199S | -0.49 | 0.33 | -0.73 | 0.54 | -4.92 | 2.03 | -4.92 | 2.03 |
| K208H | 0.16 | 0.18 | 0.14 | 0.18 | 0.01 | 0.13 | 0.01 | 0.13 |
| R237N | -0.96 | 0.29 | -0.96 | 0.29 | -1.59 | 0.57 | -1.59 | 0.57 |
| V267I | -0.41 | 0.23 | -0.41 | 0.23 | -1.33 | 0.14 | -1.33 | 0.14 |
| S273T | 0.19 | 0.33 | 0.15 | 0.28 | 0.96 | 0.58 | 0.96 | 0.58 |
| G293A | 2.18 | 0.25 | 2.20 | 0.25 | 3.20 | 0.14 | 3.20 | 0.14 |
| K332R | 0.12 | 0.19 | 0.14 | 0.21 | -0.33 | 0.13 | -0.33 | 0.13 |
| S337N | 0.27 | 0.28 | 0.26 | 0.28 | 0.34 | 0.59 | 0.34 | 0.59 |
| P355S | -1.34 | 0.35 | -1.35 | 0.34 | -1.95 | 0.57 | -1.95 | 0.57 |
Mean (M) and standard deviation (σ) values are shown for the 19 substitutions for which weights were calculated using machine learning. The values were calculated from 1000 subsamples of the variants with measurable activity from sets 1 and 2, where 5 variant sequences were randomly omitted from each subsample.
Figure 3Substitution weight mean and standard deviation values produced by the MR algorithm. We created 1000 subsamples of the training set (the sequences and non-zero activities of variants from sets 1 and 2) by leaving out 5 randomly selected variants from each subsample. A: The MR (matching loss) algorithm was used to calculate substitution weights for each subsample. The mean values from the 1000 subsamples are indicated by horizontal notches. Error bars represent one standard deviation of the 1000 calculated substitution weights. Substitutions are indicated below the graph with the number of occurrences in the training set in parentheses. Each substitution is described by a single weight. Variant 3–4 was designed to include all substitutions with positive mean weight that occur at least 3 times in the training set (red and blue circles). Note that substitution Y194S (green circle) was not selected since it occurred less than 3 times in the training set. Variant 3–9 included all substitutions that occurred at least 3 times and whose mean weight was at least one standard deviation above zero (red circles only). Substitution weights calculated from the entire dataset instead of the mean of 1000 subsamples are shown as purple circles. B: The MR algorithm was used to calculate substitution weights as in A, except that models were tested by expanding each pair in turn into 4 terms and selecting the pair that most improved the model. In this example each substitution is described by a single weight except for the 3 pairs (132,208), (337,355), (267,293) which are modeled by 4 weights each. Re d circles indicate the substitutions selected to design variant 3–14. Note that substitution combination I132V 208K was not selected since it occurred less than 3 times in the training set.
Figure 4Activities of variants designed using substitution weights. Activities towards N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide were measured at 37°C following a 5 minute heat treatment of the enzyme at 68°C. Activities are expressed relative to the mean activity of duplicates of wild-type proteinase K. Error bars represent one standard deviation of the activity measurements. Variants are grouped according to the machine learning algorithm used to calculate substitution weights (indicated below each group), and are compared with the best variants from the initial design set (variants 1–40 and 1–50 black bars, on the left). The first design (yellow bars, design method G in Additional file 2) of each group belongs to set 2. We included a substitution in the design if it occurred at least three times in the training set and its mean weight was at least one standard deviation above zero. All remaining designs in each group belong to set 3. The second in each group (green bars, design method J in Additional file 2) includes substitutions occurring at least three times and whose mean weights were merely positive (eg Figure 3A, red and blue circles). The third in each group (red bars, design method K in Additional file 2) contained all substitutions occurring at least three times and whose mean weight was at least one standard deviation above zero (eg Figure 3A, red circles). Note that this third design in each group is always better than the second. The last variant(s) in each group (blue bars, design method L in Additional file 2) were designed by modeling interdependent substitutions (eg Figure 3B, red circles).
Figure 5Machine learning design compared with random choices and "expert" designs. Distribution of activities of 4 sets of variants designed using different methods are shown. Set A (white bars, variants 1–2, 1–6, 1–12, 1–13 and 1–34 to 1–49, total of 20 variants) contain arbitrarily selected combinations of 3, 5 or 6 substitutions. Set B (light shading, variants 1–50 to 1–59, total of 10 variants) were designed by manual analysis of the sequence and activity data from variants 1 through 49. Set C (dark shading, variants 2-1 to 2–20, total of 20 variants) were designed using machine learning algorithms based on the data from variants 1 through 59. Set D (black fill, variants 3-1 to 3–16, total of 16 variants) were designed using machine learning algorithms based on the data from variants 1-1 through 1–59 and 2-1 through 2–20.
Figure 6Increases in proteinase K activity with and without heating. Proteinase K variants were tested from triplicate independent cultures for activity after heating at 68°C for different times: unheated (circles), 2.5 minutes (squares), 5 minutes (crosses), 7.5 minutes (triangles), 10 minutes (diamonds) and 15 minutes (open squares). A: absorbance at 405 nm of substrate incubated with wild type proteinase K, B: absorbance at 405 nm of substrate incubated with variant 3–9.
Figure 7Changes in activity and half-life in designed protein variants. Activity (unheated) and half life were calculated for 13 protein variants and wild type proteinase K. The activity without heating was calculated from the initial slopes of the A405 measurements without heating (white bars), examples shown in Figure 6. The half-life at 68°C (shaded bars) was calculated using the initial slopes after different heating times and fitting to an exponential curve. Error bars represent one standard deviation of the experimental measurements. The wild-type values are shown on the left. The substitutions of each variant are given in the column below the variant name. Only 10 of the 19 positions are shown. In the remaining 9 positions, all variants contained amino acids from the wild-type sequence.