Alexander A Vinogradov1, Jun Shi Chang1, Hiroyasu Onaka2,3, Yuki Goto1, Hiroaki Suga1. 1. Department of Chemistry, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. 2. Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan. 3. Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan.
Abstract
Promiscuous post-translational modification (PTM) enzymes often display nonobvious substrate preferences by acting on diverse yet well-defined sets of peptides and/or proteins. Understanding of substrate fitness landscapes for PTM enzymes is important in many areas of contemporary science, including natural product biosynthesis, molecular biology, and biotechnology. Here, we report an integrated platform for accurate profiling of substrate preferences for PTM enzymes. The platform features (i) a combination of mRNA display with next-generation sequencing as an ultrahigh throughput technique for data acquisition and (ii) deep learning for data analysis. The high accuracy (>0.99 in each of two studies) of the resulting deep learning models enables comprehensive analysis of enzymatic substrate preferences. The models can quantify fitness across sequence space, map modification sites, and identify important amino acids in the substrate. To benchmark the platform, we performed profiling of a Ser dehydratase (LazBF) and a Cys/Ser cyclodehydratase (LazDEF), two enzymes from the lactazole biosynthesis pathway. In both studies, our results point to complex enzymatic preferences, which, particularly for LazBF, cannot be reduced to a set of simple rules. The ability of the constructed models to dissect such complexity suggests that the developed platform can facilitate a wider study of PTM enzymes.
Promiscuous post-translational modification (PTM) enzymes often display nonobvious substrate preferences by acting on diverse yet well-defined sets of peptides and/or proteins. Understanding of substrate fitness landscapes for PTM enzymes is important in many areas of contemporary science, including natural product biosynthesis, molecular biology, and biotechnology. Here, we report an integrated platform for accurate profiling of substrate preferences for PTM enzymes. The platform features (i) a combination of mRNA display with next-generation sequencing as an ultrahigh throughput technique for data acquisition and (ii) deep learning for data analysis. The high accuracy (>0.99 in each of two studies) of the resulting deep learning models enables comprehensive analysis of enzymatic substrate preferences. The models can quantify fitness across sequence space, map modification sites, and identify important amino acids in the substrate. To benchmark the platform, we performed profiling of a Ser dehydratase (LazBF) and a Cys/Ser cyclodehydratase (LazDEF), two enzymes from the lactazole biosynthesis pathway. In both studies, our results point to complex enzymatic preferences, which, particularly for LazBF, cannot be reduced to a set of simple rules. The ability of the constructed models to dissect such complexity suggests that the developed platform can facilitate a wider study of PTM enzymes.
Enzymes
which perform post-translational modification (PTM) of
peptides and proteins often display nontrivial preferences by acting
on a wide range of substrates. The nuanced and, in many cases, poorly
understood nature of substrate recognition and engagement by PTM enzymes
has come under scrutiny in several contexts. For example, during the
biosynthesis of ribosomally synthesized and post-translationally modified
peptides (RiPPs),[1,2] notably, lanthipeptides[3,4] and cyanobactins,[5] a single set of PTM
enzymes can modify disparate substrates to assemble multiple natural
products.[6,7] Catalytic promiscuity of RiPP biosynthetic
enzymes suggests numerous bioengineering applications,[8−11] and accordingly, much effort has been dedicated to understanding
the molecular basis for such behavior.[12−17] Likewise, in human biology, dense PTM networks controlled by hundreds
of promiscuous enzymes orchestrate virtually every aspect of cell
behavior, and thus, investigating how PTM enzymes discriminate their
substrates is integral to form a holistic appreciation of cellular
processes.[18−20]Substrate specificity profiling studies are
a natural first step
when studying catalysis by promiscuous PTM enzymes. Numerous approaches
developed over the years[21−25] enable streamlined analysis of substrate fitness landscapes, but
every method comes with its own limitations. Platforms based on the
screening of synthetic peptide microarrays[26−30] and saturation mutagenesis approaches[31−33] have relatively low throughput and can suffer from limited generalizability
and accuracy. For example, microarray-derived substrate preferences
of sirtuin lysine deacetylases mismatch known cellular substrates.[34] In vivo library construction methods, particularly
yeast and phage display, offer a much higher throughput (up to ∼109 peptides for testing compared to 103–104 for microarrays), but developing experimental schemes for
phage/yeast display can be technically difficult, and these approaches
to date have mainly focused on studying proteases.[35−37]Inference
from large amounts of data is another challenge common
for high-throughput methods. The de facto standard approach is the
computation of position-wise amino acid enrichment scores (usually
displayed as WebLogo sequence alignment plots),[38] which overcompresses available information and inevitably
loses the nuance. Machine learning/deep learning methods represent
a promising alternative to the purely statistical treatment of data.
Deep learning has in recent years proven its ability to make meaningful
generalizations in a variety of complex tasks, but it requires large
amounts of clean, curated data to train and evaluate the models.[39,40] To date, the substrate profiling studies which utilized deep learning
were either data-limited, due to their reliance on peptide microarrays
for data acquisition,[41] or used heterogeneous
data sets,[42−45] which have led to models with modest predictive power.Messenger
RNA (mRNA) display-based enzyme-profiling strategies[46,47] have recently gained traction as a viable alternative to the established
methods. As a fully in vitro platform, mRNA display can access combinatorial
libraries of vast diversity (>1012 unique peptides).[48,49] The technique also allows for elaborate manipulation of the libraries
(extensive genetic code reprogramming, affinity purification, chemical
labeling, etc.) and therefore supports the development of workflows
inaccessible to in vivo methods. Still, mRNA display approaches have
thus far revolved around single-point saturation mutagenesis[46,47] and, as such, have typically profiled only hundreds of enzyme substrates
at once, not taking full advantage of the available library diversity.Here, we report the development of a general platform for assaying
substrate fitness landscapes of PTM enzymes (Figure ). Our approach integrates mRNA display as
the data-generating engine with deep learning workflows to analyze
and learn from the resulting data. Using two RiPP biosynthetic enzymes
catalyzing distinct reactions, we demonstrate that mRNA display-based
substrate selections can provide large amounts of clean, labeled data
to train supervised deep learning classifiers of enzymatic substrate
preferences. The resulting models accurately predict substrate fitness
from a primary sequence and generalize well across the peptide sequence
space. The models have a degree of interpretability that allows for
mapping of modification sites and identification of important residues
in the substrate. Altogether, we believe that the described pipeline
is a powerful tool for studying the dynamics of PTM enzyme/substrate
interactions.
Figure 1
An overview of the workflow for the profiling of LazBF
substrate
preferences. (a) Chemical reaction catalyzed by LazBF. (b) Schematic
overview of mRNA display-based selection/antiselection setups. For
the full protocol, see Supporting Information 2.3. Ⓟ refers to the puromycin linker used to display
the peptides onto cognate mRNAs. Both selection and antiselection
assays can be repeated several times to produce libraries of progressively
increasing (or decreasing) substrate fitness. (c) Schematic overview
of the data analysis pipeline. NGS selection and antiselection data
sets are parsed, preprocessed, and labeled. Peptides are represented
as positionally encoded matrices of ECFPs, and a supervised CNN classifier
is trained on the resulting data to produce models of LazBF substrate
preferences. For a complete description of the data analysis pipeline,
see Supporting Information 2.5.
An overview of the workflow for the profiling of LazBF
substrate
preferences. (a) Chemical reaction catalyzed by LazBF. (b) Schematic
overview of mRNA display-based selection/antiselection setups. For
the full protocol, see Supporting Information 2.3. Ⓟ refers to the puromycin linker used to display
the peptides onto cognate mRNAs. Both selection and antiselection
assays can be repeated several times to produce libraries of progressively
increasing (or decreasing) substrate fitness. (c) Schematic overview
of the data analysis pipeline. NGS selection and antiselection data
sets are parsed, preprocessed, and labeled. Peptides are represented
as positionally encoded matrices of ECFPs, and a supervised CNN classifier
is trained on the resulting data to produce models of LazBF substrate
preferences. For a complete description of the data analysis pipeline,
see Supporting Information 2.5.
Results and Discussion
Development of the mRNA Display Scheme for
LazBF Profiling
For this study, we focused on PTM enzymes
participating in the
biosynthesis of lactazole A,[50] a natural
product belonging to the thiopeptide family of RiPPs.[51] The compound is a promising bioengineering target because
its five biosynthetic enzymes (LazBCDEF) can convert a wide variety
of sequence-randomized precursor peptides to lactazole-like thiopeptides.[47,52] LazBF, a split Ser dehydratase homologous to class I lanthipeptide
synthetases (Figure a),[3] plays a central role during lactazole
biosynthesis because its operation to install four dehydroalanine
(Dha) residues into precursor peptide LazA requires precise timing
and selectivity.[53] Mechanistically, LazBF
operates via a two-step process akin to class I Ser/Thr dehydratases.[54,55] The glutamylation domain (LazB) binds the N-terminal leader peptide
(LP) region of LazA (LazALP) and promotes Ser glutamylation
in the downstream core peptide (CP) section using Glu-tRNAGlu as the acyl donor. In the second step, the elimination domain (LazF)
catalyzes a retro-Michael reaction in the Ser(OGlu) intermediate to
yield the Dha-containing product.[56] Although
preliminary enzyme characterization indicated that LazBF prefers a
Trp residue in position +1 relative to the modification site, the
enzyme also displayed more elaborate preferences which eluded generalization.[47,53] Here, we sought to develop an mRNA display/deep learning-based platform
for comprehensive profiling of LazBF substrate fitness landscapes.We envisioned training a supervised learning classifier that could
predict the fitness of LazBF substrates from their primary sequence.
To that end, the acquisition of two mRNA display data sets (one corresponding
to substrates of high fitness and another for peptides which are not
dehydrated by LazBF) was necessary. We anticipated that the treatment
of a diverse library of mRNA-displayed peptides with LazBF would dehydrate
some, but not all, library members (Figure b). The modified peptides, i.e., those containing
a Dha residue, are reactive toward thiols[57] and can be selectively conjugated to a biotinylated probe (biotin-PEG2-SH; Figure S1d). The labeling
reaction enables the separation of modified and unmodified substrates
using a streptavidin (SAv) pulldown, which selectively isolates biotinylated
products. The subsequent PCR amplification of either the SAv-bound
or unbound fraction recovers DNA libraries encoding peptides of increased
or decreased fitness, respectively. Iterative repetition of this process
should produce increasingly enriched peptide populations. During a
“selection”, SAv-bound DNA is amplified to enrich for
substrates of high fitness, while an “antiselection”
recovers the unbound DNA fraction to generate a data set of poor substrates.To establish the assay, we designed an mRNA library encoding peptides
bearing the LazALP sequence followed by a randomized CP,
HA tag for affinity purification and a flexible C-terminal linker
(library 5S5; Figure ). Every CP contained a potentially modifiable Ser residue flanked
by five random amino acids on either side (theoretical diversity:
1 × 1013 sequences) to establish substrate recognition
requirements around the dehydration site. Our preliminary experiments
indicated that library 5S5 contained substrates of differential fitness.
First, we selected three such peptides (bAP1–3, in order of
decreasing fitness; Figure S1) to establish
the experimental conditions. The treatment of the peptides expressed
by the flexible in vitro translation (FIT) system[58] with 2 μM LazBF, 20 μM tRNAGlu,
and 1 μM GluRS for 2 h led to the quantitative dehydration of
bAP1, partial modification of bAP2, and virtually no reaction for
bAP3 (Figure S1a–c). Further incubation
of the reaction products with 5 mM biotin-PEG2-SH at pH
8.5 on ice for 17 h resulted in selective and nearly quantitative
biotinylation of Dha-containing peptides, indicating the feasibility
of the envisioned experimental scheme (Figure S1e).
Figure 2
mRNA display profiling of LazBF leads to enriched peptide
populations
suitable for deep learning applications. (a, b) Summary of the selection
(a) and antiselection (b) experiments. Plotted are respective DNA
recovery and enrichment values measured by qPCR after every round
of mRNA display. (c) Data set convergence at the amino acid level
as measured by log2Y* scores. Amino acid aa in position pos is enriched in the selection data
set compared to the antiselection one if log2Y*aa, pos is greater than 0. See also the definitions in the figure header
and Supporting Information 2.1; caa, pos is the number of NGS reads with
amino acid aa in position pos in
a data set. (d) CNN classifier accuracy as a function of the number
of mRNA display rounds. The models were trained on 4.75 × 105 samples from the respective data sets, using 0.25 ×
105 unseen samples for validation. Multiple rounds of mRNA
display lead to cleaner data sets and, hence, more accurate models.
(e) CNN classifier accuracy as a function of the training data set
size. The models were trained on round 6 data. Model accuracy scales
with the size of the training data set. (f) Validation of model predictions
against experimental data. 65 validation peptides (bVP1–65;
all encoded in library 5S5; see also Table S4) were expressed by the FIT system and treated with LazBF/GluRS/tRNAGlu for 2 h. Reaction outcomes were analyzed by LC-MS as described
in Supporting Information 2.8. Model predictions
showed good agreement with the experiment.
mRNA display profiling of LazBF leads to enriched peptide
populations
suitable for deep learning applications. (a, b) Summary of the selection
(a) and antiselection (b) experiments. Plotted are respective DNA
recovery and enrichment values measured by qPCR after every round
of mRNA display. (c) Data set convergence at the amino acid level
as measured by log2Y* scores. Amino acid aa in position pos is enriched in the selection data
set compared to the antiselection one if log2Y*aa, pos is greater than 0. See also the definitions in the figure header
and Supporting Information 2.1; caa, pos is the number of NGS reads with
amino acid aa in position pos in
a data set. (d) CNN classifier accuracy as a function of the number
of mRNA display rounds. The models were trained on 4.75 × 105 samples from the respective data sets, using 0.25 ×
105 unseen samples for validation. Multiple rounds of mRNA
display lead to cleaner data sets and, hence, more accurate models.
(e) CNN classifier accuracy as a function of the training data set
size. The models were trained on round 6 data. Model accuracy scales
with the size of the training data set. (f) Validation of model predictions
against experimental data. 65 validation peptides (bVP1–65;
all encoded in library 5S5; see also Table S4) were expressed by the FIT system and treated with LazBF/GluRS/tRNAGlu for 2 h. Reaction outcomes were analyzed by LC-MS as described
in Supporting Information 2.8. Model predictions
showed good agreement with the experiment.Next, we tested whether these substrates could be differentiated
in an mRNA display format. Peptide-mRNA/DNA chimeras prepared following
the standard techniques (Supporting Information 2.2 and 2.3)[49] were modified under
the aforementioned conditions, and, following SAv pulldown, the amount
of DNA in either bound or unbound fraction was quantified by qPCR.
A large difference in DNA recovery between bAP1 and bAP3 was observed
(r; defined as the ratio of DNA in the bound over
the unbound fractions; rbAP1 = 4.6 vs rbAP3 = 0.008; Figure S1f) but only when intermediate HA-affinity purification and acetone
precipitation steps (aimed to eliminate unreacted biotin-PEG2-SH and mRNAs that failed to display peptides) were included (data
not shown). Enrichment, defined as the ratio of DNA recovery in the
experiment over the negative control (an analogous assay where LazB
is omitted from the enzyme mix), also pointed to the enzyme-dependent
DNA recovery in the bound fraction (enrichmentAP1 = 223
vs enrichmentAP3 = 1.2; Figure S1f). Combined, these data indicate that the developed mRNA display
pipeline can discriminate LazBF substrates based on their fitness.With these protocols, we performed six rounds of selection and
antiselection for library 5S5 following the established conditions,
except, starting with round 4 of the selection experiment, the LazBF
incubation time was shortened to 15 min to adjust selection pressure.
The enrichment values increased between rounds during the selection
experiment (Figure a), suggesting that the substrate populations of progressively higher
fitness were obtained. In contrast, for antiselection, after the initial
decrease in round 2, DNA recovery and enrichment remained relatively
constant (Figure b).
Next-generation sequencing (NGS) of the resulting DNA libraries revealed
that, even after six rounds of selection/antiselection, the substrate
populations remained highly diverse and had no convergence at the
peptide level, which stands in contrast to traditional affinity-based
mRNA display selection workflows (Figure S2). To analyze convergence at the amino acid level, we computed Y*aa, pos scores as a measure of enrichment for amino acid aa in position pos in the selection data
set compared to the antiselection one (Figure c). Thus, amino acid aa in
position pos appears to be favored by the enzyme
if its log transformed Y* score, log2Y*, is greater than
zero, and disfavored when log2Y* < 0. This analysis
recapitulated our previous[47,53] findings: for example,
Trp in position 7, i.e., position +1 to the fixed Ser residue, had
the highest Y* score (log2Y* = 2.53), whereas Asp and Glu,
which are known to be disfavored by LazBF,[47,53] had a negative log2Y* in every position. Overall, the
amino acids around the designed modification site (positions 5 and
7) were subject to a stronger discrimination than those further away
from Ser6 (compare position-wise variances of log2Y* scores; Figure c). For any library
member, a statistical fitness score, S, can be computed
as the sum of log2Y* for every amino acid in the variable
region. We found that representing peptides in the S-space is an effective
way to perform data set-wide analysis of substrate populations. For
example, consistent with the qPCR results, this analysis revealed
(Figure S3) that the selection generated
a highly enriched substrate subpopulation (1.7σ higher than
the naïve library), whereas the antiselection did not because
the antiselection peptides resembled the naïve library members
(Δ0.5σ). Altogether, these data suggest that the assay
produced enriched yet highly diverse substrate populations suitable
for further analysis.
Development and Validation of Deep Learning
Models
Next, we turned to the development of a deep learning
workflow (Figure c).
We sought a scalable
and generalizable pipeline to build interpretable models which can
facilitate downstream enzymatic studies. After considerable experimentation,
we opted for a straightforward data preprocessing routine: NGS data
were in silico translated, denoised, and demultiplexed, after which
the resulting peptide data sets were labeled (Supporting Information 2.5; Table S3). All selection and antiselection peptides received a label of “1”
and “0”, respectively. A number of more sophisticated
workflows, which included data preclustering, outlier detection, or
quantification of relative fitness from NGS read counts, were rejected
as they consistently led to models of a poorer performance. Peptide
sequences were represented as matrices of positionally encoded amino
acid-wise extended connectivity fingerprints (ECFPs; Supporting Information 2.5, Figures S4 and S5),[59] a technique that has been recently applied to
train models which take peptide sequences as input data.[60,61] ECFP representations are built directly from the chemical structures
of constituent amino acids, and thus, they bypass the limitation of
many popular approaches based on biophysical descriptors,[62,63] which are typically limited to 20 proteinogenic amino acids. At
the same time, ECFP representations are more interpretable than one-hot
encoding and related techniques. A deep convolutional neural network
(CNN; 2.5 million parameters; Supporting Information 2.5 and Figure S6) was selected as the model architecture,
primarily due to its fast training. However, we note that neither
the choice of the model architecture (also tested were recurrent networks,
transformers, and fully connected networks) nor the nature of peptide
representation was particularly critical from the accuracy perspective.With these methods, we turned to benchmarking the overall workflow.
First, we ascertained whether multiple rounds of mRNA display were
important by training CNN models on NGS data for each selection/antiselection
round using 4.75 × 105 and 0.25 × 105 samples for training and validation, respectively (Figure d). Model accuracy increased
from 0.823 for round 1 data to 0.992 for round 6, indicating that
multiple rounds of mRNA display can furnish progressively cleaner
data sets for deep learning. The amount of training data also proved
important. Although reasonable models could be trained on as few as
102 peptides from the round 6 data set (Figure e), the log–log plot
of the accuracy versus the number of training samples was nearly linear,
with model accuracy reaching up to 0.997 when 107 peptides
were used for training. Notably, saturation in method performance
was not observed in either experiment, which suggests that running
more rounds of mRNA display and/or increasing the sequencing depth
could further improve the accuracy of the method. The latter approach
might be particularly straightforward because the throughput of contemporary
NGS instruments reaches 1010 reads/day.[64] We also benchmarked our workflow against several traditional
machine learning methods (k-nearest neighbors, adaptive/gradient
boosting, logistic regression, and random forest classifiers) and
found that deep neural networks were consistently superior (Figure S7a).The experiments above evaluated
model performance in simple classification
tasks where a model is tasked with assigning library 5S5 peptides
as belonging to either the selection or antiselection data sets, with
NGS data used as the ground truth. In the final experiment, we evaluated
whether the models could make more biochemically meaningful predictions,
i.e., whether they generalize beyond NGS data and agree with experimentally
determined substrate fitness values. To this end, we semirandomly
selected 65 library 5S5 members to ensure a fair test of the model
performance (“validation peptide set”, bVP1–65;
see Supporting Information 2.6 for sequence
choices and Table S4) and experimentally
investigated their dehydration by LazBF in batch format. The peptides
expressed by the FIT system were incubated with LazBF/tRNAGlu/GluRS for 2 h under the same conditions as for the mRNA display
pipeline. Reaction outcomes were quantified by LC-MS and summarized
as modification efficiency values (see Supporting Information 2.8 for details). The model training pipeline was
modified to exclude all validation peptide sequences from the training
data set using a Hamming distance ≤ 2 as the cutoff value.
Overall, we found that the model predictions tracked the experimental
values (Pearson correlation coefficient, ρP = 0.968; Figure f), indicating that
despite being trained as a classifier, the model could also quantify
substrate fitness with the mean prediction error of 0.08 ± 0.09
(±σ). The ability to quantify substrate fitness was in
line with the model’s performance on NGS data sets; that is,
the quantification accuracy depended on the amount of training data
and the number of mRNA display rounds (Figure S8a,b). The model excelled at identifying high fitness substrates,
whereas underprediction of reaction yields for moderately poor peptides
(those with modification efficiencies of ∼0.05–0.15)
was the most common source of error.Altogether, these data
demonstrate that the developed mRNA display/deep
learning platform can produce accurate models capable of extrapolating
substrate fitness across the peptide sequence space. In the subsequent
series of experiments, we deployed the model to understand the high-level
features of the LazBF substrate space.
Model-Guided Population-Level
Analysis of LazBF Substrates
In striking contrast to the
performance of deep learning models,
mRNA display-based statistical metrics such as the S score bore close to no predictive power for the validation peptide
set (Figure a). To
see whether this is generally true for LazBF substrates, we generated
5 × 106 random peptides from library 5S5 in silico
and estimated their fitness using the model. The analysis of the distribution
of model predictions in the S-space demonstrated that statistical
enrichments could confidently point to a small fraction of poor substrates
(S < −5), but for high fitness peptides,
the uncertainty of the prediction was too high to be practically useful
(Figure b). For example,
an average peptide with S = 2.5 had a predicted modification
efficiency of 0.49 ± 0.45 (±σ). Representing the outcomes
of high-throughput enzyme-profiling experiments as positional amino
acid preferences is a common practice. Our results (see also the data
for LazDEF below) suggest that at least for some RiPP enzymes such
a practice should be exercised with caution, although it remains to
be established how general this phenomenon is.
Figure 3
Model enables high-level
analysis of LazBF substrate fitness landscapes.
(a) Experimentally measured modification efficiencies of validation
peptides (bVP1–65; Table S4) as a function of their S scores. S scores cannot be used to reliably
predict fitness of bVP peptides. (b) Distribution of model predictions
in the S-space. Substrate fitness of 5 × 106 random 5S5 peptides was evaluated with the model. Plotted
are binned statistics of model predictions in the S-space. The overall
distribution of the peptides in the same space is displayed for reference.
The analysis reveals that at best S scores can be
reliably used as antideterminants of substrate fitness (when S < −5). (c) Pairwise epistasis between variable
positions in the CP of 5S5 peptides. The model was utilized to compute
abs (epi) scores using predictions for 5 × 106 sequences from b). The resulting values can be used to estimate
how strongly amino acids in the substrate affect each other’s
fitness. Higher values correspond to stronger second-order effects.
See Supporting Information 2.1 for computation
details. (d) Analysis of epistatic interactions in bVP33. Average
model calls were computed for 2 × 104 partially random
in silico generated peptides in each case; “x” denotes any amino acid except Ser. (e) Visualization of
all pairwise epistatic interactions in bVP33. Strong epistasis inside
the His4-Pro5-Ser6-Arg7-Trp8 motif contributes to the high fitness
of the peptide.
Model enables high-level
analysis of LazBF substrate fitness landscapes.
(a) Experimentally measured modification efficiencies of validation
peptides (bVP1–65; Table S4) as a function of their S scores. S scores cannot be used to reliably
predict fitness of bVP peptides. (b) Distribution of model predictions
in the S-space. Substrate fitness of 5 × 106 random 5S5 peptides was evaluated with the model. Plotted
are binned statistics of model predictions in the S-space. The overall
distribution of the peptides in the same space is displayed for reference.
The analysis reveals that at best S scores can be
reliably used as antideterminants of substrate fitness (when S < −5). (c) Pairwise epistasis between variable
positions in the CP of 5S5 peptides. The model was utilized to compute
abs (epi) scores using predictions for 5 × 106 sequences from b). The resulting values can be used to estimate
how strongly amino acids in the substrate affect each other’s
fitness. Higher values correspond to stronger second-order effects.
See Supporting Information 2.1 for computation
details. (d) Analysis of epistatic interactions in bVP33. Average
model calls were computed for 2 × 104 partially random
in silico generated peptides in each case; “x” denotes any amino acid except Ser. (e) Visualization of
all pairwise epistatic interactions in bVP33. Strong epistasis inside
the His4-Pro5-Ser6-Arg7-Trp8 motif contributes to the high fitness
of the peptide.Statistically, poor performance
of S scores in
predicting substrate fitness points to prevalent higher-order effects;
i.e., the fitness of an amino acid in a given position strongly depends
on the rest of the substrate sequence and should not be treated as
an independent variable. To quantify some of these effects, we employed
the model to compute pairwise epistasis between substrate amino acids
in various positions and summarized the results as epi score values (for definition, see Supporting Information 2.1). A positive epi score corresponds
to a synergistic effect between amino acids aa1 and aa2 in positions pos1 and pos2, respectively. Conversely, a negative epi score
indicates that on average a substrate containing aa1 and aa2 in positions pos1 and pos2 has a lower-than-expected fitness if statistical independence
of amino acids was assumed (see Figure S9 for examples). Averaging of absolute epi scores
over aa1 and aa2 can be utilized
to estimate how strongly pos1 and pos2 influence each other. This analysis showed (Figure c) that amino acids around the modification
site (positions 4, 5, 7, and 8) have stronger pairwise epistasis than
those distal from Ser6, although a number of long-range interactions
was still noticeable (for example, position 1 to position 7; epi = 0.21). Overall, such second-order effects dominated
the substrate fitness landscape for LazBF, which explains why simple
amino acid enrichment metrics had limited predictive power. For instance,
validation peptide bVP33 underwent near quantitative dehydration by
LazBF (0.97) due to the presence of His4-Pro5-Ser6-Arg7-Trp8 motif.
Multiple pairwise epistatic interactions within the motif (Figure d,e) facilitated
substrate fitness despite the low statistical fitness score (S = 0.04), and no single amino acid was solely responsible
for the high modification efficiency.The diversity and abundance
of epistatic interactions in LazBF
substrates suggest that the enzyme likely makes extensive but transient
contacts with the substrate’s CP during the two-step catalysis.
Despite the variety of high fitness substrates, LazBF is less promiscuous
than it might appear, as only 4% of library 5S5 peptides were predicted
to undergo efficient dehydration (Figure d).
Model-Guided Peptide-Level Analysis of LazBF
Substrates
Integrated gradients (IGs) are a popular method
for interpreting
predictions of deep learning models.[65] As
an attribution technique, IGs seek to understand how individual input
features affect a particular prediction by the model. Because in our
case peptides are represented as a matrix of ECFPs, IGs can be projected
onto the chemical structures of input sequences to visualize model
attributions at an atomic resolution. We found this technique insightful
in two ways. First, IG attributions facilitated the assignment of
PTM sites. For several validation set peptides containing multiple
Ser residues in the CP region, the treatment with LazBF yielded chromatographically
homogeneous singly dehydrated species (see bVP17, 25, 37, and 51 in Figures S10a, S11a, 4a, and S12a, respectively),
pointing to selective modification of one Ser residue. For bVP37,
the model attributed its high score prediction (0.985) primarily to
Ser10 (Figure b),
while Ser6 was deemed less important, suggesting that the dehydration
occurred at the former residue. Tandem mass-spectrometry (MS/MS) of
dehydrated bVP37 unambiguously located the modification site to Ser10
(Figure c), and similar
analysis performed for bVP17, 25, and 51 confirmed that IG attributions
can be utilized to predict modification sites (Figures S10, S11, and S12). Second, the technique could also
be leveraged to dissect the contributions of individual amino acids
toward the overall substrate fitness. For several validation set peptides,
amino acid-wise IG attributions designated a single amino acid, often
far from the modification site (Figures d and S13), as
the major reason for a poor dehydration efficiency. Indeed, single-point
mutations at specified amino acids improved the experimentally observed
substrate fitness in every case.
Figure 4
Model-guided dissection of the substrate
preferences of LazBF.
(a) LC-MS analysis of bVP37 dehydration by LazBF [a broad extracted
ion chromatogram (brEIC) and a composite MS spectrum integrated
over substrate-derived peaks showing the overall product distribution;
see Supporting Information 2.8 for LC-MS
details]. (b) Atom- and bond-wise accumulated IG attributions for
bVP37. The model suggests that Ser10 is the primary determinant of
the high modification efficiency. (c) A zoomed-in section of a charge-deconvoluted
CID fragmentation spectrum for singly dehydrated bVP37; y-ion assignments and neutral molecule losses are omitted for clarity.
The spectrum allows unambiguous assignment of the dehydration site
to Ser10, consistent with the model’s suggestion. See Figures S10–12 for more examples. (d)
Amino acid-wise IGs provide an intuition for relative amino acid contributions
to the total substrate fitness. Experimentally measured increase in
modification efficiency for three single-point mutants of bVP32, 36,
and 58 underscores the model’s ability to identify amino acids
critical for LazBF-mediated dehydration. See Figure S13 for more examples. (e) Substrate space traversal study
for bVP29 (see also the accompanying text). The model was employed
to find a sequence of bVP29 mutants which alter the substrate fitness
at each step. The route identified by the model was validated experimentally.
Collectively, this study points to the complex and unintuitive substrate
preferences of LazBF.
Model-guided dissection of the substrate
preferences of LazBF.
(a) LC-MS analysis of bVP37 dehydration by LazBF [a broad extracted
ion chromatogram (brEIC) and a composite MS spectrum integrated
over substrate-derived peaks showing the overall product distribution;
see Supporting Information 2.8 for LC-MS
details]. (b) Atom- and bond-wise accumulated IG attributions for
bVP37. The model suggests that Ser10 is the primary determinant of
the high modification efficiency. (c) A zoomed-in section of a charge-deconvoluted
CID fragmentation spectrum for singly dehydrated bVP37; y-ion assignments and neutral molecule losses are omitted for clarity.
The spectrum allows unambiguous assignment of the dehydration site
to Ser10, consistent with the model’s suggestion. See Figures S10–12 for more examples. (d)
Amino acid-wise IGs provide an intuition for relative amino acid contributions
to the total substrate fitness. Experimentally measured increase in
modification efficiency for three single-point mutants of bVP32, 36,
and 58 underscores the model’s ability to identify amino acids
critical for LazBF-mediated dehydration. See Figure S13 for more examples. (e) Substrate space traversal study
for bVP29 (see also the accompanying text). The model was employed
to find a sequence of bVP29 mutants which alter the substrate fitness
at each step. The route identified by the model was validated experimentally.
Collectively, this study points to the complex and unintuitive substrate
preferences of LazBF.The model—together
with the aforementioned techniques—enabled
a detailed evaluation of LazBF’s catalytic promiscuity. Ultimately,
we found that the complexity of the substrate landscape, as hinted
at by the analysis of epistatic interactions, precludes reasonable
simplifications to a set of straightforward rules. To illustrate the
intricacy of LazBF substrate preferences, we performed a sequence
space traversal study for one validation set peptide, bVP29. Specifically,
we utilized the model to find a chain of mutations which drastically
alter substrate fitness at each step (Figure e). The model pointed to numerous inconspicuous
amino acid replacements distal from the modification site which either
abrogated (for example, L2A mutation in bVP29.4) or restored (A3R
in bVP29.7a) LazBF-mediated dehydration at Ser6. Altogether, we found
that (i) the presence of an aromatic amino acid next to the modification
site or elsewhere in the CP is not absolutely required for modification
(bVP29.7b and bVP29.9b); (ii) even though in general negatively charged
Glu/Asp in the CP region strongly decrease substrate fitness, some
peptides instead rely on the presence of Glu for dehydration (see
E1L mutation in bVP29.7b and the corresponding IG attributions); and
(iii) analogous mutations in homologous substrates (G4A for bVP29.8a
and bVP29.8b) can lead to opposite outcomes.Given the uncovered
complexity of LazBF preferences, and hence
the infeasibility of manual annotations of substrate fitness for the
enzyme, we argue that the models constructed with our platform represent
a powerful tool to facilitate the study of promiscuous lanthipeptide
and thiopeptide dehydratases. Our results demonstrate that the substrate
preferences for LazBF, as obfuscated as they might be, are discernible,
and with enough training data, deep learning can furnish models which
are both accurate and generalizable.
LazDEF Profiling
In the final series of experiments,
we explored how well the developed platform can be expanded to other
PTMs. We chose LazDEF, another component of the lactazole biosynthesis
pathway, as the model for this study. LazDE is a YcaO family enzyme[66] which cyclodehydrates Cys and Ser residues in
LazACP during lactazole biosynthesis (Figure a) to yield thiazolines and
oxazolines, respectively.[52] The dehydrogenase
domain of LazF further aromatizes these structures to azoles via FMN-dependent
dehydrogenation.[52] As with LazBF, LazDEF
is known to process non-native substrates, but the extent of such
promiscuity has not been fully elucidated.[47,53]
Figure 5
Substrate
specificity profiling for LazDEF. (a) Chemical reactions
catalyzed by LazDEF. (b) Design of the LazDEF substrate library, library
6C6. (c) Summary of the selection and antiselection experiments. Plotted
are respective DNA recovery and enrichment values measured by qPCR
after every round of mRNA display. (d) CNN classifier accuracy as
a function of training data set size. The models were trained on round
5 data. (e) Validation of model predictions against experimental data.
A total of 64 validation peptides (dVP1–64; Table S5) were expressed by the FIT system and treated with
LazDEF for 5 h. Reaction outcomes were analyzed by LC-MS as described
in Supporting Information 2.8. Model predictions
show good agreement with the experiment. (f) Pairwise epistasis between
variable positions in the CP of 6C6 peptides. The model was utilized
to compute abs(epi) scores using predictions for
5 × 106 sequences from panel h). The resulting values
can be used to estimate how strongly amino acids in the substrate
affect each other’s fitness. Higher values correspond to stronger
second-order effects. Compared to the results for LazBF, LazDEF substrates
are characterized by weaker pairwise epistatic interactions, which
aids in explaining the results in panels (g) and (h). See Supporting Information 2.1 for computation details.
(g) Experimentally measured modification efficiencies of validation
peptides as a function of their S scores. Compared
to the LazBF results (Figure a), the S scores for LazDEF substrates prove
more informative. (h) Distribution of model predictions in the S-space. Substrate fitness of 5 × 106 random
6C6 peptides was evaluated with the model. Plotted are binned statistics
of model predictions in the S-space. The overall
distribution of the peptides in the same space is displayed for reference.
In the interval [−3, 2], which accounts for 46% of the total
peptide space, S scores are an unreliable metric
of substrate fitness.
Substrate
specificity profiling for LazDEF. (a) Chemical reactions
catalyzed by LazDEF. (b) Design of the LazDEF substrate library, library
6C6. (c) Summary of the selection and antiselection experiments. Plotted
are respective DNA recovery and enrichment values measured by qPCR
after every round of mRNA display. (d) CNN classifier accuracy as
a function of training data set size. The models were trained on round
5 data. (e) Validation of model predictions against experimental data.
A total of 64 validation peptides (dVP1–64; Table S5) were expressed by the FIT system and treated with
LazDEF for 5 h. Reaction outcomes were analyzed by LC-MS as described
in Supporting Information 2.8. Model predictions
show good agreement with the experiment. (f) Pairwise epistasis between
variable positions in the CP of 6C6 peptides. The model was utilized
to compute abs(epi) scores using predictions for
5 × 106 sequences from panel h). The resulting values
can be used to estimate how strongly amino acids in the substrate
affect each other’s fitness. Higher values correspond to stronger
second-order effects. Compared to the results for LazBF, LazDEF substrates
are characterized by weaker pairwise epistatic interactions, which
aids in explaining the results in panels (g) and (h). See Supporting Information 2.1 for computation details.
(g) Experimentally measured modification efficiencies of validation
peptides as a function of their S scores. Compared
to the LazBF results (Figure a), the S scores for LazDEF substrates prove
more informative. (h) Distribution of model predictions in the S-space. Substrate fitness of 5 × 106 random
6C6 peptides was evaluated with the model. Plotted are binned statistics
of model predictions in the S-space. The overall
distribution of the peptides in the same space is displayed for reference.
In the interval [−3, 2], which accounts for 46% of the total
peptide space, S scores are an unreliable metric
of substrate fitness.To profile the enzyme,
we designed mRNA display library 6C6 (Figure b), where the CP
region contained a fixed Cys residue flanked by six random amino acids
on either side. To discriminate LazDEF-modified products (i.e., peptides
containing a thiazoline/thiazole residue) from unmodified substrates
(peptides bearing Cys6), we opted to use iodoacetamide-based chemistry
to selectively biotinylate the latter (Figure S14). Thus, the selection protocol was modified to collect
and amplify the unbound DNA fraction, while the antiselection amplified
SAv pulldown products. In total, we performed five rounds of selection
and antiselection (Figure c). Consistent with the LazBF study, the selection recovery
and enrichment values increased from round to round except for round
4, when the selection stringency was adjusted, whereas antiselection
statistics hovered around the same values. Likewise, the resulting
sequence populations had strong enrichments at the amino acid level
(Figure S15) but did not converge at the
peptide level (normalized Shannon entropy, Hselection = 0.992), which provided an ample amount of training
data for deep learning. Training a CNN classifier on round 5 data
led to accurate models, where—similar to the LazBF experiments—the
model accuracy was proportional to the number of training samples,
reaching up to 0.993 for 8 × 106 input peptides (Figures d and S8c), and deep learning-based classifiers also
outperformed traditional machine learning methods (Figure S7b). Validation of model predictions against experimentally
measured modification efficiency values for 64 peptides confirmed
the ability of the model to generalize beyond NGS data sets (Figure e, Table S5). The LazDEF model predictions were in good agreement
with the experimental values [ρP = 0.980; μ(abs(Δ))
= 0.04 ± 0.08 (±σ)], indicating that the model might
be leveraged for quantitative estimation of LazDEF substrate fitness.
Taken together, these results show the flexibility of the developed
mRNA display platform and its ability to profile PTM enzymes catalyzing
diverse chemical reactions.In contrast to the similar mRNA
display outcomes, LazDEF and LazBF
had divergent substrate fitness landscapes. The difference mainly
manifested in lower pairwise positional epistasis (compare Figure f vs Figure c) and, by extension, higher
predictive power of statistical fitness scores for LazDEF (Figure g,h). Compared to
LazBF, analysis of the substrate preferences for LazDEF through the
prism of log2Y* values was more meaningful. Consistent
with the prior studies,[47,53] LazDEF primarily relied
on amino acids in positions −1 and +1 to discriminate its substrates,
preferring small (Gly/Ser/Ala) amino acids on either side of the modification
site and strongly disfavored Asp/Glu anywhere in the CP (Figure S15a). However, we note that even in this
case, S scores could not reliably predict the fitness
of nearly half of the total substrate space: 46% of all library 6C6
substrates had S scores between −3 and 2 where
statistical predictions can be inaccurate (Figure h). Accordingly, numerous exceptions to the
aforementioned substrate preferences were apparent (Table S5). For example, LazDEF readily modified validation
peptide dVP31 (cyclodehydration efficiency: 0.89) which contained
Asp adjacent to the modification site (Figure S16).
Conclusions
Our study demonstrates
that mRNA display can produce large amounts
of clean, labeled data for supervised deep learning applications.
The platform relies on a differential chemical reactivity of enzyme
substrates and their reaction products. An mRNA display scheme can
be constructed so long as either species can be chemoselectively biotinylated
(in this study, reaction products for LazBF and unreactive substrates
for LazDEF). We believe that the plethora of contemporary bioconjugation
techniques[67,68] will aid the development of analogous
workflows for PTM enzymes catalyzing diverse chemical reactions.Further, we found that highly accurate models of enzymatic substrate
preferences of two PTM enzymes catalyzing different reactions can
be constructed using a unified pipeline. The resulting classifier
models could be employed for quantitative assessment of reaction yields,
prediction of modification sites, and to analyze the influence of
individual amino acids on the overall substrate fitness. The deep
learning workflow proved superior to traditional machine learning
methods and to statistical enrichment metrics, commonly used for analysis
of high-throughput enzyme-profiling data. Combined, these advances
have allowed us to dissect the catalytic preferences of a Ser dehydratase
and a YcaO cyclodehydratase, which uncovered unusually complex substrate
fitness landscapes in both cases. We believe that the LazBF and LazDEF
models will facilitate lactazole bioengineering and, more generally,
that the developed platform will foster the study of catalysis by
promiscuous PTM enzymes.
Authors: Bo Li; Daniel Sher; Libusha Kelly; Yanxiang Shi; Katherine Huang; Patrick J Knerr; Ike Joewono; Doug Rusch; Sallie W Chisholm; Wilfred A van der Donk Journal: Proc Natl Acad Sci U S A Date: 2010-05-17 Impact factor: 11.205
Authors: Nicole O Meyer; Anthony J O'Donoghue; Ursula Schulze-Gahmen; Matthew Ravalin; Steven M Moss; Michael B Winter; Giselle M Knudsen; Charles S Craik Journal: Anal Chem Date: 2017-04-04 Impact factor: 6.986
Authors: Manuel Montalbán-López; Thomas A Scott; Sangeetha Ramesh; Imran R Rahman; Auke J van Heel; Jakob H Viel; Vahe Bandarian; Elke Dittmann; Olga Genilloud; Yuki Goto; María José Grande Burgos; Colin Hill; Seokhee Kim; Jesko Koehnke; John A Latham; A James Link; Beatriz Martínez; Satish K Nair; Yvain Nicolet; Sylvie Rebuffat; Hans-Georg Sahl; Dipti Sareen; Eric W Schmidt; Lutz Schmitt; Konstantin Severinov; Roderich D Süssmuth; Andrew W Truman; Huan Wang; Jing-Ke Weng; Gilles P van Wezel; Qi Zhang; Jin Zhong; Jörn Piel; Douglas A Mitchell; Oscar P Kuipers; Wilfred A van der Donk Journal: Nat Prod Rep Date: 2020-09-16 Impact factor: 15.111