Jared A Delmar1, Jihong Wang1, Seo Woo Choi2, Jason A Martins2, John P Mikhail2. 1. Analytical Sciences, Biopharmaceutical Development, AstraZeneca, One MedImmune Way, Gaithersburg, MD 20878, USA. 2. David H. Koch School of Chemical Engineering Practice, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Abstract
The spontaneous conversion of asparagine residues to aspartic acid or iso-aspartic acid, via deamidation, is a major pathway of protein degradation and is often seriously disruptive to biological systems. Deamidation has been shown to negatively affect both in vitro stability and in vivo biological function of diverse classes of proteins. During protein therapeutics development, deamidation liabilities that are overlooked necessitate expensive and time-consuming remediation strategies, sometimes leading to termination of the project. In this paper, we apply machine learning to a large (n = 776) liquid chromatography-tandem mass spectrometry (LC-MS/MS) dataset of monoclonal antibody peptides to create computational models for the post-translational modification asparagine deamidation, using the random decision forest method. We show that our categorical model predicts antibody deamidation with nearly 5% increased accuracy and 0.2 MCC over the best currently available models. Surprisingly, our model also paces or outperforms advanced and conventional models on an independent non-antibody dataset. In addition to deamidation probability, we are able to accurately predict deamidation rate (R2 = 0.963 and Q2 = 0.822), a capability with no peer in current models. This method should enable significant improvement in protein candidate selection, especially in biopharmaceutical development, and can be applied with similar accuracy to enzymes, monoclonal antibodies, next-generation formats, vaccine component antigens, and gene therapy vectors such as adeno-associated virus.
The spontaneous conversion of asparagine residues to aspartic acid or iso-aspartic acid, via deamidation, is a major pathway of protein degradation and is often seriously disruptive to biological systems. Deamidation has been shown to negatively affect both in vitro stability and in vivo biological function of diverse classes of proteins. During protein therapeutics development, deamidation liabilities that are overlooked necessitate expensive and time-consuming remediation strategies, sometimes leading to termination of the project. In this paper, we apply machine learning to a large (n = 776) liquid chromatography-tandem mass spectrometry (LC-MS/MS) dataset of monoclonal antibody peptides to create computational models for the post-translational modification asparagine deamidation, using the random decision forest method. We show that our categorical model predicts antibody deamidation with nearly 5% increased accuracy and 0.2 MCC over the best currently available models. Surprisingly, our model also paces or outperforms advanced and conventional models on an independent non-antibody dataset. In addition to deamidation probability, we are able to accurately predict deamidation rate (R2 = 0.963 and Q2 = 0.822), a capability with no peer in current models. This method should enable significant improvement in protein candidate selection, especially in biopharmaceutical development, and can be applied with similar accuracy to enzymes, monoclonal antibodies, next-generation formats, vaccine component antigens, and gene therapy vectors such as adeno-associated virus.
Therapeutic proteins are an important and growing class of drugs that includes peptides, such as insulin; cytokines, like erythropoietin; monoclonal antibodies (mAbs), which are among the most successful cancer therapies; next-generation formats, such as antibody-drug conjugates, bispecific antibodies, and fusion proteins; as well as vaccine components and gene therapy vectors. While small molecules comprise the largest class of new drug approvals, nearly 30% of US Food and Drug Administration (FDA) approvals in 2018 were protein based, up from 26% in 2017. As of the writing of this paper, half of new drugs approved by the FDA in 2019 represent biologics.Therapeutic proteins offer new mechanisms of action, higher target specificity, lower toxicity, and longer-acting pharmacokinetics, compared to small molecule drugs.2, 3, 4 However, the development of therapeutic proteins poses additional challenges. Not only must the drug be effective, but it must also be “developable,” a concept that encompasses many characteristics including high yield and homogeneity from cell culture, high purity drug substance after purification processing, low viscosity and high stability at the high concentrations necessary for drug product, high stability at in vitro long-term storage conditions and in vivo after administration, high target specificity, and, for antibodies, unimpaired neonatal Fc receptor (FcRn) binding., Nearly all of the factors that make a protein drug developable are derived from the amino acid sequence, including site-specific post-translational modifications (PTMs).In particular, the spontaneous non-enzymatic conversion of asparagine to aspartic acid or iso-aspartic acid via deamidation is a major pathway of protein degradation and is often seriously disruptive to biological systems.7, 8, 9 Deamidation has been shown to negatively affect both in vitro stability and in vivo biological function of diverse classes of proteins. Deamidation has been reported as a critical quality attribute in many monoclonal antibodies due to its impact on biological activity.10, 11, 12, 13 In one humanized monoclonal immunoglobulin G1 (IgG1) antibody drug, an asparagine in the heavy-chain complementarity determining region 2 (CDR2) loop was found to deamidate in vivo, which greatly decreased the drug’s efficacy. In another case, heavy-chain CDR deamidation resulted in an almost complete loss of potency and binding activity of a therapeutic monoclonal antibody. In adeno-associated virus, an emerging new vector for gene therapy, extensive capsid deamidation has been observed that impacts transduction and correlates to a loss of vector activity. Deamidation of asparagine residues can also significantly affect immunogenicity and efficacy of protein-based vaccines. Specifically, progress to develop next-generation anthrax vaccines has been halted by vaccine instability resulting from asparagine deamidation in anthrax protective antigen.17, 18, 19, 20, 21 Even in the nontherapeutic enzyme glucoamylase, used commercially to produce sweeteners and ethanol, asparagine deamidation causes a decrease in enzyme activity and change in thermodynamic stability.22, 23, 24Prediction of deamidation liability as early as possible in protein drug development is important because many more candidate drugs are proposed than can be tested. For example, typical antibody generation results in hundreds of candidates, which far exceeds the capacity of a drug development organization., Development of a therapeutic protein is so costly in both money and time that, after an initial assessment for screening, only a single candidate is moved forward in most cases.,, Sequence liabilities that are not dealt with as early as possible necessitate more expensive and time-consuming remediation strategies later in development and could lead to termination of the project.Computational tools already exist to facilitate drug candidate screening by the identification of sequence liabilities.,28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 In the case of asparagine deamidation,,40, 41, 42, 43, 44, 45 currently available tools suffer from several limitations: they provide only a binary (yes, high risk to deamidate; or no, low risk to deamidate) prediction,,,, require an experimental crystal structure,42, 43, 44, 45 or are applicable only to antibody asparagines., All offer no,, or low accuracy,43, 44, 45 predictions of deamidation rate. Oversimplified models tend to overestimate the number of deamidation sites greatly, which leads to over-engineering and rejecting good drug candidates too early in development. On the other hand, these models may also overlook asparagines for which deamidation is rarely observed (such as NK or NW sites), which can lead to costly and ineffective drug development.In this paper, we apply machine learning to a large (n = 776) liquid chromatography-tandem mass spectrometry (LC-MS/MS) dataset of monoclonal antibody peptides to create accurate random decision forest models for the PTM asparagine deamidation. We show that our categorical model predicts antibody deamidation likelihood with nearly 5% increased accuracy and 0.2 MCC over the best currently available models. Surprisingly, our model also paces or outperforms advanced and conventional models on an independent non-antibody dataset, including enzyme, antigen, and viral capsid deamidation sites. In addition to deamidation probability, we are able to accurately predict deamidation rate (R2 = 0.963 and Q2 = 0.822), a capability with no peer in current models. We provide evidence that our method can be applied with equal accuracy to predict the likelihood and rate of site-specific asparagine deamidation in any protein of interest.
Results and Discussion
Feature Selection
Spontaneous deamidation of asparagine to aspartic acid or iso-aspartic acid proceeds by one of two reaction mechanisms (Figure 1). At neutral to basic pH, the most common route is by a nucleophilic attack of the asparagine side chain by the backbone nitrogen of the following (N+1) residue, forming the cyclic succinimide intermediate. Hydrolysis at one of two carbonyls of the succinimide intermediate results in either aspartic acid or iso-aspartic acid. Below pH 5, direct hydrolysis of the asparagine side chain amide to aspartic acid is the dominant reaction.,
Figure 1
Asparagine Deamidation Reaction
(A and B) Spontaneous degradation of asparagine can occur by (A) direct hydrolysis of the side chain to aspartic acid or (B) via a succinimide intermediate, produced by a nucleophilic attack of the side chain carbonyl by the following (N+1) residue backbone nitrogen, producing either iso-aspartic acid or aspartic acid. Residues are rendered as sticks with Asn, Asp, and iso-Asp, and succinimide carbons colored gray, green, and cyan, respectively (O, red; N, blue).
Asparagine Deamidation Reaction(A and B) Spontaneous degradation of asparagine can occur by (A) direct hydrolysis of the side chain to aspartic acid or (B) via a succinimide intermediate, produced by a nucleophilic attack of the side chain carbonyl by the following (N+1) residue backbone nitrogen, producing either iso-aspartic acid or aspartic acid. Residues are rendered as sticks with Asn, Asp, and iso-Asp, and succinimidecarbons colored gray, green, and cyan, respectively (O, red; N, blue).Both mechanisms have been shown to rely on both the primary and three-dimensional (3D) structure, with the residue immediately following the asparagine residue (N+1) having the largest effect.,,47, 48, 49, 50, 51 The amino side residue preceding the asparagine (N−1) was shown to have little to no effect on deamidation rate., Steric hindrance, conformational space, and electrostatic effects introduced by the N+1 residue may all affect the ability of the side chain and/or backbone to align and form the cyclic intermediate. As both reaction mechanisms require hydrolysis to form the final aspartic acid or iso-aspartic acid product, availability of water molecules, or a proton donor, and solvent exposure may directly influence the rate of deamidation. Finally, hydrogen bonding to the side chain or backbone may stabilize asparagine and prevent degradation to aspartic acid.Taken together, these observations compiled from literature informed 12 total parameters for asparagine deamidation likelihood (Table 1), which our machine-learning models would use to predict deamidation. The N+1 residue was taken into account as both a categorical variable and as the experimental half-life of a synthetic pentapeptide (pentapeptide half-life, pphl) containing the same N−1 and N+1 sequence, measured by Robinson et al. Half-lives were not reported by Robinson et al. for pentapeptides with asparagine in the N+1 position, likely because it is difficult to distinguish between deamidation in the N and N+1 position in this case. Thus, when N+1 = N, we used an average pphl of 5.7 days.
Table 1
Predictors for Asparagine Deamidation Machine Learning Model
Structural / Chemical Category
Parameter
Primary sequence
pentapeptide deamidation half-life (pphl, days)
categorical N+1 residue
Backbone orientation
backbone dihedral Phi (φ,°)
backbone dihedral Psi (Ψ,°)
nucleophilic C-N attack distance (Å)
Side-chain orientation
side-chain dihedral chi1 (χ1,°)
side-chain torsion chi2 (χ2,°)
Solvent accessibility
percent solvent accessibility (PSA, %)
solvent accessible surface area (SASA, Å2)
Hydrogen bonding
hydrogen bonds to side chain (#)
Asn local secondary structure (Sheet)
Asn local secondary structure (Loop)
Machine-learning parameter (for regression model only)
categorical model probability output (%)
12 total parameters were used to inform the categorical machine-learning model to predict deamidation likelihood, comprising 6 general categories. For the regression model to prediction deamidation rate, the output of the categorical model was included as an additional predictor.
Predictors for Asparagine Deamidation Machine Learning Model12 total parameters were used to inform the categorical machine-learning model to predict deamidation likelihood, comprising 6 general categories. For the regression model to prediction deamidation rate, the output of the categorical model was included as an additional predictor.The ability of the side chain and backbone to align to form the cyclic succinimide intermediate was taken into account by the backbone dihedral angles (phi and psi), asparagine side-chain dihedrals (chi1 and chi2), and distance between the side chain carbonyl and backbone nitrogen (nucleophilic attack distance). Solvent accessibility was expressed as a percent of the total residue area (percent solvent accessibility [PSA]), as well as area in Å2 (solvent accessible surface area [SASA]). Hydrogen bonding to the asparagine side chain was predicted if a potential donor-acceptor pair was found within 3 Å and counted as the total number of predicted bonds (up to 4). Hydrogen bonding to the backbone was accounted for by secondary structure (either sheet or loop). We do not need a third variable for helical secondary structure, because if the local secondary structure is neither a sheet nor loop, an α helix can be assumed.Because the predictive tool we developed is most valuable during early development or candidate selection of a deamidation-liable protein, when little to no experimental data is available, we relied only on the primary sequence of the proteins to train our models. Two of the deamidation predictors we chose (N+1 residue and pphl) could be gleaned from the primary sequence directly. For the 9 parameters that could not, the structure of each protein was generated by homology modeling and predictors were extracted from the predicted 3D structure. To predict deamidate rate, we used the same 12 predictors for the regression model, with an additional parameter for the output of the first classification model, representing the predicted likelihood of deamidation.
Training and Validation Dataset Construction
It is expected that at least 80% of the effort involved in training a machine-learning model is dedicated to data processing and cleaning. Machine learning requires a large amount of data with a high degree of standardization. Thus, we performed a large side-by-side forced degradation study for 34 in-house IgG molecules, comprising a total of 608 asparagine sites that include 304 unique sites. We accelerated deamidation by incubating each molecule at 40°C and pH 8.0 for up to 4 weeks and measured the deamidation abundance at 0-, 2-, and 4-week time points for each site by LC-MS/MS tryptic peptide mapping. The deamidation half-life (t1/2) of each asparagine was calculated by least-squares fit of the data to an exponential decay function (Figure S1).This experimental t1/2. was used to train the regression model for half-life prediction. For training a classification machine-learning model to predict the probability of observing deamidation, we set the threshold at 1.0% deamidation per month or t1/2 ≈276 weeks. For example, if we observed only a 0.9% increase in deamidation after 4 weeks under stress conditions, we would train the classification model that this site did not deamidate (class “no”). If we measured a 1.1% increase in deamidation after 4 weeks at stress conditions for an asparagine site, this site was reported as class “yes,” or deamidated.Each asparagine located in the variable region of the antibodies in our forced degradation study was included in the training set. Because our training set included six antibody heavy-chain formats53, 54, 55, 56 and two light-chain formats (Figure S2), we also included one copy of each unique constant region asparagine in the training data.Initial models trained on our in-house dataset made accurate predictions overall (data not shown). However, they performed relatively poorly on asparagines located in IgG complementarity determining regions (CDRs)—probably because, out of the 304 unique asparagines in our initial dataset, only 25 unique CDR asparagines were found to deamidate. To give our models more examples of CDR deamidation to learn from, we expanded the training dataset to include 33 additional clinical-stage IgG1s (and 39 additional CDR deamidation sites) with deamidation data published by Lu et al. (Table S3). The stress condition used by Lu et al. (40°C and pH 8.5) was similar to our own, and their data was incorporated into the training sets for both our categorical and regression models. Only the final models that including training data from Lu et al. are evaluated here.To validate our models, we constructed an independent validation dataset with data from 12 additional in-house IgG molecules that were not contained in the training dataset, stressed at identical conditions to the training set molecules. The observed deamidation frequency in the validation set was consistent with that of the training set (Figure S3). Predictors and LC-MS/MS data for these molecules were added to an independent validation set. This independent validation dataset was supplemented with non-antibody data from both the validation set published by Jia et al., containing Aspergillus awamori glucoamylase,22, 23, 24 anthrax antigen,17, 18, 19, and human angionenin RNase, and recent capsid viral protein 3 (VP3) deamidation data published by Giles et al. from adeno-associated virus 8 (AAV8), an emerging vector for gene therapy (Table S3).
Machine-Learning Models for Predicting Deamidation Likelihood and Rate
Both the classification model and regression model were random forest models built in RStudio using the randomForest and caret libraries. The number of trees and number of parameters tried at each split were optimized by hand to minimize the out-of-bag error estimate. Because the output of the classification model is a probability that an asparagine belongs to class “yes,” or will deamidate, the probability threshold at which we interpret the prediction as “yes” or “no” was also optimized after model building to maximize the accuracy.Statistics for the fit to the training set were calculated for both the classification and regression models. Notably, the classification model was able to achieve 100% accuracy on the training set, using 12 parameters to determine whether each of 776 asparagines would deamidate with no mistakes made. The regression model was able to predict t1/2 for the 137 deamidated asparagines, 88 of which are unique, in the training set with an R2 of 0.963. The regression model used the same 12 predictors as the classification model, as well as the prediction output from the classification model, for a total of 13 parameters (Table 1).The top two predictors of deamidation liability, measured by the mean decrease in out-of-bag accuracy when that parameter is excluded from the categorical model, were the N+1 categorical variable and the pphl (Figure 2A). This is consistent with the literature and it is well accepted that the N+1 residue has the greatest effect on the deamidation liability of all studied parameters.,,47, 48, 49, 50, 51 Even a conventional one-parameter method using only the N+1 residue is competitive with advanced techniques (Tables 11 and 16). The next three most important parameters were related to the backbone alignment (psi and phi dihedral angles and nucleophilic attack distance), followed by solvent accessibility (SASA and PSA), side-chain alignment (chi1 and chi2 dihedral angles), and hydrogen bonding (side-chain hydrogen bonds and secondary structure). Similarly, Jia et al. found that tracking hydrogen bonding, secondary structure in particular, did not improve their asparagine deamidation prediction.
Figure 2
Categorical and Regression Models Predictor Ranking
(A) Importance of each parameter in the categorical model for predicting deamidation probability was measured by the mean decrease in out-of-bag accuracy when that parameter was excluded from the model. (B) Importance of each parameter in the regression model for predicting deamidation half-life was measured by the mean increase in the out-of-bag percent mean squared error (MSE) when that parameter was excluded from the model.
Table 11
Statistical Comparison of Predictions Made by Our Categorical Model and Other Models on the Independent Non-mAb Validation Subset
Statistic
Categorical Model
NG/NN/NS
Lorenzo et al.41
Robinson et al.43
Jia et al.42
Accuracy
93.8%
88.8%
93.8%
77.5%
95.0%
MCC
0.686
0.619
0.686
0.459
0.687
Precision
60.0%
43.8%
60.0%
28.0%
71.4%
Recall
85.7%
100.0%
85.7%
100.0%
71.4%
Specificity
94.5%
87.7%
94.5%
75.3%
97.3%
Negative predictive value
98.6%
100.0%
98.6%
100.0%
97.3%
Miss rate
14.3%
0.0%
14.3%
0.0%
28.6%
Fallout
40.0%
56.3%
40.0%
72.0%
28.6%
False discovery rate
5.5%
12.3%
5.5%
24.7%
2.7%
False omission rate
1.4%
0.0%
1.4%
0.0%
2.7%
Statistics were calculated for predictions made by all models on the non-mAb validation subset, which was nearly identical to the validation set used by Jia et al.
Table 16
Statistical Comparison of Predictions Made by Our Categorical Model and Other Models on the Independent mAb-Only Validation Subset
Statistic
Categorical Model
NG/NN/NS
Yan et al.40
Lorenzo et al.41
Accuracy
95.6%
86.8%
83.8%
91.2%
MCC
0.796
0.651
0.388
0.616
Precision
100.0%
50.0%
41.7%
66.7%
Recall
66.7%
100.0%
55.6%
66.7%
Specificity
100.0%
84.7%
88.1%
94.9%
Negative predictive value
95.2%
100.0%
92.9%
94.9%
Miss rate
33.3%
0.0%
44.4%
33.3%
Fallout
0.0%
50.0%
58.3%
33.3%
False discovery rate
0.0%
15.3%
11.9%
5.1%
False omission rate
4.8%
0.0%
7.1%
5.1%
Categorical and Regression Models Predictor Ranking(A) Importance of each parameter in the categorical model for predicting deamidation probability was measured by the mean decrease in out-of-bag accuracy when that parameter was excluded from the model. (B) Importance of each parameter in the regression model for predicting deamidation half-life was measured by the mean increase in the out-of-bag percent mean squared error (MSE) when that parameter was excluded from the model.To predict t1/2, the side-chain orientation was among the most important variables, measured by the increase in the percentage of mean squared error when that parameter is excluded from the regression model (chi2 and chi1 were 1st and 5th most important, respectively). Chi2 is the closest angle to the carbonyl involved in succinimide formation and Jia et al. also observed that chi2 was more important than chi1 in their model. Similar to the categorical model, the N+1 variable was among the best predictors, followed by the backbone dihedral angle phi (Figure 2B). Interestingly, pphl, the second-most important predictor for the categorical model, was only the 9th most important predictor for the regression model. It is possible that deamidation rate is strongly influenced by structural effects absent from the Robinson et al. pentapeptides. Indeed, we have observed in both our training and validation sets that t1/2 defies the hierarchy set forth by pphl. As in the categorical model, parameters that captured hydrogen bonding to the side chain and secondary structure were the least useful in predicting deamidation rate.To test whether the models were capable of accurately predicting deamidation likelihood and rate, both classification and regression models were applied to the validation set comprised of 12 unseen antibodies held from the training set and 4 non-mAb proteins (Tables 2, 3, 4, and 5; Figure 3). Surprisingly, the categorical model performed with high accuracy for both mAb and non-mAb proteins, predicting the validation set with more than 93% accuracy and a Matthews correlation coefficient (MCC) of 0.691 even though it had never seen a non-mAb deamidation site before (Table 5). Because deamidation data for the non-mAbs came from literature and were either not quantitative or not collected under the same conditions as our in-house data, we did not validate the regression model on these molecules. On the validation set molecules with in-house LC-MS/MS data, the regression model was successful, predicting t1/2 with Q2 = 0.822 (Figure 3B).
Table 2
Confusion Matrix for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Training Dataset
Prediction →
Positive
Negative
Experiment ↓
Positive
137
0
Negative
0
639
Table 3
Statistics for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Training Dataset
Statistic
Categorical Model
Accuracy
100.0%
MCC
1.000
Precision
100.0%
Recall
100.0%
Specificity
0.0%
Negative predictive value
0.0%
Miss rate
0.0%
Fallout
0.0%
False discovery rate
100.0%
False omission rate
100.0%
Table 4
Confusion Matrix for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Independent Validation Dataset
Prediction →
Positive
Negative
Experiment ↓
Positive
17
9
Negative
4
165
Table 5
Statistics for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Independent Validation Set
Statistic
Categorical Model
Accuracy
93.3%
MCC
0.691
Precision
81.0%
Recall
65.4%
Specificity
97.6%
Negative predictive value
94.8%
Miss rate
34.6%
Fallout
19.0%
False discovery rate
2.4%
False omission rate
5.2%
Notably, on the independent validation set containing non-antibody proteins our model was able to achieve 93.3% accuracy and a Matthews correlation coefficient (MCC) of 0.691.
Figure 3
Regression Machine Learning Model for Predicting Deamidation Rate
(A) Predicted half-life (t1/2, weeks) was plotted versus the experimental measured t1/2 for the training dataset. Individual asparagines are plotted as black circles for and the solid black line indicates where predicted t1/2 = experimental t1/2. Our regression model predicted the training set with R2 = 0.963. (B) Predicted half-life (t1/2, weeks) by our regression model (blue circles), the Robinson et al. model (green squares), and the Lorenzo et al. model (orange triangles) were plotted versus experimentally measured t1/2 (weeks) for the validation set. The solid black line indicates where predicted t1/2 = experimental t1/2. While our regression model predicted the independent validation set with Q2 = 0.822, both the predictions by Robinson et al. and Lorenzo et al. resulted in Q2 < 0.
Confusion Matrix for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Training DatasetStatistics for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Training DatasetConfusion Matrix for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Independent Validation DatasetStatistics for Predictions Made by the Categorical Machine Learning Model for Predicting Deamidation Liability on the Independent Validation SetNotably, on the independent validation set containing non-antibody proteins our model was able to achieve 93.3% accuracy and a Matthews correlation coefficient (MCC) of 0.691.Regression Machine Learning Model for Predicting Deamidation Rate(A) Predicted half-life (t1/2, weeks) was plotted versus the experimental measured t1/2 for the training dataset. Individual asparagines are plotted as black circles for and the solid black line indicates where predicted t1/2 = experimental t1/2. Our regression model predicted the training set with R2 = 0.963. (B) Predicted half-life (t1/2, weeks) by our regression model (blue circles), the Robinson et al. model (green squares), and the Lorenzo et al. model (orange triangles) were plotted versus experimentally measured t1/2 (weeks) for the validation set. The solid black line indicates where predicted t1/2 = experimental t1/2. While our regression model predicted the independent validation set with Q2 = 0.822, both the predictions by Robinson et al. and Lorenzo et al. resulted in Q2 < 0.Out of the 195 total asparagines in our validation set, the categorical model made 13 mistakes. It tended toward the conservative side, underpredicting deamidation in 9 cases and only overpredicting 4 asparagines that were not experimentally observed to deamidate. Interestingly, 5 of the 9 underpredicted sites (almost 40% of our total error) came from one molecule: the AAV8 capsid protein VP3. Further, the 5 sites that our model mispredicted were significantly less deamidated than the other 5 deamidation sites observed in VP3, all of which our model correctly predicted (Tables 17 and 20). It was shown by Giles et al. that the 5 low abundance deamidation sites did not respond significantly to incubation at 70°C or pH 10 or changes to the purification process. Thus, it is possible that deamidation would remain unchanged in our milder forced degradation conditions (40°C and pH 8), and we should instead consider these sites as non-liable. Nevertheless, our model outperformed both conventional (Table 18) and advanced (Table 19) predictions of deamidation for AAV8 capsid protein VP3 asparagines.
Table 17
Confusion Matrix for Predictions Made by our Categorical Model on the AAV8 Capsid Protein VP3
Prediction →
Positive
Negative
Experiment ↓
Positive
5
5
Negative
0
37
Table 20
Comparison of Predictions Made by our Categorical Model and the NG/NN/NS Model on Selected Residues of the AAV8 Capsid Protein
Residue
N+1
Avg % Deamidation Giles et al.16
Categorical Model
NG/NN/NS
Lorenzo et al.41
N254
N
9%
no
yes
yes
N255
H
ND
no
no
yes
N263
G
99%
yes
yes
yes
N304
N
ND
no
yes
yes
N305
N
8%
no
yes
yes
N337
N
ND
No
yes
yes
N384
N
ND
no
yes
yes
N385
G
88%
yes
yes
yes
N410
N
3%
no
yes
yes
N459
T
7%
no
no
no
N498
N
ND
no
yes
yes
N499
N
17%
yes
yes
yes
N500
S
ND
no
yes
yes
N514
G
84%
yes
yes
yes
N517
S
4%
no
yes
yes
N540
G
79%
yes
yes
yes
N599
S
ND
no
yes
yes
N611
R
ND
no
no
no
N670
S
ND
no
yes
yes
N693
S
ND
no
yes
yes
Individual residue level predictions are shown for each model on a subset of residues from the AAV8 protein. The 5 mispredicted sites by our model were significantly less deamidated than the other 5 deamidation sites observed in the AAV8 capsid, measured by Giles et al.
Table 18
Confusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the AAV8 Capsid Protein VP3
Prediction →
Positive
Negative
Experiment ↓
Positive
9
1
Negative
8
29
Table 19
Confusion Matrix for Predictions Made by the Lorenzo et al. Model on the AAV8 Capsid Protein VP3
Prediction →
Positive
Negative
Experiment ↓
Positive
9
1
Negative
9
28
Taken together, the high accuracy at which our models were able to predict deamidation in the diverse proteins in our validation set indicates that these models may be generally applied to predict deamidation in any protein of interest.
Comparison with Advanced and Conventional Models
To evaluate the relative performance of our models, we have applied to our validation set as many currently available predictions of deamidation from the literature as possible. These advanced tools include another machine learning model by Jia et al., a tree-based approach by Yan et al., and empirical calculations by Robinson et al. and Lorenzo et al.. In addition, we compared all of these approaches to a conventional one-parameter method based on the primary sequence alone. For the conventional method (named NG/NN/NS here), if an asparagine is followed by glycine, asparagine, or serine (N+1 = G, N, or S), then it is considered as liable to deamidate. All deamidation sites in our validation set were NG, NN, or NS motifs, so the data was particularly amenable to and not biased against this conventional method. Statistical comparison of the conventional model and our categorical model are shown in Table S1. Conversely, our training set contained many non-NG, -NN, and -NS sites (Figure S3) and was poorly predicted by the conventional method (Table S2).Because the methods by Robinson et al. and Jia et al. require crystal structures, these models could be applied to only a subset of our validation set, which is nearly identical to the validation set used by Jia et al. for their method and comprised of only non-mAbs. We have removed N395 of A. awamori glucoamylase (PDB: 3GLY) from this validation subset as Chen et al. showed that this asparagine is N-glycosylated. Of note, all sites with N+1 = N and N+1 = Q are also missing from the non-mAb validation subset and the model by Jia et al. does not provide predictions for them. Finally, two deamidation sites in anthrax protective antigen (N466 and N537) were corrected to match the observations of Verma et al. The individual confusion matrices are shown in Tables 6, 7, 8, 9, and 10, and a statistical comparison of our method performance on this validation subset to the conventional method and those of Lorenzo et al., Robinson et al., and Jia et al. is shown in Table 11. Jia et al. had the highest accuracy on this non-mAb validation set, with one less mistake than our model or that of Lorenzo et al. However, their model also performed last in several categories. Our model and that of Lorenzo et al. had the second-best overall accuracy of 93.8%, were not the worst performers in any category, and had nearly identical MCC to that of the Jia et al. model (Table 11).
Table 6
Confusion Matrix for Predictions Made by our Categorical Model on the Non-mAb Independent Validation Subset
Prediction →
Positive
Negative
Experiment ↓
Positive
6
1
Negative
4
69
Table 7
Confusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the Non-mAb Independent Validation Subset
Prediction →
Positive
Negative
Experiment ↓
Positive
7
0
Negative
9
64
Table 8
Confusion Matrix for Predictions Made by the Lorenzo et al. Model on the Non-mAb Independent Validation Subset
Prediction →
Positive
Negative
Experiment ↓
Positive
6
1
Negative
4
69
Table 9
Confusion Matrix for Predictions Made by the Robinson et al. Model on the Non-mAb Independent Validation Subset
Prediction →
Positive
Negative
Experiment ↓
Positive
7
0
Negative
18
55
Table 10
Confusion Matrix for Predictions Made by the Jia et al. Model on the Non-mAb Independent Validation Subset
Prediction →
Positive
Negative
Experiment ↓
Positive
5
2
Negative
2
71
Confusion Matrix for Predictions Made by our Categorical Model on the Non-mAb Independent Validation SubsetConfusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the Non-mAb Independent Validation SubsetConfusion Matrix for Predictions Made by the Lorenzo et al. Model on the Non-mAb Independent Validation SubsetConfusion Matrix for Predictions Made by the Robinson et al. Model on the Non-mAb Independent Validation SubsetConfusion Matrix for Predictions Made by the Jia et al. Model on the Non-mAb Independent Validation SubsetStatistical Comparison of Predictions Made by Our Categorical Model and Other Models on the Independent Non-mAb Validation SubsetStatistics were calculated for predictions made by all models on the non-mAb validation subset, which was nearly identical to the validation set used by Jia et al.The tree-based model proposed by Yan et al. is only applicable to IgG mAbs. Thus, in order to compare our methods, we created another subset of our validation set including only IgG mAbs. On this mAb-only independent validation subset, predictions made by our model (Table 12) were compared against those made by the models of Yan et al. (Table 13), Lorenzo et al. (Table 14), and the conventional one-parameter NG/NN/NS model (Table 15). Again, our categorical model was not the worst performer in any statistic and had the best MCC (0.796) and accuracy (95.6%) at predicting the mAb-only dataset (Table 16).
Table 12
Confusion Matrix for Predictions Made by Our Categorical Model on the mAb-Only Independent Validation Set
Prediction →
Positive
Negative
Experiment ↓
Positive
6
3
Negative
0
59
Table 13
Confusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the mAb-Only Independent Validation Set
Prediction →
Positive
Negative
Experiment ↓
Positive
9
0
Negative
9
50
Table 14
Confusion Matrix for Predictions Made by the Yan et al. Model on the mAb-Only Independent Validation Set
Prediction →
Positive
Negative
Experiment ↓
Positive
5
4
Negative
7
52
Table 15
Confusion Matrix for Predictions Made by the Lorenzo et al. Model on the mAb-Only Independent Validation Set
Prediction →
Positive
Negative
Experiment ↓
Positive
6
3
Negative
3
56
Confusion Matrix for Predictions Made by Our Categorical Model on the mAb-Only Independent Validation SetConfusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the mAb-Only Independent Validation SetConfusion Matrix for Predictions Made by the Yan et al. Model on the mAb-Only Independent Validation SetConfusion Matrix for Predictions Made by the Lorenzo et al. Model on the mAb-Only Independent Validation SetStatistical Comparison of Predictions Made by Our Categorical Model and Other Models on the Independent mAb-Only Validation SubsetUnfortunately, we were not able to make a significant comparison to the tree-based method proposed by Sydow et al. Their method is restricted to only the CDR of antibodies.As of the writing of this paper, we were only able to find two methods in the literature for the prediction of deamidation half-life: by Robinson et al. and Lorenzo et al. Deamidation half-life is both temperature and pH dependent and each model is specific to one condition. Specifically, the Robinson et al. model predicts t1/2 for proteins at 37°C and pH 7.4 and Lorenzo et al. at slightly basic pH and up to 40°C. In our experience, these conditions are similar enough to our own (40°C and pH 8.0) to make a direct comparison.Out of the 26 unique deamidation sites in our validation set, only 7 had available t1/2 measurements, all of which were calculated from LC-MS/MS data collected in-house. Thus, we applied each model to these 7 sites for comparison of predictive accuracy. Both the Robinson et al. and Lorenzo et al. models predicted values of t1/2 that disagreed enough with the experimental values to produce a Q2 < 0. Our model achieved a Q2 of 0.822 (Figure 3B). Both the Robinson et al. and Lorenzo et al. methods rely heavily on the pphl as the basis for their half-life prediction, while our model ranked pphl as one of the least useful parameters for prediction t1/2 (Figure 2B), which might help to explain the discrepancy in results. While the Lorenzo et al. model tended to underpredict deamidation rate in this independent validation set, both the Lorenzo et al. and NG/NN/NS models overpredicted the number of liable sites in the AAV8 capsid protein VP3 (Tables 17, 18, 19, 20).Confusion Matrix for Predictions Made by our Categorical Model on the AAV8 Capsid Protein VP3Confusion Matrix for Predictions Made by the Conventional NG/NN/NS Model on the AAV8 Capsid Protein VP3Confusion Matrix for Predictions Made by the Lorenzo et al. Model on the AAV8 Capsid Protein VP3Comparison of Predictions Made by our Categorical Model and the NG/NN/NS Model on Selected Residues of the AAV8 Capsid ProteinIndividual residue level predictions are shown for each model on a subset of residues from the AAV8 protein. The 5 mispredicted sites by our model were significantly less deamidated than the other 5 deamidation sites observed in the AAV8 capsid, measured by Giles et al.
Conclusions
We have constructed both a categorical model for predicting whether or not an asparagine is liable for deamidation, and a regression model for determining the rate at which a predicted site deamidates. Both outperform or pace currently available models based on predictions made on independent validation sets.Although both models were trained on only mAb deamidation data, we found that they applied with similar accuracy to non-mAb molecules in our validation set, including enzyme, antigen, and viral capsid deamidation sites. In contrast to other methods, ours do not require crystallographic 3D coordinates and are not protein class specific. Rather, the structural information used by our models to predict deamidation is drawn from homology models. Thus, they are applicable to any protein for which a similar protein’s structure is available in the PDB.It is our hope that with more data and increasingly accurate and interpretable models, a fundamental understanding of protein degradation, including deamidation, will be attained, leading to more and better protein-based therapies.
Materials and Methods
3D Model Building and Parameter Extraction
For AstraZeneca in-house molecules, full-length homology models were built using Schrödinger BioLuminate. Briefly, the most similar crystal structure from the PDB, by sequence, was first identified by basic local alignment search tool (BLAST). This structure and an in-house constant region template were used as scaffolds for the full-length structure. The Protein Preparation Wizard tool was used to add hydrogens, assign bond orders, remove solvent molecules, optimize H-bond assignments, and perform restrained energy minimization. Molecules from the study by Lu et al. were modeled similarly; however, only the Fv structure was generated. Predictors of asparagine deamidation were extracted from the 3D homology models within Schrödinger via python script.
Generation of Deamidated IgGs
For IgG deamidation data generated in-house, samples at 10 mg/mL in 50 mM Tris pH 8.0 were incubated at 40°C for 2-week and 4-week time points. Reactants were stored at −80°C prior to analysis by LC-MS/MS.
LC-MS/MS Tryptic Peptide Mapping
20 μL samples at 5 μg/μL were denatured by adding 200 μL of 8 M guanidine, 130 mM Tris, 1 mM EDTA, pH 7.6 denaturing buffer. The samples were then reduced by the addition of 2 μL of 500 mM dithiothreitol. After incubation at 37°C for 30 min, samples were alkylated by the addition of 5 μL of 500 mM iodoacetamide and incubated at ambient temperature for 30 min in the dark. The reduced and alkylated samples were buffer exchanged into a solution containing 2 M urea and 100 mM Tris at pH 8.0 using an Amicon spin filter (EMD Millipore, Billerica, MA, USA; molecular weight cut-off of 10 kDa); 5 μg of trypsin was then added to the spin filter and incubated at 37°C for 4 h. The digested samples were collected from the spin filters, and the digestion was quenched with trifluoroacetic acid.Peptides produced by enzymatic digestion were eluted on an Acquity Ultra Performance liquid chromatography system (Waters, Milford, MA, USA) equipped with an ethylene bridged hybrid C18 reversed-phase column (1.7 μm, 2.1 mm, 150 mm) using a gradient of 0%–60% acetonitrile at a flow rate of 0.2 mL/min (total elution time of 76 min). Peptides separated on the column were identified by a UV detector and analyzed using an Orbitrap Velos Pro mass spectrometer (Thermo Fisher Scientific). Peak identification was based on both the exact monoisotopic mass and the tandem mass spectrum of the target ion. Deamidation quantitation was based on peak areas from the extracted ion chromatography of corresponding ions.In most cases in our collected deamidation data for the training and validation sets, sequencing information by MS/MS could distinguish between deamidation on neighboring asparagines in the same tryptic peptide. However, for two NN sites in the validation set, MS/MS data could not distinguish between the N and N+1 residues. Thus, in these cases, the t1/2 was a combined measurement for both sites in the peptide. Half-life predictions made by the regression model for these two sites were also combined prior to analysis.
Random Forest Machine Learning Model Construction
Both the classification model and regression model were random forest models built in RStudio using the randomForest and caret libraries. The number of trees and number of parameters tried at each split were optimized by manually tuned to minimize the out-of-bag error estimate.For the classification model, 500 trees were generated with 3 variables tried at each split, producing an out-of-bag error estimate of 4.25% on the training set. The probability threshold at which we interpret the prediction as “yes” or “no” was also optimized to 53% after model building. Confusion matrices and variable importance plots were generated using caret and random Forest libraries, respectively.The regression model was trained only on the subset of training data containing deamidation sites quantified by LC-MS/MS, including our in-house data and that of Lu et al. In this case, 500 trees were generated with 4 variables tried at each split. The out-of-bag predictions explained 63.5% of the variance of the training set. R2 and Q2 were calculated and variable importance plots were generated using caret and randomForest libraries, respectively.
Author Contributions
J.A.D. and J.W. conceptualized and designed the experiments and curated and interpreted the data; J.A.D., S.W.C., J.A.M., and J.P.M. performed the experiments and analyzed the data; J.A.D. wrote the manuscript.
Conflicts of Interest
This work was supported by the global biologics R&D arm of AstraZeneca. Authors J.A.D. and J.W. are employees of AstraZeneca and have stock and/or stock interests or options in AstraZeneca.
Authors: R J Harris; B Kabakoff; F D Macchi; F J Shen; M Kwong; J D Andya; S J Shire; N Bjork; K Totpal; A B Chen Journal: J Chromatogr B Biomed Sci Appl Date: 2001-03-10
Authors: Neeraj J Agrawal; Andrew Dykstra; Jane Yang; Hai Yue; Xichdao Nguyen; Carl Kolvenbach; Nicolas Angell Journal: J Pharm Sci Date: 2018-01-08 Impact factor: 3.534
Authors: Alexander W Golinski; Katelynn M Mischler; Sidharth Laxminarayan; Nicole L Neurock; Matthew Fossing; Hannah Pichman; Stefano Martiniani; Benjamin J Hackel Journal: Proc Natl Acad Sci U S A Date: 2021-06-08 Impact factor: 11.205