| Literature DB >> 35075980 |
Pin-Kuang Lai1,2, Austin Gallegos3, Neil Mody3, Hasige A Sathish3, Bernhardt L Trout1.
Abstract
Machine learning has been recently used to predict therapeutic antibody aggregation rates and viscosity at high concentrations (150 mg/ml). These works focused on commercially available antibodies, which may have been optimized for stability. In this study, we measured accelerated aggregation rates at 45°C and viscosity at 150 mg/ml for 20 preclinical and clinical-stage antibodies. Features obtained from molecular dynamics simulations of the full-length antibody and sequences were used for machine learning model construction. We found a k-nearest neighbors regression model with two features, spatial positive charge map on the CDRH2 and solvent-accessible surface area of hydrophobic residues on the variable fragment, gives the best performance for predicting antibody aggregation rates (r = 0.89). For the viscosity classification model, the model with the highest accuracy is a logistic regression model with two features, spatial negative charge map on the heavy chain variable region and spatial negative charge map on the light chain variable region. The accuracy and the area under precision recall curve of the classification model from validation tests are 0.86 and 0.70, respectively. In addition, we combined data from another 27 commercial mAbs to develop a viscosity predictive model. The best model is a logistic regression model with two features, number of hydrophobic residues on the light chain variable region and net charges on the light chain variable region. The accuracy and the area under precision recall curve of the classification model are 0.85 and 0.6, respectively. The aggregation rates and viscosity models can be used to predict antibody stability to facilitate pharmaceutical development.Entities:
Keywords: Machine learning; antibody aggregation; antibody viscosity; developability; molecular dynamics simulations
Mesh:
Substances:
Year: 2022 PMID: 35075980 PMCID: PMC8794240 DOI: 10.1080/19420862.2022.2026208
Source DB: PubMed Journal: MAbs ISSN: 1942-0862 Impact factor: 5.857
Figure 1.Aggregation rates of all 20 mAbs studied in this work.
List of mAb properties and domains for feature selection of antibody aggregation rate. The CDR definitions are based on Chothia numbering. The feature properties are obtained from dynamic average of MD trajectories. In total, there are 35 features for selection
| Feature list (mAb properties (5) x domains (7) = 35) | ||||
|---|---|---|---|---|
| mAb properties | description | domains | description | |
| Solvent accessible surface area of hydrophobic residues (SASA_phobic) | Calculated by VMD | CDRH1 | H26-H32 | |
| Solvent accessible surface area of hydrophilic residues (SASA_philic) | Calculated by VMD | CDRH2 | H52-H56 | |
| Spatial aggregation propensity (SAP) | In-house program | CDRH3 | H95-H102 | |
| Spatial negative charge map (SCM_neg) | In-house program | CDRL1 | L24-L34 | |
| Spatial positive charge map (SCM_pos) | In-house program | CDRL2 | L50-L56 | |
| CDRL3 | L89-L97 | |||
| Fv | H1-H113 + | |||
Mean squared error (MSE) of the top five one-feature and two-feature combinations of the linear regression, support vector regression (SVR) and k-nearest neighbors regression (KNN) models for predicting aggregation rates. There are 20 mAbs in this study. The MSE are averaged from 100 randomly generated fourfold cross-validation sets
| One-feature | MSE | Two-features | MSE | ||
|---|---|---|---|---|---|
| SCM_neg_H2 | 5.04 | SCM_neg_H2 | SASA_phobic_H3 | 4.81 | |
| SAP_pos_H1 | 5.31 | SCM_neg_H2 | SASA_philic_L3 | 4.97 | |
| Linear | SASA_phobic_H3 | 5.49 | SAP_pos_L1 | SCM_neg_H2 | 5.08 |
| SCM_neg_H1 | 5.66 | SCM_neg_H1 | SASA_phobic_H3 | 5.19 | |
| SASA_philic_L3 | 5.70 | SCM_neg_H2 | SCM_pos_L1 | 5.23 | |
| SCM_pos_H2 | 4.96 | SCM_pos_H2 | SASA_phobic_Fv | 4.12 | |
| SCM_neg_H2 | 5.14 | SAP_pos_L1 | SCM_pos_H2 | 4.68 | |
| SVR | SCM_pos_L3 | 5.43 | SAP_pos_L1 | SCM_neg_H2 | 4.89 |
| SASA_phobic_Fv | 5.44 | SAP_pos_Fv | SASA_phobic_Fv | 4.90 | |
| SAP_pos_L1 | 5.46 | SCM_pos_H2 | SCM_pos_L3 | 4.90 | |
| SCM_pos_H2 | 4.35 | SCM_pos_H2 | SASA_phobic_Fv | 3.37 | |
| SCM_pos_L3 | 4.97 | SAP_pos_L1 | SCM_pos_H2 | 3.80 | |
| KNN | SCM_neg_H1 | 5.35 | SCM_neg_H1 | SCM_pos_H2 | 3.97 |
| SCM_pos_H1 | 5.59 | SCM_pos_H2 | SASA_philic_L3 | 4.21 | |
| SAP_pos_Fv | 5.65 | SCM_pos_L3 | SASA_philic_L1 | 4.73 | |
Figure 2.Correlation coefficients for the best two-feature linear, support vector regression (SVR) and k-nearest neighbors (KNN) regression models trained using all 20 data and LOOCV. The features for the linear regression model are SCM_neg_H2 and SASA_phobic_H3. The features for the SVR and KNN models are both SCM_pos_H2 and SASA_phobic_Fv.
Bootstrapping of the best two-feature combinations for the Linear, SVR and KNN regression models. In bootstrapping, the 20 data from the original dataset were randomly sampled with replacement. The regression models were generated 100 times and average value of the regression coefficients (r), RMSE and their standard deviations were calculated
| Two-features | r | RMSE | ||
|---|---|---|---|---|
| Linear | SCM_neg_H2 | SASA_phobic_H3 | 0.56 ± 0.12 | 1.72 ± 0.42 |
| SVR | SCM_pos_H2 | SASA_phobic_Fv | 0.87 ± 0.07 | 1.52 ± 0.29 |
| KNN | SCM_pos_H2 | SASA_phobic_Fv | 0.90 ± 0.07 | 0.89 ± 0.22 |
Figure 3.Viscosity at 150 mg/mL at pH 6.0 in histidine buffer of all 20 mAbs studied in this work. The red dashed line indicates the low/high viscosity cutoff (30 cP). A histogram showing the experimental viscosity at 150 mg/ml of 20 mAbs. The viscosity of mAb10, mAb12, mAb13, mAb14, mAb16 and mAb20 are above the high viscosity threshold 30 cP.
Figure 4.The relationship of viscosity at 150 mg/ml with the diffusion interaction coefficients (kD) for the 20 mAbs in this study. Open circles showing the viscosity on the y-axis and kD on the x-axis. Five high viscosity mAbs have kD values < −5 mL/g (mAb10, mAb12, mAb13, mAb16 and mAb20).
Viscosity classification accuracy (ACC) of the 20 mAbs in this study using the SCM score and the machine learning model from a previous work. Predicted and experimental high viscosity are shaded in gray. The high viscosity is defined as SCM_neg_Fv > 1000, 12< mAb_chg<32 and HVI>17.3, and Vis_exp > 30 cP, respectively. Correct predictions are labeled as 1, and wrong predictions are labeled as 0
| SCM_neg_Fv | mAb_chg | HVI | Vis_exp (150 mg/ml) | SCM_pred | ML_pred | ||
|---|---|---|---|---|---|---|---|
| mAb1 | 772.7 | 28 | 14.60 | 16.64 | 1 | 1 | |
| mAb2 | 1214.6 | 20 | 15.56 | 6.49 | 0 | 1 | |
| mAb3 | 869 | 28 | 17.02 | 7.30 | 1 | 1 | |
| mAb4 | 870 | 32 | 23.38 | 9.72 | 1 | 1 | |
| mAb5 | 1055 | 24 | 19.74 | 7.03 | 0 | 0 | |
| mAb6 | 507.6 | 28 | 14.98 | 10.41 | 1 | 1 | |
| mAb7 | 1010.2 | 26 | 16.09 | 23.33 | 0 | 1 | |
| mAb8 | 2156.2 | 4 | 12.45 | 16.07 | 0 | 1 | |
| mAb9 | 808.1 | 12 | 10.04 | 6.23 | 1 | 1 | |
| mAb10 | 667.5 | 24 | 16.74 | 227.54 | 0 | 0 | |
| mAb11 | 987.2 | 24 | 18.26 | 25.95 | 1 | 0 | |
| mAb12 | 767 | 30 | 13.79 | 108.25 | 0 | 0 | |
| mAb13 | 1089.1 | 22 | 20.18 | 93.00 | 1 | 1 | |
| mAb14 | 993.9 | 24 | 17.11 | 102.46 | 0 | 0 | |
| mAb15 | 993.6 | 26 | 20.6 | 21.26 | 1 | 0 | |
| mAb16 | 1151.8 | 18 | 16.45 | 115.60 | 1 | 0 | |
| mAb17 | 763 | 32 | 22.52 | 13.14 | 1 | 1 | |
| mAb18 | 886.9 | 24 | 12.72 | 13.63 | 1 | 1 | |
| mAb19 | 1294.7 | 26 | 22.47 | 7.80 | 0 | 0 | |
| mAb20 | 1292.7 | 20 | 16.59 | 48.86 | 1 | 0 | |
| ACC (%) | 60 | 55 |
List of mAb properties and domains for feature selection of antibody viscosity. The structural features (SAP, SCM pos and SCM neg) are obtained from dynamic average of MD trajectories. Other features are extracted from antibody sequences. Charge symmetry parameters are calculated for Fv and mAb domains (2). High viscosity index is calculated for Fv domain (1). The remaining properties are calculated for VH, VL, Fv and mAb domains (8x4 = 32). In total, there are 35 features for selection
| Feature list | ||||
|---|---|---|---|---|
| mAb properties | description | domains | description | |
| Number of hydrophobic residues (N_phobic) | A,F,I,L,M,P,V,W | VH | H1-H113 | |
| Number of hydrophilic residues (N_philic) | S,T,N,Q,Y,K,R,H,D,E | VL | L1-L107 | |
| Number of positive residues (N_pos) | K,R,H | Fv | H1-H113 + | |
| Number of negative residues (N_neg) | D,E | mAb | Full length | |
| Net charges | Calculated by PROPKA3 | |||
| Charge symmetric parameter (CSP) | Product of heavy and light chain charge | |||
| Spatial aggregation propensity (SAP) | In-house program | |||
| Spatial positive charge map (SCM_pos) | In-house program | |||
| Spatial negative charge map (SCM_neg) | In-house program | |||
| High viscosity index (HVI) | In-house program | |||
Accuracy (ACC) and area under the precision-recall curve (AUPRC) of the top five one-feature and two-feature combinations of the logistic regression (LR), support vector machine (SVM), k-nearest neighbors (KNN) and decision tree (DT) models for classifying low/high viscosity. There are 20 mAbs in this study. The ACC and AUPRC are averaged from 100 randomly generated 4-fold cross-validation sets. The baseline ACC is 0.70 and the baseline AUPRC is 0.30
| One-feature | ACC | AUPRC | Two-features | ACC | AUPRC | ||
|---|---|---|---|---|---|---|---|
| N_neg_VH | 0.79 | 0.57 | SCM_neg_VH | SCM_neg_VL | 0.86 | 0.70 | |
| SCM_neg_VL | 0.77 | 0.54 | N_neg_VH | SCM_neg_VL | 0.84 | 0.68 | |
| LR | net charges_VH | 0.78 | 0.53 | N_neg_VH | net charges_VL | 0.83 | 0.67 |
| N_neg_VL | 0.77 | 0.51 | SCM_neg_VL | SCM_pos_VH | 0.83 | 0.66 | |
| net charges_VL | 0.74 | 0.48 | net charges_VH | net charges_VL | 0.81 | 0.65 | |
| N_neg_VH | 0.76 | 0.47 | N_philic_VH | SAP_pos_VL | 0.82 | 0.64 | |
| net charges_VH | 0.74 | 0.46 | N_philic_Fv | SAP_pos_VL | 0.82 | 0.63 | |
| SVM | SCM_neg_VL | 0.72 | 0.45 | N_philic_Fv | N_neg_VH | 0.82 | 0.60 |
| mAbCSP | 0.74 | 0.37 | N_phobic_VL | N_neg_VH | 0.82 | 0.60 | |
| N_neg_VL | 0.70 | 0.34 | N_philic_VH | N_neg_VH | 0.81 | 0.58 | |
| HVI | 0.76 | 0.59 | N_pos_VL | N_neg_VH | 0.83 | 0.66 | |
| SAP_pos_VL | 0.82 | 0.65 | N_philic_Fv | FvCSP | 0.82 | 0.64 | |
| KNN | SCM_neg_VL | 0.74 | 0.52 | N_pos_VL | net charges_VH | 0.83 | 0.64 |
| net charges_VH | 0.78 | 0.51 | SCM_neg_VH | SCM_neg_VL | 0.82 | 0.62 | |
| N_neg_VH | 0.78 | 0.5 | N_philic_VH | FvCSP | 0.76 | 0.62 | |
| SAP_pos_VL | 0.73 | 0.52 | N_phobic_VL | SAP_pos_VL | 0.84 | 0.74 | |
| net charges_VH | 0.77 | 0.51 | N_neg_Fv | SCM_pos_VL | 0.78 | 0.60 | |
| DT | N_neg_mAb | 0.79 | 0.51 | N_neg_mAb | net charges_VL | 0.77 | 0.60 |
| SCM_pos_VL | 0.75 | 0.49 | N_neg_mAb | SCM_neg_VL | 0.76 | 0.58 | |
| N_neg_VH | 0.76 | 0.49 | N_phobic_VL | net charges_VH | 0.79 | 0.56 | |
Accuracy (ACC) and area under the precision-recall curve (AUPRC) of the top five one-feature and two-feature combinations of the logistic regression (LR), support vector machine (SVM), k-nearest neighbors and decision tree (DT) models for classifying low/high viscosity. There are 20 mAbs in this study plus 27 mAbs from the literature. The ACC and AUPRC are averaged from 100 randomly generated 4-fold cross-validation sets. The baseline ACC is 0.74 and the baseline AUPRC is 0.26
| One-feature | ACC | AUPRC | Two-features | ACC | AUPRC | ||
|---|---|---|---|---|---|---|---|
| mAbCSP | 0.81 | 0.49 | N_phobic_VL | net charges_VL | 0.85 | 0.60 | |
| net charges_VL | 0.76 | 0.39 | N_phobic_VL | mAbCSP | 0.85 | 0.58 | |
| LR | N_neg_VL | 0.77 | 0.37 | net charges_VL | HVI | 0.84 | 0.56 |
| FvCSP | 0.76 | 0.36 | N_phobic_Fv | net charges_VL | 0.84 | 0.56 | |
| N_pos_VL | 0.75 | 0.35 | N_phobic_mAb | net charges_VL | 0.83 | 0.55 | |
| mAbCSP | 0.81 | 0.47 | N_phobic_VL | net charges_VL | 0.83 | 0.53 | |
| net charges_mAb | 0.77 | 0.37 | N_philic_mAb | mAbCSP | 0.83 | 0.51 | |
| SVM | net charges_VL | 0.76 | 0.37 | net charges_mAb | mAbCSP | 0.83 | 0.50 |
| N_pos_VL | 0.73 | 0.29 | N_neg_VH | net charges_mAb | 0.82 | 0.49 | |
| net charges_VH | 0.75 | 0.28 | net charges_VL | net charges_mAb | 0.82 | 0.49 | |
| net charges_mAb | 0.78 | 0.47 | N_neg_Fv | net charges_VL | 0.85 | 0.57 | |
| N_phobic_VH | 0.77 | 0.42 | net charges_VL | net charges_mAb | 0.82 | 0.53 | |
| KNN | net charges_VL | 0.78 | 0.42 | net charges_VH | net charges_mAb | 0.82 | 0.53 |
| mAbCSP | 0.76 | 0.41 | N_philic_VL | net charges_VL | 0.82 | 0.53 | |
| SAP_pos_VL | 0.73 | 0.39 | mAbCSP | HVI | 0.80 | 0.53 | |
| mAbCSP | 0.81 | 0.47 | N_phobic_VL | net charges_VL | 0.85 | 0.57 | |
| SAP_pos_mAb | 0.75 | 0.41 | net charges_VL | net charges_mAb | 0.84 | 0.56 | |
| DT | net charges_mAb | 0.75 | 0.40 | N_philic_VL | net charges_VL | 0.84 | 0.54 |
| net charges_VL | 0.76 | 0.39 | SAP_pos_mAb | FvCSP | 0.78 | 0.48 | |
| net charges_VH | 0.76 | 0.35 | SCM_pos_VL | mAbCSP | 0.80 | 0.48 | |