Literature DB >> 32596302

SSH: A Tool for Predicting Hydrophobic Interaction of Monoclonal Antibodies Using Sequences.

Anthony Mackitz Dzisoo¹, Juanjuan Kang¹, Pengcheng Yao¹, Benjamin Klugah-Brown², Birga Anteneh Mengesha¹, Jian Huang¹.

Abstract

Therapeutic antibodies are one of the most important parts of the pharmaceutical industry. They are widely used in treating various diseases such as autoimmune diseases, cancer, inflammation, and infectious diseases. Their development process however is often brought to a standstill or takes a longer time and is then more expensive due to their hydrophobicity problems. Hydrophobic interactions can cause problems on half-life, drug administration, and immunogenicity at all stages of antibody drug development. Some of the most widely accepted and used technologies for determining the hydrophobic interactions of antibodies include standup monolayer adsorption chromatography (SMAC), salt-gradient affinity-capture self-interaction nanoparticle spectroscopy (SGAC-SINS), and hydrophobic interaction chromatography (HIC). However, to measure SMAC, SGAC-SINS, and HIC for hundreds of antibody drug candidates is time-consuming and costly. To save time and money, a predictor called SSH is developed. Based on the antibody's sequence only, it can predict the hydrophobic interactions of monoclonal antibodies (mAbs). Using the leave-one-out crossvalidation, SSH achieved 91.226% accuracy, 96.396% sensitivity or recall, 84.196% specificity, 87.754% precision, 0.828 Mathew correlation coefficient (MCC), 0.919 f-score, and 0.961 area under the receiver operating characteristic (ROC) curve (AUC).

Entities: Chemical Disease Species

Mesh：

Substances：
Antibodies, Monoclonal

Year: 2020 PMID： 32596302 PMCID： PMC7288208 DOI： 10.1155/2020/3508107

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

One of the developing areas in the pharmaceutical industry is therapeutic antibody. The antibody drugs have been used in the treatment of autoimmune diseases, cancer, inflammation, and infectious diseases. However, developing antibody candidates as therapeutic drugs is an expensive and perilous process. Many monoclonal antibody (mAb) candidates failed due to various problems such as poor manufacturability, low stability and solubility, high viscosity, hydrophobicity, and aggregation propensity [1, 2]. Though problems mentioned above are due to various reasons, hydrophobic interactions between antibodies themselves or materials of containers have been shown to be the most predominant one. Currently, the available wet lab methods for measuring the hydrophobic interaction of monoclonal antibodies include standup monolayer adsorption chromatography (SMAC), hydrophobic interaction chromatography (HIC), and affinity-capture self-interaction nanoparticle spectroscopy (AC-SINS). SMAC is used to assess colloidal stability of antibodies under different buffer conditions. Antibodies with colloidal instability may be more likely to have nonspecific interactions, and hydrophobic interactions have been suggested to be the main mode of problematic interactions [3]. HIC is used to evaluate the solubility, viscosity, and serum clearance of antibodies, which are mainly influenced by the hydrophobicity of mAbs [4]. AC-SINS is widely used to detect antibody self-association [5, 6]. Although many physicochemical factors are involved in protein self-association, the presence of hydrophobic moieties on the protein surface is often the primary driver [7]. The methods above have offered a high-throughput solution to developability screening at early-stage antibody drug discovery. However, experimentally screening a large number of candidates is still expensive and time-consuming. Computational screening tools are urgently needed. Computational methods, especially data mining and machine learning techniques, have been widely used in various aspects of biomedical studies [8-17]. The field of antibody drug development is no exception. There are attempts to predict viscosity, developability, crossinteraction, or self-interaction of antibodies [18-20]. Currently, there are also quite a few existing methods for predicting the hydrophobicity of proteins including mAbs [21-23]. These methods are mostly based on three-dimensional structures of proteins. A recent paper by Jain et al. describes a model for predicting delayed retention of antibodies in HIC from the sequence using machine learning [24]. However, no web service is available for this model, just as other published methods for antibody hydrophobicity prediction. According to our previous working experience on predicting crossinteraction or self-interaction of antibodies, combining data from different but relevant experimental assays gives better results than just relying on a single experimental assay data. In this study, we combine data from SMAC, SGAC-SINS, and HIC that are closely related to the hydrophobicity of antibodies, build a model using machine learning, and construct a web server called SSH. It can predict hydrophobic interactions of antibodies based on just their sequences. The server is freely available at http://i.uestc.edu.cn/eli/cgi-bin/ssh.pl. We believe it can benefit antibody drug screening community by saving time, money, and resources.

2. Results

The area under the receiver operating characteristic (ROC) curve (AUC), which is a graphical representation of varying threshold values, explains how well a binary classifier can predict the new data. AUC measures the sensitivity and specificity of the binary classification algorithm, which measures the overall performance of the model; it is referred to as how well a model can predict its negative and positive data. Most binary classification uses AUC as a determinant to show how skewed the classification is toward specificity and sensitivity [25]. The analysis of the ROC curve helps to illustrate how well an individual dataset performs independent of the threshold of prediction [26, 27]. As shown in Figure 1 and Table 1, our models achieved AUC of 0.952, 0.967, 0.965, and 0.961 for SSH1, SSH2, SSH3, and SSH, respectively. AUC represented in the ROC curve further indicates good performance of the classifiers.

Figure 1

ROC and AUC of our model from the leave-one-out crossvalidation.

Table 1

Statistical results of the SSH.

	SSH1	SSH2	SSH3	SSH
Recall/sensitivity	97.297%	94.595%	97.297%	96.396%
Specificity	83.871%	87.097%	81.300%	84.073%
Accuracy	91.177%	92.647%	89.855%	91.226%
BAC	0.906	0.908	0.893	0.902
AUC	0.952	0.967	0.965	0.961
MCC	0.827	0.855	0.803	0.828

Also, the sensitivity or true positive rates (TPR) and specificity or false negative rates (FNR) give the discrepancies in the model; it also shows which data and how many positive and negative data are predicted correctly in the leave-one-out crossvalidation. Our ensemble model SSH predicted correctly 96.396% and 84.073% of the positive and negative data, respectively, as shown in Table 1 below. As shown in Figure 2, the heat map from f-scores of 8000 tripeptides of the 3 models, SSH1, SSH2, and SSH3, shows which tripeptide or amino acid contributes more to predictive results.

Figure 2

Heat map of the 131 observations in the leave-one-out crossvalidation.

To determine which amino acid gave more predictive values and is the most important to model construction, we calculated the f-scores of the tripeptides; the amino acid frequency of the 30 TPC with the best f-scores is shown in Figure 3, which shows tyrosine is the most occurring and important in the model construction. Also, Figure 4 shows 30 tripeptides with the best f-scores.

Figure 3

Amino acid frequency from the 30 best tripeptides' f-scores.

Figure 4

30 tripeptides with the best f-scores.

3. Discussion

In this study, machine learning methods were employed to predict the hydrophobic interactions of antibodies. Improper hydrophobic interactions can cause a lot of problems in antibody drug development. The datasets were constructed according to three biophysical assay values. Our model SSH was trained with TPC and achieved an accuracy of 91.226% using the leave-one-out crossvalidation, with 96.396% sensitivity or recall, 84.100% specificity, 87.754% precision, 0.828 MCC, 0.919 f-score, and 0.961 AUC. This work provides the ability to accurately predict flags in antibodies caused by hydrophobic interactions and will help facilitate the ease of development and subsequent drug manufacturing. From our analysis, tyrosine, serine, threonine, and glycine are the four amino acids with the best f-scores or the best predictive amino acids; tyrosine residues are vastly present in the active sites of antibodies [28, 29]. The 96.396% sensitivity proved the ability of our model to correctly identify those antibodies with “flags,” and the 84.10% specificity proved the ability of our model to correctly identify those antibodies without “flags.” The AUC of 0.961 and MCC of 0.828 proved that our model is good at predicting both the negative and positive data. To determine the prediction results or SSH, a voting method is used depending on the p value of the three models SSH1, SSH2, and SSH3. SSH predicts the probability of each antibody input. The higher the probability is, the more likely the antibody is to have hydrophobicity problems. Also, users can set the threshold between 0 and 1, with a higher threshold meaning stricter validation. In summary, the predictor enhanced our knowledge of how problems in antibodies could be detected for cost and time reduction; also, the work shows the possibility of virtual screening antibody drug candidates in a large scale at the early stage of development.

4. Dataset and Methods

4.1. Dataset

The antibody dataset was downloaded from the supplementary materials of the article published by Jain et al. [30]. The dataset includes 48 approved antibodies and 89 antibodies in the phase 2 and phase 3 clinical trials with 6 entries excluded due to conflicting sequences. The remaining 131 antibodies were used to develop SSH. The 10% threshold was employed as in Jain et al. to determine if the antibody has 1 or more “flags” (problems) according to the 3 assays, i.e., SMAC, SGAC-SINS, and HIC [30]. An antibody is labeled with a flag if one of its above assay values falls within the worst 10% threshold. On the other hand, the antibody with an assay value that falls outside the threshold value is deemed without a flag. Of the 131 antibodies, 94 have no flag, 25 have exactly one flag, 8 antibodies have exactly two flags, and 4 antibodies have exactly three flags, as shown in Figure 5. The antibodies with no flags were used as the negative dataset, and those antibodies with at least one flag were used as the positive dataset. The datasets are not balanced, since there are more negative entries. To solve this problem, we split the negative dataset randomly into three subsets with 31, 31, and 32 antibodies, respectively. Each subset is paired with the positive dataset, and 3 models were trained and called SSH1, SSH2, and SSH3. An ensemble method is used to combine the 3 models into SSH using the voting method.

Figure 5

Number of antibodies per flag of 131 antibodies.

4.2. Features and Feature Selection

The tripeptide composition (TPC) is widely used to convert the sequences to vectors as TPC helps to reflect the sequence order and total amino acid composition. TPC has better predictive results than a single amino acid and a dipeptide composition [19, 31]. The method for extracting TPC is shown as where i equals one of the 8000 tripeptide compositions and x(i) denotes the number of residues of each type of sequence. From TPC, the best features were selected from the 8000 features using (fselet.py) in LIBSVM, which made use of f-scores to obtain the optimal features; given two sets of real numbers, f-score technique measures the discrimination of the two sets [32]. Finally, 313, 315, and 315 features were used to build models SSH1, SSH2, and SSH3, respectively.

4.3. The Threshold Method

The threshold method is used to generate indexes for grouping the negative and positive datasets, as shown in Table 2. The 10% threshold is calculated as where N is the number of antibodies, X(i) is the ith antibody's assay value, and y = 10%(N).

Table 2

Threshold values of 3 assays [30].

Assays	Threshold values	Units (flags)
Standup monolayer adsorption chromatography (SMAC)	12.8	Retention time (min) (>)
Salt-gradient affinity-capture self-interaction nanoparticle spectroscopy (SGAC-SINS)	370	Salt concentration (mM) (<)
Hydrophobic interaction chromatography (HIC)	11.7	Retention time (min) (>)

4.4. Support Vector Machine (SVM)

The support vector machine (SVM) orders data by finding the best hyperplane separating two classes of data points. The best hyperplane for an SVM means the one with the largest margin between the two classes. The margin means the maximal width of the slab parallel to the hyperplane that has no interior data points. SVM is a machine learning method for classifying binary data and multiple class data. It is an effective machine learning method for supervised pattern recognition based on the theory of statistical learning. SVM has been widely used in the field of bioinformatics. We employed LIBSVM [33] with the following parameters: C = 2, 128, and 512 and g = 0.0078125, 0.0001220703125, and 0.0001220703125 for SSH1, SSH2, and SSH3, respectively, for the development of SSH using “RBF” kernel with the leave-one-out crossvalidation [33] .

4.5. Performance Evaluation of SSH

To measure the performance of the SSH, the leave-one-out crossvalidation was used with these measurement parameters, namely, sensitivity (SN), specificity (SP), Mathew correlation coefficient (MCC), accuracy (ACC), and AUC. Precision is the proportion of the predicted positive cases that were correct. However, accuracy is not only the true measure of a model; the Mathew correlation coefficient (MCC) should be included to evaluate the prediction performance of the developed tool (Equation (6)). MCC is another measure used in machine learning for judging the quality of binary classifications and is considered to be the most robust parameter of any class prediction method. where TP is true positive, FN is false negative, TN is true negative, and FP is false positive. Figure 6 shows the benchmark of the SSH; the 10% threshold method is used for labeling the negative and positive data.

Figure 6

Benchmark of SSH.

30 in total

1. Prediction of protein retention times in hydrophobic interaction chromatography by robust statistical characterization of their atomic-level surface properties.

Authors: Alexander T Hanke; Marieke E Klijn; Peter D E M Verhaert; Luuk A M van der Wielen; Marcel Ottens; Michel H M Eppink; Emile J A X van de Sandt
Journal: Biotechnol Prog Date: 2016-01-20

Review 2. Methods of calculating protein hydrophobicity and their application in developing correlations to predict hydrophobic interaction chromatography retention.

Authors: Andrea Mahn; M Elena Lienqueo; J Cristian Salgado
Journal: J Chromatogr A Date: 2008-12-06 Impact factor: 4.759

3. Tyrosine plays a dominant functional role in the paratope of a synthetic antibody derived from a four amino acid code.

Authors: Frederic A Fellouse; Pierre A Barthelemy; Robert F Kelley; Sachdev S Sidhu
Journal: J Mol Biol Date: 2005-12-19 Impact factor: 5.469

4. MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model.

Authors: Cong Pian; Guangle Zhang; Fei Li; Xiaodan Fan
Journal: Bioinformatics Date: 2020-01-15 Impact factor: 6.937

5. Biophysical properties of the clinical-stage antibody landscape.

Authors: Tushar Jain; Tingwan Sun; Stéphanie Durand; Amy Hall; Nga Rewa Houston; Juergen H Nett; Beth Sharkey; Beata Bobrowicz; Isabelle Caffry; Yao Yu; Yuan Cao; Heather Lynaugh; Michael Brown; Hemanta Baruah; Laura T Gray; Eric M Krauland; Yingda Xu; Maximiliano Vásquez; K Dane Wittrup
Journal: Proc Natl Acad Sci U S A Date: 2017-01-17 Impact factor: 11.205

6. Models for Antibody Behavior in Hydrophobic Interaction Chromatography and in Self-Association.

Authors: Max Hebditch; Aisling Roche; Robin A Curtis; Jim Warwicker
Journal: J Pharm Sci Date: 2018-11-23 Impact factor: 3.534

7. RIscoper: a tool for RNA-RNA interaction extraction from the literature.

Authors: Yang Zhang; Tianyuan Liu; Liqun Chen; Jinxurong Yang; Jiayi Yin; Yuncong Zhang; Zhixi Yun; Hao Xu; Lin Ning; Fengbiao Guo; Yongshuai Jiang; Hao Lin; Dong Wang; Yan Huang; Jian Huang
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

8. PSBinder: A Web Service for Predicting Polystyrene Surface-Binding Peptides.

Authors: Ning Li; Juanjuan Kang; Lixu Jiang; Bifang He; Hao Lin; Jian Huang
Journal: Biomed Res Int Date: 2017-12-27 Impact factor: 3.411

9. PhD7Faster 2.0: predicting clones propagating faster from the Ph.D.-7 phage display library by coupling PseAAC and tripeptide composition.

Authors: Bifang He; Heng Chen; Jian Huang
Journal: PeerJ Date: 2019-06-17 Impact factor: 2.984