Literature DB >> 35910147

Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides.

Zhenjiao Du1, Donghai Wang2, Yonghui Li1.   

Abstract

Due to their multiple beneficial effects, antioxidant peptides have attracted increasing interest. Currently, the screening and identification of bioactive peptides, including antioxidative peptides based on wet-chemistry methods are time-consuming and highly rely on many advanced instruments and trained personnel. Quantitative structure-activity relationship (QSAR) analysis as an in silico method can be more efficient and cost-effective. However, model performance of QSAR studies on antioxidant peptides was still poor due to limited attempts in model development approaches. The objective of this study was to compare popular machine learning methods for antioxidant activity modeling and screening of tripeptides and identify the critical amino acid features that determine the antioxidant activity. 533 numerical indices of amino acids were adopted to characterize 130 tripeptides with known antioxidant activity from the published literature, and then 7 feature selection strategies plus pairwise correlation were used to screen the most important indices for antioxidant activity and model building. 14 machine learning methods were used to build models based on the feature selection strategies, respectively. Among the 98 models, non-linear regression methods tended to perform better, and the best model with an R 2 Test of 0.847 and RMSETest of 0.627 for tripeptide antioxidants was obtained by combining random forest for feature selection and tree-based extreme gradient boost regression for model development. Based on the predicted antioxidant values of 7870 unknown tripeptides, potentially high antioxidant activity tripeptides all have a tyrosine, tryptophan, or cysteine residue at the C-terminal position. Furthermore, the predicted antioxidant activity of six synthesized tripeptides was confirmed through experimental determination, and for the first time, the cysteine or tyrosine residue at the C-terminal was found to be critical to the antioxidant activity based on both QSAR models and experimental observations.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35910147      PMCID: PMC9330208          DOI: 10.1021/acsomega.2c03062

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

Antioxidants are useful in reducing and preventing the harmful effect of in vivo free radicals by donating electrons to neutralize them, which induces cardiovascular diseases, cancers, and aging-related disorders.[1−3] Due to their multiple benefits, food protein-derived antioxidative peptides have gained increasing attention from today’s consumers and researchers.[4−6] Various in vitro antioxidant assays have been developed to evaluate antioxidant capacity, which are approximately divided into two categories, that is, single-electron-transfer (SET) reaction and hydrogen-atom-transfer (HAT) reaction.[7]In vitro assays based on the SET reaction are generally preferred due to their convenience and accuracy.[8] Conventional ways for screening peptides with high antioxidant activity are based on sequential and rigorous wet chemistry steps, such as enzymatic hydrolysis and/or microbial fermentation to release or produce peptides, in vitro antioxidant assays to determine antioxidant activity, and advanced chromatography and spectroscopy (e.g., high-performance liquid chromatography-mass spectrometry) to purify and identify potential peptides.[9] There are also studies that directly synthesized multiple peptides for screening on the basis of the theoretical knowledge (e.g., literature information of antioxidative peptides, important amino acids in peptides that contribute to antioxidant activity).[10−14] Up to now, some high-activity peptides have been found, such as Cys-Gln-Cys and Pro-His-His.[14,15] However, these conventional wet-chemistry methods for the preparation, fractionation, purification, and identification or synthesis of antioxidative peptides and for screening potentially high-activity peptides are time-consuming and highly rely on many advanced instruments and trained personnel.[2,5,9] Quantitative structure–activity relationship (QSAR) is a computational modeling method for revealing relationships between chemical structures of molecules and their bioactivity.[16] In QSAR analysis, peptides are encoded by a series of numerical values, including properties of the amino acid residues (hydrophobicity, polarity, topological information, etc.) comprising the peptides and properties of the entire peptides (electronegativity, sequence information, solubility, molecular weight, topological information, etc.).[17,18] Then, feature selection and modeling methods are combined to connect the structure information and bioactivity.[17,19] More than 80 amino acid descriptors (AADs) extracted from properties of amino acids by principal component analysis (PCA) were presented to characterize peptide structures and encode the peptides.[20−23] However, directly using these AADs usually led to undesirable model performance since most of them were not intended for the antioxidant activity modeling (e.g., T-scale for angiotensin-converting enzyme inhibitory activity).[8,11,19,22,24−30] Machine learning methods have been successfully applied for feature selection and model development in QSAR studies on peptide bioactivity (e.g., angiotensin-converting enzyme inhibitory activity).[8,17,22,23,31−33] A total of 566 numerical values of amino acid including physicochemical properties and biochemical properties of amino acids and pairs of amino acids have been available in the AAIndex database.[18] This makes it possible to use feature selection to find the important variables for bioactivity prediction compared with using AADs from PCA where the principal components were composed of various original variables. In addition, increasing studies on antioxidant peptides allowed compilation of data sets on their structures and activities.[8,12,14,15] Most previous studies on antioxidant peptides still focused on the linear regression models, which would limit the model fitting to some extent due to the synergic effect among the residues in peptides.[8,11,24,25,34,35] Data set division was another issue in most studies where the samples were sorted in a descending or ascending order by their activity, and training and test data sets were evenly selected from the samples based on the sorted sequence (e.g., first three for the training data set and the following one as the test data set). The over-even data set division strategy would undermine the model robustness since the bias in the test data set could lead to poor model performance in cross validation compared with that in the test data set.[8,11,14,32] Previous studies reported that tripeptides exhibited higher antioxidant activity and better bioavailability than other oligopeptides and have diverse structural variations.[14,15] The objective of this study was to compare popular machine learning methods for antioxidant activity modeling and screening of tripeptides and identify the critical amino acid features that determine the antioxidant activity (Figure ). A total of 130 tripeptides with Trolox-equivalent antioxidant capacity (TEAC) values (SET reaction-based) were manually collected from published studies for QSAR model development. Further, 553 numerical indices were first screened by pairwise correlation, followed by comparative evaluation using 7 different feature selection strategies. Description of the important feature variables from 553 numerical indices was developed. A total of 14 different advanced regression methods including both linear and non-linear methods were first used to develop models based on the extracted important variables, and the best model was used to predict tripeptides with high antioxidant potential for future study. Model performance of the 14 regression methods was compared and discussed. Six tripeptides were synthesized and characterized for antioxidant activity to further evaluate the model performance for practical applications. Generalizability of these models was further tested by 20 times random data set splitting and introduction of leave-one-group-out cross validation. This study provides a useful approach to screen the key factors influencing the antioxidant activity of tripeptides and a guideline for future application of various machine learning methods in QSAR modeling.
Figure 1

Flowchart for QSAR modeling and validation of antioxidant tripeptides.

Flowchart for QSAR modeling and validation of antioxidant tripeptides.

Materials and Methods

Data Set Collection

A total of 566 numerical indices of amino acids were collected using Beautiful Soup (4.5.3) from the AAIndex,[18,36] and detailed definition and description of each index are available online (https://www.genome.jp/aaindex/). The indices with missing values for amino acids were deleted, resulting in a total of 553 remaining indices (Table S1). Next, 130 antioxidant tripeptides were manually collected from BIOPEP-UWM,[37] and their activities analyzed by 2,2′-azino-bis(3-ethylbenzothiazoline-6-sulfonic acid) (ABTS) radical scavenging activity assay were obtained from the published literature and expressed as the TEAC values (μmol TE/μmol peptide) (Table ).[8,12,38] The tripeptides with no antioxidant activity (i.e., 0 μmol TE/μmol peptide) were all retained in the data set for further QSAR model development.
Table 1

Sequence and TEAC of Tripeptide (μmol TE/μMol Peptide) Data Set from the Literature.

no.sequenceactivityno.sequenceactivityno.sequenceactivity
1LHA047PHN0.2493PWR0.822
2LHD048LWF0.2594PWI0.832
3LHE049PWD0.26295RWG0.842
4LHF050LVG0.26696LWN0.866
5LHG051PHH0.26697LWR0.869
6LHH052PWE0.33998PWL0.88
7LHQ053PHI0.34499PWT0.9
8PHA054PHQ0.348100PWN0.943
9PHD055GHG0.365101RWH0.995
10PHE056LWD0.402102RWQ0.995
11PHF057LWG0.406103KHP1.143
12PHM058RHS0.409104GVR1.157
13RHA059PWA0.414105ECG1.413
14RHD060GHP0.426106PHW1.768
15RHE061PWS0.44107PWW1.774
16RHH062PWV0.457108RWW1.837
17RHK063RWD0.485109LHW1.84
18RHQ064LWM0.49110LWW1.931
19RHT065PHG0.496111WPL1.972
20PHT0.02866RWA0.497112VPW1.972
21LHM0.03167PWM0.498113RHW2.203
22LHN0.04668LWV0.499114LWY2.332
23GVT0.04769RWV0.51115RWY2.334
24PHS0.05870LWL0.515116RHY2.464
25KHR0.06771LWQ0.519117PHY2.707
26GHT0.07972LWS0.522118LHY2.753
27LWH0.09873LWA0.594119PWY2.785
28LHK0.10874RWS0.6120GVW4.365
29LHR0.10875RHF0.6121GKW4.687
30LHT0.10876LWT0.627122GHW4.745
31LHV0.10877LWI0.628123QVW5.161
32RHR0.11878LWK0.629124KVW5.218
33PHK0.17679PWH0.632125NKW5.349
34LHL0.18680PWK0.634126NHW5.368
35RHI0.18981PWQ0.637127QHW5.524
36PHV0.19882RWR0.651128KHW5.566
37PWF0.20283RWT0.651129PYW5.683
38PWG0.20384RWE0.663130YHW6.169
39RHG0.20385LHS0.68   
40RHL0.20686RWF0.689   
41RHM0.20787RWL0.689   
42RHN0.20888RWI0.702   
43PHR0.21189RWM0.702   
44RHV0.21290RWN0.702   
45LHI0.21791RWK0.753   
46PHL0.23892LWE0.777   

Data Processing

Pre-processing of Numerical Indices of Amino Acids

The pairwise correlation method was used to pre-screen collinearity of the 553 numerical indices (Table S2 and Figure S1). If the absolute value of Pearson’s correlation coefficient between 2 indices was greater than 0.95, one of them was removed randomly due to the strong correlation.[39] The remaining numerical indices were standardized for further feature selection.

Tripeptide Encoding and Feature Selection

The pre-screened numerical indices of amino acids were used to encode tripeptides. Briefly, if “n” numerical indices were selected after pre-processing, each amino acid was encoded as a “1 × n” vector (tripeptide was encoded as a “3 × n” matrix). The matrix was transformed into a “1 × 3n” matrix, where 1 to the n elements in the vector belonged to the N-terminal residue, n + 1 to 2n elements were referred to the central amino acid, and 2n + 1 to 3n elements belonged to the C-terminal residue.[8] After encoding, each tripeptide is represented by 3n variables which were further screened by feature selection methods to identify the key variables for antioxidant activity prediction as amino acid descriptors. Six representative feature selection methods were evaluated, namely, linear regression-based recursive feature elimination (RFE), named RFE-LR;[40] support vector machine regression (SVR)-based RFE, named RFE-SVR;[40] random forest regression (RFR)-based RFE, named RFE-RFR;[40] feature coefficient (FC) based on lasso regression, named FC-LR;[40] feature importance based on RFR, named FI-RFR;[40] and feature importance based on extreme gradient boosting (XGB) regression, named FI-XGB.[41] The detailed mathematical methodologies of these selection methods are available in scikit-learn (https://scikit-learn.org/). After the feature selection, all the encoded tripeptides in the entire data set were transformed into the new feature-encoded version as the X-matrix (variables) for further model development with tripeptide activity values as responses (Y-vector). Furthermore, these encoded tripeptides without feature selection were also directly used as the X-matrix (variables) for model development and compared with the models developed by feature selection methods.

QSAR Model Development

Data Set Division

Totally, 130 samples were used for model development, cross validation, and model evaluation. The transformed X-matrix and Y-vector were shuffled and randomly split into the training data set and test data set at a ratio of 3:1. 98 samples were used for the training data set to build models. The remaining 32 samples were used as the test data set to evaluate the performance of the models. Leave-one-out cross-validation (LOOCV) was utilized for the validation data set split from THE training data set.

Regression Models

Fourteen popular regression models available through scikit-learn (https://scikit-learn.org/) and XGBoost (https://xgboost.readthedocs.io/en/stable/) were comparatively evaluated, namely, tree-based XGBoost regression (tree-XGB),[41] linear-based XGBoost regression (linear-XGB),[41] random forest regression (RFR), gradient boosting decision tree regression (GBDT),[42] decision tree-based bagging regression (Bagging),[43] multi-layer perceptron regression (MLP),[44] nearest neighbor regression (KNN),[40] radial basis function kernel-based support vector machine regression (rbf-SVR),[40] linear kernel-based support vector machine regression (linear-SVR),[40] linear regression with L1 regularization (Lasso),[45] linear regression with L2 regularization (Ridge),[40] linear regression by minimizing a regularized empirical loss with stochastic gradient descent (SGD),[40] ridge regression with kernel trick (KernelRidge),[46] and Huber regression (Huber).[47]

QSAR Model Building and Optimization

The model building was conducted using Python 3.8.8 with a computer (MacOS Monterey 12.0.1, CPU intel Core-i5 2.3 GHz). Models were imported from scikit-learn and XGBoost package.[40,41] LOOCV was used to avoid overfitting and tune the hyperparameters because of our small data set size.[8,14,33] The hyperparameters with the best performance from LOOCV were used as the final model for performance evaluation with the test data set.

Model Performance Evaluation

Determination of the coefficient (R2) and root mean square error (RMSE) was used to evaluate the model performance. R2 and RMSE from the training data set, LOOCV, and test data set were named as R2Train and RMSETrain, R2CV and RMSECV, and R2Test and RMSETest, respectively. To further evaluate the model generalizability, the developed models with the tuned hyperparameters were rebuild by 20 times with different random data set splitting and evaluated by using R2 and RMSE from the training data set, LOOCV, Leave-one-group-out cross validation (LOGOCV), and test data set. The result of the extra evaluation is available in Table S3.

Prediction of Unpublished Tripeptides with Antioxidant Activity from the Models

A data set containing 7870 potential tripeptides was built, and the published 130 tripeptides used for model building and validation were not included. After obtaining the model with the best performance, the 7870 tripeptides were encoded by the selected features and used for the antioxidant activity prediction based on the selected model.

Model Application for Antioxidant Tripeptide Selection and Tripeptide Synthesis

The prediction results for the antioxidant activity of the 7870 unknown tripeptides showed that tyrosine, tryptophan, and cystine at the C-terminal residue were favorable to the antioxidant capacity. Considering the diversity of peptides, some unfavorable residues were also selected when designing tripeptide sequences for model validation. Six tripeptides, namely, QAY, PHC, YPQ, VYV, GPE, and YSQ, were synthesized by Genscript Corp (Piscataway, NJ, USA) or purchased from Sigma Aldrich (St. Louis, MO, USA). The purity of the tripeptides was above 95%, and the sequences were validated by liquid chromatography–mass spectrometry.

Characterization of Antioxidant Activity of the Synthesized Tripeptides

The ABTS radical scavenging capacity assay was based on the method described in the studies of Phongthai et al. and Chen et al.[8,48] with a few modifications. Briefly, stock solution was prepared by mixing 7.4 mM ABTS and 2.6 mM potassium persulfate in deionized (DI) water and incubating at room temperature for 12 h. The working solution was made by diluting the stock solution till the absorbance of the mixture of ABTS•+ solution and DI water at 734 nm was at 0.70 ± 0.02. Then, 150 μL ABTS•+ solution was mixed with 50 μL tripeptide solution (20 μM) and allowed for 30 min incubation at 30 °C, and subsequently, the absorbance was measured at 734 nm using the Biotek Synergy H1 Hybrid Microplate Reader (Winooski, VT, USA). Trolox (TE) was used as a standard antioxidant, and results were expressed as μmol TE/μmol peptide. All the chemicals and reagents used were of analytical grade and purchased from Sigma-Aldrich (St. Louis, MO, USA).

Results

Model Development Based on Variables Selected by FI-XGB

Five variables were selected by FI-XGB with a feature importance threshold of 0.01 (Table ) and then used to encode the 130 tripeptides as the X-matrix (i.e., 130 × 5). Based on the variable importance results, C-terminal residues contributed the most to the antioxidant activity (Y-vector), while the central amino acids contributed the least to the activity. Among the 14 QSAR models (Table ), tree-XGB achieved the best performance with an R2Test and RMSEtest of 0.814 and 0.692, respectively. The next satisfactory model was based on RFR (R2Test = 0.807 and RMSEtest = 0.698), while R2Test of the remaining models were all below 0.8, which was less desirable. In general, the non-linear regression methods, including GBDT, MLP, Bagging, KNN, and rbf-SVR, achieved better performance than the linear regression methods, such as linear-XGB, linear-SVR, Lasso, Ridge, SGD, and Huber.
Table 2

Amino Acid Positions, Variable Importance, and Description of the Selected Variables from Different Feature Selection Strategiesa

AAindex accession numberamino acid positionvariable importancedescriptionnote
selected variables by FI-XGB
BURA740101N-terminal0.0199normalized frequency of the alpha-helix 
CHOP780215N-terminal0.161frequency of the 4th residue in turnA
BEGF750102central0.036conformational parameter of the beta-structure 
KANM800103C-terminal0.0138average relative probability of the inner helix 
LIFS790103C-terminal0.7049conformational preference for antiparallel beta-strandsB
selected variables by FI-RFR
PALJ810113N-terminal0.025normalized frequency of turn in the all-alpha class 
ONEK900102N-terminal0.0108helix formation parameters (delta delta G) 
FUKS010101N-terminal0.015surface composition of amino acids in intracellular proteins of thermophiles (percent) 
JOND750102C-terminal0.0171pK (-COOH) 
LIFS790103C-terminal0.0518conformational preference for antiparallel beta-strandsB
MCMT640101C-terminal0.0286refractivity 
NAKH920102C-terminal0.0688AA composition of CYT2 of single-spanning proteins 
OOBM850102C-terminal0.037optimized propensity to form reverse turnC
WEBA780101C-terminal0.0371RF value in high-salt chromatographyD
VINM940102C-terminal0.051normalized flexibility parameters (B-values) for each residue surrounded by none rigid neighbors 
PARS000101C-terminal0.0367p-Values of mesophilic proteins based on the distributions of B valuesN
PARS000102C-terminal0.0768p-Values of thermophilic proteins based on the distributions of B valuesK
FODM020101C-terminal0.0416free energy change of epsilon(i) to alpha(Rh)E
MITS020101C-terminal0.1532amphiphilicity indexF
DIGM050101C-terminal0.0563hydrostatic pressure asymmetry index, PAIG
selected variables by FC-LR
MAXF760103N-terminal0.025normalized frequency of zeta R 
NAKH900102N-terminal0.0371SD of AA composition of total proteins 
QIAN880114N-terminal0.051weights for beta-sheet at the window position of -6 
KHAG800101central0.0367the Kerr-constant increments 
CHOP780215C-terminal0.0768frequency of the 4th residue in turnA
OOBM850102C-terminal0.0416optimized propensity to form reverse turnC
WEBA780101C-terminal0.0153RF value in high salt chromatographyD
MITS020101C-terminal0.0563amphiphilicity indexF
selected variables by RFE-LR
WERD780102N-terminal0.3136free energy change of epsilon(i) to epsilon(ex) 
AURR980107N-terminal0.9357normalized positional residue frequency at helix termini N2 
AURR980111N-terminal2.0325normalized positional residue frequency at helix termini C5H
AURR980116N-terminal1.5792normalized positional residue frequency at helix termini Cc 
CEDJ970105N-terminal0.2991composition of amino acids in nuclear proteins (percent)I
KARS160120N-terminal0.5644weighted minimum eigenvalue based on the atomic numbers 
CHOC760104Central0.8074proportion of residues 100% buried 
GEIM800110Central0.6058aperiodic indices for beta-proteinsJ
QIAN880136Central0.9265weights for coil at the window position of 3 
KARS160113Central0.7146weighted domination number using the atomic number 
CHOP780215C-terminal0.8292frequency of the 4th residue in turnA
GEIM800110C-terminal0.0872aperiodic indices for beta-proteinsJ
HUTJ700101C-terminal0.271heat capacity 
HUTJ700103C-terminal0.3975entropy of formation 
KARP850102C-terminal0.6556flexibility parameter for one rigid neighbor 
NAKH900110C-terminal1.1114normalized composition of membrane proteins 
WILM950102C-terminal0.66043hydrophobicity coefficient in RP-HPLC, C8 with 0.1%TFA/MeCN/H2O 
selected variables by RFE-SVR
CHAM820102N-terminal0.303free energy of solution in water, kcal/mole 
NAKH920101N-terminal0.4642AA composition of CYT of single-spanning proteins 
RICJ880114N-terminal0.1628relative preference value at C1 
PARS000102N-terminal0.2303p-Values of thermophilic proteins based on the distributions of B valuesK
CEDJ970105N-terminal0.1578composition of amino acids in nuclear proteins (percent)I
GEOR030105N-terminal0.0656linker propensity from small data set (linker length is less than six residues)L
GEIM800106Central0.2531beta-strand indices for beta-proteins 
NAKH900108Central0.1859normalized composition from fungi and plant 
PALJ810116Central0.1345normalized frequency of turn in alpha/beta classM
GEOR030105Central0.1183linker propensity from small data set (linker length is less than six residues)L
CHOP780215C-terminal0.1984frequency of the 4th residue in turnA
OOBM850102C-terminal0.3586optimized propensity to form reverse turnC
PALJ810116C-terminal0.1983normalized frequency of turn in alpha/beta classM
WERD780104C-terminal0.2361free energy change of epsilon(i) to alpha (Rh) 
PARS000101C-terminal0.3056p-Values of mesophilic proteins based on the distributions of B valuesN
MITS020101C-terminal0.0605amphiphilicity indexF
DIGM050101C-terminal0.1109hydrostatic pressure asymmetry index, PAIG
selected variables by RFE-RFR
CHOP780215N-terminal0.0336frequency of the 4th residue in turnA
ISOY800108N-terminal0.0294normalized relative frequency of coil 
MAXF760104N-terminal0.0341normalized frequency of left-handed alpha-helix 
GEOR030105N-terminal0.0486linker propensity from small data set (linker length is less than six residues)L
KARS160122N-terminal0.0362weighted second smallest eigenvalue of the weighted Laplacian matrix 
QIAN880127central0.0362weights for coil at the window position of -6 
AURR980111central0.0291normalized positional residue frequency at helix termini C5H
LIFS790103C-terminal0.1127conformational preference for antiparallel beta-strandsB
MCMT640101C-terminal0.0969refractivity 
OOBM850102C-terminal0.0462optimized propensity to form reverse turnC
WEBA780101C-terminal0.0245normalized frequency of turn in all-alpha classD
PARS000102C-terminal0.0745p-Values of mesophilic proteins based on the distributions of B valuesK
FODM020101C-terminal0.1246free energy change of epsilon(i) to alpha(Rh)E
MITS020101C-terminal0.2131amphiphilicity indexF
DIGM050101C-terminal0.0603hydrostatic pressure asymmetry index, PAIG

Note: Detailed information of these selected variables are available at https://www.genome.jp/aaindex/. The same capitalized letter in the last column indicates same amino acid features.

Table 3

Performance of 14 QSAR Models Based on the Different Feature Selection Strategies.a

 training data set
test data set
 
modelR2TrainRMSETrainR2CVRMSECVR2TestRMSETestnote
QSAR models based on FI-XGB
tree-XGB0.9550.2950.9110.4160.8140.692***
linear-XGB0.5660.9210.4781.010.5581.067 
RFR0.9560.2950.9240.3860.8070.698**
GBDT0.9760.2190.9040.4340.780.752*
bagging0.9740.2260.9040.4340.7690.77 
MLP0.9610.2760.8470.5480.770.769 
KNN0.840.5590.5980.8870.5551.069 
rbf-SVR0.9650.2630.8310.5740.7260.84 
linear-SVR0.3871.0950.3551.1230.3451.298 
Lasso0.5750.9120.4731.0150.591.027 
Ridge0.5660.9210.4781.010.5571.068 
SGD0.5160.9730.4241.0620.491.146 
KernelRidge0.0741.346–0.0731.4480.2061.429 
Huber0.5670.920.4781.010.5591.064 
QSAR models based on FI-RFR
tree-XGB0.9540.30.8720.50.8470.627***
linear-XGB0.7890.6430.7220.7380.6810.906 
RFR0.9280.3750.8420.5560.8540.613**
GBDT0.9780.2070.8660.5120.7810.75 
Bagging0.9620.2740.8330.5710.8220.677 
MLP0.9760.2190.820.5920.7730.764 
KNN0.9330.3620.8320.5730.8140.691 
rbf-SVR0.9540.30.8320.5740.8440.632*
linear-SVR0.780.6550.7090.7550.6230.984 
Lasso0.7960.6320.7140.7480.6850.901 
Ridge0.7790.6570.7210.7390.6790.909 
SGD0.7890.6420.7240.7350.6740.916 
KernelRidge0.2791.187–0.1181.4790.2951.346 
Huber0.7920.6370.7190.7420.6820.904 
QSAR models based on FC-LR
tree-XGB0.9770.2140.8830.4770.7070.868 
linear-XGB0.8270.5820.7750.6630.7830.748**
RFR0.9830.1820.9230.3890.6520.946 
GBDT0.9910.1340.9280.3770.6260.981 
Bagging0.9890.1450.920.3960.6810.905 
MLP0.9750.2210.7710.6690.7630.781 
KNN0.940.3420.8350.5680.8130.693***
rbf-SVR0.9880.1550.7410.7110.7160.855 
linear-SVR0.8150.6020.7560.6910.7820.75 
Lasso0.8170.5980.7590.6860.7390.819 
Ridge0.8210.5920.7790.6580.7770.757 
SGD0.8260.5840.7710.6690.7850.743 
KernelRidge0.3191.1550.0341.3750.371.272 
Huber0.8290.5790.7590.6870.7860.741*
QSAR models based on RFE-LR
tree-XGB0.9510.310.8010.6240.7730.764 
linear-XGB0.8490.5420.7520.6970.780.752 
RFR0.9390.3450.7930.6360.7370.823 
GBDT0.9860.1640.8210.5920.80.718*
Bagging0.9760.2170.8150.6010.7660.775 
MLP0.9790.2020.8680.5090.8240.672***
KNN0.8590.5260.7490.7010.6270.98 
rbf-SVR0.9930.1180.7740.6660.5691.053 
linear-SVR0.8870.470.7810.6540.6340.97 
Lasso0.890.4640.7740.6640.6530.945 
Ridge0.9080.4250.8140.6020.770.769 
SGD0.7870.6450.6840.7860.7680.773 
KernelRidge0.241.2190.0041.3950.3281.315 
Huber0.9150.4070.8310.5750.8190.681**
QSAR models based on RFE-SVR
tree-XGB0.9450.3290.8930.4570.7720.766 
linear-XGB0.8440.5530.7560.6910.7590.787 
RFR0.9550.2950.9390.3460.7580.788 
GBDT0.9820.1870.8910.4620.8110.696 
Bagging0.9920.1260.9470.3210.7780.756 
MLP0.9790.2020.8820.480.8460.628***
KNN0.9240.3860.8970.4490.8390.643*
rbf-SVR0.9960.0950.8350.5680.6660.927 
linear-SVR0.9220.390.8290.5790.8090.701 
Lasso0.8440.5520.7560.6910.7590.787 
Ridge0.8590.5250.8060.6160.830.662 
SGD0.9160.4050.8340.5690.8860.541 
KernelRidge0.3291.1450.1561.2850.4481.191 
Huber0.9260.3810.840.5590.840.642**
QSAR models based on RFE-RFR
tree-XGB0.9780.2050.9310.3670.8280.665***
linear-XGB0.8520.5390.7860.6470.7040.872 
RFR0.9760.2190.9370.3490.8080.703 
GBDT0.9890.1450.9350.3580.8150.689*
Bagging0.9920.1220.9390.3450.7990.719 
MLP0.980.1970.890.4650.8170.686**
KNN0.9660.2590.9150.4090.7910.734 
rbf-SVR0.9960.0890.9240.3850.8010.716 
linear-SVR0.8520.5390.7610.6840.6990.88 
Lasso0.8610.5210.7780.660.7060.869 
Ridge0.8670.510.7830.6510.7020.875 
SGD0.8630.5180.7870.6470.7030.874 
KernelRidge0.3821.0990.1051.3240.1441.484 
Huber0.860.5230.7880.6430.7060.869 
QSAR models without feature selection
tree-XGB0.9870.1610.8600.5230.7050.87 
linear-XGB0.9270.3780.7860.6470.7460.807*
RFR0.9460.3240.8920.4590.7490.803***
GBDT0.9920.1260.8980.4470.7440.811**
Bagging0.9290.3740.4191.0660.4041.238 
MLP0.7730.6660.6210.8610.6190.99 
KNN0.9960.0890.7520.6970.6280.978 
rbf-SVR0.9260.3810.7090.7540.7340.827 
linear-SVR0.8930.4570.7650.6780.7310.831 
Lasso0.9360.3550.7580.6880.7440.811 
Ridge0.4631.0250.1601.2810.2631.377 
SGD0.4111.073–0.0811.4540.0731.544 
KernelRidge0.9360.3550.7570.690.7430.812 
Huber0.9810.1950.8580.5260.7430.812 

Note: Detailed description of these models are available at https://scikit-learn.org/stable/and https://xgboost.readthedocs.io/en/stable/. (*)The models with more stars in the last column indicate better performance from the same feature selection method.

Note: Detailed information of these selected variables are available at https://www.genome.jp/aaindex/. The same capitalized letter in the last column indicates same amino acid features. Note: Detailed description of these models are available at https://scikit-learn.org/stable/and https://xgboost.readthedocs.io/en/stable/. (*)The models with more stars in the last column indicate better performance from the same feature selection method.

Model Development Based on Variables Selected by FI-RFR

Fifteen variables were selected by FI-RFR with a threshold (feature importance = 0.01) (Table ) and then used to encode the 130 tripeptides as the X-matrix (130 × 15). Based on the variable importance, C-terminal residues also contributed the most to the antioxidant activity (Y-vector), while there was little contribution from the central amino acids based on these selected variables. Among the 14 QSAR models (Table ), Tree-XGB gained the best performance for the test data set, and the following were RFR, rbf-SVR, bagging, and KNN respectively, while the model performance of RFR and rbf-SVR in LOOCV was not as good as that in the test data set. For the remaining models where R2Test was below 0.8, GBDT as the only non-linear regression methods still gained better performance compared with these linear regression methods.

Model Development Based on Variables Selected by FC-LR

Eight variables were selected by FC-LR with a threshold (feature coefficient = 0.01) (Table ) and then used to encode the 130 tripeptides as the X-matrix (130 × 8). Based on the variable importance, C-terminal residues contributed the most to the antioxidant activity (Y-vector), while the central amino acids contributed the least to the activity. Model performances of the 14 different regression methods are shown in Table . The KNN gained the best performance in the test data set (R2Test = 0.813 and RMSEtest = 0.693), while R2 of the remaining models were all less than 0.7 (Table ). For the remaining models, linear regression methods (linear-XGB, linear SVR, lasso, Ridge, SGD, and Huber) achieved better performance than the non-linear regression methods.

Model Development Based on Variables Selected by RFE-LR

Recursive feature elimination (RFE) eliminates one variable with the least feature importance or feature coefficient in one iteration, and the procedure is recursively repeated on the pruned data set until achieving the desired number of features. Seventeen variables were selected from RFE-LR (Table ) and then used to encode the 130 tripeptides as the X-matrix (130 × 17). Based on the variable importance, N-terminal residues contributed the most to the antioxidant activity (Y-vector), while the central amino acids contributed the least to the activity. Model performances of the 14 different regression methods are shown in Table . MLP gained the best performance in the test data set (R2Test = 0.824 and RMSEtest = 0.672), followed by Huber (R2Test = 0.819 and RMSEtest = 0.681). GBDT also provided a good result with R2Test larger than 0.8. From RFE-LR, linear-XGB, Ridge, and SGD as linear methods gained competitive performance compared with the non-linear regression methods like KNN, rbf-SVR, and RFR.

Model Development Based on Variables Selected by RFE-SVR

Seventeen variables were selected by RFE-SVR (Table ) and then used to encode the 130 tripeptides as the X-matrix (130 × 17). Based on the variable importance, C-terminal residues and N-terminal residues contributed almost equally to the antioxidant activity (Y-vector), while the central amino acids contributed less to the activity. Model performances of the 14 different regression methods are shown in Table . Linear regression method, SGD, gained the best performance in test data set (R2Test = 0.886 and RMSEtest = 0.541), while its performance in LOOCV was lower. The MLP and Huber were the next acceptable models with R2Test larger than 0.84. The KNN and linear-SVR also gained ideal performance. Based on the variables selected by RFE-SVR, there was no obvious difference between the linear and non-linear regression methods.

Model Development Based on Variables Selected by RFE-RFR

Fifteen variables were selected by RFE-SVR (Table ) and then used to encode the 130 tripeptides as the X-matrix (130 × 15). Based on the variable importance, C-terminal residues contributed the most to the antioxidant activity (Y-vector), while the central amino acids contributed the least to the activity. Model performances of the 14 different regression methods are shown in Table . Tree-XGB achieved the best performance where R2Test and RMSEtest were 0.828 and 0.665, respectively. For the remaining models, non-linear regression methods, even the worst one, KNN outperformed the linear regression methods.

Model Development without Feature Selection

A total of 1026 variables were used to encode the 130 tripeptides as the X-matrix (130 × 1026). Model performances of the 14 different regression methods are shown in Table . RFR gained the best model performance where R2Test and RMSEtest were 0.749 and 0.803, respectively. A significant overfitting was observed in the MLP model where the R2Train was 0.929, but the R2Test was only 0.404. Based on the optimal values of R2cv and R2test, tree-XGB based on FI-RFR was used to predict the antioxidant activity of the 7870 unpublished tripeptides (Table S4). A total of 178 tripeptides with a C-terminal tyrosine were predicted to possess the highest antioxidant activity of 6.1672 μmol TE/μmol peptides, and the following were the 167 tripeptides with a C-terminal tryptophan (6.1147 μmol TE/μmol peptides). Tripeptides with a C-terminal cysteine were predicted to have an antioxidant activity of 6.0230 μmol TE/μmol peptides. As for the remaining tripeptides, there was no such obvious preferable amino acid residue at specific positions.

Application of the QSAR Model in Synthetic Tripeptide Activity Prediction

The experimental antioxidant activity of the synthetic tripeptides is summarized with their corresponding predicted activity in Table . QAY was predicted to be the most powerful antioxidant peptides (6.167 μmol TE/μmol peptide), while its observed activity was ranked second (4.270 μmol TE/μmol peptide) among the six synthesized tripeptides. PHC was also predicted to exhibit strong antioxidant activity (6.023 μmol TE/μmol peptide), and its observed activity (5.013 μmol TE/μmol peptides) was even stronger than that of QAY. Overall, the QSAR model has been very useful for the selection of potentially high-antioxidant activity tripeptides, although the antioxidant activity from the model was a little bit overestimated compared to the experimental results.
Table 4

Antioxidant Activity of Synthesized Tripeptides.

synthetic tripeptideobserved activity (μmol TE/μmol peptide)predicted activity (μmol TE/μmol peptide)
QAY4.270 ± 0.1246.167
PHC5.013 ± 0.1846.023
YSQ3.736 ± 0.0245.696
YPQ3.028 ± 0.1735.696
VYV3.601 ± 0.0394.837
GPE0.598 ± 0.0992.741

Discussion

Various numerical indices were screened and selected by the six different feature selection methods. Based on the variable importance values, almost all the feature selection methods showed that the C-terminal residues played the most important role in antioxidant activity, while the central amino acid contributed the least to the activity, which was partly consistent with previous results from wet-chemistry and QSAR studies, where there was no comparison between N- and C- terminals.[12,24,31] Previous studies were confined to amino acid physicochemical properties (with about 195 indices) or the AADs which could not take full advantage of all the amino acid indices to identify the most representative indices to characterize tripeptides.[8,11,32] Although some of the selected features, especially non-physicochemical properties (e.g., LIFS790103 stands for “Conformational preference for antiparallel beta-strands”), might be difficult to understand and explain, these selected features are much targeted and less redundant.[8] For these AADs derived from PCA analysis, each principal component was composed of multiple original properties, and there are usually several principal components adopted in the model development, which can only be roughly explained (e.g., the first component was related to hydrophobicity) but impeded the further explanation of the feature importance and distracted the application of these features for peptide design and modification.[22,27] Even though some features are difficult to explain here, they all have the standard protocols to be determined, and this would be easier when applying in the structure design and modification of bioactive peptides. Among these selected features, some of them, such as CHOP780215, LIFS790103, OOBM850102, and WEBA780101, were selected multiple times for the characterization of C-terminal residues by different feature selection strategies, which showed their importance in antioxidant activity prediction. CHOP780215 was not only selected for encoding C-terminal residues by FC-LR, RFE-LR, and RFE-SVR but also selected by RFE-RFR and FI-XGB to characterize N-terminal residues. Some features, such as GEOR030105 and PALJ810116 selected by REF-SVR and GEIM800110 selected by REF-LR, were used to encode both the central and N-terminal residues, and central and C-terminal residues, respectively. This implies that some features of amino acids can contribute to antioxidant activity at any position, even though their importance varied with positions. The theoretical conclusion derived from the selected features was also supported by the study of Uno et al.[31] For the models without feature selection, inferior performance was observed, as shown in Table . The main reason of the poor performance under this preprocessing method was mainly because of the high dimensionality on the features and small sample size. Therefore, the significant improvement in model performance was achieved by feature selection because plenty of irrelevant features were eliminated.[17] For the 14 different regression methods, non-linear regression methods overall achieved better model performance based on the 6 feature selection methods, which proved the existence of non-linearity in antioxidant activity prediction. This also explained the poor model performance in most previous studies which were based on linear regression methods.[11,31,32] In addition, some studies subjectively removed the non-active tripeptides from the data set in order to improve the model fitting, which resulted in misleading models.[8,32] Further, improper data set division between the training data set and test data set increased the bias in the model and undermined the robustness of the models,[8] while model evaluation without the test data set was not complete because the performance in cross-validation could not represent the real performance of the model in the unknown data set.[32] In this study, these biases were all overcome, and the performance was greatly improved compared with the most recent study on tripeptides.[8,32] In fact, we also adopted processing methods using the bias-existing data from the literature to develop these models during the preliminary study, and R2test could be larger than 0.9559, which also proved the bias in the previous studies. Abnormal phenomena were observed in some models (e.g., rbf-SVR regression based on FI-RFR) where performance in LOOCV was poorer than that in the test data set. This implied the overfitting of these models, and the same situation was difficult to avoid in bioactivity prediction since LOOCV was an optimistic cross validation method.[8] The main reason that the n-fold cross validation was not used in our study was primarily due to the relatively small data set. In order to further evaluate the generalizability of these models, we introduced the more challenging cross validation (LOGOCV) and 20 times of random data set splitting for the model evaluation (Table S3). Performance of these overfitting strategies suffered more in generalizability evaluation. It also can be seen that the XGBoost regression method with random forest regression for feature selection was the most powerful and robust strategy for antioxidant activity prediction. From the prediction of tripeptides with potentially high antioxidant activity, the 525 unpublished tripeptides with activity higher than 6 μmol TE/μmol peptides all had a tyrosine or tryptophan or cysteine residue at the C-terminal position, which was consistent with previous studies.[12,24,31] Compared with previous studies, our model clearly specifies the tripeptides with the most promising antioxidant activity. The preferable attributes of strong antioxidant tripeptides concluded from the model development were supported by the antioxidant activity determination from the synthetic tripeptides. It was observed that tripeptides with tyrosine and cysteine residues at the C-terminal exhibited the highest antioxidant activity compared to those with the tyrosine residue at the N-terminal, which also showed lower contribution to antioxidant activity in the feature importance analysis. In addition, the model successfully predicted that the tripeptide (PHC) with a cysteine residue at the C-terminal had strong antioxidant activity (Table ), which had not been reported previously. In addition, there was no tripeptide with tyrosine at the C-terminal, showing high antioxidant activity (e.g., above 4 μmol TE/μmol peptide). Our results supported the hypothesis of the model development that these amino acid indices had the capacity to represent the residues in tripeptide for unknown antioxidant tripeptide activity prediction. The deviation between the observed and predicted activity was inevitable, but it is overall acceptable.[8,31]

Conclusions

In this study, we collected 553 latest amino acid numerical indices and 130 published tripeptides with available TEAC values for QSAR analysis. Seven feature selection strategies and 14 regression methods were combined to build QSAR models and used to comprehensively evaluate the performance of the application of machine learning methods in predicting antioxidant tripeptides. The results showed that C-terminal residues played a more important role in antioxidant activity, and non-linear regression methods were more suitable for the QSAR study on antioxidant activity. The best model based on FI-RFR for feature selection plus tree-based XGB for model building was used to predict the antioxidant activities of the unknown 7870 tripeptides, and the high-activity tripeptides have the tyrosine, tryptophan, or cysteine residue at the C-terminal position. Furthermore, 6 unpublished tripeptides were synthesized and characterized to evaluate the practical application of the best model. The predicted activity can reflect the rank of the potential activity of these tripeptides and their approximate activity, although there was an overestimation. This study also, for the first time, demonstrates through both the in silico and wet-chemistry experiment that cysteine and tyrosine residues at the C-terminal are highly corresponding to antioxidant activity for tripeptides. In addition, this study also provides critical reference for antioxidant tripeptide screening and a useful model development template for future QSAR studies on bioactive peptides.

Data and Software Availability

All the used data and software are clearly described in the Materials and Methods section.
  24 in total

1.  AAindex: amino acid index database.

Authors:  S Kawashima; M Kanehisa
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids.

Authors:  M Sandberg; L Eriksson; J Jonsson; M Sjöström; S Wold
Journal:  J Med Chem       Date:  1998-07-02       Impact factor: 7.446

3.  Purification and characterization of antioxidant peptides from enzymatically hydrolyzed chicken egg white.

Authors:  Chamila Nimalaratne; Nandika Bandara; Jianping Wu
Journal:  Food Chem       Date:  2015-05-06       Impact factor: 7.514

4.  Systematic Comparison and Comprehensive Evaluation of 80 Amino Acid Descriptors in Peptide QSAR Modeling.

Authors:  Peng Zhou; Qian Liu; Ting Wu; Qingqing Miao; Shuyong Shang; Heyi Wang; Zheng Chen; Shaozhou Wang; Heyan Wang
Journal:  J Chem Inf Model       Date:  2021-03-12       Impact factor: 4.956

5.  Quantitative analysis of the relationship between structure and antioxidant activity of tripeptides.

Authors:  Shinya Uno; Daisuke Kodama; Hiroko Yukawa; Hiroyuki Shidara; Miki Akamatsu
Journal:  J Pept Sci       Date:  2020-01-12       Impact factor: 1.905

6.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

7.  Anti-hypertensive Peptide Predictor: A Machine Learning-Empowered Web Server for Prediction of Food-Derived Peptides with Potential Angiotensin-Converting Enzyme-I Inhibitory Activity.

Authors:  Gazal Kalyan; Vivek Junghare; Mohammad Farhan Khan; Shivam Pal; Sourya Bhattacharya; Snigdha Guha; Kaustav Majumder; Sohom Chakrabarty; Saugata Hazra
Journal:  J Agric Food Chem       Date:  2021-12-02       Impact factor: 5.279

8.  ST-scale as a novel amino acid descriptor and its application in QSAM of peptides and analogues.

Authors:  Li Yang; Mao Shu; Kaiwang Ma; Hu Mei; Yongjun Jiang; Zhiliang Li
Journal:  Amino Acids       Date:  2009-04-17       Impact factor: 3.520

9.  Characterization of structure-antioxidant activity relationship of peptides in free radical systems using QSAR models: key sequence positions and their amino acid properties.

Authors:  Yao-Wang Li; Bo Li
Journal:  J Theor Biol       Date:  2012-11-02       Impact factor: 2.691

10.  QSAR Study on Antioxidant Tripeptides and the Antioxidant Activity of the Designed Tripeptides in Free Radical Systems.

Authors:  Nan Chen; Ji Chen; Bo Yao; Zhengguo Li
Journal:  Molecules       Date:  2018-06-10       Impact factor: 4.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.