Literature DB >> 30823542

Quantitative Structure-Activity Relationship Study of Antioxidant Tripeptides Based on Model Population Analysis.

Baichuan Deng1, Hongrong Long2, Tianyue Tang3, Xiaojun Ni4, Jialuo Chen5, Guangming Yang6, Fan Zhang7, Ruihua Cao8, Dongsheng Cao9, Maomao Zeng10, Lunzhao Yi11.   

Abstract

Due to their beneficial effects on human health, antioxidant peptides have attracted much attention from researchers. However, the structure-activity relationships of antioxidant peptides have not been fully understood. In this paper, quantitative structure-activity relationships (QSAR) models were built on two datasets, i.e., the ferric thiocyanate (FTC) dataset and ferric-reducing antioxidant power (FRAP) dataset, containing 214 and 172 unique antioxidant tripeptides, respectively. Sixteen amino acid descriptors were used and model population analysis (MPA) was then applied to improve the QSAR models for better prediction performance. The results showed that, by applying MPA, the cross-validated coefficient of determination (Q²) was increased from 0.6170 to 0.7471 for the FTC dataset and from 0.4878 to 0.6088 for the FRAP dataset, respectively. These findings indicate that the integration of different amino acid descriptors provide additional information for model building and MPA can efficiently extract the information for better prediction performance.

Entities:  

Keywords:  QSAR; amino acid descriptors; antioxidant tripeptides; model population analysis; quantitative structure-activity relationship

Mesh:

Substances:

Year:  2019        PMID: 30823542      PMCID: PMC6413046          DOI: 10.3390/ijms20040995

Source DB:  PubMed          Journal:  Int J Mol Sci        ISSN: 1422-0067            Impact factor:   5.923


1. Introduction

Bioactive peptides, usually containing 2–20 amino acid residues, are typically derived from the enzymatic hydrolysis of proteins [1]. They are inactive within the sequence of proteins, but they can exert various physiological functions after release. Antioxidant peptides are one of the most important groups of bioactive peptides, which can prevent oxidative stress and they have notable contributions to human health [2]. Antioxidant peptides have been isolated and purified from sources, such as cereals, milk, meat, and fish [3]. The methods to assess the antioxidant capacities of peptides include the Trolox equivalent antioxidant capacity (TEAC), the ferric ion reducing antioxidant power (FRAP), the 2,2-diphenyl-1-picrylhydrazyl radical-scavenging capacity (DPPH), the oxygen radical absorbance capacity (ORAC), the total radical trapping antioxidant parameter (TRAP), etc. [4]. However, it is impossible to test all of the peptides to find valid antioxidants, when considering the large number of theoretical possible peptides, i.e., 400 dipeptides, 8000 tripeptides, 160,000 tetrapeptides, etc. The activities of peptides are determined by the amino acid compositions, sequences, and structures. Quantitative structure-activity relationship (QSAR), which is a well-recognized tool for estimating chemical activities, has been widely applied for bioactive peptides prediction [5]. The QSAR models have been successfully built on ACE-inhibitory peptides [6], antimicrobial peptides [7], antioxidant peptides [8,9,10], antitumor peptides [11], bitter peptides [12], and etc. The QSAR study of antioxidant peptides mainly focused on di and tripeptides, because they can be absorbed intact from the intestinal lumen into the bloodstream and then produce biological effects at the tissue level [13]. When compared to dipeptides, tripeptides were reported to exhibit higher levels of antioxidant activity [14]. Besides, tripeptides had much larger structural diversity than dipeptides, which is a good property for developing multifunctional food additives [15]. The prediction performances need to be further improved, although plenty of QSAR models have been built on antioxidant peptides. The relationship between peptide structure and antioxidant activity is still unclear. This may be due to the restriction of model building methods. Model population analysis (MPA) provides a new strategy of model building, which is to use multi-models instead of a single model to improve prediction ability and interpretability [16,17]. Previous studies showed that, through the application of MPA strategy, the performance of regression models could be improved [6,18]. In this study, we built QSAR models based on two antioxidant tripeptides datasets. The first dataset contains 214 artificially designed tripeptides and the second dataset contains 172 β-Lactoglobulin derived tripeptides, which represent designed or food originated tripeptides, respectively. 16 amino acid descriptors were used to construct sophisticated data for the comprehensive information of peptides. The MPA strategy was applied to extract useful information from the data and to optimize the models. The aim of this study is not to build a new set of descriptors, but to integrate different descriptors under the framework of MPA for better QSAR model performance on antioxidant tripeptides data. The improved method for QSAR modelling will help in discovering new antioxidant tripeptides for future drugs or food additives.

2. Results

2.1. FTC Dataset

The results of QSAR models on the FTC dataset are displayed in Table 1. Before outlier elimination, the largest Q2 value of 0.4901 is obtained on the VSW descriptor. After outlier elimination, the HSEHPCSV descriptor showed the largest Q2 value of 0.6170 among the 16 amino acid descriptors. The integration of 16 descriptors gave rise to an improvement of the model performance (Q2 = 0.6818). Finally, the model prediction performance was further improved (Q2 = 0.7471) after variable selection while using the BOSS method.
Table 1

Comparisons among different quantitative structure-activity relationships (QSAR) models on ferric thiocyanate (FTC) dataset a.

DescriptorsBefore Outlier EliminationAfter Outlier Elimination
Q2R2optPCQ2R2optPCOutlier
HSEHPCSV0.38610.57814 0.6170 0.733820183, 182, 181, 134
ST-scale0.42680.5733120.59930.684413183, 182, 181, 134
HESH0.40910.536620.59680.704710183, 181, 182, 134, 129
VSW 0.4901 0.577130.59250.67685181, 183, 182, 134, 151
G-scale0.45160.552760.58430.65749181, 183, 182, 134, 118
FASGAI0.48140.545750.55440.61306129, 181, 128
DPPS0.47400.563770.53790.62788181, 182, 183, 134
E-scale0.49560.545140.51440.55824181, 182, 183, 112
5Z-scale0.39030.4626120.39740.46539181, 182, 183, 172
VHSE0.42650.5432120.39740.5148181, 182, 183, 172
T-scale0.32800.421590.37280.43629181, 182, 183
V-scale0.33710.378550.30700.34586181, 183, 182
Z-scale0.28140.339840.26780.34154181
ISA-ECI0.14930.191660.15720.18366183, 182, 181
MS-WHTM10.07360.148830.10360.16783181, 183, 182
MS-WHTM20.07750.144530.08820.16173181, 182, 183
Integrated descriptors0.48110.584330.68180.79648181, 183, 182, 134, 151, 153, 188
BOSS 0.7471 ± 0.00320.7931 ± 0.00629.72 ± 3.2199

a R2 is the coefficient of determination; Q2 is the cross-validated R2; optPC is optimal principal components for PLS regression model; the results of BOSS are shown in the form of mean value ± standard deviation in 100 runs; the top ranked Q2 scores were marked in bold.

In this study, an MPA-based outlier elimination procedure [19] was carried out to remove outliers one by one (Figure 1). For the integrated data, samples of no. 181, 183, 182, 134, 151, 153, and 188 were removed in sequence. Finally, all of the samples were within the range according to the three-sigma rule after outlier removal (Figure 1H, dashed line).
Figure 1

The process of model population analysis (MPA)-based outlier elimination on the FTC dataset of integrated descriptors. The dashed line is defined as the boundary for outliers, which is mean ± 3× standard deviation of prediction errors. (A) No outlier was eliminated, (B) sample No. 181 was eliminated, (C) sample No. 183 was eliminated, (D) sample No. 182 was eliminated, (E) sample No. 134 was eliminated, (F) sample No. 151 was eliminated, (G) sample No. 153 was eliminated, and (H) sample No. 188 was eliminated and all of the outliers were removed.

Figure 2 showed the selected variables by the BOSS method in 100 runs. The variables being selected more frequently reflect high variable importance. The top 11 variables (frequency>75), in descending order, were as follows: C-VSW-5 = N-G-7 > C-ST-3 > M-ST-7 > N-DPPS-8 > C-HESH-2 > N-FASGAI-5 > M-G-6 > N-VSW-3 > C-VHSE-6 >C-HSEHPCSV-9, which are marked on Figure 2. All the top 11 variables originated from the best preformed amino acid descriptors, i.e., HSEHPCSV, ST-scale, HESH, G-scale, FASGAI, and DPPS (Table 1). It showed that the ultimate model has the merit of the best performed models that were constructed by single amino acid descriptors.
Figure 2

Frequency of variables selected by the bootstrapping soft shrinkage (BOSS) method on the FTC dataset in 100 runs. The higher frequency denotes higher variable importance. The top 11 variables with frequency larger than 75 were marked in the figure.

2.2. FRAP Dataset

The results of QSAR models on FRAP dataset are displayed in Table 2. Before logarithmic transformation of response vector Y, the largest Q2 value of 0.1408 is obtained on 5Z-scale descriptor. The low Q2 value indicated that the tripeptide structures and their antioxidant activities that were evaluated by FRAP assay did not share a linear relationship. After logarithmic transformation, the VHSE descriptor showed the largest Q2 value of 0.4878. Through integrating the 16 descriptors, the Q2 value was increased slightly to 0.4953. The prediction performance of the model was promoted after variable selection using the BOSS method (Q2 = 0.6088). It indicated that a linear relationship between the structures and the activities was built after the logarithmic transformation of Y and the MPA strategy was efficient in improving the model.
Table 2

Comparisons among different QSAR models on FRAP dataset a.

DescriptorsBefore Logarithmic TransformationAfter Logarithmic Transformation
Q2R2optPCQ2R2optPC
VHSE0.00420.26553 0.4878 0.61226
5Z-scale 0.1408 0.317720.48090.55683
DPPS0.00590.229030.41470.54634
ST-scale0.02630.322080.39680.54109
FASGAI0.04700.275320.37350.50064
E-scale0.05600.252110.37140.47345
HESH0.04440.2818100.36680.52903
HSEHPCSV0.02590.247570.36240.49523
G-scale0.10660.233450.28360.38501
VSW0.01300.307110.23820.43612
MS-WHTM20.03420.037030.17280.25943
MS-WHTM10.04520.032990.12070.19414
T-scale0.06820.070620.07500.212910
V-scale0.02930.074840.06990.14951
Z-scale0.00520.144510.03010.14566
ISA-ECI0.02420.014110.00710.04111
Integrated descriptors0.10690.421230.49530.64233
BOSS 0.6088 ± 0.00410.6655 ± 0.00943.5100 ± 2.5086

a R2 is the coefficient of determination; Q2 is the cross-validated R2; optPC is optimal principal components for PLS regression model; the results of BOSS are shown in the form of mean value ± standard deviation in 100 runs, the top ranked Q2 scores were marked in bold.

Similarly, an MPA-based outlier elimination procedure was carried out on the FRAP dataset. No outlying sample was detected, since all of the samples gather within the range according to the three-sigma rule (Figure 3A, dashed line). The important variables that were selected by BOSS are displayed in Figure 3B. The six most important variables (frequency > 75) are C-Z5-5, M-Z5-5, N-VSW-9, N-VHSE-8, N-ST-3, and C-VSW-2, respectively. Most of the important variables originated from three well performed descriptors, i.e., VHSE, 5Z-scale, and ST-scale. However, there still some variables selected from the poorly performed descriptor, such as VSW. It suggested that descriptors with poor performance also contained useful information for model building.
Figure 3

The result of QSAR model building on the FRAP dataset. (A) The result of MPA-based outlier detection on the FRAP dataset of integrated descriptors. No outlier was detected. (B) Frequency of variables selected by the BOSS method on the FRAP dataset in 100 runs. The higher frequency denotes higher variable importance. The six top variables with frequency larger than 75 are marked in the figure.

3. Discussion

3.1. Comparison with the Reported Models

For the FTC dataset, our method showed higher prediction accuracy (Q2 = 0.7471), when compared to the previous report (Q2 = 0.6310) [20]. Note that 41 sample were eliminated as outliers in the previous study, while only seven outliers were eliminated in this study. A much larger number of samples was used in our model, which is more representative. It showed that our method exhibited a model with higher prediction performance and the relatively larger applicability domain. Similarly, for FRAP dataset, our method showed a higher prediction accuracy (Q2 = 0.6008) when compared to the previous report (Q2 = 0.5410) [21]. It should be noted that, in the previous study, five samples with the highest activities and 14 inactive samples were removed, while in our study, only inactive samples were removed. Thus, our model showed improved prediction accuracy and enlarged applicability domain.

3.2. Relationship between Antioxidant Activities and Peptide Structures

Previous studies showed that the N-terminus and C-terminus amino acids are important in relating to antioxidant activities [20]. Our results are in agreement with the previous findings that most of the important variables that were selected by BOSS originated from the N-terminus or C-terminus (Figure 2 and Figure 3B). In addition, studies showed that tripeptides containing Cys (C), Trp (W), and Tyr (Y) residues exhibited strong antioxidant activities [8,10]. Tripeptides YHY and LTC, for the two datasets, respectively, having the highest antioxidant activities is confirmed by our study. On the FTC dataset, a linear relationship between antioxidant activities and peptide structures was constructed. However, on the FRAP dataset, the relationship was only built on the log-transformed activities and structure properties. It indicates that the antioxidant activity and peptide structures on the FRAP dataset exhibits a non-linear relationship. Data transformation is crucial before model building on this kind of data. The different performance of the two datasets may be attributed to the structure diversities of peptides. In the FTC dataset, tripeptides contain either the His or Tyr residue, which have similar structures, while the structure diversity in the FRAP dataset is much larger.

3.3. The Integration of Amino Acid Descriptors

A number of amino acid descriptors have been developed and applied in the QSAR studies of bioactive peptides. Each descriptor has its merits and demerits. Our study shows that an optimal descriptor does not exist. Instead, all of the descriptors are data dependent, which means that each descriptor performs well on different datasets. It makes the researches difficult to select descriptors. By integrating different descriptors, each one can contribute particular information to the model and create a new possibility for further improvement of the model. Subsequently, the next question has become how to efficiently extract information from different descriptors and to get rid of the redundancy of the data? Model population analysis (MPA) may provide a solution for that. It uses multi-models instead of a single model for prediction. Each sub-model contains a random combination of different descriptors. Through statistical analysis of the sub-model outcomes, the informative variables from the descriptors are extracted and an optimized descriptor combination is obtained [22]. Finally, the optimized model performs better than any of the single descriptor model, as it is shown in Table 1 and Table 2. To summarize, the aim of this study is not to build a new set of descriptors, but to provide a general framework to integrate different descriptors. The framework can take in any newly developed descriptor and fit on different datasets. The more diverse the integrated descriptors are, the better performance the model can be.

4. Materials and Methods

4.1. Data Collection

4.1.1. Ferric Thiocyanate (FTC) Dataset

A dataset of 214 antioxidant tripeptides that contain either His or Tyr residue was obtained from the published literatures [20,23]. All of the tripeptides were chemically synthesized using solid phase Fmoc Chemistry and their antioxidant activities were measured by the FTC method [23]. Test samples (500 μg) in 0.5 mL of deionized water were mixed with linoleic acid emulsion (1.0 mL, 50 mM) and phosphate buffer (1.0 mL, 0.1 M) in glass test tubes (5 mL). The tubes were sealed with silicon rubber caps and then kept at 60 °C in the dark. 50 μL reaction mixtures were taken out at different intervals during incubation. The degree of oxidation was measured by sequentially adding ethanol (2.35 mL, 75%), ammonium thiocyanate (50 μL, 30%), and ferrous chloride (50 μL, 20 mM in 3.5% HCl). After the mixture had stood for 3 min, the absorbance of the solution was measured at 500 nm with a Jasco model Ubest 30 spectrophotometer (Tokyo, Japan). A control was performed containing the same contents with test sample but without the peptides. The number of days that was taken to attain the absorbance of 0.3 was defined as the induction period. The relative activities were calculated by dividing the induction period of test samples by that of the control (Table 3). All of the experiments were carried out in triplicate and averaged.
Table 3

Sequences and antioxidant activities of tripeptides on ferric thiocyanate (FTC) dataset a.

No.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivity
1LHA3.91837PHA5.79373RHA5.205109DHH0.9045145HHH0.0635181YHY9.886
2LHD3.59338PHD4.62274RHD3.304110EHH0.9045146HHK0.0635182YKY9.886
3LHE6.13639PHE6.15275RHE5.096111HHH0.0000147HHR0.0635183YRY9.886
4LHF3.62840PHF3.91676RHF3.300112KHH0.0000148HHA0.0680184YAY3.607
5LHG6.69741PHG5.19777RHG5.725113AHH2.020149HHI0.0680185YIY3.607
6LHH4.83642PHH6.05178RHH3.296114IHH2.020150HHL0.0680186YLY3.607
7LHI6.53143PHI4.91679RHI4.806115FHH1.803151HHF3.612187YFY2.233
8LHK4.22544PHK3.42680RHK2.694116WHH1.803152HHW3.612188YWY2.233
9LHL5.92045PHL5.31181RHL3.501117YHH1.803153HHY3.612189YYY2.233
10LHM4.50446PHM3.71482RHM3.218118GHH1.089154HHG0.3170190YGY3.366
11LHN5.14847PHN6.06183RHN5.713119NHH1.089155HHN0.3170191YNY3.366
12LHQ4.13648PHQ3.71884RHQ3.108120QHH1.089156HHQ0.3170192YQY3.366
13LHR5.18449PHR4.75185RHR4.302121MHH2.015157HHM0.0817193YMY1.780
14LHS4.29350PHS4.04286RHS3.386122SHH1.320158HHS0.0862194YSY3.447
15LHT5.58451PHT6.24787RHT5.987123THH1.320159HHT0.0862195YTY3.447
16LHV3.48152PHV3.33588RHV3.206124CHH0.9369160HHC0.1277196YCY3.087
17LHW6.79153PHW6.53589RHW5.878125HDH1.477161DYY3.417197YYD4.116
18LHY4.20354PHY4.22790RHY3.378126HEH1.477162EYY3.417198YYE4.116
19LWA1.19255PWA1.39691RWA1.212127HHH0.0441163HYY2.257199YYH5.303
20LWD1.71756PWD1.09692RWD0.9091128HKH0.0441164KYY2.257200YYK5.303
21LWE1.71757PWE1.09693RWE1.091129HRH0.0441165RYY2.257201YYR5.303
22LWF1.41458PWF0.919294RWF0.9091130HAH0.9518166AYY3.071202YYA3.344
23LWG1.31359PWG2.68795RWG1.717131HIH0.9518167IYY3.071203YYI3.344
24LWH3.21260PWH1.18496RWH1.091132HLH0.9518168LYY3.071204YYL3.344
25LWI1.11161PWI1.39697RWI1.232133HFH2.026169FYY1.911205YYF4.050
26LWK1.89962PWK0.406698RWK0.6061134HWH2.026170WYY1.911206YYW4.050
27LWL0.606063PWL1.09699RWL3.212135HYH2.026171YYY1.911207YYY4.050
28LWM1.39464PWM0.7955100RWM0.7273136HGH0.8318172GYY5.071208YYG2.996
29LWN1.31365PWN2.104101RWN2.404137HNH0.8318173NYY5.071209YYN2.996
30LWQ2.50566PWQ1.202102RWQ0.6061138HQH0.8318174QYY5.071210YYQ2.996
31LWR2.90967PWR2.705103RWR2.384139HMH0.8734175MYY1.991211YYM2.103
32LWS2.02068PWS1.096104RWS0.8081140HSH0.7304176SYY3.070212YYS3.983
33LWT2.02069PWT2.598105RWT3.818141HTH0.7304177TYY3.070213YYT3.983
34LWV1.61670PWV1.008106RWV0.6061142HCH0.9747178CYY0.4699214YYC0.6369
35LWW3.51571PWW2.899107RWW2.707143HHD0.1877179YDY3.047
36LWY2.22272PWY1.114108RWY0.8081144HHE0.1877180YEY3.047

a The data containing 214 antioxidant tripeptides was collected from the literature of Saito et al. [23] and Li et al. [20]. Antioxidant activities of tripeptides were measured by the FTC method and were relative values by adjusting the control to 1.0.

4.1.2. Ferric-reducing Antioxidant Power (FRAP) Dataset

A dataset of 172 antioxidant tripeptides were derived from β-Lactoglobulin, where all possible tripeptides were collected based on its amino sequence [21]. All of the tripeptides were chemically synthesized while using solid phase Fmoc Chemistry and their antioxidant activities were evaluated using the FRAP assay [24]. Ten microliters of 100 mmol/mL tripeptide solution were incubated at 37 °C with 100 μL of FRAP reagent, containing 10 mmol/L of 2,4,6-tripyridyl-s-triazine and 20 mmol/L of FeCl3. The absorption values were read at a wavelength of 570 nm using a microplate reader (Model 680, Bio-Rad, Hercules, CA, USA) after 10 min reaction. Aqueous Fe2+ solutions at concentrations that ranged from 10 to 1000 μmol/L were used to produce a calibration curve. The results were expressed as micromoles Fe2+ equivalents per mole of the sample based on the standard curve. All of the experiments were carried out in triplicate and then averaged. The activities were logarithmic transformed prior to modeling, where 14 inactive peptides (activity = 0) were removed (Table 4). The measured activities before logarithmic transformation were displayed in Table S1.
Table 4

Sequences and activities of tripeptides on ferric ion reducing antioxidant power (FRAP) dataset a.

No.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivityNo.SequenceActivity
1LTC2.8330LPM1.0459YKK0.2588NGE−0.30117ELK−0.66146KIP−1.15
2CQC2.5331TDY1.0160AQA0.2289QSA−0.33118PEQ−0.66147LLD−1.22
3GTW2.5232QCH1.0061LRV0.2090DAQ−0.34119IDA−0.68148DLE−1.22
4LFC2.0733TWY0.9662PTP0.1891ENS−0.34120LLA−0.70149PEV−1.22
5CLV2.0634RVY0.9563ALN0.1892ENG−0.37121ALA−0.72150LKP−1.40
6QKW2.0335KWE0.9064LEI0.1693NSA−0.37122GLD−0.72151ALE−1.52
7CME1.9936CLL0.8965LVR0.1394EKT−0.38123DIS−0.72152TQL−1.52
8YLL1.9137LAM0.8566HIR0.1295EQS−0.38124PEG−0.72153LEE−1.52
9QCL1.6938YSL0.8167KKI0.1196AMA−0.41125LDI−0.74154LEK−1.70
10LAC1.6939MKG0.8068SFN0.0797KID−0.41126AEP−0.74155DAL−2.00
11GEC1.6440QTM0.8069SLL0.0698GAQ−0.43127ALI−0.77156EVD−2.00
12EQC1.5241LAL0.7670PAV0.0499PLR−0.44128LDA−0.77157VDD−2.00
13FCM1.5142QAL0.7371RLS0.04100ILL−0.46129VFK−0.77158DEA−2.00
14CHI1.4543MEN0.7372AGT0.04101VRT−0.46130ALK−0.77159ALT-
15ACQ1.3844MKC0.7273LLF0.02102IAE−0.49131AQK−0.82160KGL-
16EEL1.3345LSF0.6974PMH0.00103QSL−0.49132IIA−0.82161IQK-
17WEN1.3146TCG0.6775EEQ−0.01104KTK−0.51133LIV−0.85162QKV-
18VYV1.1947SLA0.6576LVL−0.02105ASD−0.52134EGD−0.85163GDL-
19MHI1.1648TMK0.6477QLE−0.05106APL−0.52135QKK−0.85164EIL-
20CAQ1.1249LDT0.6278FDK−0.07107AQS−0.57136IPA−0.85165KII-
21WYS1.1250EKF0.5479LLL−0.08108ENK−0.57137SDI−0.89166NKV-
22KYL1.0851VLV0.5380SAP−0.08109TPE−0.59138VEE−0.89167DTD-
23CGA1.0852MAA0.4481LLQ−0.12110RTP−0.59139DDE−0.89168EPE-
24KKY1.0853PTQ0.4482NPT−0.17111VLD−0.62140KVL−0.92169EAL-
25NEN1.0854VAG0.4183FNP−0.20112IRL−0.62141KFD−0.92170DKA-
26ECA1.0755ALP0.3784LNE−0.24113AAS−0.64142IVT−0.96171KAL-
27DYK1.0656AVF0.3685SAE−0.26114LQK−0.64143VTQ−0.96172LKA-
28KCL1.0557KVA0.3186KPT−0.28115FKI−0.64144AEK−0.96
29YVE1.0558TQT0.2687DIQ−0.30116ISL−0.66145TKI−1.10

a The data containing 172 antioxidant tripeptides was collected from the literature of Tian et al. [21]. Antioxidant activities of tripeptides were measured by the FRAP assay and were logarithmic transformed. Fourteen inactive peptides were removed before model building.

The two datasets are representative for artificially designed or food protein originated tripeptides, respectively. Both of the datasets have been used for building QSAR models before. Thus, it is suitable for model comparison.

4.2. Data Processing

The tripeptide sequences were transformed into X-matrices using 16 amino acid descriptors, respectively, while the dependent variable Y-vectors represents the relative activities of peptides. These descriptors include Z-scale, 5Z-scale, DPPS, MS-WHIM, ISA-ECI, VHSE, FASGAI, VSW, T-scale, ST-scale, E-scale, V-scale, G-scale, HESH, and HSEHPCSV, as is shown in Table 5. They are the most frequently used amino acid descriptors in the QSAR study of bioactive peptides. The peptide structure is characterized by describing amino acids within the sequence. For example, Z-scale descriptor, containing three parameters (Z1, Z2, and Z3), would generate nine variables (3 parameters × 3 amino acids) for tripeptides. To clearly label each variable, we used a unified rule to name them. The amino acid at the N-terminus was designated as N, the C-terminus amino acid was designated as C, and the middle amino acid was designated as M. Thus, the nine variables that were generated by Z-scale descriptor were labeled as N-Z-1, N-Z-2, N-Z-3, M-Z-1, M-Z-2, M-Z-3, C-Z-1, C-Z-2, and C-Z-3, respectively. The 16 descriptors were integrated to build an X-matrix, which contained 306 variables (V1-V306), with the correspondence, as follows: Z-scale (V1-V9), 5Z-scale (V10-V24), DPPS (V25-V54), MS-WHIM1 (V55-V63), MS-WHIM2 (V64-V72), ISA-ECI (V73-V78), VHSE (V79-V102), FASGAI (V103-V120), VSW (V121-V147), E-scale (V148-V162), T-scale (V163-V177), ST-scale (V178-V201), V-scale (V202-V210), G-scale (V211-V234), HESH (V235-V270), and HSEHPCSV (V271-V306), respectively.
Table 5

Parameters of 16 amino acid descriptors.

DescriptorNo. of Physicochemical PropertyNo. of Extracted VariableScope of Variable
Z-scale [25]293Electronic property, steric property and hydrophobic property
5Z-scale [26]265Electronic property, steric property and hydrophobic property
DPPS [27]11910Electronic property, steric property, hydrophobic property and hydrogen bond
MS-WHIM [28]363Surface charge distribution, size and charge over shape dependence
ISA-ECI [29]/2Isotropic surface area and electronic charge index
VHSE [30]508Electronic property, steric property and hydrophobic property
FASGAI [31]3356Hydrophobic property, alpha and turn property, bulky property, electronic property, compositional characteristics, local flexibility
VSW [32]999Molecular size, shape, symmetry and atom distribution
T-scale [33]675Topological property
ST-scale [34]8278Molecular constitutional, topological, geometrical, hydrophobic, electronic and steric property
E-scale [35]2375Hydrophobic property, size, preferences for amino acids to occur in α-helices, number of degenerate triplet codons and the frequency of occurrence of amino acid residues in β-strands
V-scale [36]/3Van Der Wall’s volume, net charge index and hydrophobic parameter of side chains
G-scale [37]4578Electronic property, steric property and hydrophobic property
HESH [38]17112Electronic property, steric property, hydrophobic property and hydrogen bond
HSEHPCSV [39]9512Hydrophobic, steric, electronic properties and hydrogen bond

4.3. QSAR Model Building

Partial least squares (PLS) regression [40] was used to build the connection between the peptide structure descriptions (variables, X-matrices) and the relative activities (responses, Y-vectors). It was implemented using MATLAB software (Version R2015a, the MathWorks, Inc., Natick, MA, USA). All of the variables were auto-scaled to unit variance and all of the responses were mean-centered prior to model building. The models were validated using cross-validation and the optimal number of PLS components were chosen based on a statistic, called Q2, which is the cross-validated R2, referring to the predictive ability of the model. R2 is the coefficient of determination, providing an estimate of the model fit. MPA was applied to optimize the model through outlier elimination and variable selection. It is a framework for model building that utilizes multiple models instead of a single model to construct results [16,17]. Generally, it worked, as follows: (1) firstly, a random resampling procedure was applied to obtain sub-datasets; (2) then, sub-models were built based on the sub-datasets; and, (3) finally, a statistical analysis was used to extract useful information from the outcome of sub-models. In the present study, MPA was utilized for outlier detection and variable selection. The MPA-based outlier detection method [19] was applied to remove the outlying samples from measured data. To begin with, 1000 sub-datasets were generated through random reselecting of 80% samples in sample space. Subsequently, for each sub-dataset, a PLS regression model was built and the prediction error for each sample was recorded. The mean of prediction errors was used as the basis for outlier detection and a three-sigma rule was applied to define the boundary, as it is reported previously [6]. The bootstrapping soft shrinkage (BOSS) method [18] was applied to select informative variables from the pool of descriptors. It is also based on the idea of MPA. Firstly, 1000 sub-datasets were obtained using bootstrap resampling in the variable space. Afterwards, 1000 PLS models were built based on the sub-datasets and the regression coefficients were extracted. In the next step, weighted bootstrap resampling was used to regenerate sub-datasets and to rebuild sub-model. The resampling procedure was repeated until all of the uninformative variables were eliminated.

5. Conclusions

In this study, we have constructed QSAR models on two datasets of antioxidant tripeptides, i.e., FTC dataset and FRAP dataset. After the integration of 16 amino acid descriptors and utilization of the MPA strategy for model building, the Q2 values were enlarged from 0.6170 to 0.7471 and from 0.4878 to 0.6088, respectively. The results show that the MPA framework is powerful in QSAR model building on antioxidant tripeptides data. The framework can also be applied to investigate the structure and activity relationships of other types of bioactive peptides and to integrate more different molecular descriptors.
  3 in total

1.  Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides.

Authors:  Zhenjiao Du; Donghai Wang; Yonghui Li
Journal:  ACS Omega       Date:  2022-07-15

2.  In Silico Rational Design and Virtual Screening of Bioactive Peptides Based on QSAR Modeling.

Authors:  Mehri Mahmoodi-Reihani; Fatemeh Abbasitabar; Vahid Zare-Shahabadi
Journal:  ACS Omega       Date:  2020-03-10

3.  Bioactivity of Cooked Standard and Enriched Whole Eggs from White Leghorn and Rhode Island Red in Exhibiting In-Vitro Antioxidant and ACE-Inhibitory Effects.

Authors:  Emerson Nolasco; Mike Naldrett; Sophie Alvarez; Philip E Johnson; Kaustav Majumder
Journal:  Nutrients       Date:  2021-11-25       Impact factor: 5.717

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.