| Literature DB >> 31387305 |
Abstract
New quantitative structure-activity relationship (QSAR) models for bitter peptides were built with integrated amino acid descriptors. Datasets contained 48 dipeptides, 52 tripeptides and 23 tetrapeptides with their reported bitter taste thresholds. Independent variables consisted of 14 amino acid descriptor sets. A bootstrapping soft shrinkage approach was utilized for variable selection. The importance of a variable was evaluated by both variable selecting frequency and standardized regression coefficient. Results indicated model qualities for di-, tri- and tetrapeptides with R2 and Q2 at 0.950 ± 0.002, 0.941 ± 0.001; 0.770 ± 0.006, 0.742 ± 0.004; and 0.972 ± 0.002, 0.956 ± 0.002, respectively. The hydrophobic C-terminal amino acid was the key determinant for bitterness in dipeptides, followed by the contribution of bulky hydrophobic N-terminal amino acids. For tripeptides, hydrophobicity of C-terminal amino acids and the electronic properties of the amino acids at the second position were important. For tetrapeptides, bulky hydrophobic amino acids at N-terminus, hydrophobicity and partial specific volume of amino acids at the second position, and the electronic properties of amino acids of the remaining two positions were critical. In summary, this study not only constructs reliable models for predicting the bitterness in different groups of peptides, but also facilitates better understanding of their structure-bitterness relationships and provides insights for their future studies.Entities:
Keywords: QSAR; amino acid descriptors; bitter; peptides
Year: 2019 PMID: 31387305 PMCID: PMC6696392 DOI: 10.3390/molecules24152846
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Statistical parameters of quantitative structure–activity relationship (QSAR) models for di-, tri- and tetrapeptides using integrated descriptor sets.
| BOSS a | Variable Number | Name of Group | Statistical Parameters b | ||||
|---|---|---|---|---|---|---|---|
| A | R2 | Q2 | RMSECV | RMSE | |||
| No | 174 | Dipeptides | 4.000 | 0.948 | 0.874 | 0.222 | 0.142 |
| Yes | 174 | Dipeptides | 2.000 ± 0.604 | 0.950 ± 0.002 | 0.941 ± 0.001 | 0.152 ± 0.001 | 0.139 ± 0.002 |
| No | 261 | Tripeptides | 3.000 | 0.760 | 0.521 | 0.407 | 0.289 |
| Yes | 261 | Tripeptides | 2.000 ± 0.450 | 0.770 ± 0.006 | 0.742 ± 0.004 | 0.299 ± 0.002 | 0.282 ± 0.004 |
| No | 361 | Tetrapeptides | 6.000 | 0.965 | 0.682 | 0.429 | 0.143 |
| Yes | 361 | Tetrapeptides | 6.000 ± 1.222 | 0.972 ± 0.002 | 0.956 ± 0.002 | 0.160 ± 0.004 | 0.127 ± 0.004 |
a ‘Yes/No’ indicates the model was built with/without BOSS (bootstrapping soft shrinkage) variable selection process, respectively; b A: the number of principle components in PLS regression; R2: the coefficient of determination; Q2: the cross-validated R2; RMSECV: the root mean square error cross validation; RMSE: the root mean square error.
Figure 1Observed vs. predicted bitter activities of di- (a), tri- (b) and tetrapeptides (c). The x-axis represents the observed sensory values from literature. The y-axis represents the corresponding predicted values derived from the model having the lowest root mean square error cross validation (RMSECV) obtained by 100 bootstrapping soft shrinkage (BOSS) runs.
Statistical parameters of QSAR models for dipeptides using a single set of amino acid descriptor and comparison with models built by integrated descriptor sets.
| Descriptor | Variable Number | Statistical Parameters a | ||||
|---|---|---|---|---|---|---|
| A | R2 | Q2 | RMSECV | RMSE | ||
| 3z-scale [ | 6 | 3 | 0.838 | 0.792 | 0.284 | 0.251 |
| 5z-scale [ | 10 | 5 | 0.916 | 0.869 | 0.225 | 0.180 |
| DPPS [ | 20 | 5 | 0.934 | 0.849 | 0.242 | 0.160 |
| MS-WHIM [ | 6 | 4 | 0.757 | 0.686 | 0.349 | 0.307 |
| ISA-ECI [ | 4 | 2 | 0.845 | 0.808 | 0.273 | 0.245 |
| VHSE [ | 16 | 7 | 0.943 | 0.894 | 0.202 | 0.149 |
| FASGAI [ | 12 | 9 | 0.921 | 0.814 | 0.269 | 0.175 |
| VSW [ | 18 | 4 | 0.911 | 0.773 | 0.297 | 0.185 |
| T-scale [ | 10 | 6 | 0.900 | 0.830 | 0.257 | 0.197 |
| ST-scale [ | 16 | 10 | 0.913 | 0.655 | 0.366 | 0.184 |
| E-scale [ | 10 | 9 | 0.940 | 0.865 | 0.229 | 0.152 |
| V [ | 6 | 5 | 0.904 | 0.863 | 0.231 | 0.193 |
| G-scale [ | 16 | 9 | 0.937 | 0.855 | 0.238 | 0.157 |
| HESH [ | 24 | 4 | 0.942 | 0.881 | 0.215 | 0.150 |
| ID b | 174 | 4 | 0.948 | 0.874 | 0.222 | 0.142 |
| ID + BOSS1 c | 174 | 2.000 ± 0.604 | 0.950 ± 0.002 | 0.941 ± 0.001 | 0.152 ± 0.001 | 0.139 ± 0.002 |
| ID+BOSS2 d | 174 | 2 | 0.952 | 0.943 | 0.148 | 0.137 |
a A: the number of principle components in PLS regression; R2: the coefficient of determination; Q2: the cross-validated R2; RMSECV: the root mean squares error cross validation; RMSE: the root mean squares error. b ID: integrated descriptor sets, which means a combination of all the 14 kinds of descriptor sets. c ID + BOSS1: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process, average statistical parameters of 100 runs. d ID + BOSS2: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process; statistical parameters for the model with the lowest RMSECV.
Statistical parameters of QSAR models for tripeptides using a single set of amino acid descriptor and comparison with models built by integrated descriptor sets.
| Descriptor | Variable Number | Statistical Parameters a | ||||
|---|---|---|---|---|---|---|
| A | R2 | Q2 | RMSECV | RMSE | ||
| 3z-scale [ | 9 | 1 | 0.503 | 0.385 | 0.462 | 0.415 |
| 5z-scale [ | 15 | 2 | 0.669 | 0.526 | 0.405 | 0.339 |
| DPPS [ | 30 | 5 | 0.722 | 0.444 | 0.439 | 0.310 |
| MS-WHIM [ | 9 | 1 | 0.592 | 0.445 | 0.439 | 0.376 |
| ISA-ECI [ | 6 | 1 | 0.525 | 0.357 | 0.472 | 0.406 |
| VHSE [ | 24 | 3 | 0.689 | 0.439 | 0.441 | 0.329 |
| FASGAI [ | 18 | 5 | 0.770 | 0.572 | 0.385 | 0.282 |
| VSW [ | 27 | 5 | 0.789 | 0.504 | 0.415 | 0.270 |
| T-scale [ | 15 | 1 | 0.629 | 0.375 | 0.465 | 0.359 |
| ST-scale [ | 24 | 1 | 0.638 | 0.548 | 0.396 | 0.354 |
| E-scale [ | 15 | 2 | 0.678 | 0.532 | 0.403 | 0.334 |
| V [ | 9 | 2 | 0.560 | 0.432 | 0.444 | 0.390 |
| G-scale [ | 24 | 6 | 0.745 | 0.533 | 0.402 | 0.298 |
| HESH [ | 36 | 1 | 0.669 | 0.520 | 0.408 | 0.339 |
| ID b | 261 | 3 | 0.760 | 0.521 | 0.407 | 0.289 |
| ID + BOSS1 c | 261 | 2.000 ± 0.450 | 0.770 ± 0.006 | 0.742 ± 0.004 | 0.299 ± 0.002 | 0.282 ± 0.004 |
| ID + BOSS2 d | 261 | 1 | 0.773 | 0.751 | 0.294 | 0.280 |
a A: the number of principle components in PLS regression; R2: the coefficient of determination; Q2: the cross-validated R2; RMSECV: the root mean squares error cross validation; RMSE: the root mean squares error. b ID: integrated descriptor sets, which means a combination of all the 14 kinds of descriptor sets. c ID+BOSS1: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process, average statistical parameters of 100 runs. d ID+BOSS2: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process; statistical parameters for the model with the lowest RMSECV.
Statistical parameters of QSAR models for tetrapeptides using a single set of amino acid descriptor and comparison with models built by integrated descriptor sets.
| Descriptor | Variable Number | Statistical Parameters a | ||||
|---|---|---|---|---|---|---|
| A | R2 | Q2 | RMSECV | RMSE | ||
| 3z-scale [ | 12 | 2 | 0.822 | 0.490 | 0.544 | 0.322 |
| 5z-scale [ | 20 | 6 | 0.938 | 0.533 | 0.521 | 0.189 |
| DPPS [ | 40 | 8 | 0.968 | 0.676 | 0.433 | 0.136 |
| MS-WHIM [ | 12 | 3 | 0.813 | 0.349 | 0.615 | 0.330 |
| ISA-ECI [ | 8 | 3 | 0.717 | 0.017 | 0.755 | 0.406 |
| VHSE [ | 32 | 4 | 0.922 | 0.694 | 0.421 | 0.213 |
| FASGAI [ | 24 | 3 | 0.907 | 0.714 | 0.408 | 0.233 |
| VSW [ | 36 | 6 | 0.969 | 0.512 | 0.532 | 0.135 |
| T-scale [ | 20 | 1 | 0.624 | 0.452 | 0.564 | 0.467 |
| ST-scale [ | 32 | 1 | 0.642 | 0.155 | 0.700 | 0.456 |
| E-scale [ | 20 | 5 | 0.948 | 0.557 | 0.507 | 0.173 |
| V [ | 12 | 2 | 0.794 | 0.525 | 0.525 | 0.345 |
| G-scale [ | 32 | 4 | 0.879 | 0.620 | 0.469 | 0.265 |
| HESH [ | 48 | 4 | 0.934 | 0.703 | 0.415 | 0.195 |
| ID b | 348 | 6 | 0.965 | 0.682 | 0.429 | 0.143 |
| ID + BOSS1 c | 348 | 6.000 ± 1.222 | 0.972 ± 0.002 | 0.956 ± 0.002 | 0.160 ± 0.004 | 0.127 ± 0.004 |
| ID + BOSS2 d | 348 | 6 | 0.973 | 0.956 | 0.160 | 0.123 |
a A: the number of principle components in PLS regression; R2: the coefficient of determination; Q2: the cross-validated R2; RMSECV: the root mean squares error cross validation; RMSE: the root mean squares error. b ID: integrated descriptor sets, which means a combination of all the 14 kinds of descriptor sets. c ID+BOSS1: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process, average statistical parameters of 100 runs. d ID+BOSS2: integrated descriptor sets with BOSS (bootstrapping soft shrinkage) variable selection process; statistical parameters for the model with the lowest RMSECV.
Figure 2Variable importance of QSAR models for dipeptides. (a) Variable selecting frequencies of each variable from 100 BOSS runs; (b) standardized regression coefficients of each variable based on the model with the smallest RMSECV from 100 BOSS runs.
Figure 3Variable importance of QSAR models for tripeptides. (a) Variable selecting frequencies of each variable from 100 BOSS runs; (b) standardized regression coefficients of each variable based on the model with the smallest RMSECV from 100 BOSS runs.
Figure 4Variable importance of QSAR models for tetrapeptides. (a) Variable selecting frequencies of each variable from 100 BOSS runs; (b) standardized regression coefficients of each variable based on the model with the smallest RMSECV from 100 BOSS runs.