| Literature DB >> 35096016 |
Qian Liu1, Jing Lin1, Li Wen1, Shaozhou Wang1, Peng Zhou1, Li Mei2, Shuyong Shang3.
Abstract
The protein-protein association in cellular signaling networks (CSNs) often acts as weak, transient, and reversible domain-peptide interaction (DPI), in which a flexible peptide segment on the surface of one protein is recognized and bound by a rigid peptide-recognition domain from another. Reliable modeling and accurate prediction of DPI binding affinities would help to ascertain the diverse biological events involved in CSNs and benefit our understanding of various biological implications underlying DPIs. Traditionally, peptide quantitative structure-activity relationship (pQSAR) has been widely used to model and predict the biological activity of oligopeptides, which employs amino acid descriptors (AADs) to characterize peptide structures at sequence level and then statistically correlate the resulting descriptor vector with observed activity data via regression. However, the QSAR has not yet been widely applied to treat the direct binding behavior of large-scale peptide ligands to their protein receptors. In this work, we attempted to clarify whether the pQSAR methodology can work effectively for modeling and predicting DPI affinities in a high-throughput manner? Over twenty thousand short linear motif (SLiM)-containing peptide segments involved in SH3, PDZ and 14-3-3 domain-medicated CSNs were compiled to define a comprehensive sequence-based data set of DPI affinities, which were represented by the Boehringer light units (BLUs) derived from previous arbitrary light intensity assays following SPOT peptide synthesis. Four sophisticated MLMs (MLMs) were then utilized to perform pQSAR modeling on the set described with different AADs to systematically create a variety of linear and nonlinear predictors, and then verified by rigorous statistical test. It is revealed that the genome-wide DPI events can only be modeled qualitatively or semiquantitatively with traditional pQSAR strategy due to the intrinsic disorder of peptide conformation and the potential interplay between different peptide residues. In addition, the arbitrary BLUs used to characterize DPI affinity values were measured via an indirect approach, which may not very reliable and may involve strong noise, thus leading to a considerable bias in the modeling. The R prd 2 = 0.7 can be considered as the upper limit of external generalization ability of the pQSAR methodology working on large-scale DPI affinity data.Entities:
Keywords: amino acid descriptor; computational peptidology; domain-peptide interaction; machine learning; peptide quantitative structure-activity relationship; statistical modeling
Year: 2022 PMID: 35096016 PMCID: PMC8795790 DOI: 10.3389/fgene.2021.800857
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Four MLMs used in this study.
| MLM | Type | Variable standardization | Model parameter | |
|---|---|---|---|---|
| Parameter | Optimization | |||
| PLS | Linear | Autoscaling | NLV: number of latent variables | Increase of cumulative cross-validation |
| SVM | Nonlinear | [–1, +1] scaling | ε: ε-insensitive loss function | Systematic grid search for minimizing cross-validation RMSEcv |
| C: penalty factor | ||||
| σ2: kernel radial | ||||
| RF | Nonlinear | [–1, +1] scaling | ntree: number of trees | Systematic grid search for minimizing cross-validation RMSEcv |
| mtry: size of descriptor subset | ||||
| GP | Linear/nonlinear | Autoscaling | Θ: hyperparameter set | Automatic determination |
Summary of 21,704 SLiM-containing peptide samples binding to ten SH3, seven PDZ and one 14-3-3 domains.
| Panel | Domain | Parent protein | Domain Number | Species | Peptide number |
|---|---|---|---|---|---|
| 1 | SH3 | Amphiphysin | 1/1 | Human | 884 |
| 2 | Amphyphisin | 1/1 | Yeast | 2032 | |
| 3 | Boi1 | 1/1 | Yeast | 1336 | |
| 4 | Boi2 | 1/1 | Yeast | 1312 | |
| 5 | Endophilin | 1/1 | Yeast | 1998 | |
| 6 | Myosin5 | 1/1 | Yeast | 1139 | |
| 7 | Rvs167 | 1/1 | Yeast | 1369 | |
| 8 | Sho1 | 1/1 | Yeast | 1015 | |
| 9 | Yfr024 | 1/1 | Yeast | 1282 | |
| 10 | Yhr016c | 1/1 | Yeast | 1348 | |
| 11 | PDZ | CALP | 1/1 | Human | 80 |
| 12 | NHERF1 | 1/2 | Human | 77 | |
| 13 | NHERF1 | 2/2 | Human | 80 | |
| 14 | NHERF2 | 1/2 | Human | 80 | |
| 15 | NHERF2 | 2/2 | Human | 80 | |
| 16 | SYNA1 | 1/1 | Human | 56 | |
| 17 | PSD95 | 1/1 | Human | 6068 | |
| 18 | 14-3-3 | 14-3-3 | 1/1 | Yeast | 1163 |
Comparison of different MLMs on different DPI samples .
| MLM | DPI | Training set | Test set | ||||
|---|---|---|---|---|---|---|---|
|
| RMSEfit
|
| RMSEcv
|
| RMSEprd
| ||
| PLS | SH3 | 0.8641 | 0.4765 | 0.8335 | 0.5275 | 0.3072 | 0.5851 |
| PDZ | 0.9312 | 0.1062 | 0.1077 | 0.3823 | 0.2263 | 0.3276 | |
| 14-3-3 | 0.4344 | 0.7048 | 0.3341 | 0.7687 | 0.3625 | 0.7446 | |
| GP | SH3 | 0.8668 | 0.4719 | 0.8349 | 0.5252 | 0.3147 | 0.5808 |
| PDZ | 0.6984 | 0.2223 | 0.1953 | 0.3631 | 0.3391 | 0.3028 | |
| 14-3-3 | 0.4334 | 0.7091 | 0.3548 | 0.7566 | 0.3669 | 0.7420 | |
| RF | SH3 | 0.9470 | 0.2975 | 0.2074 | 1.1509 | 0.4973 | 0.4987 |
| PDZ | 0.8191 | 0.1722 | 0.4005 | 0.3134 | 0.3824 | 0.2928 | |
| 14-3-3 | 0.8116 | 0.4088 | 0.2562 | 0.8124 | 0.3456 | 0.7715 | |
| SVM | SH3 | 0.8772 | 0.4530 | 0.8352 | 0.5248 | 0.3091 | 0.5843 |
| PDZ | 0.7242 | 0.2126 | 0.1880 | 0.3647 | 0.2689 | 0.3594 | |
| 14-3-3 | 0.5211 | 0.6519 | 0.3614 | 0.7527 | 0.3886 | 0.7279 | |
| LibSVM | SH3 | 0.7008 | 0.2971 | 0.6817 | 0.3144 | 0.4254 | 0.3693 |
| PDZ | 0.8778 | 0.0813 | 0.1295 | 0.1528 | 0.2766 | 0.1189 | |
| 14-3-3 | 0.4003 | 0.6085 | 0.3097 | 0.6702 | 0.3025 | 0.6342 | |
VHSE, descriptor was used to characterize peptide sequences.
Human amphyphisin SH3 (1/1), human SYNA1 PDZ (1/1) and yeast 14-3-3 (1/1) are selected as case analysis.
R 2, R 2 and R 2 are the determination coefficients of internal fitting in training set, internal cross-validation on training set, and external blind prediction on test set, respectively.
RMSEfit, RMSEcv, and RMSEprd, are the root-mean-square errors of internal fitting in training set, internal cross-validation on training set, and external blind prediction on test set, respectively.
FIGURE 1Scatter plots of fitted/predictive against experimental LogBLU values over 884 human amphyphisin SH3 (1/1)-binding peptides with MolSurf characterization and using different MLMs (A,B), PLSR, (C,D), GP; (E,F), RF, (G,H), SVM and (I,J), LibSVM.
FIGURE 2Comparison between the external predictive powers (R 2) of SVM-based pQSAR modeling on different DPI sample panels with ZP-explore and LibSVM.
FIGURE 3Scatter plots of calculated against experimental LogBLU values over 1193 human amphyphisin SH3 (1/1)-binding peptides (A-D), 56 human SYNA1 PDZ (1/1)-binding peptides (E-H) and 2025 yeast endophilin SH3 (1/1)-binding peptides (I-L) with GP modeling and using different AADs.
FIGURE 4Comparison between the external predictive powers (R 2) of PLS-/GP-based pQSAR modeling on different DPI sample panels with MolSurf, ST_scale, VHSE and VSGETAWAY.
Change in external predictive power (R 2) of PLS/GP/RF/SVM/LibSVM-based pQSAR modeling on human PSD95 PDZ (1/1) panel characterized by MolSurf with different sample sizes of subset-1000, subset-3000 and fullset-6068.
| Size |
| ||||
|---|---|---|---|---|---|
| PLS | GP | RF | SVM | LibSVM | |
| Subset-1000 | 0.014 | 0.080 | 0.362 | 0.071 | 0.084 |
| Subset-3000 | 0.087 | 0.088 | 0.485 | 0.119 | 0.117 |
| Fullset-6068 | 0.149 | 0.198 | 0.421 | 0.204 | 0.147 |