| Literature DB >> 35559171 |
Abstract
There has been growing interest in using peptides for the controlled synthesis of nanomaterials. Peptides play a crucial role not only in regulating the nanostructure formation process but also in influencing the resulting properties of the nanomaterials. Leveraging machine learning (ML) in the biomimetic workflow is anticipated to accelerate peptide discovery, make the process more resource-efficient, and unravel associations among attributes that may be useful in peptide design. In this study, a binary ML classifier is formulated that was trained and tested on 1720 peptide examples. The support vector machine classifier uses Kidera factors to categorize peptides into one of two groups based on their binding ability. The classifier exhibits satisfactory performance, as demonstrated by various performance metrics. In addition, key variables that bear a huge impact on the model were identified, such as peptide hydrophobicity. As these trends were derived from a large and diverse dataset, the insights drawn from the data are expected to be generalizable and robust. Thus, the presented ML model is an important step toward the rational and predictive peptide design.Entities:
Year: 2022 PMID: 35559171 PMCID: PMC9089360 DOI: 10.1021/acsomega.2c00640
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Training performance represented by classification accuracy. Accuracy scores were obtained from training based on n = 1291, followed by 10-fold cross-validation. The darker the color, the higher is the accuracy of the model.
Figure 2Training performance represented by kappa values. Kappa scores were obtained from training based on n = 1291, followed by 10-fold cross-validation. The darker the color, the higher is the kappa score of the model.
Variable Importance Scoresa
| Variable | importance score |
|---|---|
| KF4 (hydrophobicity) | 1.416 |
| KF2 (side chain size) | 1.279 |
| KF3 (extended structure preference) | 1.274 |
| KF9 (pK-C) | 1.252 |
| KF7 (flat extended preference) | 1.102 |
| KF5 (double-bend preference) | 1.084 |
| KF10 (surrounding hydrophobicity) | 1.080 |
| KF6 (partial specific volume) | 1.070 |
| KF8 (occurrence in alpha region) | 1.056 |
| KF1 (helix/bend preference) | 1.027 |
The higher the score, the greater is the impact of the specific variable on the classification model. The importance score is derived from the classification error as a consequence of removing a specific variable.
Figure 3Hyperparameter tuning where the training classification accuracy is monitored while the sigma and C values are varied. Inset graph shows the lower sigma regions where the highest classification accuracy was achieved.
Performance of the Optimized Model (SVM with a RBF Kernel, C = 1, and Sigma = 0.108) on the Test Set, as Demonstrated by the Confusion Matrixa
| class A | class B | |
|---|---|---|
| class A | 182 | 52 |
| class B | 33 | 162 |
Other performance metrics that are derived from the confusion matrix include: accuracy = 0.802, F1 = 0.811, recall = 0.847, sensitivity = 0.847, specificity = 0.757, precision = 0.778, and kappa = 0.604.
External Validation of the Optimized Model (SVM with a RBF Kernel, C = 1, and Sigma = 0.108) As Demonstrated by the Confusion Matrixa
| class A | class B | |
|---|---|---|
| class A | 7 | 1 |
| class B | 9 | 20 |
The dataset used in the external validation is available in the Supporting Information (Table S2). Other performance metrics that are derived from the confusion matrix include: accuracy = 0.73, F1 = 0.583, recall = 0.438, sensitivity = 0.438, specificity = 0.952, precision = 0.875, kappa = 0.415.