| Literature DB >> 21779156 |
William S Sanders1, C Ian Johnston, Susan M Bridges, Shane C Burgess, Kenneth O Willeford.
Abstract
Cell penetrating peptides (CPPs) are those peptides that can transverse cell membranes to enter cells. Once inside the cell, different CPPs can localize to different cellular components and perform different roles. Some generate pore-forming complexes resulting in the destruction of cells while others localize to various organelles. Use of machine learning methods to predict potential new CPPs will enable more rapid screening for applications such as drug delivery. We have investigated the influence of the composition of training datasets on the ability to classify peptides as cell penetrating using support vector machines (SVMs). We identified 111 known CPPs and 34 known non-penetrating peptides from the literature and commercial vendors and used several approaches to build training data sets for the classifiers. Features were calculated from the datasets using a set of basic biochemical properties combined with features from the literature determined to be relevant in the prediction of CPPs. Our results using different training datasets confirm the importance of a balanced training set with approximately equal number of positive and negative examples. The SVM based classifiers have greater classification accuracy than previously reported methods for the prediction of CPPs, and because they use primary biochemical properties of the peptides as features, these classifiers provide insight into the properties needed for cell-penetration. To confirm our SVM classifications, a subset of peptides classified as either penetrating or non-penetrating was selected for synthesis and experimental validation. Of the synthesized peptides predicted to be CPPs, 100% of these peptides were shown to be penetrating.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21779156 PMCID: PMC3136433 DOI: 10.1371/journal.pcbi.1002101
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Confusion matrices for datasets generated using different approaches.
| Non-CPP | CPP | ←Classified as | |
| Dataset 1 – Unbalanced. | |||
| (total examples 145) | 0 | 34 | Non-CPP |
| 1 | 110 | CPP | |
Classifier performance with different training regimes - Performance from ten-fold cross validation with training data sets.
| Unbalanced | Balanced with random negatives | Balanced with biological negatives | Balanced by sampling from known negatives | Balanced by sampling from known positives | |
| Accuracy | 75.86% | 95.94% | 94.14% | 88.73% | 78.82% |
| True Positive Rate | 0.759 | 0.959 | 0.941 | 0.887 | 0.7883 |
| False Positive Rate | 0.768 | 0.041 | 0.059 | 0.113 | 0.2117 |
| ROC | 0.495 | 0.959 | 0.941 | 0.887 | 0.7883 |
*- These values represent the averages for 10 datasets.
Classifier performance of each classifier with original dataset.
| Unbalanced | Balanced with random negatives | Balanced with biological negatives | Balanced by sampling from known negatives | |
| Accuracy | 75.86% | 80.69% | 79.31% | 91.70% |
| True Positive Rate | 0.759 | 0.807 | 0.793 | 0.917 |
| False Positive Rate | 0.768 | 0.508 | 0.553 | 0.127 |
| ROC | 0.495 | 0.649 | 0.620 | 0.895 |
Comparison of SVM based CPP classifiers to previously published methods.
| Hällbrink-2005 | Hansen-2008 | Dobchev-2010 | Unbalanced | Distribution-based | Biologically-based | Balanced by sampling Non-CPPs | |
| Overall Accuracy | 77.27% | 67.44% | 83.16% | 75.86% | 80.69% | 79.31% | 91.72% |
| CPP Accuracy | 88.46% | 80.30% | 92.21% | 99.10% | 94.59% | 94.59% | 93.69% |
| Non-CPP Accuracy | 35.71% | 25.00% | 54.17% | 0.00% | 35.29% | 29.41% | 85.29% |
Features selected for datasets generated using approaches 1–4.
| Dataset 1(Balanced with random negative examples) | Dataset 2(Balanced with biological peptides assumed to be negative) | Dataset 3(Unbalanced dataset) | Dataset 4(Balanced by random sampling of known negatives with replacement) |
| Net Charge | Net Charge | Net Charge | Negative Charge |
| Positive Charge | Isoelectric Point | Positive Charge | Isoelectric Point |
| Number of serines (S) | Molecular Weight | Number of alanines (A) | Number of glycines (G) |
| Number of aspartates (D) | Hydropathicity | Number of arginines (R) | Number of alanines (A) |
| Percent valine (V) | Number of valines (V) | Percent arginines (R) | Number of tryptophans (W) |
| Percent proline (P) | Number of lysines (K) | Net Donated Hydrogen Bonds | Number of asparagines (N) |
| Percent phenylalanine (F) | Number of arginines (R) | Number of lysines (K) | |
| Percent threonine (T) | Percent glycine (G) | Number of histidines (H) | |
| Percent asparagine (N) | Percent methionine (M) | Number of aspartates (D) | |
| Percent tyrosine (Y) | Percent tyrosine (Y) | Percent phenylalanine (F) | |
| Percent cysteine (C) | Percent cysteine (C) | Percent tryptophan (W) | |
| Percent arginine (R) | Percent aspartate (D) | Percent arginine (R) | |
| Percent histidine (H) | Percent negative | Percent histidine (H) | |
| Percent aspartate (D) | Water Octanol Partition Coefficient | Percent Hydrophobic | |
| Percent negative | Net Donated Hydrogen Bonds | Percent negative | |
| Steric Bulk | Percent Helix | Hydrophobicity | |
| Net Donated Hydrogen Bonds | Percent Coil | Water Octanol Partition Coefficient | |
| Percent Helix | |||
| Percent Coil |
Features selected for ten datasets generated using approach 5.
| Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 | Dataset 5 | Dataset 6 | Dataset 7 | Dataset 8 | Dataset 9 | Dataset 10 |
| Number (V) | Length | Number (R) | Net Charge | Net Charge | Percent (T) | Net Charge | Positive Charge | Number (W) | Positive Charge |
| Percent (R) | Net Charge | Percent (W) | Negative Charge | Percent (I) | Percent (Y) | Positive Charge | Number (G) | Number (T) | Percent (I) |
| Number (V) | Percent positive | Number (I) | Hydrophobicity | Net Donated Hydrogen Bonds | Percent (I) | Number (S) | Number (R) | Amphipacity | |
| Number (C) | Amphipacity | Number (H) | Net Donated Hydrogen Bonds | Percent Sheet | Percent (W) | Percent (F) | Percent (S) | ||
| Percent (H) | Percent Helix | Percent (F) | Percent Hydrophobic | Percent (R) | Percent (T) | ||||
| Net Donated Hydrogen Bonds | Net Donated Hydrogen Bonds | Percent (H) | |||||||
| Amphipacity |
Balanced subsets of CPPs sampled with replacement combined with known-CPP analogs.
Figure 1Cellular internalization microscopy array of FITC-labeled peptides.
Figure 2Quantitative uptake analysis.
Known cell-penetrating peptides from the literature and commercial vendors.
| Cell-penetrating peptide | Reference |
| AAVALLPAVLLALLAKNNLKDCGLF |
|
| AAVALLPAVLLALLAKNNLKECGLY |
|
| AAVALLPAVLLALLAPVQRKQKLMP |
|
| AAVALLPAVLLALLAVTDQLGEDFFAVDLEAFLQEFGLLPEKE |
|
| AAVLLPVLLAAP |
|
| AGYLLGKINLKALAALAKKIL |
|
| AGYLLGKLKALAALAKKIL |
|
| AHALCLTERQIKIWFQNRRMKWKKEN |
|
| AHALCPPERQIKIWFQNRRMKWKKEN |
|
| ALWKTLLKKVLKA |
|
| AYALCLTERQIKIWFANRRMKWKKEN |
|
| CGPGSDDEAAADAQHAAPPKKKRKVGY |
|
| CNGRC |
|
| CNGRCG |
|
| CNGRCGGKKLKLLKLL |
|
| CNGRCGGKLAKLAKLAKLAK |
|
| CNGRCGGLVTT |
|
| GAARVTSWLGRQLRIAGKRLEGRSK |
|
| GALFLGFLGAAGSTMGAWSQPKSKRKV |
|
| GGRQIKIWFQNRRMKWKK |
|
| GIGKFLHSAKKWGKAFVGQIMNC |
|
| GLAFLGFLGAAGSTMGAWSQPKSKRKV |
|
| GRKKRRQ |
|
| GRKKRRQRRPPQC |
|
| GRKKRRQRRRC |
|
| GRKKRRQRRRPPC |
|
| GRKKRRQRRRPQ |
|
| GRQLRIAGKRLEGRSK |
|
| GWTLNPAGYLLGKINLKALAALAKKIL |
|
| GWTLNPPGYLLGKINLKALAALAKKIL |
|
| GWTLNSAGYLLGKINLKALAALAKKIL |
|
| GWTLNSAGYLLGKINLKALAALAKKLL |
|
| GWTLNSAGYLLGKLKALAALAKKIL |
|
| GWTLNSKINLKALAALAKKIL |
|
| INLKALAALAKKIL |
|
| IWFQNRRMKWKK |
|
| KALAALLKKWAKLLAALK |
|
| KALAKALAKLWKALAKAA |
|
| KALKKLLAKWAAAKALL |
|
| KCRKKKRRQRRRKKLSECLKRIGDELDS |
|
| KCRKKKRRQRRRKKPVVHLTLRQAGDDFSR |
|
| KETWWETWWTEWSQPKKKRKV |
|
| KETWWETWWTEWSQPKKRKV |
|
| KFHTFPQTAIGVGAP |
|
| KITLKLAIKAWKLALKAA |
|
| KIWFQNRRMKWKK |
|
| KLAAALLKKWKKLAAALL |
|
| KLALKALKALKAALKLA |
|
| KLALKLALKALKAALK |
|
| KLALKLALKALQAALQLA |
|
| KLALKLALKAWKAALKLA |
|
| KLALQLALQALQAALQLA |
|
| KMTRAQRRAAARRNRWTAR |
|
| KRPAATKKAGQAKKKKL |
|
| LGTYTQDFNKFHTFPQTAIGVGAP |
|
| LIRLWSHLIHIWFQNRRLKWKKK |
|
| LKTLATALTKLAKTLTTL |
|
| LKTLTETLKELTKTLTEL |
|
| LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTESC |
|
| LLIILRARIRKQAHAHSK |
|
| LLIILRRPIRKQAHAHSK |
|
| LLIILRRRIRKQAHAHSA |
|
| LLIILRRRIRKQAHAHSK |
|
| LNSAGYLLGKINLKALAALAKKIL |
|
| LNSAGYLLGKLKALAALAKIL |
|
| MANLGYWLLALFVTMWTDVGLCKKRPKP |
|
| MDAQTRRRERRAEKQAQWKAAN |
|
| MGLGLHLLVLAAALQGAKKKRKV |
|
| MPKKKPTPIQLNP |
|
| MVKSKIGSWILVLFVAMWSDVGLCKKRPKP |
|
| MVTVLFRRLRIRRACGPPRVRV |
|
| NAKTRRHERRRKLAIER |
|
| PKKKRKV |
|
| PKKKRKVALWKTLLKKVLKA |
|
| PMLKE |
|
| QLALQLALQALQAALQLA |
|
| RGGRLSSYSRRRFSTSTGR |
|
| RGGRLSYSRRRFSTSTGR |
|
| RGGRLSYSRRRFSTSTGRA |
|
| RKKRRQRRR |
|
| RKSSKPIMEKRRRAR |
|
| RQARRNRRRALWKTLLKKVLKA |
|
| RQGAARVTSWLGRQLRIAGKRLEGR |
|
| RQGAARVTSWLGRQLRIAGKRLEGRSK |
|
| RQIKIWFPNRRMKWKK |
|
| RQIKIWFQNMRRKWKK |
|
| RQIKIWFQNRRMKWKK |
|
| RQIKIWFQNRRMKWKKLRKKKKKH |
|
| RQIRIWFQNRRMRWRR |
|
| RQPKIWFPNRRMPWKK |
|
| RRLSSYSSRRRF |
|
| RRMKWKK |
|
| RRRRRRRRR |
|
| RRWRRWWRRWWRRWRR |
|
| RVIRVWFQNKRCKDKK |
|
| RVTSWLGRQLRIAGKRLEGRSK |
|
| SWLGRQLRIAGKRLEGRSK |
|
| TAKTRYKARRAELIAERR |
|
| TRQARRNRRRWRERQR |
|
| TRRNKRNRIQEQLNRK |
|
| TRSSRAGLQFPVGRVHRLLRK |
|
| TRSSRAGLQWPVGRVHRLLRKGGC |
|
| VPALR |
|
| VPMLK |
|
| VPTLK |
|
| VQAILRRNWNQYKIQ |
|
| VRLPPPVRLPPPVRLPPP |
|
| WFQNRRMKWKK |
|
| YGRKKRRQRRR |
|
| YGRKKRRQRRRGTSSSSDELSWIIELLEK |
|
| YGRKKRRQRRRSVYDFFVWL |
|
Known non-penetrating cell-penetrating peptide analogs and peptide hormones.
| Non-cell penetrating peptide | Reference |
| AGCKNFFWKTFTSC |
|
| AHALCLTERQIKSNRRMKWKKEN |
|
| CYFQNCPRG |
|
| DFDMLRCMLGRVYRPCWQV |
|
| EILLPNNYNAYESYKYPGMFIALSK |
|
| FITKALGISYGRKKRRQC |
|
| FVPIFTHSELQKIREKERNKGQ |
|
| GRKKRRQPPQC |
|
| GWTLNSAGYLLGKFLPLILRKIVTAL |
|
| GWTLNSAGYLLGKINLKAPAALAKKIL |
|
| GWTLNSAGYLLGPHAI |
|
| GWTNLSAGYLLGPPPGFSPFR |
|
| HDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGR |
|
| IAARIKLRSRQHIKLRHL |
|
| ILRRRIRKQAHAHSK |
|
| KIWFQNRRMK |
|
| KKKQYTSIHHGVVEVD |
|
| KKLSECLKRIGDELDS |
|
| KLALKALKAALKLA |
|
| KLALKLALKALKAA |
|
| LLGKINLKALAALAKKIL |
|
| LLKTTALLKTTALLKTTA |
|
| LLKTTELLKTTELLKTTE |
|
| LNSAGYLLGKALAALAKKIL |
|
| LNSAGYLLGKLKALAALAK |
|
| LRKKKKKH |
|
| PVVHLTLRQAGDDFSR |
|
| QNLGNQWAVGHLM |
|
| RPPGFSPFR |
|
| RQIKIFFQNRRMKFKK |
|
| RQIKIWFQNRRM |
|
| RQIKIWFQNRRMKWK |
|
| TERQIKIWFQNRRMK |
|
| WSYGLRPG |
|
A list of initial features used for classifier construction.
| Feature | Reference |
| Length of peptide |
|
| Net charge of peptide |
|
| Positive charge |
|
| Negative charge |
|
| Isoelectric point (pI) |
|
| Molecular weight |
|
| Hydropathicity |
|
| Number of Each Amino Acid (20 features) |
|
| Percent composition of each amino acid (20 features) |
|
| Percent polar amino acids |
|
| Percent positive amino acids |
|
| Percent negative amino acids |
|
| Percent hydrophobic amino acids |
|
| Hydrophobicity |
|
| Lipophilicity |
|
| Amphiphilicity |
|
| Water-Octanol Partition Coefficient |
|
| Steric Bulk |
|
| Side chain bulk |
|
| Net donated hydrogen bonds |
|
| Percent α helix |
|
| Percent random coil |
|
| Percent β sheet |
|
Peptides synthsized for experimental validation of classifier.
| Name | Role | Sequence (N to C) |
| HIV-TAT | Control(+) | YGRKKRRQRRR-NH2 |
| Antennapedia | Control(+) | RQIKIWFQNRRMKWKK-NH2 |
| Pep-1 | Control(+) | KETWWETWWTEWSQPKKKRKV-NH2 |
| negative-1 | Control(-) | TCSSNCQTCPCSSNNCQ-NH2 |
| negative-2 | Control(-) | GLALLGIAVAILVVL-NH2 |
| negative-3 | Control(-) | PGNIQMMSVVSMSMTITN-NH2 |
| peptide-1 | Predicted CPP | FKIYDKKVRTRVVKH-NH2 |
| peptide-2 | Predicted CPP | RASKRDGSWVKKLHRILE-NH2 |
| peptide-3 | Predicted CPP | KGTYKKKLMRIPLKGT-NH2 |
| peptide-4 | Predicted CPP | LYKKGPAKKGRPPLRGWFH-NH2 |
| peptide-5 | Predicted Non-CPP | FFSLPPVTQDWNSD-NH2 |
| peptide-6 | Predicted Non-CPP | HSPIIPLGTRFVCHGVT-NH2 |
| TP13 | Known Non-CPP-CPP Analog | LNSAGYLLGKALAALAKKIL-NH2 |
*negative-2 was unable to be synthesized to desired purity levels due to insolubility issues.