| Literature DB >> 24058508 |
Sudheer Gupta1, Pallavi Kapoor, Kumardeep Chaudhary, Ankur Gautam, Rahul Kumar, Gajendra P S Raghava.
Abstract
BACKGROUND: Over the past few decades, scientific research has been focused on developing peptide/protein-based therapies to treat various diseases. With the several advantages over small molecules, including high specificity, high penetration, ease of manufacturing, peptides have emerged as promising therapeutic molecules against many diseases. However, one of the bottlenecks in peptide/protein-based therapy is their toxicity. Therefore, in the present study, we developed in silico models for predicting toxicity of peptides and proteins. DESCRIPTION: We obtained toxic peptides having 35 or fewer residues from various databases for developing prediction models. Non-toxic or random peptides were obtained from SwissProt and TrEMBL. It was observed that certain residues like Cys, His, Asn, and Pro are abundant as well as preferred at various positions in toxic peptides. We developed models based on machine learning technique and quantitative matrix using various properties of peptides for predicting toxicity of peptides. The performance of dipeptide-based model in terms of accuracy was 94.50% with MCC 0.88. In addition, various motifs were extracted from the toxic peptides and this information was combined with dipeptide-based model for developing a hybrid model. In order to evaluate the over-optimization of the best model based on dipeptide composition, we evaluated its performance on independent datasets and achieved accuracy around 90%. Based on above study, a web server, ToxinPred has been developed, which would be helpful in predicting (i) toxicity or non-toxicity of peptides, (ii) minimum mutations in peptides for increasing or decreasing their toxicity, and (iii) toxic regions in proteins.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24058508 PMCID: PMC3772798 DOI: 10.1371/journal.pone.0073957
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of datasets’ creation.
Figure 2Comparison of average amino acid composition between various classes of therapeutic peptides.
Figure 3Comparison of average amino acid composition of preferred residues between toxic and non-toxic peptides.
Figure 4Sequence logos of (A) first ten residues of N-terminus and (B) last ten residues of C-terminus of toxic peptides, where size of residue is proportional to its propensity (main dataset).
The performance of SVM-based models developed on main dataset using various types of composition like residue, dipeptide, terminal residues composition.
| Features | Parameters | Threshold | Sensitivity | Specificity | Accuracy | MCC | AUC |
|
| t:2 g:0.005 c:5 j:1 | −0.4 | 92.91 | 94.43 | 93.92 | 0.87 | 0.97 |
|
| t:2 g:0.001 c:0.5 j:3 | −0.3 | 83.74 | 83.67 | 83.69 | 0.65 | 0.88 |
|
| t:2 g:0.005 c:10 j:3 | −0.3 | 88.66 | 91.73 | 90.69 | 0.80 | 0.94 |
|
| t:2 g:0.001 c:0.5 j:4 | −0.1 | 81.76 | 81.47 | 81.55 | 0.59 | 0.88 |
|
| t:2 g:0.005 c:1 j:3 | −0.4 | 90.41 | 86.94 | 88.11 | 0.75 | 0.94 |
|
| t:2 g:0.001 c:5 j:1 | −0.4 | 93.80 | 94.85 | 94.50 | 0.88 | 0.98 |
AAC, amino acid composition; DPC, dipeptide composition; C5AAC, amino acid composition of last five C-terminal residues; C10AAC, amino acid composition of last ten C-terminal residues; N5AAC, amino acid composition of first five N-terminal residues; N10AAC, amino acid composition of first ten N-terminal residues; MCC, Matthew’s correlation coefficient; AUC, area under the curve.
The performance of binary profile-based models developed on main dataset.
| Feature | Parameters | Threshold | Sensitivity | Specificity | Accuracy | MCC | AUC |
| CT5 | t:2 g:0.5 c:1 j:1 | −0.5 | 84.67 | 86.8 | 86.09 | 0.70 | 0.90 |
| CT10 | t:2 g:0.1 c:5 j:1 | −0.3 | 91.50 | 91.81 | 91.70 | 0.82 | 0.96 |
| NT5 | t:2 g:0.5 c:5 j:2 | −0.4 | 84.32 | 87.79 | 86.78 | 0.70 | 0.91 |
| NT10 | t:2 g:0.1 c:5 j:5 | −0.3 | 91.13 | 91.89 | 91.63 | 0.82 | 0.96 |
MCC, Matthew’s correlation coefficient; AUC, area under the curve.
The performance of motif-based model developed on main dataset.
| E-value | PCP | %Coverage |
|
| 40.56 | 93.54 |
|
| 48.27 | 90.08 |
|
| 59.11 | 86.28 |
|
| 69.31 | 82.31 |
|
| 78.07 | 78.29 |
|
| 85.16 | 74.83 |
|
| 89.15 | 71.77 |
|
| 92.12 | 68.25 |
|
| 93.40 | 64.17 |
PCP; probability of correct prediction.
The performance of model developed using motifs and dipeptide composition on main dataset.
| E-value | Sensitivity | Specificity | Accuracy | MCC | AUC |
|
| 99.39 | 97.91 | 98.41 | 0.96 | 0.99 |
|
| 98.89 | 97.91 | 98.24 | 0.96 | 0.99 |
|
| 98.39 | 97.91 | 98.07 | 0.96 | 0.99 |
|
| 97.78 | 97.91 | 97.87 | 0.95 | 0.99 |
|
| 97.17 | 97.91 | 97.67 | 0.95 | 0.99 |
|
| 96.84 | 97.91 | 97.55 | 0.95 | 0.99 |
|
| 96.62 | 97.91 | 97.48 | 0.94 | 0.99 |
|
| 96.29 | 97.91 | 97.37 | 0.94 | 0.99 |
|
| 95.84 | 97.91 | 97.22 | 0.94 | 0.99 |
MCC, Matthew’s correlation coefficient; AUC, area under the curve.
The performance of quantitative matix based method on various datasets.
| Matrix | Threshold | Sensitivity | Specificity | Accuracy | MCC | AUC |
|
| 20 | 80.46 | 92.09 | 88.00 | 0.73 | 0.92 |
|
| 20 | 75.43 | 98.98 | 95.81 | 0.81 | 0.97 |
|
| 5 | 74.10 | 98.07 | 89.65 | 0.77 | 0.95 |
|
| 5 | 73.29 | 99.28 | 95.78 | 0.81 | 0.98 |
MCC, Matthew’s correlation coefficient; AUC, area under the curve.
Figure 5Maximum and minimum scoring residues at every position as observed in quantitative matrix (main dataset).
Figure 6ROC curves of support vector machine models based on (A) amino acid composition, (B), dipeptide composition, and (C) hybrid approach.
Figure 7Schematic representation of ToxinPred webserver.