| Literature DB >> 31492094 |
Chuanyan Wu1,2, Rui Gao3, Yusen Zhang4, Yang De Marinis2.
Abstract
*: Background In the search for therapeutic peptides for disease treatments, many efforts have been made to identify various functional peptides from large numbers of peptide sequence databases. In this paper, we propose an effective computational model that uses deep learning and word2vec to predict therapeutic peptides (PTPD). *: Results Representation vectors of all k-mers were obtained through word2vec based on k-mer co-existence information. The original peptide sequences were then divided into k-mers using the windowing method. The peptide sequences were mapped to the input layer by the embedding vector obtained by word2vec. Three types of filters in the convolutional layers, as well as dropout and max-pooling operations, were applied to construct feature maps. These feature maps were concatenated into a fully connected dense layer, and rectified linear units (ReLU) and dropout operations were included to avoid over-fitting of PTPD. The classification probabilities were generated by a sigmoid function. PTPD was then validated using two datasets: an independent anticancer peptide dataset and a virulent protein dataset, on which it achieved accuracies of 96% and 94%, respectively. *: Conclusions PTPD identified novel therapeutic peptides efficiently, and it is suitable for application as a useful tool in therapeutic peptide design.Entities:
Keywords: Deep learning; Therapeutic peptide; Word2vec
Mesh:
Substances:
Year: 2019 PMID: 31492094 PMCID: PMC6728961 DOI: 10.1186/s12859-019-3006-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart of PTPD
Fig. 2Skip-gram model structure
Performance of PTPD on the ACP dataset
| Dataset | Sn(%) | Sp(%) | Acc(%) | MCC | AUC |
|---|---|---|---|---|---|
| ACP main dataset | 99.90 | 86.60 | 98.50 | 0.92 | 0.99 |
| ACP alternative dataset | 96.20 | 86.70 | 94.80 | 0.80 | 0.97 |
| ACP balanced dataset 1 | 100 | 86.20 | 93.10 | 0.87 | 0.99 |
| ACP balanced dataset 2 | 94.20 | 86.20 | 90.20 | 0.81 | 0.97 |
| HC dataset | 100 | 83.00 | 94.00 | 0.87 | 0.99 |
Performance of PTPD on the virulent protein dataset
| Dataset | Sn(%) | Sp(%) | Acc(%) | MCC | AUC |
|---|---|---|---|---|---|
| SPAAN adhesins dataset | 95.60 | 73.3 | 88.2 | 0.70 | 0.94 |
| Neurotoxins dataset | 98.00 | 94.00 | 96.00 | 0.92 | 0.93 |
Comparison of PTPD with state-of-the-art methods on the HC dataset
| Method | Sn(%) | Sp(%) | Acc(%) | MCC | AUC |
|---|---|---|---|---|---|
| PTPD | 100 | 83.00 | 94.00 | 0.87 | 0.99 |
| mACPpred [ | 97.00 | 77.00 | 85.00 | 0.72 | 0.96 |
| MLACP (SVM)[ | 85.00 | 91.00 | 90.00 | 0.73 | 0.95 |
| MLACP (RF)[ | 98.00 | 98.00 | 98.00 | 0.95 | 1.00 |
| AntiCP (Model 1)[ | 98.00 | 5.00 | 40.00 | 0.06 | 0.75 |
| AntiCP (Model 2)[ | 82.00 | 90.00 | 87.00 | 0.72 | 0.95 |
Fig. 3Comparison of different methods on the HC dataset. a Sn, Sp and Acc of different methods. b MCC and AUC of different methods. Sn: the sensitivity; Sp: the specificity; Acc: the prediction accuracy; MCC: Matthew’s correlation coefficient; AUC: the area under the curve of the receiver-operating characteristic curve
Comparison of PTPD with state-of-the-art methods on the Neurotoxins dataset
| Method | Sn(%) | Sp(%) | Acc(%) | MCC | AUC |
|---|---|---|---|---|---|
| PTPD | 98.00 | 94.00 | 96.00 | 0.92 | 0.93 |
| q-FP [ | 99.03 | 98.00 | 98.40 | 0.94 | 1 |
| VirulentPred [ | 96.00 | 16.00 | 56.00 | - | - |
| NTX-pred(FNN) [ | 89.65 | 78.78 | 84.19 | 0.69 | - |
| NTX-pred(RNN) [ | 89.12 | 96.35 | 92.75 | 0.86 | - |
| NTX-pred(SVM) [ | 96.32 | 97.22 | 97.72 | 0.94 | - |
| AS [ | 92.00 | 1.00 | 96.00 | 0.92 | 0.99 |
| 2Gram [ | 1.00 | 90.91 | 95.00 | 0.91 | 1 |
Fig. 4Comparison of different methods on the neurotoxin virulent proteins dataset. a Sn, Sp and Acc of different methods. b MCC and AUC of different methods. Sn: the sensitivity; Sp: the specificity; Acc: the prediction accuracy; MCC: Matthew’s correlation coefficient; AUC: the area under the curve of the receiver-operating characteristic curve
Fig. 5Performances under different learning rates: a accuracy under different learning rates; b loss under different learning rates
Parameter setting
| Parameters | Value |
|---|---|
| Number of kernels | 150,150,150 |
| Filter size | 3,4,5 |
| 100 | |
| Batch size | 100 |
| Epoch | 20 |
| Learning rate | 0.0001 |