| Literature DB >> 34739145 |
Abstract
One of the emerging subjects to combat the SARS-CoV-2 virus is to design accurate and efficient drug such as inhibitors against the viral protease to stop the viral spread. In addition to laboratory investigation of the viral protease, which is fundamental, the in silico research of viral protease such as the protease cleavage site prediction is critically important and urgent. However, this problem has yet to be addressed. This article has, for the first time, investigated this problem using the pattern recognition approaches. The article has shown that the pattern recognition approaches incorporating a specially tailored kernel function for dealing with amino acids has the outstanding performance in the accuracy of cleavage site prediction and the discovery of the prototype cleavage peptides.Entities:
Keywords: SARS-CoV-2 main protease; kernel function; machine learning; pattern recognition; viral cleavage site
Mesh:
Substances:
Year: 2021 PMID: 34739145 PMCID: PMC8661936 DOI: 10.1002/prot.26274
Source DB: PubMed Journal: Proteins ISSN: 0887-3585
The cleaved peptides and the proteins and the cleavage sites
| Peptides | Protein and sites |
|---|---|
|
| R1AB_SARS2#3263,R1A_SARS2#3263,R1AB_SARS#3240,R1A_SARS#3240 |
|
| R1AB_SARS2#3569,R1A_SARS2#3569 |
|
| R1AB_SARS2#3859,R1A_SARS2#3859,R1AB_SARS#3836,R1A_SARS#3836,R1AB_BC279#3842, 1AB_BCRP3#3834,R1A_BC279#3842 |
|
| R1AB_SARS2#3942,R1A_SARS2#3942,R1AB_SARS#3919,R1A_SARS#3919,R1AB_BC279#3925, R1AB_BCRP3#3917,R1A_BC279#3925 |
|
| R1AB_SARS2#4140,R1A_SARS2#4140,R1AB_SARS#4117,R1A_SARS#4117,R1AB_BC279#4123, R1AB_BCRP3#4115,R1A_BC279#4123 |
|
| R1AB_SARS2#4253,R1A_SARS2#4253,R1AB_SARS#4230,R1A_SARS#4230,R1AB_BC279#4236, R1AB_BCRP3#4228,R1A_BC279#4236 |
|
| R1AB_SARS2#4392,R1A_SARS2#4392 |
|
| R1AB_SARS2#5324,R1AB_SARS#5301,R1AB_BC279#5307,R1AB_BCRP3#5299 |
|
| R1AB_SARS2#5925,R1AB_SARS#5902,R1AB_BC279#5908,R1AB_BCRP3#5900 |
|
| R1AB_SARS2#6452,R1AB_SARS#6429,R1AB_CVMA5#6503,R1AB_CVMJH#6507, R1AB_CVM2#6451,R1AB_BC279#6435,R1AB_BCRP3#6427 |
|
| R1AB_SARS2#6798 |
|
| R1AB_SARS#3546,R1A_SARS#3546,R1AB_BC279#3552,R1A_BC279#3552 |
|
| R1AB_SARS#4369,R1A_SARS#4369 |
|
| R1AB_SARS#6775,R1AB_BC279#6781,R1AB_BCRP3#6773 |
|
| R1AB_CVMA5#3333,R1AB_CVBQ#3246,R1AB_CVBLU#3246,R1AB_CVMJH#3336, R1AB_CVM2#3279,R1A_CVMA5#3333,R1A_CVMJH#3336,R1A_CVHOC#3246, R1A_CVHN5#3284,R1A_CVHN2#3304,R1A_CVHN1#3334,R1A_CVBM#3246 |
|
| R1AB_CVMA5#3635,R1AB_CVMJH#3639,R1AB_CVM2#3582,R1A_CVMA5#3635, R1A_CVMJH#3639 |
|
| R1AB_CVMA5#3921,R1AB_CVMJH#3927,R1AB_CVM2#3869,R1A_CVMA5#3921, R1A_CVMJH#3927 |
|
| R1AB_CVMA5#4013,R1AB_CVMJH#4019,R1AB_CVM2#3961,R1A_CVMA5#4013, R1A_CVMJH#4019,R1A_CVHN5#3966,R1A_CVHN2#3986,R1A_CVHN1#4016 |
|
| R1AB_CVMA5#4207,R1AB_CVMJH#4213,R1A_CVMA5#4207,R1A_CVMJH#4213 |
|
| R1AB_CVMA5#4317,R1AB_CVBQ#4232,R1AB_CVBLU#4232,R1AB_CVMJH#4323, R1AB_CVM2#4265,R1A_CVMA5#4317,R1A_CVMJH#4323,R1A_CVHOC#4232, R1A_CVBM#4232 |
|
| R1AB_CVMA5#4454,R1AB_CVMJH#4460,R1AB_CVM2#4402,R1A_CVMA5#4454, R1A_CVMJH#4460 |
|
| R1AB_CVMA5#5382 |
|
| R1AB_CVMA5#5982,R1AB_CVMJH#5988,R1AB_CVM2#5930 |
|
| R1AB_CVMA5#6877,R1AB_CVMJH#6881,R1AB_CVM2#6825 |
|
| R1AB_CVPPU#2878,R1A_CVPPU#2878 |
|
| R1AB_CVPPU#3180,R1A_CVPPU#3180 |
|
| R1AB_CVPPU#3474,R1A_CVPPU#3474 |
|
| R1AB_CVPPU#3557,R1A_CVPPU#3557 |
|
| R1AB_CVPPU#3752,R1A_CVPPU#3752 |
|
| R1AB_CVPPU#3863,R1A_CVPPU#3863 |
|
| R1AB_CVPPU#3998,R1A_CVPPU#3998 |
|
| R1AB_CVPPU#4927 |
|
| R1AB_CVPPU#5526 |
|
| R1AB_CVPPU#6045 |
|
| R1AB_CVPPU#6384 |
|
| R1AB_CVH22#2965,R1A_CVH22#2965 |
|
| R1AB_CVH22#3267,R1A_CVH22#3267 |
|
| R1AB_CVH22#3546,R1A_CVH22#3546 |
|
| R1AB_CVH22#3629,R1A_CVH22#3629 |
|
| R1AB_CVH22#3824,R1A_CVH22#3824,R1A_CVHNL#3799 |
|
| R1AB_CVH22#3933,R1A_CVH22#3933,R1A_BC512#3976,R1A_PEDV7#3965 |
|
| R1AB_CVH22#4068,R1A_CVH22#4068 |
|
| R1AB_CVH22#4995 |
|
| R1AB_CVH22#5592 |
|
| R1AB_CVH22#6110 |
|
| R1AB_CVH22#6458 |
|
| R1AB_CVBQ#3549,R1AB_CVBLU#3549,R1A_CVHOC#3549,R1A_CVBM#3549 |
|
| R1AB_CVBQ#3836,R1AB_CVBLU#3836,R1A_CVHOC#3836,R1A_CVBM#3836 |
|
| R1AB_CVBQ#3925,R1AB_CVBLU#3925,R1A_CVHOC#3925,R1A_CVBM#3925 |
|
| R1AB_CVBQ#4122,R1AB_CVBLU#4122,R1A_CVHOC#4122 |
|
| R1AB_CVBQ#4369,R1AB_CVBLU#4369,R1A_CVHOC#4369,R1A_CVBM#4369 |
|
| R1AB_CVBQ#5297,R1AB_CVBLU#5297,R1AB_CVMJH#5388,R1AB_CVM2#5330 |
|
| R1AB_CVBQ#5900,R1AB_CVBLU#5900 |
|
| R1AB_CVBQ#6421,R1AB_CVBLU#6421 |
|
| R1AB_CVBQ#6795,R1AB_CVBLU#6795 |
|
| R1AB_CVM2#4155 |
|
| R1AB_BC279#4375,R1AB_BCRP3#4367,R1A_BC279#4375 |
|
| R1AB_BCRP3#3544 |
|
| R1AB_IBVM#2781 |
|
| R1AB_IBVM#3088,R1AB_IBVBC#3086 |
|
| R1AB_IBVM#3381 |
|
| R1AB_IBVM#3464,R1AB_IBVBC#3462 |
|
| R1AB_IBVM#3674 |
|
| R1AB_IBVM#3785,R1AB_IBVBC#3783 |
|
| R1AB_IBVM#3930 |
|
| R1AB_IBVM#4870,R1AB_IBVBC#4868 |
|
| R1AB_IBVM#5470 |
|
| R1AB_IBVM#5991,R1AB_IBVBC#5989 |
|
| R1AB_IBVM#6329,R1AB_IBVBC#6327 |
|
| R1AB_IBVBC#2779 |
|
| R1AB_IBVBC#3379 |
|
| R1AB_IBVBC#3672 |
|
| R1AB_IBVBC#3928 |
|
| R1AB_IBVBC#5468 |
|
| R1A_CVHN5#3587,R1A_CVHN2#3607,R1A_CVHN1#3637 |
|
| R1A_CVHN5#3874,R1A_CVHN2#3894,R1A_CVHN1#3924 |
|
| R1A_CVHN5#4160,R1A_CVHN2#4180,R1A_CVHN1#4210 |
|
| R1A_CVHN5#4270,R1A_CVHN2#4290,R1A_CVHN1#4320 |
|
| R1A_CVHN5#4407,R1A_CVHN1#4457 |
|
| R1A_CVHN2#4427 |
|
| R1A_BCHK9#3103 |
|
| R1A_BCHK9#3409 |
|
| R1A_BCHK9#3699 |
|
| R1A_BCHK9#3782 |
|
| R1A_BCHK9#3982 |
|
| R1A_BCHK9#4094 |
|
| R1A_BCHK9#4233 |
|
| R1A_CVHNL#2939 |
|
| R1A_CVHNL#3242 |
|
| R1A_CVHNL#3521 |
|
| R1A_CVHNL#3604 |
|
| R1A_CVHNL#3908 |
|
| R1A_CVHNL#4043 |
|
| R1A_BCHK4#3291,R1A_BC133#3298,R1A_BCHK5#3338 |
|
| R1A_BCHK4#3597,R1A_BC133#3604,R1A_BCHK5#3644 |
|
| R1A_BCHK4#3889,R1A_BC133#3896 |
|
| R1A_BCHK4#3972,R1A_BC133#3979 |
|
| R1A_BCHK4#4171,R1A_BC133#4178 |
|
| R1A_BCHK4#4281,R1A_BC133#4288 |
|
| R1A_BCHK4#4420,R1A_BC133#4427 |
|
| R1A_CVBM#4122 |
|
| R1A_BCHK5#3936 |
|
| R1A_BCHK5#4019 |
|
| R1A_BCHK5#4218 |
|
| R1A_BCHK5#4328 |
|
| R1A_BCHK5#4467 |
|
| R1A_BC512#3012,R1A_PEDV7#2997 |
|
| R1A_BC512#3314 |
|
| R1A_BC512#3590,R1A_PEDV7#3579 |
|
| R1A_BC512#3673 |
|
| R1A_BC512#3868 |
|
| R1A_BC512#4111 |
|
| R1A_PEDV7#3299 |
|
| R1A_PEDV7#3662 |
|
| R1A_PEDV7#3857 |
|
| R1A_PEDV7#4100 |
Note: The # key is used to separate between a protein and a cleavage site. Multiple protein sequences may contain an identical peptide. For instance, the peptide SAVLQSGFRK was found in four protein sequences (R1AB_SARS2, R1A_SARS2, R1AB_SARS, R1A_SARS).
FIGURE 1The sequence logo of 116 cleaved peptides, where the integers from 1 to 9 represent the residues, in order. Note that the glutamine (Q) has been omitted
The peptide data used for this study
| 9‐mer | 5‐mer | ||
|---|---|---|---|
| Raw | Reduced | Reduced | |
| Cleaved | 273 | 116 | 87 |
| Non‐cleaved | 273 | 259 | 256 |
| Blind | 5071 | 2360 | 2061 |
Note: “Raw” stands for the number of all the peptides and “Reduced” stands for the number of non‐redundant peptides.
FIGURE 2A naïve description of the kernel function approach. The left panel shows the original data space coordinated by x and y, in which four data points are labeled by A, B, α, and β. A and B belong to one class while α and β belong to the other class. They are nonlinearly separable because it is impossible to separate these two classes using one straight line. Suppose α and β are selected as the kernels. The distances between four data points and two kernels are calculated. The right panel shows the distribution of four data points based on four sets of distances using the kernel function. In this new space, two coordinates are no longer x and y, but α and β. It can be seen that this new space of four data points becomes linearly separable
FIGURE 3The bio‐SOM map of 225 neurons constructed for the 9‐mer peptides data. “N” stands for the uncleavable peptides and “C” stands for the cleaved peptides. One circle stands for one neuron or one cell. The printed letter in a cell, which is either N or C, stands for a peptide, which has been mapped to the cell. For instance, two cleaved peptides were mapped to the first cell at the bottom row while one uncleavable peptide was mapped to the third cell at the bottom row. These two cells were pure for one class. However, the second cell at the top row contained two cleaved peptides and one uncleavable peptide. Thus, this cell was not pure for one class
The model performance. “Type I” stands for the Type I error rate
| Models | 9‐mer | 5‐mer | ||||||
|---|---|---|---|---|---|---|---|---|
| AUC | MCC | Type I | Total | AUC | MCC | Type I | Total | |
| NO + C5.0 | 0.9636 | 0.8311 | 6% | 92.27% | 0.9458 | 0.6525 | 5% | 93.00% |
| NO + FOREST | 0.9940 | 0.9385 | 5% |
| 0.9890 | 0.9070 | 6% |
|
| BIN + MLP | 0.9696 | 0.8775 | 4% |
| 0.9804 | 0.8654 | 4% |
|
| BIN + SVM | 0.9691 | 0.8370 | 8% | 89.87% | 0.9651 | 0.7743 | 12% | 88.92% |
| BIN + RVM | 0.9963 | 0.9501 | 1% |
| 0.9809 | 0.8829 | 3% |
|
| DES + Linear | 0.9798 | 0.8436 | 9% | 93.33% | 0.9626 | 0.7623 | 6% | 90.38% |
| DES + C5.0 | 0.9527 | 0.8462 | 5% | 93.07% | 0.9605 | 0.7960 | 4% | 93.00% |
| DES + FOREST | 0.9897 | 0.9137 | 3% |
| 0.9863 | 0.8942 | 3% |
|
| DES + MLP | 0.9658 | 0.8202 | 5% | 90.93% | 0.9639 | 0.7965 | 6% | 90.96% |
| DES + SVM | 0.9827 | 0.8561 | 6% | 93.60% | 0.9700 | 0.8055 | 4% | 92.13% |
| DES + RVM | 0.9663 | 0.8215 | 6% | 92.27% | 0.9559 | 0.7861 | 6% | 91.84% |
| PSE + Linear | 0.9428 | 0.7382 | 5% | 88.53% | 0.9487 | 0.7349 | 5% | 88.05% |
| PSE + C5.0 | 0.8672 | 0.6333 | 5% | 83.20% | 0.9070 | 0.7821 | 7% | 90.96% |
| PSE + FOREST | 0.9773 | 0.8319 | 3% | 92.27% | 0.9735 | 0.8286 | 4% | 92.13% |
| PSE + MLP | 0.9467 | 0.7608 | 6% | 89.60% | 0.9520 | 0.7302 | 5% | 89.50% |
| PSE + SVM | 0.9727 | 0.8339 | 3% | 91.20% | 0.9519 | 0.7760 | 3% | 87.46% |
| PSE + RVM | 0.9472 | 0.7608 | 6% | 88.27% | 0.9475 | 0.7395 | 6% | 88.63% |
| bio‐Bayesian | 0.9745 | 0.8419 | 3% | 93.33% | 0.9624 | 0.7474 | 3% | 90.67% |
| bio‐C5.0 | 0.9357 | 0.7805 |
| 90.93% | 0.9394 | 0.7708 |
| 84.62% |
| bio‐FOREST | 0.9889 | 0.9380 |
|
| 0.9835 | 0.8970 |
|
|
| bio‐MLP | 0.9843 | 0.9041 |
|
| 0.9648 | 0.8605 |
|
|
| bio‐SVM | 1.0000 | 0.9938 |
|
| 0.9999 | 0.9770 |
|
|
| bio‐RVM | 0.9834 | 0.8932 |
|
| 0.9792 | 0.8790 |
|
|
Note: “NO” means no encoding process was used. “BIN” stands for the binary‐encoded data. “DES” stands for the descriptor‐encoded data. “PSE” stands for the profile‐encoded data. The percentages in bold were greater than 94.13% of the bio‐SOM model.
FIGURE 4The prediction spectra of four bio‐kernel models (bio‐FOREST, bio‐MLP, bio‐SVM, and bio‐RVM) for the protein R1AB_SARS2. The protein had seven main protease cleavage sites. The heights of the bars stand for the predicted values which have been normalized between 0 and 1. The bars with the dots on the top stand for the true cleavage sites
The performance comparison between the full models and the parsimonious models
| Algorithm | Parsimonious models | Full models | ||||||
|---|---|---|---|---|---|---|---|---|
| AUC | MCC | Type I | Total | AUC | MCC | Type I | Total | |
| bio‐FOREST | 0.9744 | 0.8581 |
| 93.60% | 0.9889 | 0.9380 |
|
|
| bio‐MLP | 0.9487 | 0.8046 | 2% | 89.33% | 0.9843 | 0.9041 |
|
|
| bio‐SVM | 0.9900 | 0.8775 | 2% |
| 1.0000 | 0.9938 |
|
|
| bio‐RVM | 0.9652 | 0.8204 | 2% | 92.27% | 0.9834 | 0.8932 |
|
|
Note: The values in bold stand for the best models.