| Literature DB >> 33868387 |
Zhenfeng Li1, Lun Hu2, Zehai Tang1, Cheng Zhao1.
Abstract
Understanding the substrate specificity of HIV-1 protease plays an essential role in the prevention of HIV infection. A variety of computational models have thus been developed to predict substrate sites that are cleaved by HIV-1 protease, but most of them normally follow a supervised learning scheme to build classifiers by considering experimentally verified cleavable sites as positive samples and unknown sites as negative samples. However, certain noisy can be contained in the negative set, as false negative samples are possibly existed. Hence, the performance of the classifiers is not as accurate as they could be due to the biased prediction results. In this work, unknown substrate sites are regarded as unlabeled samples instead of negative ones. We propose a novel positive-unlabeled learning algorithm, namely PU-HIV, for an effective prediction of HIV-1 protease cleavage sites. Features used by PU-HIV are encoded from different perspectives of substrate sequences, including amino acid identities, coevolutionary patterns and chemical properties. By adjusting the weights of errors generated by positive and unlabeled samples, a biased support vector machine classifier can be built to complete the prediction task. In comparison with state-of-the-art prediction models, benchmarking experiments using cross-validation and independent tests demonstrated the superior performance of PU-HIV in terms of AUC, PR-AUC, and F-measure. Thus, with PU-HIV, it is possible to identify previously unknown, but physiologically existed substrate sites that are able to be cleaved by HIV-1 protease, thus providing valuable insights into designing novel HIV-1 protease inhibitors for HIV treatment.Entities:
Keywords: HIV-1 protease; biased SVM; cleavage site prediction; positive-unlabeled learning; substrate specificity
Year: 2021 PMID: 33868387 PMCID: PMC8044780 DOI: 10.3389/fgene.2021.658078
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The workflow of PU-HIV.
Detailed descriptions of five datasets.
| 301Dataset | Chou ( | 301 | 62 | 239 |
| 746Dataset | You et al. ( | 746 | 401 | 345 |
| 1625Dataset | Kontijevskis et al. ( | 1,625 | 374 | 1,251 |
| impensDataset | Rögnvaldsson et al. ( | 947 | 149 | 798 |
| schillingDataset | Rögnvaldsson et al. ( | 3,272 | 434 | 2,838 |
The column of References gives the original source of corresponding dataset. The column of Total is the number of all octapeptides in the dataset. The columns of P and U are the sizes of positive and unlabeled sets, respectively.
The chemical classes to which the 20 amino acids belong.
| Sulfur-containing | C, M |
| Aliphatic 1 | A, G, P |
| Aliphatic 2 | I, L, V |
| Acidic | D, E |
| Basic | H, K, R |
| Aromatic | F, W, Y |
| Amide | N, Q |
| Small hydroxy | S, T |
Experiment results of 10-fold CV.
| 301Dataset | PU-HIV | 0.96 | 0.87 | 0.76 | ||
| PU-HIV with standard SVM | 0.96 | 0.88 | 0.86 | 0.77 | ||
| EvoCleave | 0.91 | 0.81 | 0.37 | 0.94 | 0.53 | |
| Rögnvaldsson et al. ( | 0.93 | 0.86 | 0.85 | 0.74 | 0.79 | |
| PROSPERous | 0.94 | 0.45 | 0.21 | 1 | 0.34 | |
| HIVcleave | 0.61 | 1 | 0.55 | 0.71 | ||
| iProt-Sub | 0.78 | 0.53 | 0.63 | 0.32 | 0.43 | |
| DeepCleave | 0.45 | 0.2 | 0.13 | 0.19 | 0.16 | |
| 746Dataset | PU-HIV | 0.91 | 0.91 | |||
| PU-HIV with standard SVM | 0.94 | 0.93 | 0.89 | 0.87 | 0.88 | |
| EvoCleave | 0.93 | 0.92 | 0.9 | 0.8 | 0.85 | |
| Rögnvaldsson et al. ( | 0.92 | 0.91 | 0.85 | 0.9 | 0.87 | |
| PROSPERous | 0.84 | 0.53 | 0.54 | 1 | 0.7 | |
| HIVcleave | 0.74 | 0.81 | 0.92 | 0.7 | 0.8 | |
| iProt-Sub | 0.7 | 0.71 | 0.71 | 0.25 | 0.37 | |
| DeepCleave | 0.44 | 0.49 | 0.41 | 0.14 | 0.21 | |
| 1625Dataset | PU-HIV | 0.9 | 0.9 | |||
| PU-HIV with standard SVM | 0.94 | 0.89 | 0.86 | 0.88 | ||
| EvoCleave | 0.93 | 0.84 | 0.85 | 0.74 | 0.8 | |
| Rögnvaldsson et al. ( | 0.97 | 0.9 | 0.85 | 0.8 | 0.83 | |
| PEOSPERous | 0.82 | 0.33 | 0.23 | 1 | 0.38 | |
| HIVcleave | 0.73 | 0.61 | 0.69 | 0.67 | 0.68 | |
| iProt-Sub | 0.68 | 0.41 | 0.41 | 0.26 | 0.32 | |
| DeepCleave | 0.46 | 0.21 | 0.13 | 0.14 | 0.13 | |
| impensDataset | PU-HIV | 0.73 | 0.65 | |||
| PU-HIV with standard SVM | 0.74 | 0.71 | 0.62 | 0.67 | ||
| EvoCleave | 0.88 | 0.64 | 0.77 | 0.42 | 0.54 | |
| Rögnvaldsson et al. ( | 0.9 | 0.7 | 0.69 | 0.62 | 0.65 | |
| PROSPERous | 0.83 | 0.17 | 0.16 | 1 | 0.27 | |
| HIVcleave | 0.56 | 0.29 | 0.29 | 0.45 | 0.35 | |
| iProt-Sub | 0.72 | 0.36 | 0.43 | 0.34 | 0.38 | |
| DeepCleave | 0.45 | 0.14 | 0.14 | 0.34 | 0.2 | |
| schillingDataset | PU-HIV | 0.73 | 0.67 | |||
| PU-HIV with standard SVM | 0.92 | 0.7 | 0.62 | 0.68 | 0.65 | |
| EvoCleave | 0.78 | 0.36 | 0.5 | 0.2 | 0.28 | |
| Rögnvaldsson et al. ( | 0.93 | 0.68 | 0.66 | 0.66 | 0.66 | |
| PROSPERous | 0.88 | 0.15 | 0.14 | 0.95 | 0.24 | |
| HIVcleave | 0.59 | 0.34 | 0.31 | 0.41 | 0.35 | |
| iProt-Sub | 0.75 | 0.37 | 0.39 | 0.34 | 0.37 | |
| DeepCleave | 0.52 | 0.13 | 0.13 | 0.43 | 0.2 | |
*For each dataset, the best results are bolded.
Experiment results of crossdata.
| 301Dataset | 746Dataset | 0.94 | 0.94 | 0.87 |
| 1625Dataset | 0.93 | 0.78 | 0.76 | |
| impensDataset | 0.81 | 0.55 | 0.54 | |
| schillingDataset | 0.84 | 0.44 | 0.49 | |
| 746Dataset | 301Dataset | 0.99 | 0.98 | 0.98 |
| 1625Dataset | 0.99 | 0.97 | 0.9 | |
| impensDataset | 0.84 | 0.63 | 0.6 | |
| schillingDataset | 0.89 | 0.56 | 0.56 | |
| 1625Dataset | 301Dataset | 0.99 | 0.98 | 0.97 |
| 746Dataset | 0.98 | 0.98 | 0.96 | |
| impensDataset | 0.82 | 0.59 | 0.5 | |
| schillingDataset | 0.88 | 0.54 | 0.44 | |
| impensDataset | 301Dataset | 0.94 | 0.8 | 0.7 |
| 746Dataset | 0.89 | 0.9 | 0.75 | |
| 1625Dataset | 0.89 | 0.71 | 0.63 | |
| schillingDataset | 0.94 | 0.71 | 0.66 | |
| schillingDataset | 301Dataset | 0.96 | 0.88 | 0.77 |
| 746Dataset | 0.93 | 0.94 | 0.87 | |
| 1625Dataset | 0.94 | 0.8 | 0.72 | |
| impensDataset | 0.9 | 0.75 | 0.64 |
Experiment results of feature analysis.
| AAI | 0.94 | 0.82 | 0.77 |
| CheP | 0.91 | 0.76 | 0.7 |
| CoP | 0.82 | 0.63 | 0.54 |
| AAI + CheP | 0.94 | 0.85 | 0.79 |
| AAI + CoP | 0.94 | 0.83 | 0.78 |
| CheP + CoP | 0.91 | 0.79 | 0.71 |
| AAI + CheP + CoP | 0.95 | 0.86 | 0.8 |
Three different types of features are used in different combinations to construct feature vectors, and then cross-validation is performed on five independent data sets. The experimental result is the average of five independent data sets.