| Literature DB >> 28155640 |
Onkar Singh1, Emily Chia-Yu Su2.
Abstract
BACKGROUND: The human immunodeficiency virus type 1 (HIV-1) aspartic protease is an important enzyme owing to its imperative part in viral development and a causative agent of deadliest disease known as acquired immune deficiency syndrome (AIDS). Development of HIV-1 protease inhibitors can help understand the specificity of substrates which can restrain the replication of HIV-1, thus antagonize AIDS. However, experimental methods in identification of HIV-1 protease cleavage sites are generally time-consuming and labor-intensive. Therefore, using computational methods to predict cleavage sites has become highly desirable.Entities:
Keywords: Cleavage sites; HIV-1 protease; Machine learning; Physicochemical properties; Pseudo amino acid composition; Sequence features; Structural features
Mesh:
Substances:
Year: 2016 PMID: 28155640 PMCID: PMC5259813 DOI: 10.1186/s12859-016-1337-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Four benchmark datasets for HIV-1 cleavage site prediction
| Datasets | Octamers | Cleavage sites | Non-cleavage sites |
|---|---|---|---|
| 746 | 746 | 401 | 345 |
| 1625 | 1625 | 374 | 1251 |
| Schilling | 3272 | 434 | 2838 |
| Impens | 947 | 149 | 798 |
Fig. 1System architecture of the proposed method
Predictive performance of sequence features for the four benchmark dataset
| Features | DT | LR | ANN | |||
|---|---|---|---|---|---|---|
|
| AUC |
| AUC |
| AUC | |
| 746 Dataset | ||||||
| AAC | 83.7 | 0.897 | 86.4 | 0.938 | 81.0 | 0.935 |
| DipC | 75.6 | 0.793 | 86.4 | 0.865 |
| 0.974 |
| PseAAC | 78.3 | 0.787 | 86.4 | 0.938 | 81.0 | 0.885 |
| Seq_All | 78.3 | 0.831 | 86.4 | 0.847 |
|
|
| 1625 Dataset | ||||||
| AAC | 91.4 | 0.908 | 84.1 | 0.904 | 91.4 | 0.952 |
| DipC | 92.6 | 0.861 | 96.3 | 0.972 |
|
|
| PseAAC | 90.2 | 0.822 | 87.8 | 0.921 | 87.8 | 0.945 |
| Seq_All | 92.6 | 0.882 | 96.3 | 0.958 |
| 0.984 |
| Schilling Dataset | ||||||
| AAC | 87.7 | 0.664 | 86.5 | 0.856 | 88.9 | 0.858 |
| DipC | 87.7 | 0.526 | 87.1 | 0.806 |
| 0.790 |
| PseAAC | 87.1 | 0.500 | 86.5 |
| 88.3 | 0.858 |
| Seq_All | 87.7 | 0.611 | 87.7 | 0.802 | 87.1 | 0.821 |
| Impens Dataset | ||||||
| AAC | 85.1 | 0.500 | 80.8 | 0.857 | 89.3 | 0.886 |
| DipC | 85.1 | 0.500 | 82.9 | 0.579 |
|
|
| PseAAC | 87.2 | 0.721 | 78.7 | 0.814 | 87.2 | 0.868 |
| Seq_All | 87.2 | 0.802 | 85.1 | 0.696 | 89.3 | 0.875 |
*The best accuracy and AUC in each dataset are underlined
Predictive performance of structural features for the four benchmark datasets
| Features | DT | LR | ANN | |||
|---|---|---|---|---|---|---|
|
| AUC |
| AUC |
| AUC | |
| 746 Dataset | ||||||
| SSE | 62.1 | 0.626 | 59.4 | 0.715 | 78.3 | 0.838 |
| SA |
| 0.791 | 78.4 | 0.771 | 81.0 | 0.771 |
| Str_All |
| 0.791 | 70.2 | 0.806 | 78.4 |
|
| 1625 Dataset | ||||||
| SSE | 81.7 | 0.756 | 76.8 | 0.673 | 85.3 | 0.742 |
| SA | 91.4 | 0.920 | 89.0 | 0.961 |
|
|
| Str_All | 91.5 | 0.920 | 85.4 | 0.936 | 89.0 | 0.935 |
| Schilling Dataset | ||||||
| SSE | 87.1 | 0.500 | 88.3 | 0.775 | 88.3 | 0.800 |
| SA |
| 0.788 | 84.0 | 0.828 | 87.1 | 0.840 |
| Str_All |
| 0.788 | 83.4 | 0.824 | 85.8 |
|
| Impens Dataset | ||||||
| SSE | 85.1 | 0.500 | 85.1 | 0.729 | 87.2 | 0.761 |
| SA | 89.3 | 0.736 | 89.3 | 0.918 |
|
|
| Str_All | 87.2 | 0.571 | 89.3 | 0.857 | 89.3 | 0.914 |
*The best accuracy and AUC in each dataset are underlined
Predictive performance of physicochemical property features for the four benchmark datasets
| Features | DT | LR | ANN | |||
|---|---|---|---|---|---|---|
|
| AUC |
| AUC |
| AUC | |
| 746 Dataset | ||||||
| Hydrophobicity | 75.6 | 0.735 | 83.7 | 0.956 | 89.1 |
|
| Steric property | 89.1 | 0.929 | 86.4 | 0.941 | 81.0 | 0.932 |
| Polarizability | 81.0 | 0.815 | 83.7 | 0.953 | 83.7 | 0.947 |
| Isoelectric point | 81.0 | 0.865 | 86.4 | 0.953 | 83.7 | 0.953 |
| Polarity | 83.7 | 0.838 | 83.7 | 0.912 | 86.4 | 0.909 |
| Volume | 83.7 | 0.838 | 54.0 | 0.500 | 54.0 | 0.500 |
| Phy_All | 84.9 | 0.882 | 93.6 | 0.885 |
| 0.953 |
| 1625 Dataset | ||||||
| Hydrophobicity | 87.8 | 0.849 | 84.1 | 0.896 | 86.5 | 0.874 |
| Steric property | 91.4 | 0.897 | 85.3 | 0.896 | 91.4 | 0.934 |
| Polarizability | 93.9 | 0.914 | 87.8 | 0.936 |
| 0.957 |
| Isoelectric point | 91.4 | 0.918 | 82.9 | 0.914 | 93.9 | 0.968 |
| Polarity | 86.5 | 0.847 | 87.8 | 0.904 | 89.0 | 0.919 |
| Volume | 92.6 | 0.896 | 89.0 | 0.933 | 93.9 |
|
| Phy_All | 92.7 | 0.882 | 92.7 | 0.921 | 92.7 | 0.944 |
| Schilling Dataset | ||||||
| Hydrophobicity | 87.7 | 0.708 | 89.5 | 0.862 | 89.5 | 0.863 |
| Steric property | 88.3 | 0.721 | 86.5 | 0.837 | 88.3 | 0.843 |
| Polarizability | 89.5 | 0.683 | 89.5 | 0.854 |
| 0.853 |
| Isoelectric point | 88.3 | 0.733 | 87.7 | 0.858 | 89.5 | 0.860 |
| Polarity | 87.1 | 0.500 | 87.1 | 0.860 | 88.3 | 0.865 |
| Volume | 88.9 | 0.622 | 88.3 | 0.847 | 88.3 | 0.810 |
| Phy_All | 88.9 | 0.593 | 89.5 |
| 85.2 | 0.863 |
| Impens Dataset | ||||||
| Hydrophobicity | 85.1 | 0.500 | 80.8 | 0.686 | 87.2 | 0.886 |
| Steric property | 89.3 | 0.845 | 82.9 | 0.825 | 89.3 | 0.893 |
| Polarizability | 85.1 | 0.500 | 85.1 | 0.864 | 89.3 | 0.943 |
| Isoelectric point | 85.1 | 0.500 | 78.7 | 0.850 |
|
|
| Polarity | 85.1 | 0.500 | 85.1 | 0.743 | 82.9 | 0.682 |
| Volume | 85.1 | 0.500 | 85.1 | 0.736 | 80.8 | 0.500 |
| Phy_All | 91.5 | 0.839 | 82.9 | 0.796 | 87.2 | 0.839 |
*The best accuracy and AUC in each dataset are underlined
Predictive performance of hybrid features for the four benchmark datasets
| Features | DT | LR | ANN | |||
|---|---|---|---|---|---|---|
|
| AUC |
| AUC |
| AUC | |
| 746 Dataset | ||||||
| Seq + Str | 78.3 | 0.788 | 91.8 | 0.982 | 94.5 |
|
| Seq + Phy | 83.7 | 0.838 | 86.4 | 0.968 | 94.5 | 0.976 |
| Phy + Str | 83.7 | 0.810 | 78.3 | 0.860 | 91.8 | 0.982 |
| Seq + Str + Phy | 75.6 | 0.841 |
| 0.991 |
| 0.988 |
| 1625 Dataset | ||||||
| Seq + Str | 89.0 | 0.910 | 96.3 | 0.980 | 95.1 |
|
| Seq + Phy | 89.0 | 0.785 | 97.5 | 0.958 |
| 0.990 |
| Phy + Str | 91.4 | 0.940 | 86.5 | 0.810 | 93.9 | 0.985 |
| Seq + Str + Phy | 91.4 | 0.956 | 95.1 | 0.980 | 97.5 | 0.990 |
| Schilling Dataset | ||||||
| Seq + Str | 90.8 | 0.845 | 86.5 | 0.865 |
| 0.873 |
| Seq + Phy | 87.1 | 0.500 | 90.8 | 0.837 | 88.9 | 0.825 |
| Phy + Str | 85.1 | 0.500 | 80.8 | 0.603 | 80.8 | 0.596 |
| Seq + Str + Phy | 88.9 | 0.810 | 89.5 | 0.826 | 91.4 |
|
| Impens Dataset | ||||||
| Seq + Str | 89.3 | 0.682 | 89.3 | 0.918 |
| 0.918 |
| Seq + Phy | 91.4 | 0.839 | 87.2 | 0.889 | 91.4 | 0.896 |
| Phy + Str | 85.1 | 0.500 | 82.9 | 0.889 |
|
|
| Seq + Str + Phy | 87.2 | 0.675 | 87.2 | 0.889 | 89.3 | 0.850 |
*The best accuracy and AUC in each dataset are underlined
Predictive performance based on selected features and machine learning algorithms based on validation sets and test sets
| Datasets | Features | Algorithm |
|
|
| AUC |
|---|---|---|---|---|---|---|
| 746 | Seq + Str | ANN | 100.0 (100.0)* | 88.2 (94.4) | 94.5 (97.4) | 0.994 (0.995) |
| 1625 | Seq + Str | ANN | 94.7 (89.4) | 95.2 (96.8) | 95.1 (95.1) | 0.992 (0.994) |
| Schilling | Seq + Str + Phy | ANN | 57.1 (27.3) | 96.5 (95.8) | 91.4 (86.6) | 0.895 (0.815) |
| Impens | SA | ANN | 71.4 (44.4) | 100.0 (89.8) | 95.7 (80.0) | 0.950 (0.816) |
*The predictive performance of test set is shown in parenthesis
Fig. 2Decision tree of the 746 dataset based on Seq + Str features
Interpretable biological features selected by decision tree model based on Seq + Str features for the 746 dataset
| Rank | Variable | Description | Importance |
|---|---|---|---|
| 1 | RSA_4 | RSA at the 4th position of an octamer | 1.0000 |
| 2 | MT | DipC of methionine & threonine | 0.7855 |
| 3 | C | AAC of cysteine | 0.6060 |
| 4 | RSA_5 | RSA at the 5th position of an octamer | 0.5238 |
| 5 | VH | DipC of valine & histidine | 0.4059 |
| 6 | AE | DipC of alanine & glutamic acid | 0.3769 |
| 7 | FL | DipC of phenylalanine & leucine | 0.3268 |