| Literature DB >> 35134057 |
Haiting Chai1, Quan Gu1, Joseph Hughes1, David L Robertson1.
Abstract
Human immunodeficiency virus type 1 (HIV-1) continues to be a major cause of disease and premature death. As with all viruses, HIV-1 exploits a host cell to replicate. Improving our understanding of the molecular interactions between virus and human host proteins is crucial for a mechanistic understanding of virus biology, infection and host antiviral activities. This knowledge will potentially permit the identification of host molecules for targeting by drugs with antiviral properties. Here, we propose a data-driven approach for the analysis and prediction of the HIV-1 interacting proteins (VIPs) with a focus on the directionality of the interaction: host-dependency versus antiviral factors. Using support vector machine learning models and features encompassing genetic, proteomic and network properties, our results reveal some significant differences between the VIPs and non-HIV-1 interacting human proteins (non-VIPs). As assessed by comparison with the HIV-1 infection pathway data in the Reactome database (sensitivity > 90%, threshold = 0.5), we demonstrate these models have good generalization properties. We find that the 'direction' of the HIV-1-host molecular interactions is also predictable due to different characteristics of 'forward'/pro-viral versus 'backward'/pro-host proteins. Additionally, we infer the previously unknown direction of the interactions between HIV-1 and 1351 human host proteins. A web server for performing predictions is available at http://hivpre.cvr.gla.ac.uk/.Entities:
Mesh:
Year: 2022 PMID: 35134057 PMCID: PMC8856524 DOI: 10.1371/journal.pcbi.1009720
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
Breakdown of VIP and non-VIP datasets used.
| Dataset | Positives | Negatives | |
|---|---|---|---|
| Main dataset S1 | 2881 VIPs | 7261 non-VIPs | |
| Training S1’ | 2304 VIPs | 2304 non-VIPs | |
| Independent testing S1” | 577 VIPs | 4957 non-VIPs | |
| Main dataset S2 | 188 backward VIPs | 1007 forward VIPs | |
| Training S2’ | 150 backward VIPs | 150 forward VIPs | |
| Independent testing S2” | 38 backward VIPs | 857 forward VIPs | |
| Reference dataset S3 | 335 bidirectional VIPs | ||
| Blind testing dataset S4 | 1351 undefined VIPs | ||
| Testing dataset S5 | 234 VIPs | ||
| Testing dataset S6 | 356 VIPs | ||
aDataset S1 and S2 were constructed for the prediction of VIPs and their directionality in the HIV-1-host PPIs. 80% of positives and an equal number of negatives were randomly selected for training while the remaining 20% of proteins were used for testing. Dataset S3 was constructed for prediction of ‘bidirectional’ VIPs while S4 was constructed for the prediction of putative forward, backward or bidirectional VIPs. Testing datasets S5 and S6 were retrieved from two resources with high experimental confidence: the HIV-1 infection pathway in Reactome [60], https://reactome.org/PathwayBrowser/#/R-HSA-162906 and viral host-dependency epistasis map linked to the HIV function [61]. The lists of proteins sampled for training and independent testing are provided in .
Abbreviations: HIV-1, human immunodeficiency virus type 1; VIPs, HIV-1 interacting human proteins; non-VIPs, non-HIV-1 interacting human proteins.
The performance of different feature sets on the training datasets over five-cross validations.
| Dataset | Algorithm | Features | Features number | Threshold | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|
| S1’ | SVM | Genetic sequences | 107 | 0.51 | 0.613 | 0.700 | 0.656 | 0.314 | 0.7118 |
| SVM | Proteomic sequences | 128 | 0.51 | 0.595 | 0.649 | 0.622 | 0.244 | 0.6641 | |
| SVM | Annotations | 292 | 0.57 | 0.663 | 0.806 | 0.735 | 0.475 | 0.8090 | |
| SVM | Interaction profiles | 10 | 0.52 | 0.611 | 0.777 | 0.694 | 0.394 | 0.7487 | |
| SVM | Combination | 537 | 0.56 | 0.690 | 0.817 | 0.754 | 0.512 | 0.8324 | |
| KNN | Combination | 537 | 0.35~0.39 | 0.766 | 0.633 | 0.699 | 0.402 | 0.7772 | |
| DT | Partial | 278 | N/A | 0.633 | 0.642 | 0.637 | 0.275 | N/A | |
| RF | Random | Random | 0.44~0.52 | 0.733±0.035 | 0.752±0.030 | 0.742±0.004 | 0.486±0.009 | 0.8157±0.0031 | |
| SVM | Top-ranked 33 | 33 | 0.54 | 0.645 | 0.718 | 0.681 | 0.363 | 0.7468 | |
| SVM | Top-ranked 193 | 193 | 0.48 | 0.748 | 0.751 | 0.750 | 0.499 | 0.8261 | |
| KNN | Optimum | 441 | 0.43~0.48 | 0.689 | 0.720 | 0.705 | 0.410 | 0.7734 | |
| SVM | Optimum | 441 | 0.52 | 0.727 | 0.787 | 0.757 | 0.514 | 0.8344 | |
| S2’ | SVM | Genetic sequences | 107 | N/Af | N/Af | N/Af | N/Af | N/Af | N/Af |
| SVM | Proteomic sequences | 164 | 0.40 | 0.860 | 0.633 | 0.747 | 0.507 | 0.8023 | |
| SVM | Annotations | 292 | 0.46 | 0.767 | 0.520 | 0.643 | 0.296 | 0.6786 | |
| SVM | Interaction profiles | 21 | 0.51 | 0.740 | 0.633 | 0.687 | 0.375 | 0.7108 | |
| SVM | Combination | 584 | 0.46 | 0.807 | 0.553 | 0.680 | 0.372 | 0.7383 | |
| KNN | Combination | 584 | 0.50~0.54 | 0.400 | 0.833 | 0.617 | 0.259 | 0.6501 | |
| DT | Partial | 66 | N/A | 0.673 | 0.660 | 0.667 | 0.333 | N/A | |
| RF | Random | Random | 0.38~0.58 | 0.706±0.134 | 0.710±0.167 | 0.708±0.030 | 0.432±0.045 | 0.7609±0.0270 | |
| KNN | Optimum | 129 | 0.27~0.36 | 0.487 | 0.873 | 0.680 | 0.390 | 0.7509 | |
| SVM | Optimum | 129 | 0.44 | 0.853 | 0.680 | 0.767 | 0.542 | 0.8260 |
aDataset S1’ and S2’ were balanced training datasets constructed via an undersampling strategy [70] from dataset S1 and S2, respectively (). Compositions of these two datasets are provided in .
bk-value here was determined as the square root of the size of the training samples in the five-fold cross validation
cthe DT algorithm selected 278 and 66 features from the original feature sets for the two modelling tasks
dthe RF algorithm used 50 randomly grown trees and the modelling and validation procedures were repeated 10 times
ethreshold was set by maximizing the value of MCC
f‘N/A’ was denoted if the prediction quality of the generated classifier was worse than a random guess.
Abbreviations: SVM, support vector machine; KNN, k-nearest neighbors; DT, decision tree; RF, random forest; MCC, Matthews Correlation Coefficient; AUC, area under the receiver operating characteristic curve.
The performance of features with different categories on the testing datasets.
| Dataset | Model | Feature source | Threshold | Sensitivity | Specificity | Accuracy | Precision | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|
| S1” | PreVIP-33 | Annotation | 0.73 | 0.409 | 0.881 | 0.832 | 0.285 | 0.248 | 0.7323 |
| PreVIP-193 | Multiple | 0.82 | 0.347 | 0.959 | 0.895 | 0.495 | 0.359 | 0.8034 | |
| PreVIP-441 | Multiple | 0.73 | 0.492 | 0.911 | 0.867 | 0.391 | 0.365 | 0.8079 | |
| S2” | PreDIR-164 | Proteomic sequence | 0.53 | 0.658 | 0.762 | 0.758 | 0.109 | 0.194 | 0.7110 |
| PreDIR-129 | Multiple | 0.70 | 0.474 | 0.873 | 0.856 | 0.142 | 0.200 | 0.7057 | |
| S5 | PreVIP-193 | Multiple | 0.82 | Sensitivity = 0.577 | |||||
| PreVIP-193 | Multiple | 0.50 | Sensitivity = 0.906 | ||||||
| PreVIP-441 | Multiple | 0.73 | Sensitivity = 0.701 | ||||||
| PreVIP-441 | Multiple | 0.50 | Sensitivity = 0.910 | ||||||
| S6 | PreVIP-193 | Multiple | 0.82 | Sensitivity = 0.416 | |||||
| PreVIP-193 | Multiple | 0.50 | Sensitivity = 0.806 | ||||||
| PreVIP-441 | Multiple | 0.73 | Sensitivity = 0.596 | ||||||
| PreVIP-441 | Multiple | 0.50 | Sensitivity = 0.817 | ||||||
athresholds on S1” and S2” were set by maximizing the value of MCC. On testing dataset S5 and S6, two thresholds, i.e., 0.82 and 0.73 were set according to the best performance of PreVIP-193 and PreVIP-441 on testing dataset S1”. In addition, a neutral threshold (0.5) was added for crude assessments.
bprediction results on testing dataset S5 and S6 are provided in .
Abbreviations: MCC, Matthews Correlation Coefficient; AUC, area under the receiver operating characteristic curve; HIV-1, human immunodeficiency virus type 1; VIPs, HIV-1 interacting human proteins; PreVIP-33, machine learning model generated from training dataset S1’ with the top 33 features for the VIP prediction task; PreVIP-193, machine learning model generated from training dataset S1’ with the top 193 features for the VIP prediction task; PreVIP-441, machine learning model generated from training dataset S1’ with the optimum 441 features for the VIP prediction task; PreDIR-164, machine learning model generated from training dataset S2’ with 164 proteome-based features for the directionality prediction task; PreDIR-129, machine learning model generated from training dataset S2’ with the optimum 129 features for the directionality prediction task.