| Literature DB >> 31292462 |
Xiang Chen1, Zhi-Xin Wang1, Xian-Ming Pan2.
Abstract
Human Immunodeficiency Virus 1 (HIV-1) co-receptor usage, called tropism, is associated with disease progression towards AIDS. Furthermore, the recently developed and developing drugs against co-receptors CCR5 or CXCR4 open a new thought for HIV-1 therapy. Thus, knowledge about tropism is critical for illness diagnosis and regimen prescription. To improve tropism prediction accuracy, we developed two novel methods, the extreme gradient boosting based XGBpred and the hidden Markov model based HMMpred. Both XGBpred and HMMpred achieved higher specificities (72.56% and 72.09%) than the state-of-the-art methods Geno2pheno (61.6%) and G2p_str (68.60%) in a 10-fold cross validation test at the same sensitivity of 93.73%. Moreover, XGBpred had more outstanding performances (with AUCs 0.9483, 0.9464) than HMMpred (0.8829, 0.8774) on the Hivcopred and Newdb (created in this work) datasets containing larger proportions of hard-to-predict dual tropic samples in the X4-using tropic samples. Therefore, we recommend the use of our novel method XGBpred to predict tropism. The two methods and datasets are available via http://spg.med.tsinghua.edu.cn:23334/XGBpred/. In addition, our models identified that positions 5, 11, 13, 18, 22, 24, and 25 were correlated with HIV-1 tropism.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31292462 PMCID: PMC6620319 DOI: 10.1038/s41598-019-46420-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Distribution of the six major subtypes in the Newdb dataset.
| Subtype | Number (Ra, Xb, Dc) | Percentage |
|---|---|---|
| B | 1503 (1209, 93, 201) | 50.13% |
| C | 511 (460, 26, 25) | 17.04% |
| D | 233 (120, 52, 61) | 7.77% |
| 01_AE | 213 (149, 45, 19) | 7.10% |
| A | 155 (140, 5, 10) | 5.17% |
| 02_AG | 124 (50, 3, 71) | 4.14% |
Notes: aThe number of R5 tropic sequences. bThe number of X4 tropic sequences. cThe number of dual tropic sequences.
Distribution of tropisms in the different datasets.
| Dataset | R5 | X4-using | Sum | |
|---|---|---|---|---|
| X4 | Dual | |||
| Newdb | 2335 | 245 | 418 | 2998 |
| G2p_str[ | 973 | 94 | 121 | 1188 |
| Hivcopreda [ | 1768 | 246 | 321 | 2335 |
| CM[ | 2354 | 277 | 48 | 2679 |
| WebPSSM[ | 228b (47c) | 51b (24c) | 279b (71c) | |
Notes: aRemoved 31 duplicated sequences from the original Hivcopred dataset which are marked as not only R5 tropism but also X4-using tropism. bTraining set. cValidation set.
Figure 1Performance of the XGBpred and HMMpred methods on the Newdb dataset. (A) ROC curves on the Newdb dataset in a same 10-fold cross validation test. The legend lists AUCs and specificities at the sensitivity of 91.78% which is plotted as the dashed black line. (B) Distribution of V3 loop sequence scores calculated from XGBpred and HMMpred on the Newdb dataset. The score distribution of the R5 tropic sequences is shown in blue, that of X4 is carmine and that of dual is yellow. (C) ROC curves of XGBpred and HMMpred for the six major subtypes. The legend lists AUCs and mAPs.
Performance of the XGBpred and HMMpred methods on the different datasets.
| Dataset | Method | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|
| Newdb | XGBpred | 84.62% | 90.19% | 0.7310 | 0.9465 |
| HMMpred | 70.59% | 87.09% | 0.6247 | 0.8774 | |
| G2p_str[ | Geno2pheno[ | 61.6% | — | — | 0.860 |
| G2p_str[ | 68.6% | — | — | 0.892 | |
| XGBpred | 72.56% | 89.90% | 0.6605 | 0.8952 | |
| HMMpred | 72.09% | 89.81% | 0.6570 | 0.9002 | |
| Hivcopred[ | Hivcopred[ | 81.44% | 87.07% | 0.67 | 0.904 |
| XGBpred | 87.13% | 88.52% | 0.7154 | 0.9483 | |
| HMMpred | 71.08% | 84.63% | 0.5899 | 0.8829 | |
| CM[ | CM[ | 92.92% | 95.21% | 0.885 | 0.97 |
| XGBpred | 93.85% | 95.33% | 0.8106 | 0.9809 | |
| HMMpred | 89.54% | 94.81% | 0.7826 | 0.9635 | |
| WebPSSM[ | WebPSSM[ | 83.3% | — | — | 0.881 |
| XGBpred | 83.33% | 83.10% | 0.6419 | 0.9043 | |
| HMMpred | 75.00% | 80.28% | 0.5693 | 0.8678 |
Performance of XGBpred and HMMpred on the Newdb, G2p_str, Hivcopred, CM and WebPSSM datasets at the sensitivities of 91.78%, 93.73%, 89.99%, 95.54% and 82.98%, respectively.
Figure 2Distribution of feature importance scores. The top 30 most important features indicated by XGBpred on the Newdb, G2p_str, Hivcopred and CM datasets. S# means the alignment score in position #, and R1R2 represents a dipeptide.