| Literature DB >> 35507586 |
Minjuan Shi1, Jianyan Lin2, Wudi Wei3, Yaqin Qin2, Sirun Meng2, Xiaoyu Chen2, Yueqi Li3, Rongfeng Chen3, Zongxiang Yuan1, Yingmei Qin2, Jiegang Huang1, Bingyu Liang1, Yanyan Liao3, Li Ye1,3, Hao Liang1,3, Zhiman Xie2, Junjun Jiang1,3.
Abstract
OBJECTIVE: Talaromycosis is a serious regional disease endemic in Southeast Asia. In China, Talaromyces marneffei (T. marneffei) infections is mainly concentrated in the southern region, especially in Guangxi, and cause considerable in-hospital mortality in HIV-infected individuals. Currently, the factors that influence in-hospital death of HIV/AIDS patients with T. marneffei infection are not completely clear. Existing machine learning techniques can be used to develop a predictive model to identify relevant prognostic factors to predict death and appears to be essential to reducing in-hospital mortality.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35507586 PMCID: PMC9067679 DOI: 10.1371/journal.pntd.0010388
Source DB: PubMed Journal: PLoS Negl Trop Dis ISSN: 1935-2727
Fig 1Workflow for machine learning.
Information such as clinical complications/coinfections and laboratory measures of HIV/AIDS patients with talaromycosis was collected. Different machine-learning methods were evaluated after feature selection to establish the best clinical outcome prediction model.
Fig 2The mortality change in HIV/AIDS patients with T.marneffei infection at the Fourth People’s Hospital of Nanning, Guangxi from 2012 to 2019.
Fig 3Feature engineering for filtering machine learning predictive model variables.
(A) Percentage of deaths of all patients with different clinical complications/coinfections, all variables χ2 p < 0.05. (B) Violin diagram comparing the laboratory measures levels between the two groups, with p <0.001 in all items. (C) Spearman’s rank correlation coefficient analysis for 39 laboratory measures. (D) Radar plot for the fifth most important predictors of death in the XGBoost model. Abbreviation: CD3, CD3+ T-cell count; CD4/CD8, CD4/CD8 ratio; CD8+ T-cell count; LDL, low-density lipoprotein cholesterol; Ca, calcium; HDL, high-density lipoprotein cholesterol; CREA, creatinine; AST, aspartate aminotransferase; UA, uric acid; LDH, lactate dehydrogenase; Ccr, endogenous creatinine clearance rate; Glu, glucose; CHOL, total cholesterol; TBIL, total bilirubin; AST/ALT, AST/ALT ratio; BUN/CREA, BUN/CREA ratio; BUN, urea nitrogen; K, potassium; IBIL, indirect bilirubin; P, phosphorus; Cl, chlorine; Na, sodium; STY, osmolarity; HCO3, carbonate; Cys-C, serum cystatin C; AG, anion gap; DBIL, direct bilirubin; TBA, total bile acid; Hb, hemoglobin; PLT, platelet; MONO%, monocyte ratio; RDW-CV, red blood cell distribution width; HCT, hematocrit; EOS, eosinophil; EOS%, eosinophil ratio; PDW, platelet distribution width; PCT, platelet distributing width; CRP, C-reactive protein; hsCRP, high-sensitivity C.
Fig 4Performance evaluation of four machine learning models.
A-B. Receiver operating characteristic curves of the models. (A) AUCs for death of the training (70%) set. (B) AUCs for death of the testing (30%) set. (C) Confusion matrix for the training set. (D) Confusion matrix for the testing set. (E-F) RP curve for death of the training set (E) and the testing set (F). AUC = area under the receiver operating characteristic curve. Precision = true positive/(true positive + false positive); recall = true positive/ (true positive + false negative). (C-D) “0”: “Survival”, “1”: “Death.
The effectiveness of the four machine learning preditive models.
| Classifiers | Datasets | Accuracy | Error | Sensitivity | Specificity | Precision | F1_score | mAP | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| KNN | Training | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Testing | 0.9171 | 0.0829 | 0.6029 | 0.9589 | 0.6613 | 0.6308 | 0.5955 | 0.5850 | 0.8514 | |
| Logistic | Training | 0.9088 | 0.0912 | 0.4734 | 0.9793 | 0.7876 | 0.5914 | 0.6604 | 0.5659 | 0.7264 |
| Testing | 0.9326 | 0.0674 | 0.6324 | 0.9726 | 0.7544 | 0.6880 | 0.7060 | 0.6538 | 0.8025 | |
| SVM | Training | 0.9755 | 0.0245 | 0.8245 | 1.0000 | 1.0000 | 0.9038 | 0.9763 | 0.8954 | 0.9122 |
| Testing | 0.8912 | 0.1088 | 0.4706 | 0.9472 | 0.5424 | 0.5039 | 0.5185 | 0.4446 | 0.7089 | |
| XGBoost | Training |
|
|
|
|
|
|
|
|
|
| Testing |
|
|
|
|
|
|
|
|
|
Fig 5The effect of 15 top ranked features on the outcome.
Each row represents a feature, the horizontal coordinate is the SHAP value, the blue color means the feature’s contribution is negative; the red color means the feature’s contribution is positive, one point represents a sample, the more red the color means the feature itself is larger, the more blue the color means the feature itself is smaller.
Fig 6Analysis of clinical complications/coinfections and laboratory results of misclassified cases.
(A-B) Percentage of deaths of all patients with different clinical complications/coinfections in the training dataset (A) and testing dataset (B). (C-D) Violin diagram comparing the levels of laboratory measures between the four groups. Survival: correctly classified to be alive; Death: correctly classified to be dead; FN: those whose actual prognosis was death and were classified to be alive; FP: those whose actual prognosis was survival and were misclassified as death.