| Literature DB >> 31891137 |
Magnus G Ahlström1, Andreas Ronit1, Lars Haukali Omland1, Søren Vedel1, Niels Obel1.
Abstract
BACKGROUND: Late HIV diagnosis is detrimental both to the individual and to society. Strategies to improve early diagnosis of HIV must be a key health care priority. We examined whether nation-wide electronic registry data could be used to predict HIV status using machine learning algorithms.Entities:
Keywords: Big data; HIV diagnosis; HIV prevention; Machine learning; Registry data
Year: 2019 PMID: 31891137 PMCID: PMC6933258 DOI: 10.1016/j.eclinm.2019.10.016
Source DB: PubMed Journal: EClinicalMedicine ISSN: 2589-5370
Clinical characteristics of the training and validation cohort.
| Training cohort | Validation cohort | |
|---|---|---|
| Total number of individuals | 2,972,264 | 1,411,914 |
| HIV+, n (%) | 3,063 (0·1%) | 1,287 (0·1%) |
| Male, n (%) | 1,470,420 (49%) | 652,912 (46%) |
| 15–25, n (%) | 383,521 (13%) | 149,525 (11%) |
| 25–35, n (%) | 33,662 (11%) | 232,098 (16%) |
| 35–45, n (%) | 439,223 (15%) | 168,687 (12%) |
| 45–55, n (%) | 408,877 (14%) | 203,669 (14%) |
| 55+, n (%) | 1,406,981 (47%) | 657,935 (47%) |
| Denmark, n (%) | 2,623,960 (88%) | 1,288,389 (91%) |
| Scandinavia, n (%) | 34,075 (1%) | 13,315 (1%) |
| Other Countries, n (%) | 280,223 (9%) | 100,519 (7%) |
| Unknown, n (%) | 34,006 (1%) | 9,691 (1%) |
| Primary school, n (%) | 929,426 (31%) | 403,871 (29%) |
| High school, n (%) | 160,838 (5%) | 92,277 (7%) |
| Vocational internships and main course, n (%) | 829,897 (28%) | 417,074 (30%) |
| Short higher education, n (%) | 89,474 (3%) | 44,397 (3%) |
| Middle length higher education, n (%) | 322,035 (11%) | 165,348 (12%) |
| Bachelor and long higher education, n (%) | 192,771 (6%) | 95,981 (7%) |
| PhD Programs, n (%) | 10,187 (<1%) | 4,256 (<1%) |
| Unknown, n (%) | 437,636 (15%) | 188,710 (13%) |
| Married, n (%) | 1,331,231 (45%) | 622,066 (44%) |
| Divorced, n (%) | 313,083 (11%) | 154,192 (11%) |
| Widowhood, n (%) | 364,476 (12%) | 163,582 (12%) |
| Registered partnership, n (%) | 4,280 (<1%) | 1,863 (<1%) |
| Cancelled registered partnership, n (%) | 1,322 (<1%) | 625 (<1%) |
| Longest living of two partners, n (%) | 330 (<1%) | 125 (<1%) |
| Unknown, n (%) | 85,289 (3%) | 33,911 (2%) |
| Self-employed, n (%) | 93,678 (3%) | 41,650 (3%) |
| Employed spouse, n (%) | 3,273 (<1%) | 1,654 (<1%) |
| Wage earner with own business, n (%) | 39,663 (1%) | 18,179 (1%) |
| Wage earner without business, n (%) | 1,174,574 (40%) | 610,584 (43%) |
| Wage earner with support, n (%) | 6,074 (<1%) | 2,989 (<1%) |
| Senior citizen with own business, n (%) | 21,963 (1%) | 9,666 (1%) |
| Senior citizen, n (%) | 1,059,706 (36%) | 499,156 (35%) |
| Others, n (%) | 524,385 (18%) | 211,455 (15%) |
| Unknown, n (%) | 48,948 (2%) | 16,581 (1%) |
| Capital region, n (%) | 761,306 (26%) | 363,318 (26%) |
| Large city region, n (%) | 352,530 (12%) | 174,090 (12%) |
| Hinterland region, n (%) | 467,034 (16%) | 219,237 (16%) |
| Provincial region, n (%) | 657,852 (22%) | 311,164 (22%) |
| Rural region, n (%) | 648,253 (22%) | 311,194 (22%) |
| Unknown, n (%) | 85,289 (3%) | 32,911 (2%) |
Fig. 1Training and validation sample ROC curves. The figures show the training (a) sample and validation sample (b) performance of the three different models based on the best performing algorithm (GLMridge). Each point on the graphs represents a sensitivity and a specificity for a particular cut-off with regards to risk-score calculated by using the parameters generated by fitting the different models.
Confusion matrices and performance characteristics (GLMRidge algorithm).
| Age, sex and STIs | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 1,172 | 797,138 | 798,310 | Sensitivity | 0·911 (0·894 - 0·926) | |
| Test- | 115 | 613,489 | 613,604 | Specificity | 0·435 (0·434 - 0·436) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0015 (0·0014 - 0·0016) | |
| NPV | 0·9998 (0·9998 - 0·9998) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence and main source of income | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 1,108 | 487,880 | 488,988 | Sensitivity | 0·861 (0·841 - 0·879) | |
| Test- | 179 | 922,747 | 922,926 | Specificity | 0·654 (0·653 - 0·655) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0023 (0·0021 - 0·0024) | |
| NPV | 0·9998 (0·9998 - 0·9998) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence, main source of income and medical history | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 1,110 | 424,654 | 425,764 | Sensitivity | 0·862 (0·842 - 0·881) | |
| Test- | 177 | 985,973 | 986,150 | Specificity | 0·699 (0·698 - 0·700) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0026 (0·0025 - 0·0028) | |
| NPV | 0·9998 (0·9998 - 0·9998) | |||||
| Age, sex and STIs | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 35 | 1,667 | 1,702 | Sensitivity | 0·027 (0·019 - 0·038) | |
| Test- | 1,252 | 1,408,960 | 1,410,212 | Specificity | 0·999 (0·999 - 0·999) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0206 (0·0144 - 0·0285) | |
| NPV | 0·9992 (0·9991 - 0·9992) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence and main source of income | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 77 | 1,042 | 1,119 | Sensitivity | 0·060 (0·048 - 0·074) | |
| Test- | 1,210 | 1,409,585 | 1,410,795 | Specificity | 0·999 (0·999 - 0·999) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0688 (0·0547 - 0·0853) | |
| NPV | 0·9992 (0·9991 - 0·9992) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence, main source of income and medical history | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 104 | 1,156 | 1,260 | Sensitivity | 0·081 (0·067 - 0·097) | |
| Test- | 1,183 | 1,409,471 | 1,410,654 | Specificity | 0·999 (0·999 - 0·999) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0825 (0·0679 - 0·0991) | |
| NPV | 0·9992 (0·9991 - 0·9992) | |||||
| Age, sex and STIs | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 952 | 424,326 | 425,278 | Sensitivity | 0·740 (0·715 - 0·763) | |
| Test- | 335 | 986,301 | 986,636 | Specificity | 0·699 (0·698 - 0·700) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0022 (0·0021 - 0·0024) | |
| NPV | 0·9997 (0·9997 - 0·9998) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence and main source of income | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 919 | 246,857 | 247,776 | Sensitivity | 0·714 (0·689 - 0·739) | |
| Test- | 368 | 1,163,770 | 1,164,138 | Specificity | 0·825 (0·824 - 0·826) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0037 (0·0035 - 0·004) | |
| NVP | 0·9997 (0·9996 - 0·9997) | |||||
| Age, sex, origin of birth, educational attainment, marital status, place of residence, main source of income and medical history | ||||||
| HIV+ | HIV- | Total | ||||
| Test+ | 981 | 211,973 | 212,954 | Sensitivity | 0·762 (0·738 - 0·785) | |
| Test- | 306 | 1,198,654 | 1,198,960 | Specificity | 0·850 (0·849 - 0·850) | |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0046 (0·0043 - 0·0049) | |
| NPV | 0·9997 (0·9997 - 0·9998) |
The tables depict confusion matrices and actual sensitivities, specificities, positive predictive values (PPVs) and negative predictive values (NPVs) of the best performing GLMridge model. The models are calibrated according to the risk score that yields the desired value in the training data, i.e. when sensitivities are calculated on the validation set the actual sensitives and specificities may differ slightly.
Confusion matrices and performance characteristics of different algorithms.
| 3a: simple logistic regression algorithm sensitivity of 0·90 (high coverage - screening) | |||||
| HIV+ | HIV- | Total | |||
| Test+ | 1,107 | 424,879 | 425,986 | Sensitivity | 0·860 (0·840 - 0·879) |
| Test- | 180 | 985,748 | 985,928 | Specificity | 0·699 (0·698 - 0·700) |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·026 (0·0024 - 0·0027) |
| NPV | 0·9998 (0·9998 - 0·9998) | ||||
| 3b: Random forest algorithm sensitivity of 0·90 (high coverage - screening) | |||||
| HIV+ | HIV- | Total | |||
| Test+ | 483 | 68,763 | 69,246 | Sensitivity | 0·375 (0·349 - 0·402) |
| Test- | 804 | 1,408,960 | 1,411,016 | Specificity | 0·954 (0·953 - 0·954) |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0070 (0·0063 - 0·0067) |
| NPV | 0·9994 (0·9994 - 0·9995) | ||||
| 3c: Lasso regression algorithm sensitivity of 0·90 (high coverage - screening) | |||||
| HIV+ | HIV- | Total | |||
| Test+ | 1,120 | 439,787 | 440,907 | Sensitivity | 0·870 (0·851 - 0·888) |
| Test- | 167 | 970,840 | 971,007 | Specificity | 0·688 (0·687 - 0·689) |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0025 (0·0023 - 0·0027) |
| NPV | 0·9998 (0·9998 - 0·9999) | ||||
| 3d: Ridge regression algorithm sensitivity of 0·90 (high coverage - screening) | |||||
| HIV+ | HIV- | Total | |||
| Test+ | 1,110 | 424,654 | 425,764 | Sensitivity | 0·860 (0·840 - 0·879) |
| Test- | 177 | 985,973 | 986,150 | Specificity | 0·699 (0·698 - 0·700) |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0026 (0·0025 - 0·0028) |
| NPV | 0·9998 (0·9998 - 0·9998) | ||||
| 3e: Elastic net penalty regression (α = 0.992) algorithm with Synthetic minority oversampling (SMOTE) and sensitivity of 0·90 (high coverage - screening) | |||||
| HIV+ | HIV- | Total | |||
| Test+ | 723 | 124,770 | 425,764 | Sensitivity | 0·562 (0·534 - 0·589) |
| Test- | 564 | 1,285,857 | 986,150 | Specificity | 0·912 (0·911 - 0·912) |
| Total | 1,287 | 1,410,627 | 1,411,914 | PPV | 0·0058 (0·0054 - 0·0062) |
| NPV | 0·9996 (0·9995 - 0·9996) |
The tables depict confusion matrices and actual sensitivities, specificities, positive predictive values (PPVs) and negative predictive values (NPVs) of four different algorithms. The models are calibrated according to the risk score that yields the desired value in the training data, i.e. when sensitivities are calculated on the validation set the actual sensitives and specificities may differ slightly.
Fig. 2Training and validation sample ROC curves using different machine learning algorithms. The figure shows ROC curves for the training (2a) and validation data (2b) of simple logistic regression, best performing random forest algorithm, the best performing GLMridge algorithm, the best performing GLMLasso algorithm and the best performing logistic regression with elastic net penalty and synthetic minority oversampling technique (SMOTE) prior to analyses.