| Literature DB >> 35348909 |
Theresa Roland1, Carl Böck2, Thomas Tschoellitsch2, Alexander Maletzky3, Sepp Hochreiter4, Jens Meier2, Günter Klambauer4.
Abstract
Many previous studies claim to have developed machine learning models that diagnose COVID-19 from blood tests. However, we hypothesize that changes in the underlying distribution of the data, so called domain shifts, affect the predictive performance and reliability and are a reason for the failure of such machine learning models in clinical application. Domain shifts can be caused, e.g., by changes in the disease prevalence (spreading or tested population), by refined RT-PCR testing procedures (way of taking samples, laboratory procedures), or by virus mutations. Therefore, machine learning models for diagnosing COVID-19 or other diseases may not be reliable and degrade in performance over time. We investigate whether domain shifts are present in COVID-19 datasets and how they affect machine learning methods. We further set out to estimate the mortality risk based on routinely acquired blood tests in a hospital setting throughout pandemics and under domain shifts. We reveal domain shifts by evaluating the models on a large-scale dataset with different assessment strategies, such as temporal validation. We present the novel finding that domain shifts strongly affect machine learning models for COVID-19 diagnosis and deteriorate their predictive performance and credibility. Therefore, frequent re-training and re-assessment are indispensable for robust models enabling clinical utility.Entities:
Keywords: Blood test; COVID-19; Domain shift; Machine learning
Mesh:
Year: 2022 PMID: 35348909 PMCID: PMC8960704 DOI: 10.1007/s10916-022-01807-1
Source DB: PubMed Journal: J Med Syst ISSN: 0148-5598 Impact factor: 4.920
Fig. 1Examples of temporal domain shifts in COVID-19 datasets, which might diminish the ML model’s predictive performance over time. COVID-19 numbers in Austria over time, illustrating factors causing a temporal domain shift. The numbers are sketched according to data from the Austrian BMSGPK [58]
Dataset with summary of patient characteristics
| N casesa | N positives | N negatives | Age (mean ± sd) | Sex (f/m), (f%) | Adm. type (i/o), (i%)b | |
|---|---|---|---|---|---|---|
| Full dataset ( | 79 884 | 1037 | 79 053 | 53.4 ± 25.3 | 41 589/38 295 (52.1%) | 50 727/29 157 (63.5%) |
| 70 870 | - | 70 870 | 52.8 ± 25.1 | 36 934/33 936 (52.1%) | 42 791/28 079 (60.4%) | |
| 9014 | 1037 | 8183 | 58.0 ± 26.4 | 4655/4359 (51.6%) | 7936/1078 (88.0%) | |
| 79 053 | - | 79 053 | 53.3 ± 25.4 | 41 213/37,840 (52.1%) | 50 020/29,033 (63.3%) | |
| 1037 | 1037 | - | 64.3 ± 20.2 | 455/582 (43.9%) | 908/129 (87.6%) | |
| 919 | 919 | - | 62.7 ± 20.5 | 417/502 (45.4%) | 790/129 (86.0%) | |
| 118 | 118 | - | 76.6 ± 11.8 | 38/80 (32.2%) | 118/0 (100%) | |
| March-October 2020 (training and validation cohort for prospective assessment) | 6504 | 291 | 6277 | 57.0 ± 27.3 | 3416/3088 (52.5%) | 5720/784 (87.9%) |
| November–December 2020 (test cohort for prospective assessment) | 2636 | 785 | 1982 | 60.8 ± 24.1 | 1293/1343 (49.1%) | 2335/301 (88.6%) |
aMultiple samples can be obtained from one case. Therefore, one case can be contained in both, the positives and the negatives cohort, due to a change of the COVID-19 diagnoses, e.g., the patient might have been infected during the hospital stay, or the patient’s coronavirus load might have decreased, yielding a negative test result
bAdm. type: Admission type, i: inpatient, o: outpatient
Fig. 2Large-scale COVID-19 dataset. a: A block diagram of the structure of the dataset. The blood tests from 2019 (blood tests 2019) are all negatives and are pre-processed to the 2019 cohort. The COVID-19 RT-PCR test results and the blood tests are merged to the 2020 cohort. The negatives cohort results from the 2019 cohort (pre-pandemic samples) and the negative samples of the 2020 cohort. The positive tested cases (positives cohort) are further divided to the cohort with the survivors and deceased. Note that one case can be in the negatives and positives cohort due to a change of the COVID-19 status. Multiple samples are obtained from one case, if RT-PCR and blood tests are measured repeatedly. b: Aggregation of the blood tests for the COVID-19 tested patients. The blood tests of the last 48 h before the COVID-19 test are merged to one sample. In case a feature is measured multiple times, the most recent one is inserted in the sample. Patient specific data, namely age, sex and hospital admission type, are added to the sample
Fig. 3Statistics of blood test features of the positives cohort. The change of the statistics over time indicate a change of the underlying distribution and the presence of domain shifts. Abbreviations: mean cell hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular volume (MCV)
Fig. 4Comparison of expected and actual performance. a: The actual model performance is calculated for each month from June to December 2020 and the expected model performance is calculated on the respective previous month. The ROC AUCs of two subsequent months are compared, which correspond to expected and actual performance. b: The expected and actual performance with 95% confidence intervals. The expected and actual ROC AUC is significantly different in December and PR AUC differs significantly in November and December, showing the effect of the domain shifts on model credibility. Note that the PR AUC is sensitive to changes of prevalence
Performance metrics of threshold-independent metrics for COVID-19 diagnosis prediction (experiment (i)-(iii)). The mean and the standard deviation ( ±) for the ROC AUC and PR AUC for the five random seeds are listed. Note that the PR AUC is dependent on the class prior, which changes with the different assessment strategies. E.g., the class prior in the test set in experiment (iii) is higher, because disease prevalence in the evaluation months November and December is higher. The performance estimates of a random estimator (RE) and the best feature (BF) are listed for comparison. The highest performance metrics per experiment are printed in bold
| Model | ||||||
|---|---|---|---|---|---|---|
| ROC AUC | PR AUC | ROC AUC | PR AUC | ROC AUC | PR AUC | |
| RE | 0.5000 ± 0.0000 | 0.0124 ± 0.0000 | 0.5000 ± 0.0000 | 0.0822 ± 0.0000 | 0.5000 ± 0.0000 | 0.3162 ± 0.0000 |
| BF | 0.6745 ± 0.0000 | 0.0221 ± 0.0000 | 0.6774 ± 0.0000 | 0.3141 ± 0.0000 | 0.6623 ± 0.0000 | 0.5716 ± 0.0000 |
| SNN | 0.9567 ± 0.0025 | 0.4349 ± 0.0306 | 0.8998 ± 0.0044 | 0.5577 ± 0.0074 | 0.7836 ± 0.0053 | 0.6620 ± 0.0082 |
| KNN | 0.9071 ± 0.0000 | 0.3137 ± 0.0000 | 0.8432 ± 0.0000 | 0.4486 ± 0.0000 | 0.7209 ± 0.0000 | 0.5712 ± 0.0000 |
| LR | 0.9600 ± 0.0008 | 0.4126 ± 0.0145 | 0.8878 ± 0.0022 | 0.4770 ± 0.0086 | 0.7732 ± 0.0008 | 0.6467 ± 0.0059 |
| SVM | 0.9611 ± 0.0000 | 0.4268 ± 0.0000 | 0.9045 ± 0.0000 | 0.5573 ± 0.0000 | 0.7759 ± 0.0000 | 0.6387 ± 0.0000 |
| RF | 0.5231 ± 0.0106 | 0.9138 ± 0.0025 | 0.5761 ± 0.0100 | 0.7957 ± 0.0025 | 0.6626 ± 0.0049 | |
| XGB | 0.9629 ± 0.0000 | |||||
Performance metrics of threshold-independent metrics for mortality prediction (experiment (iv)-(v)). The mean and the standard deviation ( ±) for the ROC AUC and PR AUC for the five random seeds are listed. Note that the PR AUC is dependent on the class prior, which changes with the different assessment strategies. The highest performance metrics per experiment are printed in bold
| Model | ||||
|---|---|---|---|---|
| ROC AUC | PR AUC | ROC AUC | PR AUC | |
| RE | 0.5000 ± 0.0000 | 0.1592 ± 0.0351 | 0.5000 ± 0.0000 | 0.1320 ± 0.0000 |
| BF | 0.7599 ± 0.0748 | 0.4320 ± 0.1021 | 0.7483 ± 0.0000 | 0.3938 ± 0.0000 |
| SNN | 0.8656 ± 0.0356 | 0.5866 ± 0.1196 | 0.8478 ± 0.0053 | 0.4917 ± 0.0110 |
| KNN | 0.8207 ± 0.0550 | 0.5527 ± 0.1137 | 0.8272 ± 0.0000 | 0.4669 ± 0.0000 |
| LR | 0.8613 ± 0.0351 | 0.5555 ± 0.1281 | 0.8388 ± 0.0088 | 0.4784 ± 0.0173 |
| SVM | 0.8587 ± 0.0306 | 0.5679 ± 0.1010 | 0.8271 ± 0.0000 | 0.4185 ± 0.0001 |
| RF | ||||
| XGB | 0.8501 ± 0.0210 | 0.5196 ± 0.1005 | 0.8038 ± 0.0000 | 0.4334 ± 0.0013 |