| Literature DB >> 34308343 |
Sebastian Jäger1, Arndt Allhorn1, Felix Bießmann1.
Abstract
With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.Entities:
Keywords: MAR; MCAR; MNAR; benchmark; data cleaning; data quality; imputation; missing data
Year: 2021 PMID: 34308343 PMCID: PMC8297389 DOI: 10.3389/fdata.2021.693674
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
An overview of related benchmarks. In contrast to our benchmark, all other studies focus on specific aspects such as downstream tasks or missingness conditions. Most importantly, no paper systematically compares imputation methods trained on complete and incomplete datasets. Abbreviations: the symbol # stands for the number, B means baselines, Imp means imputation quality, Down means impact on the downstream task, Comp means complete data, Incomp means incomplete data.
| Study | # Datasets/tasks | # B | Missingness | Evaluation | Training on | |||
|---|---|---|---|---|---|---|---|---|
| Pattern | Fraction | Imp | Down | Comp | Incomp | |||
|
| 2 binary classification | 6 | MCAR MAR | 0%, 10%, 20%, 30%, 40% | No | Yes |
| |
|
| 5 datasets | 7 |
| 10%, 20%, 30%, 40%, 50% | Yes | No |
| |
|
| 13 binary classification | 7 |
| 1%– | No | Yes | No | Yes |
|
| 10 classification3 regression | 11 | MNAR | 25%, 50%, 75% | Yes | Yes |
| |
|
| 84 datasets (classification and regression) | 5 | MCAR MNAR | 10%, 20%, 30%, 40%, 50% | Yes | Yes |
| |
| Ours | 21 regression | 6 | MCAR MAR MNAR | 1%, 10%, 30%, 50% | Yes | Yes | Yes | Yes |
| 31 binary classification | ||||||||
| 17 multiclass classification | ||||||||
Authors use incomplete datasets and, therefore, do not know the missingness pattern
For a subset of the experiments, i. e, not systematical.
Applying the MCAR condition to column height discards five out of ten values independent of the height values.
| Height | HeightMCAR |
|---|---|
| 179.0 | ? |
| 192.0 | ? |
| 189.0 | 189.0 |
| 156.0 | 156.0 |
| 175.0 | ? |
| 170.0 | 170.0 |
| 181.0 | ? |
| 197.0 | ? |
| 156.0 | 156.0 |
| 160.0 | 160.0 |
In the MAR condition, height values are discarded dependent on values in another column, here gender. All discarded height values correspond to rows in which gender was male.
| Height | Gender | HeightMAR |
|---|---|---|
| 200.0 | M | ? |
| 191.0 | M | ? |
| 198.0 | F | 198.0 |
| 155.0 | M | ? |
| 206.0 | M | ? |
| 152.0 | F | 152.0 |
| 175.0 | F | 175.0 |
| 159.0 | M | ? |
| 153.0 | F | 153.0 |
| 209.0 | M | 209.0 |
In the MNAR condition, height values are discarded dependent on the actual height values. All discarded values correspond to small height values.
| Height | HeightMNAR |
|---|---|
| 154.0 | ? |
| 181.0 | 181.0 |
| 207.0 | 207.0 |
| 194.0 | 194.0 |
| 153.0 | ? |
| 156.0 | ? |
| 198.0 | 198.0 |
| 185.0 | 185.0 |
| 155.0 | ? |
| 164.0 | ? |
An overview of all imputation methods and their hyperparameters we optimized. Mean/mode imputation does not have any hyperparameters, and Discriminative DL is optimized using autokeras, which is why we do not explicitly define a hyperparameter grid.
| Imputation method | Hyperparameters | Grid size | |
|---|---|---|---|
| Name | Values | ||
| Mean/mode | — | — | — |
| |
| (1, 3, 5) | 3 |
| Random forest |
| (10, 50, 100) | 3 |
| Discriminative DL | — | — | — |
| VAE |
| (0, 1, 2) | 3 |
| GAIN |
| (1, 10) | 16 |
|
| (0.7, 0.9) | — | |
|
| (0.0001, 0.0005) | — | |
|
| (0.00001, 0.00005) | — | |
Optimized using autokeras, see Section 3.4.4.
Overview of our experimental settings. We focus on covering an extensive range of the dimensions described in Section 2. In total, there are experiments, which we repeat three times to report the mean imputation/downstream score.
| Parameter | Values |
|---|---|
| Datasets | 69 (see |
| Imputation methods | Mean/mode, |
| Missingness patterns | MCAR, MAR, MNAR |
| Missingness fractions |
|
FIGURE 1Imputation ranks of the imputation methods trained on complete data. Ranks are computed for each experimental condition characterized by the dataset, missingness pattern, and missingness ratio. Since we compare six imputation methods, the possible imputation ranks range between 1 and 6. In most conditions, random forest, k-NN, and discriminative DL perform best. Generative deep learning methods tend to perform worst. In the most challenging MNAR condition, mean/mode imputation achieves competitive results.
FIGURE 2Imputation ranks of the imputation methods trained on incomplete data. Ranks are computed for each experimental condition characterized by the dataset, missingness pattern, and missingness ratio. Since we compare six imputation methods, the possible imputation ranks range between 1 and 6. Similar to the training on fully observed data random forest, k-NN and discriminative DL perform better than generative deep learning methods in most settings. In the MNAR conditions, the imputation quality of all the imputation approaches degrades in favor of mean/mode that outperforms the other for and missingness.
FIGURE 3Does imputation on incomplete test data improve predictive performance of a downstream ML model? We plot the improvement of the downstream ML model after imputation with imputation models trained on fully observed data. The downstream performance is compared to the performance obtained on incomplete test data, normalized by the ML model performance on fully observed test data. Overall, the classical ML methods and discriminative DL perform best achieving relative improvements of up to 10% and more relative to fully observed test data.
FIGURE 4Impact on the downstream task of the six imputation methods trained on incomplete data. In regression tasks, no considerable improvements are achieved. In some cases, imputation worsened the downstream ML model. In classification tasks, in contrast, we observe slightly positive effects in some settings, but negative effects predominate in the harder settings.
Training and inference duration for each imputation method in seconds. We use the wall-clock run time to measure the durations for training, including hyperparameter optimization and inference for all datasets with MCAR missingness pattern and all fractions shown in Table 6. Because training and inference durations depend heavily on the dataset size, we first calculate the durations’ mean and relative standard deviation for each imputation method on every dataset. Second, we average those mean durations and relative standard deviations for the imputation methods and present them as Mean duration and Rel. SD separately for Training and Inference.
| Imputation method | Training | Inference | ||
|---|---|---|---|---|
| Mean duration | Relative standard deviation | Mean duration | Relative standard deviation | |
| Mean/mode | 0.005 | 0.550 | 0.029 | 0.171 |
|
| 41.204 | 0.254 | 7.018 | 0.602 |
| Random forest | 226.077 | 0.119 | 24.048 | 0.236 |
| Discriminative DL | 6,275.019 | 0.405 | 440.389 | 0.211 |
| VAE | 71.095 | 0.099 | 11.215 | 0.085 |
| GAIN | 878.058 | 0.312 | 137.966 | 0.083 |