| Literature DB >> 35327897 |
Burim Ramosaj1, Justus Tulowietzki1, Markus Pauly1.
Abstract
Missing covariates in regression or classification problems can prohibit the direct use of advanced tools for further analysis. Recent research has realized an increasing trend towards the use of modern Machine-Learning algorithms for imputation. This originates from their capability of showing favorable prediction accuracy in different learning problems. In this work, we analyze through simulation the interaction between imputation accuracy and prediction accuracy in regression learning problems with missing covariates when Machine-Learning-based methods for both imputation and prediction are used. We see that even a slight decrease in imputation accuracy can seriously affect the prediction accuracy. In addition, we explore imputation performance when using statistical inference procedures in prediction settings, such as the coverage rates of (valid) prediction intervals. Our analysis is based on empirical datasets provided by the UCI Machine Learning repository and an extensive simulation study.Entities:
Keywords: bagging; boosting; imputation accuracy; missing covariates; prediction accuracy; prediction intervals
Year: 2022 PMID: 35327897 PMCID: PMC8947649 DOI: 10.3390/e24030386
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Imputation accuracy measured by using the Random-Forest method for predicting scaled sound pressure in the Airfoil dataset under various missing rates. The dotted lines refer to empirical Monte-Carlo confidence, while the solid lines are Monte-Carlo means of the .
Figure 2Prediction accuracy measured in using the Random-Forest method for predicting scaled sound pressure in the Airfoil dataset under various missing rates. The is estimated based on a five-fold cross-validation procedure on the imputed dataset. The dotted lines around the solid curves refer to empirical Monte-Carlo intervals. The solid lines are Monte-Carlo means of the cross-validated . The horizontal dotted line in red refers to the cross-validated of the Random Forest fitted to the Airfoil dataset without any missing values.
Figure 3Prediction accuracy measured in using the XGBoost method for predicting scaled sound pressure in the Airfoil dataset under various missing rates. The is estimated based on a five-fold cross-validation procedure on the imputed dataset. The dotted lines around the solid curves refer to empirical Monte-Carlo intervals. The solid lines are Monte-Carlo means of the cross-validated . The horizontal dotted line in red refers to the cross-validated of the XGBoost method fitted to the Airfoil dataset without any missing values.
Figure 4Boxplots of prediction coverage rates under the linear model. The variation is over the different covariance structures of the features. Each row corresponds to one of the missing rates , while each column corresponds to the following prediction intervals: , and . The triple (red, green and blue) correspond to the sample sizes .
Figure 5Boxplots of prediction coverage rates under the trigonometric model. The variation is over the different covariance structures of the features. Each row corresponds to one of the missing rates , while each column corresponds to the following prediction intervals: , and . The triple (red, green and blue) correspond to the sample sizes .
Figure 6Boxplots of prediction interval lengths under the linear model. The variation is over the different covariance structures of the features. Each row corresponds to one of the missing rates , while each column corresponds to the following prediction intervals: , and . The triple (red, green and blue) correspond to the sample sizes .
Figure 7Boxplots of prediction interval lengths under the trigonometric model. The variation is over the different covariance structures of the features. Each row corresponds to one of the missing rates , while each column corresponds to the following prediction intervals: , and . The triple (red, green and blue) correspond to the sample sizes .
Summary statistics of the Real Estate Dataset.
| Real Estate Dataset | ||||
|---|---|---|---|---|
| Variable | Scales of Measurement | Range | Mean/Median | Variance/IQR |
| Transaction Date | ordinal | between 2012 & 2013 | ||
| House Price per m2 | continuous |
| ||
| House Age | continuous |
| ||
| Distance to the nearest MRT station | continuous |
| 1,592,921/ | |
| Coordinate (latitude) | continuous |
| ||
| Coordinate (longitude) | continuous |
| ||
Summary statistics of the Airfoil Dataset.
| Airfoil Dataset | ||||
|---|---|---|---|---|
| Variable | Scales of Measurement | Range | Mean/Median | Variance/IQR |
| Scaled Sound Pressure | continuous |
| ||
| Frequency | discrete—ordinal |
| ||
| Angle of Attack | discrete—ordinal |
| ||
| Chord length | discrete—ordinal |
| ||
| Free-stream velocity | discrete—ordinal |
| ||
| Suction side displacement thickness | continuous |
| ||
Summary statistics of the Power Plant Dataset.
| Power Plant Dataset | ||||
|---|---|---|---|---|
| Variable | Scales of Measurement | Range | Mean/Median | Variance/IQR |
| Electric Energy Output | continuous |
| ||
| Temperature | continuous |
| ||
| Exhaust Vaccuum | continuous |
| ||
| Ambient Pressure | continuous |
| ||
| Relative Humidity | continuous |
| ||
Summary statistics of the Concrete Dataset.
| Concrete Dataset | ||||
|---|---|---|---|---|
| Variable | Scales of Measurement | Range | Mean/Median | Variance/IQR |
| Compressive Strength | continuous |
| ||
| Cement Component | continuous |
| ||
| Blast Furnance Slag Component | continuous |
| ||
| Fly Ash Component | continuous |
| ||
| Water Component | continuous |
| 456/ | |
| Super- plasticizer | continuous |
| ||
| Coarse Aggregate Component | continuous |
| ||
| Fine Aggregate Component | continuous |
| ||
| Age in Days | continuous |
| ||
Summary statistics of the QSAR Dataset, after removing nominal variables. These are H-050, nN and C-040.
| QSAR Dataset | ||||
|---|---|---|---|---|
| Variable (Molecular Description) | Scales of Measurement | Range | Mean/Median | Variance/IQR |
| LC50 | continuous |
| ||
| TPSA(Tot) | continuous |
| ||
| SAacc | continuous |
| ||
| MLOGP | continuous |
| ||
| RDCHI | continuous |
| ||
| GATS1p | continuous |
| ||
Monte-Carlo mean of the for the Airfoil Dataset summarizing the same information as in Figure 1.
| Mean Monte-Carlo | |||||||
|---|---|---|---|---|---|---|---|
| Imputation Method |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Monte-Carlo mean of the for the Airfoil Dataset summarizing the same information as in Figure 2 Using the Random Forest prediction method.
| Mean Monte-Carlo | |||||||
|---|---|---|---|---|---|---|---|
| Prediction Method |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Fully observed |
| ||||||
Monte-Carlo mean of the for the Airfoil Dataset summarizing the same information as in Figure 2 Using the XGBoost prediction method.
| Mean Monte-Carlo | |||||||
|---|---|---|---|---|---|---|---|
| Prediction Method |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Fully observed |
| ||||||
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
| Fully observed |
|
|
|
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
| none |
|
|
|
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction coverage rates averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Coverage Rate for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
| none |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the linear model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Linear Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
| none |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|
Triple of simulated prediction interval lengths averaged over the five different covariance structures for the trigonometric model and missing rates. The triple covers the sample sizes using a significance level of .
| Prediction Interval Length for the Trigonometric Model with | |||
|---|---|---|---|
| Imputation Method |
|
|
|
| missForest |
|
|
|
| mice_pmm |
|
|
|
| mice_norm |
|
|
|
| mice_rf |
|
|
|
| gbm |
|
|
|
| xgboost |
|
|
|