| Literature DB >> 23493038 |
Leah H Rubin1, Katie Witkiewitz, Justin St Andre, Steve Reilly.
Abstract
Missing data are a major problem in the behavioral neurosciences, particularly when data collection is costly. Often researchers exclude cases with missing data, which can result in biased estimates and reduced power. Trying to avoid the deletion of a case because of a missing data point can be conducted, but implementing a naïve missing data method can result in distorted estimates and incorrect conclusions. New approaches for handling missing data have been developed but these techniques are not typically included in undergraduate research methods texts. The topic of missing data techniques would be useful for teaching research methods and for helping students with their research projects. This paper aimed to illustrate that estimating missing data is often more efficacious than complete case analysis, otherwise known as listwise deletion. Longitudinal data was obtained from an experiment examining the effects of an anorectic drug on food consumption in a small sample (n=17) of rats. The complete dataset was degraded by removing a percentage of datapoints (1-5%, 10%). Four missing data techniques: listwise deletion, mean substitution, regression, and expectation-maximization (EM) were applied to all six datasets to ensure that each approach was applied to the same missing data points. P-values, effect sizes, and Bayes factors were computed. Results demonstrated listwise deletion was the least effective method. EM and regression imputation were the preferred methods when more than 5% of the data were missing. Based on these findings it is recommended that researchers avoid using listwise deletion and consider alternative missing data techniques.Entities:
Keywords: expectation maximization; imputation; listwise deletion; mean substitution; missing data; regression
Year: 2007 PMID: 23493038 PMCID: PMC3592650
Source DB: PubMed Journal: J Undergrad Neurosci Educ ISSN: 1544-2896
Means and standard deviations for pellet consumption for rats as a function of condition for the complete dataset.
| Day | Condition
| |||
|---|---|---|---|---|
| Saline | Anorectic Drug | |||
| 7 | 575.63 | 74.84 | 515.22 | 85.20 |
| 8 | 547.00 | 81.20 | 510.89 | 45.45 |
| 9 | 570.50 | 68.80 | 465.33 | 141.42 |
| 10 | 552.00 | 75.61 | 476.56 | 137.82 |
| 11 | 549.25 | 77.39 | 497.33 | 76.57 |
| 12 | 570.88 | 57.87 | 496.67 | 90.63 |
| 13 | 569.13 | 63.08 | 500.67 | 98.01 |
| 14 | 543.13 | 73.46 | 538.22 | 86.25 |
| 15 | 562.00 | 77.12 | 599.67 | 77.72 |
Residual means and variances for pellet consumption averaged across days 7–15 for each percent Missing data and missing data method. Residual means and variances were computed by subtracting the estimated overall mean from the actual overall mean and by subtracting the estimated overall variances from the actual variance for the saline and anorectic drug conditions.
| Missing Data (%) | Missing Data Technique
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Listwise Deletion | Mean Substitution | Regression | EM | ||||||
|
| |||||||||
| Saline Condition | |||||||||
| 1 | −0.60 | 65.41 | −1.63 | −377.01 | −1.77 | −334.24 | −1.63 | −350.51 | |
| 2 | −0.01 | −25.62 | −2.76 | −392.02 | −2.68 | −376.19 | −2.64 | −372.80 | |
| 3 | −1.73 | −31.11 | −3.26 | −515.05 | −3.11 | −493.54 | −3.11 | −482.15 | |
| 4 | −1.37 | 44.31 | −1.67 | 56.35 | −1.70 | 57.36 | −1.43 | 47.10 | |
| 5 | −0.52 | 241.01 | −2.00 | −246.02 | −1.89 | −238.15 | −1.84 | −223.44 | |
| 10 | 1.38 | −83.77 | −2.01 | −597.20 | −2.47 | −546.39 | −1.71 | −585.06 | |
|
| |||||||||
| Anorectic Drug Condition | |||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 3 | −1.51 | 160.22 | 0.44 | −215.53 | 0.44 | −207.46 | −0.03 | −228.31 | |
| 4 | −3.04 | 238.74 | −1.13 | −156.90 | −1.22 | −147.22 | −0.73 | −179.00 | |
| 5 | 0.65 | 150.09 | 3.28 | −374.22 | 3.22 | −357.33 | 3.19 | −365.51 | |
| 10 | −5.04 | 1020.10 | 0.79 | 410.38 | 0.41 | 404.99 | 0.47 | 387.87 | |
Figure 1P-values plotted as a function of percent of missing data and missing data technique. The black horizontal lines mark the p-values for the complete dataset. P-values falling below the black line could constitute a Type I error and p-values falling above the black line could constitute a Type II error.
Figure 2Effect sizes (η2) plotted as a function of percent of missing data and missing data technique. The black horizontal lines mark the effect size (η2) for the complete dataset.
Figure 3Lower bound of the Bayes factor plotted as a function of percent of missing data and missing data technique. The Black horizontal lines mark the lower bound of the Bayes factor for the complete dataset.