| Literature DB >> 22783214 |
Abstract
The presence of outliers can very problematic in data analysis, leading statisticians to develop a wide variety of methods for identifying them in both the univariate and multivariate contexts. In case of the latter, perhaps the most popular approach has been Mahalanobis distance, where large values suggest an observation that is unusual as compared to the center of the data. However, researchers have identified problems with the application of this metric such that its utility may be limited in some situations. As a consequence, other methods for detecting outlying observations have been developed and studied. However, a number of these approaches, while apparently robust and useful have not made their way into general practice in the social sciences. Thus, the goal of this study was to describe some of these methods and demonstrate them using a well known dataset from a popular multivariate textbook widely used in the social sciences. Results demonstrated that the methods do indeed result in datasets with very different distributional characteristics. These results are discussed in light of how they might be used by researchers and practitioners.Entities:
Keywords: Mahalanobis distance; minimum covariance determinant; minimum generalized variance; minimum volume ellipsoid; outliers; projection
Year: 2012 PMID: 22783214 PMCID: PMC3389806 DOI: 10.3389/fpsyg.2012.00211
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Summary of outlier detection methods.
| Method | Equation | Reference | Strengths | Weaknesses |
|---|---|---|---|---|
| Mahalanobis ( | Intuitively easy to understand; easy to calculate; familiar to other researchers | Sensitive to outliers; assumes data are continuous | ||
| MVE | Identify subset of data contained within the ellipsoid that has minimized volume | Rousseeuw and Leroy ( | Yields mean with maximum possible breakdown point | May remove as much as 50% of sample |
| MCD | Identify subset of data that minimizes the determinant of the covariance matrix | Rousseeuw and van Driessen ( | Yields mean with maximum possible breakdown point | May remove as much as 50% of sample |
| MGV | Calculate MAD version of | Wilcox ( | Typically removes fewer observations than either MVE or MCD | Generally does not have as high a breakdown point as MVE or MCD |
| P1 | Identify the multivariate center of data using MCD or MVE and then determine its relative distance from this center (depth); use the MGV criteria based on this depth to identify outliers | Donoho and Gasko ( | Approximates an affine equivariant outlier detection method; may not exclude as many cases as MVE or MCD | Will not typically lead to a mean with the maximum possible breakdown point |
| P2 | Identify all possible lines between all pairs of observations in order to determine depth of each point | Donoho and Gasko ( | Some evidence that this method is more accurate than P1 in terms of identifying outliers | Extensive computational time, particularly for large datasets |
| P3 | Same approach as P1 except that the criteria for identifying outliers is | Donoho and Gasko ( | May yield a mean with a higher breakdown point than other projection methods | Will likely lead to exclusion of more observations as outliers than will other projection approaches |
Figure 1Scatterplot of observations identified as outliers based on the MVE method.
Figure 2Scatterplot of observations identified as outliers based on the MCD method.
Figure 3Scatterplot of observations identified as outliers based on the MGV method.
Figure 4Boxplots.
Descriptive statistics.
| Variable | Mean | |||||||
|---|---|---|---|---|---|---|---|---|
| Full ( | MCD ( | MVE ( | MGV ( | P1 ( | P2 ( | P3 ( | ||
| TIMEDRS | 7.90 | 6.67 | 2.45 | 3.37 | 7.64 | 5.27 | 5.17 | 5.27 |
| ATTDRUG | 7.69 | 7.67 | 7.69 | 7.68 | 7.68 | 7.65 | 7.64 | 7.65 |
| ATTHOUSE | 23.53 | 23.56 | 23.36 | 23.57 | 23.50 | 23.54 | 23.54 | 23.54 |
| TIMEDRS | 4.00 | 4.00 | 2.00 | 3.00 | 2.00 | 2.00 | 2.00 | 2.00 |
| ATTDRUG | 8.00 | 8.00 | 8.00 | 8.00 | 7.00 | 7.00 | 7.00 | 7.00 |
| ATTHOUSE | 24.00 | 24.00 | 23.00 | 23.00 | 21.00 | 21.00 | 21.00 | 21.00 |
| TIMEDRS | 2.00 | 2.00 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 |
| ATTDRUG | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 |
| ATTHOUSE | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 |
| TIMEDRS | 10.00 | 9.00 | 4.00 | 5.00 | 9.50 | 7.00 | 7.00 | 7.00 |
| ATTDRUG | 9.00 | 8.25 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 |
| ATTHOUSE | 27.00 | 26.25 | 26.00 | 26.00 | 26.50 | 26.00 | 26.75 | 26.00 |
| TIMEDRS | 10.95 | 7.35 | 1.59 | 2.22 | 10.23 | 4.72 | 4.58 | 4.72 |
| ATTDRUG | 1.16 | 1.16 | 0.89 | 0.81 | 1.15 | 1.56 | 1.52 | 1.56 |
| ATTHOUSE | 4.48 | 4.22 | 3.67 | 3.40 | 4.46 | 4.25 | 4.26 | 4.25 |
| TIMEDRS | 3.23 | 2.07 | 0.23 | 0.46 | 3.15 | 1.12 | 1.08 | 1.12 |
| ATTDRUG | −0.12 | −0.11 | −0.16 | 0.01 | −0.12 | −0.09 | −0.09 | −0.09 |
| ATTHOUSE | −0.45 | −0.06 | −0.03 | 0.06 | −0.46 | −0.03 | −0.03 | −0.03 |
| TIMEDRS | 15.88 | 7.92 | 2.32 | 2.49 | 15.77 | 3.42 | 3.29 | 3.42 |
| ATTDRUG | 2.53 | 2.51 | 2.43 | 2.31 | 2.54 | 2.51 | 2.51 | 2.51 |
| ATTHOUSE | 4.50 | 2.71 | 2.16 | 2.17 | 4.54 | 2.69 | 2.68 | 2.69 |
Correlations.
| Variable | TIMEDRS | ATTDRUG | ATTHOUSE |
|---|---|---|---|
| TIMEDRS | 1.00 | 0.10 | 0.13 |
| ATTDRUG | 0.10 | 1.00 | 0.03 |
| ATTHOUSE | 0.13 | 0.03 | 1.00 |
| TIMEDRS | 1.00 | 0.07 | 0.08 |
| ATTDRUG | 0.07 | 1.00 | 0.02 |
| ATTHOUSE | 0.08 | 0.02 | 1.00 |
| TIMEDRS | 1.00 | 0.25 | 0.19 |
| ATTDRUG | 0.25 | 1.00 | 0.26 |
| ATTHOUSE | 0.19 | 0.26 | 1.00 |
| TIMEDRS | 1.00 | 0.33 | 0.05 |
| ATTDRUG | 0.33 | 1.00 | 0.32 |
| ATTHOUSE | 0.05 | 0.32 | 1.00 |
| TIMEDRS | 1.00 | 0.07 | 0.10 |
| ATTDRUG | 0.07 | 1.00 | 0.02 |
| ATTHOUSE | 0.10 | 0.02 | 1.00 |
| TIMEDRS | 1.00 | 0.06 | 0.10 |
| ATTDRUG | 0.06 | 1.00 | 0.03 |
| ATTHOUSE | 0.10 | 0.03 | 1.00 |
| TIMEDRS | 1.00 | 0.04 | 0.11 |
| ATTDRUG | 0.04 | 1.00 | 0.03 |
| ATTHOUSE | 0.11 | 0.03 | 1.00 |
| TIMEDRS | 1.00 | 0.06 | 0.10 |
| ATTDRUG | 0.06 | 1.00 | 0.03 |
| ATTHOUSE | 0.10 | 0.03 | 1.00 |
Summary of results for outlier detection methods.
| Method | Outliers removed | Impact on distributions | Impact on correlations | Comments |
|---|---|---|---|---|
| 13 | Reduced skewness and kurtosis when compared to full data set, but did not fully eliminate them. Reduced variation in TIMEDRS | Comparable correlations to the full dataset | Resulted in a sample with somewhat less skewed and kurtotic variables, though they did remain clearly non-normal in nature. The correlations among the variables remained low, as with the full dataset | |
| MVE | 230 | Largely eliminated skewness and greatly lowered kurtosis in TIMEDRS. Also reduced kurtosis in ATTHOUSE when compared to full data. Greatly lowered both the mean and standard deviation of TIMEDRS | Resulted in markedly higher correlations for two pairs of variables, than was seen with the other methods, except for MCD | Reduced the sample size substantially, but also yielded variables with distributional characteristics much more favorable to use with common statistical analyses; i.e., very little skewness or kurtosis. In addition, correlation coefficients were generally larger than for the other methods, suggesting greater linearity in relationships among the variables |
| MCD | 230 | Very similar pattern to that displayed by MVE | Yielded relatively higher correlation values than any of the other methods, except MVE, and no very low values | Provided a sample with very characteristics to that of MVE |
| MGV | 2 | Yielded distributional results very similar to those of the full dataset | Very similar correlation structure as found in the full dataset and for | Identified very few outliers, leading to a sample that did not differ meaningfully from the original |
| P1 | 40 | Resulted in lower mean, standard deviation, skewness, and kurtosis values for TIMEDRS when compared to the full data, | Very comparable correlation results to the full dataset, as well as | Appears to find a “middle ground” between MVE/MCD and |
| P2 | 43 | Very similar results to P1 | Very similar results to P1 | Provided a sample yielding essentially the same results as P1 |
| P3 | 40 | Identical results to P1 | Identical results to P1 | In this case, resulted in an identical sample to that of P1 |