| Literature DB >> 28794835 |
Sang Kyu Kwak1, Jong Hae Kim2.
Abstract
Missing values and outliers are frequently encountered while collecting data. The presence of missing values reduces the data available to be analyzed, compromising the statistical power of the study, and eventually the reliability of its results. In addition, it causes a significant bias in the results and degrades the efficiency of the data. Outliers significantly affect the process of estimating statistics (e.g., the average and standard deviation of a sample), resulting in overestimated or underestimated values. Therefore, the results of data analysis are considerably dependent on the ways in which the missing values and outliers are processed. In this regard, this review discusses the types of missing values, ways of identifying outliers, and dealing with the two.Entities:
Keywords: Bias; Data collection; Data interpretation; Statistics
Year: 2017 PMID: 28794835 PMCID: PMC5548942 DOI: 10.4097/kjae.2017.70.4.407
Source DB: PubMed Journal: Korean J Anesthesiol ISSN: 2005-6419
Types of Missing Values
| Types of missing values | Description | Possible causes |
|---|---|---|
| Missing completely at random | Missing data occur completely at random without being influenced by other data. | Consent withdrawal, omission of major exams, death, discontinued follow-up and serious adverse reactions. |
| Missing at random | Missing data occur at a specific time point in conjunction with participant dissatisfaction with study outcomes and ongoing participation | Refusal to continue measurements. |
| Not missing at random | Missing data occur when a patient who is not satisfied with study outcomes performs the required measurements on his own, before the scheduled measurement. | If a patient finds the results of self-measurement dissatisfactory in addition to dissatisfaction related to the study, the patient may refuse further measurements. |
Fig. 1Boxplot with outliers. The upper and lower fences represent values more and less than 75th and 25th percentiles (3rd and 1st quartiles), respectively, by 1.5 times the difference between the 3rd and 1st quartiles. An outlier is defined as the value above or below the upper or lower fences.
Examples of Missing Value and Outlier
| No. | Data with a missing value | Date with an outlier | ||||
|---|---|---|---|---|---|---|
| Raw data | Complete case | Imputation with the mean value | Raw data | Complete case | Winsorization with the maximum value | |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 5 | .* | -† | 2.5‡ | 9§ | -† | -† |
N: the number of a sample, NA: not applicable. *Missing value, †Discarded value, ‡Imputed mean value, §Outlier, ∥Winsorized maximum value.