| Literature DB >> 31092212 |
Kelly M Sunderland1, Derek Beaton2, Julia Fraser3, Donna Kwan4, Paula M McLaughlin4, Manuel Montero-Odasso4,5,6, Alicia J Peltsch4, Frederico Pieruccini-Faria4,5,6, Demetrios J Sahlas7, Richard H Swartz8,9, Stephen C Strother2,10, Malcolm A Binns2,11.
Abstract
BACKGROUND: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow.Entities:
Keywords: Minimum covariance determinant; Multivariate outliers; Principal component analysis; Quality control; Visualization
Year: 2019 PMID: 31092212 PMCID: PMC6521365 DOI: 10.1186/s12874-019-0737-5
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Preliminary summary demographics for the ONDRI VCI cohort
| Neuropsychology | Gait | |
|---|---|---|
| Sample size | 161 | 148 |
| Age in years, mean (sd) | 68.72 (7.42) | 68.61 (7.39) |
| Education in years, mean (sd) | 14.61 (2.92) | 14.65 (2.98) |
| Number of Males/Females | 110 / 51 | 104 / 44 |
Fig. 1The data quality evaluation process steps, represented as a process that may loop. The dashed line separates steps that are performed by the platform team from the biostatistics team, while the dashed arrow indicates that the process will not always return to Step 1
Comparison of outlying participants and errors identified by the multivariate outlier detection approaches during the first iteration of the data evaluation process: first, between MCD and RPCA directly, combining results with and without adjustment; then, between the adjusted and unadjusted results within each multivariate method. For each set of results, the total number of outliers/errors by each approach is reported (MCD vs. RPCA; adjusted vs. unadjusted), as well as the number that overlapped between the two approaches
| Neuropsychology | Gait | |||||
|---|---|---|---|---|---|---|
| Summary | MCD & RPCA | MCD | RPCA | MCD & RPCA | MCD | RPCA |
| (Adj. & Unadj. are combined) | ||||||
| Outlying Participants | 11 | 26 | 29 | 19 | 29 | 33 |
| Number of Errors | 6 | 8 | 6 | 3 | 5 | 3 |
| Individual Results | Adj. & Unadj. | Adj. | Unadj. | Adj. & Unadj. | Adj. | Unadj. |
| MCD | ||||||
| Outlying Participants | 18 | 22 | 22 | 24 | 25 | 28 |
| Number of Errors | 8 | 8 | 8 | 5 | 5 | 5 |
| RPCA | ||||||
| Outlying Participants | 16 | 22 | 23 | 19 | 26 | 26 |
| Number of Errors | 4 | 4 | 6 | 3 | 3 | 3 |
Comparison of outlying participants and errors identified by the multivariate and univariate outlier detection approaches in the first iteration of the data evaluation process, regardless of specific method and whether covariate adjustment was applied
| Neuropsychology | Gait | |||||
|---|---|---|---|---|---|---|
| Multi. & Uni. | Multi. | Uni. | Multi. & Uni. | Multi. | Uni. | |
| Outlying Participants | 44 | 44 | 133 | 25 | 43 | 42 |
| Number of Errors | 3a | 8 | 3b | 3 | 5 | 3b |
aAll outlying participants identified by multivariate methods were also identified by univariate methods. However, not all univariate methods identified the participant as an outlier on the variable with the error
bOutlying participants identified by univariate methods only were not verified.
Fig. 2Boxplots for neuropsychology variables on which an error was identified with the multivariate data quality evaluation process. All data were adjusted for age, sex, and years of education, and normalized to have zero mean and unit standard deviation. The range of typical values identified by the univariate MCD is represented by curly brackets. Values at which an error was identified with the data quality evaluation process are represented by crossed circles. BNT = Boston Naming Test. DS = Digit Span assessment. JLO = Judgement of Line Orientation. RAVLT = Rey Auditory Verbal Learning Test. Stroop = Colour-Word Interference
Fig. 3Boxplots for gait variables identified as primary contributing variables and on which an error was identified with the multivariate data quality evaluation process. All data were adjusted for age, sex, and years of education, and normalized to have zero mean and unit standard deviation. The range of typical values identified by the univariate MCD is represented by curly brackets. Values at which an error was identified with the data quality evaluation process are represented by crossed circles. As previously noted, errors identified in the gait dataset affected multiple variables, so two variables are included per error
Fig. 4Observed data for two measures of the Rey Auditory Verbal Learning Test (RAVLT). All data were adjusted for age, sex, and years of education, and normalized to have zero mean and unit standard deviation. The outlier is represented by a crossed circle
Fig. 5Observed data for two measures of the Boston Naming Test (BNT). All data were adjusted for age, sex, and years of education and normalized to have zero mean and unit standard deviation. The outlier is represented by a crossed circle
Fig. 6Observed data for a measure from each of the Boston Naming Test (BNT) and the Visual Object and Space Perception battery (VOSP). All data were adjusted for age, sex, and years of education, and normalized to have zero mean and unit standard deviation. The outlier is represented by a crossed circle