| Literature DB >> 29226627 |
Jörn Lötsch1,2.
Abstract
The measurement of concentrations of drugs and endogenous substances is widely used in basic and clinical pharmacology research and service tasks. Using data science-derived visualizations of laboratory data, it is demonstrated on a real-life example that basic statistical exploration of laboratory assay results or advised standard visual methods of data inspection may fall short in detecting systematic laboratory errors. For example, data pathologies such as generating always the same value in all probes of a particular assay run may pass undetected when using standard methods of data quality check. It is shown that the use of different data visualizations that emphasize different views of the data may enhance the detection of systematic laboratory errors. A dotplot of single data in the order of assay is proposed that provides an overview on the data range, outliers and a particular type of systematic errors where similar values are wrongly measured in all probes.Entities:
Keywords: R programming language; data quality check; data science
Mesh:
Year: 2017 PMID: 29226627 PMCID: PMC5723702 DOI: 10.1002/prp2.369
Source DB: PubMed Journal: Pharmacol Res Perspect ISSN: 2052-1707
Figure 1Dotplot of plasma concentrations of three different biochemical markers (arbitrarily named “Lab1”, “Lab2” and “Lab3”). The dots display the single data, sorted in order of consecutive assay (upper line). Two different clinical phenotypes are included with a distribution of n = 100/100. In the parameter “Lab2” a short temporal window (red ellipse) was detected during which all measured concentrations had wrongly the same numerical value. In “Lab3” all measurements were zero except for one assay day during which highly variable results were produced. The detection of these errors became impossible when the temporal succession of assay was destroyed (bottom line)
Descriptive statistical analysis of the three laboratory parameters, originating from an actual scientific project but presently arbitrarily named “Lab1”, “Lab2” and “Lab3”
| Parameters | Lab1 | Lab2 | Lab3 |
|---|---|---|---|
| N | 200 | 200 | 200 |
| Mean | 3.22 | 3.95 | 0.1 |
| Standard deviation | 2.19 | 2.43 | 0.33 |
| Median | 3.06 | 3.33 | 0 |
| Trimmed mean | 3.04 | 3.7 | 0.01 |
| Median absolute difference | 2.5 | 2.25 | 0 |
| Minimum | 0 | 0.16 | 0 |
| Maximum | 9.55 | 12.64 | 2.61 |
| Range | 9.55 | 12.48 | 2.61 |
| Skewness | 0.62 | 0.91 | 4.72 |
| Kurtosis | −0.34 | 0.32 | 25.02 |
| Standard error | 0.15 | 0.17 | 0.02 |
The calculations were made using the “describe” command of the R library “psych” (Revelle W, Northwestern University, Evanston, Illinois, https://CRAN.R-project.org/package=psych) on the R software package (version 3.4.1 for Linux; http://CRAN.R-project.org/ 5).
Figure 2Graphical presentation of plasma concentrations of three different biochemical markers (arbitrarily named “Lab1”, “Lab2” and “Lab3”). Left panel: Bar plot with means and standard deviations (error bars). Right panel: Boxplots overlaid with the original data observations. Quartiles and medians (solid horizontal line within the box) were used to construct a “box and whisker” plot. The whiskers add 1.5 times the interquartile range (IQR) to the 75th percentile or subtract 1.5 times the IQR from the 25th percentile and are expected to include 99.3% of the data if normally distributed
Figure 3Plot of single data and their distribution plasma concentrations of three different biochemical markers (arbitrarily named “Lab1”, “Lab2” and “Lab3”). Top: Matrix heatmap showing the data as color‐coded from yellow to red, with red indicating higher values. Bottom: Probability density function (PDF) estimated by means of the Pareto density estimation (PDE 3; bottom, black lines), overlaid on a standard histogram plot of the data