| Literature DB >> 34598320 |
Sebastian Malkusch1, Lisa Hahnefeld1, Robert Gurke1,2, Jörn Lötsch1,2.
Abstract
The evaluation of pharmacological data using machine learning requires high data quality. Therefore, data preprocessing, that is, cleaning analytical laboratory errors, replacing missing values or outliers, and transforming data adequately before actual data analysis, is crucial. Because current tools available for this purpose often require programming skills, preprocessing tools with graphical user interfaces that can be used interactively are needed. In collaboration between data scientists and experts in bioanalytical diagnostics, a graphical software package for data preprocessing called pguIMP is proposed, which contains a fixed sequence of preprocessing steps to enable reproducible interactive data preprocessing. As an R-based package, it also allows direct integration into this data science environment without requiring any programming knowledge. The implementation of contemporary data processing methods, including machine-learning-based imputation techniques, ensures the generation of corrected and cleaned bioanalytical data sets that preserve data structures such as clusters better than is possible with classical methods. This was evaluated on bioanalytical data sets from lipidomics and drug research using k-nearest-neighbors-based imputation followed by k-means clustering and density-based spatial clustering of applications with noise. The R package provides a Shiny-based web interface designed to be easy to use for non-data analysis experts. It is demonstrated that the spectrum of methods provided is suitable as a standard pipeline for preprocessing bioanalytical data in biomedical research domains. The R package pguIMP is freely available at the comprehensive R archive network (https://cran.r-project.org/web/packages/pguIMP/index.html).Entities:
Mesh:
Year: 2021 PMID: 34598320 PMCID: PMC8592507 DOI: 10.1002/psp4.12704
Source DB: PubMed Journal: CPT Pharmacometrics Syst Pharmacol ISSN: 2163-8306
FIGURE 1(a) Flowchart of the data engineering pipeline as it is used in the pguIMP package. The sequence of the individual processes is predefined. The user can choose from different algorithms under each subprocess and adjust the respective process parameters. The user can return to all subprocesses and change algorithms or optimize their parameters if the validation results of the pipeline created are not satisfactory. The result of such an iterative optimization routine is an individual, problem‐specific preprocessing pipeline that prepares the data set for the following chemometric analyses. (b) Screenshot of the graphical user interface of pguIMP. (1) The navigation menu under which the individual preprocessing steps are listed. In the example shown, the Transform process is selected. (2) The user can select the parameters for the respective analysis. In the case presented, the user would like to log‐normally transform the lipid mediator C16Sphinganin. (3) The user ran the preprocessing step using the parameters chosen in (2). (4) After the preprocessing step has been performed, a graphical validation of the process is shown. In the particular case, the deviation of the transformed lipid mediator distribution from a normal distribution is depicted via an overlay (upper left) of the transformed lipid mediator distribution (bar diagram) and the normal distribution (line plot): the residuals between the two distributions (lower left), a quantile–quantile plot (upper right), and the residual distribution (lower right). (LOQ, limit of quantification)
FIGURE 2Inference about the origin of missing values. Missing values have been simulated either (a) completely at random (CAR) or (b) not at random (NAR). For each variable, the data set was divided into two groups. The first group comprises the instances that were observed in the respective variable (Obs). The second group comprises the instances that were missing in the respective variable (Miss). The value distributions of the two groups were plotted for the remaining variables. This procedure is repeated row‐wise for all variables resulting in a distribution matrix. (c) The probability density function of the sum of significantly different groups per distribution matrix throughout 100 experiments. Significance was tested using the Kruskal–Wallis test with α = 0.05. (C6G, codeine‐6‐glucuronide; COD, codeine; M3G, morphine‐3‐glucuronide; M6G, morphine‐6‐glucuronide; MOR, morphine)
FIGURE 3Errors of various imputation methods. Missing values have been simulated either (a) completely at random (CAR) or (b) not at random (NAR) and were subsequently substituted either by the variable mean or median value. Alternatively, the substitution values were machine learned from the remaining variables using the classification and regression tree (CART), k‐nearest neighbors (knn), or predictive mean matching (pmm) algorithm. The error is calculated as root mean squared percentage error (RMSPE)
FIGURE 4Graphical validation of the effect of data preprocessing on unsupervised cluster analysis using factorial instance plots on the principal component map. For all experiments, data preprocessing incorporated data transformation (Ln) and normalization (minimum–maximum). Outliers are defined variable‐wise by using Grubb's test for outliers with α = 0.05. Variable values deviating from normality were identified in five instances (1, 8, 10, 12, 17). Further preprocessing incorporated three different methods of outlier handling: outliers were left untouched (Row 1; a–e), variable values in outlier instances were replaced by the respective variable median (Row 2; f–j), and variable values in outlier instances were imputed based on the remaining instances via k‐nearest neighbors (Row 3; k–o). The cluster separation as proposed by various unsupervised cluster analysis methods trained on the first two principal components of the preprocessed data are each shown column‐wise. (a, f, k) Black polygons visualize the cluster separation according to the original labeling of the data set. (b, g, l) Black polygons visualize the cluster separation following k‐means clustering. (c, h, m) Dendrogram according to the ordering of points to identify the clustering structure (OPTICS). (d, i, n) Reachability plot according to OPTICS. (e, j, o) Black polygons visualize the cluster separation following density‐based spatial clustering of applications with noise (DBSCAN) as extracted from the OPTICS analysis by applying a distance threshold (dashed line in c, h, m and d, i, n). The color code visualizes the true cluster separation as proposed by the original data labeling (Treatment A, blue; Treatment B, green; control, magenta). The numbers represent the instances of the data. Gray numbers indicate instances with regular variable values, black numbers indicate outlier instances. (ID, instance identification label; PC1, principal component 1; PC2, principal component 2)