| Literature DB >> 30830398 |
Kieu Trinh Do1, Simone Wahl2,3,4, Johannes Raffler5, Sophie Molnos2,3,4, Michael Laimighofer1, Jerzy Adamski6,7,8, Karsten Suhre9, Konstantin Strauch10,11, Annette Peters2,3, Christian Gieger2,3, Claudia Langenberg12, Isobel D Stewart12, Fabian J Theis1,13, Harald Grallert2,3,4, Gabi Kastenmüller14,15, Jan Krumsiek16,17,18.
Abstract
BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.Entities:
Keywords: Batch effects; K-nearest neighbor; Limit of detection; MICE; Mass spectrometry; Missing values imputation; Untargeted metabolomics
Mesh:
Year: 2018 PMID: 30830398 PMCID: PMC6153696 DOI: 10.1007/s11306-018-1420-2
Source DB: PubMed Journal: Metabolomics ISSN: 1573-3882 Impact factor: 4.290
Fig. 1Flow chart of the study design. Pre-processed KORA F4 metabolomics data were used to analyze patterns of missing values in the dataset. Possible underlying mechanisms were inferred and implemented in a simulation framework to generate data resembling the observed patterns. Based on these simulated data, imputation methods with different characteristics were applied and evaluated. Finally, the same imputation approaches were evaluated using KORA F4 metabolomics and genomics data
Fig. 2Overall amounts of missing data and LOD effects. a, b The overall fraction of missing values across metabolites and observations, respectively. c, d Scatter plots and boxplots of selected metabolite pairs to illustrate missing data due to LOD and non-LOD effects, respectively. Blue—observed concentrations. Red—observed values of the auxiliary metabolite in observations with missing values of the investigated metabolite. Note that red data points are not part of the x-axis but were plotted in the same scatterplot for clarity. corr correlation, p p-value of correlation, = p-value of Wilcoxon–Mann–Whitney test
Fig. 3Run day-dependent effects on missing data. a Normalized amount of missing values per run day in each platform (LC/MS+, LC/MS−, GC/MS). For a given metabolite and run day, the normalized amount of missing data per run day was calculated as the number of missing values for the respective metabolite on the respective run day divided by the total number of observations for that run day, divided by the median amount of missing data of that metabolite over all run days. Thus, a normalized run day-missingness of 1 is the average run day-missingness for a given metabolite. Pearson correlation coefficients were calculated across all pairs of platforms. b Standard deviation of missing values across run days, depending on the total amount of missing data for each platform. Each dot in the plot shows the total proportion of missing values and the run day variation for one metabolite. c, d The distribution of the total amount of missing values is shown for a metabolite with moderate (ursodeoxycholate) and high (gamma-glutamylisoleucine) standard deviation
Fig. 4Run day-dependent LOD. a Histogram of Pearson correlation coefficients of the percent of missing values and run day means. b Scatterplot of run day mean versus percent missing values, with 7-methylxanthine as an example of a negative correlation. c Run day distributions of 7-methylxanthine before run day normalization
Fig. 5Mechanisms of missing data and imputation approaches used in the simulation study. a–e Mechanisms of missing values used in the simulation study, based on evidence from real metabolomics data. f Venn diagram of imputation methods showing different characteristics. Note that the figure contains complete case analysis (CCA), which is not an imputation method, and is noted in brackets. CCA and mean were placed outside the Venn diagram, as they do not comprise any of the four characteristics. LOD limit of detection
Fig. 6Simulation results for Pearson, partial correlation, and logistic regression analysis. Performance of imputation approaches in data scenarios where a both variables followed a run day-specific probabilistic LOD mechanism, b both variables showed non-systematic patterns of missing data, and c one variable with run day-specific probabilistic LOD-based missing data and the other variable showed non-systematic patterns of missing data. Type 1 error and power reflect the false positive and true positive rate of hypothesis testing, respectively. Note that power = 1 − type 2 error rate. Note further that due to readability issues, only KNN-based imputation methods with K = 3, 10, and 20 were included, whereas KNN imputation with K = 1 and 5 can be found in File S5
Fig. 7Evaluation of imputation approaches on real data. a Pathway-based modularity for each imputation strategy. Modularity was calculated based on pathways. Vertical lines represent bootstrap-based confidence intervals (1000 times resampling). b The ability to gain statistical power and to preserve real metabolite-SNP associations after imputation. Circle color represents the ability of imputation methods to preserve effect sizes, with red and blue indicating possible overestimation and underestimation, respectively, and yellow corresponding to cases with good preservation of the association. Circle size depicts the gain in statistical power after imputation. The bigger the circle the higher the statistical power gain after imputation compared to CCA. Squares correspond to cases where no statistical power was gained. Note that due to readability issues, only KNN-based imputation methods with K = 3, 10, and 20 were included, whereas KNN imputation with K = 1 and 5 can be found in File S6 and Table S8