| Literature DB >> 31336989 |
Joseph Antonelli1,2, Brian L Claggett2, Mir Henglin2, Andy Kim2,3, Gavin Ovsak2, Nicole Kim2, Katherine Deng2, Kevin Rao2, Octavia Tyagi2, Jeramie D Watrous4, Kim A Lagerborg3, Pavel V Hushcha2, Olga V Demler5, Samia Mora2,5, Teemu J Niiranen6,7, Alexandre C Pereira8, Mohit Jain9, Susan Cheng10,11,12.
Abstract
High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations.Entities:
Keywords: high-dimensional data; large-scale metabolomics; statistical methods
Year: 2019 PMID: 31336989 PMCID: PMC6680705 DOI: 10.3390/metabo9070143
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Statistical considerations for human metabolomics data.
| Consideration | Notes and Examples |
|---|---|
| Missingness | Patterns of missing values tend to be non-random and are even sometimes predictable. For example, missing values may often but not always be more frequent for metabolites that are intrinsically low in abundance when measured from a given tissue type. |
| Missingness may be due to biological and/or technical reasons. | |
| Data distributions | Many but not all metabolites tend to demonstrate right-skewed distributions in most types of human studies (e.g., healthy controls or disease-specific referral samples). |
| Certain metabolites will display a substantial proportion of zero values that may be considered true zero values based on biology (an issue to be considered along with but distinguished from missingness). | |
| Intercorrelations | Intercorrelations between metabolites may well reflect clustering of small molecules by known or (mostly) unknown biological pathways. |
| Intercorrelations will vary widely depending on a given exposure or background, chronic disease status, and other yet unidentified factors. | |
| Intercorrelations will also vary depending on the underlying mass spectrometry (MS) method used to create a given dataset (i.e., nontargeted vs. targeted, and the specific technical methods used). | |
| Time-dependence | Whereas a portion of the human metabolome changes dynamically in response to acute perturbation or stress, many other metabolites display variation only over several days to weeks in response to subacute perturbations; other portions of the metabolome may yet exhibit relatively little change over time, except in response to major chronic exposures. |
| Confounding factors | Metabolite values will vary in response to factors that are measurable as well as factors that are not easily measurable for a given study, such as acute and chronic dietary patterns, microbiota, and environmental exposures. |
Statistical analysis methods for outcomes analyses of human metabolomics data.
| Method | Univariate or Multivariate | Handling Binary Outcome | Handling Continuous Outcome | Metabolite Selection | Advantages | Disadvantages | |
|---|---|---|---|---|---|---|---|
| Multiple tests (e.g., univariate linear regression) with Bonferroni correction | Univariate | Yes | Yes | Yes | Yes | Simple, easy to use and interpret results | Very conservative and does not account for intercorrelation |
| Multiple tests with false discovery rate (FDR) | Univariate | Yes | Yes | Yes | Yes | Simple, easy to use, less conservative than Bonferroni correction | Does not account for intercorrelation among features |
| Principal component analysis (PCA) | Multivariate | Yes | Yes | No | No | Effective for variable reduction | No intrinsic clarity on how to select or rank variables |
| Sparse partial least squares (SPLS) | Multivariate | No | Yes | No | Yes | Can quickly find a subset of variables that predicts the outcome well | Multiple tuning parameters are needed to be chosen via cross validation |
| Linear discriminant analysis (LDA) | Multivariate | Yes | No | No | Yes | Simple and works for categorical outcomes | Can not handle large numbers of features |
| Least absolute shrinkage and selection operator (LASSO) | Multivariate | Yes | Yes | No | Yes | Can quickly find a subset of variables that predicts the outcome well | May not perform well for metabolite selection when the features are highly correlated |
| Random forests and other machine learning approaches | Multivariate | Yes | Yes | No | No | Can find complex relationships between variables | If data is truly linear, this will be less efficient |
Figure 1Metabolite data transformation and centering. A frequently used approach for managing metabolite data collected in a large human cohort study involves log transforming each metabolite measures and centering the data on plate median to account for batch to batch variation. Interestingly, variable transformation can reveal multi-modal distributions.
Figure 2Actual and simulated metabolomics data. Previously analyzed data, or prior detailed knowledge of the structure of metabolomics data collected from an existing large epidemiologic cohort study (a) can be used to construct simulated data that mimics the data structure observed from real measures (b). These simulated data can be used to estimate statistical power, based on one or more methods of analyses, for planning the design of a future study.
Figure 3Using multiple statistical methods to evaluate results in a real-life application involving analyses of large cohort metabolite data. We related a panel of bioactive lipid molecule metabolites (i.e., eicosanoids) to putative derivative substrates (i.e., eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA)), which were considered in all analyses as the outcomes of clinical relevance and interest. We used multiple different statistical methods and compared results. Metabolites are denoted by mass-to-charge (m/z) ratio and retention time (rt, in minutes) using the m/z_rt convention, and are listed in rank order for each outcome (EPA or DHA) according to performance metrics provided by each model.