| Literature DB >> 35134881 |
David A Hughes1,2, Kurt Taylor1,2, Nancy McBride1,2,3, Matthew A Lee1,2, Dan Mason4, Deborah A Lawlor1,2,3, Nicholas J Timpson1,2, Laura J Corbin1,2.
Abstract
MOTIVATION: Metabolomics is an increasingly common part of health research and there is need for pre-analytical data processing. Researchers typically need to characterise the data and to exclude errors within the context of the intended analysis. While some pre-processing steps are common, there is currently a lack of standardization and reporting transparency for these procedures.Entities:
Year: 2022 PMID: 35134881 PMCID: PMC8963298 DOI: 10.1093/bioinformatics/btac059
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Brief description of the metaboprep pipeline. Along the top, the six primary steps the pipeline takes are outlined. The column on left provides an outline of the steps for the generation of summary statistics whilst the right provides an outline of the steps taken for sample and metabolite filtering. Common abbreviations used are: ‘dme’ for derived measures excluded; SD for standard deviations; ‘X’ which denotes a threshold variable that is defined by the user in the pipeline parameter file; PCs for principal components
Summary statistics for the initial, raw (prefiltered) BiB_MS-1 and ALSPAC_F24 datasets
| BiB_MS-1 | ALSPAC_F24 | |
|---|---|---|
|
| ||
| Platform | Metabolon | Nightingale Health |
| No. of samples | 1000 | 3361 |
| No. of metabolites | 1369 | 225 |
|
| ||
| % sample missingness | 11.85, 18.45, 26.81 | 0.00, 0.00, 12.24 |
| TSA at complete metabolites (min, median, max) | 1.85, 2.35, 2.98 (×103) | 3.99, 4.31, 4.75 (×103) |
| Count of outlying data points per sample (min, median, max) | 0, 5, 105 | 0, 0, 48 |
|
| ||
| % metabolite missingness (min, median, max) | 0, 2.6, 100 | 0.00, 0.00, 1.71 |
| Count of outlying data points per metabolite (min, median, max) | 0, 2, 99 | 0, 2, 344 |
| % with | 15.49 | 42.22 |
| % whose | 9.2 | 44.89 |
| No. of representative metabolites | 512 | 24 |
Note: The table provides details on the platform, sample size, sample and metabolite missingness, TSA for samples, and outlier counts, the percent of metabolites that may be considered normal distributed and an estimate of the number of representative metabolites in the dataset.
Calculated after the exclusion of derived variables in the Nightingale Health dataset and of xenobiotics in the Metabolon dataset.
Results of sample and metabolite filtering based on default exclusion thresholds
| Filtering step | Exclusion threshold | BiB_MS-1 | ALSPAC_F24 |
|---|---|---|---|
| Raw dataset (prefiltering) | 1000 samples | 3361 samples | |
| 1369 metabolites | 225 metabolites | ||
| 1. Extreme sample missingness | ≥80% | 0 | 0 |
| 2. Extreme metabolite missingness | ≥80% | 96 | 0 |
| 3. Sample missingness | ≥20% | 3 | 0 |
| 4. Metabolite missingness | ≥20% | 236 | 0 |
| 5. Sample TSA | >5SD | 1 | 3 |
| 7. PCA outliers | >5SD | 11 | 0 |
| Final dataset (post-filtering) | 985 samples | 3358 samples | |
| 1037 metabolites | 225 metabolites |
Note: PCA, principal component analysis; SD, standard deviations.
Calculated after excluding metabolites in the xenobiotic class from Metabolon data and derived measures from Nightingale Health data.
User-defined threshold. Rows in blue are sample filtering steps.
Derived from complete metabolites only.
Excluding metabolites with >20% missingness.
Using the representative metabolites only and excluding on the number of PCs determined by the acceleration factor with a minimum of two PCs.
Fig. 2.Summary figure found in each HTML report for the filtered dataset. There are seven figures in this BiB dataset summary figure. (1) The distribution of sample missingness. (2) The distribution for feature missingness. (3) The distribution for TSA, at complete features only. (4) A hierarchical clustering dendrogram based on absolute Spearman rho distances (1-rho) and cut at a tree cut height (red horizontal line) defined by the user. Blue branches on the dendrogram denote the features specified as ‘representative’ features used in the PCA. (5) A table of the number of metabolites used at each step of the dendrogram and PCA. (6) A scree plot of the variance explained for each PC also identifying the number PCs estimated to be informative (vertical lines) by the Cattel’s Scree Test acceleration factor (red, n = 2) and Parallel Analysis (green, n = 49). (7) A PC plot of the top two PCs for each sample. The number of metabolites used in the analysis is again indicated in the title of the PC plot. Individuals in the PC plot were clustered into four k-means (k) clusters, using data from the top two PCs. The k-means clustering and colour coding is strictly there to help provide some visualization of the major axes of variation in the sample population(s)