| Literature DB >> 29931280 |
J L Min1,2, G Hemani1,2, G Davey Smith1,2, C Relton1,2, M Suderman1,2.
Abstract
Motivation: DNA methylation datasets are growing ever larger both in sample size and genome coverage. Novel computational solutions are required to efficiently handle these data.Entities:
Mesh:
Year: 2018 PMID: 29931280 PMCID: PMC6247925 DOI: 10.1093/bioinformatics/bty476
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The workflow of meffil
Comparison between software packages on a server with 16 available processors
| meffil | minfi | meffil | bigMelon | bigMelon | diMmeR | |
|---|---|---|---|---|---|---|
| Number of samples | 1000 | 1000 | 5469 | 5469 | 5469 | 5469 |
| Normalization method | FN | FN | FN | Dasen | Dasen | QN |
| Platform | R | R | R | R | R | Java |
| Size of summary (Gb) | 0.2 | 0.8 | ||||
| Memory (Gb) | 3/5 | 15 | 3/67 | 57 | 12 | 4.4 |
| Time (min) | 16 | 54 | 180 | 350 | 450 | 82 |
| Size of output (Gb) | 3.5 | 2.8 | 17 | 90 | 90 |
bigMelon applied with chunksize set to 500.
bigMelon applied with chunksize set to 100.
Only meffil generates a summary object.
If the output from meffil is a matrix in R, then memory use peaks at 67 Gb. If the output is saved to ‘gdsfmt’(Zheng , 2017) file like bigMelon, then the memory use peaks at 3 Gb. We note that the running time will be the same for both options.
DimMeR does not save output until after a permutation-based EWAS is run. We terminated analysis after normalization so output size was not determined.
Fig. 2.Effect of adjusting ‘slide’ or ‘plate’ as a random effect. True positive rates (TPRs) are consistently higher in a downstream EWAS when variation due to ‘slide’ effects in ARIES (a) and ‘plate’ effects in GOYA (b) are removed using random effects models. Random effects models were applied either probe quantiles along with control variation in FN (‘FN+re’) or during the EWAS (‘FN+ewas.re’). TPRs were estimated by comparison to associations from a large meta-analysis (Joubert )
Fig. 3.Parameter selection for FN. The main parameter for FN is the number of principal components of control variation with which to normalize probe quantiles. Screeplots (a, b, d, e) show the metric used to meffil for choosing the optimal number of principal components in ARIES (a, b, c) and GOYA (d, e, f), the amount of probe quantile variation unexplained by the principal components under 10-fold cross validation. The explained variation is mainly due to technical variance as the control probes should not be correlated with biological signal (Supplementary Material). Screeplots (a, d) show the variation without regressing out random effects whereas plots (b) and (e) show the variation after regressing out slide (b) or plate (e) as random effect. Plots (c) and (f) compares true and false positive rates in a downstream EWAS of pre-natal smoking in ARIES (c) and GOYA (f) after normalizing with different numbers of principal components and regressing out slide or plate as a random effect. TPRs were estimated by comparison to associations from a large meta-analysis (Joubert )
Fig. 4.Meta-analysis with normalized data. Data can be normalized using meffil as illustrated in (a) by generating QC objects for each dataset, sending them to a normalization server for normalization and then sending them back to each dataset to complete normalization of each sample. (b) The heterogeneity tau2 statistic is shown for CpG sites in the meta-analyses of age performed with and without normalizing the seven datasets together prior to meta-analysis. The top plot shows heterogeneity when ISVA is used to generate surrogate covariates and the bottom plot when SVA is used instead. CpG sites shown in the plot are those identified as associated with age in the EWAS of the combined dataset, 2486 associations for ISVA and 7697 for SVA. The dark diagonal line shows y = x and the grey line the regression line