Literature DB >> 28035024

Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi.

Jean-Philippe Fortin1, Timothy J Triche2, Kasper D Hansen1,3.   

Abstract

Summary: The minfi package is widely used for analyzing Illumina DNA methylation array data. Here we describe modifications to the minfi package required to support the HumanMethylationEPIC ('EPIC') array from Illumina. We discuss methods for the joint analysis and normalization of data from the HumanMethylation450 ('450k') and EPIC platforms. We introduce the single-sample Noob ( ssNoob ) method, a normalization procedure suitable for incremental preprocessing of individual methylation arrays and conclude that this method should be used when integrating data from multiple generations of Infinium methylation arrays. We show how to use reference 450k datasets to estimate cell type composition of samples on EPIC arrays. The cumulative effect of these updates is to ensure that minfi provides the tools to best integrate existing and forthcoming Illumina methylation array data. Availability and Implementation: The minfi package version 1.19.12 or higher is available for all platforms from the Bioconductor project. Contact: khansen@jhsph.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 28035024      PMCID: PMC5408810          DOI: 10.1093/bioinformatics/btw691

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The IlluminaHumanMethylation450 (‘450k’) array is a widely used platform for assaying DNA methylation in a large number of samples (Bibikova ), and has been the platform of choice for epigenome-wide association studies and large-scale cancer projects. In 2015, Illumina released their next generation methylation array, the HumanMethylationEPIC (‘EPIC’) array (Moran ), with almost twice the number of CpG loci. This increased resolution, coupled with greatly expanded coverage of regulatory elements, makes the EPIC array an attractive platform for large-scale profiling of DNA methylation. The minfi package in R/Bioconductor (Gentleman ; Huber ) is a widely used software package for analyzing data from the Illumina HumanMethylation450 array (Aryee ). In addition to the analysis methods provided in the package, it exposes a flexible framework for handling DNA methylation data.

2 Methods and results

We have extended the minfi package to support EPIC arrays. This includes functionality to (i) convert an EPIC array to a virtual 450k array for joint normalization and processing of data from both platforms, (ii) estimate cell type proportions for EPIC samples using external reference data from the 450k array. In addition, we present a new single-sample normalization method (ssNoob) for methylation arrays. Concurrently, we have extended the shinyMethyl package (Fortin ) for interactive QC of Illumina methylation arrays. Following the release of the EPIC chip, Illumina quickly released multiple versions of the manifest file describing the array design, as well as DMAP files used by the scanner. As a consequence, multiple types of IDAT files containing the raw data can be encountered in the wild. Addressing this has required more robust parsing code in minfi. It is therefore highly recommended that users analyzing EPIC arrays aggressively keep minfi and associated annotation packages updated. A substantial percentage (93.3%) of loci contained on the 450k array are also present on the EPIC array, measured using the same probes and chemistry. That makes it possible to combine data from both arrays. The lowest level of the combination can occur at the probe level. We have implemented this functionality in the function combineArrays which outputs an object that behaves either as a 450k or an EPIC array as chosen by the user with a reduced number of probes; we call this is a virtual array. We also support the combination of the two array types at the CpG locus level after the creation of the methylation and unmethylation channels.

2.1 Single sample normalization with ssNoob

Single sample normalization is of great potential benefit to users, particularly for analyzing large datasets which arrive in batches, because data can be processed separately and independently of the previously processed data. We adapted the Noob method (Triche ) to be a single sample normalization method by removing the need for a reference sample in the dye bias equalization procedure step. We call the method ‘ssNoob’, and details of the algorithm are provided in the Supplementary Methods. We note that on the Beta value scale, there is no difference between values returned by Noob or ssNoob (Supplementary Methods). Differences are confined to the methylated and unmethylated signals. ssNoob reduces technical variation. We assessed how the different preprocessing methods perform at reducing technical variation among three technical replicates of the cell line GM12878 assayed on the EPIC array: preprocessing as Illumina, SWAN normalization (Maksimovic ), stratified quantile normalization (Aryee ), ssNoob (Triche ), functional normalization (Fortin ) and no normalization. We calculated the variance of the Beta values across the three technical replicates at each CpG, stratified by probe design type. Boxplots of the distribution of these variances are shown in Figure 1a. The results show that relative performance of the different preprocessing methods is similar on the EPIC array to what we previously observed on the 450k array; we caution that we also previously found that reduction in technical variation is not always associated with improvements in replication between studies (Fortin ).
Fig. 1.

(a) Distribution of the variance between technical replicates assayed on the EPIC array, preprocessed using various methods. (b) The median distance between LCLs measured on the EPIC array and a number of different samples (261 LCLs in grey, 20 PBMC in blue and 58 ENCODE cell lines in red). All samples (both EPIC and 450k) were combined into a virtual array prior to normalization

ssNoob improves classification across array types. We assessed the performance of the above normalization methods when 450k and EPIC data are first combined at the probe level, and then subsequently normalized together. We compared the three EPIC technical replicates to a set of 450k arrays collated from publicly available data (Supplementary Table S1). This set consists of 261 lymphoblastoid cell lines (LCLs), the same cell type as GM12878, along with 20 peripheral blood mononuclear (PBMC) samples and 58 other samples from ENCODE. (a) Distribution of the variance between technical replicates assayed on the EPIC array, preprocessed using various methods. (b) The median distance between LCLs measured on the EPIC array and a number of different samples (261 LCLs in grey, 20 PBMC in blue and 58 ENCODE cell lines in red). All samples (both EPIC and 450k) were combined into a virtual array prior to normalization We computed the median distance between data from the EPIC array and all of the 450k data after normalization. A useful normalization strategy will result in the LCLs drawing closer to each other while moving further from the other cell types. We used the distance as a metric for predicting whether or not a 450k sample is an LCL sample, and displayed prediction performance as a ROC curve (Supplementary Fig. S1). While all methods predict well, we observe that ssNoob, functional normalization and quantile normalization achieved perfect prediction performance. We then investigated whether or not the methods can separate the PBMC samples from the ENCODE samples (Fig. 1b, Supplementary Fig. S3), and observe that here ssNoob performed best, followed by functional normalization and quantile normalization. We repeated the same assessments when normalizing EPIC samples separately from 450k samples, then combining the data after normalization (Supplementary Figs S1–S3). Here quantile normalization performed worse, as expected. As ssNoob is a single-sample procedure, it is not affected by whether samples are combined or not prior to normalization. Based on this assessment, and on the performance of Noob in existing benchmarks, we conclude that ssNoob is the best performing method for joint normalization of data from the EPIC and 450k arrays. We caution that this evaluation is based on a small number of EPIC samples and should therefore be considered preliminary.

2.2 Estimating cell-type composition for EPIC arrays using 450k reference data

Several methods have been proposed to estimate the cell-type proportions from reference datasets made of sorted samples (Houseman ; Jaffe and Irizarry, 2014), and several reference datasets exist for the 450k array (Bakulski ; Guintivano ; Reinius ). We adapted the function estimateCellCounts to estimate cell type proportions of EPIC samples using 450k reference datasets. Briefly, the EPIC dataset is converted into a virtual 450k dataset and cell type proportions are estimated using probes common to both arrays. To evaluate how removing 7% of probes from the 450k platform impacts the cell-type composition estimation for EPIC arrays, we estimated whole-blood cell-type proportions for the 20 PBMC samples, before and after removing the probes that differ between the 450k and EPIC arrays. This yielded very good results; for each cell type, the correlation of the cell type proportions between the two sets of data is higher than 0.99 (Supplemental Fig. S4). As noted, reference datasets are also available for cord blood and brain.

2.3 Summary of the functionality in minfi

Most functionality in minfi supports all generations of Illumina Infinium HumanMethylation arrays: 27k, 450k and EPIC. This includes the different preprocessing & normalization functions, as well as differential analysis tools: dmpFinder for differentially methylated positions (DMPs), bumphunter for differentially methylated regions (DMRs) and blockFinder for differentially methylated blocks (DMBs). We have also adapted the recent function compartments (Fortin and Hansen, 2015), which estimates A/B compartments as revealed by Hi-C data, to the EPIC array. The main functions in minfi are presented as Table 1.
Table 1.

Main functions in the minfi package

FunctionDescriptionPlatforms
Data acquisition
read.metharrayRead idat files into R27k, 450k, EPIC
convertArrayCast an array platform into another27k, 450k, EPIC
combineArraysCombine data from different platforms27k, 450k, EPIC
Quality control
getSexEstimation of the samples sex27k, 450k, EPIC
getQCEstimation of sample-specific QC27k, 450k, EPIC
qcReportProduces a PDF QC report27k, 450k, EPIC
Preprocessing
preprocessRawNo normalization27k, 450k, EPIC
preprocessQuantile(Stratified) quantile normalization27k, 450k, EPIC
preprocessIlluminaGenome Studio normalization27k, 450k, EPIC
preprocessSWANSWAN normalization450k, EPIC
preprocessNoobBackground and dye bias correction27k, 450k, EPIC
preprocessFunnormFunctional normalization450k, EPIC
Differential analysis
dmpFinderEstimation of DMPs27k, 450k, EPIC
bumphunterEstimation of DMRs27k, 450k, EPIC
blockFinderEstimation of DMBs450k, EPIC
Other useful functions
compartmentsEstimation of A/B compartments450k, EPIC
estimateCellCountsEstimation of cell-type proportions27k, 450k, EPIC
addSnpInfoIntersect probes with dbSNP27k, 450k, EPIC
Main functions in the minfi package Click here for additional data file.
  15 in total

Review 1.  Orchestrating high-throughput genomic analysis with Bioconductor.

Authors:  Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal:  Nat Methods       Date:  2015-02       Impact factor: 28.547

2.  Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.

Authors:  Martin J Aryee; Andrew E Jaffe; Hector Corrada-Bravo; Christine Ladd-Acosta; Andrew P Feinberg; Kasper D Hansen; Rafael A Irizarry
Journal:  Bioinformatics       Date:  2014-01-28       Impact factor: 6.937

3.  DNA methylation of cord blood cell types: Applications for mixed cell birth studies.

Authors:  Kelly M Bakulski; Jason I Feinberg; Shan V Andrews; Jack Yang; Shannon Brown; Stephanie L McKenney; Frank Witter; Jeremy Walston; Andrew P Feinberg; M Daniele Fallin
Journal:  Epigenetics       Date:  2016-03-28       Impact factor: 4.528

4.  SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips.

Authors:  Jovana Maksimovic; Lavinia Gordon; Alicia Oshlack
Journal:  Genome Biol       Date:  2012-06-15       Impact factor: 13.583

5.  Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data.

Authors:  Jean-Philippe Fortin; Kasper D Hansen
Journal:  Genome Biol       Date:  2015-08-28       Impact factor: 13.583

6.  DNA methylation arrays as surrogate measures of cell mixture distribution.

Authors:  Eugene Andres Houseman; William P Accomando; Devin C Koestler; Brock C Christensen; Carmen J Marsit; Heather H Nelson; John K Wiencke; Karl T Kelsey
Journal:  BMC Bioinformatics       Date:  2012-05-08       Impact factor: 3.169

7.  Low-level processing of Illumina Infinium DNA Methylation BeadArrays.

Authors:  Timothy J Triche; Daniel J Weisenberger; David Van Den Berg; Peter W Laird; Kimberly D Siegmund
Journal:  Nucleic Acids Res       Date:  2013-03-09       Impact factor: 16.971

8.  A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression.

Authors:  Jerry Guintivano; Martin J Aryee; Zachary A Kaminsky
Journal:  Epigenetics       Date:  2013-02-20       Impact factor: 4.528

9.  Accounting for cellular heterogeneity is critical in epigenome-wide association studies.

Authors:  Andrew E Jaffe; Rafael A Irizarry
Journal:  Genome Biol       Date:  2014-02-04       Impact factor: 13.583

10.  Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences.

Authors:  Sebastian Moran; Carles Arribas; Manel Esteller
Journal:  Epigenomics       Date:  2015-12-17       Impact factor: 4.778

View more
  246 in total

1.  Epigenetics of neuroinflammation: Immune response, inflammatory response and cholinergic synaptic involvement evidenced by genome-wide DNA methylation analysis of delirious inpatients.

Authors:  Taku Saito; Hiroyuki Toda; Gabrielle N Duncan; Sydney S Jellison; Tong Yu; Mason J Klisares; Sophia Daniel; Allison J Andreasen; Lydia R Leyden; Mandy M Hellman; Eri Shinozaki; Sangil Lee; Aihide Yoshino; Hyunkeun R Cho; Gen Shinozaki
Journal:  J Psychiatr Res       Date:  2020-06-06       Impact factor: 4.791

2.  Methylomic profiles reveal sex-specific differences in leukocyte composition associated with post-traumatic stress disorder.

Authors:  Grace S Kim; Alicia K Smith; Fei Xue; Vasiliki Michopoulos; Adriana Lori; Don L Armstrong; Allison E Aiello; Karestan C Koenen; Sandro Galea; Derek E Wildman; Monica Uddin
Journal:  Brain Behav Immun       Date:  2019-06-19       Impact factor: 7.217

3.  Epigenome-wide DNA methylation in placentas from preterm infants: association with maternal socioeconomic status.

Authors:  Hudson P Santos; Arjun Bhattacharya; Elizabeth M Martin; Kezia Addo; Matt Psioda; Lisa Smeester; Robert M Joseph; Stephen R Hooper; Jean A Frazier; Karl C Kuban; T Michael O'Shea; Rebecca C Fry
Journal:  Epigenetics       Date:  2019-05-21       Impact factor: 4.528

4.  Maternal swimming pool exposure during pregnancy in relation to birth outcomes and cord blood DNA methylation among private well users.

Authors:  Lucas A Salas; Emily R Baker; Mark J Nieuwenhuijsen; Carmen J Marsit; Brock C Christensen; Margaret R Karagas
Journal:  Environ Int       Date:  2019-01-05       Impact factor: 9.621

5.  Distinct genome-wide methylation patterns in sporadic and hereditary nonfunctioning pancreatic neuroendocrine tumors.

Authors:  Amit Tirosh; Sanjit Mukherjee; Justin Lack; Sudheer Kumar Gara; Sophie Wang; Martha M Quezado; Xavier M Keutgen; Xiaolin Wu; Maggie Cam; Suresh Kumar; Dhaval Patel; Naris Nilubol; Monica Varun Tyagi; Electron Kebebew
Journal:  Cancer       Date:  2019-01-08       Impact factor: 6.860

6.  Correction for multiple testing in candidate-gene methylation studies.

Authors:  Zhenwei Zhou; Kathryn L Lunetta; Alicia K Smith; Erika J Wolf; Annjanette Stone; Steven A Schichman; Regina E McGlinchey; William P Milberg; Mark W Miller; Mark W Logue
Journal:  Epigenomics       Date:  2019-06-26       Impact factor: 4.778

7.  Longitudinal analysis of epigenome-wide DNA methylation reveals novel smoking-related loci in African Americans.

Authors:  Jiaxuan Liu; Wei Zhao; Farah Ammous; Stephen T Turner; Thomas H Mosley; Xiang Zhou; Jennifer A Smith
Journal:  Epigenetics       Date:  2019-03-14       Impact factor: 4.528

8.  Clinical epigenomics for cardiovascular disease: Diagnostics and therapies.

Authors:  Matthew A Fischer; Thomas M Vondriska
Journal:  J Mol Cell Cardiol       Date:  2021-02-06       Impact factor: 5.000

9.  Copper associates with differential methylation in placentae from two US birth cohorts.

Authors:  Elizabeth Kennedy; Todd M Everson; Tracy Punshon; Brian P Jackson; Ke Hao; Luca Lambertini; Jia Chen; Margaret R Karagas; Carmen J Marsit
Journal:  Epigenetics       Date:  2019-09-04       Impact factor: 4.528

10.  Epigenome-wide association study reveals a molecular signature of response to phylloquinone (vitamin K1) supplementation.

Authors:  Kenneth Westerman; Jennifer M Kelly; José M Ordovás; Sarah L Booth; Dawn L DeMeo
Journal:  Epigenetics       Date:  2020-03-05       Impact factor: 4.528

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.