| Literature DB >> 27127542 |
Michelle L Wright1, Mikhail G Dozmorov2, Aaron R Wolen3, Colleen Jackson-Cook4, Angela R Starkweather5, Debra E Lyon6, Timothy P York7.
Abstract
The need for research investigating DNA methylation (DNAm) in clinical studies has increased, leading to the evolution of new analytic methods to improve accuracy and reproducibility of the interpretation of results from these studies. The purpose of this article is to provide clinical researchers with a summary of the major data processing steps routinely applied in clinical studies investigating genome-wide DNAm using the Illumina HumanMethylation 450K BeadChip. In most studies, the primary goal of employing DNAm analysis is to identify differential methylation at CpG sites among phenotypic groups. Experimental design considerations are crucial at the onset to minimize bias from factors related to sample processing and avoid confounding experimental variables with non-biological batch effects. Although there are currently no de facto standard methods for analyzing these data, we review the major steps in processing DNAm data recommended by several research studies. We describe several variations available for clinical researchers to process, analyze, and interpret DNAm data. These insights are applicable to most types of genome-wide DNAm array platforms and will be applicable for the next generation of DNAm array technologies (e.g., the 850K array). Selection of the DNAm analytic pipeline followed by investigators should be guided by the research question and supported by recently published methods.Entities:
Keywords: DNA methylation; Epigenomics; Microarray analysis
Mesh:
Year: 2016 PMID: 27127542 PMCID: PMC4848848 DOI: 10.1186/s13148-016-0212-7
Source DB: PubMed Journal: Clin Epigenetics ISSN: 1868-7075 Impact factor: 6.551
Fig. 1a Density of DNAm intensity by probe type. Infinium I and II assays display different β-value distributions (0 indicating unmethylated sites, 1 indicating fully methylated sites), which may lead to results that contain an over-representation of type I probes due to the larger variance of type II assays. This figure shows the distribution of β-values that were obtained from a single peripheral blood specimen collected for women diagnosed with breast cancer. Differences in between probe types (visualized at the ends of the distributions (type I probes—red dotted line; type II probes—blue dotted line)) are adjusted using normalization procedures, which attempt to harmonize the differences in distributions between probe types. b Density of DNAm intensity by the experimental group. The quality of the data for each specimen can be readily visualized using a density plot, which enables one to compare distributions between, for instance, cases and controls in order to identify particular specimens with deviations in their distribution, the latter of which may serve as an indication that the specimen results are of poor quality
Major steps in the 450K array analysis pipeline
| Analysis | Rationale |
|---|---|
| Sample filtering | Experimental samples are compared to control probes present within the array technology to identify samples that fail to adequately detect DNAm. Samples with poor detection may be inaccurate, due to poor sample quality, and thus might be considered for exclusion from the dataset. |
| Probe filtering | Raw data must past initial quality and data screening. Probes failing to meet preset detection values and/or failed probes are removed from analysis because they are unreliable (see text). For example, some probes may cross-hybridize or overlap with SNPs, which could confound results. Study aims should be considered when determining which probes to remove. |
| Within-array normalization | This step removes “background” noise and corrects for technical dye-based (red/green), intensity, and probe type (I/II) differences within the array technology. |
| Batch effects | The step assesses and accounts for variation that is not caused by biological differences but by external variation (e.g., samples are processed on different days or at different facilities). |
| Cell composition | Whole blood contains multiple cell types with potentially different DNAm profiles. As different samples may contain varying proportions of cell types, statistical methods have been developed to estimate and correct for this cellular heterogeneity. |
| Differential DNAm positions and regions | Currently, many analytic pipelines assess for DNAm differences in both specific positions and broader regions. DNAm positions interrogated on the array are not evenly distributed, and both differentially methylated positions and regions may yield clinically meaningful results. |
| Biological and clinical interpretation | Various approaches may be necessary for accurate interpretation of differential methylation between groups. Tools for functional and regulatory enrichment analyses are available. Manual exploration of the literature and validation in a second cohort or by another method (e.g., bisulfite sequencing) remains as viable options for interpretation. |
Fig. 2Visualization of DMR and DMP results overlapping genomic annotations. This example figure, which was created from 450K data for peripheral blood specimens that were collected from women diagnosed with breast cancer, demonstrates how both experimental results and predicted functional elements can be viewed as individual tracks along a set of genomic coordinates (x-axis) specified along a gene (e.g., the FKBP5 gene (bottom track)). The top track displays the statistical model coefficient from a univariate test to identify individual DMPs, and the plot rug (along the x-axis) indicates significant (black tick) findings. The identified DMRs (second track) correspond to clusters of DMP results with similar coefficient values and in this example overlap CpG islands (third track) and predicted promoter regions (fourth track). These regions also correspond to other publically curated annotations that, for instance, can indicate enrichment for different chromatin states (fifth track)