Literature DB >> 33147241

DNA methylation in blood-Potential to provide new insights into cell biology.

Donia Macartney-Coxson¹, Alanna M Cameron², Jane Clapham¹, Miles C Benton¹.

Abstract

Epigenetics plays a fundamental role in cellular development and differentiation; epigenetic mechanisms, such as DNA methylation, are involved in gene regulation and the exquisite nuance of expression changes seen in the journey from pluripotency to final differentiation. Thus, DNA methylation as a marker of cell identify has the potential to reveal new insights into cell biology. We mined publicly available DNA methylation data with a machine-learning approach to identify differentially methylated loci between different white blood cell types. We then interrogated the DNA methylation and mRNA expression of candidate loci in CD4+, CD8+, CD14+, CD19+ and CD56+ fractions from 12 additional, independent healthy individuals (6 male, 6 female). 'Classic' immune cell markers such as CD8 and CD19 showed expected methylation/expression associations fitting with established dogma that hypermethylation is associated with the repression of gene expression. We also observed large differential methylation at loci which are not established immune cell markers; some of these loci showed inverse correlations between methylation and mRNA expression (such as PARK2, DCP2). Furthermore, we validated these observations further in publicly available DNA methylation and RNA sequencing datasets. Our results highlight the value of mining publicly available data, the utility of DNA methylation as a discriminatory marker and the potential value of DNA methylation to provide additional insights into cell biology and developmental processes.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33147241 PMCID： PMC7641429 DOI： 10.1371/journal.pone.0241367

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Epigenetics refers to the heritable, but reversible, regulation of various genomic functions, including gene expression. It provides mechanisms whereby an organism can dynamically respond to changes in its environment and “reset” gene expression accordingly [1]. Furthermore, these mechanisms play a critical role in development and cell lineage specificity [2, 3], as highlighted recently when epigenomic profiling revealed a linear differentiation model for memory T-cells [4]. One such epigenetic mechanism is DNA methylation. Methylation of the cytosine nucleotide within CpG dinucleotides in DNA is well documented in humans [5, 6]. DNA methylation can be developmentally ‘hard-wired’ (as in the case of imprinting [7]), underpin cell identity (i.e. cell markers of differentiation [8, 6]) or dynamic and change in response to environmental factors [9]. Therefore, the investigation of an individual’s methylation pattern can reveal a lifetime record of environmental exposures as well as potential disease specific marks [10, 11]. It is well established that epigenetics contributes significantly to the developmental fate of cells and tissues [8]. For instance, epigenetic mechanisms contribute to the differentiation of hematopoietic stem cells from bone marrow [12, 13]. Importantly, DNA methylation appears to play a crucial role at specific stages along the separation of blood cell lineages (myeloid, lymphoid) and contributes to the establishment and functionality of the final differentiated cell type [14]. Epigenetic marks, including DNA methylation, are increasingly recognised as potential discriminators of cell type [15]. This attribute has been utilised by a number of researchers to develop methods which correct for and/or deconvolute the variability introduced by cell mixtures in DNA methylation studies, particularly in blood samples [16-20]; a notable example—the so-called Houseman algorithm (Houseman 2012)—has been incorporated in to standard bioinformatic pipelines, including the R minfi package [21], for DNA methylation arrays. This behaviour of DNA methylation as a marker also suggests the possibility of such 'marks' revealing new aspects of biology—for instance it may highlight previously unrecognised immune cell populations. DNA methylation as an epigenetic mark is easily quantified and evaluated from blood. Many studies using Illumina array technology have made their data publicly available, providing an excellent resource for hypothesis generation and testing in silica prior to wet-lab experimentation. We hypothesised that because of its role in differentiation and development new biological insights could be revealed by looking at loci that discriminate between immune cell types; the potential utility of these loci in cell discrimination might be previously unrecognised and/or could be harnessed to sort and/or identify potential new cell sub-types. Therefore, we performed an in silico discovery experiment using data from a study which examined the DNA methylation profile of human white blood cell populations [22]. Reinius et al., investigated DNA methylation in: T cells (CD8+, CD4+); B cells (CD19+); natural killer cells (NK cells; CD56+); monocytes (CD14+); granulocytes (Gran; both CD16+ and Siglec8+ cells); neutrophils (Neu, CD16+), and eosinophils (Eos, Siglec-8+). The Reinus study was one of the first to illustrate the potential power of DNA methylation as a biomarker, and its role in cell lineage identity with the authors profiling DNA methylation in six healthy males and identifying discriminatory DNA loci in “classic” immune cell marker loci. Here, we use a machine learning approach which, as anticipated, identifies discriminatory DNA methylation marks in ‘classic’ immune cell markers, but also highlights significant differential methylation in “non-classic” immune markers, and genes for which a role in immune function is yet to be reported. We interrogate this further in an independent cohort and publicly available data at both the DNA methylation and gene expression level.

Material and methods

Discovery analysis

DNA methylation analysis

The Reinus data was downloaded using the R package MARMAL-AID [23]. All applicable sample information is available at the GEO page (GSE35069, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35069). Raw intensity data (Illumina 450K idats) were loaded into R [24] using the Bioconductor minfi package [21]. Background correction and control normalisation was implemented in minfi. Probes were classed as failed if the intensity for both the methylated and unmethylated probes was <1,000. Any probe which failed in at least one sample, was removed from the entire dataset. We also removed all previously identified cross-reactive probes [25], and 33 457 probes which we previously identified as aligning to the human genome greater than once [26]. All analyses were performed on beta values, calculated as the intensity of the methylated channel divided by total intensity including an offset ((methylated + unmethylated) + 100). Glmnet penalised ridge-regression mixed with lasso in an elastic-net framework was used as implemented via the R package glmnet [27] to explore methylation association between each of the cell-types (CD8+, CD4+, CD19+, CD14+, CD56+, Neutrophils, Eosinophils, Granulocytes, as well as combinations of cell populations, PBMC and whole blood). The number of variables (~450,000 CpG sites, Illumina 450K platform) far outweighs the number of cell-types; as such it is accepted that conventional statistical analysis procedures that test each CpG within an independent regression model suffer from multiple testing burden and reduced statistical power. To overcome this issue we chose to use the penalised regression procedures of glmnet, which tests all markers simultaneously, i.e. in a single regression model. Glmnet was specifically designed to overcome issues of large variable number (k) and small sample size (n) and has been successfully applied to several genome-wide association studies of SNPs [28-30] and recently methylation [31]. We have previously developed and reported on this method in detail to identify aging associated DNA methylation loci [26]. The Flt-SNE software with associated R wrapper function was used for t-SNE analysis [32]. Briefly, glmnet fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elastic-net penalty at a grid of values for the regularization parameter lambda λ. The elastic-net penalty is controlled by α, and bridges the gap between lasso (α = 1, the default) and ridge (α = 0). The ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others. The elastic-net penalty mixes these two; if predictors are correlated in groups, an α = 0 tends to select the groups in or out together. We selected an alpha at the lower end of the range (0.05) to shift the elastic-net model more towards the penalised-regression (ridge regression), allowing us to retain more related features (CpG sites which share variance). For the glmnet modelling we used cross-validation to determine the optimal value of regularization parameter λ with both minimum mean squared error (MSE) and minimum MSE + 1SE of minimum MSE. The optimal λ values were then used for predictor variable selection.

Pathways enrichment

Functional enrichment was performed on each set of CpG sites identified for each cell type in the ToppGene Suite webserver (https://toppgene.cchmc.org/) using the ToppFun function. Bonferroni adjusted correction was used in the reporting of all pathways results (adjusted P<0.05).

Validation analyses

Samples

Ethics was obtained from, and all experimental protocols were approved by, The Health and Disability Ethics Committee NZ (HDEC, 15/NTB/153). All methods were carried out in accordance with relevant guidelines and regulations. Written, informed consent was obtained from all participants who were all over 18 years of age at the time of collection. Blood from 12 self-reported healthy individuals (n = 6 male, n = 6 female) between 26–31 years of age inclusive, was collected into sterile K2 EDTA vacutainers (BD Biosciences), and the buffy coat isolated.

Cell sorting–FACS

Peripheral blood mononuclear cells (PBMCs) were Fc receptor blocked, labelled with fluorescent antibodies specific for: CD3 (OKT3), CD4 (OKT4), CD8 (HIT8a), CD14 (HCD14), CD19 (HIB19) and CD56 (HCD56; all antibodies were from Biolegend) and dead cells were identified by DAPI exclusion. CD4+, CD8+, CD14+, CD19+ and CD56+ fractions were collected (Influx cell sorter, BD Biosciences) directly into ice-cold FACS buffer, immediately frozen on dry ice and stored at –80°C.

DNA and RNA extraction

Both nucleic acids were extracted simultaneously from snap frozen cells using a Qiagen All prep DNA/RNA kit as per the manufacturers protocol. High quality genomic DNA and RNA were obtained, with RNA RIN ≥ 7.5. Sufficient quality and quantity of DNA and RNA was obtained to facilitate targeted DNA methylation and mRNA expression profiling for CD4+, CD8+, CD19+, CD14+ and CD56+ cell sorted samples.

Targeted DNA methylation analysis

Pyrosequencing was designed and performed by EpigenDX (USA), who were provided with the Illumina probe information (Table 1).

Table 1

Annotation, methylation status and TaqMan probe information for the 11 selected CpG sites.

IlmnID	CellType	Cell meth (mean)	Other meth (mean)	Absolute Difference	Percent Difference	CHR	Position	Gene Symbol	Feature	TaqMan Probe
cg24462702	CD4	0.13	0.82	0.69	69.07	X	135730445	CD40LG	1stExon	Hs00163934_m1
cg10837404	CD4	0.34	0.88	0.54	54.14	5	112356289	DCP2	3’UTR	Hs00400339_m1
cg02665297	CD19	0.08	0.95	0.87	86.91	7	5270984	WIPI2	3’UTR	Hs01093807_m1
cg21596498	CD19	0.12	0.92	0.8	80.33	19	42618407	POU2F2	Body	Hs00922179_m1
cg27565966	CD19	0.12	0.87	0.75	74.94	16	28943198	CD19	TSS200	Hs01047412_g1
cg25939861	CD8	0.14	0.81	0.67	67.35	2	87020937	CD8A	5’UTR	Hs01555594_g1
cg11067179	CD8	0.41	0.84	0.43	42.85	11	66083541	CD248	1stExon	Hs00535586_s1
cg23244761	CD14	0.14	0.93	0.79	78.76	6	161796850	PARK2	Body	Hs01038322_m1
cg16636767	CD14	0.21	0.89	0.69	68.55	11	13694647	FAR1	5’UTR	Hs00386153_m1
cg13617280	CD56	0.25	0.88	0.63	63.42	12	129299462	SLC15A4; MGC16384	Body; TSS200	Hs00377326_m1
cg13995453	CD56	0.43	0.88	0.45	45.19	12	9759653	KLRB1	Body	Hs00174469_m1

Targeted gene expression analysis

150ng total RNA was reverse transcribed using VILO Superscript (Thermo Fischer). QRTPCR was performed in triplicate on 7ng cDNA using TaqManGene expression assays (CD40LG Hs00163934_m1, DCP2 Hs00400339_m1, WIPI2 Hs01093807,POUF2 Hs00922179_m1, CD19 Hs01047412_g1, CD8A Hs01555594_g1, CD248 Hs00535586_s1, PARK2 Hs01038322_m1, FAR1 Hs00386153_m1, SLC15A4 Hs01547421_m1, KLRB1 Hs00174469_m1). Gene expression was normalised against the non-variable endogenous control genes GAPDH and GUSB, using the ΔCt method (Ctcandidate-MeanCtcontrols).

Statistics

All analyses were performed in R 3.5.2. Differential methylation and expression analyses were performed in R using the default student t-test. P values were adjusted using the Benjamini-Hochberg method.

Data

All raw and processed data are accessible via GitHub, see https://github.com/sirselim/immunecell_methylation_paper_data [DOI:https://doi.org/10.5281/zenodo.3366393]. A github repository and related site have been made available to explore t-SNE results interactively (https://sirselim.github.io/tSNE_plotting/).

Results

Discovery—DNA methylation discriminatory markers for immune cells

We identified DNA methylation at 1173 CpG sites (S1 Table) which clearly differentiated specific immune cell populations using publicly available data from whole blood [16]; hierarchical clustering and t-SNE analyses provide a visual presentation and highlight that these markers cluster the cell populations in a biologically meaningful way (Fig 1). Pathway analyses of the genes to which these 1173 CpG sites mapped strongly supported their discriminatory nature, and, as expected, enrichment for immune cell biological function was observed: enrichment for CD56 (> 79 genes), CD4 (> 68 genes), CD8 (> 34 genes), CD14 (> 69 genes) and CD19 (> 194 genes) was observed. Furthermore, these results suggest that discriminatory CpG marker loci may map to genes with a hitherto unrecognised role in immune cell discrimination and/or function.

Fig 1

Demonstration of immune cell population discrimination using sets of identified epigenetic markers (CpGs).

Demonstration of immune cell population discrimination using sets of identified epigenetic markers (CpGs).

A) Hierarchical clustering of all 1173 identified probes demonstrates perfect separation of cellular populations. B) Plot of t-sne dimensions derived from all methylation sites for all 60 samples. Points on the plot represent individual samples. C) 2D t-sne plot of selected 1173 methylation markers identified via glmnet method. Points on the plot represent individual CpG sites. An interactive version of this panel is available (https://sirselim.github.io/tSNE_plotting/). The robust differentiation between cell types was explained by non-overlapping sets of CpGs specific for each cell population: CD8+ (n = 70); CD4+ (n = 96); CD19+ (n = 347); CD56+ (n = 112); CD14+ (n = 126); Granulocytes (n = 128); Neutrophils (n = 128), and Eosinophils (n = 166). The majority of these sites were relatively hypo-methylated in the cell type of discrimination and hyper-methylated in all other cell populations analysed. The proportion of hypomethylated/total non-overlapping discriminatory CpGs [for a given cell type] was: CD8+ (46/70, 65.7%), CD4+ (71/96, 74%), CD19+ (344/347, 99%), CD56+ (111/112, 99%), CD14+ (126/126, 100%), Granulocytes (94/128, 73.4%), Eosinophils (165/166, 99%) with Neutrophils being the exception (33/128, 24.2%). Interestingly, the majority of CpG marker sites identified (~95% of CpGs) mapped to annotated gene loci, with many in regions involved in regulating mRNA expression (e.g. promoters). For each cell type marker the proportion of CpG sites mapping to annotated loci was: CD8+ (62/70); CD4+ (78/96); CD19+ (255/347); CD56+ (99/112); CD14+ (82/126); Granulocytes (102/128); Eosinophils (136/166), and Neutrophils (108/128). For individual marker information including annotation see https://github.com/sirselim/immunecell_methylation_paper_data [DOI: https://doi.org/10.5281/zenodo.3366393]. The largest DNA methylation difference observed was 87% between CD19+ cells against all others. This 87% difference was observed in two genes, WIPI2 and CARS2; while WIPI2 has a reported role in the immune system [33], to the best of our knowledge no such function has been reported for CARS2 to date. Ranked by the largest change in methylation the top five CpG sites mapping to annotated loci for each cell type were: CD19+: 87% (WIPI2, CARS2), 83% (RERE), 82% (LOC100129637), 80% (POU2F2) CD4+: 69% (CD40LG), 56% (PUM1), 54% (DCP2, BAG3), 48% (SF1) CD8+: 67% (CD8A), 64% (CD8A), 51% (CD8B), 49% (CD8B, CD8A) CD56+: 63% (SLC15A4), 52% (RASA3), 48% (MAD1L1), 45% (KLRB1/CD161), 43% (KLRB1/CD161) CD14+: 79% (PARK2), 70% (CENPA, PARK2), 69% (KIAA0146, FAR1) Eosinphils: 73% (FAM65B), 72% (KIAA0317, APLP2), 70% (MEF2A, CCDC88A) Granulocytes: 60% (VPS53, PCYOX1), 59% (ARG1), 58% (CSGALNACT1), 56% (SH3PXD28) Neutrophils: 14% (CUL9), 12% (LASP1), 7% (GFl1), 6% (LRFN1, NFAT5)

Validation in independent samples

In order to validate our observations from the in silica discovery experiment we selected 11 differentially methylated loci (Table 1) for analysis in 12 independent samples from self-reported healthy individuals (n = 6 female, n = 6 male) with an age range of 26–31 years inclusive. This sample size is equivalent per sex to that of the Reinius data [22] used in the discovery analysis. Our validation concentrated on cell sorted populations for CD4+, CD19+, CD4+, CD8+, CD56+, CD14+ from which it was possible to collect enough cells for simultaneous extraction of DNA and RNA of sufficient quantity and quality. Ten loci were selected for validation, two for each cell type (Table 1). The most differentially methylated site for each cell type CD4+, CD19+, CD8+, CD56+, CD14+ was selected (WIPI2, CD40LG, CD8A, SLC15A4, PARK2 respectively). A second site from the top 5 (see above) was selected for each of CD19+, CD4+, CD56+ and CD14+ (POUF2, DCP2, KLRB1, FAR1). For CD8+ all sites in the top 5 mapped to this marker, we therefore selected the sixth top loci which mapped toCD248 (43% difference in methylation). In addition, CD19 (ranked 27th in terms of % differential methylation [75%] of annotated loci) was included as a control.

DNA methylation

The eleven candidate loci were assayed by pyrosequencing in the 12 samples from the validation cohort. We observed a strong agreement with the expected discriminatory patterns of DNA methylation for all loci examined (Figs 2 and 3). S2 Table presents pair-wise student T-test statistics for the DNA methylation data.

Fig 2

Heatmap representation of DNA methylation and gene expression data for all 11 genes investigated.

Expression and methylation measures were split into quartiles and their levels coloured accordingly.

Fig 3

Boxplots illustrating DNA methylation and gene expression levels for all 11 gene investigated.

Methylation and gene expression data for a given gene are in adjacent boxplots.

Heatmap representation of DNA methylation and gene expression data for all 11 genes investigated.

Expression and methylation measures were split into quartiles and their levels coloured accordingly.

Boxplots illustrating DNA methylation and gene expression levels for all 11 gene investigated.

Methylation and gene expression data for a given gene are in adjacent boxplots.

RNA expression

Given the role that DNA methylation plays in regulation of gene expression we also explored the mRNA levels of the 11 candidate loci. We investigated gene expression by QRTPCR in the 12 independent, validation samples. A clear differentiation between immune cells types at the gene expression level was observed for PARK2, POU2F2, DCP2, CD248, CD8A, SLC15A4, CD4A0LG and CD19 but not for FAR1, WIPI2, KLRB1 (Figs 2 and 3). S3 Table presents pair-wise student T-test statistics for the gene expression data.

Validation in publicly available data

In order to further investigate the panel of 1173 CpG sites identified in our initial analysis we interrogated their methylation in 3 publicly available data sets. One, GSE82084, using the Illumina 450K platform (as per the Reinus data used in our discovery analysis) and two (GSE103541, GSE110554) using the more recent Illumina EPIC platform. Of the 1173 CpG sites 1025 were present on both platforms. The two EPIC studies performed DNA methylation analysis of cell sorted immune cell populations from adults [17], whereas the 450K study looked at DNA methylation in cord blood from term and preterm newborns [34]. Fig 4 presents a 2D tSNE plot of all 1025 CpG sites which clearly shows separation of immune cells populations. It is interesting to note that the T cells of neonates (orange/red triangles) sit between the CD4+ and CD8+ T cells consistent with an undifferentiated state. We also observed independent clustering of nucleated red blood cells from the same preterm newborns cohort, despite the fact that this cell-type was not in our training set.

Fig 4

tSNE plot of sorted-cells from 211 samples based on 1025 CpG sites overlapping between the three publicly available datasets.

Points represent individual samples.

tSNE plot of sorted-cells from 211 samples based on 1025 CpG sites overlapping between the three publicly available datasets.

Points represent individual samples. To further explore expression of the 11 genes we selected for our validation of DNA methylation and RNA expression on independent samples we interrogated a publicly available RNAseq dataset, GSE107011 [35]. We extracted data for the same five cell populations (CD4+, CD8+, CD19+, CD56+ and CD14+) and then extracted expression data (TPM, transcripts per million) for each of the 11 genes. Fig 4 presents the data and illustrates the ability of mRNA expression from these 11 genes to clearly differentiate cell types. It is interesting to note that effector memory/terminal effector CD8+ cells (purple circles) are distinguished from central memory/naive CD8+ cells (purple squares) as well as a separation of Th17 CD4+ cells (yellow diamonds) from other CD4+ cells (yellow circles). Interactive versions of all figures are available online (https://sirselim.github.io/tSNE_plotting/).

Discussion

DNA methylation is exquisitely placed to reflect a cell’s differentiation trajectory. Using publicly available data, and a machine learning approach we identified 1173 unique CpG sites at which DNA methylation discriminated CD8+, CD4+, CD19+, CD56+, and CD14+ cell populations as well as granulocytes, neutrophils, and eosinophils. We validated DNA methylation in two discriminatory CpG loci for each of CD8+, CD4+, CD19+, CD56+, and CD14+ in 12 independent samples. The majority of the 1173 discriminatory CpG sites mapped to annotated loci, and gene regulatory regions in particular. This suggests that, as expected, DNA methylation is playing a key role in immune cell differentiation and cell-type identification. An important implication of this is that DNA methylation can therefore be potentially harnessed to reveal previously unidentified aspects of biology, such as immune cell sub-populations. A good example of this is the transcription factor FOXP3 which plays a key role in the development and function of Treg cells [36]; originally FOXP3 expression was used to identify Treg cells until it was deemed insufficient for the robust identification of suppressive Treg cells [37, 38]. However, recent work has reported that hypomethylated CpG sites in four regions of FOXP3, CAMTA1 and FUT7 can be used to distinguish subsets of Tregs from non-regulatory CD4+ T cells [39]. These findings strongly support our view that DNA methylation, including potentially loci identified in the current study, could be used to inform similar experiments and reveal other drivers of specific immune cell subtypes. Furthermore, large differences in DNA methylation were observed, and validated, at CpG loci in genes which, while their potential role in immune cell biology has been reported, have not previously been recognised as differentiators of immune cell type, such as WIPI2 [33] for CD19+,SLC15A4 [40-42] for CD56+ and PARK2 [43, 44] for CD14+ cells. We also identified POUF2/OCT2 for which a role as a B-cell differentiator was recently reported [14]. In addition, significant, cell type specific changes in DNA methylation were observed, and validated, in genes which, to the best of our knowledge, have no previous reported role in immune biology (FAR1, CARS2). Taken together this highlights the significant potential of such analyses to uncover new facets of cell biology, and immunology. Many more additional loci from our in silica analyses showed large differences in DNA methylation, and these warrant further investigation with respect to their roles in immune cell function. To further explore the potential relationships between our selected cell methylation markers we used a t-SNE; a statistical method that attempts to identify higher dimensional relationships between data points and assign a faithful representation of those points in lower dimensional space (usually 2D) [45]. As a method t-SNE has been widely adopted in single cell sequencing experiments to identify clusters of cell populations [46, 47]. The resultant t-SNE analysis of the selected 1173 markers (Fig 1) clearly demonstrates distinct groupings of CpG sites into respective cell populations. There is a small degree of non-specific clustering of CpG sites. This could well be due to higher order background ‘signal’, or it could potentially be pointing towards underlying biological relationships that have yet to be established. The potential of this approach is highlighted by our analyses of publicly available data and 1025/1173 'candidate CpG sites' which overlapped between 450K and EPIC Illumina bead platforms. Fig 4 illustrates how well these 1025 CpG sites performed in additional, independent data. Furthermore, our initial analyses focused on samples from adults and as such we could not comment on their performance in neonates. However, one of the three public datasets we explored was from a study investigating DNA methylation in cord blood from term and preterm newborns. This clearly shows separation of immune cell sub-types isolated from neonates with the CpG markers we identified. It also suggests that such 'biomarkers' can potentially identify additional aspects of cell identity; for instance, the T cells of neonates (orange/red triangles) sit between the mature CD4+ and CD8+ T cells consistent with an undifferentiated state. Furthermore, we also observed independent clustering of nucleated red blood cells from the same preterm newborns cohort. We believe these observations support the tantalising possibility that DNA methylation can be harnessed to reveal new aspects of cell biology including the identification of currently unrecognised/undistinguishable immune cell sub-types. mRNA expression analysis of the genes to which the 11 validated DNA methylation discriminatory loci mapped also revealed discrimination at the mRNA level for CD248, and CD8A (CD8+), POU2F2 and CD19 (CD19+), PARK2 (CD14+), DCP2 (CD14+), SLC15A4 (CD56+), and CD40LG (CD4+). There were three genes (FAR1, WIPI2, KLRB1) for which this was not observed. One potential explanation is the presence of multiple isoforms per gene, such that the primer/probe combination for the QRTPCR analysis did not target the correct isoform. This possibility warrants further investigation especially given the increasing body of evidence that DNA methylation is an important modulator of alternative splicing [48-50]. We also investigated the expression of the 11 genes in a publicly available RNAseq dataset from immune cell sorted populations and saw a clear separation of the cell types with these 11 transcripts (Fig 5). In addition, the t-SNE analysis hints at the power of these 11 transcripts to provide a more nuanced separation of cell types. For example, we observed distinct separation of CD8 T-cells into two clusters of sub-populations (Terminal Effector/Effector Memory and Central Memory/Naive). Similar clustering is seen within CD4 T-helper cells, with Th17 cells clustering apart from other T-helper sub-types. We also see sub-type clustering within CD14 monocytes, with three distinct clusters: non-classical; intermediate and classical (see zoomed in section Fig 5). Therefore, as seen for the DNA methylation analysis in public data the marker loci appear to be able to provide a greater level of distinction than they were initially selected for. This speaks to the role of epigenetics in 'hard-wiring' cell lineage and regulating gene expression, and highlights the exciting possibility that DNA methylation could be explored to uncover previously unrecognised/identified immune cell sub-types.

Fig 5

tSNE plot of RNA expression in publicly available data for sorted-cells for the 11 genes highlighted in this study (RNAseq data from GSE107011).

tSNE plot of RNA expression in publicly available data for sorted-cells for the 11 genes highlighted in this study (RNAseq data from GSE107011).

Points represent individual samples. CD8+ central memory/naive purple squares, CD8+ effector/memory/terminal effector (purple circles), CD4 Th17 cells (yellow diamonds), other CD4 T-helper cells (yellow circles), CD14 monocytes—classical (blue circles), intermediate (blue triangles), non-classical (blue crosses). Interactive versions of all figures are available online (https://sirselim.github.io/tSNE_plotting/). Here we have further interrogated 11/1173 CpG sites identified in our initial discovery analysis of cell sorted immune cell populations from six healthy adult males—validating our observations in an independent cohort, and publicly available datasets (including both males and females). The public data also included a cohort of neonates demonstrating that the candidate loci held up in newborn samples too. We have not investigated whether differences are observed in individuals of varying ethnicity, although this would be an interesting avenue for further investigation. We have only looked at 11 loci; we believe that further investigation of the remaining sites with respect to their biological significance will likely reveal additional insights.

Conclusion

In summary, this study highlights the value of mining publicly available data, the utility of DNA methylation as a discriminatory marker, the potential value of DNA methylation to provide additional insights into immune cell biology and the tantalising possibility that DNA methylation can be harnessed to reveal new aspects of cell biology including the identification of currently unrecognised/undistinguishable immune cell sub-types.

Annotation and other key information for all 1173 methylation CpG sites which were identified as being suitable cell-type markers.

(CSV) Click here for additional data file.

Differential methylation statistics for pair-wise comparisons between cell populations of 11 CpG markers.

(CSV) Click here for additional data file.

Differential expression statistics for pair-wise comparisons between cell populations of 11 gene transcripts.

(CSV) Click here for additional data file. (TXT) Click here for additional data file. 6 Jul 2020 PONE-D-20-13841 DNA methylation in blood - potential to provide new insights into cell biology PLOS ONE Dear Dr. Benton, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Aug 20 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Osman El-Maarri, Ph.D Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include your tables as part of your main manuscript and remove the individual files. Please note that supplementary tables (should remain/ be uploaded) as separate "supporting information" files 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: . 4. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Authors have gathered lot of data and have analyzed them rigorously but there are some aspects completly missing from the manuscript. 1. While gathering data for methylation, authors did not look for the expression data for the same cell types in publicly available databases 2. Author discussed about the statistical methods to analyze methylation data but no information was given for the batch effect correction, normalization and other methods for data processing. Also no information was found about the data downloaded from MARMAL-AID for example number of samples, age, population, gender. From t-SNE plot it seems many samples but no information was provided about these samples 3. The writing logical organization of this paper is poor and difficult to understand Reviewer #2: In the manuscript entitled “DNA methylation in blood-potential to provide new insights into cell biology”, Macartney-Coxson et al., describe the results of a study aimed at identifying CpG-specific markers for different white blood cell types by mining data from a single publicly available data set consisting of Illumina HumanMethylation450-derived cell-specific DNA methylation signatures profiled across different leukocyte subtypes. The top two CpGs identified for five leukocyte subtypes were carried forward and validated via pyrosequencing in an independent data set. Moreover, gene expression was profiled in the subset of genes containing the aforementioned CpGs the same set of samples that underwent pyrosequencing, and used for comparisons of the expression profile across the five cell types. While the identification of lineage-specific DNA methylation markers is an area of great interest to the epigenetics research community given the now well-recognized potential for confounding by cell heterogeneity in EWAS of whole-blood or PBMCs and the fact that reference-based methodologies for deconvolution rely on such markers to enable accurate/reliable deconvolution estimates, the contributions of the research reported in this manuscript are incremental in comparison to what has already been published in this area. Despite this, the manuscript is well written and for the most part, easy to follow. Specific comments and suggestions for improvement are detailed below: Major comments: 1. Above all, it is not clear what the unique contributions of this study are above and beyond work that has already been published in this area dating back to the 2012 publication of the Houseman method (PMID: 22568884) for mixture deconvolution; a method which relies on cell specific markers across leukocyte cell types, and the original Reinius paper (the discovery data set in this examination). Moreover, there are several key references (PMID: 29843789; PMID: 26956433; PMID: 27529193, to name a few) concerning the identification of white-blood-cell lineage markers that are missing and should be cited in the manuscript. 2. At the very least, the authors should consider using the GEO data set (GSE110555) as an additional validation data set. This GEO series consists of Illumina HumanMethylationEPIC data on isolated leukocyte subtypes profiled in different, healthy, non-diseased adults. While the array technology differs from the Reinius data set (450K), 90% of the CpGs on the 450K array are also contained on the EPIC array, so the vast majority of the ~1,000 cell specific markers identified in the discovery data set should be amenable to validation in an independent data set. The above mentioned data set is but one of several different publicly available data sets with leukocyte-specific methylation data profiled using either the 450K or EPIC data sets. Along these lines, it would be interesting to see if the markers identified here (identified in adult samples) hold up when examined in cord-blood derived leukocytes. GSE82084 was identified from a cursory search of GEO and contains 450K data on T cells, granulocytes, and monocytes isolated from cord-blood. 3. There is insufficient detail provided regarding the glmnet model used to identify cell-specific markers. What was the assumed response (and assumed distribution) in the fit of such models, what was the assumed random-effect, were the CpGs filtered in advance of fitting the glmnet model (e.g., removal/exclusion of sex-linked loci, cross-reactive probes, SNP associated CpGs, etc.), how were the tuning parameters for the elastic net model selected? 4. While a link is provided with the names/identities/annotations for the cell-specific markers identified in the discovery data set, in the opinion of this reviewer, such data should really be included as a supplementary table. 5. No limitations of the study are provided in the discussion. I would especially like for the authors to address the issue of the exclusive use of adult samples in their analysis and that such markers might not hold up in newborns or children. What is the race/ethnicity of the samples used in the discovery and validation data set and might that influence the results? 6. Were any formal statistical tests performed for the 11 CpGs in the validation samples? The results (Heatmap and Boxplot) are purely descriptive. From the looks of it, it seems that the results would be statistically significant, however a formal statistical method should be applied in this circumstance. On a related note, Figure S1, in my opinion, is more informative than the Heatmap used in Figure 2, as it conveys both central tendency and variability in the methylation/expression levels across the validation samples. The latter is not reflected in the Heatmap. As such, the authors should consider including Figure S1 as a Figure in the main manuscript. 7. It is mentioned numerous times in the manuscript that the analysis is “unsupervised”? How so? The analysis to identify cell-specific markers must be supervised in the sense that the identity of the cell type is used in the model as an independent variable. 8. It is mentioned in the conclusion that the results of this study can be “harnessed to reveal new aspects of cell biology included the identification of unrecognized/undistinguishable cell types”. How so? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 7 Sep 2020 Please note: for clarity reviewer comments are in italics, our responses are in normal text and additional text added to the manuscript as part of the revision is in red. Reviewer #1: Authors have gathered lot of data and have analyzed them rigorously but there are some aspects completely missing from the manuscript. 1. While gathering data for methylation, authors did not look for the expression data for the same cell types in publicly available databases We selected 11 candidate loci for validation in an independent cohort of samples. DNA and RNA were extracted from the same sample simultaneously using the Qiagen AllPrep kit. DNA methylation was assayed using pyrosequencing and gene expression using real-time PCR. The results of the gene expression analyses are presented in Figure 2 in the original manuscript. In light of the reviewers comment we have also explored expression of these 11 genes in a publicly available RNASeq data ( GSE107011: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107011) dataset. A new section has been added to the manuscript to reflect this and a new Figure (Figure 5) presents a t-SNE analysis of the gene expression data which demonstrates clear separate of cell sorted immune cells with these 11 genes. Validation in publicly available data RNA expression To further explore expression of the 11 genes we selected for our validation of DNA methylation and RNA expression on independent samples we interrogated a publicly available RNAseq dataset, GSE107011 (Xu 2019). We extracted data for the same five cell populations (CD4+, CD8+, CD19+, CD56+ and CD14+) and then extracted expression data (TPM, transcripts per million) for each of the 11 genes. Figure 5 presents the data and illustrates the ability of mRNA expression from these 11 genes to clearly differentiate cell types. It is interesting to note that effector memory/terminal effector CD8+ cells (purple circles) are distinguished from central memory/naive CD8+ cells (purple squares) as well as a separation of Th17 CD4+ cells (yellow diamonds) from other CD4+ cells (yellow circles). Interactive versions of all figures are available online (https://sirselim.github.io/tSNE_plotting/). 2. Author discussed about the statistical methods to analyze methylation data but no information was given for the batch effect correction, normalization and other methods for data processing. Also no information was found about the data downloaded from MARMAL-AID for example number of samples, age, population, gender. From t-SNE plot it seems many samples but no information was provided about these samples We thank the reviewer for providing the opportunity to provide more clarity. We have updated the methods section to better reflect the processing from ‘raw’ idat files through to analysis of methylation beta values. MARMAL-AID was the R package that was used to download the Reinius data, all applicable sample information is available at the GEO page (GSE35069, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35069). The Reinus data was downloaded using the R package MARMAL-AID. All applicable sample information is available at the GEO page (GSE35069, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35069). Raw intensity data (Illumina 450K idats) were loaded into R (R Core Team 2017) using the Bioconductor minfi package (Aryee 2014). Background correction and control normalisation was implemented in minfi. Probes were classed as failed if the intensity for both the methylated and unmethylated probes was <1,000. Any probe which failed in at least one sample, was removed from the entire dataset. We also removed all previously identified cross-reactive probes (Chen 2013), and 33 457 probes which we previously identified as aligning to the human genome greater than once (Benton 2017). All analyses were performed on beta values, calculated as the intensity of the methylated channel divided by total intensity including an offset ((methylated + unmethylated) + 100). With respect to the t-SNE plot for 1173 CpG sites identified in our discovery analysis (Figure 1C) the points on the plot represent CpG sites not samples. We have made sure that the legend for each Figure which presents a t-SNE plot clearly states what the points represent (sample or CpG site as appropriate). This section is highlighted in the track changes version for each relevant figure legend. 3. The writing logical organization of this paper is poor and difficult to understand We respectively disagree with the reviewer but are more than happy to take editorial direction from the PLOSone team. We also note the comment from reviewer #2 that “…the manuscript is well written and for the most part, easy to follow” Reviewer #2: In the manuscript entitled “DNA methylation in blood-potential to provide new insights into cell biology”, Macartney-Coxson et al., describe the results of a study aimed at identifying CpG-specific markers for different white blood cell types by mining data from a single publicly available data set consisting of Illumina HumanMethylation450-derived cell-specific DNA methylation signatures profiled across different leukocyte subtypes. The top two CpGs identified for five leukocyte subtypes were carried forward and validated via pyrosequencing in an independent data set. Moreover, gene expression was profiled in the subset of genes containing the aforementioned CpGs the same set of samples that underwent pyrosequencing, and used for comparisons of the expression profile across the five cell types. While the identification of lineage-specific DNA methylation markers is an area of great interest to the epigenetics research community given the now well-recognized potential for confounding by cell heterogeneity in EWAS of whole-blood or PBMCs and the fact that reference-based methodologies for deconvolution rely on such markers to enable accurate/reliable deconvolution estimates, the contributions of the research reported in this manuscript are incremental in comparison to what has already been published in this area. Despite this, the manuscript is well written and for the most part, easy to follow. Specific comments and suggestions for improvement are detailed below: We thank the reviewer for their very constructive comments and believe that the additional analyses, clarification and discussion have significantly improved the manuscript. Major comments: 1. Above all, it is not clear what the unique contributions of this study are above and beyond work that has already been published in this area dating back to the 2012 publication of the Houseman method (PMID: 22568884) for mixture deconvolution; a method which relies on cell specific markers across leukocyte cell types, and the original Reinius paper (the discovery data set in this examination). Moreover, there are several key references (PMID: 29843789; PMID: 26956433; PMID: 27529193, to name a few) concerning the identification of white-blood-cell lineage markers that are missing and should be cited in the manuscript. We thank the reviewer for this comment as it highlighted our need to clarify that our manuscript is not about immune cell deconvolution per se (as PMID: 26956433, 29843789, 27529193 [and many others] as indicated by the reviewer). We have modified the manuscript to provide additional clarity and acknowledge the plethora of papers exploring cell composition deconvolution as below, and that PMID. Additional text in introduction: Epigenetic marks, including DNA methylation, are increasingly recognised as potential discriminators of cell type (Salas 2018). This attribute has been utilised by a number of researchers to develop methods which correct for and/or deconvolute the variability introduced by cell mixtures in DNA methylation studies, particularly in blood samples (Houseman 2012, Salas 2018a, Kim 2016, Houseman 2016, Decamps 2020); a notable example - the so-called Houseman algorithm (Houseman 2012) - has been incorporated in to standard bioinformatic pipelines, including the R minfi package (Aryee et al 2014), for DNA methylation arrays. This behaviour of DNA methylation as a marker also suggests the possibility of such 'marks' revealing new aspects of biology - for instance it may highlight previously unrecognised immune cell populations. 2. At the very least, the authors should consider using the GEO data set (GSE110555) as an additional validation data set. This GEO series consists of Illumina HumanMethylationEPIC data on isolated leukocyte subtypes profiled in different, healthy, non-diseased adults. While the array technology differs from the Reinius data set (450K), 90% of the CpGs on the 450K array are also contained on the EPIC array, so the vast majority of the ~1,000 cell specific markers identified in the discovery data set should be amenable to validation in an independent data set. The above mentioned data set is but one of several different publicly available data sets with leukocyte-specific methylation data profiled using either the 450K or EPIC data sets. Along these lines, it would be interesting to see if the markers identified here (identified in adult samples) hold up when examined in cord-blood derived leukocytes. GSE82084 was identified from a cursory search of GEO and contains 450K data on T cells, granulocytes, and monocytes isolated from cord-blood. We are very grateful to the reviewer for this really constructive comment. We have now analysed 3 publicly available DNA methylation data sets (two using the Illumina EPIC arrays [one suggested by the reviewer] and the data from cord-blood derived leukocytes 9also highlighted by the reviewer) which used the earlier Illumina 450K arrays. This analysis has provided an additional, strong validation of the panel of CpGs identified by our initial analyses. We have modified the manuscript accordingly. Validation in publicly available data DNA methylation In order to further investigate the panel of 1173 CpG sites identified in our initial analysis we interrogated their methylation in 3 publicly available data sets. One, GSE82084, using the Illumina 450K platform (as per the Reinus data used in our discovery analysis) and two (GSE103541, GSE110554) using the more recent Illumina EPIC platform. Of the 1173 CpG sites 1025 were present on both platforms. The two EPIC studies performed DNA methylation analysis of cell sorted immune cell populations from adults (Salas 2018), whereas the 450K study looked at DNA methylation in cord blood from term and preterm newborns (de 2017). Figure 4 presents a 2D tSNE plot of all 1025 CpG sites which clearly shows separation of immune cells populations. It is interesting to note that the T cells of neonates (orange/red triangles) sit between the CD4+ and CD8+ T cells consistent with an undifferentiated state. Figure 4. tSNE plot of sorted-cells from 211 samples based on 1025 CpG sites overlapping between the three publicly available datasets. Points represent individual samples. 3. There is insufficient detail provided regarding the glmnet model used to identify cell-specific markers. What was the assumed response (and assumed distribution) in the fit of such models, what was the assumed random-effect, were the CpGs filtered in advance of fitting the glmnet model (e.g., removal/exclusion of sex-linked loci, cross-reactive probes, SNP associated CpGs, etc.), how were the tuning parameters for the elastic net model selected? We have provided more detailed information in the methods section addressing these comments. Raw intensity data (Illumina 450K idats) were loaded into R (R Core Team 2017) using the Bioconductor minfi package (Aryee 2014). Background correction and control normalisation was implemented in minfi. Probes were classed as failed if the intensity for both the methylated and unmethylated probes was <1,000. Any probe which failed in at least one sample, was removed from the entire dataset. We also removed all previously identified cross-reactive probes (Chen 2013), and 33 457 probes which we previously identified as aligning to the human genome greater than once (Benton 2017). All analyses were performed on beta values, calculated as the intensity of the methylated channel divided by total intensity including an offset ((methylated + unmethylated) + 100). Briefly, glmnet fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elastic-net penalty at a grid of values for the regularization parameter lambda λ. The elastic-net penalty is controlled by α, and bridges the gap between lasso (α=1, the default) and ridge (α=0). The ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others. The elastic-net penalty mixes these two; if predictors are correlated in groups, an α=0 tends to select the groups in or out together. We selected an alpha at the lower end of the range (0.05) to shift the elastic-net model more towards the penalised-regression (ridge regression), allowing us to retain more related features (CpG sites which share variance). For the GLMnet modelling we used cross-validation to determine the optimal value of regularization parameter λ with both minimum mean squared error (MSE) and minimum MSE + 1SE of minimum MSE. The optimal λ values were then used for predictor variable selection. 4. While a link is provided with the names/identities/annotations for the cell-specific markers identified in the discovery data set, in the opinion of this reviewer, such data should really be included as a supplementary table. In line with the reviewers comment we have provided this information as Supplementary information, in addition to the link to the Github repository where this, and further, information is provided. Supplemental Table S1: Annotation and other key information for all 1173 methylation CpG sites which were identified as being suitable cell-type markers. 5. No limitations of the study are provided in the discussion. I would especially like for the authors to address the issue of the exclusive use of adult samples in their analysis and that such markers might not hold up in newborns or children. What is the race/ethnicity of the samples used in the discovery and validation data set and might that influence the results? We thank the reviewer for this insight and in light of our re-analysis including the public data (see point 2 above) we demonstrate that in this initial exploration the selected markers perform equally as well across both adult and neonatal samples. We feel that the comment about ethnicity is outside of the scope of this research and think that this would be an excellent example of future refinements to this approach. We have modified the manuscript accordingly. Discussion: Figure 4 illustrates how well these 1025 CpG sites performed in additional, independent data. Furthermore, our initial analyses focused on samples from adults and as such we could not comment on their performance in neonates. However, one of the three public datasets we explored was from a study investigating DNA methylation in cord blood from term and preterm newborns. This clearly shows separation of immune cell sub-types isolated from neonates with the CpG markers we identified. …... Here we have further interrogated 11/1173 CpG sites identified in our initial discovery analysis of cell sorted immune cell populations from six healthy adult males - validating our observations in an independent cohort, and publicly available datasets (including both males and females). The public data also included a cohort of neonates demonstrating that the candidate loci held up in newborn samples too. We have not investigated whether differences are observed in individuals of varying ethnicity, although this would be an interesting avenue for further investigation. We have only looked at 11 loci; we believe that further investigation of the remaining sites with respect to their biological significance will likely reveal additional insights. 6. Were any formal statistical tests performed for the 11 CpGs in the validation samples? The results (Heatmap and Boxplot) are purely descriptive. From the looks of it, it seems that the results would be statistically significant, however a formal statistical method should be applied in this circumstance. On a related note, Figure S1, in my opinion, is more informative than the Heatmap used in Figure 2, as it conveys both central tendency and variability in the methylation/expression levels across the validation samples. The latter is not reflected in the Heatmap. As such, the authors should consider including Figure S1 as a Figure in the main manuscript. Once again we thank the reviewer for this comment and suggestion. We have now included the boxplots within the body of the manuscript (Figure 3). In addition, we have added text to describe statistical analyses of the DNA methylation and RNA expression data for the 11 candidate loci, and provide the results of these as Supplementary tables. Added to methods: Differential methylation and expression analyses were performed in R using the default student t-test. P values were adjusted using the Benjamini-Hochberg method. Added to results: Figure 3. Boxplots illustrating DNA methylation and gene expression levels for all 11 gene investigated. Methylation and gene expression data for a given gene are in adjacent boxplots. The eleven candidate loci were assayed by pyrosequencing in the 12 samples from the validation cohort. We observed a strong agreement with the expected discriminatory patterns of DNA methylation for all loci examined (Figures 2 and 3). Supplementary Table S2 presents pair-wise student T-test statistics for the DNA methylation data. Given the role that DNA methylation plays in regulation of gene expression we also explored the mRNA levels of the 11 candidate loci. We investigated gene expression by QRTPCR in the 12 independent, validation samples. A clear differentiation between immune cells types at the gene expression level was observed for PARK2, POU2F2, DCP2, CD248, CD8A, SLC15A4, CD4A0LG and CD19 but not for FAR1, WIPI2, KLRB1 (Figures 2 and 3 ). Supplementary Table S3 presents pair-wise student T-test statistics for the gene expression data. Added to Supplementary Information Supplemental Table S2: Differential methylation statistics for pair-wise comparisons between cell populations of 11 CpG markers. Supplemental Table S3: Differential expression statistics for pair-wise comparisons between cell populations of 11 gene transcripts. 7. It is mentioned numerous times in the manuscript that the analysis is “unsupervised”? How so? The analysis to identify cell-specific markers must be supervised in the sense that the identity of the cell type is used in the model as an independent variable. We take the reviewers point and to save confusion have removed the term ‘unsupervised’ from the manuscript and replaced it with “a machine learning approach” 8. It is mentioned in the conclusion that the results of this study can be “harnessed to reveal new aspects of cell biology included the identification of unrecognized/undistinguishable cell types”. How so? We have modified the manuscript to highlight the potential power of DNA methylation and aspects of our study which we believe suggest this possibility. We note that our exploration of publicly available data to further explore our DNA methylation and gene expression results (as suggested by both reviewers - many thanks) has improved our ability to provide this. Introduction: This attribute has been utilised by a number of researchers to develop methods which correct for and/or deconvolute the variability introduced by cell mixtures in DNA methylation studies, particularly in blood samples (Houseman 2012, Salas 2018a, Kim 2016, Houseman 2016, Decamps 2020); a notable example - the so-called Houseman algorithm (Houseman 2012) - has been incorporated in to standard bioinformatic pipelines, including the R minfi package (Aryee 2014), for DNA methylation arrays. This behaviour of DNA methylation as a marker also suggests the possibility of such 'marks' revealing new aspects of biology - for instance it may highlight previously unrecognised immune cell populations. Discussion: The potential of this approach is highlighted by our analyses of publically available data and 1025/1173 'candidate CpG sites' which overlapped between 450K and EPIC Illumina bead platforms. Figure 4 illustrates how well these 1025 CpG sites performed in additional, independent data. Furthermore, our initial analyses focused on samples from adults and as such we could not comment on their performance in neonates. However, one of the three public datasets we explored was from a study investigating DNA methylation in cord blood from term and preterm newborns. This clearly shows separation of immune cell sub-types isolated from neonates with the CpG markers we identified. It also suggests that such 'biomarkers' can potentially identify additional aspects of cell identity; for instance the T cells of neonates (orange/red triangles) sit between the mature CD4+ and CD8+ T cells consistent with an undifferentiated state. Furthermore, we also observed independent clustering of nucleated red blood cells from the same preterm newborns cohort. We believe these observations support the tantalising possibility that DNA methylation can be harnessed to reveal new aspects of cell biology including the identification of currently unrecognised/undistinguishable immune cell sub-types. We also investigated the expression of the 11 genes in a publicly available RNAseq dataset from immune cell sorted populations and saw a clear separation of the cell types with these 11 transcripts (Figure 5). In addition, the t-SNE analysis hints at the power of these 11 transcripts to provide a more nuanced separation of cell types. For example we observed distinct separation of CD8 T-cells into two clusters of sub-populations (Terminal Effector/Effector Memory and Central Memory/Naive). Similar clustering is seen within CD4 T-helper cells, with Th17 cells clustering apart from other T-helper sub-types. We also see sub-type clustering within CD14 monocytes, with three distinct clusters: non-classical; intermediate and classical (see zoomed in section Figure 5). Therefore, as seen for the DNA methylation analysis in public data the marker loci appear to be able to provide a greater level of distinction than they were initially selected for. This speaks to the role of epigenetics in 'hard-wiring' cell lineage and regulating gene expression, and highlights the exciting possibility that DNA methylation could be explored to uncover previously unrecognised/identified immune cell sub-types. Submitted filename: Rebutal_document_Final.pdf Click here for additional data file. 14 Oct 2020 DNA methylation in blood - potential to provide new insights into cell biology PONE-D-20-13841R1 Dear Dr. Benton, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Osman El-Maarri, Ph.D Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Authors have addressed all my previous comments and have added the necessary information. Overall manuscript has communicates the work of the authors. I recommend to accept this paper. Reviewer #2: The authors have adequately addressed the reviewer's comments and have incorporated the suggestions from the reviewers into their revised manuscript. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 26 Oct 2020 PONE-D-20-13841R1 DNA methylation in blood - potential to provide new insights into cell biology Dear Dr. Benton: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Priv.-Doz. Dr. Osman El-Maarri Academic Editor PLOS ONE

48 in total

1. Epigenomic Profiling of Human CD4⁺ T Cells Supports a Linear Differentiation Model and Highlights Molecular Regulators of Memory Development.

Authors: Pawel Durek; Karl Nordström; Gilles Gasparoni; Abdulrahman Salhab; Christopher Kressler; Melanie de Almeida; Kevin Bassler; Thomas Ulas; Florian Schmidt; Jieyi Xiong; Petar Glažar; Filippos Klironomos; Anupam Sinha; Sarah Kinkley; Xinyi Yang; Laura Arrigoni; Azim Dehghani Amirabad; Fatemeh Behjati Ardakani; Lars Feuerbach; Oliver Gorka; Peter Ebert; Fabian Müller; Na Li; Stefan Frischbutter; Stephan Schlickeiser; Carla Cendon; Sebastian Fröhler; Bärbel Felder; Nina Gasparoni; Charles D Imbusch; Barbara Hutter; Gideon Zipprich; Yvonne Tauchmann; Simon Reinke; Georgi Wassilew; Ute Hoffmann; Andreas S Richter; Lina Sieverling; Hyun-Dong Chang; Uta Syrbe; Ulrich Kalus; Jürgen Eils; Benedikt Brors; Thomas Manke; Jürgen Ruland; Thomas Lengauer; Nikolaus Rajewsky; Wei Chen; Jun Dong; Birgit Sawitzki; Ho-Ryun Chung; Philip Rosenstiel; Marcel H Schulz; Joachim L Schultze; Andreas Radbruch; Jörn Walter; Alf Hamann; Julia K Polansky
Journal: Immunity Date: 2016-11-15 Impact factor: 31.745

Review 2. Epigenetics: the language of the cell?

Authors: Biao Huang; Cizhong Jiang; Rongxin Zhang
Journal: Epigenomics Date: 2014-02 Impact factor: 4.778

Review 3. Transition states and cell fate decisions in epigenetic landscapes.

Authors: Naomi Moris; Cristina Pina; Alfonso Martinez Arias
Journal: Nat Rev Genet Date: 2016-09-12 Impact factor: 53.242

Review 4. Environmental Influences on the Epigenome: Exposure- Associated DNA Methylation in Human Populations.

Authors: Elizabeth M Martin; Rebecca C Fry
Journal: Annu Rev Public Health Date: 2018-01-12 Impact factor: 21.981

5. WIPI2 links LC3 conjugation with PI3P, autophagosome formation, and pathogen clearance by recruiting Atg12-5-16L1.

Authors: Hannah C Dooley; Minoo Razi; Hannah E J Polson; Stephen E Girardin; Michael I Wilson; Sharon A Tooze
Journal: Mol Cell Date: 2014-06-19 Impact factor: 17.970

6. Methylome-wide association study of whole blood DNA in the Norfolk Island isolate identifies robust loci associated with age.

Authors: Miles C Benton; Heidi G Sutherland; Donia Macartney-Coxson; Larisa M Haupt; Rodney A Lea; Lyn R Griffiths
Journal: Aging (Albany NY) Date: 2017-02-28 Impact factor: 5.682

7. The importance of DNA methylation of exons on alternative splicing.

Authors: Ronna Shayevitch; Dan Askayo; Ifat Keydar; Gil Ast
Journal: RNA Date: 2018-07-12 Impact factor: 4.942

8. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data.

Authors: George C Linderman; Manas Rachh; Jeremy G Hoskins; Stefan Steinerberger; Yuval Kluger
Journal: Nat Methods Date: 2019-02-11 Impact factor: 28.547

9. Tracking of epigenetic changes during hematopoietic differentiation of induced pluripotent stem cells.

Authors: Olivia Cypris; Joana Frobel; Shivam Rai; Julia Franzen; Stephanie Sontag; Roman Goetzke; Marcelo A Szymanski de Toledo; Martin Zenke; Wolfgang Wagner
Journal: Clin Epigenetics Date: 2019-02-04 Impact factor: 6.551

10. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray.

Authors: Yi-an Chen; Mathieu Lemire; Sanaa Choufani; Darci T Butcher; Daria Grafodatskaya; Brent W Zanke; Steven Gallinger; Thomas J Hudson; Rosanna Weksberg
Journal: Epigenetics Date: 2013-01-11 Impact factor: 4.528

2 in total

1. Reproducibility of the energy metabolism response to an oral glucose tolerance test: influence of a postcalorimetric correction procedure.

Authors: Juan M A Alcantara; Guillermo Sanchez-Delgado; Lucas Jurado-Fasoli; Jose E Galgani; Idoia Labayen; Jonatan R Ruiz
Journal: Eur J Nutr Date: 2022-08-25 Impact factor: 4.865

2. OXTR-Related Markers in Clinical Depression: a Longitudinal Case-Control Psychotherapy Study.

Authors: Iris C Reiner; Gerald Gimpl; Manfred E Beutel; Marian J Bakermans-Kranenburg; Helge Frieling
Journal: J Mol Neurosci Date: 2021-11-25 Impact factor: 3.444

2 in total