| Literature DB >> 31861999 |
Brenna A LaBarre1,2, Alexander Goncearenco2, Hanna M Petrykowska2, Weerachai Jaratlerdsiri3, M S Riana Bornman4, Vanessa M Hayes3,4,5, Laura Elnitski6.
Abstract
BACKGROUND: Current array-based methods for the measurement of DNA methylation rely on the process of sodium bisulfite conversion to differentiate between methylated and unmethylated cytosine bases in DNA. In the absence of genotype data this process can lead to ambiguity in data interpretation when a sample has polymorphisms at a methylation probe site. A common way to minimize this problem is to exclude such potentially problematic sites, with some methods removing as much as 60% of array probes from consideration before data analysis.Entities:
Keywords: Bisulfite sequencing; CTCF sites; Data analysis; Enhancers; Illumina methylation array; Methylation probes; Polymorphisms; Single nucleotide polymorphisms (SNPs)
Mesh:
Year: 2019 PMID: 31861999 PMCID: PMC6923858 DOI: 10.1186/s13072-019-0321-6
Source DB: PubMed Journal: Epigenetics Chromatin ISSN: 1756-8935 Impact factor: 4.954
Fig. 1Venn diagram of known polymorphisms recommended for removal in published literature. a For Illumina 450K data, four publications agree on identification of 289,952 polymorphic positions [5–8], but do not agree on an additional 38,407 positions. b For Illumina Epic data, two publications [9, 10] agree on identification of 346,681 polymorphic positions but differ on another 42,369 positions. Data obtained via methylcheck python package
Fig. 2a MethylToSNP can be integrated in existing methylation array processing pipelines, where the source data can originate from remote data sources, such as GEO or local files. MethylToSNP requires already preprocessed data and will merge the SNP predictions with existing SNP annotations if available. This is why we recommend using it in conjunction with Bioconductor minfi package. b Schematic representation of the MethylToSNP workflow. Given a minfi object or a plain matrix, MethylToSNP will extract beta-values and will process one ‘cg’ probe at a time. For each probe, methylation values in different samples are clustered to find gaps between clusters with optional outlier exclusion (see “Methods”). The gaps are tested against the thresholds passed as the program’s parameters. Predicted SNPs are then reported along with the existing SNP annotation when available. Reliability scores are calculated to emphasize detection of patterns consistent with meC > T transitions
Fig. 3SNP presence distributes the methylation data into three tiers. All examples illustrated using 95 southern Africa samples. a DNA methylation beta-values at a single SNP site plotted across all samples. b Data from 40 randomly selected sites with SNP-like three-tier patterns plotted together across all samples to illustrate the need for variability in cutoff values used at each SNP position. c The same data as in (b) is shown separately for each array probe (i.e., genome locus). Probe-wise thresholds allow separation of data into three tiers
Fig. 4Plot of beta-values at probe site cg21226234 in Yoruba in Ibadan, Nigeria (YRI) Illumina 27K samples detected by MethylToSNP. The site received a high reliability score of 0.958. The probe at this site corresponds to a known SNP rs775651175
Variants predicted by applying MethylToSNP to the YRI and CEU HapMap methylation datasets
| Set | SNP predictions in dataset (#) | Overlap with dbSNP 142 (#) | Overlap with dbSNP 146 (#) | Potential novel SNPsa (#) | Total array sites removed by dbSNP 142 filteringb,c (#) | Total array sites removed by dbSNP 146 filteringb,c (#) | Median reliability scores of predicted SNPs | # novel SNPs with reliability scores ≥ 0.5 |
|---|---|---|---|---|---|---|---|---|
| YRI | 37 | 3 | 9 | 28 | 3563 | 5677 | 0.01875 | 6 |
| CEU | 283 | 37 | 57 | 226 | 3329 | 5629 | 0.011 | 31 |
Default threshold values used
aAfter filtering for known SNPs
bDirect overlap of C in CpG position
cAfter removing MethylToSNP
Sequencing results of three sites in the YRI predictions
| Probe of interest (genomic position) | MethylToSNP reliability score from 70 samples | Example genome | Average methylation level from array data | Predicted genotype (bisulfite-treated) | Average methylation level from bisulfite sequencing | Sequenced genotype |
|---|---|---|---|---|---|---|
chr19:13068298 | 0.042 | NA19131 | 0.703 | C/C | 0.732 | C/C |
| NA18506 | 0.440 | C/T | 0.435 | C/C | ||
| NA18912 | 0.127 | T/T | 0.001 | C/C | ||
chr14:105992620 | 0.034 | NA19172 | 0.830 | C/C | 0.913 | C/C |
| NA18503 | 0.405 | C/T | 0.446 | C/C | ||
| NA19222 | 0.113 | T/T | 0.020 | C/C | ||
chr19:53757911 | 0.070 | NA18861 | 0.752 | C/C | 0.801 | C/C |
| NA19161 | 0.510 | C/T | 0.500 | C/C | ||
| NA18505 | 0.228 | T/T | 0.181 | C/C |
MethylToSNP predictions in the southern African data set
| Description | Probes tested (#) | # SNP predictions in dataset | Overlap with dbSNP 142 (#) | Overlap with dbSNP 146 (#) | Potential novel SNPs (#) | Sites lost by filtering all dbSNP 142 positions (#)b | Sites lost by filtering all dbSNP 146 positions (#)b | Median reliability scores of predicted SNPs | # novel SNPs with reliability scores ≥ 0.5 |
|---|---|---|---|---|---|---|---|---|---|
| All probe sites | 473,767 | 2296 | 1249 | 1402 | 894 | 101,558 | 144,569 | 0.979 | 827 |
| Differential methylationc | 12,613 | 23 | 19 | 19 | 4 | 2143 | 3081 | 0.395 | 2 |
| Top 5% differential methylationa,c | 400 | 1 | 0 | 0 | 1 | 0 | 48 | 0.779 | 1 |
aAfter filtering for known SNPs
bDirect overlap of C in CpG position
cManuscript under revision
Sequence data verifies the presence of a SNP identified using MethylToSNP
| Subject | Methylation beta-value at cg00117311 | Putative alleles | Allele frequencies at rs78210031 |
|---|---|---|---|
| KB1 | 0.458 | CT | 0.5 C/0.5 T |
| TK1 | 0.542 | CT | 0.5 C/0.5 T |
| NB1 | 0.954 | CC | C |
| MD8 | 0.949 | CC | C |
Sequence data were obtained from the Penn State Genome Browser for four KhoeSan samples
Sequence data verifies the presence of SNPs identified using MethylToSNP
| cg IDa | Chromosome | CpG position (hg38) | SNP coordinate within 10 bp | rs ID | Information |
|---|---|---|---|---|---|
| cg00786635 | chr1 | 25,267,710–25,267,711 | 25,267,707 | rs145726224 | Common in African populations including southern Africans |
| 25,267,714 | N/A | Rare, found in one southern African | |||
| cg07482220 | chr6 | 32,178,742–32,178,743 | 32,178,737 | rs112124640 | Rare, identified in 3 southern Africans |
| cg18976974 | chr8 | 102,978,096–102,978,097 | 102,978,097 | N/A | Rare, only in one genome |
| cg10633981 | chr11 | 16,758,221–16,758,222 | N/A | N/A | No WGS data for genome of interest |
acg ID identifier from Illumina array annotations
Comparison of MethylToSNP calls and gap hunting calls
| Data set | MethylToSNP predictions | Gap hunting 3-groups predictions | % Overlap in MethylToSNP with 3-groups predictions (%)a | % Overlap in MethylToSNP including all gap hunting groups (%)b |
|---|---|---|---|---|
| 27K YRI and CEU | 371 | 8486 | 47 | 100 |
| 27K CEU | 283 | 8416 | 44 | 100 |
| 27K YRI | 37 | 3409 | 73 | 97 |
YRI Yoruba in Ibadan, Nigeria population, CEU CEU HapMap
aFeature results found in gap hunting 3-group results
bFeature results found in any of 9 gap hunting groups
Reliability scores from the simulated data sets
| Data set | Mean reliability score | Median reliability score |
|---|---|---|
| Set frequency | 0.568 | 0.553 |
| Uniform frequency | 0.501 | 0.500 |