| Literature DB >> 36160046 |
Samantha L Wilson1, Shu Yi Shen1, Lauren Harmon2, Justin M Burgener1,3, Tim Triche2, Scott V Bratman1,3, Daniel D De Carvalho1,3, Michael M Hoffman1,3,4,5.
Abstract
Cell-free methylated DNA immunoprecipitation sequencing (cfMeDIP-seq) identifies genomic regions with DNA methylation, using a protocol adapted to work with low-input DNA samples and with cell-free DNA (cfDNA). We developed a set of synthetic spike-in DNA controls for cfMeDIP-seq to provide a simple and inexpensive reference for quantitative normalization. We designed 54 DNA fragments with combinations of methylation status (methylated and unmethylated), fragment length (80 bp, 160 bp, 320 bp), G + C content (35%, 50%, 65%), and fraction of CpG dinucleotides within the fragment (1/80 bp, 1/40 bp, 1/20 bp). Using 0.01 ng of spike-in controls enables training a generalized linear model that absolutely quantifies methylated cfDNA in MeDIP-seq experiments. It mitigates batch effects and corrects for biases in enrichment due to known biophysical properties of DNA fragments and other technical biases.Entities:
Keywords: DNA methylation; absolute quantification; batch effects; cell-free DNA; cell-free methylated DNA immunoprecipitation; cfDNA; cfMeDIP; early detection of cancer; liquid biopsy; minimally invasive testing; reference standards; spike-in controls
Year: 2022 PMID: 36160046 PMCID: PMC9499995 DOI: 10.1016/j.crmeth.2022.100294
Source DB: PubMed Journal: Cell Rep Methods ISSN: 2667-2375
Figure 1Experimental design using synthetic spike-in control DNA
(A) Technical assessment of the spike-in controls with cfDNA mimic. (Left) Assessment of technical bias in solely the spike-in controls. (Right) Optimization of the synthetic DNA amount using sheared HCT116 cfDNA mimic.
(B) Clinical evaluation of acute myeloid leukemia (AML) patient samples with spike-in controls.
Figure 2Assessing biases in fragment length, G + C content, and CpG fraction
(A–C) In (A) input spike-in control DNA without cfMeDIP-seq, (B) output spike-in control DNA, after cfMeDIP-seq, and (C) 0.01 ng spike-in control DNA added to HCT116 replicates.
Blue, methylated fragments; gray, unmethylated fragments. Circle, sample 1; triangle, sample 2. Solid line, mean of the two samples. Columns marked with numerals 1 and 2 represent alternative sets of fragments with identical properties but different sequences.
See also Table S2.
Range of reads sequenced for alternative synthetic spike-in fragments of length 320 bp with the same G + C contents and CpG fractions
| Experiment | G + C content (%) | CpG fraction | Within-alternative reads | Between-alternatives reads |
|---|---|---|---|---|
| 10 ng of synthetic DNA | 35 | 1/80 | 0–343 | 343 |
| 50 | 1/40 | 298–445 | 1,161 | |
| 50 | 1/20 | 1,759–17,472 | 36,642 | |
| HCT116 + 0.01 ng of synthetic DNA | 35 | 1/80 | 13–219 | 219 |
| 50 | 1/40 | 160–240 | 646 | |
| 50 | 1/20 | 987–1,320 | 8,483 |
Minimum and maximum range between the individual replicates of two alternatives.
Range between the means of two alternatives.
Figure 3Two-dimensional histograms of the number of reads found in 300 bp windows
(A and B) Binned by molar amount and either (A) standard deviation of molar amount or (B) Umap k100 multi-read mappability.
Histograms only include windows that do not overlap with UCSC simple repeats and the ENCODE blacklist, and regions with Umap k100 multi-read mappability scores . Asterisks indicate 11 outlier genomic windows chosen for further examination.
11 genomic windows of length 300 bp with high predicted molar amount
| Chr | Start | End | Amount | Gene | Repeat element | Repeat family | Repeat name |
|---|---|---|---|---|---|---|---|
| 19 | 308,701 | 309,000 | 0.86 pmol | MIER2 | – | – | – |
| 19 | 343,501 | 343,800 | 0.74 pmol | MIER2 | – | – | – |
| 19 | 613,201 | 613,500 | 0.73 pmol | HCN2 | – | – | – |
| 19 | 651,601 | 651,900 | 0.70 pmol | RNF126 | – | – | – |
| 6 | 495,301 | 495,600 | 0.31 pmol | EXOC2 | SINE, LINE | Alu, L1 | AluYk2, L1ME2z |
| 6 | 426,601 | 426,900 | 0.28 pmol | – | LTR | ERV1 | LTR12C |
| 19 | 759,601 | 759,900 | 0.27 pmol | MISP | SINE | Alu | AluY |
| 6 | 4,048,201 | 4,048,500 | 0.26 pmol | PRPF4B | SINE | Alu | AluY |
| 19 | 671,401 | 671,700 | 0.25 pmol | – | SINE, SINE, SINE | Alu, Alu, Alu | AluSx4, AluYa8, MIRb |
| 19 | 958,501 | 958,800 | 0.25 pmol | ARID3A | SINE | Alu | AluY |
| 19 | 858,001 | 858,300 | 0.25 pmol | – | SINE | Alu | AluY |
Sorted by decreasing molar amount.
Chromosome.
GRCh38/hg38, genomic position 1-start, fully closed.
Symbols of GENCODE version 33 basic gene set genes (Frankish et al., 2019) that overlap our 300 bp genomic windows.
Elements, families, and names of RepeatMasker (Smit et al., 2010) version 3.0 repeats that overlap our 300 bp genomic windows.
Figure 4Correlation of two measurements of fragment methylation by cfMeDIP and EPIC array M-value for 300 bp genomic windows
(A, C, E, and G) Molar amount calculated from HCT116 samples correlated to EPIC array M-values.
(B, D, F, and H) Read counts calculated from the same samples, ignoring the spike-in controls.
(A and B) 37,714 windows with ≥3 CpG probes represented on the EPIC array.
(C and D) 7,975 windows with ≥5 CpG probes represented on the EPIC array.
(E and F) 2,066 windows with ≥7 CpG probes represented on the EPIC array.
(G and H) 158 windows with ≥10 CpG probes represented on the EPIC array.
Solid black line, linear model of best fit; dashed red line, loess (Cleveland 1979) local regression.
Figure 5Mean absolute error between known molar amount and predicted molar amount in test data consisting of held-out spike-ins not used for training
For each number of spike-in fragments between 6 and 25 inclusive, we 100 times randomly selected that number of spike-ins as training data. We used the remaining spike-ins as test data. Each point shows the mean absolute error over all the test spike-ins for that iteration. The vertical limits of the plot include at least 98/100 iterations in every case. We denote outliers for 6 or 7 training spike-ins with a cross at the top of the plot, labeled with the mean absolute error for that case.
Red line denotes median mean absolute error.
See also Table S6.
Figure 6Principal component analyses of cfMeDIP results normalized through four different strategies, and associations with experimental variables
(Left) Proportion of the variance explained by each principal component. (Right) Association between known variables, both technical and clinical, and principal component. Cohen’s d is an effect size of standardized means between variable. ∗∗∗p < 0.001.
(A) Raw read counts without any normalization.
(B) Read counts normalized using QSEA.
(C) Data normalized using spike-in controls.
(D) Data normalized using spike-in controls and removing regions in UCSC simple repeats, in the ENCODE blacklist, and with Umap k100 multi-read mappability scores ≤0.5.
See also Tables S3, S4, and S5.
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| 5-methylcytosine, clone 33D3 | Diagenode, Denville, NJ, USA | Cat# C1500081 |
| AML sample 6 | Leukemia Tissue Bank, Princess Margaret Cancer Centre, Toronto, ON, Canada | 150279 |
| AML sample 7 | Leukemia Tissue Bank, Princess Margaret Cancer Centre, Toronto, ON, Canada | 151050 |
| AML sample 8 | Leukemia Tissue Bank, Princess Margaret Cancer Centre, Toronto, ON, Canada | 160537 |
| AML sample 9 | Leukemia Tissue Bank, Princess Margaret Cancer Centre, Toronto, ON, Canada | 160326 |
| AML sample 10 | Leukemia Tissue Bank, Princess Margaret Cancer Centre, Toronto, ON, Canada | 150197 |
| High-Fidelity 2X Master Mix | New England Biolabs, Ipswich, MA, USA | Cat# M0492L |
| QIAquick PCR Purification Ki | Qiagen, Hilden, Germany | Cat# 28104 |
| Thermo Fisher Scientific, Waltham, MA, USA | Cat# EM0821 | |
| MinElute PCR Purification Kit | Qiagen, Hilden, Germany | Cat# 28004 |
| HpyCH4IV | New England Biolabs, Ipswitch, MA, USA | Cat# R0619S |
| HpaII | New England Biolabs, Ipswitch, MA, USA | Cat# R171S |
| AfeI | New England Biolabs, Ipswitch, MA, USA | Cat# R0652S |
| Mycoplasma-free HCT116 genomic DNA | American Type Culture Collection, Manassas, VA, USA | |
| Synthetic spike-in control oligonucleotides, see | This paper | N/A |
| Ultramer DNA Oligonucleotides | IDT, Coralville, IA, USA | N/A |
| gBlocks Gene Fragments | IDT, Coralville, IA, USA | N/A |
| xGen Stubby Adapter and UDI primer pairs | IDT, Coralville, IA, USA | Cat# 10005924 |
| Raw cell line data | This paper | GEO: |
| Raw AML patient data | This paper | EGA: |
| Processed AML patient data | This paper | GEO: |
| Human reference genome GRCh38/hg38 | Genome Reference Consortium | RefSeq: |
| GENCODE v.33 | ||
| RepeatMasker v.3.0 | Institute for Systems Biology, Seattle, WA, USA | |
| GenRGenS | ||
| RNAStructure v 6.2 | ||
| Fastp v.0.11.5 | ||
| Bowtie2 v.2.3.5 | ||
| UMI-tools version 1.0.0 | ||
| Samtools v. 0.10.2 | ||
| R v. 3.4.1 | ||
| Bedtools v.2.29.2 | ||
| Sesame v. 1.8.2 | ||
| QSEA v. 1.16.0 | ||
| compute.es v.0.2.5 | Comprehensive R Archive Network (CRAN) | |
| spiky | This paper | |
| All code pertaining to this paper | This paper | |