Literature DB >> 34976309

Computational challenges in detection of cancer using cell-free DNA methylation.

Madhu Sharma¹, Rohit Kumar Verma¹, Sunil Kumar², Vibhor Kumar¹.

Abstract

Cell-free DNA(cfDNA) methylation profiling is considered promising and potentially reliable for liquid biopsy to study progress of diseases and develop reliable and consistent diagnostic and prognostic biomarkers. There are several different mechanisms responsible for the release of cfDNA in blood plasma, and henceforth it can provide information regarding dynamic changes in the human body. Due to the fragmented nature, low concentration of cfDNA, and high background noise, there are several challenges in its analysis for regular use in diagnosis of cancer. Such challenges in the analysis of the methylation profile of cfDNA are further aggravated due to heterogeneity, biomarker sensitivity, platform biases, and batch effects. This review delineates the origin of cfDNA methylation, its profiling, and associated computational problems in analysis for diagnosis. Here we also contemplate upon the multi-marker approach to handle the scenario of cancer heterogeneity and explore the utility of markers for 5hmC based cfDNA methylation pattern. Further, we provide a critical overview of deconvolution and machine learning methods for cfDNA methylation analysis. Our review of current methods reveals the potential for further improvement in analysis strategies for detecting early cancer using cfDNA methylation.

Entities: Chemical

Keywords: Cancer heterogeneity; Cell free DNA; Computation; DMP, Differentially methylated base position; DMR, Differentially methylated regions; Diagnosis; HELP-seq, HpaII-tiny fragment Enrichment by Ligation-mediated PCR sequencing; MBD-seq, Methyl-CpG Binding Domain Protein Capture Sequencing; MCTA-seq, Methylated CpG tandems amplification and sequencing; MSCC, Methylation Sensitive Cut Counting; MSRE, methylation sensitive restriction enzymes; MeDIP-seq, Methylated DNA Immunoprecipitation Sequencing; RRBS, Reduced-Representation Bisulfite Sequencing; WGBS, Whole Genome Bisulfite Sequencing; cfDNA, cell free DNA; ctDNA, circulating tumor DNA; dPCR, digital polymerase chain reaction; ddMCP, droplet digital methylation-specific PCR; ddPCR, droplet digital polymerase chain reaction; scCGI, methylated CGIs at single cell level

Year: 2021 PMID： 34976309 PMCID： PMC8669313 DOI： 10.1016/j.csbj.2021.12.001

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Traditional clinical diagnostic methods such as bone marrow or tissue biopsies are invasive in nature and possess sampling bias; consequently, researchers are looking for alternative molecular biomarkers. In recent years liquid biopsy-based disease diagnosis techniques have gained importance due to their safer and faster approach in contrast to tissue-based studies [1]. One such liquid biopsy-derived method uses cancer traces obtained from cell-free DNA (cfDNA). These fragments are called circulating tumor DNA (ctDNA) and have shown the potential to help in the field of cancer diagnosis, and prognosis [2]. The hematopoietic system is the major origin of cfDNA in healthy subjects, while in clinical patients (e.g., cancer), the affected cells/tissues contribute more to it. The plasma of a healthy individual contains 0–100 ng/ml of cfDNA, while in the case of late-stage cancer patients, it can go up to 1000 ng/ml [3]. Following cfDNA discovery in 1948 in autoimmune diseases, applications of cfDNA have now been extended to the diagnosis of many types of abnormalities. Some of the applications include identification of fetal chromosomal abnormalities (NIPT), early graft rejection, and detection and monitoring of cancer [4]. Besides genetic alterations, epigenetic changes in cfDNA have also been found to be useful as diagnostic biomarkers in different types of cancers[5], [6]. One of the most robust epigenetic markers is DNA methylation which is obtained by the addition of a methyl group through DNA methyltransferases (DNMTs) to the fifth carbon of cytosine [6]. A high composition of unmethylated CpGs is found in promoter regions of genes (CpG islands), while 70–80% CpGs are found to be globally methylated in the case of somatic cells. One application of cfDNA methylation patterns has been in the identification of tissue of origin [4]. Moreover, various research findings show that DNA methylation-based biomarkers are more consistent in comparison to those based on mutational profiles [7], [8]. Detection of lung cancer with the help of EGFR mutation test V2 (Roche Molecular Diagnostics) and Epi procolon (Epigenomics AG) for colorectal cancer are some examples of cfDNA based FDA-approved tests [9]. A few large-scale prospective clinical trials are underway for the early detection of multiple types of cancer. The names of some of such multi-center trial studies are CCGA (Circulating Cell-free Genome Atlas), STRIVE, SUMMIT, and PATHFINDER by GRAIL Inc. [10]. An early report from these large-scale studies indicates low sensitivity in the detection of stage-I (18%) and stage-II (43%) cancer at a specificity of 0.7 % [10]. Such low sensitivity for early cancer detection highlights the importance of reviewing various steps involved in cfDNA methylation analysis. There have been a few reviews on profiling and analysis of 5mC based DNA methylation patterns in cfDNA [6], [5], [11]. Each review has its own unique aspect in target disease, description of experimental protocols, and analysis procedures. In our review, besides exploring the cfDNA methylation origin and analysis techniques, we have highlighted the usability of markers and their sensitivity in light of heterogeneity found in tumors. We have also provided a new dimension of sensitivity of 5hmC based cfDNA methylation pattern for liquid biopsy. Finally, we highlight the benefits and limitations of deconvolution and machine learning methods to analyze cfDNA methylation profiles.

Understanding cfDNA sources and features

Despite the extensive available literature on cfDNA, the biological insight behind the actual molecular origin of cfDNA is still poorly understood. Recent research has shown that multiple mechanisms work behind the release of cfDNA in the blood such as apoptosis, necrosis, pyroptosis, autophagy, NETosis, erythroblast enucleation, and cf-mtDNA [12], [13]. Several lines of evidence also suggest the role of cellular secretions in the release of cfDNA.The length of such cfDNA fragments lies in a range of 1000–3000 bp, in contrast to snippets generated via apoptosis (90 bp to 166 bp) [14]. Moreover, cfDNA in the blood could be present in the naked form (unbound DNA) or streaming as complex bounded to nucleosomes, membrane fragments, or vitrosomes or encased inside extracellular vesicles (EVs) like exosomes, microvesicles, and apoptotic bodies [15]. Disease diagnosis can be made based on the signals derived from cfDNA fragmentation pattern, nucleosome positioning, binding of transcription factors, transcription start site regions, cfDNA ended positions, as well as peripheral cellular alterations. The inherent property of information derived from cfDNA like sensitivity and noise and DNA fragment length affect the pattern inference process in the downstream computational analysis [16]. Also, in the case of cancer, tumor cells alone are not only the producers of cfDNA, but other non-cancerous cells also play an essential role in its release. The release of cfDNA from non-cancerous cells creates aberration in the signal from cancerous cells, as a result the data becomes more noisy and heterogeneous [17]. Among other contributing factors to cfDNA, its clearance rate from plasma also plays a vital role in its detection [18].

Computational problems associated with different cfDNA methylation profiling techniques

In order to tackle computational challenges associated with cancer detection using cfDNA methylation, it is crucial to understand different techniques used to profile it. Based on the mechanism to differentiate methylated cytosine from unmethylated one, the experimental assays for studying cfDNA methylation can be of three major types, i.e., restriction enzyme-based, bisulfite conversion-based, and enrichment/immuno-precipitation based [Fig. 1]. In addition there are many assay-specific pipelines for computational analysis of cfDNA methylation data as well [19].While currently, bisulfite-based conversion methods are more common, the selection of the method however, should be based on the proposed hypothesis, required resolution, cost, and nature of the experiment [20].

Fig. 1

An overview of techniques for profiling DNA methylation which are also useful for detecting cfDNA. The triangular and circular symbols reveal further details of different methods. The expanded form of abbreviations for different methods are as such:- HELP: HpaII-tiny fragment enrichment by ligation-mediated PCR, CHARM: comprehensive high-throughput arrays for relative methylation, cfNOMe: cell-free DNA-based Nucleosome Occupancy and Methylation profiling, MSCC: methyl-sensitive cut counting, qPCR: Quantitative polymerase chain reaction, TAPS: TET-assisted pyridine borane sequencing, MRE-Seq: methylation restriction enzyme sequencing, RSMA: methylation-sensitive restriction enzyme-based assay, DMH: differential methylation hybridization, ddPCR: droplet digital PCR, EM-Seq: Enzymatic Methyl-seq, MeDip: methylation DNA immunoprecipitation sequencing, MIRA: methylated CpG island recovery assay, mDIP: methylated DNA immunoprecipitation, oxBs-seq: oxidative bisulphite sequencing, WGBS: whole-genome bisulphite sequencing, RRBS: reduced representation bisulphite sequencing, BC-Seq: bisulphite conversion followed by capture and sequencing, BiMP: bisulphite methylation profiling, BSPP: bisulphite padlock probe, TAB-seq: TET-assisted bisulphite sequencing).

Restriction enzyme based methods

The use of restriction enzymes has been a classical approach for profiling methylation patterns in cfDNA. Restriction enzymes are used to cleave DNA strands at the point bearing a particular nucleotide sequence; conversely, the presence of the methyl group might prevent digestion. Broadly, two categories of enzymes are used here: methylation-sensitive restriction enzymes (MSRE) such as HpaII, McrBC, AciI, and Hin6I, which can cleave only the unmethylated regions, while methylation-insensitive enzymes (e.g., MspI, ApeKI, and TaqI) cut DNA sequences without taking into consideration the methylation status of concerned sequences [14]. There are a few variations of basic MSRE techniques for genome-wide non-methylated region identification such as HELP-seq (HpaII-tiny fragment Enrichment by Ligation-mediated PCR sequencing), MSCC (Methylation Sensitive Cut Counting), Methyl-seq, scCGI (methylated CGIs at single-cell level), etc. [5]. However, the computational difficulty lies in distinguishing true and false negatives due to read loss caused by enzymatic digestion. Alternatively, analysis can be done using single-tube enzymatic methods such as DARE (DNA Analysis by Restriction Enzymes), where both can be quantified in the same sample [21]. Moreover, MSRE sequencing provides low methylome coverage due to limited CpG-containing cleavage sites, and it is also possible that some of the restriction enzymes might have been destroyed, leading to the non-trivial problem of identifying true negatives during computational analysis [22]. Besides since MRE-seq approach is relatively uncommon and most tools are inadequate to extract total read mapping to a given recognition site, there exist a gap in modern computational pipelines for studying MRE-seq generated DNA methylation data [20], [23].

Bisulfite based conversion methods

Since 1992, the application of bisulfite treatment has been a significant milestone in analyzing DNA methylation status. In this approach, all the unmethylated cytosines on reaction with bisulfite get converted to uracil, while methylated cytosines remain unchanged. Consequently, the comparison of methylation levels before and after bisulfite treatment gives an estimate of DNA methylation [24]. In addition, bisulfite-based conversion has been the foundation of many techniques such as WGBS, RRBS, MCTA-seq, targeted bisulfite sequencing, methylation array, MSP, etc. Whole Genome Bisulfite Sequencing (WGBS) is currently the most comprehensive technique for the identification of Genome-wide DNA methylation patterns [25]. Anyhow, since the whole of the genome is targeted in this approach, the cost of bisulfite conversion becomes extremely high [26], [27]. In contrast, RRBS (Reduced-Representation Bisulfite Sequencing) is a balanced combination of sequencing costs, genomic fold coverage, and CpG sites measured. However, the application of RRBS on highly fragmented DNA is yet to be determined [28]. MCTA-Seq (Methylated CpG tandems amplification and sequencing) is a very sensitive technology used to detect cfDNA hypermethylated sites in conditions such HCC and cirrhosis [29], [30]. However, one of the drawbacks is that it only recognizes CpG tandem regions, which means it may overlook certain non-CpG methylation sites. For routine diagnostic and target validations, TBS (Targeted Bisulfite Sequencing) has nowadays become a well-known approach in terms of epigenome-wide methylation profiling. It allows analysis of specific DNA locations while still retaining each single CpG resolution, which needs less DNA than the WGBS approach. The Bisulfite conversion step alters sequence complexity via non- complementarity and asymmetrical alignments, which makes the processing of bisulfite sequencing data difficult [20]. In order to reduce sequence complexity and allow adaption of conventional alignment algorithms, many bisulfite sequencing-based tools have been developed [Table 1]. Another non-trivial computational challenge with bisulfite-based DNA methylation profiling is finding DMR (Differentially methylated regions). The DNA fragments interrogated with bisulfite-based conversion methods are mostly small and have few cytosine positions; therefore, calling significant statistical DMR becomes more challenging than detecting DMP (Differentially methylated base position) [31]. A recent study by Erger et al., presented an assay named as cfNOMe that makes use of enzymatic cytosine conversion approach as a substituent to bisulfite based conversion to reduce the degradation loss and GC bias caused by later. The computational analysis of cfNOMe profile also helps in calculating nucleosome occupancy pattern at tissue-specific regulatory sites, making it a more efficient and comprehensive method for studying the epigenetic landscape of cfDNA [32].

Table 1

Read alignment and Data visualization Tools.

S.No	Tools	Advantages	Disadvantages	References
1	BatMeth2	Indel-sensitive mapping	Removes some parts of reads (soft-clipping)	[129]
2	BSMAP	Good performance and flexibility due to seeding and hashing	Can detect indels with length less than 3 nucleotides only	[130]
3	Bismark	Flexible, easy to use and interpret	Increased run time	[131]
4	BS-Seeker2	Supports both local and gapped alignments	Local alignment leads to longer CPU times	[132]
5	BWA-meth	Direct useable output, less storage requirements	doesn’t facilitate data visualization, only supports 3-letter alignment mode	[133]
6	BSmooth	Ability to handle low coverage experimental data	Assumes methylation profiles to be smooth, not able to detect single CpG sites	[134]
7	MethylCoder	Allows fast and sensitive mapping in both color and nucleotide space	Uses only short read aligners	[135]
8	Segemehl	Efficiently handles 3’ and 5’ contaminants along with mismatches and indels	Large memory requirements	[136]
9	GSNAP	SNP tolerant alignment, splicing and multiple mismatches can be detected	Might be slow for long positions	[137]
10	BRAT-BW	Runs faster on longer reads	Allows at most one mismatch in user defined reads	[138]
11	ERNE-BS5	Analysis of methylation pattern at repeats, skillfully handles multiple mapping reads	Chances of false positives are higher	[139]
12	GEM3	Exhaustive search model, fast, scalable, and gapped matches can also be found	some pruning methods are sensitive to mismatches	[140]
13	Last	High sensitivity and speed	Requires removal of poor quality bases	[141]
14	Msuite	supports bisulfite-free techniques,4-letter mode of alignment and computationally less expensive	analysis on irregular CpG sites needs additional validation	[142]
15	TAMeBS	Filters ambiguous read alignments and reduces bias in context of methylated cytosines	Memory requirements and running time are high	[143]

Read alignment and Data visualization Tools.

Enrichment/immuno-precipitation based methods

The basic strategy behind enrichment-based methods is the use of anti methylcytosines antibodies for extraction of methylated regions from the cellular genome [33]. Methylated DNA Immunoprecipitation Sequencing (MeDIP-seq) and Methyl-CpG Binding Domain Protein Capture Sequencing (MBD-seq) are examples of techniques derived from affinity enrichment based array analysis. MeDIP uses antibodies directed against mC and mCG to extract methylated DNA fragments and has been used in several cases such as trisomy detection, cancer, and cardiology [34], [35]. High-quality methylomes can be obtained by combining MeDIP with NGS, which provides 1 to 300 bp resolution at costs comparable to other enrichment techniques [36]. MBD-seq, on the other hand, uses magnetic beads to pull out methylated-CpG binding domain (MBD) of DNA fragments. A study reports that MBD-seq can outperform MeDIP-seq in the identification of CGIs proportion [37]. Enrichment-based methods are cost-effective and have high discrimination power due to protein-binding specificity. However, MBD-seq is sensitive for highly methylated regions with high CpG densities. Such properties of the enrichment-based method create a computational challenge of correctly identifying differential methylation at sites with high tissue specificity but low CpG densities. These methods also have a low resolution in comparison to bisulfite-based methods, and the estimated confidence score is highly influenced by the depth of sequencing [36]. Besides, some of the tools based on enrichment methods, such as Batman and MEDIPS [Table 2], require the user to perform prior quality control and reads mapping for data preparation which becomes time-consuming and computationally challenging [38], [39]. In addition, computational analysis of enrichment-based DNA methylation profiles with early-stage cancer becomes tough when the fraction of cfDNA non–hematopoietic cells is microscopic.

Table 2

DNA Methylation Calling Software.

Applicability	Tool	Advantages	Disadvantages	Statistical model	Reference
MeDIP-seq	Batman	High resolution and cost-effective whole genome methylome can be obtained	Time-consuming to run even with multiple processors	Bayesian model	[38]
	MEDME	Provides both relative as well as absolute methylation levels, Can also be used for microarray designs of different platforms	Poor resolution in comparison to bisulfite based methods	Logistic model	[144]
	MEDIPS	More user friendly, cost and time effective	Difficult to detect methylation based on single end short reads	T-test, Wilcoxen test	[39]
	MeDUSA	Complete analysis of MeDIP-seq data from quality control to DMR calling	Approach employed is less efficient in terms of time and computation	Fisher’s exact test	[145], [146]
MBD-seq	MethylAction	Applicable on larger study designs (four group comparisons), detects DMR’s through bootstrapping	Chances of type one error	Negative binomial and ANODEV (Analysis of Deviance)	[147]
Bisulfite-based	RnBeads	High computational efficiency and cross platform analysis	Limited genome annotation packages	Bayes framework and Bartlett test	[148]
	DMRcate	Easy integration with other bioconductor tools, de novo based method	Make use of 450 k array only	F statistics	[149]
	DMRcaller	Detects DMRs in both CpG and non-CpG contexts	Sensitivity and specificity depends on window sizes, based on assumptions	Fisher’s exact test, Z test, Beta regression	[150]
	methylKit	Includes clustering functions along with DMRs visualisation	Limited by the memory of computer	Logistic regression and Fisher’s exact test	[151]
	MethylSig	Incorporates local information for estimating biological variation	Difficulty in handling heterogeneous data	Beta binomial model	[152]
	DSS	Capacity to handle multi factorial experimentation and data without biological replicates	Not suitable for paired design and longitudinal data type	Beta binomial distribution	[153]
MRE-seq	msgbsR	Removes fallacious mapped reads, explores differential methylation	Requires pre-processed raw data	Negative binomial model	[154]
5-hydroxymethylation	BiQ HiMod	user-friendly GUI, locus based methylation analysis and comprehensive analysis pipeline	pre-processed FASTA files are needed	Multiple statistical models	[155]

DNA Methylation Calling Software.

5-hydroxymethylation profiling

DNA demethylation by ten-eleven translocation (TET) enzymes can lead to oxidation of 5-methylcytosine (5mC) to 5-hydroxymethylcytosine (5hmC), and further to 5-carboxylcytosine (5caC) and 5-formylcytosine (5fC) [40], [41]. Studies show the emerging role of 5hmC as a prominent epigenetic marker, and it has been found to be associated with tumor progression. It is also found to be enriched in enhancers, promoters and changes in 5hmC level are linked to changes in gene expression levels as well [42], [43], [44]. A variety of techniques have been developed such as 5hmC-Seal [45], hmC–CATCH [46], oxBS-seq [47], TAB-seq [48] and hMeDIP-seq [49] etc. which makes use of 5 hydroxymethylation profiling techniques. The main weakness with 5hmc detection is its low frequency, making it more challenging in nature than 5mc. Also, 5hmc derived protocols possess low resolution (100–300 bp), are biased towards hypermethylated regions, and require relatively large DNA input. Hence for early-stage cancer detection where the contribution from non-blood sources of cfDNA is small, the output of 5hmC enrichment-based methylation profiles might suffer due to low sensitivity for relevant sites. Bergamaschi et al., suggests that to avoid model-based discrepancy, 5hmc based molecular classifiers for cancer should be interpreted in an integrative manner by combining demographic and disease comorbidity knowledge with tumor histology and pathology [50].

Computational issues related to cfDNA methylation detection techniques

After processing of samples according to different protocols for isolation or enrichment of methylated cfDNA, several detection techniques could be used to measure their quantity. However, each detection technique has its own analytical issues as discussed below:

Polymerase chain reaction based methods

Due to the low concentration of methylated DNA of non–hematopoietic origin in plasma cfDNA, digital polymerase chain reaction (dPCR) is preferred for cfDNA detection over traditional PCR. Digital PCR has shown to be 103–104 fold sensitive in having a lower limit of detection in comparison to the traditional version [51]. Digital PCR includes systems such as BEAMing (beads, emulsions, amplification, and magnetics) and droplet digital PCR (ddPCR). BEAMing was one of the first approaches for quantitatively detecting cfDNA and possess great sensitivity and specificity. However, its workflow is complex, necessitating oligonucleotides for each location, and is costly for typical clinical work [52], [53]. ddPCR is based on the technique of water–oil emulsion droplet and has got several applications like identification of the tissue origin [54], cancer detection [55], diagnosis of infectious diseases [56] among others. ddPCR is one of the most frequently used techniques these days with multiplex quantification. Various automated algorithms have been developed for ddPCR data analysis namely ‘definetherain’ [57], ‘ddpcRquant’ [58], ‘ddpcr’ [59], ‘twoddpcr’ [60], ‘ddPCRclust’ [61], ‘ddPCRmulti’ [62] etc. According to Dobnik et al., [61] the data analysis of such multiplex assays becomes difficult and noisy due to several possible target combinations along with probes cross hybridization in a single droplet. Brink et al., [61] reports that in the case of partially degraded DNA, multiplexing can also result in higher-order cluster disappearance and overlap. Alternatively, methylation-specific PCR [MSP] can also be used to amplify DNA of interest by using methylation-specific PCR primer sets. MSP requires a small quantity of DNA and is sensitive to even 0.1% methylated regions of a given CpG island. The MSP technique has been used to identify hypermethylated promoter regions associated with tumor suppressor genes. With significant improvements in droplet digital PCR (ddPCR), droplet digital methylation-specific PCR (ddMCP) tools have also been established for early detection of cancer using cfDNA [63]. As methylation-specific PCR is qualitative, the sensitivity can only be tested via the ratio of methylated and unmethylated DNA. Such results show a lack of agreement between dilution ratio and band intensity, with many scenarios exhibiting quite similar bands despite differing levels of DNA methylation [64]. MethyLight, MethylQuant, and HeavyMethyl are some of the quantitative versions of the MSP with enhanced performance in quantifying DNA methylation. As these methods are able to investigate only one or two CpGs methylation levels, some of the sites remain unexplored, providing limited data for computational algorithm and downstream analysis [65]. Real-time PCR is one of the affordable rapid methods for nucleic acid amplification, and in the past, several different methods have been developed based on this technique. For instance, Allele-Specific amplification (AS-PCR), Peptide Nuclei Acid-Locked Nucleic Acid (PNA-LNA) PCR clamp, co-amplification at lower denaturation temperature (COLD-PCR), and Allele-Specific Non-Extendable Primer Blocker PCR (AS-NEPB-PCR) are some of the techniques that evolved from the RT-PCR approach. The main advantage of this method is that there is no need for post-PCR steps; hence chances of cross-contamination are reduced, which is beneficial for diagnostic purposes [66]. Besides, MethyLight can be used along with Real-time PCR as a quantitative assay where relative fluorescence units (RFUs) represent the methylation percentage. However, it is unable to correctly analyze a heterogeneous sample because the primers are designed in such a way to detect only specific fully methylated patterns [67]. Despite being among the most effective methods, the quality of the results of real-time PCR can hold variations due to insufficient quality control steps, inappropriate use of reference genes and data normalization methods, and batch effects [68], [69]. In addition, for data normalization, the choice of reference genes, their stability, and amplification efficiency also play a significant role during data analysis. Kuang et al., demonstrated that usage of unstable reference genes could create variations in the final output and proposed cDNA as an alternative for normalizing data [70]. Reference genes can be evaluated by applying some statistical tests on Cq or with the help of various analytical methods such as NormFinder [71], BestKeeper [72], GeNorm [73], RefFinder [74].

Next-generation sequencing

Although multiple studies have reported detection of ctDNA in different stages with high sensitivity by using ddPCR or BEAMing, yet limited clinical applications of PCR have led to the development of other assays based on Next-generation sequencing (NGS) [75]. NGS has emerged as an excellent technique for high throughput DNA sequencing and has revolutionized the concept of clinical samples analysis [3]. This technology has become a powerful tool for identifying biomarkers pertaining to its high sensitivity, specificity, and scalability. Since the resolution at the single-base level by NGS allows accurate mapping of disease-specific regions, consequently it has been applied for genome-wide profiling of plasma from various cancers [76], [77], [78]. The sensitivity and specificity of NGS analysis depend upon the type of platform used, such as deep sequencing, Tam-seq, Safe-SEQs, CAPP-Seq, MCTA-Seq, FASTSeqS, etc [79]. A study by Liang et al., demonstrated that a combination of deep methylation sequencing with machine learning can provide better efficiency concerning cancer identification in comparison to ultradeep sequencing[80]. However, despite its appreciable performance, a random error rate of 0.1% and 1% by NGS technology creates a challenge in reliable detection of methylation and mutation profile with non–hematopoietic origin in plasma cfDNA [81]. Moreover, the occurrence of repetitive sequences and indels (insertions and deletions) can also be one of the contributing factors for sequence misalignment, influencing variant analysis. Data processing also relies on several other parameters such as filtering variants, the NGS technology’s nature, VAFs (variant allelic frequency), quality of sequencing, and bioinformatics pipeline. Henceforth the routine clinical applicability of NGS workflows need special precautions to ensure its authenticity, especially in case of dispersed, fragmented ctDNA within the background of normal cfDNA [82]. The complex and large size NGS data obtained from repeated experimentation creates additional challenges for statisticians in terms of deciding lower limits of detection based on assay due to lack of standard pipeline. An additional challenge is building a classification model for a high feature and small sample size dataset without overfitting or bias [79].

Methylation array

Before the popularity of NGS, HM450k (Illumina Infinium HumanMethylation450 BeadChip) had been the most desirable choice for investigators when it came to studying cancer methylomes. HM450k contains pre-designed probes for methylation sites that cover 96% of CpG islands in 450k array and additional CpG sites of enhancer regions in 850K array. Currently plenty of HM450k datasets are available on The Cancer Genome Atlas (TCGA) [83] and Gene Expression Omnibus (GEO) [84] that are being used for discovery and validation of biomarkers along with the analysis of deconvolution based cfDNA tissue of origin [4]. The main limitation of array-based methods is the inadequate genome-wide coverage, causing dissipation of some other essential methylation regions [85]. In addition, the cost of the technique is highly dependent upon the input data amount along with genome coverage, besides the required assay expertise for the experiment and subsequent downstream computational analysis [86]. Occurrences of too many false positives, probes and samples quality control, bogus cross-hybridization of probes, rescaling of probes, platform specific background correction, data normalization to reduce technical, experimental, and systematic variations are some of the other concerning issues associated with the use of methylation array [87]. Methylation arrays are also susceptible to experimental conditions and laboratory environments, leading to batch effects in data from various studies. Many batch correction algorithms can reduce the effect of known confounding factors, but since the true source of confounding factors is often unknown, even this task become non-trivial during statistical modelling of array-based cfDNA methylation profiles. Moreover, several studies report that there exists a high correlation of methylation levels among the adjacent CpG loci; consequently, statistical analysis of array-based data with the notion of independence among each CpG methylation may be misleading [88].

Computational difficulties in cfDNA methylation data analysis

The basic workflow of computational analysis of cfDNA methylation data includes (i) reads pre-processing and quality assessment, (ii) alignment and visualization, (iii) statistical analysis and interpretation. Sample pre-processing makes sure that raw data is structured and there is no bias in it. Different programs have been developed based on various algorithms to perform quality analysis such as FastQC, NGS QC, QC–Chain, ClinQC [89], [90]. Once the raw data is analyzed, low-quality bases and adapters can be removed by programs such as Trim Galore. Wild card and three-letter are two types of algorithms used to align sequencing data to the reference genome. While wild card algorithm (e.g., GSNAP, BSMAP) allows mapping of both Cs and Ts of reads to Cs in the reference genome, the three-letter algorithm (e.g., BisMark, BS-Seeker2, BRAT-BW) changes all Cs of reference and reads into Ts so that standard alignment tools can be applied [Table 1]. In order to inspect the global distribution of methylation profiles, data visualization can be done through various approaches such as UCSC Genome Browser [91], DNMIVD [92], Methylation plotter [93], Integrative Genomics Viewer (IGV) [94] and Web Service for Bisulfite Sequencing Data Analysis (WBSA) [95]. For restriction enzyme and enrichment affinity-based methods (MRE-seq, MeDIP-seq), relative read-count is estimated. However, for bisulfite sequencing (WGBS and RRBS), methylation level at individual cytosine residues is estimated. Many recent DNA methylation calling software (e.g., RnBeads, MeDUSA, MEDME, Batman) have used different statistical models to quantify DNA methylation coverage [Table 2]. However, sequencing depth, which depends on the assay used, is a critical factor to consider before making any choices for the same.

Tumour heterogeneity and dependency on markers

Inter and intra-tumor heterogeneity has been in existence for decades due to the morphological, genetic, epigenetic, and phenotypic diversity in cell populations. Nowadays, cellular heterogeneity is among the primary causes of disease resistance and targeted therapy failure [96]. While the studies based on whole-cell populations may represent the dynamics of majority cells, they may mask the role of critical sub-populations and hence the fundamental biology behind it. Also, such cellular heterogeneity poses tough challenges in diagnostics and treatments of disease in studies based on population-averaged measurements [97]. While tissue biopsies may only capture a part of this heterogeneity, liquid biopsies are more useful in such a scenario [98]. Tumor heterogeneity is also one of the leading causes of therapeutic resistance, treatment failure, and poor survival rate of cancer patients. Often cancer diagnostics depend on the presence of specific biomarkers. However, due to the dynamic nature of tumor cells, the predicted biomarkers are found on a non-uniform scale causing an impediment to the treatment of disease [99]. Literature shows multiple instances when the non–homogeneous nature of the druggable targets is observed, namely gastric adenocarcinoma, lung adenocarcinoma, breast cancer, melanoma, etc. Consequently, applying the biomarker-based targeted therapies in heterogeneous neoplasms leads to recurrence in the long run [100]. Many different computational pipelines and algorithms are being developed for estimation of cellular heterogeneity as a pre-processing step so that more meaningful insights can be achieved [101], [102], [103]. In order to analyze the consistency of some known cfDNA methylation literature-based biomarkers, we checked their expression in a set of 848 TCGA samples consisting of 96 normal and 752 breast cancer patients. It was found that the heterogeneity among the biomarkers was sufficiently large to hamper the process of diagnostics and therapeutics. Along with the heterogeneity arising from markers used for disease detection, other sources for the same could be some confounding factors. It can be also be seen from the box plot that the idea of using a single marker-based approach for disease detection does not seem to provide an acceptable level of sensitivity when applied to a classification model of 192 TCGA 450k methylation samples (96 normal, 96 breast cancer patients) [Fig. 2] (see supplementary material). Given the small amount of cfDNA produced, the power of a single marker may not be fully capable of distinguishing the cancerous state from non-cancerous. However, the sensitivity can be augmented by using a set of multiple markers.

Fig. 2

cfDNA methylation based markers performance on TCGA data. Illumina 450k methylation data-set for bulk tissue (Breast Cancer) was retrieved from TCGA (The Cancer Genome Atlas) database and processed for manually curated literature based markers. (a) boxplot of FPR (False positive rate) vs sensitivity showing performance of single marker for sample class prediction (Breast Cancer vs Normal). Based on LDA (Linear Discriminant Analysis) fitting of TCGA samples for one marker, values for sensitivity and FPR were obtained and presented in the form of box-plot. It can be observed from the plot that a single marker-based approach for detection of disease delivers quite less sensitivity. (b) heatmap showing heterogeneity among biomarkers for the same cancer type. Markers based normalised beta scores for all the TCGA observations were visualised as a heatmap for differential analysis of cancer and non-cancerous observations. This figure demonstrates that such level of heterogeneity among biomarkers can be one of the influencing factors for disease diagnostics and therapeutics.

Multi-marker based detection: opportunities and obstacles

Although rogue cfDNA methylation level in cancer has been known for more than a decade, it has yet not fully established its importance as a diagnostic tool in clinical practice. A significant drawback with conventional biomarkers is that most of the time, the marker’s utility is limited to only metastatic and late-stage cancer [63]. Barault et al., showed that individual biomarkers have a relatively low prevalence in patients, which can be increased if they are used in combination [104]. Perhaps each of these markers may be informative alone; the multiparametric scenario could improve its discriminating power for cancer and healthy individuals. Mouliere et al., studied the use of multi markers (Intplex) in colorectal cancer for cfDNA, and it was found to be quite sensitive, specific, and easy to implement. Also, it was shown to be adaptable to repetitive examination, henceforth making the follow-up studies easy if one talks about in terms of personalized medicine [105]. However, there seem to be some weaknesses in using a multi-marker panel. Firstly, the performance of markers varies based on the population, test data, experimental assay, and analysis of the result. Due to these reasons, such biomarker panels hold less confidence of clinicians. Also, studies aimed to prove cfDNA marker’s robustness are often retrospective and possess inadequate sample size and statistical competency. In an effort to avoid such anomalies, comprehensive studies are required to abide by the standard guidelines for reporting the diagnostic accuracy [106].

5hmc based detection: success and limitations

The human genome contains a large number of 5-hydroxymethylcytosines (5hmC) based epigenetic modifications as the oxidized form of 5-methyl-cytosines (5mc) and is proposed to act as ideal markers for reflecting the chromatin activation state. In a similar fashion to 5mc based studies, 5hmc modifications have also been reported as crucial factors for understanding different types of cancer pathology and tissue-specific origin [45]. However, in contrast to 5mc, 5hmc based profiles are shown to possess more stability and robustness, which provides better specificity in terms of cancerous vs. normal individuals. Besides, while 5mc is believed to have a repressive effect, 5hmc got permissive ramifications on the gene expression [107]. Also, since enhancers, promoters, and other regulatory elements are found to be enriched with 5hmc, it is also expected to be in more correlation with cellular gene expression [108]. 5hmc has recently been linked to many biological processes and disorders, including brain development, malignant melanoma, breast cancer, bladder cancer, and non-small cell lung cancer [108], [109], [110]. Although, in comparison to extensive cfDNA research on 5mc, 5hmc has yet to be thoroughly investigated in the realm of cancer diagnosis. Given the minute amount of cell-free DNA, obtaining noise-free signals and lack of highly sensitive DNA sequencer for 5hmc is one of the challenges faced by researchers while using 5hmc as an epigenetic biomarker (10-to 100-fold less than 5mC) [107]. In order to evaluate the possibility of using markers for the 5hmC profile of cfDNA, we performed an analysis using data published by Song et al., for mostly advanced-stage cancer. For their study, Song et al., performed analysis using read-count on a large number of genes, and they did not report any classification based on fewer number of markers. Therefore, we evaluated the classification using the 5hMC profile of cfDNA with a reduced number of genomic loci as markers. Our result revealed that the classification accuracy reduces with a lower number of markers, but it was sufficient to group similar phenotype samples together. Our analysis used the top 50 marker locations using feature importance achieved by applying random forest-based classification on gene and CpG island read-counts (see supplementary material). Using top 50 markers, it was possible to achieve good separability among different phenotypes in the 2D embedding plot (see Fig. 3). Application of density-based clustering (see supplementary material) on the 2D embedding using top 50 markers resulted in clustering-purity above 0.70 NMI (Normalized Mutual Information) score (see Fig. 3). Thus the utilization of 5hmC profiles on selected markers for detection could be feasible to some extent for an advanced stage of cancer. As Song et al. generated 5hmC profile using cfDNA of patient with mid or late stage cancer, the challenge of sensitivity with 5hmC for detecting early cancer still remains as open problem.

Fig. 3

The visualisation of low dimensional embedding of 5hmC profile of cell free DNA of samples from patients with different types of cancer. Here 2D embedding (using tSNE) of 5hmC profile is shown for samples either using read-count of genes or CpG islands. The results of embedding are shown for read-counts on all genes or only 50 selected genes. Similarly the results of embedding done using all CpG island or only 50 selected CpG islands are also shown. The 5hmC profiles used here published by Song et al. 2017. Using large number of genomic loci (all genes or all CpG island) can provide good separability among samples according to type of cancer. With 5hmc profile, top 50 chosen marker CpG island provide slightly better separability among different pathological condition in comparison to top 50 gene. The purity of clustering after embedding using top 50 markers (genes or CpG islands) is also shown terms of Normalized mutual information (NMI).

Deconvolution: pros and cons

Considering high levels of heterogeneity among tissues, reports suggest the use of tissue-specific biomarkers. For plasma DNA-based testing as well, tissue-specific markers are found to be more consistent in nature [111]. In order to map the origin of tumor tissue from cfDNA, one of the commonly used methods is the deconvolution algorithm, which recovers the original signal from a mixture of signals. Deconvolution algorithms are basically of two kinds: reference-based and reference-free. Reference-based deconvolution algorithms are based on supervised methods utilizing cell-type-specific differentially methylated regions (DMRs). On the other hand, reference-free algorithms do not need cell-type-specific DMRs as reference but estimate cellular proportion using unsupervised deconvolution approaches [112]. One of the earliest and most widely used algorithms, based on reference dataset, is constrained projection [CP] (also known as quadratic programming [QP]) which operates through least square minimization. For reference-free approaches, there are frameworks such as removing unwanted variation (RUV), non-negative matrix factorization (NMF) [113]. Recently many more reference-based [EpiDISH, CIBERSORT] and reference-free approaches [CellMix, CDSeq, TOAST, RefFreeEWAS, EWASher, SVA] for cfDNA deconvolution have emerged.[114], [115], [116], [117], [118], [119], [120]. Studies show that disease prediction accuracy increases by incorporating tissue proportion factors and more interpretative biological output is obtained. According to Moss et al., the use of only defined sets of significant CpG sites in deconvolution gives greater resolution and less noise in comparison to using the entire methylome, even with a low amount of DNA. [4]. Most of the reference-based deconvolution methods suffer from two main limitations. First, they often need a prior guess about the organ from which DNA could be found in plasma. Although with a correct estimation of organ, the calculation of the proportion of contribution from different cell types is reasonably satisfactory to some extent. The second limitation of reference-based deconvolution is the difference in technical batch-effect in reference cell methylome profile and cfDNA methylation profile. In actual practice, the prediction of cellular proportion can be more complicated due to some biological or technical artifacts. Hence there is a need for such computational methods which can accurately project the information in lower dimension space without being influenced by a reference methylation panel [1], [111]. To analyse the data separability of reference-free deconvolution methods, we applied three most commonly used approaches such as RefFreeEWAS [119], ReFACTor [121], and SVA [122] on 450 k methylation profile from prostrate cancer and normal samples of TCGA (100 samples) and cfDNA (28 samples). In the current study, a comparison of the deconvolution techniques on randomly selected 100 CpG sites showed that the performance of a specific approach depends partially on the dataset itself; for example, in TCGA samples, RefFreeEWAS was able to do a better classification among others and in the case of cfDNA dataset RefFreeEWAS and ReFACTor showed similar separation [Fig. 4] (see supplementary material). Other limitations include batch effects, small datasets, unaccountable covariates related to CpG islands methylation etc.

Fig. 4

Applying different deconvolution techniques on the DNA methylation profiles of cancer and normal samples. Reference-free deconvolution methods such as RefFreeEWAS, ReFACTor and SVA were applied to DNA methylation profiles and projected as tSNE coordinates to analyze sample separability. (a) Here DNA methylation profiles available in the TCGA portal for solid tissue from prostate cancer were used. (b) Deconvolution methods were applied to DNA-methylation profiles of cell-free DNA (cfDNA) extracted from the plasma of individuals with normal and prostate cancer pathotypes from CFEA. The comparative analysis is based on 100 randomly selected CpG sites of the samples.

Machine learning based approaches: strengths and weaknesses

With computational advancements in the field of liquid biopsy, the role of machine learning in diagnostics and therapeutics seems quite promising. Recently a few studies have applied machine learning approaches for cfDNA methylation analysis [123], [124], [1], [125], [126]. Machine learning techniques can be applied using whole-genome features or selected markers scores with or without deconvolution. Such as Shu et al., used meDIP-seq profile and first identified the top 300 DMRs among patients and non-patients before applying the binomial generalized linear model [123]. On the other hand, Feng et al., applied machine learning using three scenarios: 1) just using markers, 2) after NMF based reference-free deconvolution, and 3) after reference-based tissue proportion estimation using QP. With WGBS profile from cfDNA (liver cancer and normal), Feng et al., achieved higher accuracy using by training machine learning model after reference-based proportion estimation (accuracy = 0.79) in comparison to reference-free deconvolution (accuracy = 0.7) or using marker signal directly (accuracy = 0.75) [1]. It is not trivial to judge the usefulness of reports of high classification accuracy with smaller data sets from previous studies. Provided a large data size, machine learning algorithms may develop solutions to learn disease-related patterns directly from a patient’s whole genome or targeted sites (multi-marker) signal. For cfDNA methylation-based predictions, machine learning techniques have their own limitations. Such as the requirement of a large number of samples to train, bias in classification due to imbalance in training data-set, batch effect [11]. Especially in the case of cfDNA methylation data-set, when the relevant signal is overwhelmed with the epigenetic signature of blood cells, suppressing batch effect for correct prediction in target sample is very challenging. It is reflected by the performance of classifier in detecting 50 types of cancer by CCGA consortium [127] using large training (1654 cancer + 1375 normal) and validation set (703 cancer + 605 normal). With such a large training set, the classifier used by CCGA consortium could achieve average sensitivity of 44.2 for cancer stages I, II and III [127]. Even for 12 predefined high signal cancer types, CCGA consortium could achieve a sensitivity of only 39% for stage I samples. Such results highlight the limitation caused by the low concentration of cfDNA from non–hematopoietic origin and heterogeneity among patients [127].

Discussion

Here we have described the strengths and weaknesses of several procedures involved in detecting cancer using cfDNA methylation. By analyzing existing DNA methylation profiles from tumor samples and cfDNA, we showed limitations in using individual markers due to cancer heterogeneity. However, there is yet another kind of bias, which adds to the computational challenge. The bias in different ways of detection of DNA methylation reduces the significance of detection of specific markers. Such as many markers detected using HM450k methylation array might be completely non-detectable by RRBS based cfDNA methylation profiling. Therefore despite the availability of a few data-sets of cfDNA methylation profiles from cancer patients, it is not trivial to finalize markers for any cancer type that could be used globally with multiple cfDNA methylation profiling techniques. In other fields of genomics, such as single-cell expression profile analysis, there have been a few attempts to perform integrative analysis irrespective of bias of platform and protocol used. However, rarely such attempts have been made to solve the computational problem of integrative analysis using cfDNA methylation profiles. The reason could be that single-cell expression profiles are not mixtures of unknown cell types, whereas cfDNA methylation profiles have mixed signals from several cell types. The approach used by different clinical trials to learn machine-learning models on a data-set and to validate on another data-set is often called transfer learning. There has been substantial development in making transfer learning more adaptive [128] to new data-set to avoid the batch effect. However, adaptive transfer learning often needs small samples from target data to adjust itself. There could be day-to-day variation in the profiling of cfDNA methylation even from the same patient. Hence it remains to be seen how adaptive transfer learning can be used to identify the tissue of origin using cfDNA methylation, irrespective of batch effect and variation in signal-dilution by blood cells. Even though a few clinical trials have reported good accuracy for detecting late-stage cancer, detection of early-stage is still a challenge [30], [125], [63]. The low accuracy on early cancer detection reduces the utility of liquid biopsy as advanced-stage tumors are often non-treatable. Hence there is still a demand for novel computational approaches to improve early-stage cancer detection using cfDNA methylation profiles.

Funding information

This work was supported by Department of Biotechnology and Indian Council of Medical Research (ICMR).

Availability of data and materials

The datasets used for analysis in the current study can be found at The Cancer Genome Atlas (TCGA) https://portal.gdc.cancer.gov/ and Cell Free Epigenome Atlas (CFEA) http://www.bio-data.cn/CFEA/ repositories.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

144 in total

Review 1. Optimizing methodologies for PCR-based DNA methylation analysis.

Authors: Hernán G Hernández; M Yat Tse; Stephen C Pang; Humberto Arboleda; Diego A Forero
Journal: Biotechniques Date: 2013-10 Impact factor: 1.993

Review 2. Methods for genome-wide DNA methylation analysis in human cancer.

Authors: Shicai Fan; Wenming Chi
Journal: Brief Funct Genomics Date: 2016-04-06 Impact factor: 4.241

3. Rapid quantification of DNA methylation by measuring relative peak heights in direct bisulfite-PCR sequencing traces.

Authors: Minghong Jiang; Yuhao Zhang; Jing Fei; Xinxia Chang; Weiwei Fan; Xueqing Qian; Tianbao Zhang; Daru Lu
Journal: Lab Invest Date: 2009-12-14 Impact factor: 5.662

4. Ultrasensitive detection of circulating tumour DNA via deep methylation sequencing aided by machine learning.

Authors: Naixin Liang; Bingsi Li; Ziqi Jia; Chenyang Wang; Pancheng Wu; Tao Zheng; Yanyu Wang; Fujun Qiu; Yijun Wu; Jing Su; Jiayue Xu; Feng Xu; Huiling Chu; Shuai Fang; Xingyu Yang; Chengju Wu; Zhili Cao; Lei Cao; Zhongxing Bing; Hongsheng Liu; Li Li; Cheng Huang; Yingzhi Qin; Yushang Cui; Han Han-Zhang; Jianxing Xiang; Hao Liu; Xin Guo; Shanqing Li; Heng Zhao; Zhihong Zhang
Journal: Nat Biomed Eng Date: 2021-06-15 Impact factor: 25.671

5. Methylome analysis using MeDIP-seq with low DNA concentrations.

Authors: Oluwatosin Taiwo; Gareth A Wilson; Tiffany Morris; Stefanie Seisenberger; Wolf Reik; Daniel Pearce; Stephan Beck; Lee M Butcher
Journal: Nat Protoc Date: 2012-03-08 Impact factor: 13.491

6. BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data.

Authors: Weilong Guo; Petko Fiziev; Weihong Yan; Shawn Cokus; Xueguang Sun; Michael Q Zhang; Pao-Yang Chen; Matteo Pellegrini
Journal: BMC Genomics Date: 2013-11-10 Impact factor: 3.969

7. 5-Hydroxymethylcytosine signatures in circulating cell-free DNA as diagnostic biomarkers for human cancers.

Authors: Wenshuai Li; Xu Zhang; Xingyu Lu; Lei You; Yanqun Song; Zhongguang Luo; Jun Zhang; Ji Nie; Wanwei Zheng; Diannan Xu; Yaping Wang; Yuanqiang Dong; Shulin Yu; Jun Hong; Jianping Shi; Hankun Hao; Fen Luo; Luchun Hua; Peng Wang; Xiaoping Qian; Fang Yuan; Lianhuan Wei; Ming Cui; Taiping Zhang; Quan Liao; Menghua Dai; Ziwen Liu; Ge Chen; Katherine Meckel; Sarbani Adhikari; Guifang Jia; Marc B Bissonnette; Xinxiang Zhang; Yupei Zhao; Wei Zhang; Chuan He; Jie Liu
Journal: Cell Res Date: 2017-09-19 Impact factor: 25.617

8. Enrichment methods provide a feasible approach to comprehensive and adequately powered investigations of the brain methylome.

Authors: Robin F Chan; Andrey A Shabalin; Lin Y Xie; Daniel E Adkins; Min Zhao; Gustavo Turecki; Shaunna L Clark; Karolina A Aberg; Edwin J C G van den Oord
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

9. ddPCRclust: an R package and Shiny app for automated analysis of multiplexed ddPCR data.

Authors: Benedikt G Brink; Justin Meskas; Ryan R Brinkman
Journal: Bioinformatics Date: 2018-08-01 Impact factor: 6.937