Literature DB >> 23968174

MethyQA: a pipeline for bisulfite-treated methylation sequencing quality assessment.

Shuying Sun¹, Aaron Noviski, Xiaoqing Yu.

Abstract

BACKGROUND: DNA methylation is an epigenetic event that adds a methyl-group to the 5' cytosine. This epigenetic modification can significantly affect gene expression in both normal and diseased cells. Hence, it is important to study methylation signals at the single cytosine site level, which is now possible utilizing bisulfite conversion technique (i.e., converting unmethylated Cs to Us and then to Ts after PCR amplification) and next generation sequencing (NGS) technologies. Despite the advances of NGS technologies, certain quality issues remain. Some of the more prevalent quality issues involve low per-base sequencing quality at the 3' end, PCR amplification bias, and bisulfite conversion rates. Therefore, it is important to conduct quality assessment before downstream analysis. To the best of our knowledge, no existing software packages can generally assess the quality of methylation sequencing data generated based on different bisulfite-treated protocols.
RESULTS: To conduct the quality assessment of bisulfite methylation sequencing data, we have developed a pipeline named MethyQA. MethyQA combines currently available open-source software packages with our own custom programs written in Perl and R. The pipeline can provide quality assessment results for tens of millions of reads in under an hour. The novelty of our pipeline lies in its examination of bisulfite conversion rates and of the DNA sequence structure of regions that have different conversion rates or coverage.
CONCLUSIONS: MethyQA is a new software package that provides users with a unique insight into the methylation sequencing data they are researching. It allows the users to determine the quality of their data and better prepares them to address the research questions that lie ahead. Due to the speed and efficiency at which MethyQA operates, it will become an important tool for studies dealing with bisulfite methylation sequencing data.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 23968174 PMCID： PMC3765750 DOI： 10.1186/1471-2105-14-259

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

In a mammalian genome, DNA methylation is an epigenetic event that involves the addition of a methyl-group (CH3) to 5’ cytosines following with guanines (i.e., CpG sites, where “p” stands for phosphate). This epigenetic modification plays an important role in cancerous cells. In fact, DNA methylation is one of the most common molecular changes in several cancers including breast, ovarian, and colon cancers [1-15]. DNA methylation can silence important tumor suppressor genes such as p16, ER, and PR[9]. It often occurs at the early stage of tumor development and can be easily detected in a small amount of DNA [16,17]. Thus it can be used as an early identifier in cancerous cells. Furthermore, its reversible characteristic, or demethylation (i.e., methylation can be removed), makes it a possible target for therapeutic demethylation drugs. For these reasons, identifying cancer methylation patterns has become an extremely important topic in the area of cancer epidemiology. There are different types of cancer methylation patterns such as differential methylation and partial methylation, which play significant roles in tumor development and growth [18-20]. In order to identify these patterns, it is critically important to obtain methylation signals at the single CpG site level. With the bisulfite-treatment technique (i.e., converting unmethylated C to T) combined with advanced high throughput sequencing technologies, it is now possible to obtain methylation signals at the CpG site level. Over the last several years, a few leading research groups have successfully generated bisulfite-treated methylation sequencing data [21-27]. These data are extremely large. For example, the methylation sequencing data of one study may occupy gigabytes and even terabytes of hard-drive space depending on the coverage, size of sequencing regions, and number of samples. There are different quality issues in giant sequencing data and it is challenging to preprocess and analyze such data. For example, in some experiments we see that 3’ end reads have dramatically low qualities, some have a lot of Ns at the 5’ and 3’ ends of sequencing reads, some k-mer sequences are unexpectedly highly represented, and some have a large number of duplicated reads. Although several tools have been successfully developed to align bisulfite-treated reads and call methylation signals [21-23,28-32], few packages have been developed for the quality assessment of bisulfite sequencing, except the recent SAAP-RRBS pipeline [33]. SAAP-RRBS consists of four modules including reads assessment and clean-up, alignment, CpG site methylation extraction, and annotation for CpG sites. This is a useful tool designed for the Reduced Representation Bisulfite Sequencing (RRBS) protocol [34], but not for whole genome sequencing or any other bisulfite-treated protocol. Although, in theory the workflow can be easily extended to analyze whole genome sequencing data, in practice it can be challenging due to the alignment speed. Furthermore, it does not have the feature of comparing the DNA sequence structure of different regions, as our new program will include. Therefore, there is still a need to develop a quality assessment tool for bisulfite-treated methylation sequencing data. Bisulfite-treated DNA methylation sequencing has its own characteristics that may lead to different quality issues. For example, bisulfite treatment causes damage to DNA, resulting in fragmentation of long molecules [35]. Furthermore, bisuflite treatment may not be complete, and incomplete bisulfite conversion will affect methylation signal/ratio estimates. In addition, methylation in mammalian DNA generally occurs at CpG sites, which are often found in CpG islands that are regions with high GC contents and are likely to be repetitive regions. The high GC content and the repetitive regions tend to affect DNA sequencing, and after sequencing the distribution of A, C, G, and T in a bisulfite-treated genome (or target regions) is dramatically shifted because unmethylated C is converted to T. Any or all of these factors may affect the sequencing quality and results. It is critical to develop an efficient quality assessment package for bisulfite sequencing data generated based on different protocols to assist the accurate identification of methylation patterns. To meet this urgent need, we have developed a pipeline that incorporates both the currently available quality assessment programs and our new program with novel features.

Implementation

The workflow of our pipeline

The workflow of our pipeline (see Figure 1) is explained below wherein Steps 4 and 5 are our new features.

Figure 1

The workflow of our pipeline MethyQA.

The workflow of our pipeline MethyQA. Step 1: Assess sequencing qualities using FastQC [36]. FastQC assesses sequencing qualities, sequence content, GC content, per base N content, duplication levels and so on. Though FastQC is not designed for bisulfite-treated methylation sequencing data (for example, it cannot assess bisulfite conversion rates), it still produces very informative diagnostic plots. Step 2: Trim sequencing data. Quite often, sequencing quality is very low at the 3’ end in Illumina data. Low quality untrimmed reads will not be aligned. It is necessary to include a trimming step and trimming off low quality reads can improve NGS alignment result [37]. In our pipeline, two trimming options are provided: dynamic trimming (i.e., trimming based on quality scores using the trim function of the BRAT package [30]) and trimming off a fixed number of bases. In addition, adaptor trimming [38,39] is also included as an option. Step 3: Align sequencing data using BRAT and obtain methylation ratios at all cytosine sites. After trimming, BRAT [30] is utilized as a default alignment tool. After alignment, the pipeline generates the methylation ratio file using the ACGT-count function of the BRAT package. The output files are methylation ratios of all cytosines on both forward and reverse strands for each sample. As for the choice of the alignment tool in Step 3, we choose BRAT as the default. BRAT is an efficient tool for mapping FASTQ format short-reads by building a hash table for the reference genome. It is a very user-friendly software package and produces comparable results. Compared to other alignment tools [29,31,32,40-43], BRAT has several superior features. First, it uses relatively less memory [43]. Second, it does not limit read length. Third, it can align both single-end and paired-end reads. Fourth, it can account for overlapping paired-end reads. Fifth, it can check DNA strands. Sixth, it provides a function to convert alignment output to SAM format. Finally, its ACGT-count function is very convenient in that it produces the methylation ratios for all cytosines in a genome, thus it reduces the users’ time and effort needed to parse the large alignment output files. Step 4: Examine bisulfite rates using nonCGc sites. For mammalian cells, the nonCGc sites (i.e., the cytosines that are not in a CpG pair) are highly likely to be un-methylated, so we choose to examine bisulfite conversion rates using these nonCGc sites. In this step, our pipeline can examine bisulfite rates at both the chromosome level and the target-region level (if target regions are provided). For the chromosome level analysis, our pipeline studies the distribution of all nonCGc sites using histograms and summary tables. For the target region analysis, instead of studying the bisulfite conversion of each individual nonCGc site, we summarize all the nonCGc sites within a target region. In particular, our pipeline calculates the mean and median bisulfite rates of all nonCGc sites with coverage for each region. It then generates summary tables and plots histograms for the statistical summary of all target regions. If a dataset has very high bisulfite-conversion rates (i.e., > 0.999) as shown in the summary tables and plots, the user can continue to do further downstream analysis. However, if the results of this step show that the dataset has very low bisulfite conversion rates, the user may continue with caution. For example, if there is a large percentage (e.g., >30%) of nonCGc sites with very low bisulfite conversion rates, the users may have to further investigate their sequencing experiments to understand the problem of bisulfite conversion, or even discard the data. If there is a small proportion (e.g., <5%) of nonCGc sites with low bisulfite conversion rates, the user may split all nonCGc sites into two groups: (A) nonCGc sites with high bisulfite rates (e.g., >= 0.99) and (B) nonCGc sites with low bisulfite rates (e.g., < 0.99). The user may only use the CpG sites near the nonCGc sites in group (A) to do downstream analysis. Step 5: Compare sequence structures of different regions. It is important to be aware that many factors can affect the quality of sequencing and genomic regions may respond differently to these factors. For example, some regions have low bisulfite conversion, while other regions do not; some regions have low coverage, while other regions have high coverage. It is unclear how these differences are related to DNA sequence structure (e.g., GC contents and repetitive regions). In order to interpret a sequencing experiment, it is necessary to know which regions have high or low coverage. In this step, our pipeline takes user-provided target regions as an input file. The target regions can be a list of genes with start and end positions, a list of chromosome regions, or a list of CpG islands in which the user is interested. The regions with high and low metrics (i.e., coverage and bisulfite conversion) are defined below: 1) High bisulfite conversion region: if the median bisulfite conversion rate of all nonCGc sites in a target region is greater than or equal to B, this region is selected as a high bisulfite conversion region. 2) Low bisulfite conversion region: if the median bisulfite conversion rate of all nonCGc sites in a target region is less than or equal to b, this region is selected as a low bisulfite conversion region. 3) High coverage region: For a given target, let N be the number of nonCGc sites and n be the number of nonCGc sites with coverage in a target region. If n/N >= L, it is selected as a high coverage region. 4) Low coverage region: For a given target, let N be the number of nonCGc sites and n be the number of nonCGc sites with coverage in a target region. If n/N < l, it is selected as a low coverage region. As for the above high and low metric (i.e., coverage and bisulfite conversion) regions, we recommend the users first check the number of target regions in each group. If there are only a small number of regions (e.g., less than 10 target regions, or less than 0.5% of the total target regions) with low metric status, that means there may not be a serious coverage or bisulfite conversion issue. It is not necessary to compare the DNA sequence structure of high and low metric regions. The sample is probably very well sequenced. If, indeed, there are a large number of regions with low metric status, we recommend the users check further. In order to investigate whether the coverage difference and bisulfite conversion problem are due to DNA sequence structures, our pipeline produces regions with low or high metrics as defined above, and then compares the DNA sequence structure of different regions. In particular, our pipeline generates plots for the percentage of A, C, G, T, C+G, CGc, nonCGc, and repetitive bases (i.e., “%low_count” provided by the UCSC genome browser) for these different regions. Generally speaking, if the coverage differences (or bisulfite conversion problems) are not associated with DNA sequence structures, we will not see any dramatic differences when comparing the percentage of A, C, G, T, C+G, CGc, nonCGc, and repetitive bases for high and low coverage regions (or high and low bisulfite conversion regions). However, if we see some dramatic differences in the comparison plots, this may provide us some insight into the sequencing experiments. For example, if we see that the high coverage regions tend to have much lower percentages of GC contents (or nonCGc) and higher percentages of As or Ts, while low coverage regions tend to have the reverse patterns, this may indicate some bisulfite conversion problem. This problem is likely because bisulfite conversion may damage DNA fragments, leaving them broken and unable to be sequenced. In addition, if we find that the high and low coverage regions correspond to low and high “%low_count” (i.e., repetitive regions) respectively, this may indicate that the repetitive regions are not well sequenced. In the user manual (see the Additional file 1), we have provided different examples to illustrate our pipeline in more details. The above are the five steps of the complete pipeline MethyQA. If users are familiar with alignment and have obtained the methylation ratios using either the BRAT ACGT-count program or some other alignment tools, they can skip Steps 1 to 3 and only use the partial pipeline provided in our package (named partial.MethyQA) to achieve the quality assessment in Steps 4 and 5. The BRAT methylation ratio output contains the following basic and standard information for each cytosine site: chromosome, position, cytosine type (i.e., CG, CHH, and CHG), total coverage, and methylation ratio. If users have used other alignment tools, as long as the output of these bisulfite sequencing alignment tools generate the above basic information, the output can be easily converted by switching the order of columns to the BRAT methylation ratio output format, then run our partial.MethyQA pipeline.

Input and output

Our pipeline uses the raw FASTQ file as input in Step 1 and Step 2. In Steps 3, 4 and 5, the input files are the output files from the previous step. If the user is interested in studying specific target regions in Steps 4 and 5, a target file with three columns including chromosome, start and end positions for each region is required. As for the output, see Table 1 for a list of the main output files in each step of the MethyQA pipeline. In addition, the output files for Steps 1, 2 and 3 are well described in the FastQC and BRAT documentation files and details can be found there. More details about the input and output files are provided in the MethyQA user manual (see the Additional file 1).

Table 1

The main output files in each step of the MethyQA pipeline

File name	Pipeline step	Descriptions
SampleName_fastqc	Step 1	One folder and one zip file that save the output of quality assessment using fastqc.
SampleName_fastqc.zip	Step 1
fastx.trim.fastq	Step 2	Fastx or cutadapt output (one line per read) if adapter trimming is used.
cutadapt.trim.fastq	Step 2
*_reads1.txt	Step 2	BRAT trimming output (one line per read) if dynamic trimming using BRAT trim is done.
fixedTrim_BRATout	Step 2	Fixed length trimming output (one line per read) if “fixed-length” trimming is used.
alignment.brat	Step 3	BRAT alignment output (one line per read).
* _forw.txt	Step 3	BRAT ACGT-count (i.e., methylation ratio) output file (one line per cytosine position).
*chrN.summary.table.txt	Step 4	Chromosome level summary table for bisulfite conversion rates.
*chrN.BS.ps	Step 4	Chromosome level plot for bisulfite conversion rates.
*chrN.target.summary.table.txt	Step 4	Target region level summary table for the mean and median of bisulfite conversion rates.
*chrN.mean.median.ps	Step 4	Target region level plot for the mean and median of bisulfite conversion rates.
*chrN.seq.bisulfite.boxplot.ps	Step 5	Plots for comparing the DNA sequence structure for regions with high and low bisulfite conversion rates.
*chrN.highBS.seq	Step 5	Target regions with high or low bisulfite conversion rates (seq files include all basic DNA sequence statistics, and target files include the summary of nonCGc bisulfite conversion rates).
*chrN.lowBS.seq
*chrN.highBS.target
*chrN.lowBS.target
*chrN.seq.coverage.boxplot.ps	Step 5	Plots for comparing the DNA sequence structure for regions with high and low sequencing coverage.
*chrN.highCoverage.seq	Step 5	Target regions with high or low sequencing coverage (seq files include all basic DNA sequence statistics, and target files include the summary of nonCGc bisulfite conversion rates).
*chrN.lowCoverage.seq
*chrN.highCoverage.target
*chrN.lowCoverage.target

“*” means the prefix provided by the user while running MethyQA. In some file names, “chrN” means a specific chromosome that the user investigates.

The main output files in each step of the MethyQA pipeline “*” means the prefix provided by the user while running MethyQA. In some file names, “chrN” means a specific chromosome that the user investigates.

Usage and command options

Our pipeline is written in Perl and R. It can be run as shown below under a LINUX or UNIX environment. The usage of the complete pipeline MethyQA is: perl MethyQA.pl -i -t -c -p -d -R -r [OPTIONS] The command options of MethyQA are explained in Table 2.

Table 2

The command options of MethyQA

[-i <file>]	FASTQ input file
[-t <file>]	Target input file (i.e., a list of target regions specified for analysis). “F”, if do not perform target analysis
[-d <dir>]	Path to MethyQA directory (e.g., /home/user/downloads/MethyQA/)
[-c <string>]	Chromosome number (e.g., chr1, chr2, chr17, chrX, chrY, etc.)
[-p <string>]	Prefix (i.e., the prefix written to the output file names)
[-R <dir>]	Reference directory (i.e., the directory with the genome reference files)
[-r <file>]	Reference name (i.e., the file name of the reference that the user will use)
[-f <string>]	FASTQ format (i.e., “sanger” or “illumina”)
[-a <string>]	Adapter trimming. (1) “no”: no adapter trimming (default); (2) “fastx”: fastx adapter trimming; (3) “cutadapt”: cutadapt adapter trimming. If cutadapt is set, the “-Y” option should be specified in the command line
[-A <string>]	Adapter sequence (The default is Illumina adapter sequence:AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG)
[-T <string>]	Quality trim flag. (1) “ no”: no quality trimming; (2) “brat”: brat dynamic trimming (default); (3) “fix”: fixed quality trimming
[-N <int>]	For fixed quality trimming (users specify the number of bases to be trimmed at the 5' end, default is 5)
[-n <int>]	For fixed quality trimming (users specify the number of bases to be trimmed at the 3' end, default is 10)
[-B <real>]	Cutoff value for selecting high bisulfite conversion regions (Range: [0, 1], default B=0.99)
[-b <real>]	Cutoff value for selecting low bisulfite conversion regions (Range: [0, 1], default b=0.6)
[-L <real>]	Cutoff value for selecting high coverage region (Range: [0, 1], default L=0.5)
[-l <real>]	Cutoff value for selecting low coverage region (Range: [0, 1], default l=0.1)
[-u <logic>]	Bisulfite flag (it is an option to initiate boxplot of high vs. low bisulfite rates, either ‘TRUE’ (default) or ‘FALSE’)
[-v <logic>]	Coverage flag (it is an option to initiate boxplot of high vs. low coverage, either ‘TRUE’ (default) or ‘FALSE’)
[-Y <string>]	Path to python when running cutadapt (i.e., python, python2.6, /home/bin/python)
[-Q <string>]	Path to FastQC (e.g., /home/appl/apps/bin/fastqc, default is to use the one complied in MethyQA pipeline)
[-M <string>]	Path to BRAT trim function (e.g., /home/appl/apps/bin/trim.v1.2.4, default is to use the one complied in MethyQA pipeline)
[-K <string>]	Path to BRAT-large function (e.g., /home/appl/apps/bin/brat-large.v1.2.4, default is to use the one complied in MethyQA pipeline)
[-J <string>]	Path to BRAT ACGT-count function (e.g., /home/appl/apps/bin/acgt-count.v1.2.4, default is to use the one complied in MethyQA pipeline)
[-X <string>]	Path to fastx function (e.g., home/appl/apps/bin/fastx, default is to use the one complied in MethyQA pipeline)
[-C <string>]	Path to cutadapt function (e.g., /home/appl/apps/bin/cutadapt, default is to use the one complied in MethyQA pipeline)

The command options of MethyQA The usage of the partial pipeline partial.MethyQA is: perl partial.MethyQA.pl -i -t -c -p -d -R [OPTIONS] The command options of partial.MethyQA are similar to the complete pipeline MethyQA, and more details about these options are provided in the user manual (see the Additional file 1).

Results

We demonstrate the use of MethyQA using a publicly available bisulfite-treated methylation sequencing dataset for the cell line MCF10A [26]. Because the first three steps are conducted using available software packages, we mainly show the results of Steps 4 and 5. The reads in this dataset have low quality at the 3’ end. After trimming, about 1.5 million reads (2.5% of the total) that were thrown away from the raw data are aligned in the trimmed data (using the reference genome hg18). Thus, we use the alignment results obtained with low quality bases trimmed. Figure 2A is the bisulfite conversion (i.e., 1 - methylation ratio) rate of nonCGc sites in chr1. This figure shows that all data points are around 1, that is, the bisulfite conversion rate is very high and there is no evidence of incomplete conversion. In addition to the graphical summary, our pipeline also provides a summary table for chromosome level analysis (see Table 3). Table 3 shows that the total number of nonCGc sites on chr1 (TNCGC) is 44683043, and 622926 of them (i.e., 1.394%) have at least 1X coverage. The bisulfite conversion rates of more than 75% of the nonCGc sites are 100%. In combination with the Figure 2A, the examination results show that this dataset has very high bisulfite conversion rate. If a dataset has low bisulfite conversion rates, the histogram will be very different from the above one, that is, there will be data points with values much less than 1. In the user manual (see the Additional file 1), we provide different examples of datasets with and without problems.

Figure 2

Table 3

Example of bisulfite-rate summary at the chromosome level

chr	TNCGC	TNCGCwC	Percent	Min	25^thpecentile	Median	Mean	75^thperecentile	Max
Chr1	44683043	622926	1.394%	0	1	1	0.9961	1	1

TNCGC means the “total number of nonCGc sites”. TNCGCwc means the “total number of nonCGc sites with coverage”. “Percent” means the percent of nonCGc sites with coverage. The last 6 columns are a 6-number-summary (minimum, 25th percentile, median, mean, 75th percentile, and maximum) for the bisulfite rates of nonCGc sites on chr1.

Plots of bisulfite conversion rate (A) and nonGCc content (B). Plot A is the histogram of the bisulfite conversion rates of nonCGc sites on chr1. Plot B is the non-CGc content in the high and low coverage regions. Example of bisulfite-rate summary at the chromosome level TNCGC means the “total number of nonCGc sites”. TNCGCwc means the “total number of nonCGc sites with coverage”. “Percent” means the percent of nonCGc sites with coverage. The last 6 columns are a 6-number-summary (minimum, 25th percentile, median, mean, 75th percentile, and maximum) for the bisulfite rates of nonCGc sites on chr1. Figure 2B compares the percentage of nonCGc sites for regions with high and low coverage. This figure shows that low coverage regions tend to have higher nonGCc content than high coverage regions. In addition to comparing the nonCGc proportions, our pipeline can compare the DNA sequence structures of high or low coverage (or bisulfite conversion rate) regions in more detail as explained in Step 5 of our pipeline. For example, we may compare the DNA sequence structure for high coverage with low coverage target regions (see Figure 3). In Figure 3, we use the genomic regions obtained based on the RRBS protocol as target regions because this MCF10A sample is sequenced using the RRBS method. In particular, we use the chromosome regions (or intervals) obtained with the MspI (C^CGG) sites and within 40~220 base-distance. Figure 3 compares the percentages of A, C, G, T, GC content (i.e., C+G), CGc, nonCGc, and repetitive bases in high-coverage regions with the ones in low coverage regions. From this figure, we see that there is no obvious difference between high and low coverage region, which is because this sample is well sequenced and there is no obvious sequencing problem. However, for some datasets that may have known or unknown library preparation or sequencing problems, the DNA sequence structure plots generated in Step 5 will show obvious patterns. For example, some data will show high coverage corresponding to dramatically high or low percentages of A, GC, or nonCGc contents, and so on. More information about other examples and our pipeline can be found in the user manual (see the Additional file 1).

Figure 3

Plots of comparing regions with high and low coverage. The comparison is based on the percentages of A, C, G, T, GC content (i.e., C+G), CGc, nonCGc, and repetitive bases.

Discussion

Our pipeline has a few limitations. First, for some non-mammalian genomes (e.g., plants), nonCGc sites are methylated. Our pipeline will not be suitable for checking the bisulfite–treated methylation sequencing data generated from these genomes. For these genomes, the investigator may use some positive and negative controls (e.g., some regions or sites known to be methylated or unmethylated). Then the users may study the methylation ratios of these known regions. Second, our pipeline is mainly developed for the FASTQ format sequencing data generated using the Illumina analyzer. Sequencing data that are not in the FASTQ file format first need to be converted to a FASTQ file in order to use our MethyQA program. Despite these limitations, the perl and R scripts provided by our group can be used to conduct further analysis with pre-obtained methylation ratios. Our pipeline has the following advantages. First, because our pipeline is not designed for specific protocol generated data, it is suitable for performing quality assessment for bisulfite sequencing data generated by different protocols. Second, the user can conduct the quality assessment, not only at the individual chromosome level, but also at a user-provided target-region level. If the users are interested in whole genome sequencing or checking bisulfite conversion rates, they can utilize the chromosome level analysis. In contrast, if users are mainly interested in certain type of regions (e.g., CpG islands, promoter regions, or candidate genes), the target-region-analysis feature can be utilized as it allows the users to focus on specific regions of interest.

Conclusions

The development of pipelines for bisulfite-treated methylation sequencing quality data is highly needed. MethyQA is a new tool that can fill this need. It can process large amounts of raw and aligned methylation sequencing data efficiently. It generates both diagnostic graphs and tables to examine sequencing quality, providing useful information for medical researchers and analysts.

Availability and requirements

Project name: MethyQA Downloading software (pipeline):http://hal.case.edu/~sun/MethyQA.v2.zip Operating system(s): Linux/Unix Programming language: Perl (v5.8 or later), R (v2.13 or later) and Python (v2.6 or later) Other requirements: None Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SS and AN wrote the perl and R scripts. XY provided original alignment scripts and helped with revising the final scripts. All three authors have been involved in the writing of manuscript and approved the final document.

Additional file 1

MethyQA User Manual. Click here for file

40 in total

1. Breast cancer DNA methylation profiles in cancer cells and tumor stroma: association with HER-2/neu status in primary breast cancer.

Authors: Heidi Fiegl; Simone Millinger; Georg Goebel; Elisabeth Müller-Holzner; Christian Marth; Peter W Laird; Martin Widschwendter
Journal: Cancer Res Date: 2006-01-01 Impact factor: 12.701

2. Circulating methylated DNA: a new generation of tumor markers.

Authors: Martin Widschwendter; Usha Menon
Journal: Clin Cancer Res Date: 2006-12-15 Impact factor: 12.531

3. BRAT-BW: efficient and accurate mapping of bisulfite-treated reads.

Authors: Elena Y Harris; Nadia Ponts; Karine G Le Roch; Stefano Lonardi
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

4. CpG island methylator phenotype in colorectal cancer.

Authors: M Toyota; N Ahuja; M Ohe-Toyota; J G Herman; S B Baylin; J P Issa
Journal: Proc Natl Acad Sci U S A Date: 1999-07-20 Impact factor: 11.205

5. Aberrant DNA methylation in ovarian cancer: is there an epigenetic predisposition to drug response?

Authors: Susan H Wei; Robert Brown; Tim H-M Huang
Journal: Ann N Y Acad Sci Date: 2003-03 Impact factor: 5.691

6. Methylation profiling of CpG islands in human breast cancer cells.

Authors: T H Huang; M R Perry; D E Laux
Journal: Hum Mol Genet Date: 1999-03 Impact factor: 6.150

7. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.

Authors: Felix Krueger; Simon R Andrews
Journal: Bioinformatics Date: 2011-04-14 Impact factor: 6.937

8. Gamma-Normal-Gamma mixture model for detecting differentially methylated loci in three breast cancer cell lines.

Authors: Abbas Khalili; Dustin Potter; Pearlly Yan; Lang Li; Joe Gray; Tim Huang; Shili Lin
Journal: Cancer Inform Date: 2007-02-07

9. BSMAP: whole genome bisulfite sequence MAPping program.

Authors: Yuanxin Xi; Wei Li
Journal: BMC Bioinformatics Date: 2009-07-27 Impact factor: 3.169

10. BRAT: bisulfite-treated reads analysis tool.

Authors: Elena Y Harris; Nadia Ponts; Aleksandr Levchuk; Karine Le Roch; Stefano Lonardi
Journal: Bioinformatics Date: 2009-12-22 Impact factor: 6.937

11 in total

Review 1. A survey of the approaches for identifying differential methylation using bisulfite sequencing data.

Authors: Adib Shafi; Cristina Mitrea; Tin Nguyen; Sorin Draghici
Journal: Brief Bioinform Date: 2018-09-28 Impact factor: 11.622

2. Urinary Nucleic Acid in Tumor: Bioinformatics Approaches.

Authors: Davide Angeli
Journal: Methods Mol Biol Date: 2021

3. HPP1 gene promoter methylation in pancreatic cancer: correlation with carcinogenesis and clinical implication.

Authors: Lisi Peng; Lu Zhuang; Kun Lin; Yao Yao; Yang Zhang; Chunping Zhu; Kaixuan Wang; Haojie Huang; Zhaoshen Li
Journal: Int J Clin Exp Pathol Date: 2018-07-01

Review 4. DNA methylation analysis in plants: review of computational tools and future perspectives.

Authors: Jimmy Omony; Thomas Nussbaumer; Ruben Gutzat
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

5. Generating Sequencing-Based DNA Methylation Maps from Low DNA Input Samples.

Authors: Suzan Al Momani; Euan J Rodger; Peter A Stockwell; Michael R Eccles; Aniruddha Chatterjee
Journal: Methods Mol Biol Date: 2022

6. A trimming-and-retrieving alignment scheme for reduced representation bisulfite sequencing.

Authors: Xuefeng Wang; Xiaoqing Yu; Wei Zhu; W Richard McCombie; Eric Antoniou; R Scott Powers; Nicholas O Davidson; Ellen Li; Jennie Williams
Journal: Bioinformatics Date: 2015-02-13 Impact factor: 6.931

7. Bisulfite Conversion of DNA: Performance Comparison of Different Kits and Methylation Quantitation of Epigenetic Biomarkers that Have the Potential to Be Used in Non-Invasive Prenatal Testing.

Authors: Chrysanthia A Leontiou; Michael D Hadjidaniel; Petros Mina; Pavlos Antoniou; Marios Ioannides; Philippos C Patsalis
Journal: PLoS One Date: 2015-08-06 Impact factor: 3.240

8. HMPL: A Pipeline for Identifying Hemimethylation Patterns by Comparing Two Samples.

Authors: Shuying Sun; Peng Li
Journal: Cancer Inform Date: 2015-08-09

Review 9. Base resolution methylome profiling: considerations in platform selection, data preprocessing and analysis.

Authors: Zhifu Sun; Julie Cunningham; Susan Slager; Jean-Pierre Kocher
Journal: Epigenomics Date: 2015-09-14 Impact factor: 4.778

Review 10. Profiling DNA Methylation Based on Next-Generation Sequencing Approaches: New Insights and Clinical Applications.

Authors: Daniela Barros-Silva; C Joana Marques; Rui Henrique; Carmen Jerónimo
Journal: Genes (Basel) Date: 2018-08-23 Impact factor: 4.096