| Literature DB >> 23968174 |
Shuying Sun1, Aaron Noviski, Xiaoqing Yu.
Abstract
BACKGROUND: DNA methylation is an epigenetic event that adds a methyl-group to the 5' cytosine. This epigenetic modification can significantly affect gene expression in both normal and diseased cells. Hence, it is important to study methylation signals at the single cytosine site level, which is now possible utilizing bisulfite conversion technique (i.e., converting unmethylated Cs to Us and then to Ts after PCR amplification) and next generation sequencing (NGS) technologies. Despite the advances of NGS technologies, certain quality issues remain. Some of the more prevalent quality issues involve low per-base sequencing quality at the 3' end, PCR amplification bias, and bisulfite conversion rates. Therefore, it is important to conduct quality assessment before downstream analysis. To the best of our knowledge, no existing software packages can generally assess the quality of methylation sequencing data generated based on different bisulfite-treated protocols.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23968174 PMCID: PMC3765750 DOI: 10.1186/1471-2105-14-259
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The workflow of our pipeline MethyQA.
The main output files in each step of the MethyQA pipeline
| SampleName_fastqc | Step 1 | One folder and one zip file that save the output of quality assessment using fastqc. |
| SampleName_fastqc.zip | ||
| fastx.trim.fastq | Step 2 | Fastx or cutadapt output (one line per read) if adapter trimming is used. |
| cutadapt.trim.fastq | ||
| *_reads1.txt | Step 2 | BRAT trimming output (one line per read) if dynamic trimming using BRAT trim is done. |
| fixedTrim_BRATout | Step 2 | Fixed length trimming output (one line per read) if “fixed-length” trimming is used. |
| alignment.brat | Step 3 | BRAT alignment output (one line per read). |
| * _forw.txt | Step 3 | BRAT ACGT-count (i.e., methylation ratio) output file (one line per cytosine position). |
| *chrN.summary.table.txt | Step 4 | Chromosome level summary table for bisulfite conversion rates. |
| *chrN.BS.ps | Step 4 | Chromosome level plot for bisulfite conversion rates. |
| *chrN.target.summary.table.txt | Step 4 | Target region level summary table for the mean and median of bisulfite conversion rates. |
| *chrN.mean.median.ps | Step 4 | Target region level plot for the mean and median of bisulfite conversion rates. |
| *chrN.seq.bisulfite.boxplot.ps | Step 5 | Plots for comparing the DNA sequence structure for regions with high and low bisulfite conversion rates. |
| *chrN.highBS.seq | Step 5 | Target regions with high or low bisulfite conversion rates (*seq files include all basic DNA sequence statistics, and *target files include the summary of nonCGc bisulfite conversion rates). |
| *chrN.lowBS.seq | ||
| *chrN.highBS.target | ||
| *chrN.lowBS.target | ||
| *chrN.seq.coverage.boxplot.ps | Step 5 | Plots for comparing the DNA sequence structure for regions with high and low sequencing coverage. |
| *chrN.highCoverage.seq | Step 5 | Target regions with high or low sequencing coverage (*seq files include all basic DNA sequence statistics, and *target files include the summary of nonCGc bisulfite conversion rates). |
| *chrN.lowCoverage.seq | ||
| *chrN.highCoverage.target | ||
| *chrN.lowCoverage.target |
“*” means the prefix provided by the user while running MethyQA. In some file names, “chrN” means a specific chromosome that the user investigates.
The command options of MethyQA
| [-i <file>] | FASTQ input file |
| [-t <file>] | Target input file (i.e., a list of target regions specified for analysis). “F”, if do not perform target analysis |
| [-d <dir>] | Path to MethyQA directory (e.g., /home/user/downloads/MethyQA/) |
| [-c <string>] | Chromosome number (e.g., chr1, chr2, chr17, chrX, chrY, etc.) |
| [-p <string>] | Prefix (i.e., the prefix written to the output file names) |
| [-R <dir>] | Reference directory (i.e., the directory with the genome reference files) |
| [-r <file>] | Reference name (i.e., the file name of the reference that the user will use) |
| [-f <string>] | FASTQ format (i.e., “sanger” or “illumina”) |
| [-a <string>] | Adapter trimming. (1) “no”: no adapter trimming (default); (2) “fastx”: fastx adapter trimming; (3) “cutadapt”: cutadapt adapter trimming. If cutadapt is set, the “-Y” option should be specified in the command line |
| [-A <string>] | Adapter sequence (The default is Illumina adapter sequence: |
| [-T <string>] | Quality trim flag. (1) “ no”: no quality trimming; (2) “brat”: brat dynamic trimming (default); (3) “fix”: fixed quality trimming |
| [-N <int>] | For fixed quality trimming (users specify the number of bases to be trimmed at the 5' end, default is 5) |
| [-n <int>] | For fixed quality trimming (users specify the number of bases to be trimmed at the 3' end, default is 10) |
| [-B <real>] | Cutoff value for selecting high bisulfite conversion regions (Range: [0, 1], default B=0.99) |
| [-b <real>] | Cutoff value for selecting low bisulfite conversion regions (Range: [0, 1], default b=0.6) |
| [-L <real>] | Cutoff value for selecting high coverage region (Range: [0, 1], default L=0.5) |
| [-l <real>] | Cutoff value for selecting low coverage region (Range: [0, 1], default l=0.1) |
| [-u <logic>] | Bisulfite flag (it is an option to initiate boxplot of high vs. low bisulfite rates, either ‘TRUE’ (default) or ‘FALSE’) |
| [-v <logic>] | Coverage flag (it is an option to initiate boxplot of high vs. low coverage, either ‘TRUE’ (default) or ‘FALSE’) |
| [-Y <string>] | Path to python when running cutadapt (i.e., python, python2.6, /home/bin/python) |
| [-Q <string>] | Path to FastQC (e.g., /home/appl/apps/bin/fastqc, default is to use the one complied in MethyQA pipeline) |
| [-M <string>] | Path to BRAT trim function (e.g., /home/appl/apps/bin/trim.v1.2.4, default is to use the one complied in MethyQA pipeline) |
| [-K <string>] | Path to BRAT-large function (e.g., /home/appl/apps/bin/brat-large.v1.2.4, default is to use the one complied in MethyQA pipeline) |
| [-J <string>] | Path to BRAT ACGT-count function (e.g., /home/appl/apps/bin/acgt-count.v1.2.4, default is to use the one complied in MethyQA pipeline) |
| [-X <string>] | Path to fastx function (e.g., home/appl/apps/bin/fastx, default is to use the one complied in MethyQA pipeline) |
| [-C <string>] | Path to cutadapt function (e.g., /home/appl/apps/bin/cutadapt, default is to use the one complied in MethyQA pipeline) |
Figure 2Plots of bisulfite conversion rate (A) and nonGCc content (B). Plot A is the histogram of the bisulfite conversion rates of nonCGc sites on chr1. Plot B is the non-CGc content in the high and low coverage regions.
Example of bisulfite-rate summary at the chromosome level
| Chr1 | 44683043 | 622926 | 1.394% | 0 | 1 | 1 | 0.9961 | 1 | 1 |
TNCGC means the “total number of nonCGc sites”. TNCGCwc means the “total number of nonCGc sites with coverage”. “Percent” means the percent of nonCGc sites with coverage. The last 6 columns are a 6-number-summary (minimum, 25th percentile, median, mean, 75th percentile, and maximum) for the bisulfite rates of nonCGc sites on chr1.
Figure 3Plots of comparing regions with high and low coverage. The comparison is based on the percentages of A, C, G, T, GC content (i.e., C+G), CGc, nonCGc, and repetitive bases.