Literature DB >> 24497972

WBSA: web service for bisulfite sequencing data analysis.

Fang Liang1, Bixia Tang1, Yanqing Wang1, Jianfeng Wang1, Caixia Yu1, Xu Chen1, Junwei Zhu1, Jiangwei Yan1, Wenming Zhao1, Rujiao Li1.   

Abstract

Whole-Genome Bisulfite Sequencing (WGBS) and genome-wide Reduced Representation Bisulfite Sequencing (RRBS) are widely used to study DNA methylation. However, data analysis is complicated, lengthy, and hampered by a lack of seamless analytical pipelines. To address these issues, we developed a convenient, stable, and efficient web service called Web Service for Bisulfite Sequencing Data Analysis (WBSA) to analyze bisulfate sequencing data. WBSA focuses on not only CpG methylation, which is the most common biochemical modification in eukaryotic DNA, but also non-CG methylation, which have been observed in plants, iPS cells, oocytes, neurons and stem cells of human. WBSA comprises three main modules as follows: WGBS data analysis, RRBS data analysis, and differentially methylated region (DMR) identification. The WGBS and RRBS modules execute read mapping, methylation site identification, annotation, and advanced analysis, whereas the DMR module identifies actual DMRs and annotates their correlations to genes. WBSA can be accessed and used without charge either online or local version. WBSA also includes the executables of the Portable Batch System (PBS) and standalone versions that can be downloaded from the website together with the installation instructions. WBSA is available at no charge for academic users at http://wbsa.big.ac.cn.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24497972      PMCID: PMC3907392          DOI: 10.1371/journal.pone.0086707

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

DNA methylation plays an important role in cell differentiation, X chromosome inactivation, genomic imprinting through regulation of transcription, chromatin structure, chromosome stability, and tumorigenesis [1], [2], [3], [4], [5], [6]. DNA methylation research has been accelerated by the development of next-generation sequencing technology, and it is now the focus of research groups with diverse interests [5], [7], [8], [9], [10], [11], [12]. There are four mainstream sequencing-based methods for DNA methylation profiling: two utilize enrichment of methylated DNA (Methylated DNA Binding Domain sequencing, or MBD-seq [7] and Methylated DNA Immunoprecipitation sequencing, or MeDIP-seq [8], [9]), and the other two utilize bisulfite conversion (MethylC-seq or WGBS [10] and Reduced Representation Bisulfite Sequencing or RRBS [11]). Reacting DNA with bisulfite converts cytosine residues to uracil residues but does not alter 5-methylcytosine residues, which makes it possible to distinguish methylated from unmethylated cytosine residues. Because bisulfite sequencing determines single base changes, its resolution is greater than those methods that utilize DNA enriched in methylated regions [12]. WGBS and RRBS are widely used in biological research [5], [12]. The read alignment of bisulfite sequencing data differs from that generated using non-bisulfite sequencing due to the change of C to T residues. Therefore, alignment tools such as BSMAP [13], BS SEEKER [14], RMAP [15], and Bismark [16] were developed to address this issue. Further, common alignment tools such as BWA [17] and Bowtie [18], [19] also align bisulfite sequencing reads to a reference sequence after the reads and the reference sequences have been converted [10]. Other tools are available for analyzing bisulfite sequencing data, such as CyMATE [20], CpG PatternFinder [21], GBSA [22], COHCAP [23], methylKit [24], and BSmooth [25]. Their applications are limited because their analytical pipelines either require aligned reads as input or only support single-end alignments. Moreover, these tools only identify methylated cytosines or only analyze methylated CpG islands to search for correlations between methylation and gene expression. SAAP-RRBS [26] and RRBS-Analyser [27] integrated BSMAP as an alignment tool and acted as streamlined analysis and annotation pipelines. However, these approaches are designed only for RRBS data with limited annotations. Certain tools identify differentially methylated regions [24], [25], [27], [28], but most do not focus on analysis of non-CGs (Table 1–3).
Table 1

Comparison of WBSA's WGBS module with six pipelines.

FunctionsWBSA-WGBSCyMATECpG PatternFinderGBSACOHCAPmethylKitBSmooth
Read Quality analysisYNNNNNY
Filter adaptor & low qualityYNNNNNN
Computation of conversion rateYNYNNNN
AlignmentYNNNNNYa
Focus on non-CGsYYNYNNN
Methylation levelYYYYYYY
Methylation distributionYYYNNNN
Relationship of methylation and CpG islandsYNNYYYN
Gene annotationYNNYNYN
Functional analysis of genes with high or low methylationYNNNNNN
Sequence preferenceYNNNNNN
Correlation between methylation and gene expressionYNNNYNN
Online versionYYYYYYY
Standalone versionYNNNNNN
PBS versionYNNNNNN

only support single-end data.

Table 3

Comparison of WBSA's DMR module with four pipelines.

FunctionsWBSA-DMRRRBS-analyserBSmoothmethylKitQDMR
Focus on CGsYYYYY
Focus on non-CGsYYNNN
Correlation between DMR and genesYYNYY
The functional analysis of correlative genesYNNNY
More than one method of DMR identificationYNNNN
Online versionYYYYY
Standalone versionYNNNN
PBS versionYNNNN
only support single-end data. We describe here WBSA, which provides a user-friendly and novel web service for analyzing bisulfite sequencing data. WBSA focuses on the analysis of CpG as well as CHG and CHH (H = A, T or C), and therefore aids DNA methylation research related to animals or plants. Researchers can use WBSA to perform comprehensive analyses of WGBS or RRBS data and can identify DMRs in different contexts. Moreover, WBSA is efficient and fast. For example, the overall process starting from the analysis of the quality of reads to annotating methylation only takes 5.5 hours for 2.8 GB (15.6M reads) sample data with WGBS. WBSA comprises six modules: Home, WGBS, RRBS, DMR, Documents, and Downloads and provides the executables for downloads and local installation.

Methods

1) Analysis and annotation of bisulfite sequencing data

WBSA is a web service for analyzing bisulfite sequencing data and uses Perl and C to compile all program scripts. WBSA comprises three main (WGBS, RRBS and DMR) and three basic (Home, Documents and Downloads) modules. The main modules analyze and annotate the bisulfite sequencing data. The flowcharts shown in Figure 1 depict the process of data analysis. Detailed methods are provided below.
Figure 1

Flowchart of data analysis.

a. Flowchart of data analysis for WGBS and RRBS. WGBS and RRBS include four parts as follows: pre-processing of reads and the reference sequence, mapping to the reference genome, mC identification, and methylation annotation. The sequencing reads, reference sequences, and the lambda sequence should be used as input data, and all the results can be previewed and downloaded. b. Flowchart of DMR identification. The DMR analysis module includes DMR identification and annotation.

Flowchart of data analysis.

a. Flowchart of data analysis for WGBS and RRBS. WGBS and RRBS include four parts as follows: pre-processing of reads and the reference sequence, mapping to the reference genome, mC identification, and methylation annotation. The sequencing reads, reference sequences, and the lambda sequence should be used as input data, and all the results can be previewed and downloaded. b. Flowchart of DMR identification. The DMR analysis module includes DMR identification and annotation. Data uploading and analyzing read quality: WBSA provides two modes for uploading sequencing data: HTTP (for files <2 GB) and FTP (no size limit). We integrated FastQC software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to analyze the quality of bisulfite sequencing data. WBSA generates figures and tables that show the results of the analysis of quality distribution, nucleotide distribution, GC content distribution, and overrepresented sequences as well as other information. The user can filter adaptor sequences and low quality bases from two ends of a sequence if the base quality value is less than the value of the threshold (optional parameter) for WGBS and RRBS data. If the read length is less than the minimum read length (predefined value) after filtering bases, the reads will be discarded. Sequence alignments: For bisulfite sequencing reads, cytosines in T-rich reads are replaced with thymines, while guanines in A-rich read are replaced with adenines. The position of the replaced cytosines or guanines will be marked if the quality value is larger than Q (a predefined value). WBSA prepares the reference sequence and simultaneously converts it twice as follows: (1) cytosines are replaced with thymines, and (2) guanines are replaced with adenines. BWA [17] is used to align processed reads according to the converted reference sequence. The default mapping parameters can be changed by the user. If an unmethylated DNA sequence Lambda named “chrLam” is used and uploaded, WBSA can integrate the Lambda sequence within the reference sequence. The Lambda genome is included in the reference sequence as an extra chromosome so that reads originating from the unmethylated control DNA can be aligned. The sodium bisulfite non-conversion rate is calculated as the percentage of cytosines sequenced at cytosine reference positions in the Lambda genome. WBSA can process single-end and paired-end data for WGBS, but only processes single-end data for RRBS, because the restriction endonuclease digestion fragments are likely to be shorter (40–220 bp). Therefore, single-end sequencing is more practical to perform than paired-end sequencing. WBSA discards four types of reads that map to the reference as follows: (1) reads mapped to multiple positions; (2) reads mapped to the wrong strands (T-rich reads mapped to Crick-strand Cs converted to Ts or to Watson-strand Gs converted to ‘A’s, A-rich reads mapped to Watson-strand Cs converted to Ts or to Crick-strand Gs converted to ‘A’s). WBSA only supports analysis of methylC-seq data, which is strand-specific; (3) T-rich reads where a C maps to T in the reference sequence, or A-rich reads where a G maps to an A in the reference sequence; and (4) duplicated reads generated by the use of PCR (optional parameter). Identification of methylation sites: For each reference cytosine, WBSA uses the binomial distribution B(n, p) to identify the methylation site, using a 0.01 false discovery rate (FDR) corrected P-value [10], where the probability p in the binomial distribution B(n, p) is estimated from the number of cytosines sequenced in reference sequence cytosine positions in the unmethylated Lambda sequence (referred to as the error rate: non-conversion plus sequencing error frequency) if the Lambda sequence is uploaded by the user; otherwise, the probability p must be provided by the user. For each reference cytosine, the trial number (n) is the read depth, and the cytosine is noted as methylated if the number of sequenced cytosines (m) follows the following formula as below:Further, the RRBS module eliminates the impact on mC identification because of double strand DNA repair and conversion into blunt ends at the terminus of a sequence. Annotation by WGBS and RRBS: WBSA provides a wide variety of annotations and analyses for WGBS and RRBS. WBSA first evaluates the abundance of methylated cytosines in the genome and shows the distribution of methylation in different regions (upstream, first exon, first intron, internal exons, internal introns, last exon, downstream) of genes and in repeats. WBSA next performs a statistical analysis of the number and percentage of methylated CpG islands in different functional genic regions (promoter, gene body, downstream, and intergenic). A methylated CpG island is defined as a sequence of 200-plus base pairs with a G+C content of greater than 50%, the observed/expected C frequency of greater than 0.6 and a methylation level of greater than 70%. The third is the functional clustering analysis of genes with high and low levels of methylation. Functional gene clustering is implemented using three steps: (1) methylation level of each gene is counted; (2) genes with high (>70%) and low (<30%) levels of methylation are annotated and functionally classified according to Gene Ontology (GO) terms, respectively; (3) the numbers of genes with the high and low levels of methylation are counted, and histograms are generated (horizontal axis and vertical axes represent the functional class and gene number, respectively). Fourth, a red graph shows the distribution of methylation levels in transposable elements (TE). Fifth, the sequence preference for mCG, mCHG, and mCHH are analyzed using WEBLOGO software [29]. Sixth, the correlation between gene expression and methylation levels is analyzed, and this analysis consists of four steps as follows: (1) uploaded genes are sorted according to the expression values; (2) sorted genes are divided equally into five groups, such that the first group contains genes with the lowest expression values; (3) each gene body or promoter region is divided equally into 20 bins, and the average relative methylation level of each bin for genes in every group is calculated; (4) two-dimensional curves are generated (horizontal axis, gene body or promoter region; vertical axis, average relative methylation level), showing the relative levels of mCG, mCHG, and mCHH contexts in the promoter regions and gene bodies for WGBS and the CG context for the RRBS promoter regions. Identification of differentially methylated regions: WBSA includes an independent module for DMR identification (Figure 1b) and provides the static window and dynamic window methods. The static window method is used to identify DMRs in strings of CN, CG, and CH (N = A, T, C or G, H = A, C or T). This approach fixes the window length and the number of adjacent windows. The Wilcoxon test is used if both samples have sufficient coverage in these windows and the methylation level of one sample is greater, at least 0.2 (delta methylation level), than that of the other. The test window moves one mC for each step. The p-value, minimum sequence coverage rate and delta methylation level can be adjusted according to user's expectations. Whether using FDR correction is determined by users. The dynamic window method is used to identify DMRs in strings of CN and CG. The Wilcoxon test is used in a window with fixed numbers of CNs or CGs if the coverage of both samples is sufficient and the methylation level of one sample is greater, at least 0.2 (delta methylation level), than that of the other. First, the window moves towards the 3′-direction one step-size at a time and repeats the Wilcoxon test until the p-value is not significant or until the end of the sequence is reached. The same process is repeated in the original fixed window in the 5′-direction. The window size, step size, coverage, delta methylation level and p-value can be adjusted in accordance with different expectations. There are three sample data analysis pipelines showing all corresponding analyses on the WGBS, RRBS, and DMR pages, and users can access the corresponding links to analyze these sample datasets or check the results using WBSA. A detailed description of how to use WBSA is shown on the Documents page. Links for all the software related to the service and instruction manuals, which supports the PBS and a standalone versions, are provided on the Downloads page.

2) System architecture

WBSA consists of a web application and computational modules, which we take as front-end and back-end programs, respectively [30]. The front-end program was developed using Java Server Page (JSP) technology, which is used to process the submitted task request and shows the results. Three popular frameworks, Struts (http://struts.apache.org), Spring (http://www.springsource.org), and MyBatis (http://mybatis.github.io), are simultaneously used to enhance the flexibility and extendibility of the web application. The Mysql database is used to store information, which will be used by the front-end and back-end programs. The back-end program was mainly developed using Perl to process the data. A Java package encloses all the Perl scripts, monitors task status, and executes the Perl script. An XML file is used to exchange information between front-end and back-end programs. The system architecture and workflow are shown in Figure 2.
Figure 2

WBSA system architecture and workflow.

When the user chooses one analysis module from the web page such as WGBS, the user must input several parameters according to the instructions provided. The web process, which was developed using a Struts and Spring framework, will then proceed with the user's request and generate an XML file to store the parameters provided by the user. At the same time, it will insert a record into the database to identify the new job. The workflow monitor process BIG Workflow will continually monitor the job's status from the database and will activate the data processing procedure if a new job is found. When the user previews the result on the web page, the web process will indicate the status of the job and show the appropriate results to the user.

WBSA system architecture and workflow.

When the user chooses one analysis module from the web page such as WGBS, the user must input several parameters according to the instructions provided. The web process, which was developed using a Struts and Spring framework, will then proceed with the user's request and generate an XML file to store the parameters provided by the user. At the same time, it will insert a record into the database to identify the new job. The workflow monitor process BIG Workflow will continually monitor the job's status from the database and will activate the data processing procedure if a new job is found. When the user previews the result on the web page, the web process will indicate the status of the job and show the appropriate results to the user. The executable package includes PBS and standalone versions, which are available for download from our website and can be installed locally on a Linux-64 system. The executables include a Jar package and several configuration files. Users only need to write a configuration file and then start the Jar package to activate the task. Some essential tools described on the download page require installation to enable the system to run.

3) Simulation of bisulfite sequencing data

To simulate WGBS data, we first used DWGSIM software (version 0.1.7) (https://github.com/nh13/DWGSIM) to simulate NGS paired-end 80-bp reads with a 0.5% base-error rate on a whole-genome scale for two model organisms (zebrafish and rice). And the simulated data includes 5% random reads. The simulation parameters are -e 0.005 (base-error rate of the first read sequences) -E 0.005 (base-error rate of the second read sequences) -C 10 (10-fold coverage) -1 80 (length of the first read sequences) -2 80 (length of the second read sequences) -y 0.05 (5% random reads). In the second step, Cs are randomly converted to Ts for the first-read sequences of paired-end reads and Gs to ‘A’s for the second-read sequences of paired-end reads. The numbers of simulated reads include 89,278,622 and 24,677,386 pairs, respectively, and represent 10-fold coverage of the zebrafish and rice genomes. The numbers of random DNA sequences were 4,492,050 and 1,235,216 pairs, respectively. We trimmed 10 and 20 bases from the ends of simulated reads and generated 70 and 60 bp long reads. To simulate RRBS data, first we scanned either the human (hg19) or mouse (mm9) genome and marked the positions of CCGGs for the Watson and Crick strands, and the distance between adjacent CCGGs should be ≥40 bp and ≤220 bp. Then we extracted at random 36-bp sequences that start with CGG (starting with CCGG and removing the first C). Next, we introduced randomly 0.5% incorrect bases into these 36-bp fragments and then imported 5% random DNA sequences. In the final step, we converted at random Cs to Ts in each read. The total numbers of simulated reads of human and mouse were 17,087,814 and 7,463,343, and the numbers of random DNA sequences were 854,403 and 373,182 reads, respectively.

Results and Discussion

1) Evaluation of the mapping efficiency and accuracy of WBSA

Mapping reads to a reference genome is an important step for the analysis of bisulfite sequencing. We therefore compared WBSA with the two most popular mapping software packages, Bismark and BSMAP. The comparison includes the following variables: sequencing types (paired-end and single-end), read length (80, 70, 60, and 36 bp), data types (simulated data and actual data), and library types (WGBS and RRBS data). We simulated paired-end reads with different lengths of zebrafish and rice genomes for WGBS and single-end reads of human and mouse genomes for RRBS (simulation methods are described in the Methods section). We used three methods (WBSA, BSMAP and Bismark) to align simulated and actual sequencing reads to their corresponding genomes. The results show that WBSA performed as effectively as BSMAP and Bismark. In contrast, WBSA mapping was more accurate and faster. The detailed results are presented in Table 4–6.
Table 4

Comparison of mapping times and accuracies among WBSA, BSMAP, and Bismark for simulated WGBS data.

Read length (bp)SpeciesSoftwareAlignment ParametersMapping Time (hours)RAM (Gb)Mapped ReadsCorrectly Mapped ReadsFalse PositiveFalse Negative
Num. (pairs)%Num. (pairs)%Num. (pairs)%Num. (pairs)%
80ZebrafishBismark (v0.8.1)-q –phred33-quals -n 3 -l 1647.80∼5.578,801,15088.2677,891,34687.25005,985,4226.70
BSMAP (v2.74)-s 16 -v 3 -p 1 -r 1 -R -u7.60∼4.384,439,55694.5870,308,94078.7500347,0160.39
WBSA-n 3 -l 16 -k 324.07∼4.384,776,39494.9680,698,42190.390010,1780.01
RiceBismark (v0.8.1)-q –phred33-quals -n 3 -l 1612.57∼1.52157094687.412126609686.180018712247.58
BSMAP (v2.74)-s 16 -v 3 -p 1 -r 1 -R -u1.18∼1.72341661194.892023590382.0000255590.10
WBSA-n 3 -l 16 -k 35.47∼1.22344216294.992328912494.370080
70ZebrafishBismark (v0.8.1)-q –phred33-quals -n 3 -l 1640.37∼4.37816039787.557706746786.320066261757.42
BSMAP (v2.74)-s 16 -v 3 -p 1 -r 1 -R -u11.45∼4.38438310194.527279000381.53004034710.45
WBSA-n 3 -l 16 -k 325.45∼4.38478656794.978469766294.870050
RiceBismark (v0.8.1)-q –phred33-quals -n 3 -l 1610.72∼1.22139036686.682103406185.240020518048.31
BSMAP (v2.74)-s 16 -v 3 -p 1 -r 1 -R -u1.03∼1.82342266594.921976019680.0700195050.08
WBSA-n 3 -l 16 -k 34.92∼1.22344216694.992312139593.690040
60ZebrafishBismark (v0.8.1)-q –phred33-quals -n 2 -l 1439.77∼5.17732501486.617600050885.130074615588.36
BSMAP (v2.74)-s 14 -v 2 -p 1 -r 1 -R -u8.05∼4.38424237794.367001729978.43005442280.61
WBSA-n 2 -l 14 -k 215.93∼4.38478657194.978406806194.160010
RiceBismark (v0.8.1)-q –phred33-quals -n 2 -l 149.53∼1.52115877285.742074198884.050022833989.25
BSMAP (v2.74)-s 14 -v 2 -p 1 -r 1 -R -u0.77∼1.72341252894.871916176577.6500296420.12
WBSA-n 2 -l 14 -k 23.94∼1.12344216894.992291045592.840020
Table 6

Comparison of mapping times and accuracies between WBSA, BSMAP, and Bismark for actual bisulfite sequencing data.

Data typeSpeciesSoftwareAlignment ParametersMapping Time (hours)RAM (Gb)Mapped ReadsUniquely Mapped Reads
Num.%Num.%
WGBSHumanBismark(v0.8.1)-q –phred33-quals -n 3 -l 16303.9∼10.616684983737.3315396981434.45
BSMAP(v2.74)-s 16 -v 3 -p 1 -r 1 -R -u42.73∼8.023813405453.2822093879349.43
WBSA-n 3 -l 16 -k 3113.20∼9.224083482553.8822219883249.71
RRBSMouseBismark(v0.8.1)-q –phred33-quals -n 2 -l 1422.65∼9.11760996385.301289316562.45
BSMAP(v2.74)-s 14 -v 2 -p 1 -r 1 -R -u3.93∼6.81248936260.50913779144.26
WBSA-n 2 -l 14 -k 25.14∼8.01325066864.19953382946.18
For mapping simulated WGBS paired-end data with different lengths, the three mapping methods had a false-positive rate of zero. BSMAP ran the fastest, followed by WBSA, and Bismark. However, WBSA produced the highest mapped rates, the correctly mapped rates, and the lowest false negative rates. The correctly mapped rate is the ratio of the correctly mapped simulated reads to the total simulated reads, and the false negative rate is the ratio of the simulated unmapped, nonrandom reads to total simulated reads. There was little difference in memory use among the methods (Table 4). For mapping simulated RRBS single-end data, memory use, mapping times, mapped rates, correctly mapped rates, false negative rates, false positive rates of the WBSA and BSMAP methods were similar. Each out-performed Bismark (Table 5).
Table 5

Comparison of mapping times and accuracies between WBSA, BSMAP, and Bismark for simulated RRBS data.

SpeciesSoftwareAlignment ParametersMapping Time (hours)RAM (Gb)Mapped ReadsCorrectly Mapped ReadsFalse PositiveFalse Negative
Num.%Num.%Num.%Num.%
HumanBismark (v0.8.1)-q –phred33-quals -n 2 -l 145.54∼10.51093092967.631084935967.137950530327731.04
BSMAP (v2.74)-s 14 -v 2 -p 1 -r 1 -R -u1.22∼7.51616177294.581248908873.09230716620.42
WBSA-n 2 -l 14 -k 21.42∼6.31622838994.971230237972.00264052860.03
MouseBismark (v0.8.1)-q –phred33-quals -n 2 -l 141.52∼7.1509959968.3506563367.872060.06199076826.67
BSMAP (v2.74)-s 14 -v 2 -p 1 -r 1 -R -u0.28∼6.8705410294.52560332875.0850360640.48
WBSA-n 2 -l 14 -k 20.63∼6.1708767594.97559494174.97510.0125370.03
We downloaded the actual WGBS data for human (SRX006782, 447M reads) and actual RRBS data for mouse (SRR001697, 21M reads) from the website of the United States National Center for Biotechnology Information (NCBI) to compare the mapped rates and uniquely mapped rates of WBSA with BSMAP and Bismark. The results show that mapped rates or uniquely mapped rates of WBSA were superior to that of BSMAP. The uniquely mapped rates of Bismark were the highest for the mouse dataset, but both rates were not comparable with those of WBSA or BSMAP for the human dataset. The mapping time and memory use for WBSA fell between those of BSMAP and Bismark (Table 6). Considering all of the above results, we conclude that the WBSA mapping method was more accurate and efficient than the other two methods.

2) Evaluation of the accuracy of WBSA analysis

To estimate the accuracies of the identification of methylation sites and the advanced analysis results generated by WBSA, we downloaded the published embryonic stem cell dataset from the NCBI website (SRA accessions SRX006239–41, 1.12 G reads). The data are derived from the report of Lister et al. [10], who presented the first genome-wide, single-base resolution maps of methylated cytosines in a mammalian genome from human embryonic stem cells and fetal fibroblasts. The entire analysis took about approximately five days, reads of three libraries were pre-processed as the same time first, then they were mapped simultaneously to the reference sequence, finally the combined data were further analyzed sequentially. We found that our annotation results were consistent with those of Lister et al. [10]. For example, the bisulfite conversion rate for WBSA and Lister et al. were 99.7% and 99.6%, respectively. This small difference may be accounted for by more extensive filtering by WBSA. For instance, post-analysis by WBSA filtered out the following: T-rich reads that mapped Cs to Ts in the reference genome; A-rich reads that mapped Gs to ‘A’s in the reference genome; T-rich reads that mapped to Crick strands of Cs that were converted to Ts or Watson strand Gs that were converted to ‘A’s, and A-rich reads that mapped to Watson strand Cs converted to Ts, or Crick strand Gs converted to ‘A’s. For the identified mCs, non-CGs accounted for approximately 25% of all mCs, and the number of mCHHs was the lowest, which is consistent with the published data (Figure 3a). We also observed that the distribution of mC for all chromosomes was almost the same shape as that published by Lister et al. (Figure 3b, Figure S1). Further, we did not detect local sequence enrichment for mCGs, but did find a preference for TA dinucleotides upstream of non-CG methylated regions. The base following a non-CG methylcytosine was most commonly an A, and a T was also observed frequently. This is the same as the preference in the paper (Figure 3c). The distribution of methylation levels shows that most of the CGs is highly methylated, consistent with results of Lister at al. (Figure 3d).
Figure 3

The performance of WBSA compared with a published study.

a. The percentage of methylcytosine identified in each sequence context. b. The methylcytosine density in Chr1. Each dot indicates the methylation density in a 10-kb window. c. Logo plots of sequences proximal to sites of DNA methylation in each sequence context. Logos are presented for all methylcytosines. Three or four bases flanking each methylcytosine context were analyzed to show the local sequence preference. d. Distribution of the methylation level in the CG context. The vertical axis indicates the fraction of methylated CGs for a corresponding methylation level (horizontal-axis) where the methylation level is defined as the mCG∶CG ratio at each reference cytosine in the CG context (at least 10× coverage is required).

The performance of WBSA compared with a published study.

a. The percentage of methylcytosine identified in each sequence context. b. The methylcytosine density in Chr1. Each dot indicates the methylation density in a 10-kb window. c. Logo plots of sequences proximal to sites of DNA methylation in each sequence context. Logos are presented for all methylcytosines. Three or four bases flanking each methylcytosine context were analyzed to show the local sequence preference. d. Distribution of the methylation level in the CG context. The vertical axis indicates the fraction of methylated CGs for a corresponding methylation level (horizontal-axis) where the methylation level is defined as the mCGCG ratio at each reference cytosine in the CG context (at least 10× coverage is required).

Conclusions

WBSA is an interactive web-based service that was designed for researchers who may not necessarily be familiar with post-analysis of bisulfite sequencing data or for those lacking local computing resources. WBSA is a free, accurate, comprehensive, and user-friendly tool for analyzing bisulfite sequencing data that integrates read-quality analysis, read preprocessing, read mapping, mC identification, and annotation analysis. WBSA focuses on CG and non-CG methylation, and can be applied to DNA methylation research for animal and plant genomes. WBSA is a highly automated package that can be run in a local cluster environment or on a standalone server. The methylcytosine density in all chromosomes. (TIF) Click here for additional data file.
Table 2

Comparison of WBSA's RRBS module with three pipelines.

FunctionsWBSA-RRBSSAAP-RRBSRRBS-analysermethylKit
Read-quality analysisYYYN
Filter adaptor & low qualityYYYN
Computation of conversion rateYNYN
AlignmentYYYN
Focus on non-CGsYNYN
Methylation levelYYYY
Methylation distributionYNYN
Relationship of methylation and CpG islandsYNNY
Gene annotationYYNY
Functional analysis of genes with high or low methylationYNNN
Sequence preferenceYNNN
Correlation between methylation and gene expressionYNNN
Online versionYYNY
Standalone versionYNYN
PBS versionYNNN
  30 in total

1.  WebLogo: a sequence logo generator.

Authors:  Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal:  Genome Res       Date:  2004-06       Impact factor: 9.043

2.  wapRNA: a web-based application for the processing of RNA sequences.

Authors:  Wenming Zhao; Wanfei Liu; Dongmei Tian; Bixia Tang; Yanqing Wang; Caixia Yu; Rujiao Li; Yunchao Ling; Jiayan Wu; Shuhui Song; Songnian Hu
Journal:  Bioinformatics       Date:  2011-09-06       Impact factor: 6.937

Review 3.  The epigenomics of cancer.

Authors:  Peter A Jones; Stephen B Baylin
Journal:  Cell       Date:  2007-02-23       Impact factor: 41.582

Review 4.  Cancer epigenomics: DNA methylomes and histone-modification maps.

Authors:  Manel Esteller
Journal:  Nat Rev Genet       Date:  2007-03-06       Impact factor: 53.242

5.  SAAP-RRBS: streamlined analysis and annotation pipeline for reduced representation bisulfite sequencing.

Authors:  Zhifu Sun; Saurabh Baheti; Sumit Middha; Rahul Kanwar; Yuji Zhang; Xing Li; Andreas S Beutler; Eric Klee; Yan W Asmann; E Aubrey Thompson; Jean-Pierre A Kocher
Journal:  Bioinformatics       Date:  2012-06-10       Impact factor: 6.937

6.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.

Authors:  Felix Krueger; Simon R Andrews
Journal:  Bioinformatics       Date:  2011-04-14       Impact factor: 6.937

7.  Chromosome-wide DNA methylation analysis predicts human tissue-specific X inactivation.

Authors:  Allison M Cotton; Lucia Lam; Joslynn G Affleck; Ian M Wilson; Maria S Peñaherrera; Deborah E McFadden; Michael S Kobor; Wan L Lam; Wendy P Robinson; Carolyn J Brown
Journal:  Hum Genet       Date:  2011-05-20       Impact factor: 4.132

8.  QDMR: a quantitative method for identification of differentially methylated regions by entropy.

Authors:  Yan Zhang; Hongbo Liu; Jie Lv; Xue Xiao; Jiang Zhu; Xiaojuan Liu; Jianzhong Su; Xia Li; Qiong Wu; Fang Wang; Ying Cui
Journal:  Nucleic Acids Res       Date:  2011-02-08       Impact factor: 16.971

9.  COHCAP: an integrative genomic pipeline for single-nucleotide resolution DNA methylation analysis.

Authors:  Charles D Warden; Heehyoung Lee; Joshua D Tompkins; Xiaojin Li; Charles Wang; Arthur D Riggs; Hua Yu; Richard Jove; Yate-Ching Yuan
Journal:  Nucleic Acids Res       Date:  2013-04-17       Impact factor: 16.971

10.  GBSA: a comprehensive software for analysing whole genome bisulfite sequencing data.

Authors:  Touati Benoukraf; Sarawut Wongphayak; Luqman Hakim Abdul Hadi; Mengchu Wu; Richie Soong
Journal:  Nucleic Acids Res       Date:  2012-12-24       Impact factor: 16.971

View more
  10 in total

1.  MethBank 3.0: a database of DNA methylomes across a variety of species.

Authors:  Rujiao Li; Fang Liang; Mengwei Li; Dong Zou; Shixiang Sun; Yongbing Zhao; Wenming Zhao; Yiming Bao; Jingfa Xiao; Zhang Zhang
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

2.  Inferring and modeling inheritance of differentially methylated changes across multiple generations.

Authors:  Pascal Belleau; Astrid Deschênes; Marie-Pier Scott-Boyer; Romain Lambrot; Mathieu Dalvai; Sarah Kimmins; Janice Bailey; Arnaud Droit
Journal:  Nucleic Acids Res       Date:  2018-08-21       Impact factor: 16.971

Review 3.  DNA methylation analysis in plants: review of computational tools and future perspectives.

Authors:  Jimmy Omony; Thomas Nussbaumer; Ruben Gutzat
Journal:  Brief Bioinform       Date:  2020-05-21       Impact factor: 11.622

Review 4.  Profiling genome-wide DNA methylation.

Authors:  Wai-Shin Yong; Fei-Man Hsu; Pao-Yang Chen
Journal:  Epigenetics Chromatin       Date:  2016-06-29       Impact factor: 4.954

5.  TEA: the epigenome platform for Arabidopsis methylome study.

Authors:  Sheng-Yao Su; Shu-Hwa Chen; I-Hsuan Lu; Yih-Shien Chiang; Yu-Bin Wang; Pao-Yang Chen; Chung-Yen Lin
Journal:  BMC Genomics       Date:  2016-12-22       Impact factor: 3.969

6.  Crosstalk between epigenetic silencing and infection by tobacco rattle virus in Arabidopsis.

Authors:  Laura Diezma-Navas; Ana Pérez-González; Haydeé Artaza; Lola Alonso; Elena Caro; César Llave; Virginia Ruiz-Ferrer
Journal:  Mol Plant Pathol       Date:  2019-07-05       Impact factor: 5.663

Review 7.  Cell-Free DNA Methylation Profiling Analysis-Technologies and Bioinformatics.

Authors:  Jinyong Huang; Liang Wang
Journal:  Cancers (Basel)       Date:  2019-11-06       Impact factor: 6.639

Review 8.  Computational challenges in detection of cancer using cell-free DNA methylation.

Authors:  Madhu Sharma; Rohit Kumar Verma; Sunil Kumar; Vibhor Kumar
Journal:  Comput Struct Biotechnol J       Date:  2021-12-07       Impact factor: 7.271

9.  methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data.

Authors:  Kamal Kishore; Stefano de Pretis; Ryan Lister; Marco J Morelli; Valerio Bianchi; Bruno Amati; Joseph R Ecker; Mattia Pelizzola
Journal:  BMC Bioinformatics       Date:  2015-09-29       Impact factor: 3.169

10.  Genome-wide DNA methylome variation in two genetically distinct chicken lines using MethylC-seq.

Authors:  Jinxiu Li; Rujiao Li; Ying Wang; Xiaoxiang Hu; Yiqiang Zhao; Li Li; Chungang Feng; Xiaorong Gu; Fang Liang; Susan J Lamont; Songnian Hu; Huaijun Zhou; Ning Li
Journal:  BMC Genomics       Date:  2015-10-23       Impact factor: 3.969

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.