UNLABELLED: Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting and visualization. This package facilitates a rapid transition from sequencing reads to a fully annotated CpG methylation report to biological interpretation. AVAILABILITY AND IMPLEMENTATION: SAAP-RRBS is freely available to non-commercial users at the web site http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.
UNLABELLED: Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting and visualization. This package facilitates a rapid transition from sequencing reads to a fully annotated CpG methylation report to biological interpretation. AVAILABILITY AND IMPLEMENTATION:SAAP-RRBS is freely available to non-commercial users at the web site http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.
DNA methylation is one of the major epigenetic control mechanisms for gene regulation, cellular differentiation, embryogenesis, X chromosome inactivation, genomic imprinting and tumorigenesis (Laird, 2003). Reduced representation bisulfite sequencing (RRBS) has been demonstrated to be a cost-effective approach to analyze DNA methylation pattern in CpG dense regions of the genome (Bock ; Harris ; Wang ). RRBS data analysis is challenging due to the bisulfite-mediated C to T conversion. Several specific RRBS sequence read alignment programs have been developed, which include BSMAP/RRBSMAP (Xi and Li, 2009; Xi ), BISMARK (Krueger and Andrews, 2011), BSSEEKER (Chen ) and Illumina's bisulfite sequencing software (http://www.illumina.com/applications/epigenetics.ilmn). Each aligner uses a different strategy for bisulfite-converted read mapping. BISMARK and BSSEEKER convert reference genome (C to T for + and G to A for - strand) and reads (C to T) for alignment. Illumina's pipeline does not align to the whole genome; rather, digitally digested MspI fragments are generated as multi-fasta files, which then undergo C/T and G/A conversion. BSMAP/RRBSMAP does not need the reference genome preparation step but everything is handled internally using bitwise masking. In addition to alignment, other steps are equally important to get high-quality and analyzable data for downstream analyses. The MspI digestion, bisulfite treatment and subsequent size selection tend to generate a higher portion of low-quality or adapter-contaminated reads; accurate extraction of CpG methylation status from alignment is not implemented in alignment programs or with limited function and more importantly, CpG methylation is highly context specific in the genome and it would be impossible to interpret the data without annotation. Herein, we present a comprehensive workflow that combines all these steps into a single application that can be run on a high-performance computing cluster or on an individual workstation. The workflow accommodates aligned BAM files from different aligners or conduct annotation alone. Moreover, the workflow can be easily extended to analyze whole genome bisulfite sequencing.
2 METHODS
2.1 Architecture and implementation
The SAAP-RRBS consists of four main modules: (1) sequence read assessment and clean-up; (2) alignment to reference genome; (3) methylation status extraction and (4) CpG reporting and annotation (Supplementary Fig. S1). These modules use a suite of public and in-house developed tools using Perl, Python, Java and C.
2.2 Sequence read quality assessment and trimming
SAAP-RRBS first removes low-quality bases and adapter sequences from either end of a read (Martin, 2011). Resulting reads that are shorter than a specified length are discarded to reduce non-unique mapping. The FastQC tool is incorporated to report summarized metrics on the quality of the reads.
2.3 Alignment to reference genome
BSMAP version 2.43 is integrated into the pipeline as the default aligner. For RRBS, -D C-CGG option is turned on for RRBSMAP alignment mode. Note that RRBSMAP was integrated into BSMAP since version 2.0. Early versions of BSMAP are very slow although alignment accuracy is comparable to other tools (Chatterjee ; Chen ; Krueger ). The introduction of RRBSMAP has significantly increased performance as the program restricts alignment to genomic locations with MspI cut sites (CCGG) (Xi ). We have compared RRBSMAP with three other alignment programs using simulated and real data, and the results are presented in Supplementary Tables S1 and S2 and Figure S2. RRBSMAP shows the highest mapping rate and accuracy, with the similar speed as BSSEEKER. The estimated methylation ratio from RRBSMAP is highly correlated with that from other aligners (R2 > 0.99). In addition, RRBSMAP does not need to prepare reference genome and can be used with both single- and pair-end sequencing data.
2.4 Methylation status extraction
The BAM file is parsed to extract methylation status of C in CpG context. For reads mapped to the forward strand, the total numbers of C and T at the C position of a CpG in the reference are counted and a methylation ratio at that position is calculated by dividing the total number of C by the total number of C plus T (Cs/(Cs + Ts)). For reads mapped to the reverse strand, the numbers of G and A at the G position (complementary C at the reverse strand) of the CpG are counted to obtain the complementary methylation ratio (Gs/(Gs + As)).
2.5 CpG annotation
The annotation module provides each CpG with rich genomic contextual information, including whether it is in exons, within a gene, in a CpG island and distance to transcription start site (TSS). The annotation module also searches the dbSNP database and marks the C position with known single nucleotide polymorphism (SNP) and alternative allele. To speed up the process, annotations for all CpGs in the genome are pre-computed in the initial setup.
2.6 Final report and data visualization
The pipeline generates two main reports for end users: (1) a summary report for all samples in a run, including QC metrics, summary statistics for each sample and links to more detailed reports and (2) a methylation data report with annotations for each sample. This report contains methylation data for CpGs with coverage ≥10× and base quality score ≥20. Each reported CpG site is dynamically linked to a local instance of the Integrative Genomics Viewer (IGV) for read- and base-level data visualization.
3 RESULTS AND CONCLUSIONS
3.1 Performance and benchmark measurements
SAAP-RRBS is most suitable for a cluster environment where multiple samples can be run in parallel. It takes ∼ 4–6 h to complete the pipeline for a sample (MCF7 breast cancer cell line) with ∼ 50 millions of reads in a Linux platform with CPU at 2.67 GHz, 8 cores and >100 GB RAM. The resulting methylation ratio is highly correlated with the results from BISMARK, BSSEEKER and the Illumina pipeline (R2 > 0.99) as well as from Illumina methylation27k microarray (R2 > 0.9; Supplementary Fig. S2). Some well-known methylation patterns in the cell line are also observed from the data such as promoter hypermethylation of WT1, HOXA5, IRX1 and PAX7 (Supplementary Fig. S3).
3.2 Summary report
The summary report contains a link to the FastQC report and summary statistics such as total reads passing filters, mapped reads, mapping rate, bisulfite conversion rate, captured Cs in CpG context, Cs with strand-specific coverage ≥10× and base quality ≥20 and CpGs that overlap with dbSNP, CpGs within CpG islands and CpGs with genes (Supplementary Fig. S4).
3.3 Annotated report
The final report contains strand-specific CpG methylation and detailed annotations for each C position. The number of methylated Cs, total number of Cs and methylation ratio are reported, along with information of the CpG relative to genomic features such as nearby gene, distance to TSS, overlapping SNP, alternative allele and a CpG island. In the report, users can click a hyperlink for a CpG site to open IGV and view the aligned reads and base level information (Supplementary Fig. S5).RRBS has become a popular and efficient way to profile genome-wide methylation patterns for novel discoveries in biomedical research. However, the high-throughput and complex data need a fast and convenient bioinformatics tool to process the data into interpretable formats. We have developed a comprehensive pipeline that integrates necessary analyses into a single package. The pipeline is modular, which allows users to use their own alignment tool. Users can also provide pre-computed methylation data with chromosome coordinates and run the workflow in an annotation-only module. The pipeline is highly automated and can be run in a single machine or within a cluster environment where many samples can be processed in parallel for increased performance.
Authors: Christoph Bock; Eleni M Tomazou; Arie B Brinkman; Fabian Müller; Femke Simmer; Hongcang Gu; Natalie Jäger; Andreas Gnirke; Hendrik G Stunnenberg; Alexander Meissner Journal: Nat Biotechnol Date: 2010-09-19 Impact factor: 54.908
Authors: R Alan Harris; Ting Wang; Cristian Coarfa; Raman P Nagarajan; Chibo Hong; Sara L Downey; Brett E Johnson; Shaun D Fouse; Allen Delaney; Yongjun Zhao; Adam Olshen; Tracy Ballinger; Xin Zhou; Kevin J Forsberg; Junchen Gu; Lorigail Echipare; Henriette O'Geen; Ryan Lister; Mattia Pelizzola; Yuanxin Xi; Charles B Epstein; Bradley E Bernstein; R David Hawkins; Bing Ren; Wen-Yu Chung; Hongcang Gu; Christoph Bock; Andreas Gnirke; Michael Q Zhang; David Haussler; Joseph R Ecker; Wei Li; Peggy J Farnham; Robert A Waterland; Alexander Meissner; Marco A Marra; Martin Hirst; Aleksandar Milosavljevic; Joseph F Costello Journal: Nat Biotechnol Date: 2010-09-19 Impact factor: 54.908
Authors: Bradley W Anderson; Yun-Suhk Suh; Boram Choi; Hyuk-Joon Lee; Tracy C Yab; William R Taylor; Brian A Dukek; Calise K Berger; Xiaoming Cao; Patrick H Foote; Mary E Devens; Lisa A Boardman; John B Kisiel; Douglas W Mahoney; Seth W Slettedahl; Hatim T Allawi; Graham P Lidgard; Thomas C Smyrk; Han-Kwang Yang; David A Ahlquist Journal: Clin Cancer Res Date: 2018-05-29 Impact factor: 12.531
Authors: Carolina A Bonin; Eric A Lewallen; Saurabh Baheti; Elizabeth W Bradley; Michael J Stuart; Daniel J Berry; Andre J van Wijnen; Jennifer J Westendorf Journal: Gene Date: 2015-10-17 Impact factor: 3.688
Authors: Matthew M Roforth; Joshua N Farr; Koji Fujita; Louise K McCready; Elizabeth J Atkinson; Terry M Therneau; Julie M Cunningham; Matthew T Drake; David G Monroe; Sundeep Khosla Journal: Bone Date: 2015-03-27 Impact factor: 4.398
Authors: Ryan A Hlady; Aishwarya Sathyanarayan; Joyce J Thompson; Dan Zhou; Qunfeng Wu; Kien Pham; Jeong-Heon Lee; Chen Liu; Keith D Robertson Journal: Hepatology Date: 2019-01-05 Impact factor: 17.425
Authors: Dan Zhou; Ryan A Hlady; Marissa J Schafer; Thomas A White; Chen Liu; Jeong-Hyeon Choi; Jordan D Miller; Lewis R Roberts; Nathan K LeBrasseur; Keith D Robertson Journal: Epigenetics Date: 2016-11-18 Impact factor: 4.528
Authors: Samantha E Day; Richard L Coletta; Joon Young Kim; Luis A Garcia; Latoya E Campbell; Tonya R Benjamin; Lori R Roust; Elena A De Filippis; Lawrence J Mandarino; Dawn K Coletta Journal: Epigenetics Date: 2017-01-20 Impact factor: 4.528
Authors: John B Kisiel; Massimo Raimondo; William R Taylor; Tracy C Yab; Douglas W Mahoney; Zhifu Sun; Sumit Middha; Saurabh Baheti; Hongzhi Zou; Thomas C Smyrk; Lisa A Boardman; Gloria M Petersen; David A Ahlquist Journal: Clin Cancer Res Date: 2015-05-28 Impact factor: 12.531