Literature DB >> 22689387

SAAP-RRBS: streamlined analysis and annotation pipeline for reduced representation bisulfite sequencing.

Zhifu Sun1, Saurabh Baheti, Sumit Middha, Rahul Kanwar, Yuji Zhang, Xing Li, Andreas S Beutler, Eric Klee, Yan W Asmann, E Aubrey Thompson, Jean-Pierre A Kocher.   

Abstract

UNLABELLED: Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting and visualization. This package facilitates a rapid transition from sequencing reads to a fully annotated CpG methylation report to biological interpretation.
AVAILABILITY AND IMPLEMENTATION: SAAP-RRBS is freely available to non-commercial users at the web site http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22689387      PMCID: PMC3413387          DOI: 10.1093/bioinformatics/bts337

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

DNA methylation is one of the major epigenetic control mechanisms for gene regulation, cellular differentiation, embryogenesis, X chromosome inactivation, genomic imprinting and tumorigenesis (Laird, 2003). Reduced representation bisulfite sequencing (RRBS) has been demonstrated to be a cost-effective approach to analyze DNA methylation pattern in CpG dense regions of the genome (Bock ; Harris ; Wang ). RRBS data analysis is challenging due to the bisulfite-mediated C to T conversion. Several specific RRBS sequence read alignment programs have been developed, which include BSMAP/RRBSMAP (Xi and Li, 2009; Xi ), BISMARK (Krueger and Andrews, 2011), BSSEEKER (Chen ) and Illumina's bisulfite sequencing software (http://www.illumina.com/applications/epigenetics.ilmn). Each aligner uses a different strategy for bisulfite-converted read mapping. BISMARK and BSSEEKER convert reference genome (C to T for + and G to A for - strand) and reads (C to T) for alignment. Illumina's pipeline does not align to the whole genome; rather, digitally digested MspI fragments are generated as multi-fasta files, which then undergo C/T and G/A conversion. BSMAP/RRBSMAP does not need the reference genome preparation step but everything is handled internally using bitwise masking. In addition to alignment, other steps are equally important to get high-quality and analyzable data for downstream analyses. The MspI digestion, bisulfite treatment and subsequent size selection tend to generate a higher portion of low-quality or adapter-contaminated reads; accurate extraction of CpG methylation status from alignment is not implemented in alignment programs or with limited function and more importantly, CpG methylation is highly context specific in the genome and it would be impossible to interpret the data without annotation. Herein, we present a comprehensive workflow that combines all these steps into a single application that can be run on a high-performance computing cluster or on an individual workstation. The workflow accommodates aligned BAM files from different aligners or conduct annotation alone. Moreover, the workflow can be easily extended to analyze whole genome bisulfite sequencing.

2 METHODS

2.1 Architecture and implementation

The SAAP-RRBS consists of four main modules: (1) sequence read assessment and clean-up; (2) alignment to reference genome; (3) methylation status extraction and (4) CpG reporting and annotation (Supplementary Fig. S1). These modules use a suite of public and in-house developed tools using Perl, Python, Java and C.

2.2 Sequence read quality assessment and trimming

SAAP-RRBS first removes low-quality bases and adapter sequences from either end of a read (Martin, 2011). Resulting reads that are shorter than a specified length are discarded to reduce non-unique mapping. The FastQC tool is incorporated to report summarized metrics on the quality of the reads.

2.3 Alignment to reference genome

BSMAP version 2.43 is integrated into the pipeline as the default aligner. For RRBS, -D C-CGG option is turned on for RRBSMAP alignment mode. Note that RRBSMAP was integrated into BSMAP since version 2.0. Early versions of BSMAP are very slow although alignment accuracy is comparable to other tools (Chatterjee ; Chen ; Krueger ). The introduction of RRBSMAP has significantly increased performance as the program restricts alignment to genomic locations with MspI cut sites (CCGG) (Xi ). We have compared RRBSMAP with three other alignment programs using simulated and real data, and the results are presented in Supplementary Tables S1 and S2 and Figure S2. RRBSMAP shows the highest mapping rate and accuracy, with the similar speed as BSSEEKER. The estimated methylation ratio from RRBSMAP is highly correlated with that from other aligners (R2 > 0.99). In addition, RRBSMAP does not need to prepare reference genome and can be used with both single- and pair-end sequencing data.

2.4 Methylation status extraction

The BAM file is parsed to extract methylation status of C in CpG context. For reads mapped to the forward strand, the total numbers of C and T at the C position of a CpG in the reference are counted and a methylation ratio at that position is calculated by dividing the total number of C by the total number of C plus T (Cs/(Cs + Ts)). For reads mapped to the reverse strand, the numbers of G and A at the G position (complementary C at the reverse strand) of the CpG are counted to obtain the complementary methylation ratio (Gs/(Gs + As)).

2.5 CpG annotation

The annotation module provides each CpG with rich genomic contextual information, including whether it is in exons, within a gene, in a CpG island and distance to transcription start site (TSS). The annotation module also searches the dbSNP database and marks the C position with known single nucleotide polymorphism (SNP) and alternative allele. To speed up the process, annotations for all CpGs in the genome are pre-computed in the initial setup.

2.6 Final report and data visualization

The pipeline generates two main reports for end users: (1) a summary report for all samples in a run, including QC metrics, summary statistics for each sample and links to more detailed reports and (2) a methylation data report with annotations for each sample. This report contains methylation data for CpGs with coverage ≥10× and base quality score ≥20. Each reported CpG site is dynamically linked to a local instance of the Integrative Genomics Viewer (IGV) for read- and base-level data visualization.

3 RESULTS AND CONCLUSIONS

3.1 Performance and benchmark measurements

SAAP-RRBS is most suitable for a cluster environment where multiple samples can be run in parallel. It takes ∼ 4–6 h to complete the pipeline for a sample (MCF7 breast cancer cell line) with ∼ 50 millions of reads in a Linux platform with CPU at 2.67 GHz, 8 cores and >100 GB RAM. The resulting methylation ratio is highly correlated with the results from BISMARK, BSSEEKER and the Illumina pipeline (R2 > 0.99) as well as from Illumina methylation27k microarray (R2 > 0.9; Supplementary Fig. S2). Some well-known methylation patterns in the cell line are also observed from the data such as promoter hypermethylation of WT1, HOXA5, IRX1 and PAX7 (Supplementary Fig. S3).

3.2 Summary report

The summary report contains a link to the FastQC report and summary statistics such as total reads passing filters, mapped reads, mapping rate, bisulfite conversion rate, captured Cs in CpG context, Cs with strand-specific coverage ≥10× and base quality ≥20 and CpGs that overlap with dbSNP, CpGs within CpG islands and CpGs with genes (Supplementary Fig. S4).

3.3 Annotated report

The final report contains strand-specific CpG methylation and detailed annotations for each C position. The number of methylated Cs, total number of Cs and methylation ratio are reported, along with information of the CpG relative to genomic features such as nearby gene, distance to TSS, overlapping SNP, alternative allele and a CpG island. In the report, users can click a hyperlink for a CpG site to open IGV and view the aligned reads and base level information (Supplementary Fig. S5). RRBS has become a popular and efficient way to profile genome-wide methylation patterns for novel discoveries in biomedical research. However, the high-throughput and complex data need a fast and convenient bioinformatics tool to process the data into interpretable formats. We have developed a comprehensive pipeline that integrates necessary analyses into a single package. The pipeline is modular, which allows users to use their own alignment tool. Users can also provide pre-computed methylation data with chromosome coordinates and run the workflow in an annotation-only module. The pipeline is highly automated and can be run in a single machine or within a cluster environment where many samples can be processed in parallel for increased performance.
  10 in total

1.  RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing.

Authors:  Yuanxin Xi; Christoph Bock; Fabian Müller; Deqiang Sun; Alexander Meissner; Wei Li
Journal:  Bioinformatics       Date:  2011-12-06       Impact factor: 6.937

2.  Systematic assessment of reduced representation bisulfite sequencing to human blood samples: A promising method for large-sample-scale epigenomic studies.

Authors:  Li Wang; Jihua Sun; Honglong Wu; Siyang Liu; Junwen Wang; Boxin Wu; Shujia Huang; Ning Li; Jun Wang; Xiuqing Zhang
Journal:  J Biotechnol       Date:  2011-07-06       Impact factor: 3.307

Review 3.  The power and the promise of DNA methylation markers.

Authors:  Peter W Laird
Journal:  Nat Rev Cancer       Date:  2003-04       Impact factor: 60.716

Review 4.  DNA methylome analysis using short bisulfite sequencing data.

Authors:  Felix Krueger; Benjamin Kreck; Andre Franke; Simon R Andrews
Journal:  Nat Methods       Date:  2012-01-30       Impact factor: 28.547

5.  BS Seeker: precise mapping for bisulfite sequencing.

Authors:  Pao-Yang Chen; Shawn J Cokus; Matteo Pellegrini
Journal:  BMC Bioinformatics       Date:  2010-04-23       Impact factor: 3.169

6.  Quantitative comparison of genome-wide DNA methylation mapping technologies.

Authors:  Christoph Bock; Eleni M Tomazou; Arie B Brinkman; Fabian Müller; Femke Simmer; Hongcang Gu; Natalie Jäger; Andreas Gnirke; Hendrik G Stunnenberg; Alexander Meissner
Journal:  Nat Biotechnol       Date:  2010-09-19       Impact factor: 54.908

7.  Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications.

Authors:  R Alan Harris; Ting Wang; Cristian Coarfa; Raman P Nagarajan; Chibo Hong; Sara L Downey; Brett E Johnson; Shaun D Fouse; Allen Delaney; Yongjun Zhao; Adam Olshen; Tracy Ballinger; Xin Zhou; Kevin J Forsberg; Junchen Gu; Lorigail Echipare; Henriette O'Geen; Ryan Lister; Mattia Pelizzola; Yuanxin Xi; Charles B Epstein; Bradley E Bernstein; R David Hawkins; Bing Ren; Wen-Yu Chung; Hongcang Gu; Christoph Bock; Andreas Gnirke; Michael Q Zhang; David Haussler; Joseph R Ecker; Wei Li; Peggy J Farnham; Robert A Waterland; Alexander Meissner; Marco A Marra; Martin Hirst; Aleksandar Milosavljevic; Joseph F Costello
Journal:  Nat Biotechnol       Date:  2010-09-19       Impact factor: 54.908

8.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.

Authors:  Felix Krueger; Simon R Andrews
Journal:  Bioinformatics       Date:  2011-04-14       Impact factor: 6.937

9.  Comparison of alignment software for genome-wide bisulphite sequence data.

Authors:  Aniruddha Chatterjee; Peter A Stockwell; Euan J Rodger; Ian M Morison
Journal:  Nucleic Acids Res       Date:  2012-02-16       Impact factor: 16.971

10.  BSMAP: whole genome bisulfite sequence MAPping program.

Authors:  Yuanxin Xi; Wei Li
Journal:  BMC Bioinformatics       Date:  2009-07-27       Impact factor: 3.169

  10 in total
  32 in total

1.  DNA methylation patterns in human iPSC-derived sensory neuronal differentiation.

Authors:  Soneela Ankam; Amandine Rovini; Saurabh Baheti; Ron Hrstka; Yanhong Wu; Kiley Schmidt; Hailong Wang; Nicolas Madigan; Lena-Sophie Koenig; Kimberly Stelzig; Zachary Resch; Christopher J Klein; Zhifu Sun; Nathan P Staff
Journal:  Epigenetics       Date:  2019-06-06       Impact factor: 4.528

2.  Detection of Gastric Cancer with Novel Methylated DNA Markers: Discovery, Tissue Validation, and Pilot Testing in Plasma.

Authors:  Bradley W Anderson; Yun-Suhk Suh; Boram Choi; Hyuk-Joon Lee; Tracy C Yab; William R Taylor; Brian A Dukek; Calise K Berger; Xiaoming Cao; Patrick H Foote; Mary E Devens; Lisa A Boardman; John B Kisiel; Douglas W Mahoney; Seth W Slettedahl; Hatim T Allawi; Graham P Lidgard; Thomas C Smyrk; Han-Kwang Yang; David A Ahlquist
Journal:  Clin Cancer Res       Date:  2018-05-29       Impact factor: 12.531

3.  Identification of differentially methylated regions in new genes associated with knee osteoarthritis.

Authors:  Carolina A Bonin; Eric A Lewallen; Saurabh Baheti; Elizabeth W Bradley; Michael J Stuart; Daniel J Berry; Andre J van Wijnen; Jennifer J Westendorf
Journal:  Gene       Date:  2015-10-17       Impact factor: 3.688

4.  Global transcriptional profiling using RNA sequencing and DNA methylation patterns in highly enriched mesenchymal cells from young versus elderly women.

Authors:  Matthew M Roforth; Joshua N Farr; Koji Fujita; Louise K McCready; Elizabeth J Atkinson; Terry M Therneau; Julie M Cunningham; Matthew T Drake; David G Monroe; Sundeep Khosla
Journal:  Bone       Date:  2015-03-27       Impact factor: 4.398

5.  Integrating the Epigenome to Identify Drivers of Hepatocellular Carcinoma.

Authors:  Ryan A Hlady; Aishwarya Sathyanarayan; Joyce J Thompson; Dan Zhou; Qunfeng Wu; Kien Pham; Jeong-Heon Lee; Chen Liu; Keith D Robertson
Journal:  Hepatology       Date:  2019-01-05       Impact factor: 17.425

6.  High fat diet and exercise lead to a disrupted and pathogenic DNA methylome in mouse liver.

Authors:  Dan Zhou; Ryan A Hlady; Marissa J Schafer; Thomas A White; Chen Liu; Jeong-Hyeon Choi; Jordan D Miller; Lewis R Roberts; Nathan K LeBrasseur; Keith D Robertson
Journal:  Epigenetics       Date:  2016-11-18       Impact factor: 4.528

7.  Genome-Wide Epigenetic Studies in Human Disease: A Primer on -Omic Technologies.

Authors:  Huihuang Yan; Shulan Tian; Susan L Slager; Zhifu Sun; Tamas Ordog
Journal:  Am J Epidemiol       Date:  2015-12-30       Impact factor: 4.897

8.  Potential epigenetic biomarkers of obesity-related insulin resistance in human whole-blood.

Authors:  Samantha E Day; Richard L Coletta; Joon Young Kim; Luis A Garcia; Latoya E Campbell; Tonya R Benjamin; Lori R Roust; Elena A De Filippis; Lawrence J Mandarino; Dawn K Coletta
Journal:  Epigenetics       Date:  2017-01-20       Impact factor: 4.528

9.  New DNA Methylation Markers for Pancreatic Cancer: Discovery, Tissue Validation, and Pilot Testing in Pancreatic Juice.

Authors:  John B Kisiel; Massimo Raimondo; William R Taylor; Tracy C Yab; Douglas W Mahoney; Zhifu Sun; Sumit Middha; Saurabh Baheti; Hongzhi Zou; Thomas C Smyrk; Lisa A Boardman; Gloria M Petersen; David A Ahlquist
Journal:  Clin Cancer Res       Date:  2015-05-28       Impact factor: 12.531

10.  Aberrant signature methylome by DNMT1 hot spot mutation in hereditary sensory and autonomic neuropathy 1E.

Authors:  Zhifu Sun; Yanhong Wu; Tamas Ordog; Saurabh Baheti; Jinfu Nie; Xiaohui Duan; Kaori Hojo; Jean-Pierre Kocher; Peter J Dyck; Christopher J Klein
Journal:  Epigenetics       Date:  2014-07-07       Impact factor: 4.528

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.