Literature DB >> 26319221

BS-SNPer: SNP calling in bisulfite-seq data.

Shengjie Gao1, Dan Zou2, Likai Mao3, Huayu Liu4, Pengfei Song5, Youguo Chen6, Shancen Zhao7, Changduo Gao8, Xiangchun Li7, Zhibo Gao7, Xiaodong Fang7, Huanming Yang7, Torben F Ørntoft9, Karina D Sørensen9, Lars Bolund1.   

Abstract

UNLABELLED: Sodium bisulfite conversion followed by sequencing (BS-Seq, such as whole genome bisulfite sequencing or reduced representation bisulfite sequencing) has become popular for studying human epigenetic profiles. Identifying single nucleotide polymorphisms (SNPs) is important for quantification of methylation levels and for study of allele-specific epigenetic events such as imprinting. However, SNP calling in such data is complex and time consuming. Here, we present an ultrafast and memory-efficient package named BS-SNPer for the exploration of SNP sites from BS-Seq data. Compared with Bis-SNP, a popular BS-Seq specific SNP caller, BS-SNPer is over 100 times faster and uses less memory. BS-SNPer also offers higher sensitivity and specificity compared with existing methods.
AVAILABILITY AND IMPLEMENTATION: BS-SNPer is written in C++ and Perl, and is freely available at https://github.com/hellbelly/BS-Snper.
© The Author 2015. Published by Oxford University Press. All rights reserved.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26319221      PMCID: PMC4673977          DOI: 10.1093/bioinformatics/btv507

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Sodium bisulfite conversion followed by sequencing (BS-Seq) has become popular for studying human epigenetic profiles (Laird, 2010). Whole genome bisulfite sequencing detects 90% of methylation events and SNPs in the nuclear DNA (Lister ). However, its cost remains too high for most laboratories. Reduced representation bisulfite sequencing (RRBS) is a more efficient and economical way to monitor genome-wide promoter and CpG island methylation status (Laird, 2010) and it has become a popular method to investigate epigenetic changes in clinical tissue samples (Berman ; Hansen ). SNP identification is important for identification of allele-specific epigenetic events such as, gamete specific and genetic imprinting (Reik and Walter, 2001; Wilkins, 2005). However, SNP calling from BS-Seq data has been shown to be complicated. One problem is that reads from two genomic strands are not complementary at methylated loci. Two methods were widely used to solve this problem. One is to align the reads in a three-letter space; the other is a wildcard algorithm which accounts for the C/T conversions. Many software packages have been developed based on these two methods. Bismark (Krueger and Andrews, 2011), MethylCoder (Pedersen ), BRAT-BW (Harris ) and BS Seeker (Chen ) are based on the Burrows–Wheeler transform. Bismark and BS Seeker use Bowtie (Langmead and Salzberg, 2012) to align the reads in a three-letter space. BSMAP (Xi and Li, 2009) uses a wildcard algorithm. Another problem is that true C/T SNPs in samples cannot be distinguished from C/T substitutions caused by bisulfite conversion and, thus, could be misidentified as unmethylated Cs (Liu ). Given that over two-thirds of all SNPs occur in CpG context (Tomso and Bell, 2003), sequence variations need to be addressed as an important error source. After alignment, real methylation status could be recovered, because a C/T SNP in Watson strand and G/A SNP in Crick strand should be interpreted as unmethylated cytosine variation (see Supplementary Fig. S1). Thus, Watson and Crick strands should be independently treated to reduce such errors. To our knowledge, Bis-SNP is a popular BS-Seq data based SNP calling tool. Another BS-Seq specific SNP caller, MethylExtract, is slightly faster than Bis-SNP (Barturen ). However, Bis-SNP performs better in sensitive and accuracy. Here, we present BS-SNPer, a program for BS-Seq variation detection from alignments in standard BAM/SAM format (Li ). We implemented a novel algorithm (called ‘dynamic matrix algorithm’, see text below) and approximate Bayesian modeling to improve the performance. Using published RRBS data, BS-SNPer showed higher specificity and sensitivity with lower memory requirement and was over 100 times faster than Bis-SNP.

2 Methods

Two steps are implemented to obtain the final SNP set. In the first step, a candidate SNP set is obtained from alignments, usually in the BAM/SAM format, using a novel method ‘dynamic matrix algorithm’. In the second step, the candidate set was converted to the final SNP set using Bayes model, considering alignment quality and read support.

Step 1. Dynamic matrix algorithm

Alignments are filtered based on sequencing quality, mapping quality and mismatch rates. Mutations are removed if their frequencies are lower than a certain threshold (default 0.1). The formulae to calculate the frequencies are listed in Supplementary Table S1. For the remaining candidate SNP set, positions, reference bases, the numbers of supporting reads and average sequencing quality for all four bases in both Watson and Crick strands are recorded. In order to improve memory and computation efficiency, these data are dynamically allocated and freed for each chromosome. For each position in a chromosome, the data are stored in the form of a vector; thus the data of a chromosome are stored in a matrix. The content and size of the matrix change with the chromosomes. The method is thus called ‘dynamic matrix algorithm’.

Step 2. Approximate Bayesian modeling

In brief, the Bayesian inference of each genotype is based on its posterior distributions, P(G|D), using Bayes’ formula. The posterior distribution is built upon two components: the prior distribution of each genotype P(G), and the likelihood P(D|G), which is the probability of observing reads D given genotype G. For the prior P(G), we referred to the model of SOAPsnp (Li ). We observed that, when sequencing depth was higher than 10, the choice of the prior had no large effects. The likelihood, which represents the error rates caused by multiple sources, is calculated by the formula P(D|g = G) =  for multiple independent samples D, where i = 1, 2,, n, and n is the number of reads. We used average error rate instead of full evaluation in error rate, which increases the modeling speed. The genotype with largest probability is recognized as the final SNP. See Supplementary Text for more details.

3 Results

Our previous RRBS data (Huang ) were used to test the performance of BS-SNPer and to compare it with Bis-SNP. The data comprise one para-normal sample (below called ‘Normal’; normal tissue adjacent to cancer tissue at a distance of at least 5 cm) and three cancer samples, i.e. primary renal cell carcinomas (pRCC), local invasion of the vena cava (IVC) and distant metastasis to the brain (MB) tissues from a patient with metastatic clear cell renal cell carcinoma. Whole exome sequencing data in the same work (Huang ) were employed to assess the performance of SNP calling. All reads of four samples were aligned to the GRCh37 assembly (hg19) of the human genome using the program BSMAP with options ‘-z 64 -p 12 -s 16 -v 10 -q 2’. The alignments were then fed into both BS-SNPer, MethylExtract and Bis-SNP under same conditions. We evaluated the algorithms on a system equipped with a six-core Intel Xeon E5650 2.66 GHz processor and 32 GB memory. The system runs on 64-bit Red Hat Enterprise Linux 4.1. The minimal running time of four samples of MethylExtract and Bis-SNP was 6 and 20 h, respectively, whereas BS-SNPer only used around 10 min for each sample. Compared with Bis-SNP, the increase in speed was more than 100 times in all four cases, including normal tissue (Normal), pRCC, local IVC, distant MB, of our published data (Huang ). The increase in speed was not at the cost of memory or accuracy. The maximal memory usage of BS-SNPer was 8 GB, whereas MethylExtract and Bis-SNP used 10 GB for all cases, respectively. BS-SNPer also showed higher sensitivity and specificity for all four samples (Supplementary Table S2). For example, in Normal tissue, 2730 SNPs were detected by BS-SNPer, among which 2335 were validated by exome data [false positive rate (FPR) 14.47%]. Bis-SNP detected 3483 SNPs, while 2011 of them were validated (FPR 42.26%; MethylExtract 39.93%). As exome sequencing detected 2873 SNPs for this sample, false negative rate of BS-SNPer was 18.73%, whereas that of Bis-SNP and MethylExtract was 30% and 57.01%, respectively (Supplementary Table S2). In conclusion, based on a dynamic matrix algorithm and Bayesian statistical framework, we present BS-SNPer, a SNP calling tool for BS-Seq data. BS-SNPer provides high performance in terms of speed, memory usage, accuracy and sensitivity.
  18 in total

Review 1.  Genomic imprinting: parental influence on the genome.

Authors:  W Reik; J Walter
Journal:  Nat Rev Genet       Date:  2001-01       Impact factor: 53.242

Review 2.  Genomic imprinting and methylation: epigenetic canalization and conflict.

Authors:  Jon F Wilkins
Journal:  Trends Genet       Date:  2005-06       Impact factor: 11.639

3.  SNP detection for massively parallel whole-genome resequencing.

Authors:  Ruiqiang Li; Yingrui Li; Xiaodong Fang; Huanming Yang; Jian Wang; Karsten Kristiansen; Jun Wang
Journal:  Genome Res       Date:  2009-05-06       Impact factor: 9.043

4.  BRAT-BW: efficient and accurate mapping of bisulfite-treated reads.

Authors:  Elena Y Harris; Nadia Ponts; Karine G Le Roch; Stefano Lonardi
Journal:  Bioinformatics       Date:  2012-05-03       Impact factor: 6.937

Review 5.  Principles and challenges of genomewide DNA methylation analysis.

Authors:  Peter W Laird
Journal:  Nat Rev Genet       Date:  2010-03       Impact factor: 53.242

6.  MethylCoder: software pipeline for bisulfite-treated sequences.

Authors:  Brent Pedersen; Tzung-Fu Hsieh; Christian Ibarra; Robert L Fischer
Journal:  Bioinformatics       Date:  2011-06-30       Impact factor: 6.937

7.  Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains.

Authors:  Benjamin P Berman; Daniel J Weisenberger; Joseph F Aman; Toshinori Hinoue; Zachary Ramjan; Yaping Liu; Houtan Noushmehr; Christopher P E Lange; Cornelis M van Dijk; Rob A E M Tollenaar; David Van Den Berg; Peter W Laird
Journal:  Nat Genet       Date:  2011-11-27       Impact factor: 38.330

8.  BS Seeker: precise mapping for bisulfite sequencing.

Authors:  Pao-Yang Chen; Shawn J Cokus; Matteo Pellegrini
Journal:  BMC Bioinformatics       Date:  2010-04-23       Impact factor: 3.169

9.  Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors:  Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal:  Nature       Date:  2009-10-14       Impact factor: 49.962

10.  BSMAP: whole genome bisulfite sequence MAPping program.

Authors:  Yuanxin Xi; Wei Li
Journal:  BMC Bioinformatics       Date:  2009-07-27       Impact factor: 3.169

View more
  25 in total

1.  Rapid Recovery Gene Downregulation during Excess-Light Stress and Recovery in Arabidopsis.

Authors:  Peter A Crisp; Diep R Ganguly; Aaron B Smith; Kevin D Murray; Gonzalo M Estavillo; Iain Searle; Ethan Ford; Ozren Bogdanović; Ryan Lister; Justin O Borevitz; Steven R Eichten; Barry J Pogson
Journal:  Plant Cell       Date:  2017-07-13       Impact factor: 11.277

2.  BS-virus-finder: virus integration calling using bisulfite sequencing data.

Authors:  Shengjie Gao; Xuesong Hu; Fengping Xu; Changduo Gao; Kai Xiong; Xiao Zhao; Haixiao Chen; Shancen Zhao; Mengyao Wang; Dongke Fu; Xiaohui Zhao; Jie Bai; Likai Mao; Bo Li; Song Wu; Jian Wang; Shengbin Li; Huangming Yang; Lars Bolund; Christian N S Pedersen
Journal:  Gigascience       Date:  2018-01-01       Impact factor: 6.524

3.  Identification of unique DNA methylation sites in Kabuki syndrome using whole genome bisulfite sequencing and targeted hybridization capture followed by enzymatic methylation sequencing.

Authors:  Yo Hamaguchi; Hiroyuki Mishima; Tomoko Kawai; Shinji Saitoh; Kenichiro Hata; Akira Kinoshita; Koh-Ichiro Yoshiura
Journal:  J Hum Genet       Date:  2022-09-27       Impact factor: 3.755

4.  Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional bayesian approaches.

Authors:  Adam Nunn; Christian Otto; Mario Fasold; Peter F Stadler; David Langenberger
Journal:  BMC Genomics       Date:  2022-06-28       Impact factor: 4.547

5.  Spontaneous and ART-induced large offspring syndrome: similarities and differences in DNA methylome.

Authors:  Yahan Li; Jordana Sena Lopes; Pilar Coy-Fuster; Rocío Melissa Rivera
Journal:  Epigenetics       Date:  2022-05-03       Impact factor: 4.861

6.  Cytosine methylation patterns suggest a role of methylation in plastic and adaptive responses to temperature in European grayling (Thymallus thymallus) populations.

Authors:  Tiina Sävilammi; Spiros Papakostas; Erica H Leder; L Asbjørn Vøllestad; Paul V Debes; Craig R Primmer
Journal:  Epigenetics       Date:  2020-07-30       Impact factor: 4.528

7.  Temporal genome-wide DNA methylation signature of post-smolt Pacific salmon challenged with Piscirickettsia salmonis.

Authors:  Francisco Leiva; Scarlet Bravo; Killen Ko Garcia; Javier Moya; Osiel Guzman; Nicolás Bascuñan; Rodrigo Vidal
Journal:  Epigenetics       Date:  2020-12-31       Impact factor: 4.528

Review 8.  Genetic-epigenetic interactions in cis: a major focus in the post-GWAS era.

Authors:  Catherine Do; Alyssa Shearer; Masako Suzuki; Mary Beth Terry; Joel Gelernter; John M Greally; Benjamin Tycko
Journal:  Genome Biol       Date:  2017-06-19       Impact factor: 13.583

9.  A modified sequence capture approach allowing standard and methylation analyses of the same enriched genomic DNA sample.

Authors:  Lisa Olohan; Laura-Jayne Gardiner; Anita Lucaci; Burkhard Steuernagel; Brande Wulff; John Kenny; Neil Hall; Anthony Hall
Journal:  BMC Genomics       Date:  2018-04-13       Impact factor: 3.969

10.  BAT: Bisulfite Analysis Toolkit: BAT is a toolkit to analyze DNA methylation sequencing data accurately and reproducibly. It covers standard processing and analysis steps from raw read mapping up to annotation data integration and calculation of correlating DMRs.

Authors:  Helene Kretzmer; Christian Otto; Steve Hoffmann
Journal:  F1000Res       Date:  2017-08-16
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.