Literature DB >> 25421351

BamBam: genome sequence analysis tools for biologists.

Justin T Page¹, Zachary S Liechty, Mark D Huynh, Joshua A Udall.

Abstract

BACKGROUND: Massive computational power is needed to analyze the genomic data produced by next-generation sequencing, but extensive computational experience and specific knowledge of algorithms should not be necessary to run genomic analyses or interpret their results.
FINDINGS: We present BamBam, a package of tools for genome sequence analysis. BamBam contains tools that facilitate summarizing data from BAM alignment files and identifying features such as SNPs, indels, and haplotypes represented in those alignments.
CONCLUSIONS: BamBam provides a powerful and convenient framework to analyze genome sequence data contained in BAM files.

Entities: Chemical Disease

Mesh：

Substances：
Sulfites
hydrogen sulfite

Year: 2014 PMID： 25421351 PMCID： PMC4258253 DOI： 10.1186/1756-0500-7-829

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Findings

Massive amounts of data are involved in genome sequence research, requiring researchers to use supercomputing clusters and complex algorithms to analyze their sequence data. Genomic analyses frequently include next-generation sequencing to produce millions of short reads, followed by aligning of reads to a reference genome sequence with software like GSNAP and Bowtie 2 [1, 2]. These programs generate SAM files, the accepted standard for storing short read alignment data, which are subsequently compressed to BAM format via SAMtools [3]. The BAM files must then be analyzed and compared to produce meaningful results. Here we expand on the body of tools for analyzing and comparing BAM files. We present BamBam, a package of bioinformatics tools to carry out a variety of genomic analyses on BAM files (Table 1). The included tools perform such tasks as counting the number of reads mapped to each gene in a genome (as for gene expression analyses), identifying SNPs (Single Nucleotide Polymorphisms) and CNVs (Copy Number Variants), and extracting consensus sequences. The purpose of BamBam is to provide a consistent framework to perform common tasks, without requiring extensive knowledge of computation or algorithms to select or interpret appropriate parameters.

Table 1

The core independent tools of BamBam

Section	Tool	Purpose
Single nucleotide Polymorphisms	InterSnp	Call SNPs between two or more samples
	Pebbles	Impute genotypes in output from InterSnp
	HapHunt	Phase haplotypes with K-means
Copy number Variants	GapFall	Identify deletions between two samples
	Elfen	Identify covered regions
	HMMph	Call copy number variants with HMM
Bisulfite-sequence Analysis	MetHead	Summarize base pair methylation in bisulfite-sequence data
GeneVisitor	Bam2Consensus	Generate consensus sequences from one or more samples
	Bam2Fastq	Extract mapped and unmapped reads from BAM files
	Counter	Summarize read coverage of sequences or regions
	SubBam	Extract subset of mapped reads
Allopolyploid analysis	PolyCat	Categorize reads by genome based on similarity to parents
Scripts	Various	Various

The core independent tools of BamBam The BamBam package includes several independent programs, briefly described below. Brief tests were carried out to compare InterSnp, GapFall, and HapHunt with similar tools (Additional file 1). The latest version of PolyCat is also included [4]. The README in the download package provides example commands for various common analyses, including phylogeny inference, molecular evolution estimation, methylation analysis, and differential expression analysis. A usage guide (see Additional file 2) provides a more detailed walkthrough of some workflows.

Single nucleotide polymorphisms

InterSnp calls SNPs between samples, represented by separate BAM files. InterSnp examines each position in the genome, assigning consensus alleles to each site for each sample. A SNP is called whenever two samples differ at the same position, producing a table with the genotypes of all samples at all polymorphic sites. The output is a table with the sequence name, position, and genotype for each sample at that site on each row, which can be readily processed by common command-line programs or scripts to calculate statistics or produce marker data for other programs. Pebbles imputes genotypes using the K-nearest neighbor algorithm [5, 6]. For each unknown genotype, Pebbles finds the samples that are most similar at nearby loci. Then it assigns a genotype to the unknown locus based on the weighted contributions of those neighbors. Pebbles operates on InterSnp output—a table of genotypes—and produces a file of the same form. HapHunt uses K-means clustering to solve the haplotype-phasing problem, which consists of identifying all haplotypes in a sampled individual or population. Many programs have attempted to solve haplotype phasing and the closely related haplotype assembly problems using a variety of strategies, including Max-Cut, hidden Markov models, and dynamic programming [7-9]. The K-means clustering algorithm (Figure 1) is an unsupervised machine learning algorithm, and is mathematically equivalent to Principle Component Analysis [10, 11].

Figure 1

K-means clustering algorithm. An example 2-cluster run is shown, with the clusters distinguished by color and the current cluster seeds marked by a starburst. In the first round, each point is assigned to its closest seed, and a new seed is chosen for each cluster based on the average of all points in that cluster. As a result, the blue cluster seed moves to the right side. In the second round, both cluster seeds drift to their correct locations, resulting in a proper division. Note that, after two rounds, the clusters have reached a steady-state, and would not change further through an infinite number of iterations. HapHunt first selects K reads distant from one another to serve as haplotype seeds. It assigns each other read to the haplotype with the closest consensus sequence. Then it recalculates the consensus sequences based on the reads in each haplotype and repeats the process of assigning each read to the haplotype with the closest consensus sequence. It repeats this process a given number of times, calculating a score at the end of each round based on the difference of the smallest interhaplotype distance and greatest intrahaplotype distance. This score favors clusterings in which haplotypes are individually compact and most distinct from one another. This score can optionally be scaled by the average size ratio for each pair of haplotypes, favoring clusterings that are more evenly divided. The consensus sequences of the final haplotypes are printed as an aligned FASTA file for each sequence in the original reference.

Copy number variants

Gapfall identifies large deletions between samples based on read coverage. It searches the genome for extended regions that have high coverage in one sample but no coverage in the other. A large region with no coverage could indicate a physical deletion (for genomic samples) or a deactivated gene (for RNA-seq). These putative deletions are reported as an annotation file that can be visualized with a genome browser such as IGV [12]. Eflen identifies and extracts regions in a BAM file that are covered by at least a user-specified number of reads and outputs those regions as a GFF file. Provided with multiple BAMs, Eflen will identify regions that are covered in at least a user-specified fraction of those BAMs. This tool can be especially useful for analyzing GBS or RNA-seq data. HMMph identifies CNVs between samples based on read coverage. BAM files must be provided for a control and for the sample of interest. The coverage ratio between those two BAM files is normalized by the total read coverage. Then the copy number of each locus in a sliding window is modeled based on a Poisson distribution in an untrained Hidden Markov Model [13, 14].

Bisulfite-sequence analysis

Bisulfite treatment converts unmethylated cytosines to thymines. MetHead summarizes methylation at all cytosine positions in the genome, based on BAM files of mapped bisulfite-treated reads. It totals the number of mapped cytosines and thymines at each position (indicating methylated and unmethylated states, respectively), then performs a one-tailed binomial test for the methylation of that site.Different protocols are used for bisulfite treatment. If PCR is not performed after bisulfite treatment but before sequencing, then only 2 possibilities exist: conversions on the forward and reverse strand. But if PCR is performed, 4 possibilities exist (Figure 2). To properly count the number of cytosines and thymines in the 4-possibility protocol, the origin of the pre-PCR DNA fragment must be inferred. MetHead determines this—if necessary—by counting the number of C- > T conversions and G- > A conversions (indicative of a conversion on the reverse strand). It generates a BAM file with the orientation of each read matching its origin strand. That BAM can then be analyzed as if it were data produced by the 2-possibility protocol. Note that, in the produced BAM, the orientation of reads is not based on the direction in which the read was sequenced. Instead, the orientation of the read indicates the type of conversion caused by bisulfite treatment: C- > G or A- > T.

Figure 2

Bisulfite treatment. The effects of bisulfite treatment on DNA are shown. An “m” superscript indicates a methylated cytosine. The orientation of each strand is indicated by “<<” and “>>”. Bisulfite treatment converts unmethylated cytosines into uracils and, ultimately, thymines. After PCR, however, a given fragment may have C- > T conversions or G- > A conversions, depending on its orientation relative to its origin fragment.

GeneVisitor

It is often useful to be able to compute on specific genomic intervals, such as genes. GeneVisitor provides a quick and easy way to do this, using an annotation file (GFF or BED format) to call a function on each indicated region of the genome. This class can be used by C++ programmers to run custom functions. In addition, pre-built tools utilize GeneVisitor without the need for programming. Bam2Consensus converts one or more BAM files into a series of FASTA-formatted consensus sequences. If desired, multiple sequences—essentially unphased haplotypes—can be produced per BAM file, facilitating analyses of heterozygosity, nucleotide diversity, and molecular evolution. Suppose you have several BAM files representing different accessions of a species, all mapped to a common genome reference sequence. With a single command, Bam2Consensus can produce an aligned FASTA file for each gene, each containing the consensus sequences for each accession. Bam2Fastq extracts mapped or unmapped reads from a BAM file, or from select regions of the BAM file. Counter summarizes the number of reads mapped to each annotated region in one or more BAM files. RPKM (Reads Per Kilobase per Million mapped reads) normalization can be applied if desired. The output of Counter is a table of features and read counts, ready to be imported into EdgeR for differential expression analysis [15]. SubBam extracts a subset of a BAM file. It can optionally modify the BAM file, changing the coordinates of mapped reads to match a new reference that is a subset of the original reference. Suppose you have WGS reads mapped to a reference sequence and are interested in several loci. SubBam can produce BAMs that only contain the loci of interest, with a coordinate system corresponding to the position in the locus, rather than in the genome as a whole.

Allopolyploid analysis

The latest version of PolyCat is included in BamBam. PolyCat uses an index of known homoeo-SNPs (polymorphisms that distinguish the genomes of an allopolyploid) to identify the source genome for each read in a library, which cannot be distinguished through typical next-generation sequencing protocols [4].The MultiIndex class is used by PolyCat and MetHead, and can be used to make novel tools in C++. The MultiIndex is appropriate for random access to hundreds of millions of individual base positions in a genome sequence. It provides quick random access to base positions scattered across a genome sequence. Each sequence in the reference is indexed with a linked-list, with an index of landmark nodes spaced along the sequence at a resolution specified by the user (Figure 3).

Figure 3

Multi-level index. The multi-level index provides random access to large numbers of individual base positions across a genome. Each sequence (green) is indexed by a linked-list (blue), and that index is indexed by a set of landmark nodes (red) to provide rapid access to any location.

Scripts

In addition to the core tools mentioned above, BamBam includes many Perl scripts, many of which use BioPerl modules [16]. Script functions include calculation of nucleotide diversity (π) and molecular evolution rates (Ka and Ks), paralog identification, differential expression with EdgeR [15], summarization of results from MetHead, and summarization of genotype tables produced by InterSnp and Pebbles.

Conclusions

The BamBam tools form a simple interface between the researching biologist and the wealth of data contained in next-generation sequence alignments. They provide a means to efficiently identify interesting genomic features and summarize data, facilitating many next-generation sequence analysis experiments. BamBam is freely available under the MIT license at http://sourceforge.net/projects/bambam/. It depends on both SAMtools and BAMtools [3, 17].

Availability and requirements

Project Name: BamBam Project Home Page:http://sourceforge.net/projects/bambam/ Operating System: Unix Dependencies: SamTools, BamTools, BioPerl Programming Language: C++ and Perl License: MIT

Authors’ information

JP has a B.S. in Computer Science and is currently a graduate student in Biology, focusing on developing tools for polyploid genome analysis. ZL is an undergraduate student in the Udall lab. MH is a graduate student in the Udall lab. JU is an Associate professor at Brigham Young University and is the academic advisor of JP, ZL, and MH. Additional file 1:Supplementary Material. Figure S1 Read alignment of cotton A-genome and d-genome reads to a common reference, rendered in IGV. Highlights indicate differences compared to the reference, so highlights in the upper sequence (A-genome) and the lack of those highlights in the lower sequence (D-genome) indicate SNPs between the two genomes. In this region, InterSnp identified 17 SNPs but SAMtools failed to identify any. Figure S2 Haplotypes identified by SAMtools and HapHunt, compared to the known haplotype. Figure S3 Phylogenetic tree. This neighbor-joining tree was built by neighbor based on SNPs identified by InterSnp. Then Geneious was used to render the actual tree. Table S1 The number of deletions identified in each accession (row), along with the percentage of those deletions that were shared with other members of the same species and with the entire group of samples. (DOCX 207 KB) Additional file 2:BamBam User Guide[15],[18].(DOCX 18 KB)

14 in total

1. Missing value estimation methods for DNA microarrays.

Authors: O Troyanskaya; M Cantor; G Sherlock; P Brown; T Hastie; R Tibshirani; D Botstein; R B Altman
Journal: Bioinformatics Date: 2001-06 Impact factor: 6.937

2. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

3. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals.

Authors: Brian L Browning; Sharon R Browning
Journal: Am J Hum Genet Date: 2009-02-05 Impact factor: 11.025

4. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.

Authors: Vikas Bansal; Vineet Bafna
Journal: Bioinformatics Date: 2008-08-15 Impact factor: 6.937

5. BamTools: a C++ API and toolkit for analyzing and managing BAM files.

Authors: Derek W Barnett; Erik K Garrison; Aaron R Quinlan; Michael P Strömberg; Gabor T Marth
Journal: Bioinformatics Date: 2011-04-14 Impact factor: 6.937

6. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Fast and SNP-tolerant detection of complex variants and splicing in short reads.

Authors: Thomas D Wu; Serban Nacu
Journal: Bioinformatics Date: 2010-02-10 Impact factor: 6.937

9. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

10. Imputation of unordered markers and the impact on genomic selection accuracy.

Authors: Jessica E Rutkoski; Jesse Poland; Jean-Luc Jannink; Mark E Sorrells
Journal: G3 (Bethesda) Date: 2013-03-01 Impact factor: 3.154

15 in total

1. Mapping-by-sequencing of Ligon-lintless-1 (Li 1 ) reveals a cluster of neighboring genes with correlated expression in developing fibers of Upland cotton (Gossypium hirsutum L.).

Authors: Gregory N Thyssen; David D Fang; Rickie B Turley; Christopher Florane; Ping Li; Marina Naoumkina
Journal: Theor Appl Genet Date: 2015-05-29 Impact factor: 5.699

2. Parallel and Intertwining Threads of Domestication in Allopolyploid Cotton.

Authors: Daojun Yuan; Corrinne E Grover; Guanjing Hu; Mengqiao Pan; Emma R Miller; Justin L Conover; Spencer P Hunt; Joshua A Udall; Jonathan F Wendel
Journal: Adv Sci (Weinh) Date: 2021-03-15 Impact factor: 16.806

3. Transcriptome assembly, profiling and differential gene expression analysis of the halophyte Suaeda fruticosa provides insights into salt tolerance.

Authors: Joann Diray-Arce; Mark Clement; Bilquees Gul; M Ajmal Khan; Brent L Nielsen
Journal: BMC Genomics Date: 2015-05-06 Impact factor: 3.969

4. DNA Sequence Evolution and Rare Homoeologous Conversion in Tetraploid Cotton.

Authors: Justin T Page; Zach S Liechty; Rich H Alexander; Kimberly Clemons; Amanda M Hulse-Kemp; Hamid Ashrafi; Allen Van Deynze; David M Stelly; Joshua A Udall
Journal: PLoS Genet Date: 2016-05-11 Impact factor: 5.917

5. Genomic-assisted haplotype analysis and the development of high-throughput SNP markers for salinity tolerance in soybean.

Authors: Gunvant Patil; Tuyen Do; Tri D Vuong; Babu Valliyodan; Jeong-Dong Lee; Juhi Chaudhary; J Grover Shannon; Henry T Nguyen
Journal: Sci Rep Date: 2016-01-19 Impact factor: 4.379

6. Development of single-nucleotide polymorphism markers for Bromus tectorum (Poaceae) from a partially sequenced transcriptome.

Authors: Keith R Merrill; Craig E Coleman; Susan E Meyer; Elizabeth A Leger; Katherine A Collins
Journal: Appl Plant Sci Date: 2016-11-04 Impact factor: 1.936

7. Molecular Characterisation of a Supergene Conditioning Super-High Vitamin C in Kiwifruit Hybrids.

Authors: John McCallum; William Laing; Sean Bulley; Susan Thomson; Andrew Catanach; Martin Shaw; Mareike Knaebel; Jibran Tahir; Simon Deroles; Gail Timmerman-Vaughan; Ross Crowhurst; Elena Hilario; Matthew Chisnall; Robyn Lee; Richard Macknight; Alan Seal
Journal: Plants (Basel) Date: 2019-07-22

8. Haplotype Detection from Next-Generation Sequencing in High-Ploidy-Level Species: 45S rDNA Gene Copies in the Hexaploid Spartina maritima.

Authors: Julien Boutte; Benoît Aliaga; Oscar Lima; Julie Ferreira de Carvalho; Abdelkader Ainouche; Jiri Macas; Mathieu Rousseau-Gueutin; Olivier Coriton; Malika Ainouche; Armel Salmon
Journal: G3 (Bethesda) Date: 2015-11-03 Impact factor: 3.154

9. Independent Domestication of Two Old World Cotton Species.

Authors: Simon Renny-Byfield; Justin T Page; Joshua A Udall; William S Sanders; Daniel G Peterson; Mark A Arick; Corrinne E Grover; Jonathan F Wendel
Journal: Genome Biol Evol Date: 2016-07-02 Impact factor: 3.416

10. Single-molecule sequencing and Hi-C-based proximity-guided assembly of amaranth (Amaranthus hypochondriacus) chromosomes provide insights into genome evolution.

Authors: D J Lightfoot; D E Jarvis; T Ramaraj; R Lee; E N Jellen; P J Maughan
Journal: BMC Biol Date: 2017-08-31 Impact factor: 7.431