Literature DB >> 22302568

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Matti Niemenmaa¹, Aleksi Kallio, André Schumacher, Petri Klemelä, Eija Korpelainen, Keijo Heljanko.

Abstract

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

Entities: Chemical Disease Species

Mesh：

Year: 2012 PMID： 22302568 PMCID： PMC3307120 DOI： 10.1093/bioinformatics/bts054

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Next-generation sequencing (NGS) technologies provide unprecedented opportunities for life science research. In order to exploit this potential to its full extent, new computational approaches are needed for the efficient processing of large datasets. Nearly all NGS applications rely on sequence alignment as the first analysis step. The alignment data is commonly stored in the standardized, compact and indexed BAM (Binary Alignment/Map) format (Li ), which is then used for further analysis such as SNP genotyping, peak calling or detecting differential gene expression. As data sizes increase more rapidly than processing power and disk-read speed, many of these bioinformatics tasks have been ported to utilize the map-reduce distributed processing framework (Taylor, 2010). Existing solutions, such as GATK (McKenna ), SeqWare Query Engine (O'Connor ) and Seal (Pireddu ), provide useful parts for NGS data analysis pipelines. However, they do not allow efficient parallel access to BAM files. Map-reduce is a distributed computing paradigm that has been designed for processing collections of relatively independent data items, and is therefore well suited for sequencing reads (Dean and Ghemawat, 2008). It divides data between processing nodes by splitting the files into chunks, which are then processed separately. The user has to write map and reduce functions, where the map function does the actual processing of a chunk, and the reduce function combines partial results. The most popular open source implementation of map-reduce is Apache Hadoop (White, 2009). BAM files are conceptually a good fit for map-reduce style chunk processing, but their low level structure hinders adoption. Typically map-reduce jobs process data chunks in line-based text format, where identifying entries is simple as line boundaries are denoted by newline characters. Detecting entry boundaries and accessing the binary content of (compressed) BAM files, however, is non-trivial. On the other hand, using plain Hadoop with text-based SAM files results in several times greater disk and network loads. Text formats also complicate the pipeline as data is typically stored in BAM files. We developed the Hadoop-BAM Java library to act as an integration layer between analysis applications and BAM files stored in the Hadoop Distributed File System (HDFS).

2 METHODS

Hadoop-BAM solves the issues related to BAM splitting, presenting a convenient API for implementing map and reduce functions for Hadoop. The library supports two modes of access to BAM files. The first mode relies on a precomputed index that maps byte offsets to BAM records and thus allows random access, which is required to process chunks that can result from Hadoop splitting the BAM data arbitrarily. The second mode does not use an index and instead relies on a two-level detection routine. The higher level locates boundaries between compressed blocks via BGZF magic numbers, while the lower level detects BAM block boundaries via redundancies in the BAM file format. For details we refer to the Supplementary Material. The library exposes a Picard compatible Java API to programmers. Hence, Hadoop code can be written without considering the issues of BGZF compression, block boundary detection, BAM record boundary detection, or parsing of raw binary data. Tools developed upon the Picard API can be easily converted to support large-scale distributed computing with Hadoop-BAM.

2.1 Evaluation

To demonstrate the library, we use it for calculating coverage summaries for the Chipster genome browser. Chipster is a biologist-friendly analysis software for high-throughput data (Kallio ), and its genome browser allows users to zoom smoothly from whole chromosome to nucleotide level. Good interactive performance with large BAM files is achieved by precomputing summary files, which are used to create zoomed out views (Fig. 1).

Fig. 1.

Chipster genome browser using preprocessed data to show an interactive high level overview of coverage profile.

Chipster genome browser using preprocessed data to show an interactive high level overview of coverage profile. Implementing summarizing is simple, because Hadoop-BAM allows developers to treat BAM files as Hadoop input/output formats, which includes the provision of a custom partitioner for the input data. The library further extends Hadoop to offer SAMRecord from the Picard toolbox as a map-reduce value type. In essence, the task is to extract the genomic coordinates from the given BAM file, sort the resulting records first by their center point, and for each consecutive group of records of size at most N, output a summarized record containing mean position and group size. The tool was implemented on top of Hadoop version 0.20.2, which was the latest stable version as of writing. Intermediately data was compressed via hadoop-lzo. For benchmarking, we relied on a test cluster with 112 nodes, each of which has two six-core AMD Opteron 2435 CPUs with a clock speed of 2.6 GHz and 250 GB of local disk space, and InfiniBand interconnect. A 50 GB BAM file containing whole-genome sequencing data from the 1000 Genomes Project was summarized into groups of size 2 for k∈{1, 2,…, 16} during a single map-reduce run. Total execution time is already well under an hour with eight worker nodes. This is very reasonable for a 50 GB dataset. As shown in Fig. 2, the map-reduce job scales well up to about eight worker nodes, after which scaling worsens. This also has a significant effect on the total time: starting at the four worker mark, the job actually takes less time than the file transfers. As the import and export of data requires much time, we conclude that when designing Hadoop based pipelines, one should avoid moving data in and out of Hadoop between analysis steps. Performance is also bound by the interconnect network. This result indicates that BAM, as a binary and compressed format, is suitable for large-scale NGS data analysis in the cloud. Using SAM or another text format would greatly reduce performance, as there would be far more data to transfer. All in all, compact formats are good not only for storage, but also for distributed processing with map-reduce.

Fig. 2.

Mean speedups for summarizing a 50 GB BAM file with Hadoop, using heuristic splitting. Due to the cluster usage policy, the maximum number of parallel worker nodes was restricted to 15.

3 DISCUSSION

To conclude, we presented how the combination of a compact data format such as BAM and a powerful distributed framework Hadoop can be used to efficiently process large NGS datasets. The Hadoop-BAM library provides an easy-to-use interface for their integration by resolving the incompatibilities these two technologies have. We predict that similar integration efforts will become common when cloud computing is taken into wider use in NGS data analysis. While our use case consisted of coverage calculations, it is important to note that Hadoop-BAM can be used for virtually any analysis task based on BAM files, ranging from variant detection to peak calling. In order to make Hadoop-BAM more accessible, we are currently evaluating simpler and higher-level Hadoop-based query languages for working with BAM files. Examples of such include Apache Pig (Olston ) and Hive (Thusoo ). We have also developed a command line interface and are extending it to provide Samtools-like functionality. Funding: Cloud Software Program funded by Finnish Funding Agency for Technology and Innovation Tekes; Academy of Finland (#139402). Conflict of Interest: none declared.

6 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.

Authors: Ronald C Taylor
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

4. SeqWare Query Engine: storing and searching sequence data in the cloud.

Authors: Brian D O'Connor; Barry Merriman; Stanley F Nelson
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

5. SEAL: a distributed short read mapping and duplicate removal tool.

Authors: Luca Pireddu; Simone Leo; Gianluigi Zanetti
Journal: Bioinformatics Date: 2011-06-22 Impact factor: 6.937

6. Chipster: user-friendly analysis software for microarray and other high-throughput data.

Authors: M Aleksi Kallio; Jarno T Tuimala; Taavi Hupponen; Petri Klemelä; Massimiliano Gentile; Ilari Scheinin; Mikko Koski; Janne Käki; Eija I Korpelainen
Journal: BMC Genomics Date: 2011-10-14 Impact factor: 3.969

6 in total

34 in total

1. Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

Authors: Max Klein; Rati Sharma; Chris H Bohrer; Cameron M Avelis; Elijah Roberts
Journal: Bioinformatics Date: 2016-09-22 Impact factor: 6.937

2. Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Authors: Xiaobo Sun; Jingjing Gao; Peng Jin; Celeste Eng; Esteban G Burchard; Terri H Beaty; Ingo Ruczinski; Rasika A Mathias; Kathleen Barnes; Fusheng Wang; Zhaohui S Qin
Journal: Gigascience Date: 2018-06-01 Impact factor: 6.524

3. A Genocentric Approach to Discovery of Mendelian Disorders.

Authors: Adam W Hansen; Mullai Murugan; He Li; Michael M Khayat; Liwen Wang; Jill Rosenfeld; B Kim Andrews; Shalini N Jhangiani; Zeynep H Coban Akdemir; Fritz J Sedlazeck; Allison E Ashley-Koch; Pengfei Liu; Donna M Muzny; Erica E Davis; Nicholas Katsanis; Aniko Sabo; Jennifer E Posey; Yaping Yang; Michael F Wangler; Christine M Eng; V Reid Sutton; James R Lupski; Eric Boerwinkle; Richard A Gibbs
Journal: Am J Hum Genet Date: 2019-10-24 Impact factor: 11.025

Review 4. Some experiences and opportunities for big data in translational research.

Authors: Christopher G Chute; Mollie Ullman-Cullere; Grant M Wood; Simon M Lin; Min He; Jyotishman Pathak
Journal: Genet Med Date: 2013-09-05 Impact factor: 8.822

5. SeqHBase: a big data toolset for family based sequencing data analysis.

Authors: Min He; Thomas N Person; Scott J Hebbring; Ethan Heinzen; Zhan Ye; Steven J Schrodi; Elizabeth W McPherson; Simon M Lin; Peggy L Peissig; Murray H Brilliant; Jason O'Rawe; Reid J Robison; Gholson J Lyon; Kai Wang
Journal: J Med Genet Date: 2015-01-13 Impact factor: 6.318

6. Halvade: scalable sequence analysis with MapReduce.

Authors: Dries Decap; Joke Reumers; Charlotte Herzeel; Pascal Costanza; Jan Fostier
Journal: Bioinformatics Date: 2015-03-26 Impact factor: 6.937

7. DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

Authors: Ram Vinay Pandey; Christian Schlötterer
Journal: PLoS One Date: 2013-08-23 Impact factor: 3.240

8. Experiences with workflows for automating data-intensive bioinformatics.

Authors: Ola Spjuth; Erik Bongcam-Rudloff; Guillermo Carrasco Hernández; Lukas Forer; Mario Giovacchini; Roman Valls Guimera; Aleksi Kallio; Eija Korpelainen; Maciej M Kańduła; Milko Krachunov; David P Kreil; Ognyan Kulev; Paweł P Łabaj; Samuel Lampa; Luca Pireddu; Sebastian Schönherr; Alexey Siretskiy; Dimitar Vassilev
Journal: Biol Direct Date: 2015-08-19 Impact factor: 4.540

9. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

Authors: André Schumacher; Luca Pireddu; Matti Niemenmaa; Aleksi Kallio; Eija Korpelainen; Gianluigi Zanetti; Keijo Heljanko
Journal: Bioinformatics Date: 2013-10-22 Impact factor: 6.937

Review 10. Enabling large-scale biomedical analysis in the cloud.

Authors: Ying-Chih Lin; Chin-Sheng Yu; Yen-Jen Lin
Journal: Biomed Res Int Date: 2013-10-31 Impact factor: 3.411