| Literature DB >> 22302568 |
Matti Niemenmaa1, Aleksi Kallio, André Schumacher, Petri Klemelä, Eija Korpelainen, Keijo Heljanko.
Abstract
Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.Entities:
Mesh:
Year: 2012 PMID: 22302568 PMCID: PMC3307120 DOI: 10.1093/bioinformatics/bts054
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Chipster genome browser using preprocessed data to show an interactive high level overview of coverage profile.
Fig. 2.Mean speedups for summarizing a 50 GB BAM file with Hadoop, using heuristic splitting. Due to the cluster usage policy, the maximum number of parallel worker nodes was restricted to 15.