Literature DB >> 24149054

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

André Schumacher¹, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti, Keijo Heljanko.

Abstract

SUMMARY: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.
AVAILABILITY AND IMPLEMENTATION: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

Entities: Species

Mesh：

Year: 2013 PMID： 24149054 PMCID： PMC3866557 DOI： 10.1093/bioinformatics/btt601

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Novel computational approaches are required to cope with the increasing data volumes of large-scale sequencing projects, as the growth in processing power and storage access speed is unable to keep pace with them (Marx, 2013; Stein, 2010). Several innovative tools and technologies have been proposed to tackle these challenges. Some are based on MapReduce, which is a distributed computing paradigm that is based on the idea of splitting input data into chunks, which can be processed largely independently (via a Map function). Subresults can later be merged after grouping-related subresults (by a Reduce function). MapReduce permits automatic parallelization and scalable data distribution across many computers. The most popular implementation available as open-source software is Apache Hadoop, which also comes with its own distributed file system. The validity of Hadoop as a data processing platform is demonstrated by the level of adoption in major data-intensive companies, e.g. Twitter, Facebook and Amazon. Motivated by the potential scalability and throughput offered by Hadoop, there are an increasing number of Hadoop-based tools for processing sequencing data (Taylor, 2010), ranging from quality control (Robinson ) and alignment (Langmead ; Pireddu ) to SNP calling (Langmead ), variant annotation (O’Connor ) and structural variant detection (Whelan ), including general purpose workflow management (Schönherr ). Note the recent publication of independent and complimentary work in (Nordberg ). Although Hadoop does simplify writing scalable distributed software, it does not make it trivial. Such a task still requires specialized skills and a significant amount of work, particularly if the solution involves sequences of MapReduce jobs. This effort can be reduced significantly by using high-level tools such as Apache Pig, which implements an SQL-like scripting language that is automatically translated into a sequence of MapReduce jobs. Given its flexibility and simplicity for developing data processing pipelines, it is not surprising that a large fraction of computing jobs in contemporary Hadoop deployments originate from Apache Pig or similar high-level tools (Chen ). SeqPig brings the benefits of Apache Pig to sequencing data analysis. It allows users to integrate their own analysis components with existing MapReduce programs to create full NGS pipelines based on Hadoop.

2 METHODS

SeqPig extends Pig with a number of features and functionalities specialized for processing sequencing data. Specifically, it provides (i) data input and output components, (ii) functions to access fields and transform data and (iii) a collection of scripts for frequent tasks (e.g. pile-up, QC statistics). Apache Pig provides an extension mechanism through the definition of new library functions, implemented in one of several supported programming languages (Java, Python, Ruby, JavaScript); these functions can then be called from Pig scripts. SeqPig uses this feature to augment the set of operators provided by plain Pig with a number of custom sequencing-specific functions. SeqPig supports ad hoc (scripted and interactive) distributed manipulation and analysis of large sequencing datasets so that processing speed scales with the number of available computing nodes. It provides import and export functions for file formats commonly used for sequencing data: Fastq, Qseq, FASTA, SAM and BAM. These components, implemented with the help of Hadoop-BAM (Niemenmaa ), allow the user to load and export sequencing data in the Pig environment. All available fields, such as BAM/SAM optional read attributes, for example, can then be accessed and modified from within Pig. SeqPig also includes functions to access SAM flags, split reads by base (for computing base-level statistics), reverse-complement reads, calculate read reference positions in a mapping (for pileups, extracting SNP positions) and more. It comes packaged with scripts that calculate various statistics and manipulations on read data, which also serve as examples. The growing library of functions and scripts is documented in the SeqPig manual. Contributions from the community are welcome and encouraged. Figures 1 and 2 show script examples. For a more detailed list of features and more examples please see the project Web site: http://seqpig.sourceforge.net/.

Fig. 1.

Converting Qseq into Fastq; the dataset is simply read and then written using the appropriate load/store functions

Fig. 2.

Script that generates a histogram for GC content of reads. The script loads a Fastq file, splits each read into separate bases and for each read coordinate filters only bases that are either G or C. The filtered bases are then counted and counts are grouped. Finally, the script prints records which contain the GC count and the count of reads that have the given GC count

Converting Qseq into Fastq; the dataset is simply read and then written using the appropriate load/store functions Script that generates a histogram for GC content of reads. The script loads a Fastq file, splits each read into separate bases and for each read coordinate filters only bases that are either G or C. The filtered bases are then counted and counts are grouped. Finally, the script prints records which contain the GC count and the count of reads that have the given GC count To evaluate SeqPig, we implemented a script that calculates most of the read quality statistics that are collected by the popular FastQC tool (Andrews, 2010). The script is included with the examples (fast_fastqc.pig). We ran a set of experiments, which measured the speed-up gained by using SeqPig on Hadoop clusters of different sizes compared with using a single-node FastQC run. We used a set of Illumina reads as input (read length: 101 bases; file size: 61.4 GB; format: Fastq). Software versions were as follows: FastQC 0.10.1; Hadoop 1.0.4; Pig 0.11.1. All tests were run on nodes equipped with dual quad-core Intel Xeon CPUs @ 2.83 GHz, 16 GB of RAM and one 250 GB SATA disk available to Hadoop. Nodes are connected via Gigabit Ethernet. FastQC read its data from a high-performance shared parallel file system by DDN. SeqPig used the Hadoop file system, which uses each node’s local disk drive. We first ran five different SeqPig read statistics for a different number of computing nodes: the sample distribution of (i) the average base quality of the reads; (ii) the length of reads; (iii) bases by position inside the reads; and (iv) the GC contents of reads. Finally, we combined them into a single script. Each of the executions results in a single MapReduce job and thus a single scan through the data. All runs were repeated three times and averaged (deviation from average <7%). From Figure 3 one can see that it is possible to achieve a significant speed-up by exploiting the parallelism in read and base statistics computation using Hadoop. Further, the total runtime of the script that computes all statistics is mostly determined by the slowest of the individual ones, as the complete script is compiled into a single map-only job. A different observation is that for most of the statistics computed, we are able to achieve a close to linear speed-up compared with FastQC until 48 nodes. We assume that the leveling off is due to the Hadoop job overhead eventually dominating over speed-up due to parallelization, depending on input file size.

Fig. 3.

Results of an experiment with an input file of 61.4 GB and a different number of Hadoop worker nodes. The script does not currently implement all FastQC statistics (we expect the missing ones to scale similarly in SeqPig), whereas the per-cycle quality distribution is not computed by FastQC SeqPig enables simple and scalable manipulation and analysis of sequencing data on the Hadoop platform. At CRS4 SeqPig is already used routinely for several steps in the production workflow; in addition, it has been successfully used for ad hoc investigations into data quality issues, comparison of alignment tools and reformatting and packaging data. We have also tested SeqPig on Amazon’s Elastic MapReduce service, where users may rent computing time on the cloud to run their SeqPig scripts and even share their S3 storage buckets with other cloud-enabled software. Instructions are provided in the Supplementary Material.

10 in total

1. Biology: The big challenges of big data.

Authors: Vivien Marx
Journal: Nature Date: 2013-06-13 Impact factor: 49.962

2. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.

Authors: Henrik Nordberg; Karan Bhatia; Kai Wang; Zhong Wang
Journal: Bioinformatics Date: 2013-09-10 Impact factor: 6.937

3. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

4. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.

Authors: Ronald C Taylor
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

5. SeqWare Query Engine: storing and searching sequence data in the cloud.

Authors: Brian D O'Connor; Barry Merriman; Stanley F Nelson
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

6. SEAL: a distributed short read mapping and duplicate removal tool.

Authors: Luca Pireddu; Simone Leo; Gianluigi Zanetti
Journal: Bioinformatics Date: 2011-06-22 Impact factor: 6.937

7. SAMQA: error classification and validation of high-throughput sequenced read data.

Authors: Thomas Robinson; Sarah Killcoyne; Ryan Bressler; John Boyle
Journal: BMC Genomics Date: 2011-08-18 Impact factor: 3.969

8. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Authors: Matti Niemenmaa; Aleksi Kallio; André Schumacher; Petri Klemelä; Eija Korpelainen; Keijo Heljanko
Journal: Bioinformatics Date: 2012-02-02 Impact factor: 6.937

9. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds.

Authors: Sebastian Schönherr; Lukas Forer; Hansi Weißensteiner; Florian Kronenberg; Günther Specht; Anita Kloss-Brandstätter
Journal: BMC Bioinformatics Date: 2012-08-13 Impact factor: 3.169

10. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583

10 in total

15 in total

1. SeqHBase: a big data toolset for family based sequencing data analysis.

Authors: Min He; Thomas N Person; Scott J Hebbring; Ethan Heinzen; Zhan Ye; Steven J Schrodi; Elizabeth W McPherson; Simon M Lin; Peggy L Peissig; Murray H Brilliant; Jason O'Rawe; Reid J Robison; Gholson J Lyon; Kai Wang
Journal: J Med Genet Date: 2015-01-13 Impact factor: 6.318

Review 2. Next generation distributed computing for cancer research.

Authors: Pankaj Agarwal; Kouros Owzar
Journal: Cancer Inform Date: 2015-04-27

3. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

Authors: Alexey Siretskiy; Tore Sundqvist; Mikhail Voznesenskiy; Ola Spjuth
Journal: Gigascience Date: 2015-06-04 Impact factor: 6.524

Review 4. Single-cell Transcriptome Study as Big Data.

Authors: Pingjian Yu; Wei Lin
Journal: Genomics Proteomics Bioinformatics Date: 2016-02-11 Impact factor: 7.691

5. START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries.

Authors: Xinjie Zhu; Qiang Zhang; Eric Dun Ho; Ken Hung-On Yu; Chris Liu; Tim H Huang; Alfred Sze-Lok Cheng; Ben Kao; Eric Lo; Kevin Y Yip
Journal: BMC Genomics Date: 2017-09-22 Impact factor: 3.969

6. Benchmarking distributed data warehouse solutions for storing genomic variant information.

Authors: Marek S Wiewiórka; Dawid P Wysakowicz; Michal J Okoniewski; Tomasz Gambin
Journal: Database (Oxford) Date: 2017-01-01 Impact factor: 3.451

7. Experiences with workflows for automating data-intensive bioinformatics.

Authors: Ola Spjuth; Erik Bongcam-Rudloff; Guillermo Carrasco Hernández; Lukas Forer; Mario Giovacchini; Roman Valls Guimera; Aleksi Kallio; Eija Korpelainen; Maciej M Kańduła; Milko Krachunov; David P Kreil; Ognyan Kulev; Paweł P Łabaj; Samuel Lampa; Luca Pireddu; Sebastian Schönherr; Alexey Siretskiy; Dimitar Vassilev
Journal: Biol Direct Date: 2015-08-19 Impact factor: 4.540

8. CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.

Authors: Wei-Chun Chung; Chien-Chih Chen; Jan-Ming Ho; Chung-Yen Lin; Wen-Lian Hsu; Yu-Chun Wang; D T Lee; Feipei Lai; Chih-Wei Huang; Yu-Jung Chang
Journal: PLoS One Date: 2014-06-04 Impact factor: 3.240

9. Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.

Authors: Aaron Y Lee; Cecilia S Lee; Russell N Van Gelder
Journal: BMC Bioinformatics Date: 2016-07-28 Impact factor: 3.169

Review 10. Big Data Application in Biomedical Research and Health Care: A Literature Review.

Authors: Jake Luo; Min Wu; Deepika Gopukumar; Yiqing Zhao
Journal: Biomed Inform Insights Date: 2016-01-19