Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Literature DB >> 29762754

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Xiaobo Sun¹, Jingjing Gao², Peng Jin³, Celeste Eng⁴, Esteban G Burchard⁴, Terri H Beaty⁵, Ingo Ruczinski⁶, Rasika A Mathias⁷, Kathleen Barnes⁸, Fusheng Wang⁹, Zhaohui S Qin^2,10.

Abstract

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29762754 PMCID： PMC6007233 DOI： 10.1093/gigascience/giy052

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Keyword Cloud
References

23 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes.

Authors: Wei Zhou; Ruilin Li; Shuo Yuan; ChangChun Liu; Shaowen Yao; Jing Luo; Beifang Niu
Journal: Bioinformatics Date: 2017-04-01 Impact factor: 6.937

3. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Authors: Marek S Wiewiórka; Antonio Messina; Alicja Pacholewska; Sergio Maffioletti; Piotr Gawrysiak; Michał J Okoniewski
Journal: Bioinformatics Date: 2014-05-19 Impact factor: 6.937

4. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Authors: Matti Niemenmaa; Aleksi Kallio; André Schumacher; Petri Klemelä; Eija Korpelainen; Keijo Heljanko
Journal: Bioinformatics Date: 2012-02-02 Impact factor: 6.937

5. SeqHBase: a big data toolset for family based sequencing data analysis.

Authors: Min He; Thomas N Person; Scott J Hebbring; Ethan Heinzen; Zhan Ye; Steven J Schrodi; Elizabeth W McPherson; Simon M Lin; Peggy L Peissig; Murray H Brilliant; Jason O'Rawe; Reid J Robison; Gholson J Lyon; Kai Wang
Journal: J Med Genet Date: 2015-01-13 Impact factor: 6.318

6. Halvade: scalable sequence analysis with MapReduce.

Authors: Dries Decap; Joke Reumers; Charlotte Herzeel; Pascal Costanza; Jan Fostier
Journal: Bioinformatics Date: 2015-03-26 Impact factor: 6.937

Review 7. Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives.

Authors: Ivan Merelli; Horacio Pérez-Sánchez; Sandra Gesing; Daniele D'Agostino
Journal: Biomed Res Int Date: 2014-09-01 Impact factor: 3.411

Review 8. Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.

Authors: Emad A Mohammed; Behrouz H Far; Christopher Naugler
Journal: BioData Min Date: 2014-10-29 Impact factor: 2.522

9. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

Authors: Alexey Siretskiy; Tore Sundqvist; Mikhail Voznesenskiy; Ola Spjuth
Journal: Gigascience Date: 2015-06-04 Impact factor: 6.524

10. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583