Literature DB >> 29762754

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Xiaobo Sun1, Jingjing Gao2, Peng Jin3, Celeste Eng4, Esteban G Burchard4, Terri H Beaty5, Ingo Ruczinski6, Rasika A Mathias7, Kathleen Barnes8, Fusheng Wang9, Zhaohui S Qin2,10.   

Abstract

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Entities:  

Mesh:

Year:  2018        PMID: 29762754      PMCID: PMC6007233          DOI: 10.1093/gigascience/giy052

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


  23 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes.

Authors:  Wei Zhou; Ruilin Li; Shuo Yuan; ChangChun Liu; Shaowen Yao; Jing Luo; Beifang Niu
Journal:  Bioinformatics       Date:  2017-04-01       Impact factor: 6.937

3.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Authors:  Marek S Wiewiórka; Antonio Messina; Alicja Pacholewska; Sergio Maffioletti; Piotr Gawrysiak; Michał J Okoniewski
Journal:  Bioinformatics       Date:  2014-05-19       Impact factor: 6.937

4.  Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Authors:  Matti Niemenmaa; Aleksi Kallio; André Schumacher; Petri Klemelä; Eija Korpelainen; Keijo Heljanko
Journal:  Bioinformatics       Date:  2012-02-02       Impact factor: 6.937

5.  SeqHBase: a big data toolset for family based sequencing data analysis.

Authors:  Min He; Thomas N Person; Scott J Hebbring; Ethan Heinzen; Zhan Ye; Steven J Schrodi; Elizabeth W McPherson; Simon M Lin; Peggy L Peissig; Murray H Brilliant; Jason O'Rawe; Reid J Robison; Gholson J Lyon; Kai Wang
Journal:  J Med Genet       Date:  2015-01-13       Impact factor: 6.318

6.  Halvade: scalable sequence analysis with MapReduce.

Authors:  Dries Decap; Joke Reumers; Charlotte Herzeel; Pascal Costanza; Jan Fostier
Journal:  Bioinformatics       Date:  2015-03-26       Impact factor: 6.937

Review 7.  Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives.

Authors:  Ivan Merelli; Horacio Pérez-Sánchez; Sandra Gesing; Daniele D'Agostino
Journal:  Biomed Res Int       Date:  2014-09-01       Impact factor: 3.411

Review 8.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.

Authors:  Emad A Mohammed; Behrouz H Far; Christopher Naugler
Journal:  BioData Min       Date:  2014-10-29       Impact factor: 2.522

9.  A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

Authors:  Alexey Siretskiy; Tore Sundqvist; Mikhail Voznesenskiy; Ola Spjuth
Journal:  Gigascience       Date:  2015-06-04       Impact factor: 6.524

10.  Searching for SNPs with cloud computing.

Authors:  Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-11-20       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.