Literature DB >> 24845651

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Marek S Wiewiórka1, Antonio Messina1, Alicja Pacholewska2, Sergio Maffioletti1, Piotr Gawrysiak1, Michał J Okoniewski1.   

Abstract

UNLABELLED: Many time-consuming analyses of next -: generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics BECAUSE OF: their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes.
AVAILABILITY AND IMPLEMENTATION: Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24845651     DOI: 10.1093/bioinformatics/btu343

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  18 in total

1.  Synonymous variants that disrupt messenger RNA structure are significantly constrained in the human population.

Authors:  Jeffrey B S Gaither; Grant E Lammi; James L Li; David M Gordon; Harkness C Kuck; Benjamin J Kelly; James R Fitch; Peter White
Journal:  Gigascience       Date:  2021-04-05       Impact factor: 6.524

2.  Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Authors:  Xiaobo Sun; Jingjing Gao; Peng Jin; Celeste Eng; Esteban G Burchard; Terri H Beaty; Ingo Ruczinski; Rasika A Mathias; Kathleen Barnes; Fusheng Wang; Zhaohui S Qin
Journal:  Gigascience       Date:  2018-06-01       Impact factor: 6.524

3.  Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application.

Authors:  Gaye Lightbody; Valeriia Haberland; Fiona Browne; Laura Taggart; Huiru Zheng; Eileen Parkes; Jaine K Blayney
Journal:  Brief Bioinform       Date:  2019-09-27       Impact factor: 11.622

4.  Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software.

Authors:  Brendan Lawlor; Paul Walsh
Journal:  Bioengineered       Date:  2015-05-21       Impact factor: 3.269

5.  VariantSpark: population scale clustering of genotype information.

Authors:  Aidan R O'Brien; Neil F W Saunders; Yi Guo; Fabian A Buske; Rodney J Scott; Denis C Bauer
Journal:  BMC Genomics       Date:  2015-12-10       Impact factor: 3.969

6.  A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

Authors:  Alexey Siretskiy; Tore Sundqvist; Mikhail Voznesenskiy; Ola Spjuth
Journal:  Gigascience       Date:  2015-06-04       Impact factor: 6.524

Review 7.  Single-cell Transcriptome Study as Big Data.

Authors:  Pingjian Yu; Wei Lin
Journal:  Genomics Proteomics Bioinformatics       Date:  2016-02-11       Impact factor: 7.691

8.  START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries.

Authors:  Xinjie Zhu; Qiang Zhang; Eric Dun Ho; Ken Hung-On Yu; Chris Liu; Tim H Huang; Alfred Sze-Lok Cheng; Ben Kao; Eric Lo; Kevin Y Yip
Journal:  BMC Genomics       Date:  2017-09-22       Impact factor: 3.969

9.  Benchmarking distributed data warehouse solutions for storing genomic variant information.

Authors:  Marek S Wiewiórka; Dawid P Wysakowicz; Michal J Okoniewski; Tomasz Gambin
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

Review 10.  Big Data Application in Biomedical Research and Health Care: A Literature Review.

Authors:  Jake Luo; Min Wu; Deepika Gopukumar; Yiqing Zhao
Journal:  Biomed Inform Insights       Date:  2016-01-19
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.