Literature DB >> 32369166

MaRe: Processing Big Data with application containers on Apache Spark.

Marco Capuccini1,2, Martin Dahlö2,3,4, Salman Toor1, Ola Spjuth2.   

Abstract

BACKGROUND: Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing.
RESULTS: Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.
CONCLUSIONS: MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Keywords:  Apache Spark; Big Data; MapReduce; application containers; workflows

Year:  2020        PMID: 32369166      PMCID: PMC7199472          DOI: 10.1093/gigascience/giaa042

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


  31 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

3.  Big Data: Astronomical or Genomical?

Authors:  Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal:  PLoS Biol       Date:  2015-07-07       Impact factor: 8.029

4.  Defining "mutation" and "polymorphism" in the era of personal genomics.

Authors:  Roshan Karki; Deep Pandya; Robert C Elston; Cristiano Ferlini
Journal:  BMC Med Genomics       Date:  2015-07-15       Impact factor: 3.063

Review 5.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.

Authors:  Emad A Mohammed; Behrouz H Far; Christopher Naugler
Journal:  BioData Min       Date:  2014-10-29       Impact factor: 2.522

6.  The European Bioinformatics Institute in 2018: tools, infrastructure and training.

Authors:  Charles E Cook; Rodrigo Lopez; Oana Stroe; Guy Cochrane; Cath Brooksbank; Ewan Birney; Rolf Apweiler
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

7.  PhenoMeNal: processing and analysis of metabolomics data in the cloud.

Authors:  Kristian Peters; James Bradbury; Sven Bergmann; Marco Capuccini; Marta Cascante; Pedro de Atauri; Timothy M D Ebbels; Carles Foguet; Robert Glen; Alejandra Gonzalez-Beltran; Ulrich L Günther; Evangelos Handakas; Thomas Hankemeier; Kenneth Haug; Stephanie Herman; Petr Holub; Massimiliano Izzo; Daniel Jacob; David Johnson; Fabien Jourdan; Namrata Kale; Ibrahim Karaman; Bita Khalili; Payam Emami Khonsari; Kim Kultima; Samuel Lampa; Anders Larsson; Christian Ludwig; Pablo Moreno; Steffen Neumann; Jon Ander Novella; Claire O'Donovan; Jake T M Pearce; Alina Peluso; Marco Enrico Piras; Luca Pireddu; Michelle A C Reed; Philippe Rocca-Serra; Pierrick Roger; Antonio Rosato; Rico Rueedi; Christoph Ruttkies; Noureddin Sadawi; Reza M Salek; Susanna-Assunta Sansone; Vitaly Selivanov; Ola Spjuth; Daniel Schober; Etienne A Thévenot; Mattia Tomasoni; Merlijn van Rijswijk; Michael van Vliet; Mark R Viant; Ralf J M Weber; Gianluigi Zanetti; Christoph Steinbeck
Journal:  Gigascience       Date:  2019-02-01       Impact factor: 6.524

8.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

9.  ZINC: a free tool to discover chemistry for biology.

Authors:  John J Irwin; Teague Sterling; Michael M Mysinger; Erin S Bolstad; Ryan G Coleman
Journal:  J Chem Inf Model       Date:  2012-06-15       Impact factor: 4.956

Review 10.  Bioinformatics applications on Apache Spark.

Authors:  Runxin Guo; Yi Zhao; Quan Zou; Xiaodong Fang; Shaoliang Peng
Journal:  Gigascience       Date:  2018-08-01       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.