Literature DB >> 28065898

MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes.

Wei Zhou1, Ruilin Li2,3, Shuo Yuan1, ChangChun Liu1, Shaowen Yao1, Jing Luo4, Beifang Niu2,3.   

Abstract

Summary: With the advent of next-generation sequencing, traditional bioinformatics tools are challenged by massive raw metagenomic datasets. One of the bottlenecks of metagenomic studies is lack of large-scale and cloud computing suitable data analysis tools. In this paper, we proposed a Spark based tool, called MetaSpark, to recruit metagenomic reads to reference genomes. MetaSpark benefits from the distributed data set (RDD) of Spark, which makes it able to cache data set in memory across cluster nodes and scale well with the datasets. Compared with previous metagenomics recruitment tools, MetaSpark recruited significantly more reads than many programs such as SOAP2, BWA and LAST and increased recruited reads by ∼4% compared with FR-HIT when there were 1 million reads and 0.75 GB references. Different test cases demonstrate MetaSpark's scalability and overall high performance. Availability: https://github.com/zhouweiyg/metaspark. Contact: bniu@sccas.cn , jingluo@ynu.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Entities:  

Mesh:

Year:  2017        PMID: 28065898     DOI: 10.1093/bioinformatics/btw750

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  6 in total

1.  Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.

Authors:  Xiaobo Sun; Jingjing Gao; Peng Jin; Celeste Eng; Esteban G Burchard; Terri H Beaty; Ingo Ruczinski; Rasika A Mathias; Kathleen Barnes; Fusheng Wang; Zhaohui S Qin
Journal:  Gigascience       Date:  2018-06-01       Impact factor: 6.524

2.  Analyzing large scale genomic data on the cloud with Sparkhit.

Authors:  Liren Huang; Jan Krüger; Alexander Sczyrba
Journal:  Bioinformatics       Date:  2018-05-01       Impact factor: 6.937

3.  Large scale microbiome profiling in the cloud.

Authors:  Camilo Valdes; Vitalii Stebliankin; Giri Narasimhan
Journal:  Bioinformatics       Date:  2019-07-15       Impact factor: 6.937

4.  Computational Strategies for Scalable Genomics Analysis.

Authors:  Lizhen Shi; Zhong Wang
Journal:  Genes (Basel)       Date:  2019-12-06       Impact factor: 4.096

Review 5.  Bioinformatics applications on Apache Spark.

Authors:  Runxin Guo; Yi Zhao; Quan Zou; Xiaodong Fang; Shaoliang Peng
Journal:  Gigascience       Date:  2018-08-01       Impact factor: 6.524

6.  BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Authors:  Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song
Journal:  Front Big Data       Date:  2022-01-18
  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.