Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 SpaRC: scalable sequence clustering using Apache Spark.

Literature DB >> 30816928

SpaRC: scalable sequence clustering using Apache Spark.

Lizhen Shi¹, Xiandong Meng^2,3, Elizabeth Tseng⁴, Michael Mascagni¹, Zhong Wang^2,3,5.

Abstract

MOTIVATION: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.
RESULTS: Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.
AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/berkeleylab/jgi-sparc. Published by Oxford University Press 2018.

Entities: Disease Species

Mesh：

Year: 2019 PMID： 30816928 DOI： 10.1093/bioinformatics/bty733

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Keyword Cloud
Cited

3 in total

1. Computational Strategies for Scalable Genomics Analysis.

Authors: Lizhen Shi; Zhong Wang
Journal: Genes (Basel) Date: 2019-12-06 Impact factor: 4.096

2. Deconvolute individual genomes from metagenome sequences through short read clustering.

Authors: Kexue Li; Yakang Lu; Li Deng; Lili Wang; Lizhen Shi; Zhong Wang
Journal: PeerJ Date: 2020-04-08 Impact factor: 2.984

3. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.

Authors: Jinxiang Chen; Fuyi Li; Miao Wang; Junlong Li; Tatiana T Marquez-Lago; André Leier; Jerico Revote; Shuqin Li; Quanzhong Liu; Jiangning Song
Journal: Front Big Data Date: 2022-01-18

3 in total