Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Large-scale parallel genome assembler over cloud computing environment.

Literature DB >> 28610458

Large-scale parallel genome assembler over cloud computing environment.

Arghya Kusum Das¹, Praveen Kumar Koppa¹, Sayan Goswami¹, Richard Platania¹, Seung-Jong Park¹.

Abstract

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Keywords: Big data genome assembly; Giraph; Hadoop; cloud computing; solid state drive (SSD); traditional HPC cluster

Mesh：

Year: 2017 PMID： 28610458 DOI： 10.1142/S0219720017400030

Source DB: PubMed Journal: J Bioinform Comput Biol ISSN： 0219-7200 Impact factor: 1.122

Keyword Cloud
Cited

1 in total

1. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.

Authors: Arghya Kusum Das; Sayan Goswami; Kisung Lee; Seung-Jong Park
Journal: BMC Genomics Date: 2019-12-20 Impact factor: 3.969

1 in total