| Literature DB >> 29253074 |
Liren Huang1,2,3, Jan Krüger1,2, Alexander Sczyrba1,2,3.
Abstract
Motivation: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform.Entities:
Mesh:
Year: 2018 PMID: 29253074 PMCID: PMC5925781 DOI: 10.1093/bioinformatics/btx808
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Features of different cloud based bioinformatics tools
| Tools | Platform | Implementation | Application | Methods and tools |
|---|---|---|---|---|
| Cloudburst ( | Hadoop | Native Java re-implementation | Mapping | Seed and extend |
| Crossbow ( | Hadoop | Invoking external utilities | Mapping and genotyping | Bowtie and SOAPsnp ( |
| Halvade ( | Hadoop | Invoking external utilities | Mapping | BWA and GATK ( |
| Myrna ( | Hadoop | Invoking external utilities | Mapping and gene expression profiling | Bowtie ( |
| CS-BWAMEM ( | Spark | Native Scala re-implementation | Mapping | Burrows-Wheeler transform |
| SparkSW ( | Spark | Native Scala re-implementation | Mapping | Smith-Waterman |
| SparkBWA ( | Spark | Invoking external utilities | Mapping | BWA ( |
| MetaSpark ( | Spark | Native Scala re-implementation | Fragment recruitment | Seed and extend |
| Sparkhit | Spark | Both native and external | Fragment recruitment, mapping | Seed and extend and a collection of tools |
Fig. 1.A distributed computational framework for large scale genomic analysis. (A) Architecture of a Spark cluster deployed on the Amazon cloud. The yellow boxes represent Amazon EC2 instances that are virtualized into Spark master/worker nodes. (B) Distributed implementation of Sparkhit-recruiter. The reference index, illustrated in blue dashed box, is built on a driver node and broadcasted to each worker node. Sequencing reads, illustrated in Red dashes, are loaded into an RDD and queried to the broadcasted reference index in parallel as a ‘Map’ step. A ‘Reduce’ step is followed to summarize the mapping result. (C) Pipeline of Sparkhit-recruiter. The reference genome, illustrated in bold blue dash, is extracted and built into a K-mer hash table. The sequencing read, illustrated in bold red dash, will be searched against the reference hash table for exact matches. A smaller Kmer is used to apply the q-Gram filter. (D) Pipeline of Sparkhit-mapper. It is similar to (C), but uses the pigeonhole principle. (E, H) Using external tools and Docker containers for different analyses. Genomic data is loaded into an RDD and distributed across worker nodes. Each partition of the RDD is sent to external tools to be processed independently. (F) Different modules of the machine learning library. Colored dots denote vectors of either genotypes or gene expressions. (G) Parallel decompression. A Bzip2 file is split into blocks and stored in three worker nodes. Each block is decompressed independently (Color version of this figure is available at Bioinformatics online.)
Fig. 2.Performance benchmarks for Sparkhit. (A–D) Run time comparisons between different aligners. The comparisons were carried out across different sizes of input fastq files, different sizes of reference genomes and different numbers of worker nodes. (E) Run time performance of Sparkhit-recruiter for recruiting 100–1000 GB sequencing data to a 72 MB reference genome on a 30 nodes Spark cluster deployed on the Amazon EC2 cloud. Each node has 32 vCPUs. (F) Scaling performance of Sparkhit-recruiter. When increasing the number of worker nodes, the mean speed ups are measured by comparing their run times to the run time on 10 worker nodes. We recruited 1.3 TB fastq files (Data-1) to a 72 MB reference genome (Ref-2) on the same cluster of (E). (G) Run time comparisons between Crossbow and Sparkhit for preprocessing 338 TB compressed fastq files on 50 and 100 worker nodes. (H) Comparing the recruited number of reads between Crossbow and Sparkhit-recruiter when mapping 1.3 TB fastq files to a 72 MB reference genome. (I–J) Run times of the machine learning library on a private cluster and the Amazon EC2 cloud. All computations were performed on a 200 GB VCF file cached in the memory. (K) Run times for different iterations of the K means clustering. We ran iterations on the same VCF file from I, J, with data caching and non data caching. (L–M) Sensitivity and accuracy comparisons between mapping tools
Fig. 4.Large scale genomic data analyses on the cloud with Sparkhit. (A) Run time comparison between three auto-scaling tools for deploying a Spark cluster on the Amazon EC2 cloud. Durations include pending for approval of EC2 spot request and waiting for SSH connection to each EC2 instance. EMR, Amazon Elastic MapReduce service. (B) Run times for processing all WGS data from the Human Microbiome Project. Mapping was carried out using Sparkhit-recruiter while profiling was carried out using Sparkhit invoked Kraken. (C) Run times for processing 15 TB BAM files of the 3000 Rice Genome Project. We uploaded the variant calling result to Amazon S3. (D) Run times for processing 5.6 TB compressed sequencing data. Mapping was carried out using Sparkhit invoked BWA aligner. We uploaded the SAM files to Amazon S3. (E) Run times for processing 3.2 TB RNA-seq data. Gene expression profiling is carried out using Sparkhit invoked Kallisto. (F) Fast access to genomic data on public repositories. Datasets of the Human Microbiome Project, the 3000 Rice Genome Project and the 1000 Genomes Project are hosted in different regions on Amazon S3. Whereas the RNA-seq data of a prostate cancer transcriptomic study is stored on the ENA ftp server
Fig. 3.Comparisons between Sparkhit-recruiter and MetaSpark on metagenomic fragment recruitment. (A) Run times on recruiting simulated sequencing reads to 72 MB and 142 MB reference genomes. All tests were carried out on 10, 20 and 30 worker nodes Spark clusters. Each worker node has 16 vCPUs. Run times are presented in logarithmic scale base 2. (B) Numbers of recruited reads on recruiting 6 million simulated reads to 72 MB reference genome and 1 million simulated reads to 142 MB reference genome