| Literature DB >> 30101283 |
Runxin Guo1, Yi Zhao2, Quan Zou3, Xiaodong Fang4, Shaoliang Peng1,5.
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.Entities:
Mesh:
Year: 2018 PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Bioinformatics tools and algorithms based on Apache Spark
| Name | Function | Features | Pros/Cons | Reference |
|---|---|---|---|---|
| SparkSW | Alignment | Consists of three phases: data preprocessing, SW as map tasks, and top K records as reduce tasks | Load-balancing, scalable, but without the mapping location and traceback of optimal alignment | [ |
| DSA | Alignment and mapping | Leverages data parallel strategy based on SIMD instruction | Up to 201 times increased speed over SparkSW and almost linearly increased speed with increasing numbers of cluster nodes | [ |
| CloudSW | Alignment and mapping | Leverages SIMD instruction, and provides API services in the cloud | Up to 3.29 times increased speed over DSA and 621 times increased speed over SparkSW; high scalability and efficiency | [ |
| SparkBWA | Alignment and mapping | Consists of three main stages: RDD creation, map, and reduce phases; employs two independent software layers | For shorter reads, averages 1.9x and 1.4x faster than SEAL and pBWA. For longer reads, averages 1.4x faster than BigBWA and Halvade, but requires the data availability in HDFS | [ |
| StreamBWA | Alignment and mapping | Input data are streamed into the cluster directly from a compressed file | ∼2x faster than nonstreaming approach, and 5x faster than SparkBWA | [ |
| PASTASpark | Alignment and mapping | Employs an in-memory RDD of key-value pairs to parallel the calculating MSA phase | Up to 10x faster than single-threaded PASTA; ensures scalability and fault tolerance | [ |
| PPCAS | Alignment and mapping | Based on the MapReduce processing paradigm in Spark | Better with a single node and shows almost linearly increased speeds with increasing numbers of nodes | [ |
| SparkBLAST | Alignment and mapping | Utilizes | Outperforms CloudBLAST in terms of speed, scalability, and efficiency | [ |
| MetaSpark | Alignment and mapping | Consists of five steps: constructing | Recruits significantly more reads than SOAP2, BWA, and LAST and more reads by ∼4 than FR-HIT; shows good scalability and overall high performance | [ |
| Spaler | Assembly | Employs Spark's GraphX API; consists of two main parts: de Bruijn graph construction and contig generation | Shows better scalability and achieves comparable or better assembly quality than ABySS, Ray, and SWAP-Assembler | [ |
| SA-BR-Spark | Assembly | Under the strategy of finding the source of reads; based on the Spark platform | Shows a superior computational speed than SA-BR-MR | [ |
| HiGene | Sequence analysis | Puts forward a dynamic computing resource scheduler and an efficient way of mitigating data skew | Reduces total running time from days to just under nearly an hour; 2x faster than Halvade | [ |
| GATK-Spark | Sequence analysis | Takes full account of compute, workload, and characteristics | Achieves more than 37 times increased speed | [ |
| SparkSeq | Sequence analysis | Builds and runs genomic analysis pipelines in an interactive way by using Spark | Achieves 8.4–9.15 times faster speeds than SeqPig; accelerates data querying up to 110 times and reduces memory consumption by 13 times | [ |
| CloudPhylo | Phylogeny | Evenly distributes entire workloads between worker nodes | Shows good scalability and high efficiency; the Spark version is better than the Hadoop version | [ |
| S-CHEMO | Drug discovery | Intermediate data are immediately consumed again on the producing nodes, saving time and bandwidth | Shows almost linearly increased speeds on up to eight nodes compared with the original pipeline | [ |
| Falco | Single-cell RNA sequencing | Consist of a splitting step, an optional preprocessing step, and the main analysis step | At least 2.6x faster than a highly optimized single-node analysis; running time decreases with increasing numbers of nodes | [ |
| VariantSpark | Variant association and population genetics studies | Parallels population-scale tasks based on Spark and the associated MLlib | 80% faster than ADAM, Hadoop/Mahout version, and ADMIXTURE; more than 90% faster than R and Python implementations | [ |
| SEQSpark | Variant association and population genetics studies | Splits large-scale datasets into many small blocks to perform rare variant association analyses | Always faster than Variant Association Tools and PLINK/SEQ; in some cases, running time is reduced to 1% | [ |
| BioSpark | Data-parallel analysis on large, numerical datasets | Consists of a set of Java, C++, and Python libraries; abstractions for parallel analysis of standard data types; some APIs; and file conversion tools | Convenient, scalable, and useful; has domain-specific features for biological applications | [ |
Figure 1:The cluster architecture of Spark.
Figure 2:Examples of narrow and wide dependencies. Each box is an RDD, where the partition is shown as a shaded rectangle.
Figure 3:An example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles and are black if they are already in memory. To run an action on RDD G, we build stages at wide dependencies and pipeline narrow transformation inside each stage. In this case, the output RDD of stage 1 is already in memory, so we run stage 2 and then stage 3.