| Literature DB >> 31805063 |
Hamid Mushtaq1, Nauman Ahmed1, Zaid Al-Ars1.
Abstract
Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.Entities:
Mesh:
Year: 2019 PMID: 31805063 PMCID: PMC6894754 DOI: 10.1371/journal.pone.0224784
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of tools used in GATK best-practices pipeline and SparkGA2.
| Step | GATK | SparkGA2 |
|---|---|---|
| BWA mem | BWA mem | |
| Picard | Picard’s Java library | |
| Picard | Sorting in Scala | |
| Picard | Picard | |
| GATK | GATK | |
| GATK | GATK | |
| GATK | GATK |
Fig 1Data flow of SparkGA2.
Fig 2Mapping output in SparkGA2.
Fig 3Regions and tradeoff.
Runtime of SparkGA2 on Microsoft cloud.
| Step | 4 nodes | 5 nodes | 6 nodes |
|---|---|---|---|
| (mins, %) | (mins, %) | (mins, %) | |
| Step 1 | 292, | 222, | 192, |
| Step 2 | 79, | 66, | 48, |
| Step 3 | 160, | 129, | 104, |
| 531, | 417, | 344, |
Runtime of SparkGA2 with NA12878 on the SURFSara cluster.
| Step | 6 nodes | 24 nodes | 48 nodes | 67 nodes |
|---|---|---|---|---|
| (mins, %) | (mins, %) | (mins, %) | (mins, %) | |
| Step 1 | 331.5, | 72.5, | 38.5, | 29, |
| Step 2 | 67.5, | 15.5, | 11, | 10, |
| Step 3 | 422.5, | 101, | 62, | 47, |
| 821.5, | 189, | 111.5, | 87, |
Comparison of SparkGA2 (SGA2) vs SparkGA (SGA) on Microsoft cloud with 4, 5 and 6 nodes.
| Step | 4 nodes | 5 nodes | 6 nodes | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | |
| Step 1 | 347, | 292, | 261, | 222, | 212, | 192, | |||
| Step 2 | 76, | 79, | 60, | 66, | 51, | 48, | |||
| Step 3 | 155, | 160, | 124, | 129, | 113 | 104, | |||
| 578, | 531, | 445, | 417, | 376, | 344, | ||||
Fig 4Performance comparison with SparkGA on the Microsoft cluster.
Comparison of SparkGA2 vs SparkGA on SURFsara cluster with 6, 24, 48 and 67 nodes, for NA12878.
| Step | 6 nodes | 24 nodes | 48 nodes | 67 nodes | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | |
| Step 1 | 316, | 331.5, | 85, | 72.5, | 46, | 38.5, | 37.5, | 30, | ||||
| Step 2 | 95, | 67.5, | 29, | 15.5, | 17, | 11, | 13, | 10, | ||||
| Step 3 | 419, | 422.5, | 117.5, | 101, | 70.5, | 62, | 55.5, | 47, | ||||
| 830, | 821.5, | 231.5, | 189, | 133.5, | 111.5, | 106, | 87, | |||||
Fig 5Performance comparison with SparkGA on the SURFsara cluster.
Comparison of SparkGA2 vs SparkGA on SURFSara cluster with 67 nodes, with different benchmarks.
| Step | ERR194147 | ERR194160 | NA12878 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | SGA (m, %) | SGA2 (m, %) | imp | |
| Step 1 | 41, | 35, | 43.5, | 34, | 37.5, | 30, | |||
| Step 2 | 14.5, | 10, | 15, | 10, | 13, | 10, | |||
| Step 3 | 57.5, | 58.5, | 54, | 56, | 55.5 | 47, | |||
| 113, | 103.5, | 112.5, | 100, | 106, | 87, | ||||
Fig 6Profile of a node with SparkGA on the 6-node Microsoft cloud cluster.
Fig 7Profile of a node with SparkGA2 on the 6-node Microsoft cloud cluster.
Runtime in minutes using the streaming approach on Microsoft cloud with 6 nodes.
| Step | SparkGA | SparkGA2 |
|---|---|---|
| mins | mins | |
| Chunking | 143.5 | - |
| Step 1 | 212 | 192 |
| Step 2 | 51 | 48 |
| Step 3 | 113 | 104 |
Maximum memory consumed by number of regions.
| 122 GB | 69 GB | 37 GB | 29 GB | 20 GB | 20 GB |
Runtime in minutes for Step 2 on Microsoft cloud with 6 nodes, with different number of regions.
| 48 min | 51 min | 56 min |