| Literature DB >> 29194413 |
Xi Yang1, Chengkun Wu2, Kai Lu3, Lin Fang4, Yong Zhang5, Shengkang Li6, Guixin Guo7, YunFei Du8.
Abstract
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion-a big data interface on the Tianhe-2 supercomputer-to enable big data applications to run on Tianhe-2 via a single command or a shell script. Orion supports multiple users, and each user can launch multiple tasks. It minimizes the effort needed to initiate big data applications on the Tianhe-2 supercomputer via automated configuration. Orion follows the "allocate-when-needed" paradigm, and it avoids the idle occupation of computational resources. We tested the utility and performance of Orion using a big genomic dataset and achieved a satisfactory performance on Tianhe-2 with very few modifications to existing applications that were implemented in Hadoop/Spark. In summary, Orion provides a practical and economical interface for big data processing on Tianhe-2.Entities:
Keywords: Hadoop; Spark; Tianhe-2; big data; genomics big data
Mesh:
Year: 2017 PMID: 29194413 PMCID: PMC6149962 DOI: 10.3390/molecules22122116
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The parallel processing mode for big data analytics.
Hardware and software settings of the Hadoop test. BGI: Beijing Genomics Institute; HDFS: Hadoop Distributed File System.
| Test Scenario Name | Hardware Setting | Analysis Setting |
|---|---|---|
| BGI | A Hadoop cluster with 18 nodes. Each node is equipped with 24 cores, 32 GB RAM. The storage uses HDFS, and the total capacity is 12 TB. | 1 master + 17 slaves.Each core is allocated 3 GB RAM, a maximum of 10 cores were used on each node. |
| Orion-1 | A Hadoop cluster initiated and maintained by Orion on Tianhe-2 with 18 nodes. Each node is equipped with 24 cores, 64 GB RAM. Use the Tianhe-2 parallel filesystem directly. | 1 master + 17 slaves.Each core is allocated 8 GB of RAM, a maximum of 6 cores were used on each node. |
| Orion-2 | 1 master + 17 slaves.Each core is allocated 3 GB of RAM, a maximum of 16 cores were used on each node. |
Performance comparison of four components of SOAPGaea.
| SOAPGaea Components | BGI | Orion-1 | Orion-2 |
|---|---|---|---|
| FASTQ Filtering | 24 m 43 s | 12 m 50 s | 10 m 23 s |
| Read Alignment | 1 h 35 m 56 s | 48 m 48 s | 49 m 49 s |
| Duplication Removal | 28 m 21s | 15 m 38 s | 9 m 43 s |
| Quality Control | 1 h 30 m 2 s | 1 h 39 m | 46 m 38 s |
| Total processing time | 3 h 59 m 2 s | 2 h 56 m 16 s | 1 h 56 m 33 s |
Hardware and software settings of the Spark test.
| Test Scenario Name | Hardware Setting | Analysis Setting |
|---|---|---|
| BGI | A Spark cluster with 18 nodes. Each node is equipped with 24 cores, 32 GB RAM. The storage uses HDFS and the total capacity is 12 TB. | 1 master + 17 slaves. Each core is allocated 3 GB RAM, a maximum of 10 cores were used on each node. |
| Orion-A | A Spark cluster initiated and maintained by Orion on Tianhe-2 with 100 nodes. Each node is equipped with 24 cores, 64 GB RAM. Uses the Tianhe-2 parallel filesystem directly. | 1 master + 99 slaves. 24 cores, a maximum total of 44 GB RAM cores were used on each node. |
| Orion-B | A Spark cluster initiated and maintained by Orion on Tianhe-2 with 250 nodes. Each node is equipped with 24 cores, 64 GB RAM. Uses the Tianhe-2 parallel filesystem directly. | 1 master + 249 slaves. 24 cores, a maximum total of 44 GB RAM cores were used on each node. |
Performance decomposition for GaeaDuplicate Spark in different settings.
| GaeaDuplicate_Spark | Read In | Compute | Write Out | Total |
|---|---|---|---|---|
| BGI | 17 m | 1.1 h | 40 m | 2 h |
| Orion-A | 25 m | 14 m | 40 m | 1.3 h |
| Orion-B | 32 m | 6 m | 25 m | 1.1 h |
Figure 2An overview of the Orion architecture for big data analytics.
Figure 3The initial interface of the installation directory.
Figure 4An example job script.