| Literature DB >> 28282428 |
Haohuan Fu1,2,3, Lin Gan1,2,4,3, Chao Yang5, Wei Xue1,2,4,3, Lanning Wang2,6,3, Xinliang Wang4, Xiaomeng Huang1,2,3, Guangwen Yang1,2,4,3.
Abstract
The scientific demand for more accurate modeling of the climate system calls for more computing power to support higher resolutions, inclusion of more component models, more complicated physics schemes, and larger ensembles. As the recent improvements in computing power mostly come from the increasing number of nodes in a system and the integration of heterogeneous accelerators, how to scale the computing problems onto more nodes and various kinds of accelerators has become a challenge for the model development. This paper describes our efforts on developing a highly scalable framework for performing global atmospheric modeling on heterogeneous supercomputers equipped with various accelerators, such as GPU (Graphic Processing Unit), MIC (Many Integrated Core), and FPGA (Field Programmable Gate Arrays) cards. We propose a generalized partition scheme of the problem domain, so as to keep a balanced utilization of both CPU resources and accelerator resources. With optimizations on both computing and memory access patterns, we manage to achieve around 8 to 20 times speedup when comparing one hybrid GPU or MIC node with one CPU node with 12 cores. Using a customized FPGA-based data-flow engines, we see the potential to gain another 5 to 8 times improvement on performance. On heterogeneous supercomputers, such as Tianhe-1A and Tianhe-2, our framework is capable of achieving ideally linear scaling efficiency, and sustained double-precision performances of 581 Tflops on Tianhe-1A (using 3750 nodes) and 3.74 Pflops on Tianhe-2 (using 8644 nodes). Our study also provides an evaluation on the programming paradigm of various accelerator architectures (GPU, MIC, FPGA) for performing global atmospheric simulation, to form a picture about both the potential performance benefits and the programming efforts involved.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28282428 PMCID: PMC5345762 DOI: 10.1371/journal.pone.0172583
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Existing efforts on utilizing accelerators in climate models.
| Original | Acceleration | Original Unit | Acceleration Unit | Speedup | |
|---|---|---|---|---|---|
| 1 CPU (2.4GHz Dual-Core Opteron) | 1.35 GHz Quadro 5600 | 0.33GFlops | 15.8 GFlops | 48× | |
| 1 Core (3 GHz Quad-Core Xeon 5400) | 1.46 GHz GTX 280 | 21.0944s | 2.475s | 8.5× | |
| 1 Core (2.6 GHz Intel Core 2 Duo) | 0.65 GHz GTX 9800 GX2 | 0.782GFlops | 10.93GFlops | 14× | |
| 1 CPU (3.2 GHz Intel Core i7 970) | 1.215 GHz GTX 590 | 2530ms | 36.4ms | 70× | |
| 1 Core (3.3 GHz Intel Core i5 3550) | 1.05 GHz GTX 605 | 1.74s | 12.4ms | 140× | |
| 2 Microprocessors (2.8 GHz Pentium 4) | 2 FPGAs (100 MHz SRC-6) | 31.25ms | 129ms | 0.24 | |
| 1 Core (Intel Harpertown) | 1.46 GHz GTX280 | 8.814s | 0.263s | 33.6× | |
| 1 Core (Intel Xeon 5500) | 1.3 GHz Tesla C1060 | 805ms | 83ms | 9.7× | |
| 1 CPU (2.6 GHz AMD Opteron 2435) | 1.15 GHz Fermi C2050 | 1GFlops | 6.1 GFlops | 6.1× | |
| 1 Core (2.93 GHz Intel Core i7-940) | 1.4 GHz GTX 480 | 6620ms | 118.1ms | 56× | |
| 1280 Cores (2.9 GHz Intel Westmere-EP) | 320 (CPU core + 1.1GHz M2050 GPU) | 780 GFlops | 2.3TFlops | 3× | |
| 1 Core (2.4 GHz AMD Opteron) | 1 GPU (Tesla S1070, 1.44GHz) | 0.355GFlops | 28.4TFlops | 80× | |
| 408 Cores (Intel E5-2670, 2.6 GHz) | 4 GPUs (K20X, 2.6 GHz) | 27s | 27s | 1 × † | |
| 12 Cores (2.7 GHz Intel Xoen X5650) | 6 FPGAs (150 MHz Virtex 6 SX475T) | 2235s | 30s | 74× | |
Fig 1The cubed-sphere mesh (Left), and its six patches as the computational domain (Right).
Fig 2State reconstruction in cell (i, j), values in the adjacent four cells are needed (Left). The 13-point upwind stencil (Right): values of the adjacent twelve cells from different positions are needed to compute the value of the solid point in the center.
Fig 3The three-layer partition scheme.
Fig 4Our flexible partition scheme.
Fig 5Work flow of the hybrid partition.
Using n + 1 accelerators to process the inner part, and using CPU to process the interpolation, computing of halo and extra outer part. C2A and A2C refer to the data exchange between CPU and the accelerators. Part of the CPU time shown in the figure can be overlapped by the accelerator time.
Fig 6The “pipe-flow” communication scheme with four steps.
Arrows indicate the directions that data flows the six patches. At each step, each patch only has one data in (MPI receive) and one data out (MPI send), leading to a better load balance systematically.
Fig 7The hardware architecture of the multi-core CPU, MIC, many-core GPU, and reconfigurable FPGA.
Fig 8Weak scaling performance on Tianhe-1A and Tianhe-2.
Fig 9Strong scaling efficiency on Tianhe-1A and Tianhe-2.
Fig 10Surface level distribution of the atmosphere at day 15 on Tianhe-1A based on the model test [34].
The conical mountain is outlined by the dotted circle.
Fig 11Surface level distribution of the atmosphere at day 15 on Tianhe-2 based on the model test [34].
Single-chip performance comparison.
| chip | performance (GFlops) | speedup |
|---|---|---|
| X5670 CPU (6 cores) | 12 | 1 |
| E5 2692 CPU (12 cores) | 24 | 2× |
| Fermi C2050 GPU | 140 | 12× |
| Xeon Phi 31S1P MIC | 200 | 17× |
| Virtex-6 FPGA (mixed-precision) | 900 | 75× |
Single-node performance comparison.
| Configuration | performance (GFlops) | speedup |
|---|---|---|
| Tianhe-1A | ||
| CPU (12 cores) | 20.7 | 1 |
| CPU + GPU | 158 | 7.6× |
| Tianhe-2 | ||
| CPU (24 cores) | 46 | 2.2× |
| CPU + MIC | 240 | 11.6× |
| CPU + 2 MIC | 346 | 16.7× |
| CPU + 3 MICs | 400 | 19.3 |
| Maxeler | ||
| CPU (12 cores) | 20.7 | 1 |
| CPU + 3DFEs (double precision) | 14 | 0.7× |
| CPU + DFE (mixed precision) | 956 | 46× |
| CPU + 4 DFEs (mixed precision) | 3156 | 152× |
Power efficiency of different supercomputer nodes.
| power (watt) | efficiency (GFlops/watt) | power efficiency | |
|---|---|---|---|
| Tianhe-1A node | 360 | 0.6 | 1.2× |
| Tianhe-2 node | 815 | 0.49 | 1 |
| Maxeler node | 514 | 6.1 | 12.4× |
The number of code lines on different platforms.
| platform | CPU | GPU | MIC | FPGA |
|---|---|---|---|---|
| original version | 969 | − | − | − |
| naive porting | − | +96 | +10 | +644 |
| optimized porting | − | +90 | +80 | +115 |
| porting difficulty | − | medium | low | high |