| Literature DB >> 24982258 |
J Borgdorff1, M Ben Belgacem2, C Bona-Casas3, L Fazendeiro4, D Groen5, O Hoenen6, A Mizeranschi7, J L Suter5, D Coster6, P V Coveney5, W Dubitzky7, A G Hoekstra8, P Strand4, B Chopard2.
Abstract
Multiscale simulations model phenomena across natural scales using monolithic or component-based code, running on local or distributed resources. In this work, we investigate the performance of distributed multiscale computing of component-based models, guided by six multiscale applications with different characteristics and from several disciplines. Three modes of distributed multiscale computing are identified: supplementing local dependencies with large-scale resources, load distribution over multiple resources, and load balancing of small- and large-scale resources. We find that the first mode has the apparent benefit of increasing simulation speed, and the second mode can increase simulation speed if local resources are limited. Depending on resource reservation and model coupling topology, the third mode may result in a reduction of resource consumption.Entities:
Keywords: distributed multiscale computing; multiscale simulation; performance
Mesh:
Year: 2014 PMID: 24982258 PMCID: PMC4084531 DOI: 10.1098/rsta.2013.0407
Source DB: PubMed Journal: Philos Trans A Math Phys Eng Sci ISSN: 1364-503X Impact factor: 4.226
Figure 1.Scenarios using component-based modelling or distributed computing. (a) A monolithic model incorporating all codes A, B, C into a single code base. (b) The model is decomposed into submodels and the codes are separated by function, also separating the runtime dependencies per submodel. (c) How the components could be distributed to increase resource effectiveness. (Online version in colour.)
Performance measures of tied multiscale models TTE and HemeLB. Owing to the supercomputer policy restricting connections, the distributed communication speed of TTE could not be experimentally verified. Distributed communication time is estimated as cdistr≈5 s, based on network speeds from Germany to Japan (with a latency up to 0.5 s and throughput at least 20 MB s−1).
| simulation | speed-up | resources used | ||||
|---|---|---|---|---|---|---|
| TTE | 128 | 397 | 16+512 | 98 | 4.0 | 1.0 |
| 16+1024 | 56 | 7.1 | 1.2 | |||
| 256 | 201 | 16+512 | 98 | 2.0 | 1.0 | |
| 16+1024 | 56 | 3.6 | 1.1 | |||
| HemeLB | 4 | 144 81 | 4+512 | 298 | 48.6 | 2.7 |
| 4+2048 | 157 | 92.2 | 5.6 |
Resources used for performance measurements in §4. The total number of cores is listed in the right-most column, although practically a fraction of that can be used in a single reservation.
| resource | location | type | CPU architecture | cores |
|---|---|---|---|---|
| Mavrino | London, UK | cluster | Intel Xeon X3353 | 64 |
| Gordias | Geneva, Switzerland | cluster | Intel Xeon E5530 | 224 |
| Gateway | Munich, Germany | cluster | Intel Xeon E5-2670 | 256 |
| Scylla | Geneva, Switzerland | cluster | Intel Xeon Westmere | 368 |
| Inula | Poznań, Poland | cluster | AMD Opteron 6234 | 1600+ |
| Reef | Poznań, Poland | cluster | Intel Xeon E5530 | 2300+ |
| Zeus | Krakow, Poland | HPC | Intel Xeon L/X/E 56XX | 12 000+ |
| Cartesius | Amsterdam, The Netherlands | HPC | Intel Xeon E5-2695 v2 | 12 500+ |
| Helios | Aomori, Japan | HPC | Intel Xeon E5-2680 | 70 000+ |
| HECToR | Edinburgh, UK | HPC | AMD Opteron Interlagos | 90 000+ |
| SuperMUC | Munich, Germany | HPC | Intel Xeon E5-2680 8C | 150 000+ |
Performance measures of scalable multiscale models Canals and MultiGrain. The Canals simulation is performed on the Gordias cluster and the Scylla cluster, with Tlocal taken as the average of the Tlocal of Gordias and Scylla (only T of Gordias between parentheses). It is compared with running the same two submodels at lower core counts (on 50+50 cores) and with running a single monolithic model with the same total problem size (on 100 cores). The time listed for Canals is the time per iteration. The time listed for MultiGrain is the average over 10 simulations and includes the standard error from the mean caused by the stochastic optimization method used. It combines a node of the Zeus cluster and one from the Inula cluster.
| simulation | speed-up | resources used | ||||
|---|---|---|---|---|---|---|
| Canals (low resolution) | 50+50 | 0.015 | 100+100 | 0.023 | 0.63 | 3.2 |
| 100 | 0.011 (0.011) | 100+100 | 0.023 | 0.47 (0.47) | 4.2 (4.3) | |
| Canals (high resolution) | 50+50 | 0.99 | 100+100 | 0.71 | 1.4 | 1.4 |
| 100 | 1.77 (1.307) | 100+100 | 0.71 | 1.8 (2.5) | 1.1 (0.80) | |
| MultiGrain | 7 | 27±7 | 7+4 | 20±3 | 1.4 | 1.1 |
| MultiGrain | 11 | 43±16 | 11+8 | 36±10 | 1.2 | 1.5 |
Performance measures of skewed multiscale models Nano and ISR3D. The time listed for ISR3D is the time per iteration. The last two rows concern a previous version of ISR3D; it was executed on Huygens and Zeus. The current version was executed on Cartesius and Reef.
| simulation | speed-up | resources used | ||||
|---|---|---|---|---|---|---|
| Nano | 128 | 9.8×105 | 64+128+1024 | 5.7×105 | 1.73 | 0.88 |
| Nano | 1024 | 5.7×105 | 64+128+1024 | 5.7×105 | 1.0 | 0.19 |
| Nano | 2048 | 5.4×105 | 64+128+2048 | 5.4×105 | 1.0 | 0.11 |
| ISR3D | 144 | 281 | 144+8 | 283 | 0.99 | 1.06 |
| ISR3D versus alt. | 144+8 | 531 | 0.53 | 1.00 | ||
| ISR3D/old | 32 | 1813 | 32+4 | 1532 | 1.18 | 0.95 |
| ISR3D/old versus alt. | 32+4 | 1804 | 1.00 | 0.56 |