| Literature DB >> 35923230 |
Kristian Gregorius Hustad1, Xing Cai1,2.
Abstract
A central component in simulating cardiac electrophysiology is the numerical solution of nonlinear ordinary differential equations, also called cardiac ionic cell models, that describe cross-cell-membrane ion transport. Biophysically detailed cell models often require a considerable amount of computation, including calls to special mathematical functions. This paper systematically studies how to efficiently use modern multicore CPUs for this costly computational task. We start by investigating the code restructurings needed to effectively enable compiler-supported SIMD vectorisation, which is the most important performance booster in this context. It is found that suitable OpenMP directives are sufficient for achieving both vectorisation and parallelisation. We then continue with an evaluation of the performance optimisation technique of using lookup tables. Due to increased challenges for automated vectorisation, the obtainable benefits of lookup tables are dependent on the hardware platforms chosen. Throughout the study, we report detailed time measurements obtained on Intel Xeon, Xeon Phi, AMD Epyc and two ARM processors including Fujitsu A64FX, while attention is also paid to the impact of SIMD vectorisation and lookup tables on the computational accuracy. As a realistic example, the benefits of performance enhancement are demonstrated by a 109-run ensemble on the Oakforest-PACS system, where code restructurings and SIMD vectorisation yield an 84% reduction in computing time, corresponding to 63,270 node hours.Entities:
Keywords: SIMD vectorisation; cardiac electrophysiogy; ionic cell models; lookup tables (LUTs); multicore CPUs
Year: 2022 PMID: 35923230 PMCID: PMC9342677 DOI: 10.3389/fphys.2022.904648
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.755
Cell models used in the numerical experiments of this paper. The “FLOPs” column lists the number of floating-point operations required to compute a single time step for a naïve implementation using the Forward Euler scheme. Section 3.2 describes how performance counters were used to obtain the operation counts.
| Model | Name | State variables | FLOPs | References |
|---|---|---|---|---|
| ten Tusscher–Panfilov (2006) | TP06 | 19 | 1500 |
|
| Jæger–Tveito (2021) | JT21 | 25 | 1322 |
|
| Grandi–Pasqualini–Bers (2010) | GPB | 39 | 2149 |
|
Hardware specifications (compute node level) of the five target platforms. ISA is an abbreviation of “instruction set architecture”.
| Name | CPU | ISA | SIMD width (bits) | Memory | Peak memory bandwidth |
|---|---|---|---|---|---|
| Oakforest | 1 × Intel Xeon Phi 7250 | x86-64 | 512 | 16 GiB MCDRAM + 96 GiB DDR4 | MCDRAM: |
| Peak performance: 3 TFLOPS | |||||
| Oakbridge | 2 × Intel Xeon Platinum 8280 | x86-64 | 512 | 192 GiB DDR4 | 281 GB/s |
| Peak performance: 4.8 TFLOPS | |||||
| Wisteria | 1 × Fujitsu A64FX | ARM v8.2-A | 512 | 32 GiB HBM | 1024 GB/s |
| Peak performance: 3.4 TFLOPS | |||||
| Milan | 2 × AMD EPYC 7763 | x86-64 | 256 | 2 TiB DDR4 | 410 GB/s |
| Peak performance: 5.0 TFLOPS | |||||
| ThunderX2 | 2 × Cavium ThunderX2 CN9980 | ARM v8.1-A | 128 | 1 TiB DDR4 | 341 GB/s |
| Peak performance: 1.0 TFLOPS |
Compiler flags used to enable auto-vectorisation.
| Compiler | Version | System | Flags |
|---|---|---|---|
| ARMClang | 21.0 | ThunderX2 |
|
| Fujitsu | 4.7.0 | Wisteria |
|
| GCC | 11.1.0 | Milan |
|
| Intel | 19.1.3.304 | Oakbridge/ Oakforest |
|
Single-threaded performance of scalar and vectorised math library calls. The units for the Scalar and SIMD columns is millions of function evaluations per second.
| System | Function | Scalar | SIMD | Speedup |
|---|---|---|---|---|
| Oakbridge | exp | 261.0 | 760.6 | 2.91 |
| Oakbridge | expm1 | 167.4 | 648.8 | 3.88 |
| Oakbridge | log | 202.4 | 663.6 | 3.28 |
| Oakbridge | pow | 83.4 | 427.1 | 5.12 |
| Oakforest | exp | 43.9 | 260.0 | 5.92 |
| Oakforest | expm1 | 23.7 | 223.4 | 9.44 |
| Oakforest | log | 38.5 | 247.1 | 6.41 |
| Oakforest | pow | 13.9 | 103.9 | 7.50 |
| Milan | exp | 91.7 | 743.6 | 8.11 |
| Milan | expm1 | 151.8 | 138.3 | 0.91 |
| Milan | log | 73.9 | 583.6 | 7.89 |
| Milan | pow | 26.8 | 159.7 | 5.96 |
| Wisteria | exp | 85.6 | 633.1 | 7.40 |
| Wisteria | expm1 | 25.3 | 24.3 | 0.96 |
| Wisteria | log | 79.1 | 534.7 | 6.76 |
| Wisteria | pow | 16.4 | 113.4 | 6.92 |
| ThunderX2 | exp | 88.4 | 123.0 | 1.39 |
| ThunderX2 | expm1 | 45.6 | 27.1 | 0.59 |
| ThunderX2 | log | 64.8 | 102.9 | 1.59 |
| ThunderX2 | pow | 25.6 | 22.6 | 0.88 |
Maximum error of vectorised math library calls when evaluating input values in the prescribed ranges. The error is reported in units of least precision (ULPs).
| Function (value range) System | exp (−700, 700) | expm1 (−700, 700) | log (10–300, 10,
| pow (—30, 30) × (—30, 30) |
|---|---|---|---|---|
| Oakbridge | 2.623 | 2.753 | 1.496 | 0.998 |
| Oakforest | 1.471 | 2.008 | 1.276 | 1.035 |
| Milan | 2.623 | 0.735 | 1.343 | 0.998 |
| Wisteria | 1.923 | 0.753 | 1.343 | 1.62×1013 |
| ThunderX2 | 2.313 | 0.992 | 1.883 | 0.998 |
Single-threaded performance of naïve and auto-vectorised implementations. The FE scheme is used; C = 11688851 cells. “SoA” refers to the “struct of arrays” memory layout discussed in Section 2.3.2.
| System | Model | Throughput | Speedup | |||
|---|---|---|---|---|---|---|
| Naïve | SoA | SIMD |
|
| ||
| Oakbridge | TP06 | 2.917 | 3.039 | 15.070 | 5.2 | 5.0 |
| Oakbridge | JT21 | 3.622 | 3.559 | 17.970 | 5.0 | 5.0 |
| Oakbridge | GPB | 2.039 | 2.047 | 8.204 | 4.0 | 4.0 |
| Oakforest | TP06 | 0.499 | 0.525 | 4.081 | 8.2 | 7.8 |
| Oakforest | JT21 | 0.645 | 0.651 | 5.116 | 7.9 | 7.9 |
| Oakforest | GPB | 0.351 | 0.406 | 3.526 | 10.0 | 8.7 |
| Milan | TP06 | 1.186 | 1.336 | 5.910 | 5.0 | 4.4 |
| Milan | JT21 | 1.590 | 1.624 | 6.712 | 4.2 | 4.1 |
| Milan | GPB | 1.094 | 1.103 | 4.073 | 3.7 | 3.7 |
| Wisteria | TP06 | 0.578 | 0.694 | 6.604 | 11.4 | 9.5 |
| Wisteria | JT21 | 0.532 | 0.871 | 7.649 | 14.4 | 8.8 |
| Wisteria | GPB | 0.413 | 0.475 | 4.211 | 10.2 | 8.9 |
| ThunderX2 | TP06 | 0.911 | 0.951 | 1.509 | 1.7 | 1.6 |
| ThunderX2 | JT21 | 1.629 | 1.080 | 1.849 | 1.1 | 1.7 |
| ThunderX2 | GPB | 0.747 | 0.787 | 1.179 | 1.6 | 1.5 |
Multi-threaded performance of naïve and auto-vectorised implementations; The FE scheme is used; C = 11688851 cells. “SoA” refers to the “struct of arrays” memory layout discussed in Section 2.3.2.
| System | Model | Throughput | Speedup | |||
|---|---|---|---|---|---|---|
| Naïve | SoA | SIMD |
|
| ||
| Oakbridge | TP06 | 127.8 | 134.0 | 481.7 | 3.8 | 3.6 |
| Oakbridge | JT21 | 148.8 | 147.2 | 444.7 | 3.0 | 3.0 |
| Oakbridge | GPB | 87.4 | 91.1 | 239.4 | 2.7 | 2.6 |
| Oakforest | TP06 | 55.7 | 59.8 | 398.7 | 7.2 | 6.7 |
| Oakforest | JT21 | 68.3 | 66.1 | 376.6 | 5.5 | 5.7 |
| Oakforest | GPB | 36.4 | 39.5 | 234.2 | 6.4 | 5.9 |
| Milan | TP06 | 199.4 | 219.8 | 920.8 | 4.6 | 4.2 |
| Milan | JT21 | 250.4 | 256.0 | 833.5 | 3.3 | 3.3 |
| Milan | GPB | 170.2 | 164.5 | 518.9 | 3.0 | 3.2 |
| Wisteria | TP06 | 27.8 | 33.2 | 296.6 | 10.7 | 8.9 |
| Wisteria | JT21 | 25.1 | 41.0 | 321.0 | 12.8 | 7.8 |
| Wisteria | GPB | 19.5 | 22.4 | 167.5 | 8.6 | 7.5 |
| ThunderX2 | TP06 | 94.1 | 97.6 | 137.1 | 1.5 | 1.4 |
| ThunderX2 | JT21 | 142.5 | 114.2 | 169.3 | 1.2 | 1.5 |
| ThunderX2 | GPB | 75.5 | 83.1 | 109.1 | 1.4 | 1.3 |
Multi-threaded performance of an ensemble simulation using the JT21 model; C = 11688851 cells. The most performant implementation for each system is in boldface.
| System | SIMD | Cell–Time | Time–Cell | Cell–Time–Cell | |||
|---|---|---|---|---|---|---|---|
| FE | GRL1 | FE | GRL1 | FE | GRL1 | ||
| Oakbridge | On |
|
| 154.3 | 132.2 | 375.2 | 257.5 |
| Oakbridge | Off | 150.5 | 90.0 | 85.3 | 57.8 | 110.1 | 71.1 |
| Oakforest | On |
|
| 108.2 | 89.1 | 114.1 | 94.7 |
| Oakforest | Off | 63.5 | 37.8 | 16.1 | 13.7 | 16.0 | 14.1 |
| Milan | On | 271.8 | 161.2 | 292.1 | 275.2 |
|
|
| Milan | Off | 269.2 | 158.7 | 151.2 | 105.9 | 167.4 | 116.8 |
| Wisteria | On | 41.4 | 19.4 | 104.5 | 70.3 |
|
|
| Wisteria | Off | 40.8 | 19.6 | 17.8 | 12.0 | 17.5 | 11.9 |
| ThunderX2 | On | 103.5 | 58.6 | 83.5 | 60.9 |
|
|
| ThunderX2 | Off | 111.4 | 62.6 | 64.1 | 40.6 | 79.6 | 47.9 |
RRMS error when solving the TP06 model using SIMD and vectorised math functions. For each ODE solver scheme a reference solution is computed on Milan using scalar math functions and with compiler optimisations disabled. The model is solved for 1 s with a time step Δt = 1 µs.
| System | FE | RL | GRL1 |
|---|---|---|---|
| Oakbridge (with expm1) | 2.11 × 10–16 | 1.34 × 10–16 | 9.47 × 10–17 |
| Oakbridge | 2.00 × 10–16 | 1.48 × 10–14 | 1.29 × 10–12 |
| Oakforest | 3.40 × 10–16 | 2.40 × 10–11 | 1.17 × 10–11 |
| Milan | 2.67 × 10–16 | 2.48 × 10–14 | 1.07 × 10–12 |
| Wisteria | 2.33 × 10–16 | 2.49 × 10–14 | 8.43 × 10–13 |
| ThunderX2 | 1.85 × 10–16 | 2.48 × 10–14 | 7.94 × 10–13 |
Multi-threaded performance of naïve, auto-vectorised and LUT implementations for the TP06 model in a simulation scenario. The Rush–Larsen scheme is used with Δt = 0.1 µs, C = 11688851 cells. “SoA” refers to the “struct of arrays” memory layout discussed in Section 2.3.2. The most performant implementation is in boldface.
| System | Throughput | Speedup | ||||||
|---|---|---|---|---|---|---|---|---|
| Naïve | SoA | SIMD | LUT | SIMD & LUT |
|
|
| |
| Oakbridge | 96.0 | 96.3 | 374.3 | 393.4 |
| 3.9 | 4.1 | 5.8 |
| Oakforest | 46.9 | 47.8 | 354.3 | 196.3 |
| 7.6 | 4.2 | 9.0 |
| Milan | 163.2 | 170.9 | 754.5 | 762.3 |
| 4.6 | 4.7 | 7.3 |
| Wisteria | 17.5 | 21.3 |
| 74.3 | 206.5 | 14.0 | 4.2 | 11.8 |
| ThunderX2 | 71.8 | 79.7 | 116.0 |
| 199.3 | 1.6 | 3.5 | 2.8 |
FIGURE 1Comparison of the TP06 model solved with and without the use of lookup tables. The RL scheme is used with Δt = 1 µs. In the upper plot, the two numerical solutions cannot be distinguished by eye. The RRMS error of the LUT solution compared with the non-LUT solution is 1.36 × 10–7.
FIGURE 2Traces of the transmembrane potential and calcium concentration for 1000 different sets of parameters in an ensemble simulation.