| Literature DB >> 22582031 |
Andreas W Götz, Mark J Williamson, Dong Xu, Duncan Poole, Scott Le Grand, Ross C Walker.
Abstract
We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.Entities:
Year: 2012 PMID: 22582031 PMCID: PMC3348677 DOI: 10.1021/ct200909j
Source DB: PubMed Journal: J Chem Theory Comput ISSN: 1549-9618 Impact factor: 6.006
Figure 1Peak floating-point operations per second (Flop/s; left) and memory bandwidth (right) for Intel CPUs[26] and NVIDIA GPUs.[27]
Approximate Maximum Atom Counts That Can Be Treated with the GPU Implementation of GB Implicit Solvent Simulations in AMBER 11 Using the SPDP Precision Modela
| GPU card | GPU memory | simulation type | max atoms |
|---|---|---|---|
| GTX-295 | 895 MB | constant | 20 500 |
| constant | 19 200 | ||
| Tesla C1060 | 4.0 GB | constant | 46 350 |
| constant | 45 200 | ||
| Tesla C2050 | 3.0 GB | constant | 39 250 |
| constant | 38 100 | ||
| Tesla C2070 | 6.0 GB | constant | 54 000 |
| constant | 53 050 |
Test systems are droplets of TIP3P water molecules. All simulations use SHAKE (AMBER input ntf=2, ntc=2); a time step of 2 fs; the Hawkins, Cramer, Truhlar GB model[37] (AMBER input igb=1); the default cutoff value of 25 Å for GB radii (AMBER input rgbmax=25); and temperature control with the Langevin thermostat (AMBER input ntt=3), if applicable. Error-correction code (ECC) was switched off on the Tesla cards.
Figure 2Schematic representation of the work-load distribution for the calculation of nonbond forces with N atoms. Each square represents the interactions between two atoms i and j for which the resulting forces need to be evaluated. These are grouped together in tiles of size W × W that are each assigned to an independent warp. Due to symmetry, only the blue diagonal tiles and the green off-diagonal tiles need to be considered for the calculation. For details, see the text.
Single GPU Throughput Timings (ns/day) for AMBER GB Simulations with a Time Step of 2 fs Using the Parallel CPU Version on One Node (12 Intel X5670 Cores or 32 AMD Opteron 6136 Cores) and the Serial GPU Version with the SPDP Precision Model on One Node (One Intel X5670 Core and One GPU)a
| CPU/GPU | TRPCage (304 atoms) | ubiquitin (1231 atoms) | apo-myoglobin (2492 atoms) | nucleosome (25 095 atoms) |
|---|---|---|---|---|
| GPU version | ||||
| M2090 (6 GB) | 399.9 | 184.2 | 78.1 | 1.42 |
| C2070 (6 GB) | 364.1 | 157.2 | 64.3 | 1.09 |
| C1060 (4 GB) | 234.6 | 78.3 | 31.5 | 0.40 |
| GTX580 (1.5 GB, PNY XLR8) | 471.1 | 215.9 | 88.7 | – |
| CPU version | ||||
| 32 × Opteron 6136 | 225.0 | 29.9 | 10.3 | 0.08 |
| 12 × X5670 | 247.1 | 19.8 | 6.6 | 0.07 |
For details on the hardware and software stack, see text. A dash indicates insufficient GPU memory for the simulation.
The CPU code requires >10 atoms per core, and thus the TRPCage simulation was run on 24 CPU cores.
Multi-GPU Throughput Timings (ns/day) for AMBER GB Simulations with a Time Step of 2 fs Using the Parallel CPU Version (12 Intel X5670 Cores or 32 AMD Opteron 3136 Cores on Each Node) and the Parallel GPU Version with the SPDP Precision Model (One Intel X5670 Core and One GPU Per Node)a
| CPU/GPU | apo-myoglobin (2,492 atoms) | nucleosome (25,095 atoms) |
|---|---|---|
| GPU version | ||
| 8 × M2090 | 135.1 | 3.95 |
| 4 × M2090 | 115.0 | 2.71 |
| 2 × M2090 | 93.1 | 1.80 |
| 1 × M2090 | 78.1 | 1.42 |
| CPU version | ||
| 2048 × Opteron 3136 | – | 0.53 |
| 1024 × Opteron 3136 | – | 0.78 |
| 512 × Opteron 3136 | – | 0.65 |
| 256 × Opteron 3136 | – | 0.55 |
| 128 × Opteron 3136 | 29.8 | 0.31 |
| 64 × Opteron 3136 | 18.3 | 0.17 |
| 32 × Opteron 3136 | 10.3 | 0.08 |
| 12 × X5670 | 6.6 | 0.07 |
For details on the hardware and software stack, see the text. A dash indicates lower speed than with less nodes.
Throughput Timings (ns/day) for AMBER GB Simulations of Apo-Myoglobin (2,492 atoms) with a Time Step of 2 fs Using the Serial GPU Version with Different Precision Modelsa
| precision model | SPSP | SPDP | DPDP |
|---|---|---|---|
| M2090 (6 GB) | 92.7 | 78.1 | 25.8 |
| C2070 (6 GB) | 73.7 | 64.3 | 20.5 |
| C1060 (4 GB) | 41.2 | 31.5 | 5.4 |
| GTX580 (1.5 GB, PNY XLR8) | 111.4 | 88.7 | 16.0 |
For details on the hardware and software stack, see the text.
Deviations of Forces (in kcal/(mol Å)) of the AMBER PMEMD GPU Implementation Using Different Precision Models As Compared to Reference Values Obtained with the CPU Implementation
| precision model | TRPCage (304 atoms) | ubiquitin (1231 atoms) | apo-myoglobin (2492 atoms) | nucleosome (25 095 atoms) |
|---|---|---|---|---|
| max deviation | ||||
| SPSP | 3.0 × 10–3 | 4.8 × 10–3 | 4.2 × 10–3 | 2.7 × 10–2 |
| SPDP | 5.6 × 10–5 | 3.7 × 10–4 | 1.6 × 10–4 | 1.1 × 10–3 |
| DPDP | 1.1 × 10–8 | 7.3 × 10–8 | 3.4 × 10–8 | 8.0 × 10–8 |
| RMS deviation | ||||
| SPSP | 5.0 × 10–4 | 6.1 × 10–4 | 4.1 × 10–4 | 1.5 × 10–3 |
| SPDP | 7.0 × 10–6 | 1.5 × 10–5 | 8.1 × 10–6 | 3.0 × 10–5 |
| DPDP | 1.5 × 10–9 | 3.6 × 10–9 | 2.6 × 10–9 | 3.2 × 10–9 |
Energy Drifts Per Degree of Freedom (kT/ns/dof) from Simulations of 100 ns (TRPCage), 50 ns (Ubiquitin), and 20 ns (Apo-Myoglobin)a
| time step | 0.5 fs | 1.0 fs | 2.0 fs |
|---|---|---|---|
| TRPCage (304 atoms) | |||
| CPU | 0.000006 | 0.000066 | 0.000355 |
| GPU (DPDP) | 0.000012 | 0.000082 | 0.000382 |
| GPU (SPDP) | 0.000003 | 0.000070 | 0.000222 |
| GPU (SPSP) | 0.000184 | 0.000252 | – |
| ubiquitin (1231 atoms) | |||
| CPU | 0.000004 | 0.000011 | –0.000216 |
| GPU (DPDP) | 0.000001 | 0.000006 | –0.000247 |
| GPU (SPDP) | 0.000003 | 0.000030 | –0.000165 |
| GPU (SPSP) | 0.001065 | 0.000305 | – |
| apo-myoglobin (2492 atoms) | |||
| CPU | 0.000012 | 0.000094 | 0.000416 |
| GPU (DPDP) | –0.000004 | 0.000117 | 0.000290 |
| GPU (SPDP) | 0.000019 | 0.000185 | 0.000139 |
| GPU (SPSP) | 0.002230 | 0.000442* | – |
The SHAKE algorithm to constrain bond lengths to hydrogen atoms was used for a time step of 2.0 fs; no constraints were used for smaller time steps. A dash indicates that the system heated up extremely during the simulation to the point that it is meaningless to report an energy drift. An asterisk indicates that the energy drift increases dramatically for longer time scales.
Figure 3Total energy (kcal/mol) along constant energy trajectories using a time step of 0.5 fs without constraints. Shown are results for TRPCage (left) and ubiquitin (right) for different precision models. The insets show the first nanosecond of each trajectory.
Figure 4Root-mean-square deviations (RMSDs) of the Cα backbone carbon atoms of ubiquitin (excluding the flexible tail, residues 71–76) with respect to the crystal structure for 50 independent trajectories as obtained with the CPU implementation and the GPU implementation of PMEMD using different precision models.
Figure 5Root-mean-square fluctuations (RMSFs) of the Cα backbone carbon atoms of ubiquitin residues 71–76 with respect to the crystal structure for 50 independent trajectories of 100 ns length as obtained with the CPU implementation and the GPU implementation of PMEMD using different precision models.