Anu George1, Sandip Mondal2, Madhura Purnaprajna3, Prashanth Athri1. 1. Department of Computer Science and Engineering, Amrita School of Engineering, Bengaluru 560035, Amrita Vishwa Vidyapeetham, India. 2. CubeBio AI, Bengaluru 560100, India. 3. Department of Computer Science and Engineering, PES University, Bengaluru 560085, India.
Abstract
Molecular dynamics (MD) simulations probe the conformational repertoire of macromolecular systems using Newtonian dynamic equations. The time scales of MD simulations allow the exploration of biologically relevant phenomena and can elucidate spatial and temporal properties of the building blocks of life, such as deoxyribonucleic acid (DNA) and protein, across microsecond (μs) time scales using femtosecond (fs) time steps. A principal bottleneck toward extending MD calculations to larger time scales is the long-range electrostatic force measuring component of the naive nonbonded force computation algorithm, which scales with a complexity of (N, number of atoms). In this review, we present various methods to determine electrostatic interactions in often-used open-source MD packages as well as the implementation details that facilitate acceleration of the electrostatic interaction calculation.
Molecular dynamics (MD) simulations probe the conformational repertoire of macromolecular systems using Newtonian dynamic equations. The time scales of MD simulations allow the exploration of biologically relevant phenomena and can elucidate spatial and temporal properties of the building blocks of life, such as deoxyribonucleic acid (DNA) and protein, across microsecond (μs) time scales using femtosecond (fs) time steps. A principal bottleneck toward extending MD calculations to larger time scales is the long-range electrostatic force measuring component of the naive nonbonded force computation algorithm, which scales with a complexity of (N, number of atoms). In this review, we present various methods to determine electrostatic interactions in often-used open-source MD packages as well as the implementation details that facilitate acceleration of the electrostatic interaction calculation.
Molecular dynamics (MD)
simulations[1−5] are widely used to investigate spatial and temporal fluctuations
of biomolecules. MD simulations of processes such as protein–drug
interactions[6−9] and folding/unfolding of proteins in the explicit solvent[10−14] require micro- to millisecond length simulations to provide the
chemical and biological inferences that researchers commonly probe.
The number of iterations or time steps can range from tens to hundreds
of millions. This high sampling requirement is attributed to the high-frequency
motion of bond-stretch variation.[15] Bond
constraint algorithms such as SHAKE[16] and
atomic mass scaling of the atoms in the system[17] have been used to decrease the sampling rate. Nonetheless,
the time steps in these simulations are in the range of a couple of
femtoseconds. The compute-intensive stages involved in each time step
(summarized below), compounded by the high sampling rate requirement,
result in simulations that extend from hours to months (wall time).[18]In each iteration or time step of the
MD simulations, the force
on the system is calculated as the sum of bonded and nonbonded components.[1] Hence, the choice of the energy function is an
important decision toward providing a stable MD simulation package.
In general, most MD software packages (see Section ) use an additive form of individual functions
mapped to specific empirically calculated parameters to express a
particular form of interaction that contributes to the instantaneous
total potential energy of the system. A defined combination of such
functions and their associated parameters is known as a force field.
The force field used in most MD packages is shown in eq . It is the sum of intramolecular
(Vintra) and intermolecular (Vinter) contributions. The Vintra term comprises eqs through 1d, and the functions represent bond
(Vbond), angle (Vangle), and dihedral or torsional angle (Vtors), and V1,5 is the Vinter contribution. Vinter is the sum of long-range electrostatic (Vel) and short-range van der Waals (Vvdw) interactions (eqs e and 1f). Thus, the total potential energy
can be written asHere, kb, k, k, and k are the bond, angle, dihedral angle, and Urey–Bradley
force constants, respectively. (b – b0), (θ – θ0),
and (ω – ω0) are the difference between
the instantaneous values of the specific atoms and the reference (equilibrium)
values of bond, angle, and improper angles, respectively. Dihedral
angles are represented as a combination of Cosine functions, and in eq c, n is
the multiplicity and ϕ0 the phase shift. The ∑ notation indicates
that the summation is carried out over every distinct pair of i and j without counting any pair twice. q is the partial charge on
the ith atom, and ϵ0 is the absolute
dielectric permittivity of the vacuum. ϵ and σ are the Lennard-Jones parameters for the van der
Waals interaction between two interacting sites, i and j, while r = ||r – r|| is the distance between
those sites and L(r, r) = 1/||r – r||.The bonded forces
determine the molecular conformation. It accounts
for the potential energy contribution from the covalent bonds, angles,
and torsion angles. Further, it is observed that their influence drops
off steeply after a reasonably chosen cutoff radius.[2] The van der Waals potential between two atoms also rapidly
approaches zero since it is inversely proportional to r6, where r is the distance between two
atoms.[19] However, in the case of the electrostatic
force between charged atoms, it does not cease to exist even when
the distance between them is considerable.[20,21] Due to this, the cutoff distance for electrostatic calculations
is much larger than that for van der Waals in nonbonded force calculations.
Further, the computational complexity increases with an increasing number of
atoms, N. Thus, the electrostatic forces in nonbonded
force calculations are the primary bottleneck in MD simulations.[19,22]In this review, the implementation of electrostatic potential
calculation
algorithms (eq 1) in
the most widely used simulation packages, namely, AMBER,[23] NAMD,[24] GROMACS,[25] LAMMPS,[26] DESMOND,[18] and ACEMD,[27] is explored.
These MD packages are able to achieve very long time-scale MDs in
the micro- to millisecond ranges through the use of GPUs (graphics
processing units).The organization of the paper is as shown
in Figure , and the
objectives are the following:
Figure 1
Different electrostatic force calculations (EFCs) are explored
in Section of the
review. In Section , different techniques for the acceleration of the EFC by the GPU
are studied in MD packages along with solo GPU implementations of
the algorithm. The yellow-colored box encloses the different electrostatic
force calculation algorithms, whereas the gray box encloses the MD
packages and solo algorithmic implementations, respectively. The color
code used for implementation is the same as the fill color of the
box of the algorithm name.
summarize the available methods of long-range electrostatic
or Coulomb force calculations with respect to accuracy, complexity,
and scalability (Section );review the software and GPU-enabled
technologies currently
used by popular MD;packages to accelerate
electrostatic calculations (Section ).Different electrostatic force calculations (EFCs) are explored
in Section of the
review. In Section , different techniques for the acceleration of the EFC by the GPU
are studied in MD packages along with solo GPU implementations of
the algorithm. The yellow-colored box encloses the different electrostatic
force calculation algorithms, whereas the gray box encloses the MD
packages and solo algorithmic implementations, respectively. The color
code used for implementation is the same as the fill color of the
box of the algorithm name.Algorithms used to reduce this complexity can be
categorized as
Ewald-based and non-Ewald-based calculations. In algorithms that use
the Ewald-based method, the slowly converging force (given in eq f) is split into real space
contribution and reciprocal space contribution along with self-term,
which is associated with error correction. This splitting into components
makes the computation much faster.[28−30] The non-Ewald algorithms,
such as the Wolf method, reaction field method, preaveraging method,
zero-dipole summation methods, etc., are based on the assumption that
the effect of long-range interactions is negligible beyond the cutoff
radius for an electrostatically neutral system.[30] The Ewald-based algorithms are the particle mesh Ewald
(PME)[31] (reviewed in Section ) and the particle–particle/particle
mesh Ewald (P3 M)[31,32] methods (Section ). The other
variants of electrostatic force calculation algorithms discussed in
this review are the direct Coulomb summation (DCS)[33] (Section ), multilevel summation method (MSM)[34] (Section ),
and fast multipole method (FMM)[35,36] (Section ).
Different Methods of Coulombic Interaction Force
Calculations
Electrostatic interactions are one of the main
forces that contribute
to determining the structure of biomolecules.[37] They play a significant role in modeling the dynamics of charged
molecules of proteins, DNA, and membranes in explicit solvent simulations.[38,39]The main challenges in electrostatic potential computation
(eq ) are the slow
convergence
of the electrostatic potential of atoms with distance and the need
to avoid artifacts introduced in the potential calculation. These
artifacts are due to the boundary conditions as the surface molecules
experience different forces from other molecules present.[29,40] While the primary objective of this study is to provide a survey
of methods implemented to address the slow convergence issue, it should
be noted that periodic boundary conditions (PBCs) are used to address
surface effects. The central cell containing the simulated system
is replicated in three dimensions to form an infinite lattice. During
the course of the simulation, when a molecule moves in the central
original box, its periodic image in each of the adjacent boxes moves
in exactly the same way. Therefore, whenever a molecule departs the
central box, one of its images enters the opposite face. There are
no walls at the central box boundary and no surface molecules in the
PBC scheme, and hence errors introduced due to boundary surface effects
are negligible.The convergence bottlenecks exist since electrostatic
potential
does not reduce to zero even for vast distances,[41,42] and long-range methods are used to address this. The effect of electrostatic
interactions for long ranges is achieved by the application of periodic
boundary conditions (PBCs). The PBC method considers the interactions
between direct particles, which are contained in the central simulation
box, as well as their mirror images in the replicated boxes.[2,43] This yields a more accurate value for the potential and, in turn,
for the forces acting on the atoms. Hence, the simulation of biomolecules
using long-range methods results in a more accurate in silico reproduction of the experimental results.[44] However, the computation-intensive nature of this method becomes
the bottleneck for scaling larger time scales.[45,46]There are several long-range-based algorithms available to
handle
Coulombic interactions. Among them, the most common algorithms used
for Coulombic computation in explicit solvent MD aredirect Coulomb summation (DCS)[33] (Section )multilevel summation method (MSM)[34] (Section )particle mesh Ewald
(PME)[31] (Section )particle–particle/particle
mesh Ewald (P3 M)[31,32] (Section )fast multipole method (FMM)[35,36] (Section )These algorithms are explained and compared in terms
of their accuracy,
computational complexity, and scalability. The algorithmic variations
of the Coulombic force calculations result in differences in accuracy,
computational complexity, and scalability. The accuracy is measured
with respect to the root-mean-square (rms) error in terms of force.
The complexity depends on the number of particles in the system simulated
and the methods used, such as fast Fourier transform (FFT), etc. The
scalability of the above long-range-based algorithms is determined
by the volume of intercommunication between different functional modules
in the algorithm. The scalability increases as the amount of communication
in the algorithm decreases. Highly scalable algorithms provide increased
performance by utilizing more computational resources. The accuracy,
computational complexity, and achievable scalability are the factors
that determine the suitability of an algorithm for use in molecular
dynamics (MD) simulation.
Direct Coulomb Summation
It is a
lattice-based method in which the coordinates of atoms and their partial
charges are taken into account for the computation of electrostatic
potential.[33] A rectangular lattice is defined
around the atoms with a fixed lattice spacing in all three dimensions.
The electrostatic potential contribution, V(i), at each point i in the lattice is calculated
aswhere q is charge of atom j; ϵ0 is the absolute dielectric permittivity of the vacuum; ϵ(r) is the dielectric constant
that varies with the distance between the lattice point i and atomic charge q;[33] and r is the pairwise distance between the atom j and the lattice point i as shown in Figure .
Figure 2
Electrostatic potential
calculation in direct Coulomb summation.[33]
Electrostatic potential
calculation in direct Coulomb summation.[33]The potential map that is calculated by using the
rectangular lattice
can be subdivided into planes and rows. The components of distance
calculation (z–, y– coordinates) for an atom to points on the same row and plane are
constant. Hence, calculation of the potential energy contribution
of an atom to all the points in a row of the lattice can reuse the
distance component.[47] This reuse of the
distance component achieves increased performance as memory accesses
and computations are reduced for the computation of the potential
of each atom.[48]
Accuracy
This method does not involve any approximations
and hence is used as the benchmark to compare the force calculation
methods for accuracy.[47]
Complexity
The complexity of the method for a system
composed of N atoms and Q points
in the lattice is .
Scalability
The scalability of the algorithm is nonlinear
due to the quadratic complexity of the summation algorithm.
Multilevel Summation Method
The significant
latency of convergence in the direct Coulomb summation method makes
it an unsuitable choice for large molecular systems with more than
a few hundred atoms. Multilevel summation is one of the fastest potential
approximation methods that can be used for periodic, semiperiodic,
and nonperiodic systems.[34,49−51] The steps involved in the multilevel summation algorithm are shown
in Figure .
Figure 3
Workflow involved
in the multilevel summation algorithm. Adapted
with permission from ref (52). Copyright 2014 American Institute of Physics.
Workflow involved
in the multilevel summation algorithm. Adapted
with permission from ref (52). Copyright 2014 American Institute of Physics.In this method, the Coulombic potential (given
in eq f) for the atomic
partial charges
on the lattice defined around the biomolecule is calculated by hierarchical
interpolation. Hierarchical interpolation and restriction are the
two actions performed to estimate the charge at each lattice point
at different levels of the grid. The charge at each grid point is
due to all the atomic charges in the biomolecule. The potential of
the system is split into two parts. The first component is the short-range
part that disappears after the cutoff distance a.
The long-range part is (vlong). It is
obtained by excluding the force involved in the interaction of the
atoms with itself in the mirror images (vex) as shown in eq .The long-range term is mapped to a grid where
the lattice spacing doubles for every successive level. This allows
the use of different time steps for the calculation of each term in
the potential and hence helps in improving performance. The term L(r, r) in Vel (given in eq f) is split into the sum of short-range and long-range interactions.
The short-range interaction L0 is truncated
at distance a. The slowly varying long-range part
is split into L1, L2, L3, ···, L1, L, which are clipped off at
distances 2a, 4a, 8a, ···, 2a, ∞.where L0(r, r′) and L(r, r′)
are given as in[53]where r, r′ is the position vector of the ith and jth atom; γ(R) = 1/R for R ≥ 1 is a softening
function; and R in the softening function is where k = 0 to m and m = 1, 2, 3 ..., M – 1 is the level of the grid.The interpolation of
the forces from the grids of spacing h, 2h, 4h, 8h,···, 2h is done
using the interpolation operator I. I is calculated
in terms of the interaction between the grid point r = (x, y, z) and the corresponding nodal
basis function ϕ,[34] which expressed in terms of the single-dimensional
basis function Φ aswhere,m = 1, 2,..,
M and p, n are the indices of the grid points.The continuous differentiation property of Φ and the continuation
of γ ensure the conservation of energy, which is critical in
long simulations. The interaction kernel L(r, r) is approximated efficiently by nesting between levels of interpolation[53] asThe interpolation function I (eq ) can be further divided into computations that assign
charges to the grid points. Then the potential at these grid points
is calculated based on the charge assigned. Finally, the long-range
contribution to the potential is interpolated from the lattice points
on the finest grid (the smallest spacing between the lattice points).
The local support for Φ is provided by interpolation with a
piecewise polynomial of degree q with stencil size q + 1.[54] These steps are computed
by finding the intermediate charge q and the electrostatic potential e as described in ref (33).The negative gradient
of the potential function (given in eq ) provides the force exerted on the particles due to
electrostatic interaction. It is then used along with bonded and van
der Waals forces to find out the displacement of the particles in
the system in one time tep by solving Newton’s equations of
motion.The appropriate choice of spacing of the finest
grid (h) and the short-range cutoff distance (a) determines the accuracy of the force computation. An h-spacing of 2 Å, which is close to the interatomic
spacing, is used along with a cutoff radius a between
8 Å and 12 Å. The choice of these values for grid spacing
and cutoff radius results in a relative error of less than 1%.[49] Accuracy can be further improved by the use
of higher orders of interpolation. Nevertheless, as stable dynamics
are achieved at a much lower accuracy in MSM, lower interpolation
orders can be used for simulations.[33]The computation complexity at each grid
point is a constant, and the number of grid points decreases by a
factor of 1/8 in each level. The complexity of the algorithm is where Q is the number
of lattice points, and N is the number of atoms.The dense, localized convolution calculations
involved and the close neighbor communications in MSM facilitate high
scalability and, thus, maximum utilization of parallel-intensive computation
platforms.[33]
Particle Mesh Ewald (PME)
PME is
based on the Ewald scheme of potential calculations.[31] The application of periodic boundary conditions (PBCs)
is used in PME to overcome the boundary surface effect. The use of
PBCs results in certain artifacts which are dealt with in the computation
of Coulombic potential. In the calculation of Coulombic potential,
the correlation between the particles in the central cell and the
replicated cells must remain the same. It should also be taken care
that there should not be any correlation between the original particle
contained in the central cell and the mirror image of that particle
contained in the replicated cell.[56] These
factors are accounted for in the PME method by calculating the potential
energy as the sum of the direct space sum (vd), the reciprocal space sum (vr) calculated as shown in Figure , and the error-correcting term or self-term (vc). The direct space sum (vd) calculates the potential due to interactions between
charges within the cutoff radius. The reciprocal space sum, vr, includes the interaction between charges
in the central unit cell and the replicated cell, and vc is subtracted from the sum of vd and vr to remove the interaction
of charges with itself in the replicated cells.[57]where N is the number of
particles in the system simulated; n is the coordinate
vector of the unit simulation cell and * denotes all values of n which does not include n ≠ = 0
and i = j; q and q are the ith and jth charges between which the interaction is being calculated;
erfc(x) = 1 – erf(x) where
erf(x) is the error function and erfc(x) is the complementary error function; α is the Ewald parameter; rn = |r– r + n|; V is the volume of the unit cell; and m is the reciprocal
lattice vector.[58]VEwald converts the slowly converging direct potential energy
(given in eq f) into
a fast converging series.[57] The Ewald parameter
α allows the balancing of computation between the direct and
reciprocal space with no change in the accuracy of the potential calculation.
Figure 4
Long-range
part of the electrostatic potential calculation in the
particle mesh Ewald method.[55] Adapted with
permission from ref (55). Copyright 2014 Marc Snir.
Long-range
part of the electrostatic potential calculation in the
particle mesh Ewald method.[55] Adapted with
permission from ref (55). Copyright 2014 Marc Snir.The calculation of the reciprocal space (vr) is done by approximating the atom charges
on a three-dimensional
grid. This approximation is made by interpolation using cardinal-B-splines.[58] The reciprocal space vr given in eq can be rewritten aswhere S(m) is the structure factor;[59]Q is the 3D grid of dimension K1, K2, and K3 onto which
the charges are interpolated; and F(Q) is the discrete Fourier transform of Q.The main parameters that determine the accuracy
of the method are the Ewald splitting parameter (α), the value
of the cutoff radius, for which the short-range potential (vd) is calculated, the order of the charge assignment
function, and the spacing of the 3D grid. It is an optimization problem
to find the optimal value of these parameters that attains the desired
accuracy. The charge approximation error decreases with an increase
in the order of the B-spline used.[60,61]When α is chosen to be sufficiently
large, interaction in direct space beyond a certain cutoff radius
becomes negligible, and its complexity reduces to . The use of fast Fourier transform (FFT)
to compute the reciprocal space energy and force reduces the complexity
of the long-range part to .[32]The scalability is constrained by the method
of implementation of 3D-FFT used in the reciprocal space calculation.
The scalability of PME can be improved by increasing the load of the
direct space and thus decreasing the FFT computation to be done in
the reciprocal space. This can be achieved by increasing the 3D-grid
space.[62,63]
Particle–Particle/Particle Mesh Method
(P3 M)
P3 M is also based on the Ewald
summation method.[31] It gives the most accurate
potential among the different methods developed based on Ewald, and
it is also the most flexible.[32] In this
method, similar to the Ewald method, the electrostatic potential is
divided into short-range and long-range as given in Figure .
Figure 5
Potential calculation
in the particle–particle/particle-mesh
algorithm by dividing into two parts, namely, particle–particle
for the short-range part and particle mesh for the long-range part.
For the short range, a 3D chaining mesh is built around the system
such that the sides of the cell HC1 and HC2 are
greater than the cutoff radius R. The cutoff radius R is taken as 3 to 4 times of the sides H1, H2 of the potential mesh.[32]
Potential calculation
in the particle–particle/particle-mesh
algorithm by dividing into two parts, namely, particle–particle
for the short-range part and particle mesh for the long-range part.
For the short range, a 3D chaining mesh is built around the system
such that the sides of the cell HC1 and HC2 are
greater than the cutoff radius R. The cutoff radius R is taken as 3 to 4 times of the sides H1, H2 of the potential mesh.[32]The contributions of the short-range and long-range
parts are weighted
by the Ewald parameter (α). The division into short-range and
long-range in P3 M is achieved by deducting a shielding
charge distribution of identical magnitude from the actual point charge.
The short-range potential is calculated as the direct summation of
the interaction potential between the combination formed by the point
charge and shield charge distribution over the adjacent neighbors.[64] The long-range part involves the effects due
to the shielding charges.The S2 function proposed by Hockney
and Eastwood for standard charge
distribution[65] is given aswhere a adjusts the S2 distribution,
and r is the position vector of an atomic charge.The short-range interaction potential between two particles with
charge distribution γ(r) is expressed aswhere and C is a constant that depends on the range of values of ζ as given in Toukmaji
and Board.[57]The accurate potential
due to interaction between the point charges
is found by adding potential due to interaction between the charge
distributions (long-range part) and the short-range interaction potential.[64] The long-range interaction potential at each
grid point k is found by applying the inverse fast
Fourier transform (FFT) on ψ̂(k), which
is calculated aswhere k is the position vector
of each grid point; ρ̂(k) is obtained
by the FFT of the gridded charge distribution; and Ĝ(k) is the optimal influence function.[66]Ĝ(k) is calculated once at the start of the simulation based on the
charge shape, system size, and interpolation function.[64] The force at each grid point due to the potential
of the nearest-neighboring grid point is calculated by applying the
finite difference operators. Improved accuracy is obtained by considering
not only the potential due to the nearest neighbors but also the next
nearest neighbors in chain. Thus, a linear combination of finite difference
operators is applied to calculate the potential at each grid point.
Unlike analytical differentiation, this method conserves momentum
and thus avoids errors such as incorrect particle drifts.[65] Fourier transformation is done once to obtain
in real space the force exerted on each particle of the system and,
in turn, the velocity and thus the displacement in each time step.The complexity of the algorithm is , which is similar to the other mesh-based
Ewald methods.[67]It is the most accurate among the Ewald-based
methods. The accuracy of the algorithm depends on the use of a refined
mesh or the use of higher orders of interpolation, which are computationally
expensive.[32] Another method called “interlacing”
is used, in which a second grid is introduced at half a grid spacing,
and the same calculation as on the initial grid is performed. The
potential and force obtained on both the grids are averaged to get
more accurate values.[68] Thus, interlacing
facilitates the use of wider spaced grids, and hence performance is
not hindered.Similar to PME, the scalability depends
on the workload distribution between the short-range and long-range
parts. More long-range load results in more FFT calculations involving
all-to-all communication and hence less scalability. The algorithmic
implementation of 3D-FFT plays a very significant role.[69,70]
Fast Multipole Method (FMM)
The FMM
is one of the top 10 algorithms of the 20th century[36] developed by Greengard and Rokhlin for the fast
calculation of electrostatic potential.[35] In this method, the potential calculation is divided into interactions
between neighboring or direct particles and those between distant
particles as shown in Figure . The distant particle interaction is calculated by multipole
expansions.[71]
Figure 6
In the fast multipole
method, the interactions on a particle at
the particle–particle level are calculated between all the
particles that are located in the same and surrounding cells. The
interactions with particles from the larger cells are approximated.
Adapted with permission from ref (72). Copyright 2006 Taylor and Francis Group LLC
(Books) US.
In the fast multipole
method, the interactions on a particle at
the particle–particle level are calculated between all the
particles that are located in the same and surrounding cells. The
interactions with particles from the larger cells are approximated.
Adapted with permission from ref (72). Copyright 2006 Taylor and Francis Group LLC
(Books) US.The infinite multipole expansion gives the potential V(r) at point P, which
is at a distance
of r from the center of a collection of point charges
enclosed in a sphere of radius a and is shown in eq aswhere y(θ,ϕ) is the spherical harmonic potential;[73] θ and ϕ are the zenith and azimuth
angles of the atom at point P; and M is the lth multipole
moment at level m.The simulation box is hierarchically
decomposed into a tree structure
of smaller subunits. The interaction effect of all the particles in
a subunit on a distant particle is calculated by the truncation of
multipole expansion for terms, where ϵ is the required accuracy.
The effect of larger groups of particles is obtained by hierarchically
combining these multipole expansions. Potential accumulation due to
the interaction of all subunits is obtained by the hierarchical combination
of the Taylor expansions.[74] The distant
forces thus calculated are added to the direct forces to obtain the
potential of each particle.[75]The
long-range electrostatic interaction of systems having a nonuniform
distribution of particles can be computed using the FMM. It can also
be applied to both periodic and vacuum systems. The FMM is utilized
for QC calculations of large systems where scaling is considered an
essential criterion along with accuracy.[76] However, for QM/MM calculations, PME is considered more efficient
than the FMM. The greater code complexity of the FMM and the lack
of energy conservation compared to mesh-based Ewald methods have led
to the usage of primarily mesh-based Ewald methods in the molecular
dynamics’ packages of the current era.It is based on two parameters, mainly the
depth or levels of the FMM tree and the length/number of terms in
the multipole expansion. Based on the user-provided energy error threshold,
the optimal set of these parameters is obtained by running the error
control and energy minimization scheme.[77,78]FMM has a complexity of and hence scales linearly with system size.[79,80]It is one of the most scalable algorithms
among electrostatic algorithms due to its linear complexity. The communication
complexity is where P is the number
of processes due to its hierarchical nature.[81] The distribution of particles among processes is performed using
the z-space filling curve. It reduces the communication
required between the processes and, in turn, increases the algorithm’s
scalability.The summary of comparisons of the different algorithms
used for electrostatic force calculation, discussed in Section , is given in Table in terms of accuracy, complexity,
scalability, and boundary conditions required for the respective algorithms.
Table 1
Summary of Comparisons between Different
Electrostatic Force Calculation Algorithms (EFC Alg.)a
EFC Alg.
complexity
accuracy
scalability
boundary conditions
DCS
(60)
Highest[47]
Low[47]
NPC, PC
MSM
(33)
Medium[33]
High[33]
PC, SPC, NPC
P3 M
(67)
High[32]
Medium[69,70]
PC
PME
(32)
Medium[60,61]
High[62,63]
PC
FMM
(79,80)
Medium[77,78,82]
High[81]
NP, PC
NPC stands for nonperiodic conditions,
PC for periodic conditions, and SPC for semiperiodic conditions.
NPC stands for nonperiodic conditions,
PC for periodic conditions, and SPC for semiperiodic conditions.
Electrostatic Force Calculation on GPUs by Different
MD Software
The data-parallel, compute-intensive nature of
MD force calculation
stages makes graphics processing units (GPUs) a very suitable choice
for accelerators.[83,84] The use of GPUs with multicore
CPU provides the researchers with increased computation capability
on workstations and laptops. Thus, GPU augments multicore CPU performance.
GPUs are equipped with floating-point performance of over tens of
teraflops, high-bandwidth memory, and hardware multithreading. These
GPU features help to accelerate the electrostatic force calculation
stage of MD. The design and implementation of the electrostatic force
calculation algorithm such that it scales to hundreds of tightly coupled
processors is the most key requirement for harnessing the power of
GPU computing.The exploration of different techniques of acceleration
of the
electrostatic force calculation in MD production packages by GPUs
is done in the following sections. It gives information to the researchers
and developers working on MD performance on the currently employed
strategies for accelerating electrostatic force calculation algorithms.
AMBER MD
Assisted model building
with energy refinement (AMBER) is a set of force fields developed
to simulate the dynamics of biomolecules,[85,86] and the AMBER MD[23] is a MD package that
utilizes the AMBER force fields for simulation. The nonbonded force
calculations in AMBER involve the van der Waals (vdW) interactions
and the Coulomb interactions.[87] For the
van der Waals (vdW) interactions, the cutoff distance is used as it
decays to zero after a certain distance, whereas the Coulomb forces
reduce in strength very slowly. Hence, if ignored, it can lead to
abrupt dynamics due to the lack of energy conservation. The use of
periodic boundary conditions (PBCs) avoids artifacts due to boundary
conditions. Hence the unit cell, which contains the system to be simulated,
is replicated in all three dimensions.[88]The Coulomb force calculation algorithm is done in AMBER using
PME (given in Section ), the lattice sum method, which is used to take into consideration
the force contribution beyond the cutoff. In AMBER, for explicit solvent
MD,[89] the long-range electrostatic force
is calculated fully on GPU using this algorithm. In the reciprocal
space, 3D fast Fourier transform (FFT) of charge from real to complex
is performed using the CUFFT library[90] of
Nvidia. This calculation in AMBER implementation happens with one
thread per particle.
Techniques for Acceleration
Arithmetic Precision
The single-precision fixed-point
(SPFP) integer arithmetic combines single precision with a 64-bit
fixed-point arithmetic. The use of SPFP in the place of a double-precision
floating point for PME calculation helps significantly to boost the
performance with not much loss in accuracy.[91] This use of SPFP also helps fully utilize the future and present
GPU architectures without incurring a performance degradation in the
old GPUs. The current version of AMBER provides three precision models:
the SPFP, which is the default model; the SPDP, which uses double-precision
floating-point variables for all three bonded term (bond, angle, and
dihedrals) calculations; and the accumulation of potential and single-precision
floating points for all others, the DPDP, which uses double precision
for the entire GPU code.[91,92]
Neighbor-List Creation
The neighbor-list of atoms within
the cutoff distance is obtained using two sorting mechanisms.[93] This list is used to perform the direct sum
on the GPU. The atoms are first sorted into boxes with dimensions
at least equal to the cutoff radius, and each atom is assigned a box
ID. Then, in the next step, to improve the spatial locality of each
atom in the box, a Hilbert curve[94] is implemented
as shown in Figure . The box ID and the Hilbert curve coordinates are then used to sort
the atoms. The neighbor list for each atom is done using the sorted
atoms by considering all the adjacent boxes within the extended cutoff
(cutoff radius + buffer); i.e., all the atoms within (extended-cutoff) R2 are taken as a neighbor. A proposed optimization
uses a bit-mask on the Hilbert ID to exclude certain atoms from the R2 calculation. Once the neighbor-list is formalized,
the vdW and electrostatic interactions are calculated by skipping
the 1–2, 1–3, and 1–4 interactions that are listed
in the topology file.[95]
Figure 7
Neighbor-list is prepared
by the use of a Hilbert curve ID along
with a Box ID. Adapted with permission from ref (93). Copyright 2013 American
Chemical Society.
Neighbor-list is prepared
by the use of a Hilbert curve ID along
with a Box ID. Adapted with permission from ref (93). Copyright 2013 American
Chemical Society.
GPU Cluster
Parallel simulations across multiple GPUs
were done using the message passing interface (MPI) protocol by replicating
the data structures to all the GPUs.[93] Hence,
the memory usage on each GPU is approximately identical with a single
parallel implementation, which can be improvised by better communication
and peer-to-peer functionality for direct memory copies.[96]In AMBER18,[97] additional features have been added to improve the acceleration
of PME calculation. AMBER18 has achieved for dihydrofolate reductase
(DHFR) a simulation speed of 657 ns/day compared to 588 ns/day in
Amber16 (4 fs time step, constant energy). This acceleration has been
achieved by faster PME real space and reciprocal space calculation
by the innovative spline tabulation look-up and particle mapping kernels
and optimizing the memory access of bonded and nonbonded terms.
NAMD with PME
The smooth particle
mesh Ewald (SPME) algorithm is used in NAMD[24,98] for electrostatic force calculations, which is a variation of PME.
In this, the charge-spreading action is performed by the B-spline
functions. This interpolation scheme is a continuous function at the
grid points (n). As the points are all distinct,
their derivatives are all continuous up to a certain degree (n – 1).[99] This property
of B-splines helps conserve energy as the derivatives of the potential
yield continuous forces. It also helps reduce the number of FFTs to
half compared to the PME method. In SPME, potentials are approximated,
but the forces are the exact derivatives. Integration of the forces
is achieved by multiple time steps where the nonbonded forces are
calculated the least frequently. The neighbor list for short-range
forces is obtained by dividing the system spatially into patches.
Within each patch, the atoms are sorted into groups of 32 using the
orthogonal recursive bisection method.[100] A compute object in NAMD consists of 32 × 32 tiles. It is further
divided into tile lists that can be run on any thread block. This
division into tile lists provides more flexibility for load balancing
between the warps in a thread block.[24]
FFT Parallel Implementation
The 3D FFT is implemented
for multi-GPU by using a pencil, slab, or box decomposition based
on the number of GPUs as shown in Figure . Each decomposition is assigned to a single
GPU. For pencil decomposition, 1D FFT is performed; 2D FFT is performed
for slab; and 3D FFT is performed for box-level decompositions, respectively,
using the cuFFT library.[90] For 1D FFT,
transposing of the grid is done to get a contiguous memory location
for the next FFT direction.[101] The parallel
implementation of FFTs requires numerous messages to be passed between
MPI nodes. Nevertheless, the MPI architecture adhered to by NAMD makes
it possible to interleave the FFT calculations with the dominant short-range
nonbonded interactions.[102]
Figure 8
Different types of decomposition
for parallelization of FFT (a)
slab decomposition and (b) pencil decomposition. The all-to-all communication
is minimum in slab decomposition. The number of slabs is based on
the number of processes in the CPU/GPU node.[103]
Different types of decomposition
for parallelization of FFT (a)
slab decomposition and (b) pencil decomposition. The all-to-all communication
is minimum in slab decomposition. The number of slabs is based on
the number of processes in the CPU/GPU node.[103]
Newton’s Third Law
According to Newton’s
third law, when the interactive forces between atoms i and j are calculated if a force f is exerted on i,
then the force exerted on j by atom i will be –f. This law helps to avoid the calculating force on atom j again. Race conditions generated when all the threads in a warp
are trying to write to the same memory location of atom j can be avoided by looping through the tile in diagonal lines.[104] The use of shared memory and its synchronization
overheads for shifting atom coordinates, charges, and computed forces
are avoided by the use of shfl instruction, which
makes use of GPU registers.[105]
Charm++
This is a programming system and run-time library
used by NAMD.[106] It provides a message-driven
programming framework and processor virtualization capability. In
message-driven programming, a method is invoked on an object only
when a message arrives for it, and thus it helps to avoid communication
latency.[107] Processor virtualization allows
algorithm writing for a maximum number of parallel objects and dynamic
distribution among the available processors at a run time. The performance
achievable is decided by the nonbonded calculation, which scales as NR3ρ where N is the number
of atoms, R the cutoff distance for short-range calculations,
and ρ = N/V the number of atoms per unit volume V.[24]
NAMD with MSM
The multilevel summation
method (MSM, Section ) is used as an alternative method for electrostatic force
calculation in NAMD[53,108] as it supports periodic, semiperiodic,
and nonperiodic biomolecular simulations. The GPU accelerates this
compute-bound algorithm. High throughput is achieved by harnessing
the power of thousands of multithreaded cores in the GPU. The communication
structure of the MSM algorithm guarantees parallel scalability, and
the slowly varying potential allows multiple time steppings, which
helps to improve performance.[109]The short-range part of the system potential Vmsm (the first term in eq ) becomes more compute-intensive as the grids become finer.
The most time-consuming computations involved in the algorithm are
the lattice-based cutoff part, vlong,
and the short-range part of Vmsm. Hence,
these calculations are accelerated by the GPU.
Small-Bin-Based Geometric Hashing
This approach is
used to accelerate the short-range part of the MSM. In this, the CPU
performs the geometric hashing of the atoms into small bins. This
separation allows the calculation of interactions to be limited to
only the neighboring bins. CPU regularizes the approach to avoid boundary
conditions by adding empty bins. Then, these bins are copied to the
global memory of the GPU. A subcube of lattice points is assigned
to a thread block. The thread block calculates the potential of each
subcube of lattice points by iteratively copying the neighboring atom
bins to the shared memory. The offsets of the neighboring bins are
read from the constant memory by using their thread block index. The
load balancing between the CPU and GPU for computation of atom interactions
was found to be ideal for a bin size of 4 Å.[101]
Sliding Window Approach
The potential at each point
in the lattice is the sum of surrounding charges that are weighted
by their distance to the respective point. The potential due to interaction
between points on a lattice is calculated by the lattice cutoff part vlong (eq ). At each level, k, the bounding sphere
which has a cutoff radius of 2a has the same number of atoms as levels k + 1 and k – 1. The weights at the boundary
of a sphere can be calculated theoretically, as the distances between
points on a uniform lattice are the same.Each thread block
is assigned a subcube of size 64 lattice points, and each thread is
assigned a point. The constant memory is used to store the precalculated
weights, and the shared memory is used to store the surrounding neighboring
cubes of charges. To achieve constant memory access at a registered
speed, all the threads in a warp have to read the same weight. The
sliding window approach is used to achieve this. The shared memory
is loaded with eight subcubes containing 64 charges each in this approach.
A sliding window of size 43 is shifted four times in the
three dimensions, namely, x, y,
and z, within a triply nested loop. In each sliding
window position, every thread applies a single weight value to its
corresponding charge to obtain the updated potential. This sliding
window loop is embedded in another triple-nested loop that iterates
over the neighborhood cubes loaded in the shared memory. For the correctness
of the sliding window approach, the eight subcubes should be padded
by a layer of subcubes with no charged points in them. For coalesced
reads from global memory, the lattice points in each subcube are stored
in x,y,z-ordering,
and the subcubes are stored in row–column–depth order.[101]
GROMACS with PME
GROMACS is one of
the most widely used open-access MD packages. It employs the heterogeneous
platform of multicore CPUs and GPUs for maximum performance, throughput,[25,110] etc. The electrostatic force calculation is performed using PME
(Section ). The
highly parallelized environment of GROMACS divides the computation
among ensembles, domains, and multiple SIMD cores.[111] This decomposition of workloads with SIMD, openMP, and
MPI acting on each domain helps maximize balanced hardware resource
utilization. The different methods implemented for PME acceleration
in GROMACS are explored.
Eighth Shell Domain Decomposition
This method[112] helps to significantly reduce the communication
involved for force calculation between the domains. Charge groups
(grouping of partially charged atoms into a single neutrally charged
group) are used as the basic element in the decomposition as it reduces
the number of calculations involved. In the eighth-shell decomposition,
the system is split into independent subsystems along with dynamic
load balancing, and these subsystems are then mapped into separate
MPI ranks (while retaining the “locality of reference”
within each domain). Multiple OpenMP threads are contained in each
MPI rank. The best performance is achieved when the product of the
number of MPI ranks and number of OpenMP threads are equal to the
number of cores on a node, and the threads are pinned to the core.
Each of the subsystems is described as a triclinic cell, which helps
decrease the volume of solvent compared to a rectangular cell and
thus helps reduce computation and increase performance by 40%. The
apparent difference in the workload between the protein and solvent/water
domains is addressed by dynamic load balancing.[113] For the electrostatic interactions, 3D domain decomposition
is performed for the direct space, and 2D pencil decomposition is
done for the reciprocal space.[25]
Dedicated Nodes
The long-range part of the electrostatic
interaction is allocated with separate dedicated MPI nodes[113,115] as in Figure . The
division of the available nodes is such that 3–4 nodes are
dedicated to a direct space map, and a single node is dedicated to
reciprocal space. The PME nodes are kept to a minimum as the message
communications required for 3D FFT calculations scale as a square
of the number of nodes involved. The 3D FFT is performed using 2D
pencil decomposition.[116] This mode of allocation
helps to automatically determine the division of the workload into
short-range and long-range force calculations among the nodes available
in the simulating platform.
Figure 9
Scaling with domain decomposition is achieved
with the multiprogram
multidata approach in which particle–particle (PP) ranks to
PME ranks are allocated in the ratio of 3:1 for rectangular simulation
boxes and 2:1 for rhombic dodecahedra to minimize all-to-all communication.
The arrows represent the flow of communication. Each PP rank communicates
with the PME rank.[114]
Scaling with domain decomposition is achieved
with the multiprogram
multidata approach in which particle–particle (PP) ranks to
PME ranks are allocated in the ratio of 3:1 for rectangular simulation
boxes and 2:1 for rhombic dodecahedra to minimize all-to-all communication.
The arrows represent the flow of communication. Each PP rank communicates
with the PME rank.[114]The FMM has also been implemented in GROMACS on
GPU.[82] The FMM is faster when the simulation
is performed
on a large number of particles, large simulation boxes, inhomogeneous
charge distributions, and high parallelization.[116,117] The multipole-to-local calculation is the most compute-intensive
part of the FMM. Its complexity depends on the multipole order (refer
to Section ).
The ScaFaCoS FMM[67] is used in GROMACS,
which is fully parallelized using CUDA.
LAMMPS
The large-scale atomic/molecular
massively parallel simulator (LAMMPS)[26] is a highly parallelized MD package. The parallelization is achieved
by spatial decomposition of the system domain into multiple subdomains.
Domain decomposition is performed dynamically for load balancing,
as shown in Figure . Each of the subdomains is allocated to separate nodes. The communication
between the nodes is based on the message passing interface (MPI)
protocol. In LAMMPS, the long-range electrostatic potential calculation
is done by the P3 M algorithm. The acceleration of the
P3 M algorithm is achieved by its implementation on a distributed
system of GPU clusters connected by InfiniBand[69] and the algorithmic methods proposed for PME.[101]
Figure 10
Domain decomposition for a system with uniform
density into an
equally sized box provides an equal load to all the allocated cores,
but for a nonuniform density system, an equally sized block gives
different load to the cores as shown in (a). LAMMPS performs domain
decomposition as shown in (b) for load balancing.[118]
Domain decomposition for a system with uniform
density into an
equally sized box provides an equal load to all the allocated cores,
but for a nonuniform density system, an equally sized block gives
different load to the cores as shown in (a). LAMMPS performs domain
decomposition as shown in (b) for load balancing.[118]
Charge Assignment Acceleration
The charge assignment
function used is a spline of order P. The algorithm’s
accuracy depends on P and the spacing of the grid, h. The values used for P and h determine the error due to the discretization of the charge. The
parameters P (default value of 5), h, and α are fixed for a user-specified value of the root-mean-square
(RMS) discretization error.[32]For
parallelization, a 3D grid is virtually implanted in the simulated
system, and the system is divided into subdomains. The subdomains
are allocated to each of the processors in the cluster. The processor
performs the grid charge allocation for the charges that belong to
its subdomain. All the grid points (P3) within a P-stencil width centered on the grid point closest to
the charge are affected by the charge allocation. Since this charge
allocation involves ghost grid points that do not belong to the respective
processor, it must be communicated to the neighboring processors.
3D FFT Acceleration
Once the charge allocation has
been performed, parallel 3D FFT is performed. The forward 3D FFT is
carried out in three communication stages. The 3D grid is partitioned
into 2D columns in x-dimension, then 2D-x into 2D-y, and finally 2D-y into
2D-z. On the final output of the grid partitioning,
2D-z, convolution is performed. The inverse FFT is
performed using the same three communication stages in the reverse
order. The gradient for force calculation is done using the ik-differentiation[119] in the
reciprocal space. The electric field vector of each particle is obtained
by interpolation from P3 grid points.
It also involves the ghost grid points. This information is obtained
by six-way communication with the neighboring processors.[69]The large volume of communication exchanged
between processors compared to computational intensity is the bottleneck
in the scalability of parallel 3D FFT and, hence, the scalability
of particle-mesh methods.
ACEMD
ACEMD[27] is an MD software designed to scale across multiple GPUs to achieve
the microsecond long simulation in an economical and effective manner.
The ideal system for ACEMD is a single node attached to multiple GPUs
via PCIe expansion slots. The GPUs compute different force terms based
on a simple force–decomposition scheme.[120] Dynamic load balancing is performed to distribute the particles
for force calculation among the GPUs. The host processor then sums
up the force terms computed by the GPUs. This total force matrix is
transferred back to all the GPUs for complete system integration.
For electrostatic force calculations, the PME[60] algorithm is implemented in a scalable parallel manner by dedicating
a specific number of GPUs for it.
Cell-List Construction
The system is decomposed into
cells for dynamic load balancing. Then, based on the cell-list scheme,[101,122] the particles are first binned based on their coordinates. For an
MD simulation that uses a cutoff radius, R = 12 Å,
approximately 22 atoms can be accommodated in bins with size R/2. Thus, the number of atoms in a bin (cell) is comparable
to the warp size in a GPU of 32 threads. Each cell is assumed to have
a maximum of 64 atoms to overcome the density fluctuations that arise
during the simulation. For PME, the accuracy required for MD is provided,
by using a cutoff radius R = 9 Å. This, in turn,
results in each cell being populated with at most 32 atoms. Each thread
launched by the cell-list construction kernel processes the construction
of the cell list for one particle at a time for all atoms in that
cell. Atomic memory operations are used to avail concurrent access
to the cell-list array.[123]
Optimized Memory Access
For the nonbonded force computation,
a thread block is allocated a single cell as shown in Figure . All the atom coordinates
within a radius R of the current cell are loaded
into the shared memory. The forces are computed by looping over the
particles loaded in the shared memory. The cost of global memory access
restricts the use of Newton’s third law in this GPU implementation,
and hence the reciprocal force is not stored. The texture memory helps
by providing radial components of the force functions (Coulomb and
van der Waals) from a lookup table. The lookup table provides the
linearly interpolated values of these functions. The error due to
the use of the lookup table is acceptable for MD as it is consistently
less than 10–4.[18]
Figure 11
Optimized
memory access with each cell allocated to a thread block
in the GPU.[121]
Optimized
memory access with each cell allocated to a thread block
in the GPU.[121]
DESMOND
DESMOND[18] is an open-source MD program developed by D. E. Shaw Research.
DESMOND achieves high performance through well-defined implementation
techniques and novel distributed memory parallel algorithms that accelerate
parallel MD simulations significantly. These include parallel decomposition
methods, message-passing techniques, novel communication primitives,
and short-vector SIMD capabilities. The reduction of conflicts in
updating the nonbonded forces has provided the maximum performance
improvement.[22]
Midpoint Method
The system that is to be simulated
is partitioned into a mesh of rectangular parallelepipeds or boxes.
To parallelize the range-limited electrostatic interactions within
a cutoff radius R, similar to the spatial decomposition
strategy,[124] each process updates particles
in one box of the grid. This significantly reduces the communication
bandwidth required by the process, and it further reduces if the cutoff
radius is decreased.[112,125−128]The midpoint method[126] further
reduces the communication bandwidth. This method, shown in Figure , calculates the
interaction between two particles in a box if the midpoint of the
segment connecting the two particles falls in that respective box.
The data of particles that reside within the R/2
distance of the home box have to be imported for the interaction calculation
and are called the import region in the midpoint method. The communication
bandwidth requirement for this parallelization method depends on the
size of this import region. The midpoint method has the advantage
that the same data communicated for pairwise nonbonded interactions
can be reused for bonded forces, requiring local communication. The
particle simulation parallelization engine, a custom portable library,
helps with the efficient communication and computation in the midpoint
method. The library manages data order in memory, and the messages
to be sent to other nodes are assembled in the order in which it is
to be sent.
Figure 12
In the midpoint method, if the distance between two particles
is
less than or equal to the cutoff distance, then the midpoint of the
line joining the two particles is calculated, and the computation
of the interaction for the particles involved is done in the box in
which the midpoint falls. In the above figure, for the particles b
and c, the midpoint falls in box 1, and hence the computation for
this interaction is done in box 1.[18] Reprinted
in part with permission from ref (18). Copyright 2013 IEEE.
In the midpoint method, if the distance between two particles
is
less than or equal to the cutoff distance, then the midpoint of the
line joining the two particles is calculated, and the computation
of the interaction for the particles involved is done in the box in
which the midpoint falls. In the above figure, for the particles b
and c, the midpoint falls in box 1, and hence the computation for
this interaction is done in box 1.[18] Reprinted
in part with permission from ref (18). Copyright 2013 IEEE.
Task Decomposition
The particle-based approach is used
for task decomposition.[22] The near-pairs,
which are pairs of atoms that come within the cutoff radius, are stored
in a nonredundant linear list. In this decomposition, each GPU thread
is dedicated to one particle, and it performs the force calculation
for all pairs involving that particle. For force calculation, the
“full-shell” mechanism is used, where the force for
each particle is calculated by its respective thread. Hence, the same
pair interaction is calculated twice.[129] The conflict removal of different threads updating the same particle’s
memory is obtained at the cost of redundant storage and computation.
This mechanism leads to better load balancing.Charge spreading
and force interpolation involved in PME and the k-space Gaussian split
Ewald method (k-GSE)[130] require no further
communication when this decomposition method is implemented. Force
contribution to all the particles in the import region must be exported
after computing all the home–box interactions. Once the force
on the particle has been calculated, particle migration to different
processors has to be taken care of due to the atom position update.
However, if the import region is increased slightly, this particle
migration needs to be performed only after a few time steps. This
is because the position change of a particle is significantly less
as compared to R(125) in
each time step. This reduces the communications required at the cost
of a slight increase in bandwidth for increased particle information
during particle import and force data transfer in the force export
stage. The volume of the import region is the measure of communication
bandwidth for this parallelization strategy.[18]
Staged Communication
During the import and export,
staged communication[124,131] is used. This reduces the number
of messages from 26 (number of neighbors for a home box) to 6, that
is, one in each cardinal direction (+x, −x, +y, −y, +z, −z) in three stages.
3D FFT Parallelization
Then, the 3D FFT is performed
on the mesh. For parallelization of the 3D FFT used in PME, a 3D mesh
of size N*N*N (each of which is a power of 2) is superimposed on the simulated
system.[18,127] The points on this mesh are spread across p*p*p boxes,
and each of the boxes has a mesh of size n*n*n. 1D FFTs are calculated successively
in each of the three dimensions to evaluate the 3D FFT. Nnn points occur
in each column in the x-dimension and require nn 1D transformations, and the data for each transformation
are distributed across p processes. The 1D FFT in the x-dimension is calculated
in log2(N) stages by data exchange between log2(p) processes. In each stage, the communication
only happens between a process and another process. In DESMOND, the
forward transforms are performed by decimation in frequency. Three
transforms are performed for each of the three dimensions, and the
transformed data become bit-reversed in order. Then, multiplication
by the optimal influence function is done.An inverse transformation
brings back the initial spatial distribution using decimation in time.
Based on this FFT parallelization, when FFT is performed using 512
processes (8 × 8 × 8), the FFT calculation requires nine
messages to be sent by each process (three in each of the three directions).
The number of messages and their latency is the primary determinant
of the FFT latency in the case of small meshes. However, for large
meshes, the message bandwidth also requires significant consideration.
For intranode parallelization, pthreads are used, which divide the
work assigned to the process to be shared by multiple threads and
processors. Normalized representations are used for atom position
coordinates.[101]
SIMD Programming Interface
The SIMD programming interface
was developed to use short vector SIMD instructions without hand-coding
these instructions, and this simplified the work of the coder along
with accelerating the computations.[18] These
SIMD extensions simultaneously perform four 32-bit single-precision
floating-point operations with 16-byte aligned memory accesses. Single
precision in computation has the advantage of reduced memory and communication
bandwidth.
Bitwise Time Reversibility
An ideal MD simulation should
be reversible in time. However, the round-off errors and lack of associativity
in floating-point arithmetic lead to the inability to reverse a simulation.
DESMOND can achieve accurate bitwise time reversibility for simulations
that do not use constraints.[18] It uses
fixed-point arithmetic for updating particle position that overcomes
the round-off error. The nonassociativity of floating-point arithmetic
is dealt with by maintaining an ordering of computation and particles.
This DESMOND feature helps to achieve an energy drift of less than
1 K per microsecond with single-precision calculations.The
position coordinates are normalized to the respective box, which helps
achieve extra precision. The storage of different data structures
using the optimum memory hierarchy plays a significant role. The position,
force, and energy information are stored in shared memory. The force
tables, which store the intrinsic properties of force fields, are
stored in constant memory as they are constant values. The controlling
conditions are modified to avoid or reduce thread divergence, and
branch predictions are made. In addition to these optimizations, streaming
SIMD extension (SSE) instructions[132] help
to further speed up the computations up to 3×.
Comparison of Electrostatic Algorithms and
Its Implementation in MD Packages Based on Its Advantages and Disadvantages
In MD simulation, the main criteria to be kept in mind are ensuring
that the simulation’s accuracy meets the accuracy threshold
set by the user and the computation cost involved in achieving that
accuracy like the simulation time, efficient utilization of resources,
etc. The hyperparameters of the algorithm are set based on these criteria.
The Ewald-based methods such as PME and P3 M are based
on fast Fourier transforms (FFTs). The use of FFTs helped to reduce
the quadratic computation complexity of electrostatic algorithms to with comparable experimental accuracy for
force calculations. The inherently periodic systems, such as crystals,
are particularly suited for FFT-based methods. The use of FFTs is
made possible in PME and P3 M by applying periodic boundary
conditions. The periodic boundary conditions consider the dynamics
that occur at the boundary of the simulated system, but the applicability
of periodicity to inherently nonperiodic systems such as solutions
is questionable. It has been seen that introduction of artificial
periodicity for solvated biomolecules results in perturbation of conformational
equilibrium. It, in turn, causes minor fluctuations and an artificial
stabilization of the most compact state.[133] The effect is more pronounced when the solvent has low dielectric
permittivity, the solute has an overall charge, etc.[134]The Ewald summation method requires that the system’s
simulation box is neutral. The simulation of non-neutral systems is
facilitated by introducing a uniform background charge distribution
that neutralizes the simulation box[135] or
by adding explicit counterions. The use of background charges may
result in the rise of artifacts, as the spatial distribution of counterions
in real systems is different, particularly in systems with an inhomogeneous
dielectric constant. The uniform background charge distribution in
inhomogeneous systems artificially changes the potential of mean force.
This artifact can be overcome by the use of explicit ions to neutralize
the simulation box. The effect on the energy and pressure introduced
by the background charge in homogeneous systems can be corrected on
the fly or a posteriori.[136]The required global transpose of data in FFTs becomes a great
challenge
while computing the three-dimensional FFTs when PME or P3 M algorithms are implemented on large clusters of CPUs or GPUs.
It has been found that the scalability of the FFT is where p is the number
of processes. The cost of communication between the processes scales
as log(p)[137] when compared
with hierarchical clustering methods, such as the FMM. This reduced
communication cost is achieved by the memory access pattern in the
FMM, which is more localized due to the use of approximations based
on spatial scale separation. The need for an efficient and scalable
algorithm to run on the future exascale computer has brought the FMM
into the research forefront. The memory becomes a constraint in the
simulation box size scalability in the case of PME, whereas in the
FMM, only the number of particles is the limit. The performance of
the FMM becomes very high when there is sufficient parallelism to
exploit. The FMM achieves the same accuracy as PME when a multipole
order of 7 is used with a depth 3. On the other hand, the standard
parameters (cutoff radius −1.0 nm, grid spacing of 0.12 nm,
and a B-spline interpolation order of 4) are used for PME. The energy
drift occurs in the FMM due to the octree discretization. This discretization
also results in small force discontinuities over time. This can be
compensated in the FMM by increasing the multipole order to more than
8 or regularization for lower orders. In GROMACS, energy drift is
significantly reduced for both PME and FMM implementation using a
large Verlet buffer and double precision calculation. Further, it
is found that in the case of a biomolecular system with 50 000
atoms the performance of the FMM is only about one-third as compared
to PME in GROMACS. Thus, it is recommended to use the FMM for systems
with large dimensions and nonuniform particle distribution, such as
aerosol or multidroplets within 10 000 atoms. In these situations,
the large memory requirement of the FFT grid makes PME unsuitable
for use.[82]MSM provides better scalability
compared to PME when the number
of processors is large. This is because FFTs in MSMs which require
global communications are avoided in the lower levels by approximating
the long-range part on a hierarchy of grids and using it only at the
coarsest (highest) level, and it is also not iterative. It uses near-neighbor
point-to-point communication. The computational time taken by MSM
can be reduced further by choice of optimum grid spacing h (see Section ). The optimum grid spacing can be obtained using an equation given
by Moore and Crozier[138] in Section B as the starting point to
decide the levels and, in turn, the number of grid points. Fine-grained
control of accuracy in MSM is not possible because the grid spacing
in MSM can only be changed in each direction by factors of two or
eight when changed in three dimensions simultaneously. For example,
a system ApoA1 (92K atoms) was simulated on Cray XE6 nodes (32 cores
per node) by varying the number of nodes in which PME showed higher
performance when the number of cores was less than 256. The performance
crossover between MSM and PME occurs between 256 and 512 cores.The performance of PME reaches a plateau at 1024 cores, while MSM
continues to scale for up to 1536 and achieves a performance of 20
ns per day. MSM produced comparable accuracy with PME in the calculation
of water properties such as density, diffusion constant, dielectric
constant, etc. Still, the numerical accuracy of PME is much more than
MSM. The optimum choice of parameters such as grid size, ratio of
the cutoff to grid size, and interpolation order is important to achieve
low root-mean-square error, which evaluates the accuracy. The MSM
method greatly benefits molecular simulations involving nonperiodic
and semiperiodic boundary conditions. The property of MSM to simulate
in semiperiodic boundary conditions can be applied in the simulations
of the cell membranes. It can also be used to simulate nanopores used
as sensors and situated in a graphene sheet, which acts like a cell
membrane. MSM is implemented in the molecular dynamics package NAMD
as explained in Section .[53] MSM was also found to be faster
than FMM for a given accuracy requirement. This is because, in MSM,
forces are smooth as compared to in FMM.[34,141] MSM was found to be slower than SPME in NAMD for a water system
simulated on a single processor,[63] but
MSM running on GPU has significant speedup due to its highly scalable
nature.[142]For DESMOND,[143] the simulations run
with the u-series algorithm produced a performance of 1100.7 ns/day
on one V100 with a 2.5 fs time step, whereas Amber18[144] achieved a performance of 779.46 ns/day and 411.90 ns/day
with 4 and 2 fs time steps, respectively, for the DHFR system with
23 000 atoms. ACEMD provides 937.5 ns/day on single V100 GPU
for DHFR (23K atoms). ACEMD[145] produced
28 ns/day, and DESMOND provided a performance of 34.1 ns/day on a
single V100 for STMV (Satellite tobacco mosaic virus) with 1 067 095
atoms. GROMACS 2020[146] provides 43.1 ns/day
on four V100 GPUs. GROMACS 2020 achieves 147 ns/day performance on
four V100 SXM GPUs on the Cellulose system with 408 609 atoms,
and Amber18 provides a performance of 53.39 ns/day for one V100 system.
In all the MD packages except DESMOND, the PME algorithm is used.
The main advantages and disadvantages of electrostatic algorithms
implemented in MD packages are summarized in Table .
Table 2
Comparison of Advantages of Electrostatic
Algorithms Implemented in MD Packages
EFC algorithm
advantages
disadvantages
PME
1.
The performance in GROMACS reaches (1000) steps per second.
1.
Memory becomes a constraint when large simulation boxes
are used due the to memory access nature of FFT.
2. Rapid convergence of energy with high accuracy using mixed
precision and highly optimized GPU implementation in GROMACS using
optimization strategies (refer to Section 3.4).
2. FFT requires all-to-all
communication and hence becomes
a bottleneck for scalability. PME scaling reaches a flat level at
1024 processes.[139,140]
MSM
1. Based on spatial scales, it provides
the most suitable memory
access patterns.
1. In single CPU implementation,
MSM achieves
a performance of 41% and 47% for orders 4 and 6, respectively, compared
to PME. Increasing the number of levels in MSM results in a moderate
penalty as compared to the efficiency of FFTW18.
2. Highly scalable and hence most suitable for large-scale
simulations running on GPUs and/or clusters.
3. 3D convolutions performed in MSM are readily vectorized,
while FFTs are not. This is advantageous for GPU implementations of
MSM and in CPU using technology such as AVX2 (Intel advanced vector
extensions).
2. When a dynamically varying
simulation box is
used, the efficiency of the real space grid calculation becomes an
issue as it depends on the spatial extent of the convolution kernel
stencil.
4. Readily applicable for periodic,
nonperiodic, and semiperiodic
system.
FMM
1.
FMM performs well when used for simulation of a large number
of particles, large simulation boxes, inhomogeneous charge distributions,
and high parallelization.[116,139]
1. Truncation
of the multipole expansion at finite order p (small
values) will result in the deviation of forces
and energies computed. A multipole order of p ⩾
12 achieves the highest numerical accuracy in single precision computation
for FMM.[82]
2.
FMM scales linearly up to ≈270 000 000 and ≈160 000 000 particles in single and double precision.
2. FMM achieves
only one-third performance of PME on a single
GPU for a periodic salt water solution.
3.
FMM is not limited by simulation box size only by number
of particles as memory constraint does apply to FMM algorithmic implementation.
3. The application of FMM to noncubic simulation boxes increases
the complexity of the algorithm.
Acceleration of Electrostatic Potential Algorithms
Not Part of an MD Package
DCS Acceleration
The direct Coulomb
summation (DCS) algorithm for direct Coulomb potential calculation
(Section ) on
the lattice is an ideal algorithm for acceleration using GPUs. This
is due to the high degree of data parallelism, intensive arithmetic
operations, and simple data structures in DCS, which are highly suitable
for the GPU architecture.[33] The GPU implementation
of DCS on NVIDIA GeForce 8800 GTX can achieve a performance of more
than 16× as compared to the optimized CPU version
on an Intel Core 2 quad-core 2.66 GHz processor. This performance
of the GPU can be further enhanced to 40× using
different algorithmic and architectural improvements.[47]
Grid Partitioning
The potential map grid of the simulated
system is divided into planes, and each of the planes is mapped to
the grid on a GPU. Different planes are distributed among multiple
GPUs for parallel computation. Each of the 2D planes is further divided
and assigned to a thread block on the GPU. The number of threads in
a block is dependent on the shared memory, the registers, and resources
that each thread consumes.[33]
Overlapping Arithmetic Operations and Memory Accesses
For distance calculation, some of the coordinate components are constant
for atoms in individual planes. Hence, they can be reused to find
the potential contribution to different points on a plane. When the
potential contribution of atoms in the x-direction
is to be computed, the sum of y,z components of distance calculation is calculated once for an atom
and reused for the rest of the atoms. Thus, the increased arithmetic
operations can mask the delay in the memory reference. With the increase
in the number of lattice points to which an atom’s contribution
is calculated, the load on constant memory reduces, increasing the
number of registers used to store partial results. So the number of
registers used per thread limits the number of points to which an
atom’s contribution can be calculated at the same time.This bottleneck is mitigated by using shared memory, and the registers
are used to store the partial results. The use of shared memory and
registers helps increase the number of lattice points to which an
atom’s contribution is found but increases the computational
tile size and, in turn, the size of the thread block. Performance
degrades when the potential map of the simulated system is not divisible
by the tile size and results in the addition of padding. This additional
padding results in unwanted arithmetic operations and a degradation
in performance for smaller potential maps.[33,47]
Concurrent Memory Subsystem Usage
To provide data to
the vast number of arithmetic units available in the GPU, simultaneous
independent access of constant, shared, and global memory was done.[33]
FMM Acceleration on GPU
The fast
multipole method (FMM) is a very suitable algorithm (Section ) for highly parallel exascale
systems available today due to its linear complexity and high arithmetic
intensity.[147] It is scalable to thousands
of GPUs, making petascale-computing possible. In this algorithm, the
system simulated is divided into subdomains that map to the branches
of a tree. The FMM algorithm consists of six main kernels, and each
of the kernels is evaluated on the GPU.
Techniques of Acceleration
Orthogonal Recursive Multisection
Workload balancing
is one of the most important requirements to achieve maximum performance
on a highly parallel platform like the GPU. The decomposition of the
simulated molecule is achieved by the orthogonal recursive multisection,
which is a variant of the orthogonal recursive bisection method.[148] It forms a binary tree-like structure with
an equal number of particles in each bisection. This formation of
the binary tree-like structure is done by finding the nth element in a group of N particles.
The nth element algorithm is used to divide
the domain into several subdomains that are not a power of 2.[149] Rectangular-shaped subdomains called cells
are formed by this method.
Dual-Tree Traversal
The cell–cell interactions
are obtained by the dual-tree traversal method, which allows the use
of rectangular-shaped subdomains.[150] All
the particles in a subdomain or cell are assigned to a process. The
information on the assigned particles is stored in the memory of that
process to reduce the memory requirement. Only the particle data and
multipole moments of particles near each subdomain border are to be
communicated to the neighboring processes. This tree traversal thus
reduces the amount of communication involved.
Local Essential Tree
The particle decomposition forms
an oct-tree structure called the global tree. Portions of the global
tree, that is, the subdomain, are assigned to each process. However,
it requires information from the borders of the other subdomains.
These data are communicated between the processes. Each process constructs
a local essential tree, a subset of the global tree using the information
from the other subdomains. The multipole acceptance criterion (MAC)[81] determines which particle data are to be transferred
from the target cell. Based on MAC, a radius R between
the center of mass of the source cell and the border of the target
cell running on a remote process is determined.
Batch Evaluation
Of the most compute-intensive is particle
to particle and multipole to local kernels.[81] Batch evaluation of the two kernels is performed to accelerate the
computations, as there is no dependency between these kernels. The
source cell is loaded to the shared memory, and the target cell is
assigned to a thread block. A thread in the thread block computes
each coefficient of the multipole-to-local expansion.The different
acceleration techniques used for electrostatic force calculations
on GPUs are summarized in Table .
Table 3
Different Methods of Electrostatic
Force Calculation (EFC) Acceleration Used by Different MD Packages
and Solo Algorithm Implementation (DCS and FMM) on GPU for Improved
Performance
MD package
EFC algorithm
methods
of acceleration
AMBER[23]
PME
1. Mixed precision arithmetic is used (SPFP).
2. Neighbor-list creation.
3. GPU-cluster
acceleration.
NAMD[24,98]
S-PME
1. 3D-FFT parallel implementation.
2. Charm++ programming model.
NAMD[53,108]
MSM
1. Small-bin based geometric hashing.
2.
Sliding window approach.
GROMACS[25,110]
PME
1. Eighth-shell domain
decomposition.
2. Dedicated nodes.
3. Offload model.
LAMMPS[26]
P3 M
1. Charge assignment acceleration.
2. 3D FFT acceleration.
ACEMD[27]
PME
1. Cell-list construction.
2. Optimized
memory access.
DESMOND[18]
PME
1. Midpoint Mmthod.
2. Task decomposition.
3. 3D FFT parallelization.
(33,47)
DCS
1. Grid partitioning.
2. Overlapping
arithmetic and memory operations.
(147)
FMM
1. Orthogonal recursive multisection.
2. Dual-tree traversal.
Discussion
This paper compares the
accuracy, complexity, and scalability of
the electrostatic force calculation algorithms (Section ) along with the software and hardware techniques
used for the algorithms’ acceleration on high-performance platforms
such as GPUs in the above sections. Accuracy and scalability are the
two factors that decide the choice of an electrostatic force calculation
algorithm in an MD package. The use of double-precision floating points
for arithmetic operations provides very high accuracy but degrades
the performance. While the individual discussions of methods and parallelization
techniques seek to provide succinct introductions, they should suffice
so readers can contrast using the different comparison features provided
to make the right choice for their application or area of study.The performance of specialized supercomputers, such as Anton 2,
is based on customized algorithmic implementations and specialized
hardware. They provide bleeding-edge performances but remain out of
reach for the bulk of researchers. For example, Anton 2 contains 33 792
processor cores and achieve a performance of 85 μs per day for
dihydrofolate reductase (DHFR).[151] However,
specialized computers are not within the reach of a typical researcher.
The acceptable option is to optimize the electrostatic algorithms
further to maximize the utilization of high-performance computing
platforms.To our knowledge, the particle mesh Ewald (PME 8)
algorithm is
the most popular algorithm that is used in the molecular dynamics
biomolecular simulations (see ref (152) for benchmarks). The PME implementation on
GPU used in GROMACS is one of the fastest. This is due to the efficient
utilization of parallelization features available in the GPU and the
algorithmic optimizations. Nevertheless, when more than one GPU is
to be used in a cluster, the all-to-all communication required in
PME becomes a bottleneck. This is because the number of messages exchanged
in this communication step scales quadratically with the number of
nodes in the cluster.[82] The hybrid PME
offload option is also considered in which the 3D FFT which is communication
intensive is back-offloaded to the CPU. This will become a viable
choice in the next-generation computer which will have high-bandwidth
CPU–GPU interconnection, thus allowing the grid transfer overlap.[153] Due to the emergence of the exascale supercomputer,
it is vital to consider the algorithm’s scalability to achieve
extreme acceleration. These exascale supercomputers have led to further
research on highly scalable algorithms such as the FMM.The
FMM implementation on GPU can achieve the accuracy of PME with
high multipole order. However, it has been found experimentally that
the performance of the FMM on GPU for protein systems is only one-third
of the performance of PME on GPU. For large sparse systems, the FMM
outperforms PME, and it can also handle open boundaries.[82] FMM and MSM are highly scalable but have not
been adopted by the popular MD packages due to the increased implementation
complexity. If these parameters can be improved, these algorithms
can be widely used. Hence, it is an exciting and open area of exploration
and opportunity.
Conclusion
This paper surveys the different
methods of electrostatic force
calculation and their advantages and disadvantages. Different algorithms
are weighted based on the ease of programming, computational complexity,
accuracy, and simulation stability. The electrostatic force, which
is due to nonbonded interaction, is one of the major performance bottlenecks
in MD that is used for simulating the dynamics of biological systems
such as proteins, DNAs, and membranes in explicit solvents.It has been found that efficiency in terms of time and memory can
only be achieved by the interplay of software and hardware optimizations.
The electrostatic force calculation is now moving toward a hybrid
GPU environment, as discussed in the research papers reviewed here.
The difference in acceleration occurs due to the different data management
strategies used in the algorithm and the hardware features of the
platform in which it is implemented.The paper reviews how the
electrostatic force calculation is accelerated
by the use of high-performance platforms such as GPUs to achieve long
time-scale simulations. The Ewald-based methods such as PME, P3 M, and MSM are implemented in the most widely used MD packages
and special-purpose supercomputers for MD. These platforms can harness
the inherent intensive parallelism available in the algorithm due
to the fine-grained parallelism available in hardware implementations
of GPUs. The ability to overlap computation with communication has
been achieved due to customized designs. The review shows that the
future of efficient acceleration of MD algorithms depends on the development
or adaptability of scientific algorithms that can harness the modern
high-performance computing platforms.
Authors: James C Phillips; Rosemary Braun; Wei Wang; James Gumbart; Emad Tajkhorshid; Elizabeth Villa; Christophe Chipot; Robert D Skeel; Laxmikant Kalé; Klaus Schulten Journal: J Comput Chem Date: 2005-12 Impact factor: 3.376
Authors: Matthias Reumann; Blake G Fitch; Aleksandr Rayshubskiy; Michael C Pitman; John J Rice Journal: Biomed Tech (Berl) Date: 2011-06 Impact factor: 1.411
Authors: Maria M Reif; Vincent Kräutler; Mika A Kastenholz; Xavier Daura; Philippe H Hünenberger Journal: J Phys Chem B Date: 2009-03-12 Impact factor: 2.991
Authors: Stefano Piana; Kresten Lindorff-Larsen; Robert M Dirks; John K Salmon; Ron O Dror; David E Shaw Journal: PLoS One Date: 2012-06-29 Impact factor: 3.240