Literature DB >> 35353508

GROMACS in the Cloud: A Global Supercomputer to Speed Up Alchemical Drug Design.

Carsten Kutzner¹, Christian Kniep², Austin Cherian³, Ludvig Nordstrom⁴, Helmut Grubmüller¹, Bert L de Groot⁵, Vytautas Gapsys⁵.

Abstract

We assess costs and efficiency of state-of-the-art high-performance cloud computing and compare the results to traditional on-premises compute clusters. Our use case is atomistic simulations carried out with the GROMACS molecular dynamics (MD) toolkit with a particular focus on alchemical protein-ligand binding free energy calculations. We set up a compute cluster in the Amazon Web Services (AWS) cloud that incorporates various different instances with Intel, AMD, and ARM CPUs, some with GPU acceleration. Using representative biomolecular simulation systems, we benchmark how GROMACS performs on individual instances and across multiple instances. Thereby we assess which instances deliver the highest performance and which are the most cost-efficient ones for our use case. We find that, in terms of total costs, including hardware, personnel, room, energy, and cooling, producing MD trajectories in the cloud can be about as cost-efficient as an on-premises cluster given that optimal cloud instances are chosen. Further, we find that high-throughput ligand-screening can be accelerated dramatically by using global cloud resources. For a ligand screening study consisting of 19 872 independent simulations or ∼200 μs of combined simulation trajectory, we made use of diverse hardware available in the cloud at the time of the study. The computations scaled-up to reach peak performance using more than 4 000 instances, 140 000 cores, and 3 000 GPUs simultaneously. Our simulation ensemble finished in about 2 days in the cloud, while weeks would be required to complete the task on a typical on-premises cluster consisting of several hundred nodes.

Entities: Chemical

Mesh：

Substances：
Ligands

Year: 2022 PMID： 35353508 PMCID： PMC9006219 DOI： 10.1021/acs.jcim.2c00044

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Over the past decades, molecular dynamics (MD) simulations have become a standard tool to study biomolecules in atomic detail. In the field of rational drug design, MD can greatly reduce costs by transferring parts of the laboratory workflow to the computer. In the early stage of drug discovery, large libraries of small molecules with the potential to bind to the target protein (the “hits”) are identified and subsequently modified and optimized to ultimately uncover more potent “lead” candidates. In silico approaches allow reducing the number of small molecule compounds from tens of thousands down to a few hundred entering preclinical studies. Naturally, it is a combination of all the pharmacokinetic and pharmacodynamic features that defines whether a candidate molecule can be evolved into a useful drug. Molecular dynamics-based computational drug development concentrates mainly on the particular question of how well a specific ligand binds to a target. While calculations of absolute protein–ligand binding affinity are feasible, they also present numerous technical challenges.[1,2] Evaluation of the relative binding affinities, however, is much more tractable and in recent years has been well-established in the field of computational chemistry.[3−6] In the latter approach, MD-based so-called alchemical calculations allow obtaining differences in binding free energy between two ligands. Such calculations require performing transformations between the two ligands for their protein-bound and for their unbound solvated state. Carrying out multiple transformations allows sorting the whole collection of ligands by their binding affinity to the target. Different approaches can be used to carry out the transformations, but they all involve a λ parameter that interpolates between the ligands. An automated workflow for binding affinity calculations has recently been developed,[5] based on the open-source software packages pmx[7] and GROMACS.[8] Despite continuous advances in hardware and software, carrying out MD simulations remains computationally challenging. A typical MD project can easily occupy a modern compute cluster for days or even months until a sufficient amount of simulation trajectory is produced. Where now does a researcher get the required compute time? Established providers are the compute centers of universities and research institutes, national supercomputing centers, and local clusters, each with particular advantages and disadvantages with respect to how easily resources can be accessed, how much and how quickly they are available, what the costs are, etc. During the past decade, cloud computing[9,10] has developed into a new, alternative option to obtain compute time for scientific applications. Whereas the first systems of cloud computing reach back into the mid-1990s,[11] since about 2007, it is being increasingly used for scientific workloads.[12−16] Cloud computing providers like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform can serve both HPC and HTC demands as they nowadays offer virtually limitless compute power, plus the possibility to efficiently parallelize individual simulations over multiple instances (compute nodes in the cloud) connected by a high-performance network. One of the main promises of cloud-based computing is its ability to easily scale-up when the resources for computation are required. This way the user has access to an HPC/HTC facility which can flexibly adjust to the particular needs at a given time. Consequently, the usage of such cloud compute clusters can be fine-tuned to optimize costs or minimize the time-to-solution. The Folding@home project, initiated in 2000, was the first to leverage globally distributed compute resources for MD on a large scale.[17] Reports of cloud infrastructure in the narrow sense being used for MD date back to 2012. Wong et al. developed a VMD[18] plugin for the NAMD[19] simulation software that simplifies running simulations in the AWS Elastic Compute Cloud (EC2).[20] They carried out a simulation of a one million atom large biomolecular system on a total of 64 CPU cores spread over eight EC2 instances. Van Dijk et al. implemented a web portal to execute large-scale parametric MD studies with GROMACS[8] on European grid resources.[21] In 2014, Król et al. performed an ensemble simulation of 240 replicas of a several hundred atom large evaporating nanodroplet on 40 EC2 single-core instances.[22] Kohlhoff et al. demonstrated that long simulation time scales for biomolecules can be accessed with the help of cloud computing. They simulated two milliseconds of dynamics of a major drug target on the Google Exacycle platform.[23] Singharoy et al. described tools to easily perform MD-based flexible fitting of proteins into cryo-EM maps in the AWS cloud.[24] The concept of making cloud-based workflows for MD readily available to the scientist is also pursued by the following projects. The AceCloud[25] on-demand service facilitates running large ensembles of MD simulations in the AWS cloud; it works with the ACEMD,[26] GROMACS, NAMD, and Amber[27] simulation packages. QwikMD[28] is a user-friendly general MD program integrated into VMD and NAMD that runs on supercomputers or in the AWS cloud. HTMD[29] is a python-based extensible toolkit for setting up, running, and analyzing MD simulations that also comes with an AWS interface. A purely web-based application that facilitates setting up MD simulations and running them in the cloud is described by Nicolas-Barreales et al.[30] The Copernicus[31] scientific computing platform can be used to carry out large sets of MD simulations on supercomputers and cloud instances. In a hands-on fashion, Kohlhoff describes how to perform GROMACS simulations on Google’s cloud platform using Docker containers.[32] Arantes et al. propose a Jupyter-notebook based, user-friendly solution to perform different kinds of MD-related workflows at no cost using the Google Colab services, which is especially useful for teaching purposes.[33] Cloud computing is also increasingly being adopted to aid drug discovery. In their 2013 article,[34] Ebejer et al. review the use of cloud resources for protein folding and virtual screening and highlight the potential for future large-scale, data-intensive molecular modeling applications. They point out a virtual screening study of 21 million compounds that has been carried out in 2012 on a cloud-based cluster with 50 000 cores using Schrödinger’s docking software Glide.[35] D’Agostino et al. discuss the economic benefits of moving in silico drug discovery workflows to the cloud.[36] A recent virtual screening study of 1 billion compounds against proteins involved in SARS-CoV-2 infection was carried out on Google’s cloud services.[37] Cloud computing has been compared to traditional on-premises clusters for exemplary scientific workflows;[38,39] however, we are unaware of a quantitative study to date for the field of MD simulation. Therefore, here we assess the costs and performance of cloud computing for carrying out biomolecular simulations. We use GROMACS as the simulation engine and AWS as the provider of cloud infrastructure for this case study, for the following reasons: GROMACS is open source and freely available to anyone, and it is one of the fastest MD codes available.[40] AWS is one of the largest providers of cloud infrastructure, on par with Microsoft and Google.[10] First, we measured the GROMACS performance using established benchmarks[41] on a broad range of available instance types (with and without GPUs) and also across multiple instances. The simulation performance-to-instance price ratio allows optimizing for a minimal time-to-solution or minimal project costs. The benchmark results and the instance costs allowed us to compare the costs of carrying out simulations in the cloud to those for operating an in-house cluster. Second, we ask how much high-throughput ligand screening can be accelerated in the cloud. To address this question, we used globally available compute capacity to carry out a large protein–ligand binding affinity study at highest possible throughput.

General Background

Cloud Computing

The large cloud providers offer a wide range of instance types, with and without GPUs, optionally with extra memory or HPC network, targeted toward different application areas. Because of the sheer scale of options, cloud resources can appear overwhelming at first, especially compared to an on-premises HPC cluster with preinstalled software. To help setting up typical workflows and to illustrate best practices, plenty of online content is available such as manuals, tutorials, workshops and discussion forums. In addition, an experienced technical staff is available to directly help with specific issues—a support that this project has greatly benefited from. The compute unit that is rented out to customers is called instance. It may be a whole node with multiple cores and GPU(s), just a part of a node, or even just a single core. Large nodes that are rented out as several smaller instances are shared between different customers. However, each customer is restricted to her instance (her part of the node) exclusively, and her processes cannot spill over into the compute cores, memory, or network bandwidth allocated to other instances on the node. AWS instances come with a certain number of virtual CPUs (vCPUs) which translate to hardware threads. Renting two vCPUs on a modern AMD or Intel-based instance is equivalent to getting one physical core exclusively on that machine. Although the actual exact location of allocated compute instances remains opaque to the user, the region she chooses encompasses a group of geographically close data centers. Costs usually vary by region, depending on supply and demand, as well as energy costs, and specific services or cutting edge processor features may be available only in some of the regions. For the case of AWS, each region consists of multiple, isolated, and physically separate availability zones (AZs) within a geographic area. An AZ is a group of one or more data centers with independent redundant power supply and network connectivity. In 2021, AWS had 85 AZs in 26 regions. There are different payment models that can be chosen from. On-demand payment is most flexible, as one can rent an instance at any time and give it back when it is not needed any more. One pays only for the time that the instance is needed. One can also get reserved instances at a 50–70% discount if one books these instances for one to three years, but then one has to pay regardless if one can make use of them. Preemptible or Spot instances tap into the pool of currently unused compute capacity and are available at discount rates of up to 90% compared to on-demand, though pricing varies across AZs and over time. However, a Spot instance can be claimed back at any time by Amazon EC2 with a 2 min warning.

Using Hardware Efficiently with GROMACS

Key to optimal simulation performance is understanding how GROMACS makes use of the available hardware. GROMACS combines several parallelization techniques, among them MPI and OpenMP parallelism, GPU offloading, and separable ranks to evaluate long-range electrostatics. With domain decomposition (DD), the simulation system is divided into n × n × n domains, each of which is operated on by one MPI rank.[40] During the simulation, dynamic load balancing (DLB) adjusts the size of the domains such that any uneven computational load between the MPI ranks is minimized. Each MPI rank can further have multiple OpenMP threads. Best performance is usually achieved when the product of MPI ranks and OpenMP threads equals the number of cores (or hardware threads) on a node or instance and when all threads are properly pinned to cores. Though leaving some cores idle may in rare cases make sense, oversubscription will lead to significant performance degradation. When distributing a simulation system over an increasing number of MPI ranks in a strong scaling scenario, at some point the time spent for communication between the ranks limits further speedup. Usually the bottleneck is in the long-range contribution to the electrostatic forces which are calculated with the particle mesh Ewald (PME) method.[42] Parallel PME requires all-to-all communication between the participating ranks, leading to r2 MPI messages being sent on r MPI ranks.[40] This communication bottleneck can be alleviated by assigning a subset of MPI ranks to exclusively evaluate the long-range PME part. As typically only a quarter up to a third of all ranks need to be allocated for long-range electrostatics, the communication bottleneck is greatly reduced, yielding better performance and scalability. GROMACS can offload various types of computationally demanding interactions onto the GPU.[41,43,44] One of the largest performance benefits stems from offloading the short-range part of the nonbonded interactions (Coulomb and van der Waals). In parallel, each MPI rank can offload its local domain’s interactions to a GPU. The PME long-range part can be offloaded as well; however, this computation still cannot be distributed onto multiple GPUs. Additionally, bonded interactions and for suitable parameter settings the integration and constraint calculations can be offloaded. The relative GPU-to-CPU compute power on a node determines how many interaction types can be offloaded for optimal performance. Ideally, CPU and GPU finish their force calculation at about the same time in the MD time step so that no time is lost waiting. Earlier studies showed that both the GROMACS performance as well as the performance-to-price (P/P) ratio, i.e., how much MD trajectory is produced per invested €, can vastly differ for different hardware.[41,45] Nodes with GPUs provide the highest single-node GROMACS performance. At the same time, P/P skyrockets when consumer GPUs are used instead of professional GPUs (e.g., NVIDIA GeForce RTX instead of Tesla GPUs). The P/P ratio of consumer GPU nodes is typically at least a factor of 3 higher than that of CPU nodes or nodes with professional GPUs. Pronounced variations in GROMACS performance and cost-efficiency are therefore expected between the different instance types on AWS. Benchmarks allow picking instance types optimal for MD simulation.

Obtaining Relative Binding Free Energies from MD Simulations

To evaluate relative binding affinities in a chemical library of interest, ligands are connected into a network (graph) and a number of pairwise calculations is performed, eventually allowing the sorting of the molecules according to their binding free energy. It is a usual practice to repeat calculations several times for each ligand pair to obtain reliable uncertainty estimates.[46−48] Various methods for the alchemical calculations have been developed. For example, the commercially available Schrödinger software uses a free energy perturbation-based approach,[49] whereas the open source workflow used here[5,7] is based on thermodynamic integration (TI)[50] using a nonequilibrium transformation protocol.[51] Both approaches yield similarly accurate relative binding free energies at similar computational effort.[5] The nonequilibrium TI approach requires equilibrated ensembles of the physical end states for the solvated protein with ligand, one for ligand A and one for ligand B, as well as two equilibrated ensembles of ligand A and ligand B in solution. From the equilibrated ensembles, many short “fast growth” TI simulations are spawned during which ligand A is transformed into ligand B and vice versa using a λ-dependent Hamiltonian. The free energy difference is then derived from the overlap of the forward (A → B) and reverse (B → A) work distributions using estimators based on the Crooks fluctuation theorem.[52]

Methods

We will first describe the setup of the cloud-based HPC clusters that we used to derive the GROMACS performance on a range of available instance types and provide some details about the benchmark input systems and on how the benchmarks were carried out. Then we will outline our setup to distribute a large ensemble of free energy calculations on globally available compute resources.

Cloud-Based HPC Cluster and Software Setup

The benchmark simulations were carried out on AWS compute clusters in the North Virginia region set up with the ParallelCluster[53] open source cluster management tool. Each cluster consists of a master instance of the same architecture as the nodes (x86 or ARM). The master fires up and closes down the node instances as needed and operates the queueing system (SLURM).[54] For the x86 cluster, we used ParallelCluster v. 2.10.0 on a c5.2xlarge master; for the ARM cluster, we used v. 2.9.1 on a m6g.medium master instance. For brevity, we will from now on refer to c5.2xlarge instances as c5.2xl and also abbreviate all other *xlarge instances accordingly. All instances use Amazon Linux 2 as operating system; for technical specifications of the instances, see Table . Whereas all instances can communicate via TCP (see last columns in Table for the network bandwidth), some of them have an elastic fabric adapter (EFA). EFA enables HPC scalability across instances by a higher throughput compared to TCP and a lower and more consistent latency.

Table 1

Technical Specifications of AWS Instances Used in This Study and GROMACS Compilation Optionsa

							network
instance type	CPU model	HT or vCPUs	clock (GHz)	used SIMD instructions	NVIDIA GPUs	MPI lib	(Gbps)	EFA
c5.24xl	Intel 8275CL	96	3.0	AVX_512		i	25
c5.18xl	Intel 8124M	72	3.0	AVX_512		i	25
c5n.18xl	Intel 8124M	72	3.0	AVX_512		i	100	√
c5.12xl	Intel 8275CL	48	3.0	AVX_512		i	12
c5.9xl	Intel 8124M	36	3.0	AVX_512		i	10
c5.4xl	Intel 8275CL	16	3.0	AVX_512		i	≤10
c5.2xl	Intel 8275CL	8	3.0	AVX_512		i	≤10
c5.xl	Intel 8275CL	4	3.0	AVX_512		i	≤10
c5.large	Intel 8124M	2	3.0	AVX_512		i	≤10
c5a.24xl	AMD EPYC 7R32	96	3.3	AVX2_128		i	20
c5a.16xl	AMD EPYC 7R32	64	3.3	AVX2_128		i	20
c5a.12xl	AMD EPYC 7R32	48	3.3	AVX2_128		i	12
c5a.8xl	AMD EPYC 7R32	32	3.3	AVX2_128		i	10
c5a.4xl	AMD EPYC 7R32	16	3.3	AVX2_128		i	≤10
c5a.2xl	AMD EPYC 7R32	8	3.3	AVX2_128		i	≤10
c5a.xl	AMD EPYC 7R32	4	3.3	AVX2_128		i	≤10
c5a.large	AMD EPYC 7R32	2	3.3	AVX2_128		i	≤10
hpc6a.48xl	AMD EPYC 7R13	96	2.65	AVX2_128		t	100	√
c6g.16xl	ARM Graviton2	64	2.3	NEON_ASIMD		t	25
c6g.12xl	ARM Graviton2	48	2.3	NEON_ASIMD		t	20
c6g.8xl	ARM Graviton2	32	2.3	NEON_ASIMD		t	≤10
c6g.4xl	ARM Graviton2	16	2.3	NEON_ASIMD		t	≤10
c6g.2xl	ARM Graviton2	8	2.3	NEON_ASIMD		t	≤10
c6g.xl	ARM Graviton2	4	2.3	NEON_ASIMD		t	≤10
c6i.32xl	Intel 8375C	128	2.9	AVX_512		i	50	√
m6i.32xl	Intel 8375C	128	2.9	AVX_512		t	50	√
m5n.24xl	Intel 8259CL	96	2.5	AVX_512		i	100	√
m5zn.12xl	Intel 8252C	48	3.8	AVX_512		t	100	√
m5zn.2xl	Intel 8252C	8	3.8	AVX_512		t	≤25
p3.2xl	Intel E5-2686v4	8	2.3	AVX2_256	V100	t	≤10
p3.8xl	Intel E5-2686v4	32	2.3	AVX2_256	V100 × 4	t	10
p3.16xl	Intel E5-2686v4	64	2.3	AVX2_256	V100 × 8	t	25
p3dn.24xl	Intel 8175M	96	2.5	AVX2_256	V100 × 8	t	100	√
p4d.24xl	Intel 8275CL	96	3.0	AVX2_256	A100 × 8	i	400	√
g3s.xl	Intel E5-2686v4	4	2.3	AVX2_256	M60	i	10
g3.4xl	Intel E5-2686v4	16	2.3	AVX2_256	M60	i	≤10
g4dn.xl	Intel 8259CL	4	2.5	AVX_512	T4	i	≤10
g4dn.2xl	Intel 8259CL	8	2.5	AVX_512	T4	i	≤25
g4dn.4xl	Intel 8259CL	16	2.5	AVX_512	T4	i	≤10
g4dn.8xl	Intel 8259CL	32	2.5	AVX_512	T4	i	50
g4dn.12xl	Intel 8259CL	48	2.5	AVX_512	T4	i	50
g4dn.16xl	Intel 8259CL	64	2.5	AVX_512	T4	i	50
g4dn.12xl	Intel 8259CL	48	2.5	AVX_512	T4 × 4	i	50
g5.xl	AMD EPYC 7R32	4	3.3	AVX2_128	A10G	t	≤10
g5.2xl	AMD EPYC 7R32	8	3.3	AVX2_128	A10G	t	≤10
g5.4xl	AMD EPYC 7R32	16	3.3	AVX2_128	A10G	t	≤25
g5.8xl	AMD EPYC 7R32	32	3.3	AVX2_128	A10G	t	25

i, using Intel MPI 2019; t, using GROMACS’ built-in thread-MPI library. EFA (elastic fabric adapter) signals whether an HPC network is available.

i, using Intel MPI 2019; t, using GROMACS’ built-in thread-MPI library. EFA (elastic fabric adapter) signals whether an HPC network is available. Different versions of GROMACS (2020.2 and 2021.1, with and without MPI) were installed with the Spack[55] 0.15.4 package manager. GROMACS was built in mixed precision with GCC 7.3.1, FFTW 3.3.8, hwloc 1.11, and either Intel MPI 2019 or its built-in thread-MPI library (as listed in Table ). GPU versions used CUDA 10.2 on g instances and CUDA 11.1 on p instances. Benchmarks on m6i.32xl instances were done using ICC 2021.2 and Intel MKL. The multi-instance scaling benchmarks on m5n.24xl and m5zn.12xl instances used a GROMACS executable built with Intel MPI + ICC 2021.2 and Intel MKL. A workshop to reproduce a (slightly updated) setup is available on the web,[56] whereas general advice on how to use AWS services can be found in this book.[57]

Description of the MD Benchmark Systems

To determine the GROMACS performance on various instance types, we used seven simulation systems (Table ). MEM, RIB, and PEP are typical MD systems differing in size and composition, where no special functionality like external forces or free energy (FE) is required. MEM is an aquaporin tetramer embedded in a lipid membrane surrounded by water and ions in a simulation box of 10.8 × 10.2 × 9.6 nm3 size.[58] RIB contains an E. coli ribosome in a box of size (31.2 nm)3 with water and ions.[59] The (50 nm)3 large PEP system was used to study peptide aggregation;[60] it contains 250 steric zipper peptides in solution. MEM, RIB, and PEP were used in previous performance studies,[41,45,61] allowing the comparison of cloud instances to a variety of other already benchmarked hardware.

Table 2

Benchmark Systems: Specifications of the MD Input Systems That Are Used for Benchmarks in This Studya

benchmark acronym	no. of atoms	Δt (fs)	r_c (nm)	grid sp. (nm)	no. of FE atoms
PEP[61]	12 495 503	2	1.2	0.160	0
RIB[59]	2 136 412	4	1.0	0.135	0
MEM[58]	81 743	2	1.0	0.12	0
SHP-2 protein + ligand	107 330	2	1.1	0.12	53
c-Met protein + ligand	67 291	2	1.1	0.12	61
HIF-2α protein + ligand	35 546	2	1.1	0.12	35
c-Met ligand in water	6 443	2	1.1	0.12	61

The FE column lists the number of perturbed atoms for this benchmark (note that this number will vary for different ligands considered in the physical end states); Δt is integration time step, rc cutoff radius, and grid sp. the spacing of the PME grid. Benchmark input .tpr files can be downloaded from https://www.mpinat.mpg.de/grubmueller/bench. c-Met, HIF-2α, and SHP-2 are representative systems from the large binding affinity ensemble assembled by Schindler et al.[4] These systems run special FE kernels for all λ-dependent interactions, i.e., those involving a transformation between atomic properties. As the FE kernels are slower than the normal kernels and because of a larger cutoff, finer PME grid, and the need to calculate two PME grids (one for each of the physical states), even at equal atom count a FE simulation will be slower than a plain MD system. We therefore carried out separate benchmarks for the FE systems, chosen such that predicting the performance of all ensemble members listed in Tables and 4 is easily possible: A small, medium, and large protein plus ligand system to cover the whole range of sizes for the protein systems (35 k–110 k atoms) and one ligand-in-water system representative for all 9936 ligand-in-water simulations.

Table 3

Systems Considered for the First Binding Affinity Studya

	size (atoms)
system	protein+ligand	ligand	no. of ligands	no. of edges	no. of jobs
CDK8	109 807	5789	33	54	972
SHP-2	107 330	6231	26	56	1008
PFKFB3	96 049	6570	40	67	1206
Eg5	79 653	6116	28	65	1170
c-Met	67 291	6443	24	57	1026
SYK	66 184	5963	44	101	1818
TNKS2	52 251	6012	27	60	1080
HIF-2α	35 546	4959	42	92	1656
total					2 × 9936

Table 4

Systems Considered for the Second Binding Affinity Studya

	size (atoms)
system	protein+ligand	ligand	no. of ligands	no. of edges	no. of jobs
CDK2	106 910	4993	16	25	150
P38	80 777	6750	34	56	336
ROS1	73 957	8434	28	63	378
Bace	73 330	5914	36	58	348
JNK1	72 959	5956	21	31	186
Bace (Hunt)	72 036	5773	32	60	360
Bace (p2)	71 671	6687	12	26	156
PTP1B	70 020	8753	23	49	294
PDE2	63 943	5504	21	34	204
TYK2	62 292	5956	16	24	144
PDE10	56 616	7655	35	62	372
thrombin	49 312	6025	11	16	96
galectin	35 635	9576	8	7	42
MCL1	32 745	5435	42	71	426
total					2 × 3492

Same as in Table , but considering 14 protein–ligand complexes in one MD force field (OpenFF v2.0.0). The systems were collected from public sources for the previous free energy calculation studies.[5,66] The total number of jobs is 3 × 2 = 6× the number of edges for the protein+ligand plus an equal number for the ligand systems.

For each of eight considered protein–ligand complexes (from the study[4]), two sets of simulations are performed: protein+ligand for the solvated protein–ligand complex and ligand for the solvated ligand alone. An edge is referred to as the transformation of one ligand A to another ligand B. As we probe three independent replicas for each system in forward and backward simulation direction, and three small molecule force fields (GAFF[62] v2.11, CGenFF[63,64] v3.0.1, and OpenFF[65] v2.0.0), the total number of jobs is 3 × 2 × 3 = 18× the number of edges for the protein+ligand plus an equal number for the ligand systems. Same as in Table , but considering 14 protein–ligand complexes in one MD force field (OpenFF v2.0.0). The systems were collected from public sources for the previous free energy calculation studies.[5,66] The total number of jobs is 3 × 2 = 6× the number of edges for the protein+ligand plus an equal number for the ligand systems. In total, 2 × 9 936 = 19 872 independent jobs were run for the binding affinity study (Table ) by which 1656 free energy differences (ΔΔG values) were determined. Each job first simulated six nanoseconds at equilibrium (for the starting state, i.e., A or B), followed by 80 nonequilibrium transformations from the start to the end state (A → B, or B → A), as mentioned in section . The 80 individual transformations were started from different, equidistant positions of the equilibrium trajectory and were each 50 ps long. In total, 10 ns of trajectory was generated per job.

Benchmarking Procedure

MEM and RIB Plain MD Systems

MEM and RIB benchmarks were run for 20 k steps on single instances and for 40 k–50 k steps when scaling across multiple instances or multiple GPUs using GROMACS 2020. Because of effects of load balancing, PME grid versus cutoff scaling and memory allocations (compare section ) the first few thousand steps in a GROMACS simulation are typically slower than average and were therefore excluded from the benchmarks, which are intended to be proxies for the long-term performance. To make use of all CPU cores of an instance, the product of ranks × threads was set to the number of physical cores or to the number of available hardware threads. We benchmarked various combinations of ranks × threads and additionally checked whether separate PME ranks improve performance. Pinning of threads to cores was enabled, and no checkpoint files were written during the benchmark runs. On GPU instances we used one rank per GPU and offloaded all short-range nonbonbed interactions to the GPU(s). For improved performance, also the long-range PME contribution was offloaded to a GPU, except for some GPU instances with many cores, where it turned out to be faster to evaluate the long-range PME contribution on the CPU. For scaling benchmarks across two or more GPU instances, the long-range PME contribution was run on the CPU part, as only there can it be parallelized. The timings (in simulated nanoseconds per day) reported for MEM and RIB (Tables –9) are averages over two runs. The parallel efficiency on n instances E reported in Tables , 8, and 9 is computed as the performance P on n instances divided by n times the performance on a single instance:The performance-to-price ratios (ns/$) in the MEM and RIB tables are calculated from Amazon EC2 on-demand prices for Linux instance in the US East (N. Virginia) region (https://aws.amazon.com/ec2/pricing/on-demand/), except for hpc6a instances, which are not yet available in N. Virginia. Therefore, for those instances, the prices in the US East (Ohio) region were used.

Table 5

GROMACS 2020 Performance on Selected CPU Instancesa

ns/d values list MEM and RIB performances, and (ns/$) columns show performance to price. Values are color-coded for a quick visual orientation: Grays for low performances, blue towards higher values. For the performance-to-price ratios, reds indicate sub-average ratios, yellows average, and greens above-average ratios.

Table 9

Scaling Across Multiple GPU Instancesa

instances	total cores	ranks × threads	MEM (ns/d)	E_MEM	RIB (ns/d)	E_RIB
1	32	1 × 32/8 × 4	98.1	1.00	7.48	1.00
2	64	8 × 8/32 × 2	76.0	0.39	13.27	0.89
4	128	8 × 16/32 × 4	73.2	0.19	19.50	0.65
8	256	32 × 8			24.39	0.41
16	512	64 × 8			28.38	0.24
32	1024	32 × 32			21.47	0.09

As in Table , but for g4dn.16xl instances with hyperthreading off.

Table 7

Scaling Across Multiple CPU Instancesa

instances	total vCPUs	ranks × threads	PME ranks	MEM (ns/d)	E_MEM	RIB (ns/d)	E_RIB
1	96	48 × 2/96 × 1	0/24	127.5	1.00	10.81	1.00
2	192	48 × 4/192 × 1	12/48	158.8	0.62	20.35	0.94
4	384	48 × 8/384 × 1	12/96	201.1	0.39	37.63	0.87
8	768	384 × 2	96	182.9	0.18	59.93	0.69
16	1536	128 × 12/384 × 4	32/96	151.9	0.07	87.21	0.50
32	3072	384 × 8/768 × 4	96/192	144.3	0.04	115.49	0.33

GROMACS 2020 performances for MEM and RIB over multiple hpc6a instances. The third column lists the optimal decomposition into MPI ranks and OpenMP threads, and the fourth column lists the optimal number of separate PME ranks; the left entry is for MEM, the right entry for RIB if they differ.

Table 8

Scaling Across Multiple GPU Instancesa

instances	total cores	ranks × threads	MEM (ns/d)	E_MEM	RIB (ns/d)	E_RIB
1	16	1 × 16/16 × 1	95.3	95.3	5.15	1.00
2	32	4 × 8	65.0	65.0	8.49	0.82
4	64	8 × 8/32 × 2	73.1	73.1	15.80	0.77
8	128	32 × 4/64 × 2	63.7	63.7	21.25	0.52
16	256	32 × 8			25.86	0.31
32	512	32 × 16			22.78	0.14

As in Table , but for g4dn.8xl instances with hyperthreading off.

Free Energy Systems Used for the Binding Affinity Study

Each job of the binding affinity ensemble run (Table ) consists of two parts: first, a 6 ns equilibrium simulation; second, 80 nonequilibrium transformations of 50 ps length each. The first (equilibration) part was benchmarked as described above for MEM and RIB, using 10 k total steps with timings from the first half discarded. In cases where PME grid versus cutoff tuning took more than 5 k time steps, 20 k time steps were used in total. For the binding affinity runs we did not check whether separate PME ranks improve the performance. The timings reported in Tables and 11 resulted from individual runs of the equilibration part. Here, we ran on individual instances only, no scaling across multiple instances was attempted. Though in most cases we tested various combinations of splitting a given number of total cores Nc into ranks and threads Nc = Nranks × Nthreads, we do not report all results in Tables and 11 to keep them reasonably concise. Instead, we report a consensus for the combination Nranks × Nthreads that yielded best results across the free energy benchmark systems.

Table 10

Free Energy Benchmarks on CPU Instancesa

Performance (ns/d) and performance-to-price ratios (ns/$) for GROMACS 2020 on various Intel (c5 and m5zn), AMD (c5a), and ARM (c6g) CPU instances. Color coding as in Table .

Table 11

Free Energy Benchmarks on GPU Instancesa

As in Table , but now for GROMACS 2021 using one GPU per simulation. The single-GPU performance on p4d.24xl was derived by running 8 identical benchmarks, each using one GPU and 1/8th of the hardware threads, in a multi-simulation.

The second (transformation) part was benchmarked by timing one of the 50 ps (25 k steps) long transformation runs. No time steps were discarded from the measurements, as an initial below-average performance will occur in each of the 80 short transformation runs and thus should be included when predicting the performance of the whole transformation part. The total costs per free energy difference have been derived by combining the equilibration and transformation phase timings of the protein–ligand complex and the ligand alone in water. Six runs were performed per free energy difference for the protein–ligand complex (3 replicas × 2 directions) plus additional six for the solvated ligand. All runs for the solvated ligand were performed on c5.2xl instances. As the 12 independent parts ran in parallel, the time-to-solution is given by the runtime of the longest individual part. That would usually be the protein–ligand complex, but in rare cases the ligand in water would have a longer runtime, as it ran on comparatively slow c5.2xl instances. Prices for AWS instances in the US East (N. Virginia) region as of May 2021 were used.

Setup of Globally Distributed Compute Resources

The allocation of cloud-based compute resources (e.g., via ParallelCluster or AWS Batch[67]) is normally confined to a specific geographic region (there are currently 26 in AWS). Whereas stacks of small to medium jobs can be conveniently executed using just a single region, a global setup is better suited when a substantial amount of core hours is needed: The pool of available instances is much larger for all regions combined compared to just a single region. This allows, for example, to start more instances at the same time or to pick only the subset of instances with the highest performance-to-price ratio. To benefit from global compute resources, we used AWS HyperBatch as a means to provide a single entry point for jobs scheduled to AWS Batch queues across regions. The technical setup used for the binding affinity study is sketched in Figure . For easier overview, the main compute setup is shown in the middle, whereas input and output of data is gathered in the left, blue column and monitoring functionality about the status of jobs and instances in the right, green column. In a nutshell, AWS HyperBatch provides cross-regional serverless job scheduling and resource orchestration using DynamoDB, Lambda functions, Step Functions, AWS Kinesis Data Streams, the Simple Queue Service (SQS), and the Amazon API Gateway.[57]

Figure 1

HyperBatch-based setup distributes all 19 872 GROMACS jobs globally. An illustrative lifetime of a job follows the steps ①···⑧ and is described in section of the text.

HyperBatch-based setup distributes all 19 872 GROMACS jobs globally. An illustrative lifetime of a job follows the steps ①···⑧ and is described in section of the text. For the binding affinity ensemble, we used Spot instances because they are much cheaper than on-demand. The downside of a Spot instance is that it can be terminated at any time, which can happen if the pool of free Spot instances shrinks over time and more on-demand capacity is requested in a region. To minimize instance termination, we requested a number of instances in each region proportional to the Spot pool size of that region. We introduced additional flexibility by requesting instances with all possible vCPU counts and fitting several jobs on them. A single 96 vCPU c5.24xl instance could then, for example, end up running one 48 vCPU job plus six 8 vCPU jobs at the same time. To better understand the whole setup, let us look at the encircled digits (red) in Figure and follow the lifetime of one of the 19 872 jobs from the binding affinity ensemble. ① We submit an example job from a Cloud9 terminal to the HyperBatch entry point. The job definition file specifies how many vCPUs to allocate, whether to request a GPU, and which subfolders to use in the S3 input and output buckets for job I/O. HyperBatch distributes jobs across regions according to region weights reflecting the compute capacity of the regions, e.g., using the weights (6, 6, 3, 1, 1, 4) for the regions (us-east-1,us-east-2,us-west-2,ap-southeast-1,ap-northeast-2,eu-west-1) for GPU jobs. Our example job gets distributed to eu-west-1 (blue, ②), where it is relayed to a Batch instance ③ with sufficient free resources (vCPUs, GPUs). The instance loads the correct Docker image from AWS Elastic Container Registry (ECR) with preinstalled software for the current architecture ④, e.g., pmx and the GROMACS executable with the SIMD level matching the CPU (see Figure for the definition of the Docker file). The actual simulations are handled by the Perl script shown in Figures and 4. This script is designed to deal with sudden interrupts that are possible with Spot instances. Accordingly, output data and checkpoints are saved in regular intervals to S3 storage. To start a simulation, step ⑤ loads the input files from S3 (line 13 in Figure ). Step ⑥ loads potentially existing output data from S3 (line 18 in the listing); this is the case when the job was running earlier already but was interrupted before it finished. Depending on whether prior output data is present, the job is either continued or started from scratch. Generally, the MD job consists of two parts: (i) the production of an equilibrium trajectory (lines 25–56 in the listing) and (ii) the 80 individual transformations (lines 67–86). Part i is executed in eight chunks (lines 28 and 29) so that upon instance termination only a small part needs to be recomputed, as ⑦ each chunk’s data is transferred to S3. If an instance terminates during one of the 80 transformations, the job is continued from the start of that transformation, as a completed transformation ⑦ is immediately saved to S3. At last, pmx integrates and saves the work values that are later used for free energy estimation (lines 88–95). Instance termination ⑧ at any time will trigger a Lambda function that resubmits the job again to HyperBatch. The current state of each job can be checked in a DynamoDB table (Figure , right). Additional configuration using Amazon Elasticsearch allows globally monitoring the whole simulation ensemble in a Kibana[68] dashboard that shows the total number of running instances, the instance types by region, and more.

Figure 2

Example of a Docker file for a GPU image. From the Docker files multiple Docker container images are compiled (one for each architecture) that are loaded from the Amazon ECR by the instances.

Figure 3

Perl script used to launch each of the 19 872 jobs (first part).

Figure 4

Perl script used to launch each of the 19 872 jobs (cont’d).

Example of a Docker file for a GPU image. From the Docker files multiple Docker container images are compiled (one for each architecture) that are loaded from the Amazon ECR by the instances. Perl script used to launch each of the 19 872 jobs (first part). Perl script used to launch each of the 19 872 jobs (cont’d).

Results and Discussion

We present our results in four parts: (i) performance, scaling, and cost efficiency in terms of performance-to-price (P/P) ratios for the standard MD input systems such as MEM and RIB on CPU and GPU instances; (ii) a cost comparison of cloud computing versus buying and operating an own cluster; (iii) as a prerequisite for the binding affinity study, the results of the free energy benchmarks (SHP-2, c-Met, and HIF-2α) on various instance types, including the resulting performance-to-price ratios; (iv) the performance and the costs of the binding affinity studies on global cloud resources.

Which Instances Are Optimal for GROMACS?

Tables –9 show the benchmark results for various instance types. For CPU instances, Table lists MEM and RIB performances in gray and blue colors, and the resulting P/P ratios from greens over yellows to reds, corresponding to high, medium, and low cost efficiency. Table shows the same for instances with up to 8 GPUs. As the mapping of colors to values depends on the smallest and largest observed values, it differs between MEM and RIB, but is the same across all tables. As a result, greens will always refer to good choices in terms of P/P ratio. For several of the instances, various ranks × threads decompositions are listed; “PME ranks” indicates if and how many MPI ranks were set aside for PME.

Table 6

GROMACS 2020 Performance on Individual Instances with GPUsa

As in Table , but on instances with up to eight GPUs. PME long-range interactions were offloaded to a GPU in all cases, except *, where they were evaluated on the CPU.

Performance on Individual Instances with CPUs

The highest performances were measured on hpc6a.48xl, c6i.32xl, and m6i.32xl instances, which is expected as with 96–128 vCPUs they offer the largest number of cores (see also Table ). Performance-wise, they are followed by 96 vCPU c5.24xl and m5n.24xl instances. In terms of cost-efficiency, hpc6a instances are the clear favorites among the CPU instances.

Performance on Single Instances with GPUs

From the GPU instances (Table ), the g5 with 8–32 vCPUs reach or even surpass the performance of the hpc6a.48xl CPU instance, albeit with a significantly (1.25–2.4×) better cost-efficiency. In fact, the single-GPU g4dn’s with 4–16 vCPUs and g5’s with 4–32 vCPUs exhibit the best cost-efficiency of all instances for the MEM and RIB benchmarks. Perhaps unsurprisingly, the highest single-instance performances of this whole study have been measured on instances with eight GPUs. With the exception of the (comparatively cheap) quadruple-GPU g4dn.12xl instances, however, the P/P ratio plunges when distributing a simulation across multiple GPUs on an instance. In those cases, GROMACS uses both domain decomposition via MPI ranks as well as OpenMP parallelization, with added overheads of both approaches. Additionally, as the PME long-range contribution can not (yet) be distributed to multiple GPUs, it is offloaded to a single GPU, while the other GPUs share the remaining calculations of the nonbonded interactions. All imbalances in computational load between the GPUs or between the CPU and GPU part translate into a loss in efficiency and thus in a reduced cost-efficiency. For single-GPU simulations, GROMACS has a performance sweet spot. Here, domain decomposition is usually not needed nor invoked, and all nonbonded interactions including PME can be offloaded to a single GPU, leading to considerably less imbalance than in the multi-GPU scenario. To use instances with N GPUs more efficiently, one can run N simulations simultaneously on them via GROMACS’ built-in -multidir functionality, thus essentially gaining the efficiency of the single-GPU case. This is demonstrated in Table for the p4d.24xl and the g4dn.12xl instances. The p4d.24xl line in the table shows the results for parallelizing a single simulation across the whole instance, whereas p4d.24xl/8 shows what happens when eight simulations run concurrently. Here, the produced amount of trajectory and thus also the cost-efficiency, is about four times as high. For the g4dn.12xl/4 vs g4dn.12xl instance, running four concurrent simulations instead of one simulation translates into about a factor of 2 higher cost-efficiency. As in Table , but on instances with up to eight GPUs. PME long-range interactions were offloaded to a GPU in all cases, except *, where they were evaluated on the CPU.

Scaling Across Multiple Instances

For selected instance types, we also determined how much performance can be gained on multiple instances. For this we have selected instance types that (i) exhibit above average P/P ratios for the single-instance benchmarks and (ii) have a network speed of at least 50 Gigabit/s. Tables , 8, and 9 summarize the results for scaling across 1–32 CPU and GPU instances. For the 81 k atom MEM system, the maximal performance is reached on 4 hpc6a instances, however at a parallel efficiency of less than 40%, whereas for the g4dn’s, the highest performance is recorded on individual instances. In contrast, the large RIB system shows a decent scaling behavior. On hpc6a, the single-instance performance of 10.8 ns/d can be increased to about 60 ns/d at a parallel efficiency of 69% on eight instances. On 32 instances, with 115 ns/d, the single-instance performance is increased 11-fold. Whereas the RIB system continues to scale beyond 8 hpc6a instances, the g4dn’s never reach 30 ns/d. The difference in scaling efficiency between CPU and GPU instances is mainly determined by the network speed for the internode communication. As the hpc6a instances have a much better interconnect than g4dn (see Table ), the scaling is more efficient for the CPU nodes. The hpc6a instances, however, never reach the scaling performance of an on-premises dedicated HPC cluster. There, as shown in Figure 7 of ref (43), the same benchmark systems exhibit peak performances of 303 ns/d for MEM and 204 ns/d for RIB. GROMACS 2020 performances for MEM and RIB over multiple hpc6a instances. The third column lists the optimal decomposition into MPI ranks and OpenMP threads, and the fourth column lists the optimal number of separate PME ranks; the left entry is for MEM, the right entry for RIB if they differ. As in Table , but for g4dn.8xl instances with hyperthreading off. As in Table , but for g4dn.16xl instances with hyperthreading off. Figure summarizes all benchmark results and interrelates them to uncover which instances are optimal in terms of both performance and cost-efficiency. The symbols show benchmark performances (at optimal parallelization settings) on various instances as a function of the on-demand hourly instance costs. The inclined gray lines are isolines of equal P/P ratio with the most cost-efficient configurations toward the upper left. Moving from one isoline to the neighboring one toward the top left improves the P/P ratio by a factor of 2. Symbols connected by a line denote the strong scaling behavior across multiple identical instances, with a single instance at the left end of the curve, followed by 2, 4, 8, and so on, instances. A scaling curve that runs parallel to the cost-efficiency isolines would indicate optimal scaling, i.e., a parallel efficiency of E = 1. Figure allows a series of observations. (i) In terms of cost-efficiency, the optimal instances for GROMACS are the single-GPU g4dn’s with 4, 8, and 16 vCPUs (green symbols toward the left) and g5’s (brown symbols) whose P/P ratio is at least a factor of 2 higher than most of the other instance types. (ii) Perhaps unsurprisingly, the highest MEM and RIB performances on individual instances are reached with p3 and p4d instances hosting eight GPUs connected via PCI Express (red and purple symbols). (iii) For larger systems (RIB and PEP), the highest absolute performances are reached by scaling across multiple c6i.32xl or hpc6a.48xl instances, with the hpc6a’s showing by far the best cost-efficiency. (iv) The performance of small systems like MEM cannot be significantly improved by scaling across many instances. (v) Choosing one of the many possible instances for an MD project essentially boils down to pinning down a point along the connecting line between best cost-efficiency and highest performance, trading off HTC and HPC computing. Let us follow this special line for the example of the RIB benchmark. It starts at optimal cost-efficiency with the single-GPU g5.2xl instances (left, brown stars). For higher performances, one would pick g5.4xl and then g5.8xl instances, however at the cost of losing 10%–25% in P/P ratio. For higher performances (again, at reduced cost-efficiency), the scientist would then continue with scaling over hpc6a instances (blue) which exhibit the best P/P ratios toward growing performances. There is generally no reason to pick instances within the area below the described line as here one simply gets lower GROMACS performance for the same price. For example, for the price of a g3.4xl instance (violet, bottom left), one could instead choose a g5.xl or g5.2xl that exhibits two times the RIB performance.

Figure 5

Performance, costs, and cost-efficiency for GROMACS simulations on various AWS instance types. GROMACS 2020 performance as a function of the on-demand instance costs ($/h) for the MEM (circles), RIB (stars), and PEP (triangles) benchmark on CPU (open symbols) and GPU instances (filled symbols). Separate symbols indicate single-instances; connected symbols show the parallel scaling across multiple instances.

Cost Comparison: Cloud vs On-Premises Cluster

Whether it is more cost-efficient to run simulations on a cloud-based cluster depends of course almost completely on the specific use case, i.e., how big the cluster will be, what software will run on it, and whether there are enough jobs at all times to keep the cluster busy as opposed to bursts of compute demand with idle time in between. Therefore, no generalizable results or guidance can be provided here. We do think, however, that rough estimates of respective costs and comparison to a typical local compute cluster at a research institution will provide useful information and guidelines in particular for new groups in the field who need to set up computer resources. To this aim, we will estimate and compare the total costs of producing one microsecond of trajectory for the RIB benchmark with GROMACS. The hardware for an own cluster can be aggressively tuned toward cost-efficiency for simulations with GROMACS. Combining inexpensive processors with consumer GPUs yields the best performance-to-price ratios.[41] For instance, 1 U (rack unit) nodes with an Intel E5-2630v4 processor plus an NVIDIA GeForce RTX 2080 GPU were offered for under 2000 € net at the time, including three years of warranty. Investment costs for the racks, cooling system, and infrastructure needed to operate the cluster are estimated to about 500 € per U of rack space over the lifetime of the racks. For a lifetime of 5 years, that adds 100 € per U per year. For technical staff to operate, repair, and look after a 500 node cluster, we assume 100 000 € per year, which adds 200 € to the operating costs for each node per year. A suitable room (60–100 m2 for about 500 U of hardware with appropriate infrastructure and the possibility to install heavy apparatus) adds about 30 000 € to the yearly costs (60 € per node), depending on the location. For cluster management software we assume 40 € per node per year. Bar A of the top panel of Figure shows a breakdown of the total costs for our optimized consumer GPU node. Bar D illustrates how the costs grow when using the same hardware as in bar A, but now with a professional GPU (e.g., an NVIDIA Quadro P6000) instead of a consumer GPU (which leads to considerably higher fixed costs) and in a larger chassis that takes 4 U rack space (which lead to significantly increased recurring costs for room and rack space). Thus, densely packed hardware helps to reduce costs. The lower panel of Figure shows the resulting costs per microsecond of RIB trajectory for the nodes from the upper panel. g5.2xl instances offer both a high absolute performance as well as a good performance-to-price ratio for GROMACS (Figure ), which would therefore be a good pick for production runs. However, on-demand g5.2xl instances would yield RIB trajectory costs as high as 3200 €/μs. To reduce costs, one would reserve an instance for one or three years, and for maximal savings one can pay up front (bars C). g5.2xl Spot instances are nearly as cost-efficient as consumer-GPU nodes tailored toward GROMACS (bars A and B in the lower panel). They are more cost-efficient than buying a node with a professional GPU (bar D). In summary, with careful selection of cloud resources and payment options, there is not much difference in cost today compared to on-premises computing.

Figure 6

Costs and cost-efficiency of a compute node in an owned cluster compared to a cloud instance with similar GROMACS performance over 3 years. Top panel: Violet bars show costs of AWS g5.2xl instances (producing 7.55 ns/d of RIB trajectory), which offer one of the highest performance-to-price ratios for GROMACS (compare Figure ), in individual blocks of one year. Bar A shows the fixed costs for buying a consumer GPU node tailored to GROMACS within the thick black line (broken down into individual hardware components) plus the yearly recurring costs (mainly energy) for three years. This node (E5-2630v4 CPU plus RTX 2080 GPU) produces 5.9 ns/d of RIB trajectory.[41] Bar B shows the average costs using an AWS Spot instance. Bar C shows the costs when reserving the AWS instance and paying up front. Bar D is the same as bar A, but using a 4 U node with a professional GPU (e.g., Quadro P6000). Bars A–D in the lower panel show the resulting RIB trajectory costs for the nodes shown in the top panel, for a service life of three years.

GROMACS Performance for Free Energy Calculations

Turning on FE perturbations reduces the GROMACS performance, because an additional PME grid is evaluated, and because interactions involving perturbed atoms run through kernels that are not as optimized as the standard kernels. How much the performance differs with and without FE depends on how big the fraction of perturbed atoms is and on the parameters chosen for FE. For those reasons we cannot use the MEM and RIB benchmarks to predict the performance of the systems used in our high-throughput ligand screening study. Instead, we carry out new benchmarks for four representative FE systems (Table ) chosen from the whole binding affinity ensemble (Table ). The performances for these systems, which are a small ligand-in-water system (from the c-Met data set) plus three protein–ligand complexes of different size (HIF-2α, c-Met, and SHP-2) are shown in Table for CPU instances for various decompositions into MPI ranks and OpenMP threads. For convenient navigation and the latest updates of the benchmark data we also provide access to this information via web interface: http://pmx.mpibpc.mpg.de/aws.pl. The table shows the general trend of small instances exhibiting higher P/P ratios, but there are no pronounced differences between the architectures. The highest performances are observed on the 96 vCPU Intel instances. Performance (ns/d) and performance-to-price ratios (ns/$) for GROMACS 2020 on various Intel (c5 and m5zn), AMD (c5a), and ARM (c6g) CPU instances. Color coding as in Table . Up to version 2020, with perturbed charges it was not possible to offload the PME grid calculations to the GPU. This has changed from version 2021 on, leading to considerably enhanced performance (more than a factor of 2) on GPU instances in our cases (Figure ). Therefore, we used GROMACS 2021 for all binding affinity simulations. The benchmark results for the four representative FE systems on GPU instances are assembled in Table .

Figure 7

Performance improvements of GROMACS 2021 for free energy calculations on GPUs. For three different MD systems (colors) with free energy perturbation turned on, the bars compare GROMACS 2021 and 2020 performances on a p3.2xl instance. As in Table , but now for GROMACS 2021 using one GPU per simulation. The single-GPU performance on p4d.24xl was derived by running 8 identical benchmarks, each using one GPU and 1/8th of the hardware threads, in a multi-simulation. Whereas the performances of the 32 and 64 vCPU g4dn instances are comparable to or higher than that of the best performing CPU instances (i.e., c6g.12xl for the ligand in water and c5.24xl for the protein–ligand complexes), the smaller g4dn instances with ≤16 vCPUs still offer high performance but at exceptionally high P/P ratios: about two times higher than on CPU instances. On the instances with ≥32 vCPUs it is beneficial for performance to just use half the number of vCPUs for OpenMP threads, as the reduction of values over too many threads can deteriorate performance otherwise. The recently introduced g5 instances even surpass the g4dn performance at a similarly good price–performance ratio. Regarding cost-efficiency, any of the c5, c5a, or c6g instances with ≤8 vCPUs has a high P/P ratio for the small ligand-in-water systems, whereas single-GPU g4dn instances with ≤16 vCPUs or g5’s with 8–32 vCPUs are undefeated for the larger protein–ligand systems.

Costs and Time-to-Solution per FE Calculation

The numbers in Tables and 11 are for the equilibration phase of the FE calculation (see section ). We do not list the benchmark results of the transformation phase separately, but included them in the estimate of the total cost of computing one FE difference, as described in Methods. Figure shows the time-to-solution and the costs per FE difference that result when using different Spot instance types. The figure does not include hpc6a instances because currently hpc6a is available only on-demand. With three replicas and two directions, the total costs for one FE difference is 6× the time needed for the protein–ligand part, plus the (small) costs of the ligand in water.

Figure 8

Costs and time needed to compute one FE difference. Diamonds show the costs to compute one FE difference (using Spot pricing) versus the time-to-solution for various instance types (colors) for the c-Met system. In addition to c-Met, HIF-2α is shown at the lower left end of each colored line, and SHP-2 at the upper right end. The gray square shows costs and timings for a consumer GPU node specifically tuned for GROMACS simulations, as discussed in section and shown in Figure A. Spot instance costs are just about a third of the on-demand costs (not shown), although Spot prices vary slightly among the regions and change over time. We therefore used Spot instances for our binding affinity studies, even though these may be terminated at any time should there be demand for that instance type in the given region. As can be seen in Figure , on CPU instances the time-to-solution generally shrinks with the number of vCPUs (as expected) while the costs grow. Using g5.2xl, g5.4xl, g4dn.xl, g4dn.2xl, or g4dn.4xl GPU instances, any FE calculation is completed within 15 h for less than $20 for all systems (green quadrant). Other single-GPU instances like g4dn.8xl, g5.8xl, and g5.xl are somewhat less cost-efficient but still better than the remaining instance types. The white quadrant on top of the green quadrant accommodates multiple instance types on which an FE value can be computed in less that 15 h, albeit at a markedly higher cost than on g4dn instances.

High-Throughput Ligand Screening in the Global Cloud

Study 1: Focus on Time-to-Solution

Our first screening study consisted of 19 872 Batch jobs to compute 1 656 free energy differences (200 μs of trajectory in total) for the ensemble shown in Table . With this study we evaluate the suitability of cloud computing for large-scale computational drug design scans that have been traditionally performed on on-premises clusters where such a scan would typically take several weeks to complete. As we aimed to minimize the time-to-solution, from all available instance types we selected only instances that would need no more than nine hours for any job. The g4dn.2xl, g4dn.4xl, and g4dn.8xl meet that criterion at the lowest costs. (At the time of our screening studies the g5 GPU instances were not yet available.) However, relying on just three instance types is risky if one wants to minimize time-to-solution. g4dn instances are not very abundant in the AWS cloud, and if they happen to be in high demand at the time of our screening study, we might not get many of them. Therefore, we added other instance types that meet our nine hour criterion but that are almost always available: c5.24xl and c5.18xl as well as the similar c5d.24xl and c5d.18xl. We ran the small systems of ligand in water on eight c5 vCPUs, where they would complete in about five hours at a price of less than $2 and high cost-efficiency (see c-Met ligand column in Table ). To draw from a larger pool of instances we allowed for c5 instances of various size and just requested that they offer at least eight vCPUs (see also Figure ). Larger instances accept multiple jobs until they do not have enough free vCPUs left.

Figure 9

Usage of global compute resources for the first ligand screening study aimed to optimize time-to-solution. Colors show the various instances that were in use globally during the 3 days of the ensemble run. We submitted the first 9936 jobs (the large protein systems) in a first wave on a Monday at around 5 p.m., and the second 9936 jobs (the small ligand systems) were submitted in a second wave 24 h later. Figure shows the number of instances that were in use during our first screening study color-coded by instance type. Figure provides further details of the run. As can be seen from the top and middle panels, we acquired about 140 000 vCPUs within the first 30 min and about 3000 GPUs within the first two hours of the run, distributed globally over six regions.

Figure 10

Usage of global compute resources for the first ligand screening study aimed to optimize time-to-solution. Compute resources (split into regions) allocated for the ensemble run over time. Top, vCPUs; middle, GPU instances; bottom, number of instances. Each wave finished in about 1 day, and we speculate that also the whole 19 872 jobs would have finished within 24 h if submitted simultaneously. As GPU instance availability is essentially independent of the CPU instance availability, the GPU jobs from the first wave (greens in Figure ) can completely overlap with the CPU jobs of the second wave. At the same time, after the peak of the second wave (Tue 17 h–23 h), there should be more than enough c5 Spot capacity to accommodate the CPU jobs of the first wave. Unfortunately, there was a glitch in the initial version of our setup that prevented finished instances to terminate properly. For that reason, the actual costs of the first run summed up to $56 per FE difference, although, when counting productive time only, they reduce to $40 per FE difference. This is in the expected range (see Figure ), given the mix of instances that were in use. The overall costs almost entirely resulted from the hourly charges of the EC2 compute instances, whereas data transfer to and from the S3 buckets accounted for less than 0.5% of the whole costs. We observed 3070 restarts due to Spot instance terminations during this study; that is, restarts comprised about 15% of the overall job number. In the worst case, if an instance terminates just before writing a checkpoint, 1 ns of simulation time is lost. As we simulate 10 ns per job, an upper bound for the loss due to Spot instance terminations is about 1.5% of the total compute time. Spot instances get a signal two minutes before shutdown; thus, to completely avoid loss of compute time in the future, that signal could be used to trigger checkpoint writing in an improved simulation protocol. In addition to the performance and price evaluation, we have also validated correctness of the calculations against the previous computations.[69] We used Student’s t test to compare free energy values calculated in the current work to those reported previously, ensuring that the results showed no statistically significant differences.

Study 2: Focus on Cost-Efficiency

Our second screening study aimed to further reduce the price tag by incorporating only the most cost-efficient instance types for the run. The second study used a slightly different and smaller data set (see Table ) that required 6984 jobs to be run for 582 FE differences, or 70 μs of trajectory in total. The setup was as in the first study; however, we further excluded instances with low cost-efficiency. Most notably, we ran all the protein systems on cost-efficient GPU instances. The acquired resources over time for the second study are shown in Figure .

Figure 11

Usage of global compute resources for the second ligand screening study aimed at optimizing cost-efficiency. Top, allocated instance types over time; bottom, GPU instances allocated in the different regions. The vCPU usage peaked at slightly above 35 000 vCPUs at two hours into the second run (not shown), with on average 500 GPU instances in use over 26 h. About six hours after the submission of the ensemble the small ligand-in-water systems were finished (blue and orange areas in Figure , top). As our benchmarks on c5.2xl estimated a runtime of about five hours per system, we conclude that there were enough c5 Spot instances available to run each of the 3492 ligand jobs concurrently. GPU instances are however running over a time span of about 30 h altogether, as apparently not enough g4dn Spot capacity was available to run all 3492 GPU jobs concurrently. From the lower panel of Figure we see that at the time of submission, there was only g4dn capacity available in four regions, whereas the Ireland (blue) and North Virginia (yellow) regions provided g4dn instances only after several hours into the run. The large differences across regions underline that a multiregion approach is necessary for decent job throughput when limiting oneself to only a few instance types. The resulting costs of our second study were about $16 per FE difference and thus only about a third of what was achieved in the first study and in line with what is expected from the benchmarks on g4dn instances (Figure ). Both high-throughput ligand screening studies illustrate the flexibility of cloud computing for MD-based investigations: AWS can be used to scale up the computations to the extent of a large HPC facility but can also be used in a cost-efficient manner akin to a smaller in-house cluster. When aiming to minimize the time-to-solution, the 19 872 calculation jobs were finished in ∼2 days. This compares well to the timing in the recent report, where the Tier 2 Max Planck Supercomputer Raven (interim version, 480 Intel Xeon Cascade Lake-AP nodes with 96 cores, 192 threads) performed calculations of the same data set in ∼3 days.[69] The cost-efficient usage of the cloud resources allowed reaching the cost of $16 for a free energy calculation. Cost-efficiency could be further optimized by also running the ligand-in-water simulations on the g4dn GPU instances (instead of using c5 instances), which would result in a cost of $14 per free energy difference, although g4dn capacity may then not be sufficient to run all simulations at once. In comparison to a GROMACS optimized in-house cluster of Intel E5-2630v4 10-core nodes and NVIDIA RTX 2080 GPU, this cost would be ∼$8.5, in agreement with the estimates of relative costs for a compute node analyzed in Figure .

Summarizing Discussion

Cloud computing has the potential to transform large-scale simulation projects. To date, computationally intensive projects, when assigned to busy on-premises clusters with limited computing capacity, may need weeks or months to be completed. In the cloud, though, the required processing power can be distributed among numerous compute centers around the globe. With the removal of the bottleneck of limited computing capacity, jobs that are independent of each other can run simultaneously in a high-throughput manner, thus reducing the time-to-solution to the runtime of the longest simulation of the ensemble. Such use cases that require very high peak performance over a short period of time can easily be met by cloud computing, while installing and operating a sufficiently large local cluster would be neither cost-effective nor feasible. For the use case of MD-based high-throughput ligand screening we established a HyperBatch-based workflow that allows completing large-scale projects that would run for weeks on an on-premises cluster within 48 h or less in the cloud. Shortly after submitting 19 872 jobs, we acquired about 140 000 compute cores and 3000 GPUs in multiple regions around the globe. We demonstrated that the costs associated with such projects can be reduced about 9-fold compared to a naïve implementation: A job checkpoint-restart mechanism allowed using Spot instances instead of on-demand instances, which accounts for a 3-fold reduced price. Putting the benchmarked application performance in relation to the instance costs allowed selecting from a huge variety of available instance types the most cost-efficient ones only, thereby reducing the price tag by another factor of 3, albeit at the cost of a longer time-to-solution. Whereas HyperBatch is geared toward speeding up HTC projects, we also investigated HPC strong scaling scenarios with a cloud-based HPC cluster. Cluster installation via ParallelCluster and software installation via Spack provided a straightforward and reproducible setup. Because of the possibility to automatically scale up (and down) the number of cluster nodes depending on whether there are jobs in the queue, there is virtually no waiting time for jobs in the queue. The breadth of readily available hardware that includes several architectures (Intel, AMD, ARM) in various sizes (regarding core count), accelerators like GPUs, and high-performance network if wanted, allows always picking the optimal hardware for the job at hand, in terms of a short time-to-solution or a high cost-efficiency. For GROMACS, we found that g4dn and g5 GPU instances offer the highest performance-to-price ratio, followed by hpc6a CPU instances, whereas instances with the fastest interconnect (c6i.32xl, hpc6a, and c5n.18xl) showed the best parallel scaling on up to 64 instances using 8192 vCPUs altogether for the largest benchmark system. So how did, overall, cloud computing compare to a local cluster for our realistic test cases? For many cases, the extra flexibility offered by the cloud will certainly come at a cost higher than that of a local compute cluster. However, as our study shows, by aggressively tuning both alternatives toward cost-efficiency we are approaching a break even point, and the costs of cloud computing and on-premises computing become similar. In fact, an on-premises consumer-GPU cluster tailored toward GROMACS produces an MD trajectory at about 0.85× the costs of Spot GPU instances with similar performance. We note that this outcome is also due to the fact that very specialized single software, GROMACS, was used; in contrast, if a wider variety of software has to run, the nodes can probably not be tuned that much and therefore will be less cost-efficient for a particular application. Just the use of a professional GPU instead of a consumer GPU will result in trajectory costs significantly higher than what can be achieved on an optimal Spot instance.

Conclusions

Cloud computing has traditionally been much more expensive than an on-premises cluster for the case in which continuous long-term computer performance is required. Here we have shown that this has changed, at least for the specialized, yet highly important application of drug design. We are now at a break even point, where the costs are the same, maintaining the great benefit of cloud computing to offer enormous flexibility and, if required, extremely short production times. We consider this a critical milestone for MD-based high-throughput computational drug design.

Data and Software Availability

The input files for the benchmarks can be downloaded from https://www.mpinat.mpg.de/grubmueller/bench. A guide to build GROMACS on AWS is available here: https://gromacs-on-pcluster.workshop.aws. Free energy calculation benchmarks are available at http://pmx.mpibpc.mpg.de/aws.pl.

41 in total

1. Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences.

Authors: G E Crooks
Journal: Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics Date: 1999-09

2. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy.

Authors: Richard A Friesner; Jay L Banks; Robert B Murphy; Thomas A Halgren; Jasna J Klicic; Daniel T Mainz; Matthew P Repasky; Eric H Knoll; Mee Shelley; Jason K Perry; David E Shaw; Perry Francis; Peter S Shenkin
Journal: J Med Chem Date: 2004-03-25 Impact factor: 7.446

3. VMD: visual molecular dynamics.

Authors: W Humphrey; A Dalke; K Schulten
Journal: J Mol Graph Date: 1996-02

4. Google-Accelerated Biomolecular Simulations.

Authors: Kai J Kohlhoff
Journal: Methods Mol Biol Date: 2019

5. Predicting Activity Cliffs with Free-Energy Perturbation.

Authors: Laura Pérez-Benito; Nil Casajuana-Martin; Mireia Jiménez-Rosés; Herman van Vlijmen; Gary Tresadern
Journal: J Chem Theory Comput Date: 2019-03-01 Impact factor: 6.006

6. Large-Scale Assessment of Binding Free Energy Calculations in Active Drug Discovery Projects.

Authors: Christina E M Schindler; Hannah Baumann; Andreas Blum; Dietrich Böse; Hans-Peter Buchstaller; Lars Burgdorf; Daniel Cappel; Eugene Chekler; Paul Czodrowski; Dieter Dorsch; Merveille K I Eguida; Bruce Follows; Thomas Fuchß; Ulrich Grädler; Jakub Gunera; Theresa Johnson; Catherine Jorand Lebrun; Srinivasa Karra; Markus Klein; Tim Knehans; Lisa Koetzner; Mireille Krier; Matthias Leiendecker; Birgitta Leuthner; Liwei Li; Igor Mochalkin; Djordje Musil; Constantin Neagu; Friedrich Rippmann; Kai Schiemann; Robert Schulz; Thomas Steinbrecher; Eva-Maria Tanzer; Andrea Unzue Lopez; Ariele Viacava Follis; Ansgar Wegener; Daniel Kuhn
Journal: J Chem Inf Model Date: 2020-09-03 Impact factor: 4.956

7. Development and Benchmarking of Open Force Field v1.0.0-the Parsley Small-Molecule Force Field.

Authors: Yudong Qiu; Daniel G A Smith; Simon Boothroyd; Hyesu Jang; David F Hahn; Jeffrey Wagner; Caitlin C Bannan; Trevor Gokey; Victoria T Lim; Chaya D Stern; Andrea Rizzi; Bryon Tjanaka; Gary Tresadern; Xavier Lucas; Michael R Shirts; Michael K Gilson; John D Chodera; Christopher I Bayly; David L Mobley; Lee-Ping Wang
Journal: J Chem Theory Comput Date: 2021-09-22 Impact factor: 6.578