Literature DB >> 35935573

Pre-exascale HPC approaches for molecular dynamics simulations. Covid-19 research: A use case.

Miłosz Wieczór^1,2, Vito Genna¹, Juan Aranda¹, Rosa M Badia³, Josep Lluís Gelpí^3,4, Vytautas Gapsys⁵, Bert L de Groot⁵, Erik Lindahl^6,7, Martí Municoy⁸, Adam Hospital¹, Modesto Orozco^1,4.

Abstract

Exascale computing has been a dream for ages and is close to becoming a reality that will impact how molecular simulations are being performed, as well as the quantity and quality of the information derived for them. We review how the biomolecular simulations field is anticipating these new architectures, making emphasis on recent work from groups in the BioExcel Center of Excellence for High Performance Computing. We exemplified the power of these simulation strategies with the work done by the HPC simulation community to fight Covid-19 pandemics. This article is categorized under:Data Science > Computer Algorithms and ProgrammingData Science > Databases and Expert SystemsMolecular and Statistical Mechanics > Molecular Dynamics and Monte-Carlo Methods.

Entities: Chemical

Keywords: BioExcel; COVID19; exascale; molecular dynamics

Year: 2022 PMID： 35935573 PMCID： PMC9347456 DOI： 10.1002/wcms.1622

Source DB: PubMed Journal: Wiley Interdiscip Rev Comput Mol Sci ISSN： 1759-0884

NEW COMPUTER ARCHITECTURES

Biomolecular simulations have been linked to computer power since their origin in the 1960s. Thus, access to the Golem computer explained (in part) the leadership of Weizmann's institute in the development of the first force‐field based methods oriented towards the study of biomacromolecules, , and a large computer placed at CECAM headquarters allowed the first molecular dynamics (MD) simulation of a protein by McCammon, Gellin, and Karplus. Since the 1970s, continuous improvements in computers have allowed molecular simulations, and in particular MD to be established in the structural biology and biophysics worlds. The increase in computer power seemed unstoppable for decades but, unfortunately, current chip technology has reached a plateau and we cannot expect much more powerful processors to emerge from current computer technology. Thus, while waiting for quantum computers, hardware developers have tried to maintain the increase in processing power by three different strategies: (i) to build specific purpose processors, (ii) to take advantage of secondary processors specialized in basic numerical operations (like the graphical processing units; GPUs) and (iii) to increase the number of cores in the computers. MD‐specific computers such as Anton , were a major promise for the MD field when first appeared in 2008, but 13 years later, their global impact has been small. On the other hand, GPUs have revolutionized the field allowing small groups access to significant computing power. Yet, no single GPU can perform massive MD simulations, thus requiring the combination of a large number of CPUs (and eventually GPUs) in the simulation. Thus, most massive MD simulations are performed in High‐Performance Computing (HPC) environments, especially in gigantic supercomputers created by the aggregation of a massive number of CPUs or CPU/GPUs. Two excellent examples of Tier‐0 supercomputers are Japanese Fugaku and American Summit. Rikken supercomputer Fugaku has 7.6 million general‐purpose cores A64FX 48C and provides 0.5 ExaFlop of peak power, while American Summit has a hybrid CPU/GPU architecture based on Power9 and NVIDIA Tesla V100 cards, with 2.4 million cores and 0.2 ExaFlop of peak power. Both are amazing pieces of engineering and are approaching to the mythical ExaFlop barrier, but we cannot ignore that Fugaku is only four times faster than the top‐of‐the line computer five years ago (Chinese Sunway) and has an electrical consumption (close to 30 MW excluding cooling) that raises doubts on the sustainability of a much bigger computer. In summary, the biosimulation community has largely benefitted from decades of continuous increase in computer power, but we cannot expect much faster processors in the next years. Most likely, increases in aggregated computer power would appear linked to massively parallel architectures. Our community, and in particular the molecular dynamics (MD) field should redefine its objectives and tools having in mind realistic expectations of how new high‐performance computers will be.

TAKING ADVANTAGE OF HIGH PERFORMANCE COMPUTING

The quality of an MD simulation depends mainly on three factors: (i) the accuracy of the Hamiltonian representing molecular interactions, (ii) the similarity between the simulated system and the physical reality, and (iii) the exhaustiveness of the sampling of the conformational space. Most MD simulation engines use the same type of classical Hamiltonian defined in the 1960s by Lifson and coworkers, which have been largely refined, reaching a good agreement with experimental measurements when sufficient convergence is achieved for some macromolecular systems. , , , It means that in practice, the quality of MD calculations mainly depends on the ability of the simulated model to represent the reality, and on the completeness of the sampling of the relevant configurations. As some of the biologically relevant systems may contain millions of atoms, a tendency in the MD field in the last decade has been to simulate very big systems , , , , , which are expected to be more realistic and better fitting with supercomputer architectures. Unfortunately, the larger the system, the longer the simulation generally should be, and the more complex the force evaluation is. In brief, the reliable treatment of very large systems is limited to a few computational groups with access to massive computer resources, while in everyday practice the simulation system sizes range from 104 to 105 atoms. The challenge is how to efficiently use supercomputers on biological problems involving such “small” systems.

Strategies for parallelization of a single calculation

Molecular dynamics simulation approaches are based on the idea of sampling phase space, and as such the efficiency of the method ultimately comes down to the performance of individual simulations. The total throughput of methods relying, for example, on ensembles is multiplied by the speedup due to parallelization of individual simulations, which in turn is multiplied by speedups from acceleration and algorithmic advances. Several modern MD implementations such as Amber, NAMD, Desmond, and GROMACS have been accelerated either for GPU hardware—using CUDA, OpenCL or SYCL APIs—or single‐instruction multiple‐data instruction units available on modern CPUs. However, both hardware and software landscapes have become extremely diverse, and while almost all codes can be compiled on arbitrary systems, it cannot be taken for granted that a code is able to make good use of the hardware available on a particular system. For large projects, it has become an important decision to select a suitable combination of HPC hardware along with optimally accelerated simulation software already at the planning stage. GPU accelerators have been a revolution for MD simulations, just as for many other scientific fields, which is effectively due to new algorithms enabling parallelization of the force evaluation in each timestep over ~10 000 functional units in each GPU, , , although this is opaque to the user. On the other end of the spectrum, algorithms to explicitly parallelize simulations by communicating between multiple different nodes have evolved to use advanced load balancing and advanced domain decomposition techniques that limit communication to the neighboring nodes and minimize communication, for example, by allowing an interaction between atoms present on two specific nodes to be evaluated on a third (“neutral‐territory” methods). , While benchmarks occasionally use gigantic systems to showcase codes, biomolecular applications like Covid‐19 are typically focused on achieving sampling, for example, for a protein by first reducing the system size as much as possible and the parallelizing the simulation of this given system over as many nodes as possible—so‐called “strong scaling”, which is considerably more difficult. The algorithms above have enabled truly impressive strong scaling where simulations have been able to make use of 10,000 cores or more, , resulting in long time scale simulations of individual SARS‐CoV‐2 proteins, sampling of their binding site dynamics , conformational changes, modeling of SARS‐CoV‐2 proteins whose structure is not yet known, and multiscale simulations, for example, of the SARS‐CoV‐2 spike protein. One of the currently most critical bottlenecks is that the particle‐mesh Ewald summation algorithms used for long‐range electrostatics involve global 3D fast fourier transforms, where the number of communication messages scale as the square of the number of nodes. Fortunately, there are several promising new algorithms in development based, for example, on multipole expansion methods , or multi‐level summation. However, since molecular dynamics is inherently an iterative stepwise method, high‐end parallelism is subject to hard challenges of communication bandwidth and latency—it will not help to add more CPU resources to a simulation waiting for communication. These problems are even more severe when compounded with accelerators. First, some GPU codes do not even support multi‐node parallelism since the random‐access nature of indexing atoms to be communicated is not a great fit for accelerator hardware. Second, it introduces another layer of communication complexity and latency when data must first be sent between GPUs in each node and between CPU and GPU before it can be communicated to other nodes. Third, since the force calculation can be an order of magnitude faster on GPUs, the relative impact of network bandwidth and latency will also become tenfold higher, and thus dominate for much smaller node counts compared with CPU‐only simulations. As an example, for GROMACS the current effective limit to strong scaling is somewhere around 40 atoms per CPU core, while it is typically not meaningful to go to <3000 atoms per GPU, and other major codes have qualitatively similar behavior. Notably, this means the relative strong scaling is typically considerably worse for GPU‐accelerated simulations, and somewhat paradoxically they might not improve the absolute performance of the simulations in the limit of high‐end parallelism, but they make it possible to reach the same performance with order‐of‐magnitude smaller hardware resources. These freed‐up resources can then instead be utilized much more efficiently to parallelize the solution to the scientific problem through combination with ensembles of many accelerated and parallel simulations.

Ensemble simulations

In effect, a direct alternative to escape the technical shortcomings of direct parallelism is based on the use of ensembles obtained from independent or loosely coupled simulations. The use of replicas is now well established and considered part of the “good practices” in the field, , especially when one is concerned with simulations of rare events. As an example, in our 2010 study of spontaneous DNA hairpin folding, we were lucky to have found a fast “downhill” folding pathway in our first trajectory (1 μs long). However, to obtain good statistics, we had to collect more than 20 such trajectories, and yet four of them never reached the folded state (after 4 μs), highlighting the variability in the outcomes of individual simulations. In fact, thanks to the chaotic nature of MD simulations, even initially correlated replicas rapidly lose the memory of their common initial state and can thus be analyzed independently to obtain statistical bounds on a given MD‐derived quantity. Even more interestingly, they can be also combined to create meta‐trajectories that provide ensembles much richer than expected from the length of individual replicas. In this spirit, it was shown that the multiple replica approach can be used not only to describe equilibrium properties, but also nonequilibrium transitions, even when their characteristic timescales are much larger than those sampled by the individual replicas. , , In this spirit, Pande, Noe, and others , have demonstrated that short replicas can be combined into a Markov State Model (MSM) to trace complex kinetic pathways from the transition probabilities between a discrete number of states in the system, even when no single trajectory samples transitions between all states. In recent years, many generalizations of this scheme have been developed, including history‐augmented MSMs that go beyond the simple theory of memoryless Markov processes, or the transition‐based reweighting analysis method (TRAM) that combines data from multiple ensembles, including explicitly biased ones. Indeed, although an unbiased multiple‐replica approach is often the most easily interpretable, the use of biasing schemes usually offers better sampling of the configurational space. This is especially important for the exploration of slow conformational transitions that are unlikely to happen within currently accessible simulation times. An example of this approach is Hamitonian or Temperature replica exchange , , , where either the Hamiltonian or the kinetic energy is perturbed across replicas to improve sampling efficiency by pushing the systems out of metastable states with long residence times. An equally popular intermediate approach, called Solute Tempering (REST), combines these two ideas to perturb the effective temperature of a subsystem of interest. , In Free Energy Perturbation (FEP) or Thermodynamic Integration (TI), the Hamiltonian is perturbed in discrete steps (FEP, discrete TI) or smoothly (slow growth TI, non‐equilibrium TI) from one state to the other, allowing for the calculation of the associated work—which, depending on the Hamiltonian at hand, can be coupled to a conformational transition or an alchemical change. , , , In umbrella sampling or metadynamics, , , , , additional terms are added to the energy function to favor (in a guided or unguided manner) the exploration of a selected collective variable, often in conjunction with the replica exchange scheme. , In the string method, trajectories are pinned to points forming a path in a low‐dimensional subspace, and the displacements tangent and perpendicular to the path are used to iteratively refine this path, until a lowest‐energy pathway is found between two initially defined states. , , Subtler biasing techniques imply the use of information‐based biasing as in Maxwell‐Demon Dynamics, where after a certain time only trajectories approaching a given target are extended or cloned, , or adaptive approaches (very popular in metadynamics or MSMs), which launch new trajectories from regions of the configurational space that have been undersampled in the original ensemble. As a notable example, the Weighted Ensemble (WE) method provides a rigorous scheme for evolving large batches of short trajectories to obtain kinetic and thermodynamic data without the application of external forces. , Recently, machine learning approaches hold promise to enhance sampling efficiency by learning a refined biasing scheme from the collected trajectories, , by re‐optimizing the reaction coordinate that is being explored on‐the‐fly, , by learning an approximate structural ensemble from which new uncorrelated seeds can be drawn, or by simplifying the structural description of the system without compromising on its physical accuracy. Note that while unbiased replica approaches rely on completely independent trajectories, biased multiple replica approaches usually imply a loose coupling between trajectories, as decisions on a given trajectory should be made periodically based on the results obtained in the others. The multiple replica approach fits perfectly into massively parallel computers, as each of the simulations is independent (or nearly independent) from the others and as the CPU time required to compute the forces is quite independent of the specific coordinates, a good processor load balance is expected. In fact, unbiased multiple replica approaches can be run in distributed platforms, where volunteers provide access to their personal computers to run independent simulations, which are then integrated to obtain the representation of the system. As the first such distributed setup to achieve Exaflop performance in 2020, the Folding@home project is a paradigm for this type of initiatives that has advanced the study of many complex biophysical processes , , , , and provided excellent results in SARS‐Cov2 research (see below). There are several desirable properties that should characterize a well‐scaling ensemble approach fitted to the Exascale era:When choosing a method, one should always consider the availability of resources, the existence of “natural” reaction coordinates, the expected complexity of the free energy landscape, the natural timescale of the process in question, and the desired levels of detail and accuracy in the final model. Direct parallelism should ideally be replaced with “trivial” parallelism, meaning that ideally ensembles should be generated by a very large number of short, independent trajectory segments. The ideal method should be able to keep the number of parallel runs constant to make resource allocation efficient and predictable. Explicit communication between runs should be avoided or minimized, and ideally performed asynchronously as not all computing units perform equally even within a single machine. Disk space requirements should be minimized, since 1000s of individual runs can easily overwhelm even purpose‐built filesystems. Efficient GPU implementations should be considered alongside more trivially paralellized CPU ones to ensure compatibility across all types of HPC resources. Finally, popular, out‐of‐the‐box parallelization schemes should be used to ensure interoperability across the increasingly diverse computational cluster architectures.

Non‐equilibrium simulations

Biased replica approaches are typically implemented in the context of equilibrium simulations. For example, in metadynamics, umbrella sampling, or FEP/TI, we expect that the trajectory collected at a given point of the transition (geometrical or alchemical) samples well the neighboring points along the reaction coordinate, and the reversible work associated with this microscopic change will not change by extending the trajectory, which often requires long simulations. Similarly, producing a well‐converged MSM requires that the transition probabilities between any two states reach local equilibrium, that is, the trajectory samples an equilibrium situation, which requires long aggregated trajectory times. The need to run a limited number of resource‐intensive replicas is undesirable for HPC, as it limits the number of processors that can be allocated to a specific scientific problem. A potential solution arises from the use of non‐equilibrium methods, which allow us to obtain transition free energies by combining a very large number of “irreversible works” derived from short non‐equilibrium simulations starting from equilibrium start and end states. Non‐equilibrium methods have a solid foundation in Jarzynski equation and its later extension, the Crooks fluctuation theorem. They provide access to equilibrium properties, for example, free energy differences. and are efficient in an HPC environment, as they allow for a distributed use of processors working independently. Within the BioExcel consortium, De Groot and coworkers have pushed the limits of these non‐equilibrium techniques in the context of alchemical perturbation, developing a full methodology that goes from defining an optimum pathway for the transition to collecting converged free energy results for the alchemical transition. , This methodology is becoming very successful in exploring relative binding free energies of related drugs or the impact of mutation in protein interactions , and has been extensively used in the context of Covid19 research (see below). Note that the technique can also be adapted to biasing schemes related to geometrical rather than Hamiltonian changes, even though initial attempts exposed a number of specific challenges that need to be addressed if a general workflow is to be established. ,

Multi‐simulation projects and workflows

As already discussed, a typical research project cannot be solved by running a single simulation, but requires combining many of them. For example, to determine the structural role of sequence variants on a short loop of a protein, at least 20n (with n equal to the length of the loop) independent simulations need to be done. The Folding@home strategy is well suited for this type of problems as we are referring to completely independent and asynchronous simulations. There are, however, other cases that require combining different types of simulations that must be arranged using flexible workflows. For example, before running a free energy calculation on the impact of polymorphism on the binding of a large series of drugs to a given target, many calculations need to be done: 3D models and topologies of the drugs need to be generated, force‐field parameters have to be determined, and equilibrated structures need to be obtained. In parallel, 3D models of the target protein and its mutants have to be made, optimized, thermalized and equilibrated through MD simulations of both apo and holo forms of the protein. Finally, all these processes must be repeated for hundreds of drugs and sequence variants to reach biologically relevant conclusions, and after some analytics on all the results can be needed. In this context, the cost of running ultra‐efficient free energy calculations might have no impact in the global project, as the limiting step can be the manual setup of chains of calculations. Thus, in order to be efficient, manual interventions need to be reduced to a minimum by using suitable applications , , automating all the preparation steps. Furthermore, the execution of the different programs has to be coordinated by a workflow manager to guarantee no lag time between calculations. Workflows have been very popular in bioinformatics, , , , , but their adoption in the biomolecular simulation fields has been slower, , mainly due to the powerful computational resources needed and the large amounts of data generated. The recent advances in the informatics technology, including faster network bandwidths, optimized and distributed file systems and pre‐exascale HPC supercomputers opened the door to biomolecular simulation execution pipelines. The combination of automated workflows with new optimized codes and HPC supercomputers , allows an efficient usage of the available resources and the implementation of large studies through task parallelization. We anticipate that workflows will be at the core of biosimulations projects in a near future. Execution of workflows is typically controlled by Workflow Management Systems (WFMS): the software that regulates the execution of a defined sequence of tasks, arranges them in the most efficient way, manages intermediate data, and monitors the whole execution. The list of available WFMS is huge, and is still increasing every year (see the list of workflow managers compiled and regularly updated by the Common Workflow Language (CWL) team at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems). Some of these tools also allow for workflow setup by means of Graphical User Interfaces (GUIs). , , , , , , Very popular in the bioinformatics field, these drag‐and‐drop interfaces (KNIME, Pipeline‐Pilot, Taverna , Galaxy, Unipro UGENE, Kepler ) allow for very easy workflow development based on interconnected nodes (building blocks). The chemoinformatics and MD fields have just recently developed GUI‐based WFMS, with some available implementations in Galaxy, KNIME, , , Kepler, and Pipeline‐Pilot, the most popular being integrated into commercial software. Although most of these WFMS can be used to launch jobs in HPC clusters through integrated remote execution functionalities, they were not designed to run massive HPC jobs. For this reason, the biomolecular simulation field should rely on a different type of WFMS adapted to work with HPC infrastructure, eventually reaching the Exascale regime. BioExcel CoE joined together some of the most important HPC WFMS around the world in a dedicated workshop (2018), with the goal of identifying specific software gaps and opportunities for improved workflow practices. A representation of the most popular WFMS in the field were present in the workshop: Parsl , AdaptiveMD, PyCOMPSs, BioSimSpace, LongBow, FireWorks, Swift, ExTASY, , and RADICAL. , Although sharing many of the most important features (automation, data control flow, task management, dependency graph), each WFMS has its own design goal, and adoption of one versus another implies a thorough analysis of the particular research project needs and the WFMS features. Some of the differentiating elements that were identified are the possibility to run adaptive algorithms (ability to change their behavior in runtime), fault tolerance (ability to distinguish between critical and non‐critical task failures), flexibility (ability to extend/reduce the number of resources used at runtime), and provenance (ability to store input metadata for reproducibility purposes). Some of these tools have been designed to cover very specific applications, whereas others are able to handle broader, more generic workflows. Finally, the installation and configuration overhead is a critical point that should be carefully considered before choosing a WFMS. HPC‐focused WFMS have demonstrated their power in biomolecular research projects, and Molecular simulation pipelines are now regularly being used to gain insights into the binding free energy and the residence time of ligands in drug development studies. They paved the way for the classical protein‐folding problem before AlphaFold2 appeared , and have been extensively used in recent studies helping the fight against the SARS‐CoV‐2 virus , , , , (see below). However, despite their obvious impact, HPC workflows are still not widely adopted by the biosimulation community, mainly due to the difficulty of integrating them with the most common simulation tools. One step towards such integration is the BioExcel Building Blocks (BioBB) library: a collection of portable wrappers on top of common biomolecular simulation tools. The library is designed to: i) increase the interoperability between the tools wrapped; ii) facilitate the implementation of biomolecular simulation workflows; and iii) increase the reusability and reproducibility of the generated workflows. The library is being developed following the FAIR principles for research software development best practices. The result is a collection of building block modules, classified according to the functionalities offered (e.g., MD, analysis, chemistry). The BioBB library is compatible with different WFMS, including both GUIs (Galaxy, KNIME) and HPC‐focused (PyCOMPSs). Installation of BioBB workflows controlled by PyCOMPSs WFMS can be easily done using BioConda packages. The community has no doubt that biomolecular simulation workflows will have a clear impact on the future grand challenges in science, and that they will be essential tools for the efficient exploitation of the future Exascale supercomputer generation. However, important aspects linked to the pipeline execution still need further research and discussion to reach a community agreement. Data provenance, historical record of the data; , workflow metadata, important information from the input, output and intermediate data; reproducibility, possibility to generate the same results using a particular pipeline; , reusability, possibility to use a particular pipeline in different resources; and FAIR computational workflows are all crucial points whose treatment is still under debate.

THE PROBLEM OF DATA

As computer power increases, the management of data is becoming the Achilles heel of molecular simulation. As described in a recent review, 1 μs of a medium‐sized protein system (105 atoms including solvent) generates around 5 Pb of data, and even if only 1‰ of coordinates are stored, a single trajectory would require 5 Tb of disk space. The common practice in the field has been to store trajectories locally and delete them after some period, which means that very valuable (and expensive to obtain) information is lost. FAIR (Findable, Accessible, Interoperable and Reusable) requirements for scientific data force the storage of plain trajectories and associated simulation metadata. This generates a formidable problem that requires at least: (i) efficient metadata to specify the purpose and conditions of the simulation, (ii) efficient ways to store the raw data, (iii) efficient transfer mechanisms, and (iv) flexible analysis portals including virtual research environments (VRE) for remote programmatic access to the data. The BioExcel CoE and the ELIXIR European Initiative have defined a prototype of EDAM ontologies for metadata of MD simulations. This should be extended and refined to be adopted by already existing , , , , , , , or new simulation databases. Some recent databases such as BigNASim already implement an ontology metadata which helps to classify trajectories, favoring then meta‐analysis and the detection of artefactual simulations. Defining what should be stored from a trajectory is a major decision, which is often guided by traditions maintained since times when the throughput of HPC systems was a fraction of the current one. A significant work has been done by communities such as ABC , , to define the maximum acceptable frequency for storing data and the amount of solvent that needs to be maintained for reanalysis. In parallel, software developers are creating compression approaches and importing methods used to compress other types of massive data. Again, the community should reach a consensus on the type of compression strategy and the amount of data to be stored, otherwise the FAIR requirement will not be fulfilled. Despite cleaver approaches to improve transfer efficiency (GridFTP, Aspera, and others; see ), current network limitations are severely hampering transferring raw trajectories. Streaming methods that would allow basic analysis in remote before downloading the trajectory are going to be useful to avoid unnecessary downloads . However, it seems that the most sensible approach is to reduce the movement of MD files, with big data centers placed physically close to the computer power concentrating the data, mimicking current situation in other fields such as bioinformatics or genomics. The first MD databases were available around 10–15 years ago and aimed to cover all representative structures in the protein data bank: Dynameomics, MoDEL, or Dynasome , , , are examples of these families of databases. More recent efforts have been focused on specific families of proteins of special importance, such as GPCRmd that includes trajectories of G‐protein coupled receptors, MemProtMD, which contains simulations on 3500 membrane proteins in realistic membrane environments, or TMB‐iBIOMES that includes hundreds of nucleosome trajectories. A few databases are focused on small molecules with potential biomedical or biomolecular impact; an example is Cyclo‐lib which is focused on cyclodextrins, or BCE, which contains classical trajectories and quantum structural data on small drugs. Finally, BigNASim is, to our knowledge, the only database focused on nucleic acids simulations, and contains hundreds of simulations including the latest ones obtained by the ABC consortium. Recent efforts to create repositories and associated databases developed by the BioExcel CoE to help Covid19 research will be commented below. Most of these databases are coupled to flexible graphical web interfaces allowing access to key analyses of the trajectories. However, to be really useful, these user interfaces should be flexible and interactive, something made possible with recent database architectures. For example in the BioExcel‐CV19 web server, an associated REST API allows the user to extract slices of the trajectory as well as fragments of the molecule, which are then shown by the NGL visualizer. Using these tools, the user can focus their analysis on configurations fulfilling certain criteria, or perform meta‐analyses combining parts of different trajectories. The next step towards gaining interactivity for analysis tools coupled to databases is probably the generalization of virtual research environments (VRE), infrastructures that allow the users to deploy and share their software to perform customized analyses on databases. These infrastructures make it possible to perform specific analyses on the computers supporting the databases, eventually only downloading analysis results instead of the original trajectories. Entering the Exascale era, we must not forget the Exabyte problem that will be created by the future generation of Exaflop computers. Otherwise, massive amounts of resources will be wasted without providing useful information for the community.

COVID‐19: A USE CASE FOR HPC‐MD SIMULATIONS

SARS‐CoV‐2, an enveloped and positive‐strand RNA coronavirus, was first identified in the city of Wuhan in December 2019 and rapidly spread worldwide due to a high inter‐human transmission rate and a relevant percentage of asymptomatic infections. , , As of December 2021, almost 300 million cases and more than 5 million deaths have been registered worldwide, with dramatic implications for the global economy and human well‐being. Since its emergence in late 2019, both the SARS‐CoV‐2 virus and the host response—starting at the main receptor, the angiotensin‐converting enzyme 2 (ACE2)—have been extensively studied to understand the disease etiopathogenesis, the structure of the functional viral proteins, and the viral‐host interaction pattern to guide a rapid development of both direct‐acting antiviral and host‐directed therapeutic agents. In this context, the theoreticians' community, assisted by the constantly increasing number of X‐ray and high‐resolution cryogenic electron microscopy (Cryo‐EM) structures available, has promptly reacted and a massive worldwide computational endeavor has been made to unravel the molecular details of the virus' functioning at a multiscale level, from force‐field based all‐atom simulations to hybrid quantum‐mechanics/molecular‐mechanics approaches. Here, we will summarize selected studies that have provided key information on the functioning of the virus obtainable uniquely via scientific HPC simulations.

RNA‐dependent RNA polymerase

The RNA‐dependent RNA polymerase (RdRp) is the core protein of the replication‐transcription complex, which is the machinery in charge of the transcription and replication of SARS CoV‐2's viral genome and subgenomic mRNAs. The RdRp holoenzyme is composed of four subunits: one core nsp12 and cofactors, two nsp8s and one nsp7. In addition to RdRp, other important proteins are involved in the replication complex: a helicase (nsp13), an exonuclease that proofreads the newly synthetized RNA (nsp14), mRNA capping enzymes (nsp14, 16), or a specific endoribonuclease (nsp15). Despite the great effort invested in vaccinating the global population, the emergence of immune escape variants partially decreased the protection conferred by the vaccine. Moreover, the rapid spread of variants and its possible resistance to current vaccines still make antiviral drugs important means to alleviate the effects of the pandemic and react swiftly to its future outbursts. Thus, due to its central role in the viral life cycle, the RdRp has been an attractive target for antiviral drugs. Among them, Remdesivir has gained much attention currently being the first small‐molecule drug approved for the treatment of Covid‐19 by FDA in October 2020 (only followed by a Nirmatrelvir/Ritonavir combination in late 2021). Remdesivir is an adenine nucleotide analog that acts by inhibiting RdRp through a delayed chain termination mechanism. Other inhibition strategies have also been deployed towards protein partners involved in the replication machinery or by hacking the virus' replication through lethal mutagenesis. Although RdRp has been primarily investigated using experimental techniques, theoretical computations have helped understand different aspects of its inner workings. In order to better understand SARS‐CoV‐2's RdRp, Shaw and co‐workers made use of Cryo‐EM microscopy, RNA‐protein cross‐linking and unbiased molecular dynamics simulations to unveil RdRp's backtracking mechanism. In this study, three structures in which the backtracked RNA was accommodated by the nucleotide‐triphosphate (NTP) entry channel were resolved, a mechanism also found in cellular RNA polymerases. Starting from the cryo‐EM structure that featured two nsp8 cofactors, a nsp7 cofactor, two nsp13 helicase cofactors, an RNA template and a product RNA strand with one mismatched nucleotide at its 3′ end, three independent 5‐μs unbiased molecular dynamics simulations were carried out. Although the backtracking process was not captured in the simulations, these trajectories revealed the spontaneous fraying of the 3′ terminal mismatched nucleotide into the NTP entry channel. This positioning would disfavor the RNA translocation and thus the elongation process. Numerous experimental and computational studies have focused on understanding Remdesivir's inhibitory mechanism inside RdRp. However, Remdesivir's mode of action is still debated. A theoretical study investigated the possible mechanism of action at the molecular level with a total of 34 μs of simulation. Absolute free energy calculations were performed to obtain binding free energies of Remdesivir and two other nucleotide analogues inside RdRp, followed by independent short molecular dynamics simulations (100 ns) for each of the systems where Remdesivir had been incorporated in the nascent RNA strand. Their results would support inhibition coupled by a delayed chain termination, or alternatively by destabilizing the protein complex. In another recent study, equilibrium MD totaling 9 μs of simulation, as well as non‐equilibrium free energy calculations, were employed to obtain the free energies of binding of ATP and Remdesivir‐TP (RTP) to human and viral RNA polymerases. The molecular mechanism of action of Remdesivir was also addressed. The study showed that while SARS‐CoV‐2 RdRp displays a binding preference towards RTP versus ATP, human RNA Pol II displays the opposite trend. Moreover, ~100 ns of hybrid QM/MM MD simulations making use of the string method were performed to decipher the molecular mechanism and reaction free energy profiles of nucleotide activation and incorporation inside the RdRp. These results indicate that RdRp makes use of a self‐activated mechanism to achieve nucleotide incorporation, as observed in other polymerases. This highly processive polymerase was found to follow a two‐metal ion mechanism of nucleotide incorporation. While Remdesivir is observed to be incorporated more slowly than its natural counterpart, adenine, this difference is too small to account for its biological activity. Finally, free energy calculations unveil a stabilizing effect that occurs during Remdesvir elongation along RdRp's exit channel, which would stall RNA polymerization and explain its antiviral inhibitory effect.

Main protease

Proteolytic cleavage of the SARS‐CoV‐2's virally expressed polyproteins pp1a and pp1ab into its functional proteins is carried out by two viral proteases: the main protease Mpro (nsp5) that cuts 11 sites, and the papain‐like protease PLpro (nsp3) which cuts three sites. This a crucial step that enables the formation of the replication machinery in charge of transcription and replication of the viral genome inside the host cell. In addition, viral proteases can act on cellular proteins to help escape the virus from immune response. Thus, many efforts have been directed towards the inhibition of viral proteases as an effective antiviral strategy. Moreover, due to its dissimilarity to human proteases, Mpro and PLpro are promising drug targets. Mpro and PLpro are cysteine proteases composed of three and four domains, respectively. While Mpro is active as a dimer, PLpro acts as a monomer, and as in other cysteine proteases, their active sites are formed by a Cys‐His catalytic dyad. Joint crystallographic and molecular dynamics studies have analyzed the effects and mode of binding/action of substrates or promising inhibitors inside the Mpro/PLpro enzymes. The work by Hummer and co‐workers studied SARS‐Cov‐2 PLpro by means of biochemical, structural and functional analysis. Through multi‐μs MD simulations they analyzed the interaction between PLpro and two different protein substrates. In addition, 1‐μs long MD simulations were obtained for both SARS‐CoV and SARS‐CoV‐2 PLpros in complex with GRL‐0617, a non‐covalent inhibitor of SARS‐CoV PLpro that was shown to also act against SARS‐CoV‐2 PLpro. Kovalevsky et al. crystalized Mpro at room temperature and performed 1 μs molecular dynamics simulations to analyze the flexibility of some of the protein's regions. Experimental and theoretical approaches were also combined to study Mpro inhibition by myricetin and its derivatives. Two independent ~1.5 μs long Gaussian accelerated‐MD simulations explained myricetin's dynamic interactions inside Mpro's active site. In addition to drug‐design efforts, quantum mechanics (QM) calculations were performed on a small active‐site QM‐cluster model, including implicit solvation. Reaction pathways and transition states were characterized for myricetin and two other derivatives, explaining the formation of a covalent bond. The calculated activation free energies were obtained through thermal correction. Alternative approaches combining MD‐metatrajectories, docking experiments and machine learning strategies have been used to explore binding of millions to billion of chemical compounds as binders of the protease. These massive efforts, which fit into the paradigm of distributed computing, allowed the community to develop chemical moieties that could then undergo lead optimization protocols. , , , Additionally, fully theoretical studies have helped understand features not accessible to experiments. Atomistic simulations unveiled the binding preferences of the promising inhibitor ebselen to Mpro through 6 μs of cumulative MD calculations. In another computational study, important insights into structure–function relationships of Mpro were obtained through continuous constant pH MD simulations coupled to replica‐exchange enhanced sampling. The study revealed the impact of the protonation of His172, which causes a conformational change of Mpro resulting in its deactivation. Based on their results, the authors provide a set of guidelines to design new inhibitors that could bind to Mpro's residues. Tuñon and co‐workers made use of extensive QM/MM‐MD simulations to characterize the free energy profiles of the reaction mechanism of Mpro with a peptidic substrate. Exploration of the reaction free energy landscape was performed at the DFT(B3LYP)/MM level making use of the string and umbrella sampling methods, with a total of 2.4 ns of cumulative sampling. They propose a Mpro reaction mechanism consisting of three steps with four transition states, with deacylation as the rate‐limiting step of the overall reaction. The study unveils the most important residues involved in catalysis which can help guide drug‐design.

Spike: The crucial protein for infection

One of the most widely investigated SARS‐CoV‐2 protein is the viral Spike protein, a transmembrane homotrimeric class‐I fusion glycoprotein which exists in a metastable prefusion architecture and comprises two functional subunits for binding to the host cell receptor ACE2 (S1 subunit) and for fusion of the viral and host cell membranes (S2 subunit). The Spike protein of SARS‐CoV is cleaved by a host‐cell protease, the transmembrane protease/serine subfamily member 2 (TMPRSS2); an alveolar cell serine protease preferentially expressed on epithelial cells of the respiratory tract. , , TMPRSS2‐mediated cleavage of SARS‐CoV Spike protein is propaedeutic to host ACE2 binding, membrane fusion and cell‐entry mechanism; a highly‐concerted and regulated step‐wise mechanism. Particular attention has been paid to the highly antigenic S1‐based receptor‐binding domains (RBD) that by undergoing substantial hinge‐like conformational rearrangements, transiently expose (Up conformation) or hide (closed conformation) the interface area required for ACE2 receptor binding and subsequent shedding of S1, refolding of S2 for membrane fusion and virus internalization. , , Many aspects of Spike have been subjected to massive simulations. Among them, glycosylation of both viral and host receptor has been quite a debated topic among theoretical chemists and biochemists. Indeed, it is known that extracellular portion of ACE2 contains at least seven N‐glycosylated sites (i.e., N53, N90, N103, N322, N432, N546, N690) and several O‐glycosylation sites (e.g., T730), , with the latter shown to be crucial in the process of protein–protein recognition and binding mode. , , To shed light on the role of glycans in mediating virus internalization and spread, Hummer and coworkers have recently performed extensive multi‐replica approach MD simulations of the fully‐glycosylated human ACE2 receptor bound to the RBD of SARS‐CoV‐2 Spike and its unbound state. Their results have shown that glycosylation at position N90 in human ACE2 structurally hampers virus binding. This explains, at the atomistic level, the experimental data on the augmented susceptibility to SARS‐CoV‐2 infection when N90 glycosylation is lost. On the other hand, N322 glycan aids infectivity by hiding a key cryptic epitope known to be the target of CR3022 neutralizing antibody. In their work, Hummer and collaborators have used a classical multi‐replica approach of the 11 different glycosylated sites of ACE2. For each of the 11 setups, they carried out three independent MD simulation runs [(1 × 1 μs and 2 × 480 ns) × 11] for a total of ~22 μs trajectory for almost 1 million atoms. The so obtained trajectories were analyzed to dissect protein–protein interactions starting from the Cryo‐EM native contacts, which were identified accordingly to the Best et al. approach. Glycosylation was also intensively analyzed by Amaro and co‐workers that have built a full‐length model of the glycosylated SARS‐CoV‐2 S protein, both in the open and closed states and have performed multiple microsecond‐long, all‐atom molecular dynamics simulations. Their efforts have produces key full‐atom structural and dynamical insights on the roles of glycans and their implications on the (i) protein geometry, (ii) the global plasticity of Spike, and (iii) the shielding role against antibody epitope recognition. In general, the numerous glycans found on the whole Spike protein play diverse roles and contribute differently to the global and local conformational rearrangements in a time‐dependent manner. Indeed, they observed that the two N165 and N234 glycosylation sites play a major role in shifting the equilibrium ensemble from closed to open RBD conformation. The simulated models pinpoint N165 and N234 positions as a major modulator of the RBD conformational plasticity with N234 glycan that promotes RBD exposure and concurrent stabilization. To provide further support for their findings, they performed additional sets of extensive MD simulations on the N165A/N234A mutant of the SARS‐CoV‐2 Spike protein, which corroborated the importance of these glycosylated positions in the infective dynamics of the protein.This result was also confirmed by bilayer interferometry experiments. From the technical point of view, this work required an enormous computational effort as the full‐length structures were embedded into an equilibrated all‐atom membrane bilayer with a composition mimicking the one where the virus buds. Subsequently, explicit water molecules and ions were added, resulting in two final systems of ∼1.7 million atoms each for which multiple replicas were simulated collecting more than 15 μs of the overall sampling time. Massively paralel computers were used for this study. Spike conformational transitions and the functional characterization of SARS‐CoV‐2 rising variants have been an additional question the scientific community has targeted. As anticipated before, the three‐protomeric Spike protein can adopt different macrostates prevalently determined by the serial combination of each single RBD arrangement. Since RBD exposure (i.e., closed‐to‐open) is propaedeutic to ACE2 binding and virus internalization, and considering that the vast majority of evolution‐selected mutations fall within the RBD domain (contributing to the creation of higher‐infective viral strains), dissecting RBD's opening/closing mechanism and energetics at molecular level is of utmost importance. In this regard, S. Gnanakaran and co‐workers have characterized the flexibility and conformational transition of the Spike protein in a work where MD simulations using an HPC resource have provided a unique and detailed perspective that other standard biophysical approaches would have not produced. In particular, they have analyzed the effect of D614G mutation on the Spike protein discovering that the mutation heavily affects the interaction pattern of RBD, especially when this adopts the open‐state conformation. Briefly, they found three contributions to structural rearrangements associated with the D614G mutation: (i) variation in the inter‐protomer contacts, (ii) alteration in the correlations between the single RBDs, and (iii) variation in the specific inter‐protomer hydrogen bond pattern. As a consequence, the infection‐capable one‐up state conformation of RBD is favored when D614G mutation occurs. Based on these results, the authors suggest that these three effects can lead to an increase in infectivity through an increase in the one‐up population of the G‐form. In this work, mainly unbiased MD simulations were performed, generating five replicas for each of the system analyzed in the study, generating overall ~20 μs of trajectories for ~1 million atoms simulated in explicit solvent. In a similar vein, Ray, Le and Andricioaei studied the long‐range allosteric communication within the Spike, using an enforced closed‐to‐open transition to study dynamical correlations between individual residues. By calculating correlation scores, mutual information‐based cross‐correlation metrics and connectivity graph networks from a set of over 40 short trajectories, they traced allosteric pathways governing the global motions of the RBD to identify residues that, upon mutation, would most likely disrupt interdomain communications. Their analysis was able to recover the established effect of the D614G substitution on Spike opening, as well as provide a plausible explanation for the biological effects of other observed mutations such as A570D or P681H. However, the authors note that their model does not yet have predictive power, one that would allow to score new mutations by their potential threat to human health. Nevertheless, the effect of natural mutations of SARS‐CoV‐2 on Spike structure, conformation, and antigenicity has been the leading‐topic recently due to the rapidly growing number of virus variants with increased transmissibility, higher disease severity, resistance to neutralizing antibodies elicited by current vaccines or from previous infection, reduced efficacy of treatments, or failure of diagnostic methods. For this reason, these variants have been named Variants of Concern (VoCs). Gobeil et al. investigated the Spike VoC involved in transmission between minks and humans, as well as the B1.1.7 (alpha variant), B.1.351 (beta variant), and P1 (gamma variant) Spike variants. All variants showed a remarkedly increased ACE2 binding affinity accompanied by a higher propensity of RBDs to stably adopt the functionally‐active up states. While adaptation to mink resulted in Spike destabilization, the B.1.1.7 (UK variant) Spike balanced stabilizing and destabilizing mutations. A local destabilizing effect of the RBD E484K mutation was implicated in resistance of the B.1.1.28/P.1 (Brazil Variant) and B.1.351 (South Africa Variant) lineages to neutralizing antibodies. Also, authors have revealed allosteric effects of mutations and mechanistic differences that may govern zoonotic jumps or escaping mechanisms from antibody neutralization. Finally, extensive Adaptive Sampling Molecular Dynamic Simulations (ASMD) were performed by De Fabritis and co‐workers , to explore more under‐sampled regions of the conformational space of Spike, hence overcoming energetic barriers separating free energy wells. In ASMD, simulations are launched in sequential batches called epochs utilizing knowledge of the conformational space obtained from all previous epochs. To generate an ensemble of the RBD tip conformations required to start the adaptive sampling routine, authors have performed 100 distinct 50 ns‐long simulations in the NVT ensemble with randomized initial velocities for each of the WT and mutant systems. Each iteration consisted of 50 to 100 short independent simulations, conformations from which were selected to further sample the most under‐sampled conformational space regions. Time independent component analysis (TICA) was performed to identify slowly changing molecular order parameters and subsequently Markov state models were built. A total of 29 adaptive iterations were performed, yielding total simulation times of 274.8 and 256.8 μs for the WT and Mut systems. An out‐scaling Folding@home distributed computing project involved 1 million people to create a distributed exascale computer with the aim of simulating ~0.1 s of the entire viral proteome. The brilliant idea of Gregory R. Bowman and collaborators has permitted to observe a dramatic opening of the apo Spike complex, far beyond that seen experimentally, explaining and predicting the existence of “cryptic” epitopes. The project has also highlighted large conformational changes across the proteome, which reveal over 50 “cryptic” pockets that expand targeting options for the design of antivirals, thus paving the way towards dozens of alternative drug discovery projects. Important observations were also made in other aspects of Spike dynamics. The authors, thanks to extensive sampling, successfully captured this rare event for both glycosylated and unglycosylated protein and found that glycosylation only slightly increases the population of the open state, the effect being smaller than that produced by genetic variation in the protein. The authors also found that opening occurs only for a single RBD at a time and that the scale of spike opening is often substantially larger than has been observed in experimental snapshots in the absence of binding partners. An enormous amount of calculating nodes worldwide were created and authors measured an impressive Folding@home peak performance of 1.01 exaflops. This was achieved at a point when ~280,000 GPUs and 4.8 million CPU cores were performing simulations simultaneously; a performance 5‐fold greater than the peak performance of the world's fastest traditional supercomputer at that time, Summit. All the simulations were performed with the Gromacs package and the immense computational power of peers distributed around the globe has permitted to investigate intensively and with the richest sampling ever obtained 36 distinct proteins associated to SARS‐CoV‐2 virus. The endeavors have produced (so far) a total outstanding simulated time‐scale of ~115 ms (>0,1 s).

BioExcel workflows

The combination of BioExcel Building Blocks and the PyCOMPSs workflow manager illustrates how workflows can help reaching the High Throughput (HT) regime in HPC infrastructures, running hundreds of calculations using hundreds of cores, thus exploiting thousands of cores in one single job. The possibility of running these huge executions will help the efficient usage of large supercomputers such as the ones that are to become available in the coming years. Workflows built using these tools have been used in some of the COVID‐19 related research presented in this review. One example is a pipeline for the calculation of binding free energy differences upon protein residue mutations (Figure 1). The relative changes in binding free energies (ΔΔGs) are computed using a so‐called alchemical fast‐growth thermodynamic integration (TI) method that does not require a quasi‐equilibrium simulation during the alchemical transition. The workflow starts from two independent equilibrium simulations: WT and mutant. These simulations need to sufficiently sample the end state ensembles, as the free energy accuracy will depend on the sampling convergence. From the generated trajectories snapshots are selected to start fast (picoseconds‐long) transitions driving the system in the forward (WT to mutant) and reverse (mutant to WT) directions. The work values required to perform these transitions are collected and the Crooks Fluctuation Theorem is used to calculate the free energy difference between the two states.

FIGURE 1

Left: Schematic representation of the nonequilibrium alchemical free energy calculations (taken from Aldeghi, 2019). Right: Alchemical free energy calculations workflow implemented with BioExcel building blocks. The whole pipeline was implemented using the BioExcel Building Blocks library, wrapping GROMACS and pmx biomolecular tools. , pmx is used to generate hybrid structures and topologies for the mutated residues and GROMACS to perform molecular dynamics simulations. The PyCOMPSs workflow manager is used to automatically distribute and parallelize the high number of transitions between the wild type and mutant states of the protein. As all these integration steps are completely independent calculations, they can in principle all be run at the same time. The flexibility added by PyCOMPSs allows a variable number of HPC cores depending on their availability (see use example in Figure 2). Here, a particular alchemical free energy calculation was performed on a 218,179‐atom system, running a total number of short 1000 TI runs using 32 nodes (1536 cores) of the MareNostrum supercomputer in a single job. This particular execution took 4 h, using 100% of the CPUs available during most of this time. The number of nodes/cores to be used in the pipeline is completely flexible as PyCOMPSs is taking care of distributing the independent simulations across the number of available cores (see examples of pre‐exascale workflows built within the BioExcel CoE including the one described in this section in the BioExcel GitHub repository: https://github.com/bioexcel/biobb_hpc_workflows).

FIGURE 2

High parallelization reached thanks to the PyCOMPSs workflow manager. Example of a single job using 32 nodes (1536 cores) of the Marenostrum supercomputer at the BSC. Each line shows the CPU usage in % of a single MareNostrum node. These workflows were heavily used to understand the effect of mutations in both virus and host on spike recognition by ACE2, finding, for example, that most human ACE2 polymorphisms have a negligible effect on the binding affinity of the RBD, but also that the presence of some rare polymorphisms would protect a small fraction of the population from virus infection (see Figure 3).

FIGURE 3

Impact of human polymorphisms in RBD‐hACE2 binding free energy, computed with BioExcel workflows including GROMACS and pmx. Human ACE2 protein mutations with higher frequency in the population are shown. Positive ΔΔG values indicate that a mutation reduces binding affinity. The ΔΔG values are in kcal/mol. Similarly, Gapsys & de Groot have explored an effect of ACE2 mutations on the ACE2‐RBD binding creating a two‐level screening strategy (Figure 4). Firstly, the residues at the ACE2‐RBD interface were exhaustively scanned with Rosetta Flex ddG protocol. This approach provides a good compromise for accuracy and computational efficiency allowing to predict changes in binding affinity for a large number of mutations. In the second step, a selection of mutations based on the predictions from the previous step was subjected to the pmx/Gromacs based alchemical free energy calculation protocol. While these calculations are computationally demanding, they also offer high prediction accuracy. In addition to estimating changes in the ACE2‐RBD binding affinity, relative changes in ACE2 stability upon amino acid mutation were evaluated as well. All in all, this multi‐step strategy allows identifying the residue positions in ACE2 and their mutations which would have the largest impact on the binding with the viral RBD. This strategy can be easily adjusted to probe for RBD mutations and their effect on the binding affinity, thus identifying viral variants of higher potency. Furthermore, by replacing ACE2, as the RBD binding partner, to an antibody or peptide, such mutation scan can be used to design novel therapeutics by enhancing their binding affinity to the viral RBD.

FIGURE 4

A multi‐step strategy to predict the effect of ACE2 mutations on the ACE‐RBD binding affinity. In the first step, the ACE2 residues at the protein binding interface are scanned by means of the computationally efficient Rosetta flex ddG protocol. In the second step, a selection of the mutations is probed by the computationally more demanding MD based free energy calculations using pmx and Gromacs software packages. The calculations allow evaluating the effect of mutations on the protein binding affinity, as well as on the stability of the ACE2 protein. This strategy can be further adapted to predict the effects of virus mutations, design high affinity antibodies or peptide therapeutics.

BioExcel COVID‐19 hub: Shaping the future of open data

The entire scientific community reacted to the COVID‐19 pandemic by redirecting efforts to study the SARS‐CoV‐2 infection and its mechanisms of action. The biomolecular simulation field contributed since day one with an incredible number of MD simulations, which in turn produced an enormous amount of data distributed around the different groups working on them. BioExcel CoE, in collaboration with the Molecular Sciences Software Institute (MolSSI), developed the COVID‐19 Molecular Structure and Therapeutics Hub (https://covid.bioexcel.eu/), a website presented as a community‐driven data repository and curation service for molecular structures, models, therapeutics, and simulations related to COVID‐19 computational research. The repository was designed from scratch to share data (including MD data) from the scientific community, making them completely open to better tackle the COVID‐19 global emergency. The Hub has become a reference repository with huge amounts of useful information gathered in one single portal. Renowned groups in the field (e.g., D.E. Shaw, Riken, Folding@home) have contributed to the repository, summing up to milliseconds of trajectory data. Trajectories stored include different structures involved in the process of virus infection and virus life cycle, such as the ones previously presented in this review (Spike, Protease, RNA Polymerase). Today the Hub is an essential repository for the field, a central point where to find and download useful data for research. The BioExcel‐CV19 database and associated web server (Figure 5) expand the power of the Hub, including interactive graphical representations of the trajectories and analyses performed on them. The main objective of the BioExcel‐CV19 project is to generate a tool for scientists interested in the COVID‐19 research to interactively and graphically check key structural and dynamic features stemming from MDs. As these features vary depending on the structure analyzed, specific analyses were performed, uploaded to the database, and represented in the web portal. These analyses and key features were collected by direct interaction with the authors of the simulations. As an example, trajectories corresponding to the viral RBD‐hACE2 complex include interface observables (e.g., residue distances, hydrogen bonds), allowing an easy analysis of their behavior along the simulation (see Figure 5).

FIGURE 5

Screenshots from the BioExcel‐CV19 web server. (a) Trajectory representation (berzosertib, an FDA approved drug molecule binding to the ectodomain of human ACE2); (b) analysis of the ligand pocket flexibility; (c) Analysis of electrostatic and Van der Waals interactions between the ligand and the ACE2 receptor; (d) analysis of hydrogen bonding between the ligand and the ACE2 receptor; (e) electrostatic potential surface in the ligand‐binding pocket; (f) insight on the main interaction between the ligand and the ACE2 receptor involving an aspartic acid. URL: https://bioexcel‐cv19.bsc.es/#/browse/MCV1900072 All the analyses integrated in the web portal are completely interactive. Whenever possible, a direct link from the analysis to the 3D representation is offered, using NGL viewer tool. The fast extraction of a particular snapshot from the whole trajectory is possible thanks to the NoSQL Mongo database backend powering the server. The whole set of trajectories (atomistic 3D coordinates for every atom and every frame of the simulation) and analyses are stored in this distributed database and efficiently retrieved on the fly from any web portal request. The entire pipeline is automated and new COVID‐19 related trajectories are currently being processed. The set of analyses will be continuously extended, according to the suggestions of the authors of MD simulations.

THE ROAD AHEAD

MD and, in general, biomolecular simulation tools are no longer marginal techniques used by a small set of groups to confirm already known facts. Rather, the methods are used by a huge community, helping to understand the mechanisms of life, making predictions and guiding experiments. The Covid‐19 pandemic has highlighted the incredible power of MD simulations, either alone or combined with experimental techniques, to reveal atomistic details of biologically relevant mechanisms. These research projects have also shown how large the computational requirements of state‐of‐the‐art MD simulations are. It is not only a question of access to massive computer platforms, but also of strategies for how to use them in an efficient manner. While the next‐generation supercomputers will be designed by hardware specialists, and future biological questions will come out of experimental labs, the bio‐simulation community has to provide a robust interface between computer science and biology. Exascale supercomputing will be soon a reality, while Exascale distributed platforms are already here. If we learn from the past, computer scientists will soon dream of the YottaFlop computers. The computational science community should escape the futile discussion between capacity and capability of computers and our community should strive to make the best of whatever new computer technology emerges, and to fully exploit the data derived from simulations.

AUTHOR CONTRIBUTIONS

Miłosz Wieczór: Writing – original draft (equal); writing – review and editing (equal). Vito Genna: Writing – original draft (equal); writing – review and editing (equal). Juan Aranda: Writing – original draft (equal); writing – review and editing (equal). Rosa M. Badia: Writing – review and editing (supporting). Josep Lluís Gelpí: Writing – review and editing (supporting). Vytautas Gapsys: Writing – original draft (supporting); writing – review and editing (supporting). Bert de Groot: Conceptualization (supporting); writing – original draft (supporting); writing – review and editing (supporting). Erik Lindahl: Writing – original draft (supporting); writing – review and editing (supporting). Martí Municoy: Formal analysis (supporting). Adam Hospital: Conceptualization (lead); writing – original draft (lead); writing – review and editing (lead). Modesto Orozco: Conceptualization (lead); writing – original draft (lead); writing – review and editing (lead).

Pre-exascale HPC approaches for molecular dynamics simulations. Covid-19 research: A use case.

NEW COMPUTER ARCHITECTURES

TAKING ADVANTAGE OF HIGH PERFORMANCE COMPUTING

Strategies for parallelization of a single calculation

Ensemble simulations

Non‐equilibrium simulations

Multi‐simulation projects and workflows

THE PROBLEM OF DATA

COVID‐19: A USE CASE FOR HPC‐MD SIMULATIONS

RNA‐dependent RNA polymerase

Main protease

Spike: The crucial protein for infection

BioExcel workflows

BioExcel COVID‐19 hub: Shaping the future of open data

THE ROAD AHEAD

AUTHOR CONTRIBUTIONS

RELATED WIREs ARTICLE

1. MoDEL (Molecular Dynamics Extended Library): a database of atomistic molecular dynamics trajectories.

2. Targeted Molecular Dynamics Calculations of Free Energy Profiles Using a Nonequilibrium Friction Correction.

3. Markov state models of protein misfolding.

4. Transition path theory analysis of c-Src kinase activation.

5. Accurate and Rigorous Prediction of the Changes in Protein Free Energies in a Large-Scale Mutation Scan.

6. The MemProtMD database: a resource for membrane-embedded protein structures and their lipid interactions.

7. A pneumonia outbreak associated with a new coronavirus of probable bat origin.

8. Mesoscale All-Atom Influenza Virus Simulations Suggest New Substrate Binding Mechanism.

9. pmx: Automated protein structure and topology generation for alchemical perturbations.

10. The static and dynamic structural heterogeneities of B-DNA: extending Calladine-Dickerson rules.

1. Pre-exascale HPC approaches for molecular dynamics simulations. Covid-19 research: A use case.