Literature DB >> 33326239

Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19.

A Acharya¹, R Agarwal^2,3, M B Baker⁴, J Baudry⁵, D Bhowmik⁶, S Boehm⁴, K G Byler⁵, S Y Chen⁷, L Coates⁸, C J Cooper^2,3, O Demerdash⁹, I Daidone¹⁰, J D Eblen^2,11, S Ellingson¹², S Forli¹³, J Glaser¹⁴, J C Gumbart¹, J Gunnels¹⁵, O Hernandez⁴, S Irle^6,16,17, D W Kneller⁸, A Kovalevsky⁸, J Larkin¹⁸, T J Lawrence⁹, S LeGrand¹⁸, S-H Liu^2,11, J C Mitchell⁹, G Park⁷, J M Parks^2,3, A Pavlova¹, L Petridis^2,11, D Poole¹⁸, L Pouchard⁷, A Ramanathan¹⁹, D M Rogers¹⁴, D Santos-Martins¹³, A Scheinberg²⁰, A Sedova⁹, Y Shen^2,3, J C Smith^2,11, M D Smith^2,11, C Soto⁷, A Tsaris¹⁴, M Thavappiragasam⁹, A F Tillack¹³, J V Vermaas¹⁴, V Q Vuong^6,16,17, J Yin¹⁴, S Yoo⁷, M Zahran²¹, L Zanetti-Polzi²².

Abstract

We present a supercomputer-driven pipeline for in silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking. Ensemble docking makes use of MD results by docking compound databases into representative protein binding-site conformations, thus taking into account the dynamic properties of the binding sites. We also describe preliminary results obtained for 24 systems involving eight proteins of the proteome of SARS-CoV-2. The MD involves temperature replica exchange enhanced sampling, making use of massively parallel supercomputing to quickly sample the configurational space of protein drug targets. Using the Summit supercomputer at the Oak Ridge National Laboratory, more than 1 ms of enhanced sampling MD can be generated per day. We have ensemble docked repurposing databases to 10 configurations of each of the 24 SARS-CoV-2 systems using AutoDock Vina. Comparison to experiment demonstrates remarkably high hit rates for the top scoring tranches of compounds identified by our ensemble approach. We also demonstrate that, using Autodock-GPU on Summit, it is possible to perform exhaustive docking of one billion compounds in under 24 h. Finally, we discuss preliminary results and planned improvements to the pipeline, including the use of quantum mechanical (QM), machine learning, and artificial intelligence (AI) methods to cluster MD trajectories and rescore docking poses.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2020 PMID： 33326239 PMCID： PMC7754786 DOI： 10.1021/acs.jcim.0c01010

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

The need for rapid time-to-solution in drug discovery has become accentuated by the Covid-19 pandemic. Given the urgency of the pandemic, in parallel with vaccine development, both repurposing established antivirals and discovering new drugs are needed. Among the present candidates, apart from antibody treatments the antiviral remdesivir, a nucleoside analog that acts by interfering with RNA synthesis, it is active against SARS-CoV-2,[1−5] shortening recovery from Covid-19 in hospitalized patients.[6] Other promising candidates, such as dexamethasone, may modulate the host response.[7] When surveying clinical trials for Covid-19, one is struck by the number of trials that are not based on knowledge of the drug interacting with a known target. There are several examples. As one illustration, baloxavir is a specific inhibitor of the cap-snatching endonuclease of influenza virus, which is a member of the PD-(D/E)xK two-metal nuclease superfamily.[8] Coronaviruses have an endonuclease but of a completely different fold (NendoU) and different active site residues.[9] NendoU also oligomerizes into a hexamer. Although it would seem unlikely that baloxavir would bind to NendoU, has nevertheless been in clinical trials. Similarly, lopinavir and ritonavir are also undergoing testing, even though they target proteases of the unrelated HIV.[4] Although in principle, enzyme active sites with similar chemical functions may bind similar ligands, the steric and physicochemical substrates of drug–protein binding are nuanced. The frequent trial of drugs specific for targets known to be absent in SARS-CoV-2 seems to us to be symptomatic of a lack of precision in combating this pandemic. For further information on other ongoing trials on small molecule drugs, biologics, passive immunization with antibodies, and vaccines, the reader is referred to the comprehensive review by Liu et al.[10] The aforementioned challenges, taken together, demonstrate that there exists an unequivocal need for de novo drug discovery campaigns as well as repurposing studies.[11] There are a number of events in the SARS-CoV-2 viral replication cycle[12] to target for antiviral therapeutic development; from viral entry to membrane fusion, travel to the host endoplasmic reticulum where translation of the viral genome occurs, to formation of the viral replication complex and formation from host membranes of double-membrane vesicles (DMVs),[13,14] the passage of the replicon through the Golgi and the release of the virion from the cell. Each of these steps involves key viral proteins and occurs in a different compartment of the host cell. For example, the binding of the virus to the ACE2 receptor involves the receptor-binding domain (RBD) of the virus S (spike) protein, prefusion cleavage involves the binding of host TMPRSS and furin proteases to the S1/S2 dibasic domain[15] formation of the replication complex and the DMVs involves the nonstructural proteins (NSPs), and the N protein is required for packaging of the viral genome into the newly assembled virion.[16] The replication complex is made up of 15 mature NSPs, which are encoded by orf1ab and orf1a genes as the pp1ab and pp1a polyproteins.[17] Currently, many efforts are targeting the main protease, MPro,[18] which is required for cleavage of the large viral polypeptide into its functional proteins, the RNA-dependent RNA polymerase (RdRp)[19] responsible for the production of new viral RNA, and some efforts target prevention of S cleavage.[20] In addition, viral proteins also function to impede the host’s defense mechanisms: both proteases have been shown to inhibit the human immune response by interacting with immune proteins in SARS-CoV.[21] It is, therefore, important to understand regions on these proteins that act as binding sites for both substrates (as in the case of the proteases) and for protein–protein interaction. In previous work, very early in the pandemic, we combined restrained temperature replica-exchange molecular dynamics (restrained T-REMD) simulations with virtual high-throughput screening in a supercomputer-based ensemble docking campaign to identify well-characterized drugs, metabolites, or natural products that bind to either the S-protein: ACE2 receptor interface or the RBD of the S-protein.[22] From this ensemble docking campaign, we provided a ranking of the predicted binding affinities of over 8000 drugs, metabolites, and natural products (and their isomers) with regards to the SARS-CoV-2 S-protein and the S-protein: ACE2 receptor complex. The ranked list has been incorporated into experimental testing using a high throughput screen that was implemented in the SARS-CoV outbreak, and new compounds will be added as discovered. Three of the top compounds, hypericin (a component of St. John’s Wort), imatinib, and quercetin, identified in the initial S-Protein: ACE2 receptor screen are now in clinical trials. Here, we report on an optimized supercomputing pipeline for early stage drug discovery together with results on 24 systems involving 8 SARS-CoV-2 proteins. The computational approach mimics what happens in nature, using “structure-based” drug discovery. Generally speaking, the availability of many experimental protein structures combined with massive increases in computational power and methodological advances have led to a resurgence of computational studies in which trial compounds are docked into binding sites in three-dimensional models of the protein targets and then ranked according to their strength of binding. Computational docking has been particularly useful in early stages of molecular discovery in order to identify initial hits to be prioritized for experimental validation. Early docking studies were performed with static target crystal structures and rigid ligands. These were quite successful in some cases, such as in the discovery of antivirals for HIV and influenza.[23,24] Unfortunately, though, at that time, structures for few targets existed, and the process was relatively inefficient: calculations were relatively inaccurate, and computers could dock only ∼100 compounds in a reasonable time frame. Since the 1990s, the power of supercomputers has increased by a factor of a million or so. Rigid docking of over a billion compounds has been performed in a few days. Thus, virtual high-throughput screening has outperformed equivalent experimental high-throughput screening and has been shown to rapidly identify very tightly binding compounds.[25] Strictly rigid docking does not often take place in protein: ligand interactions, as both ligands and proteins, undergo thermally driven internal motions, which lead to fluctuating binding site conformations.[26] Therefore, a particularly important development has been the recognition that incorporating target flexibility into drug discovery protocols can improve the drug discovery process.[27] Ensemble docking uses different conformations of the protein targets of interest, and combinatorially performs the docking of databases of compounds against each of the protein target conformations. This process models the conformational selection binding mechanism, as opposed to a more limited induced-fit mechanism. The method requires the generation of an ensemble of protein conformations to be used in the docking calculations. Ensemble docking of small probe molecules for flexible pharmacophore modeling was introduced in 1999. It was shown that consensus pharmacophore models, based on multiple MD structures or on multiple crystallographic structures, were more successful than models based on single conformations in yielding successful predictions of binding.[28] In our own laboratories, ensemble docking has produced experimentally validated hits against each of the 16 protein targets presented to us over the past few years. Our groups have increasingly used an ensemble approach to perform docking.[22,29−46] In addition, we have shown that the clustering of protein target MD trajectories usually brings a large improvement in the quality of ensemble docking compared to what is obtained using single structure docking.[47] Ensemble generation using MD and docking both require significant computational power: to perform MD simulations of sufficient duration and to dock large databases of compounds. This combinatorially large computational time requirement essentially limits this approach to high-performance computing (HPC) architectures for large database screenings, even when only a subset of protein conformations is used in docking, for example, following clustering of the target’s MD configurations. HPC involves the use of specialized, large supercomputing systems to perform large calculations that are parallelized over many compute nodes, each consisting of dozens of cores. Traditionally, the use of a high-speed interconnect allows rapid communication between separate compute units and clever parallelization schemes to enable rapid calculations on problems too large to fit on a single compute unit. These schemes have historically involved specialized programs that focus efforts to optimize communication overlap. The use of graphics processing units (GPUs) has helped to accelerate many types of calculations. The Summit supercomputer, housed at the Oak Ridge Leadership Computing Facility (OLCF), is currently the fastest supercomputer in the United States. Summit is an IBM AC922 system consisting of 4608 large nodes, each with six NVIDIA Volta V100 GPUs per node. Each node also contains two POWER9 CPU sockets for a total of 42 cores per node. The GROMACS MD program[48−50] is able to make use of all aspects of the Summit supercomputer’s HPC utilities, including the GPUs and the interconnect, providing for both strong and weak scaling, which dramatically decreases time per MD step and increases the size of the system that can be simulated efficiently. The temperature replica exchange molecular dynamics (T-REMD) routine[51−54] which was chosen here for the MD calculations (see below) uses the interconnect not only to allow for parallelization of a single simulated biomolecule, but also to communicate between separate replicas of the system, each carried out at a different temperature, and performs exchanges between replicas to accelerate the conformational sampling of the biomolecule.[55] Protein–ligand docking has hitherto not been considered a traditional HPC task, as each docking calculation is short and does not require multiple nodes to complete. In fact, many docking programs can run on a single CPU core. However, the number of cores on a supercomputer or cluster can provide a resource to perform many simultaneous docking calculations that greatly decrease the time-to-solution for screening a large data set of ligands. Cloud and distributed computing resources also provide this type of completely parallel solution for high-throughput screening.[56,57] The use of GPUs has recently been made possible for the widely used program AutoDock4[58,59] resulting in the program AutoDock-GPU, which provides up to 50× speedup over AutoDock4 (available at https://github.com/ccsb-scripts/AutoDock-GPU).[60−64] Thus, the use of leadership HPC facilities for ensemble docking can provide the ability to screen billions of ligands to a full set of conformations generated with HPC-based MD simulations. Quantum mechanical refinement of classical docking ranking based on fragment molecular orbital (FMO) techniques also naturally benefits greatly from massively parallel supercomputer capabilities.[65] In this work, we describe our efforts establishing a supercomputing-based pipeline for ensemble docking and preliminary results on its application to discovering therapeutics that target viral proteins of SARS-CoV-2. The pipeline and results presented here represent our contribution to date to the work of the USA HPC Covid-19 Consortium that was created on March 29th, 2020. We describe the choice of eight targets and the preparation of protein models from experimental data. We report on T-REMD simulations performed for the targets totaling about half a millisecond of simulation time. We have docked repurposing databases to ten configurations of each protein simulated using the popular docking program Autodock Vina. We also describe efforts deploying Autodock-GPU[60−62] at scale on Summit that demonstrate the docking of over a billion compounds in 24 h with full structural optimization of the ligand. Future developments involving the use of AI and quantum chemistry in rescoring and clustering are also outlined. The pipeline described here can also be used in future work to target human proteins[66,67] known to interact with viral proteins, or in disease-causing responses in Covid-19 and more generally in computational structure-based drug discovery.

Methods

Computational methods in drug discovery narrow a vast chemical search space to a tractable set of compounds suitable for experimental testing. Experimental work can involve a variety of tests, including live virus testing as well as target engagement studies, and will not be considered further here. Rather, we discuss the procedures of structural modeling, MD simulation, and docking.

Choice of Target Proteins and Generation of Structural Models from Experimental Work

Multiple groups have been using structural biology techniques, including X-ray crystallography, small-angle scattering (SAS), and cryogenic electron microscopy (cryoEM), to investigate the structure of proteins and protein complexes from SARS-CoV-2.[68−71] However, obtaining a structure from the Protein Data Bank (PDB) or perhaps a revised model from another resource is only the starting point. Often structures obtained from the PDB do not fully resolve all residues, and a determination must be made whether and how to model them. Also, as structural models are rapidly being released to aid in the fight against COVID-19, the potential inclusion of a few structural errors is an unfortunate reality. In particular, the identification of metal cations in protein structures requires careful thought and examination of its local coordination environment. Even with perfectly assigned and complete experimental structures, it may not be enough to consider viral protein targets as chemically invariant structures for modeling and binding calculations. Large differences in pH in various cell compartments as the virus travels through the host cell[72−74] can qualitatively change the protein’s structure and function. Differences in pH also affect the protonation states of the proteins and the small molecules being tested as drugs, altering drug binding preferences. Finally, the oligomerization states of the target proteins are important to consider as the interactions between protein monomers may influence the shapes of the active sites. Another important factor to consider when performing in silico screens using ensemble docking is the ability to construct a useful model of a particular protein for MD simulation. For instance, certain metal-containing regions of a protein may not have an existing classical mechanics model (force field parameters), or existing models may be inadequate. In addition, highly charged, disordered, and ion-dependent biomolecules have been known to have less accurate force fields and may perform poorly in an MD simulation.[75−78] Proteins chosen for ensemble docking in this study were those that had a crystal structure available with a reasonable resolution, were amenable to accurate simulation with classical MD force fields, and were also known to be important for viral pathogenicity based on either recent studies or those on SARS-CoV. The 24 systems studied comprise nine protein domains. Two of these, RBD of the S (spike) protein and the N-terminal region of the N (nucleocapsid) protein, are domains in structural proteins found attached/within the virion. The N protein is used for packaging the viral genome and is essential for the assembly of the virion.[79] The remaining seven domains come from nonstructural proteins (NSPs) 3, 5, 9, 10, 15, and 16, which form the replication complex and are involved in a number of key tasks leading to the creation of new virus particles.[80] Two domains come from NSP3,[81] the ADP ribose phosphatase (ADRP, also known as macro- or “X”) domain, and the papain-like protease domain (PLPro).[82] The ADRP seems to be involved in ADP-ribosylation, which is used in cell signaling and thus may act to inhibit the host immune response.[83] PLPro cleaves regions of the polyprotein to release nonstructural proteins and also is involved in the mechanisms the virus uses to counteract the host immune response, for instance by interaction with host immune proteins[80,84] NSP5 is the main protease (MPro), which self-cleaves and also cleaves other regions along the polyprotein, releasing essential proteins to perform their tasks in the assembly of the replication complex, and is also involved in interacting with and preventing the action of host immune proteins.[84,85] The exact function of NSP9 is unclear, but has been found in SARS-CoV to be required for viral replication and has been shown to bind to RNA oligonucleotides,[86] NSP15 is an endoribonuclease specific for uridine whose exact function is also unknown, but has been implicated in interfering with host immune response both through direct interaction and by cleaving viral RNA to prevent detection by the host.[87] NSP16 is thought to be a methyltransferase that requires NSP10 as a cofactor, and acts to disguise viral mRNA from the host immune response by adding a methylation onto the RNA cap which host cells use to mark RNA as belonging to “self” versus “pathogen”.[88−90] The explosion in research and literature fueled by the Covid-19 pandemic, together with the need for searching through related literature on other coronaviruses, has created a challenge for researchers needing to understand the structural details and cellular contexts of the SARS-CoV-2 proteins. To help navigate this challenging landscape, we have been developing new tools based on natural language processing and machine learning for enabling a more robust search for specific questions required for our modeling, simulation, and ligand docking work[91−93] featuring targeted filtering and exploiting external resources (e.g., Wikidata, ChEBI, PubChem) to expand our search capability. For example, after we generate a set of related keywords, the service will screen for the terms referring to a chemical substance and fetch the chemical information (e.g., SMILES string) from the PubChem automatically. In addition, using this keyword search enables the ontologies (e.g., Wikidata, ChEBI) to be used to link related chemicals and their properties for document annotations in query results. The main data resource of the system is a collection of scientific papers, which are collated from major publications. The full-text article access and download from the publishers’ archives are performed under the publishers’ agreements, and the internal article corpus in our system is updated on a weekly basis. To provide a diverse survey of the conformational ensembles of the SARS CoV-2 viral proteome, we performed T-REMD simulations of 24 different model systems listed in Table . Additionally, Supporting Information (SI) Table S1 is also provided, which summarizes the PDB entry simulated, complete protonation state choices (where applicable), and the number of replicas used for the T-REMD.

Table 1

Model Systems Simulated

Protein/System Notes
S (Spike) Protein Receptor Binding Domain (RBD)/“Apo” (PDB:6W41)	S Protein RBD/Complexed with ACE2 (PDB:6W41)	MPro/monomer, CHARMM-GUI default protonation (PDB:6Y2E)
MPro/dimer, CHARMM-GUI default protonation (PDB:6Y2E)	MPro/dimer, “charged” protonation variant (PDB:6WQF)	MPro monomer/HIE41 protonation variant (PDB:6WQF)
MPro dimer/HIE protonation variant (PDB:6WQF)	MPro monomer/HID41 protonation variant (PDB:6WQF)	MPro dimer/HID41 protonation variant (PDB:6WQF)
NSP15 (endoribonuclease)/hexamer (PDB:6VWW)	NSP15 (Endoribonuclease)/monomer (PDB:6VWW)	NSP10:NSP16 Complex (Methyltransferase) (PDB:6W4H)
NSP10/monomer (PDB:6W4H)	NSP16/monomer (PDB:6W4H)	N (nucleocapsid) N-terminus phosphoprotein/monomer (PDB:6M3M)
N (nucleocapsid) N-terminus phosphoprotein/tetramer (PDB:6M3M)	N (nucleocapsid) N-terminus phosphoprotein/tetramer complexed with Zn (PDB:6YVO)	N (nucleocapsid) N-terminus phosphoprotein/monomer alternate crystal structure (PDB:6YVO)
NSP9/monomer (PDB:6W4B)	NSP9/dimer (PDB:6W4B)	NSP3 ADP ribose phosphatase/asymmetric unit (PDB:6W02)
PLPro/monomer “charged” protonation variant (PDB:6W9C)	PLPro/monomer “neutral” variant (PDB:6WRH)	NSP3 ADP ribose phosphatase (PDB:6W02)

Simulation Model Preparation

To engage in the use of MD for the rapid generation of conformational ensembles for drug discovery one requires that the input for MD be generated in a semiautomated fashion by which the atomic coordinates, obtained from experimental or in silico protein structure prediction methods, can be quickly processed into MD input files. To facilitate this semiautomated approach, CHARMM-GUI was used for most model building.[94] The general system building method used here involves the direct retrieval of structures from the PDB and processing to model missing residues, assign protonation states, add disulfide bonds (where noted in the PDB annotation), add glycosylation (where resolved in the crystal structures of the S-protein receptor binding domain and ACE2), neutralize the charge of the system (using Na+ and Cl– ions), and solvate (with TIP3P water). Many proteins have coordinated ions that serve structural roles, such as the Zn2+ cations in Nsp10, or catalytic roles. Thus, the treatment of Zn-complexes in fixed-charged classical MD force-fields is a challenge, and for some systems, it may result in the failure to maintain Zn-protein coordination[95−97] and when found necessary (as noted below and also summarized in SI Table S1) an explicit bond representation was used. All of these considerations mandate an abundance of care when preparing a biologically accurate model for simulation. Below we discuss considerations taken into account when modeling some of the proteins simulated.

S (Spike) Protein (PDB: 6W41)

Presently available structures of the S protein have nine gaps totaling approximately 150 residues, in addition to over 20 and over 100 missing residues at the N- and C-termini, respectively. Current models also lack post-translational modifications, including glycosylation and formation of disulfide bonds. The S protein is heavily glycosylated, with roughly 20% of its mass in glycan chains, yet at most, a few mono or disaccharides are present in the structure. In our preliminary study,[22] we made use of a homology model of the entire spike with restraints applied such that only the human ACE2-Spike interface was unconstrained. Here, using crystal structures of the ACE2-S protein complex, simulations of the receptor-binding domain of the S Protein (Spike) both in complex with the human ACE2 receptor and on its own (referred to within the text as the “Apo” RBD) were performed. The viral spike receptor-binding domain was chosen to provide insight into the details of the initial viral-host recognition process. Glycosylation resolved from crystallographic imaging was used, and an annotated disulfide bond was also included.

Main Protease. (PDB: 6Y2E and 6WQF)

The main protease, MPro, is an attractive target for the development of antiviral drugs. There is compelling evidence that the enzymatically active species is the dimeric assembly of MPro. A dimer is observed in most crystal structures of CoV MPro, as well as in solution at sufficiently high concentrations. In addition, a linear increase in the enzyme activity at increasing concentration suggests catalytic incompetence of the monomer.[98] Therefore, the full dimer was considered in the present MD simulations for MPro, using as starting coordinates the apo-homodimer in the crystal structures 6Y2E and 6WQF.[99,100] The crystal structures show that SARS-CoV-2 MPro, similarly to other MPros,[85,99,101,102] is composed of three domains: Domains I (residues 8–101) and II (residues 102–184) are arranged in an antiparallel β-barrel structure, and domain III (residues 201–303) contains five α-helices arranged in a globular cluster. Domain III is a specific feature of CoV MPro proteins and was suggested to be essential in the proteolytic activity by keeping domain II and the long loop connecting domains II and III (residues 185–200) in the proper orientation, and/or by orienting the N-terminal residues that are essential for the dimerization.[101] Dimerization occurs through interactions between the helical domains of the two monomers and through hydrogen bonding interactions between the N-terminal residues of one monomer and key residues in the other monomer. In particular, the salt bridge between the N-terminal Ser1 of one monomer and Glu166 of the other monomer has been suggested to be essential to maintaining the catalytically competent conformation.[101,103] The substrate-binding site is located in a cleft between domains I and II and contains a highly conserved catalytic dyad formed by Cys145 and His41. Comparison among the two apo crystal structures and the crystal structure obtained in the presence of an inhibitor reveals[85] only minor structural differences in the position of a few side-chains and no relevant changes in the substrate-binding site, except for the rotation of the side chain of Met165, which is in the proximity of His41. Although the catalytic mechanism is not fully understood, there is a general agreement in considering that the proteolytic activity of CoV MPros is initiated by activation of the enzyme through a proton transfer reaction in the catalytic dyad, leading to a charge-separated state with a highly reactive thiolate. It has also been suggested that such a proton transfer reaction is induced by the presence of the native substrate.[104,105] Therefore, in the present MD simulations of the apoenzyme, Cys145, and His41 were simulated in their neutral state, with His41 protonated at Nδ (i.e., HSD). This choice is based on the observation that the His41-Nε appears to be the best candidate as proton acceptor from Cys145 because in the crystal structures the His41-Nε is closer than the His41-Nδ to the Cys145-S and the His41-Nδ is already involved in a hydrogen bond to a highly conserved water molecule, which is considered the third element of the catalytic site. A recent QM/MM study[106] also supports this proton transfer mode and the role of water in catalysis. However, the ε-nitrogen protonation state for His41 (HIE) cannot be decisively ruled out, and MD simulations were performed also considering this alternative, although less probable, protonation state. The protonation states of two additional His residues, namely His163 and His172, were also highlighted as being crucial for the enzymatic activity of CoV MPro. In particular, the doubly protonated (cationic) state of His163 at pH 6.0 was suggested to modulate relevant conformational variations involving Glu166, Phe140, and His172, leading to a catalytically inactive conformation.[101] At higher pH values, both His163 and His172 should be uncharged, and, on the basis of the hydrogen-bonding pattern that can be inferred from the crystal structures, the HSE protonation state was used for both His163 and His172 in the present MD simulations. All other His residues were also simulated in their neutral state, assigning the Nδ or Nε protonation state on the basis of their chemical environment and hydrogen-bonding patterns. The selected protonation states are as follows: HSD64, HSD80, HSE164, HSE246.

PLPro (PDB: 6W9C and 6WRH)

For PLPro, the Zn cation was coordinated to C189, C192, C224, and C226. Similar to MPro, the protonation states of the His residues in PLPro were not readily available. Here we pursued two potential protonation state varients, a charged variant and a neutral variant. For the charged variant the protonation states were obtained with the use of the PropKa 3.0 server assuming pH ∼5, corresponding to its presumed cellular (lysosome) environment,[107] with 6W9C being assigned to pH 5 based on its physiological role in acidic environments. Protonation states for the neutral state were manually assigned using 6WRH as the original coordinate file, with C111S mutation reversed. For both variants, Zn-coordination during temperature-replica exchange was enforced by topological patches applied with CHARMM-compatible tools. TopoGromacs[108] was used to convert the system and associated force field to GROMACS format.

NSP15 (PDB: 6VWW)

Large (His) tags present during the recombinant expression processes to purify NSP15 for crystallization; however, these tags were not removed before crystallization. Prior to simulating the monomeric and hexameric forms, the artifactual His tags were removed from NSP15 using MOE2019 and subjected to a “quick prep” with the prepare protein module of MOE to resolve potential issues in the resulting structure. The resulting truncated PDBs were then uploaded to CHARMM-GUI for neutralization and solvation.

NSP10 (PDB: 6W4H)

For NSP10, both in its monomeric and a complexed form with one Zn cation liganded to C4370, C4373, C4381, and C4383, while the other bound Zn was liganded to C4327, C4330, H4336, and C4343. For 6VYO, H59 and H145 were liganded.

Molecular Mechanics and Molecular Dynamics System Preparation (Force Fields, Counterions, Energy Minimization, and Equilibration)

All simulations were performed using the GROMACS[48,109] software suite, and the CHARMM36m force field,[110] which is the default of choice using the CHARMM-GUI. For all systems, the protein was solvated in water-boxes with edge-distances of 1 nm, and only neutralizing Na+ and Cl– ions were used. Short-range interactions were treated with a smooth force-switch cutoff of 1.2 nm, and long-range electrostatics were treated using the particle-mesh Ewald (PME) formalism, as implemented in GROMACS.[111] To facilitate the use of a 2 fs MD time step, all covalent bonds to hydrogen were restrained with the LINCS algorithm[112] in all simulations. Following system preparation, all solvated models generated were subject to steepest-descent energy minimization with a stopping condition of either reaching the force-convergence criteria of 1000 kJ mol–1 nm–1 or a maximum of 5000 iterations. Energy minimization was performed primarily to remove potential clashes between the solvent, ions, and the protein (or protein complex) of interest. Postclash removal minimization, short (250 ps) NPT relaxation simulations (with default positions restrains generated from CHARMM-GUI) were performed to relax the simulation box dimensions for each replica (at different temperatures) independently (see T-REMD Protocol). For these relaxation simulations, the Berendsen baro/thermostats[113] (as implemented in GROMACS) and an integration time step of 1 fs were used.

T-REMD Protocol

MD simulations provide a means to study the conformational dynamics of proteins. However, frequently MD becomes ‘trapped,’ resulting in the need for many long simulations to effectively sample a protein’s conformational landscape.[114] To overcome this sampling challenge, enhanced sampling techniques can be used. For the present work, temperature replica-exchange molecular dynamics (T-REMD) was employed, whereby multiple copies of a target system are simulated simultaneously with each copy (replica) at a different temperature, with periodic coordinate swapping (performed in such a manner as that preserves detailed balance) between the copies.[52−54] By running at multiple temperatures, with exchanges, the dynamics of the system avoids “kinetic traps” and provides a robust sampling of the protein free-energy landscape, and thus the protein conformational diversity.[115] T-REMD was chosen for several reasons: it guarantees an increase in sampling efficiency over straightforward MD,[116] it does not require the assignment of reaction coordinates (or collective-variables) a priori to accelerate conformational sampling it does not require direct modification to the system Hamiltonian.[117] T-REMD simulations for each system were performed with the GROMACS simulation suite. A limited temperature range of 310 K to ∼350 K was chosen to maintain physiological configuration space. For each protein system, the number of replicas and temperatures for each replica was chosen using the temperature predictor server by Patriksson and van der Spoel[118] with a target exchange probability set at 0.2 though the actual exchange was found to be ∼0.3 for all systems. All simulations were performed for a total of 750 ns per replicate. After relaxation, production T-REMD simulations were performed with a frame saving rate of 10 ps and an integration time step of 2 fs. Production simulations were performed, similar to the relaxation simulations, in the NPT ensemble. Unlike the relaxation simulations, the V-rescale (Bussi) thermostat[119] and the Parrinello–Rahman barostat,[120,121] were used. Regardless of the temperature window, the target pressure for each replicate was set to 1 bar.

Trajectory Analysis

For all systems, the measures of the gyration tensor (from which shape anisotropies are derived), solvent-accessible surface area (SASA), and pairwise simulation frame versus simulation frame RMSD matrices, and RMSD based clusters, were obtained using a combination in-house VMD[122] scripts, NumPy, and SciPy.[123] The RMSD clustering specifically only considered the lowest temperature replica (310 K), and rapidly generated the pairwise RMSD matrices using the QCP algorithm.[124] Clustering was performed using hierarchical clustering with a complete linkage method, as implemented within SciPy. For generality in evaluating structural diversity, clustering was initially performed based on the RMSD of all heavy protein atoms, and where additional diversity of active sites was of interest for subsequent docking, a second round of clustering was performed based on binding site residues and protein–protein interfaces. VMD atom selections for the docking specific clustering are summarized in SI Table S2. Although T-REMD is an efficient simulation method, and the 310 K data do correspond to a formal statistical mechanical ensemble generated at this temperature, as with other enhanced sampling methods the risk is always present that the enhancement of the sampling takes the system to regions of configurational space beyond that that would be significantly sampled by the protein physiologically; for example, to partially or wholly unfolded states. We, therefore, take care to identify these and to not perform docking screens on such configurations.

Docking

Two different docking databases were used. A smaller database of potential ligands was built merging together the content of the SWEETLEADS[125,126] repurposing database SuperDRUG2,[127,128] and the NCI-diversity database,[129] yielding 13 757 unique compounds. This database has been ensemble docked to all systems, as listed in Table , with noted targeted binding sites. This database was docked using local HPC clusters using Autodock Vina.

Table 2

List of Proteins and Binding Sites Used for “Smaller Database” Docking. PPI Refers to a Protein–protein Interfacea

receptor/binding site	receptor/binding site
MPro monomer/catalytic pocket	NSP15 monomer/catalytic pocket
MPro dimer/PPI	NSP15 dimer/PPI
NSP9 dimer/FTMap sites	NSP10 monomer/PPI to NSP16
nucleocapsid phosphoprotein/RNA binding site	NSP16 monomer/PPI to NSP10
nucleocapsid phosphoprotein/PPI	NSP10:NSP16/PPI
nucleocapsid tetramer/FTMap sites	NSP3 ADRP domain (asymmetric unit, dimer)/active site
NSP3 ADRP domain (monomer) active site	NSP9 monomer/PPI

In some cases, FTMap was used to identify potential binding sites (see SI Table S2).

Supercomputing docking runs were performed involving billion-plus compound screens of the Enamine database using an accelerated version of Autodock: Autodock-GPU. To date, these runs have been performed on two crystal structures of MPro. In some cases, FTMap was used to identify potential binding sites (see SI Table S2).

Smaller Database Docking

Data and Protocols

Docking to the target structures obtained from the MD simulations (as listed in Table ) was performed on various HPC clusters using Vina MPI[130] and MOE. Two sets of structures were used in the ensemble docking. In the first series of docking calculations, only the first 100 ns of the T-REMD trajectories were used, and the results of the docking simulations were passed on to collaborators for experimental testing. In the second series of docking, as the MD trajectories were expanded beyond their initial first 100 ns, the clustering was performed on the entire 750 ns trajectories, as described in the results section below. For the VinaMPI[130] calculations, the “Exhaustiveness” parameter was kept at its default value of 10. Databases of potential ligands were built merging together the content of the SWEETLEADS,[125,126] SuperDRUG2,[127,128] and NCI-diversity databases,[129] yielding 13 757 unique compounds. Using the program MOE, compounds with more than 49 rotatable bonds were deleted from the database, and only one stereoisomer was included for each compound. Very low molecular weight (<58) compounds (single atoms, ions, very small functional groups). The resulting database included ∼9K unique molecules. The compounds were protonated at pH 7 and energy-minimized using the MOE software to obtain low energy 3D structures. The compounds were saved on disk in sdf format and then converted to PDBQT format using AutoDock Tools.[131,132] The ligands were docked to 10 clusters per receptor, each cluster corresponding to a different configuration of the binding pocket. The clusters corresponding to the first 100 ns of the MD simulations have been uploaded on the publicly available structure repository https://cmbcovid19.flywheelsites.com/data/ additional data, including the complete trajectories from the 750 ns T-REMD simulations is forthcoming. The residues used to determine the clusters fall into one of three categories: the protein active site, residues at the protein–protein interfaces (for complexes), and all the protein non-hydrogen atoms. Tables and list the receptors and binding sites we have screened so far.

Binding Sites for Docking

In general, we have two classes of potential binding sites: (1) catalytic pocket or substrate-binding site and (2) PPI. The first aims at identifying potential competitive inhibitors of the viral proteins, and the second aims at finding compounds potentially disrupting a viral protein–protein complex. Binding site definition requires manual intervention and cannot be easily automated. Examples of definitions are listed below for three viral proteins. In the main protease dimer (PDB: 6WQF), the docking box contains catalytic sites of chain A and PPI residues. The docking box was constructed to align with the peptide-binding groove on either side of the catalytic dyad of chain A, which extends outward to include the S3, S2, S1, S1′, and S2’ substrate binding pockets. In the NSP10-NSP16 complex (PDB: 6W4H), the S-adenosyl methionine (SAM) binding site Asp6928 in NSP16 was considered.[89] In addition, PPI residues such as Tyr4349, Val4295 to Leu4298 in NSP10, and Gln6885 in NSP16 were included. Tyr4349 and Gln6885 interact with each other in SARS-CoV virus,[89] and Val4295 to Leu4298 are hot spot residues in the SARS-CoV-2 virus computationally predicted using the crystal structure along with the KFC2 method,[133] which is based on a machine learning predictive model (https://kfc.mitchell-lab.org). Hot spot residues are the fraction of PPI residues that account significantly for the overall protein–protein binding affinity, and they are typically determined experimentally using alanine scanning mutagenesis.[134] In the N-terminal domain of nucleocapsid protein tetramer (PDB: 6VYO), three critical RNA-binding residues on the beta-sheet core were included in docking: Arg88, Arg92, and Arg107[71,135].[136]

Billion-Compound Supercomputer Docking with Enamine Real Database

A major aim of this exercise was to see whether it would be possible to dock a billion compounds with full ligand optimization on the OLCF Summit supercomputer in 24 h of wall-clock time. To perform efficient ensemble docking, we modified AutoDock-GPU,[60,62] to enable it to run at peak efficiency on the Summit system. For compatibility, OpenCL kernels were rewritten in CUDA, and file input and output were streamlined to enable it to keep up with the GPU’s speed. These modifications, together with the size of the Summit supercomputer, indeed allow over 1 billion compounds to be docked within 24 h. This capability will enable giga-compound docking for a number of proteins in the viral proteome and beyond. We performed initial docking tests using this framework on NSP15 (NendoU) and the main protease (MPro). For NSP15 we used a 9000 compound data set composed of the SWEETLEADS[125] database with additional ligands, and also a trimmed version of this data set containing only ligands containing less than 11 rotatable bonds, consisting of about 5000 ligands. For tests with MPro, we used a 90 000 ligand subset of the Enamine REAL database.[137] All ligands were prepared with AutoDockTools,[132,138] and the receptor grids were generated with the program autogrid with a grid spacing of 0.375 Å. We tested a set of search box sizes: 40, 25, 20, and 15 Å3, and different settings for the number of runs, nruns, which defines how many separate instances of the genetic algorithm are executed. For the trimmed data set, we also performed docking with AutoDock Vina with exhaustiveness of 10 to compare results. These results provided us with the confidence to dock over 1 billion compounds from the Enamine real database to two different MPro crystal structures, 5R84 and 6WQF,[100] with a search space 25 Å large on each side, centered on the active site. The analysis of this data set is ongoing. Due to the documented inaccuracies of force field-based scoring functions in the task of screening and affinity prediction of compounds,[139] rescoring of at least 1% of the billion compounds is being performed using the accurate, yet highly computationally efficient machine learning-based rescoring method known as RF-Score-3.[140] Also, at least 50% of those compounds rescored with RF-Score-3 will be further filtered using recently developed rescoring described below in Future Directions and Preliminary Results from New Methodologies, subsection Protein–Ligand Rescoring Using Machine Learning.

Sequence Analysis and Mutational Entropy Calculations

We performed an analysis of available sequences of the SAR-CoV-2 virus to look for numbers of mutations and map these locations on the proteins we were using as drug discovery targets. All complete, high-coverage genomes labeled as human host SARS-CoV-2 were downloaded from GISIAID[141,142] on May 5, 2020, yielding a total of 16 252 genomes. Sequences were filtered to remove any genomes with greater than 3% ambiguous (N) nucleotides or were less than 29,000 nucleotides in length, resulting in 14 284 genomes. Multiple sequence alignment of the 14 284 genomes was performed using MAFFT[143] v.7.464 with the --addfragments method using NC_045512.2 (EPI_ISL_402125) as the reference genome and removing insertions relative to the reference. Mature protein-coding sequences for each protein were extracted from the alignment using coordinates from the reference genome and translated using FAST[144] v1.6, with protein sequences containing internal stop codons discarded from further analyses. Shannon entropy[145] was calculated for every column of each protein alignment using a custom script, disregarding ambiguous and gap characters using a custom script. Additionally, the frequency and types of substitutions with respect to the reference were recorded. For visualization of the mutation entropy per residue of the proteins studied in this paper, entropy values were color-coded in protein PDB structures. Known SARS-CoV and SARS-CoV-2 structures were downloaded from the Protein Data Bank, their sequences were aligned with the SARS-CoV-2 reference genome (NC_045512.2) using BLASTP, and the calculated entropy of the sequences was embedded in the PDB file in the place of the B factor column using a custom Python script.

Preliminary QM Refinement Protocol

Along with ML-based approaches, quantum mechanics (QM)-based refinement of classical docking results is being developed here as a tool to narrow down the list of promising inhibitor candidates.[146] Until recently, the inclusion of QM electronic structure in high-throughput drug screening was deemed computationally intractable due to the enormous computational resources required even for density functional theory (DFT) calculations. The poor scaling of most quantum chemical methods further exacerbates the situation. A viable emergent alternative is the recently developed linear-scaling version of an approximate, yet remarkably accurate DFT method called “fragment molecular orbital density-functional tight-binding (FMO–DFTB).[147] This method is implemented in the widely-utilized GAMESS quantum chemistry code.[148] We here report preliminary calculations of FMO–DFTB with the so-called polarizable continuum model (PCM) of the solvent[149] for quantum mechanics-based evaluation of potential COVID-19 spike protein inhibitor drugs identified by re-clustering and re-docking to an extended simulation of the S protein, similar to the initial work by Smith & Smith.[22] For the PCM calculations, the cavity was calculated using simplified united atomic (SUAHF) radii[150] which are available for all the chemical elements contained in all ligand compounds. Because the binding side is widely open, the dielectric constant of water ε = 78.39 was used. In addition, to improve the accuracy in describing non-covalent interactions, the D3 dispersion correction was employed. To obtain the refined binding energy of a given candidate, its unbound geometry, the unbound protein, and its corresponding complex were optimized using FMO–DFTB/PCM. While the unbound ligands were completely optimized, only selective residues in the binding pocket of the unbound protein, and in the protein–ligand complexes were locally optimized. The QM-refined binding energy is defined here as the difference between FMO–DFTB/PCM total energy of the complex and the sum of the total energy of unbound protein plus total energy of unbound ligand. In preliminary work, the QM-refinement was carried out for the Vina top-10 best binders of each spike protein cluster. In total, 15 spike protein clusters were investigated, and the binding energies of 150 protein-ligands complexes were refined.

Results

We present here preliminary results obtained for members of the SARS-CoV-2 proteome. Naturally, ongoing refinements of the results are continually being undertaken, and the results are incomplete. However, they give a snapshot report on the state of delivery of the pipeline. At the moment of submission, 24 T-REMD simulations have been performed on nine members of the proteome, in various oligomerization and protonation states, for a total of 0.612 ms of MD aggregated over all replicas and ∼17.25 μs aggregated overall lowest temperature windows. At present ∼2.07 M physical docking calculations have been performed with the smaller database and on Summit 2.4 billion docking calculations with the Enamine REAL database. The preliminary results presented are general trends observed in the MD and docking runs and do not describe details of the candidate compounds or dynamical properties of individual proteins, which will be reserved for future publications. Results of MD and docking are available at the Web site https://coronavirus-hpc.ornl.gov and will be updated as new simulations and docking results become available.

T-REMD Scaling Performance

Figure shows the performance per replica on Summit of T-REMD simulations for the majority of the simulations performed in this work using GROMACS version 2020.1. A few simulations were run with GROMACS 2018 and/or with different scheduling parameters and achieved only 20–50% of the performance shown above and are not included in the figure. We found that performance was maximized when running all bonded and nonbonded calculations on the GPUs (interatomic and both particle-mesh Ewald and pairwise Lennard-Jones). With the noted choices, performance saturates at around 100 ns/day for 250 000 atoms and above, even if more nodes are allocated per replica, for two reasons. First, the GPU-based fast Fourier transform is limited to a single GPU, and communication latencies between nodes slow down the calculation. However, throughput around 100 ns/day can still be achieved for simulations above 250,000 atoms if nodes are increased proportionately to system size.

Figure 1

Simulation throughput per replica. Each point represents the performance achieved by replica-exchange MD simulations on a single protein/water system. Run parameters were one replica per node (each node has six GPUs), using between 24 and 40 replicas in a given system.

T-REMD: Conformational Sampling of SARS-CoV-2 Proteins

T-REMD simulations were performed with the number of replicas ranging from 20 to 60 for 750 ns each for an aggregate sampling of over 0.6 ms (Table and SI Table S1). Given the scaling data noted above, for the total 816 replicas simulated, the calculations (if performed simultaneously) used the equivalent of ∼18% of the entire Summit supercomputer for ∼3 days. The performance, if the entire machine were used to simulate all of the different protein systems at the same time, would thus scales up to ∼1 ms/day. For all systems, the replicate temperatures range from 310 K to ∼350 K, and the average exchange probabilities were near 0.3. From the simulations, structural diversity was quantified by calculating, when a binding site is known, the gyration tensor of the binding site residues, the solvent-accessible surface area (SASA) of the binding sites, and the construction of pairwise snapshot-snapshot root-mean-squared deviation (RMSD) matrices for the target temperature replica, that is, the replica with the temperature set to 310 K (see Methods for calculation details). Additionally, using the gyration tensor, the shape anisotropy of the pockets was also obtained. Linkage-based RMSD clustering, using the pairwise RMSD matrices was performed to gauge the overall structural diversity of the proteins. Figure provides example conformations and calculated quantities for one example target, the neutral variant of the PL-Protease (PLPro). Similar plots for the other simulated systems are provided as Supporting Information and on https://coronavirus-hpc.ornl.gov.

Figure 2

Configurational variability of PLPro (PDB: 6WRH) with neutral HIS protonation states. (A) Overlay of 26 RMSD aligned structures from the lowest temperature replicate spanning the 750 ns of sampling. (B) Population distribution for shape anisotropy (κ) and solvent accessible surface area (SASA), with redder colors indicating greater occupancy of these kappa-SASA combinations. The distributions are also reflected by one-dimensional histograms above and to the right of the plot, and black dots within the population distribution, which represent position information for 10% of the total snapshots considered. (C) Pairwise RMSD clustering for the lowest temperature replica, with the snapshots ordered according to their cluster. The clusters in this instance were defined using a cutoff of half the maximum RMSD observed within the simulation and are labeled according to color with a color-bar for reference located above the plot. (D) Pairwise RMSD distribution across all snapshots. (E) Population statistics for the clusters introduced in (C). From Figure A (and subfigure A of the SI Figures S1–S23), it is clear that the simulations generate a diverse ensemble of states with varied loop structures. For the case of the neutral variant PL-Protease, Figure C,E indicate the existence of a number of dominant conformational states. Figure D,C further suggest that, although six dominant states exist, these states could be grouped into two “superstates”, which may indicate a switching like behavior or the potential existence of a “hinge”. Finally, subfigure B shows a significant amount of sampling of rod-like geometries (anisotropies near 1); however, there are states that have a correlated reduction in SASA and shape anisotropy, which would correspond with a nearly continuous transition between rod-like structures and spheroid-like structures.

Figure 3

Configurational variability of the PLPro (PDB: 6WRH) active site region generally bounded by the black dashed lines and the next step in analysis after Figure . Each of the differently colored aligned protein models represents the center of a populous cluster, as defined by active site conformation RMSD. Residues such as R164, E165, Y266, Q267, and F302 vary in conformation substantially and highlight the conformational variation within the ensemble created through T-REMD. For clearer visualization, only residues 91 and onward for PLPro are shown, as this selection was used for active site alignment. Within the VMD[122] rendering, side chains are displayed without their hydrogens. The general conformation variation highlighted by Figure and SI Figures S1–S22, to some degree, masks the conformational variation within binding sites; however, when for docking to the individual binding sites, clusters within the T-REMD trajectory are identified and demonstrate significant variability within the active site region (Figure ). While not specifically active site residues, residue variability at the tip of the loop centered on Y266 and the charged residue pair R164-E165 near the active site imply that accounting for the protein conformational ensemble is essential. Otherwise, the docking calculations would be strongly biased by the rotameric states present in the single static structure used in typical single-structure docking calculations.

Figure 4

Distribution of the number of identical compounds being found in n-number of target top 500-compounds selection out of 9014 compounds.

Smaller Database Docking

A preliminary analysis was performed of general trends seen in docking the smaller database to the 24 SARS-CoV-2 protein systems. For each protein target, all the docking results from each of the 10 cluster configurations were combined, and the top 500 scoring compounds extracted. The selectivity of the compounds for any given target varies considerably (Table ) with the number of compounds present on any two different top 500 lists as low as 132 or as high as 283. In comparison, from two random selections of 500 items out of 9014 items (see SI Figure S24, 5% percentile = 19 compounds, 95% percentile = 36 compounds), 27 identical compounds would be expected on average. Thus, the high number of identical top-scoring compounds observed between any two targets indicates a nonrandom selection of these duplicate compounds.

Table 3

Number of Duplicate Compounds Found in the top 500 lists in specific pairs of proteinsa

In grey/diagonal: the number of compounds unique to the corresponding target/site in the respective top 500 lists.

In grey/diagonal: the number of compounds unique to the corresponding target/site in the respective top 500 lists. For any particular target, the number of nonduplicate compounds is relatively low, ranging from 8 to 50 (Table ). The majority the compounds selected bind to a single target (Figure ). However, of all the compounds that are found in the top 500 lists, over half are calculated to bind to 3 or more targets. Molecular weight was found to be only weakly related to the number of protein targets a compound is calculated to bind to (see SI Figures S25 and S26). Therefore, the overlap in the top-scoring compounds is not an artifact of the size of the ligand. In the absence of a systematic experimental assay on each of these compounds, it is difficult to assess the significance of the overlap in the top 500 compounds. It is important to note that the “top compounds” are assembled based on relative docking scores between compounds against the same target, and not based on their absolute docking scores, which could artifactually inflate the number of duplicates. A high number of duplicates between the lists obtained on two different proteins could also indicate a computational bias of some compounds based on other criteria than their good fit to the targets. On the other hand, such high numbers could correctly identify promiscuous binding sites that do not display marked structural specificity and hence could be indeed targeted by similar compounds. It is outside of the scope of the present work to assertively differentiate between these two possibilities. However, the number of duplicates varies greatly across several pairs of targets, which renders unlikely a systematic bias in the docking (because of, say, molecular weight or other ligand properties independent of the target’s binding site). Only ∼55% of the top 500 compounds were the same in the docking results from the 100 and 750 ns clusters. Thus, extending the T-REMD simulation time by a factor of 7.5 nearly doubled the chemical diversity. Future analysis will be needed to indicate if the compounds that are identical in both sets of docking calculations are promiscuous compounds that would bind to many protein structures or if many of the clusters from the MD trajectories end up being selected by the same compounds.

Comparison of Docking with Experimental Screening Results

The chemical databases used in the ensemble docking contained compounds from a variety of sources (i.e., SweetLeads, NCI, and Enamine). In a separate experimental screening campaign, 2900 chemicals have been tested by the National Institutes of Health, National Center for Advancing Translational Sciences (NCATS) and listed on https://opendata.ncats.nih.gov/covid19/databrowser (accessed 2020/11/02). Many of these chemicals are included in our docking databases. Therefore, the computational predictions from docking were compared to positives experimentally identified by by NCATS. NCATS report results for spike protein (Spike-ACE2 protein–protein interaction (AlphaLISA) and MPro (3CL Enzymatic Activity) assays, which are equivalent to a subset of the docking calculations described here. We determined how many of the experimentally tested NCATS compounds were in our smaller docking database and identified how many of the experimental positives were in the top ranked lists for each protein (Table ). We also report the corresponding percentage of true positives, that is, how many of the top computational-scoring compounds were identified by NCATS as active as a percentage of how many of the top computational-scoring compounds were experimentally tested by NCATS.

Table 4

Number of Top-Scoring Computationally Predicted Compoundsa, Corresponding NCATS-Tested Compounds As a Subset from First Column, Percentage of Strong and Strong+Moderately Active Compounds for the Spike Protein (Top) and MPro (Bottom) Targets

no. of top compounds (docking)a	no. of corresponding compounds tested (NCATS)	percentage of NCATS actives (strong)	percentage of NCATS actives (strong+moderate)
Spike
673	235	14.0%	53.2%
420	158	17.1%	57.0%
292	108	20.4%	61.1%
149	55	25.5%	67.3%
81	27	33.3%	77.8%
17	4	100.0%	100.0%
MPro
968	359	-	7.0%
648	221	-	6.8%
459	156	-	7.7%
248	86	-	9.3%
136	45	-	8.9%
32	7	-	14.3%

Top compounds from docking were obtained from the top 500, 300, 200, 100, and 50 ranked lists that correspond to each of the spike and MPro targets. For both systems multiple docking runs were considered and only unique compounds are reported. We found that computational prediction is systematically enriched compared to a random selection of compounds. The experimental hit rate for NCATS compounds active in the spike protein is 6.1% for “strong actives” (NCATS definition) and 28.3% for strong and moderately active compounds (the value of 28.3% being unusually high). In contrast, the computational enrichment is between about twice to four times as high (Table and Figure ). Out of the 673 unique compounds in the union of our top 500 lists for each spike protein simulation variant, 235 have been tested experimentally by NCATS. Of these 235 compounds, 33 (14%) are experimentally strong actives and 125 (53%) are strongly or moderately active. Narrowing the ranked lists from docking to the top 10 resulted in 17 unique compounds for which 4 have experimental activity (0.14% of the total NCATS screen). Interestingly, all four of the experimentally tested compounds (i.e., 100% of the tested compounds in our top 10 lists) are strongly active.

Figure 5

Comparison of S-protein (Spike) true-positive rates for strong-actives. Plot shows percentage of experimental NCATS positives in top computational-predicted chemicals as solid line. Dashed-line represents constant NCATS positive rate for comparison. For the MPro assay, NCATS identifies only 1 strongly active out of 2675 compounds in the approved drugs collection that were tested experimentally. Therefore, we considered the overlap in our database with both the strong and moderately active compounds. Although the enrichment rates obtained through computational docking was not as high as the rates obtained for the spike protein, the computationally obtained MPro enrichments ranged from 7% to 14% and are still systematically better than the rates obtained exeprientally (5.7%). Interestingly, for both targets, as the number of ranked compounds considered is decreased, the computational enrichment improves. As expected, the very top screening compounds with the best docking score are often the most likely to have experimental activity and the further we go down the ranked list the lower the computational enrichment. Thus, docking performs better as a tool for identifying a small number of active compounds in a very small subset of a database, rather than a tool to identify many active compounds in a large subset of a database. This result is important when considering the prioritization of compounds from the billion-plus compound screen described below. A threshold of 10% or even 5% of a billion compounds database would still likely be too large a number to screen experimentally, but less than 1% would be more amenable to experimental validation. The present results suggest that the best enrichments are indeed obtained for a small or very small subset of the chemical databases used in docking.

Billion-Compound Supercomputing Screens

We found that for the ligands with fewer numbers of rotatable bonds, such as found in the Enamine data set, a docking calculation using 20 repeated runs could be performed in 0.5–2.5 s when using the Summit GPU (Figure ). The same set of ligands docked with Vina on Summit’s CPUs showed a large spread of timings, with some ligands requiring nearly five minutes to complete (Figure ). In practice, this means that with GPU-enabled docking, it is feasible to flexibly dock a billion compounds in about a day on modern supercomputers, whereas with Vina, a similar calculation would require a multiyear effort on a university cluster. We confirmed that for ligands with less than 11 rotations, the Solis-Wets algorithm in AD-GPU provides equivalent results to the new ADADELTA algorithm.[61] For the trimmed data set, the top 5% of scores obtained with AD-GPU using the Solis-Wets algorithm formed an intersection with the top 5% of scores from Vina consisting of 18% of each top 5% set. Analysis of the full billion ligand sets is currently ongoing.

Figure 6

General benchmarking of Autodock-GPU and Autodock Vina performance against subset of Enamine database.

Mutation Analysis

The mutation frequency of the proteins simulated in this study is generally low. It should be noted, however, that it is as yet early in the history of SAR-CoV-2, and thus increased relative variability of residues along the proteome may indicate the propensity of those residues for future mutations. We did find higher variability, given by the entropy values, in other SARS-CoV-2 proteins not included in this study; in particular, the Spike mutation D614G noted in other reports continues to be seen with high frequency since being described in a recent preprint that performed an analysis on GISAID through April 13.[151] We counted 9107 D614G mutations (up from 3577 found April 13) and calculated entropy of 0.94 for this residue. The NSP12 RdRp protein also shows a large mutation entropy at residue 323, with a mutation entropy of 0.95. This residue, P, has mutated to L 9078 times (and F 3 times). Note that not every protein was represented in all sequences used for entropy calculations. Other regions of the genome with higher entropy values (greater than 0.5) are residues 203 and 204 of orf9 (entropy 0.70 and 0.69, respectively), residue 85 of NSP2 (0.74), 37 of NSP6 (0.57), 57 of orf3a (0.81), and 84 of orf8 (0.56). The highest entropy found among the structures in this study was in the main protease, with an entropy of 0.13 for residue 15, a glycine. We found 261 G15S mutations and one G15D mutation in our data set. The MPro also has a number of other residues with relatively high entropies, including residue 90, with entropy 0.07 and 117 K90R mutations, and 266 with entropy 0.04 and 64 A266 V mutations. After this, the next highest entropy was 0.06 for the N-terminal region of the N protein and also for NSP15. These are displayed in Figure . An entropy of 0.04 was also found for domain X of NSP3 (glycine 76). A lower mutation entropy was found in PLPro, compared to MPro, with the highest value being 0.03. These mutations are important to consider when choosing targets for drug discovery, in that a protein that seems to be more rapidly mutating could potentially lead to an ineffective therapeutic if mutations alter the shape of the drug-binding site. In the case of MPro, the highest entropy mutations were not found in the active site; however, it is possible that they may still affect its conformation indirectly. The reduced mutation entropy for PLPro may indicate that an effort to target a protease could meet with fewer mutation-related problems if targeting PLPro rather than MPro.

Figure 7

Example mutational entropy analysis. Residues are colored by entropy, with redder colors corresponding to greater entropy.

Future Directions and Preliminary Results from New Methodologies

As emphasized above, this article is a progress report on an ongoing project. The development of the pipeline is continuing with advances being made in several directions. Notably, we are incorporating artificial intelligence and machine learning into rescoring ligand ranking and clustering the MD trajectories. Further, we are developing methods to rescore docking using quantum chemical approaches. Although these developments have not been incorporated into the pipeline at the time of writing and were not applied to generate the results described above, we report on progress with them here.

Clustering MD Trajectories Using Deep Learning and AI

The deluge of data generated from simulations such as the T-REMD runs reported here can make traditional approaches of machine learning and clustering approaches (based on measures of similarity in the RMSD-space, or other metrics) quite challenging. Often, practical aspects of computing dictate the use of subsample tracts of the MD data itself or use of prior knowledge about these data sets (e.g., knowing that the ligand binds only in a certain orientation) to filter such data sets. Deep learning techniques can be particularly valuable in “sifting” through large data sets and can be powerful for clustering T-REMD simulations. We are investigating the use of a variational autoencoder with convolutional filters (CVAE), previously developed to cluster protein folding trajectories,[152,153] to cluster the T-REMD simulations of NSP15. As shown in Figure we find that the latent dimensions learned from the simulations indeed cluster the simulation data into a small number of conformational states. These states correspond to transitions observed in the simulations, as seen from various measured observables from the data such as the binding site RMSD, SASA, and the radius of gyration.

Figure 8

Deep learning clusters T-REMD simulations of the NSP15 hexameric complex into conformational states that are potentially relevant for docking studies. (A) A 3D-representation of the CVAE learned from the T-REMD simulations shows the presence of multiple conformational states. Each conformation from the simulation is painted using the RMSD to the starting structure and shows the presence of distinct directions in the conformational landscape where low- and high-RMSD structures are distributed. To understand this representation better, we use an at-stochastic neighbor embedding (t-SNE) algorithm to embed the data into a low-dimensional space, where we can clearly visualize how the conformational landscape is organized. In this two-dimensional space, we visualize various observables from the simulations, including (B) RMSD to the native structure, (C) SASA, and (D) radius of gyration. In each of these cases, we can observe the presence of at least three dominant substates with distinct structural characteristics, which can be further used for docking simulations. The outcomes from the clustering provide insights into aspects of how the T-REMD simulations have sampled the conformational landscape - for example, in the case of this protein, as observed in Panel C, there is only one conformational state which has sampled a large SASA, indicating a potentially open state (which has only a minor change in the overall RMSD, Panel B). This information can be particularly helpful for selecting conformations and identifying metastable states for docking simulations.[154]

Protein–ligand Rescoring Using Machine-Learning

The computational identification of drug compounds, and small molecules in general, that bind to a protein consists of three distinct tasks: (1) identifying a putative conformation of the protein–ligand complex (the docking problem); (2) given a docked conformation, determining whether or not the ligand is a true binder (the screening problem); and (3) is determined to be a true binder, ascertaining a relative, or better yet, absolute binding affinity (the affinity, or scoring, problem). In principle, one could perform the screening and affinity prediction problems using molecular dynamics techniques such as free energy perturbation, thermodynamic integration, or more approximate methods such as MM/PB(GB)SA (molecular mechanics/Poisson-Boltzmann[generalized Born]-surface area). However, this is computationally intractable for large numbers of compounds, even with supercomputers, and the accuracy can often be poor. Furthermore, these rigorous first-principles-based methods assume a putative binding site, and cannot be applied to cases where the binding site is unknown. While the score or energy given by computational docking programs such as AutoDock Vina is reasonably well-suited for docking pose prediction, improvements are possible on the screening and affinity problems, and for this, we use here machine learning. There is an ongoing need for the development of computationally tractable models that can be easily validated on benchmark docking data sets. To this end, accurate, physics-based, machine-learned models for the docking and affinity have been trained using the PDBbind database, a data set consisting of experimentally determined protein–ligand complex structures with accurate experimental binding affinities.[155−158] On an independent data set, the CASF-2013 benchmark,[158,160] affinity prediction (random forest-based) models achieve, at best, a Pearson correlation (R2) of 0.86, and docking pose prediction classifiers achieve an area under the curve of the receiver operator characteristic (AUC of ROC) of 0.91 using support vector machines (Demerdash et al., in review). While the random forest model trained on unnormalized features achieved the best R2 at 0.86, a range of additional models (trained with random forest, gradient boosted trees, or support-vector machines using normalized or unnormalized features) achieved R2 of 0.81–0.85. Regarding the docking pose prediction, the model used here achieves greater enrichment for native-like structures (78%) than AutoDock Vina (63%) (Demerdash et al., in preparation). A model dedicated to virtual screening as a first step in triaging candidate molecules was developed. Once again, as with affinity and pose prediction, this model requires docked structures as input. This model is trained to discriminate between active and inactive compounds, and affinity ranking is performed as a second step only on the true active compounds. To this end, a support-vector machine-based model using the Data set of Useful Decoys-Enhanced, a database of 102 proteins with experimentally verified active and inactive compounds, has been trained.[161,162] Preliminary performance on an independent validation set is encouraging, achieving AUC of ROC of 0.80 and recall of 0.76; that is, 76% of experimentally validated true positives were predicted positively by the model. This model is currently being subjected to further optimized, primarily through the calculation of additional physicochemical descriptors (features) and the optimization of hyperparameters. Due to the urgent nature of the Covid-19 drug discovery campaign, computational expediency precluded calculating features on all docked structures for a given compound and, in turn, precluded running the docking pose classifier on the output of AutoDock Vina. Therefore, we relied on AutoDock Vina’s ranking and not the machine-learned docking pose classifier, thereby reducing the number of feature calculations that must be performed and increasing the throughput. (Parallelization efforts and code optimization are underway, so that feature calculation on all docked poses and subsequent application of the docking pose classifier becomes less onerous.) The virtual screening model was applied to these top-scoring structures from AutoDock Vina, generating a “binder” vs “non-binder” classification. Subsequently, affinity prediction models were applied to just those complex structures classified as “binder.” The affinity prediction was performed using the range of high-performing models on docked structures corresponding to each MD cluster representative used in ensemble docking (see: Methods). This results in affinity predictions on typically 10 cluster representatives, each with affinity predictions from 5 machine-learned models (1 SVM, 2 boosted tree, and two random forest approaches), resulting in 50 “cases”. For each case, the top-500 ligands in terms of predicted affinity, were obtained. Molecules that appeared in the top 500 in at least 25 of the 50 cases were deemed hits and are presently undergoing experimental testing.

QM Analysis of S-Protein Docking Results

In a preliminary evaluation of the accuracy of FMO–DFTB/PCM in describing the interactions between ligands and the S-protein, we compare FMO–DFTB/PCM pair interaction energy (PIE) to that of the higher-level, but more expensive FMO-MP2/PCM method. The PIEs were calculated for ligands binding to the S-protein in the binding pocket. Figure shows that FMO–DFTB/PCM interaction energies agree very well with high-level ab initio FMO-MP2/PCM data with the R correlation coefficient; in this case, it is 0.984. The high correlation between FMO–DFTB/PCM PIE and FMO-MP2/PCM PIE indicates that FMO–DFTB/PCM may be a fast and reliable QM-based method for interaction energy calculations.

Figure 9

Pair interaction energy (PIE) decomposition analysis for FMO–DFTB/PCM plotted against FMO-MP2/3-21G/PCM data.

Pair interaction energy (PIE) decomposition analysis for FMO–DFTB/PCM plotted against FMO-MP2/3-21G/PCM data. An updated preliminary homology model and T-REMD simulation similar to that reported by Smith & Smith of the S-protein RBD and redocking to new clusters from followed by a re-evaluation of the top scoring 150 complexes with FMO–DFTB/PCM was performed as follows: 15 protein conformations and their ten strongest binding ligands predicted by AutoDock Vina were selected, and geometry optimizations were performed at the FMO–DFTB3-D3(BJ)/3ob/PCM level of theory. The binding energies of the top-3 best binders ranked by FMO–DFTB/PCM are listed in Table . According to our preliminary results, although FMO–DFTB/PCM agrees well with Vina in categorizing the strong binders, with all Vina, top-10 ligands have considerably stronger FMO–DFTB/PCM binding energies (ΔEbind < −12 kcal/mol) and the QM-based ranking is significantly different from the Vina ranking.

Table 5

Binding Energy of the top-3 Best Ranked by FMO–DFTB/PCM and Their Binding Free Energy Predicted by Autodock Vina

ligand SWEETLEAD ID	protein cluster ID	FMO–DFTB/PCM ΔE^bind(kcal/mol)	Vina ΔG^bind(kcal/mol)
4752	7	–67.75	–5.40
7055	11	–66.78	–7.60
4698	12	–66.41	–7.60

It is important to note that the current FMO–DFTB/PCM energy, which is based on solvent-corrected binding interaction energies is not the binding free energy. Various additional contributions to the binding free energy can be separately evaluated, and work is underway in this regard. For example, an entropic contribution can be estimated from vibrational frequencies once the requisite Hessian matrix is available.

Conclusions

The present manuscript reports on the establishment of a supercomputer-based virtual high-throughput screening ensemble-docking pipeline that takes into account the dynamic properties of protein targets, as well as preliminary results on simulations and docking screens to a number of protein targets from SARS-CoV-2. The speed at which structural data have been derived experimentally for the SARS-CoV-2 proteome means that several of the simulations reported above were “out of date” almost immediately. By this, we mean that the simulations were performed using models derived from experiments that had been superseded by higher-resolution or more complete data. Examples of these are the S-protein, MPro, N Protein, and NSP9. Clearly, as information on structures increases in quality, simulations will be further repeated. Furthermore, the complexity of the structural models derived is expected to increase. For example, models of the S protein interacting with the viral envelope or extending up to the complete virion can be envisaged and, in principle at least, incorporated into drug screening protocols. The present results provide comprehensive simulation models for eight of the viral proteins in 24 molecular systems. T-REMD is well suited for massively parallel supercomputing because many replicas are run simultaneously, and they need to communicate with each other. In the present tests, 350 ns/day/replica was obtained for the smallest (NSP3 phophatase/ADRP) system, and this, therefore, scales up to about 1.5 ms/day of aggregate MD time, given the hypothetical situation that one had about 100 different proteins to run of roughly the same size. For bigger systems, with 105 atoms, the throughput is lower, about 1.0 ms/day. Nevertheless, it is clear that extensive simulation data can be obtained on many proteins with a short time-to-solution on this machine. As one possible future direction, one might envisage running T-REMD on the 44 drug targets that have been suggested as a minimal screen for the toxicity effects in human drug trials.[163] The ensemble docking performed so far mostly involves repurposing databases and therefore is limited to about 10k compounds. Many of these compounds are predicted to be quite promiscuous in binding to the targets. Two of the compounds identified in the top 1% of our preliminary S-protein screen have been reported to be in two registered clinical trials (quercetin and hypericin). Further, several compounds from the screens reported above show activity in reducing live viral infectivity: these results will be reported elsewhere. The docking results using the smaller database were not run on Summit, because of the fact that for Summit code running on GPUs is preferred. However, as COVID-19 therapeutic research moves beyond repurposing to the discovery of novel compounds, there is a need to quickly screen many more compounds. Therefore, we have installed Autodock-GPU and demonstrated that it is capable of screening 1 billion compounds on Summit in 12 h when scaled to the whole machine. Although several other groups have reported billion-compound screens, these have been using AI approaches or rigid docking without pose optimization.[164,165] The present billion-compound screen calculations, therefore, represent a potential supercomputer-driven paradigm shift in computational drug discovery and can be envisaged to be performed on dozens of proteins in a single day when the exascale era of supercomputing arrives, as planned for 2021.

118 in total

1. GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.

Authors: Berk Hess; Carsten Kutzner; David van der Spoel; Erik Lindahl
Journal: J Chem Theory Comput Date: 2008-03 Impact factor: 6.006

2. GROMACS: fast, flexible, and free.

Authors: David Van Der Spoel; Erik Lindahl; Berk Hess; Gerrit Groenhof; Alan E Mark; Herman J C Berendsen
Journal: J Comput Chem Date: 2005-12 Impact factor: 3.376

3. A temperature predictor for parallel tempering simulations.

Authors: Alexandra Patriksson; David van der Spoel
Journal: Phys Chem Chem Phys Date: 2008-02-25 Impact factor: 3.676

4. Recent developments in the general atomic and molecular electronic structure system.

Authors: Giuseppe M J Barca; Colleen Bertoni; Laura Carrington; Dipayan Datta; Nuwan De Silva; J Emiliano Deustua; Dmitri G Fedorov; Jeffrey R Gour; Anastasia O Gunina; Emilie Guidez; Taylor Harville; Stephan Irle; Joe Ivanic; Karol Kowalski; Sarom S Leang; Hui Li; Wei Li; Jesse J Lutz; Ilias Magoulas; Joani Mato; Vladimir Mironov; Hiroya Nakata; Buu Q Pham; Piotr Piecuch; David Poole; Spencer R Pruitt; Alistair P Rendell; Luke B Roskop; Klaus Ruedenberg; Tosaporn Sattasathuchana; Michael W Schmidt; Jun Shen; Lyudmila Slipchenko; Masha Sosonkina; Vaibhav Sundriyal; Ananta Tiwari; Jorge L Galvez Vallejo; Bryce Westheimer; Marta Włoch; Peng Xu; Federico Zahariev; Mark S Gordon
Journal: J Chem Phys Date: 2020-04-21 Impact factor: 3.488

5. Rapid Identification of Potential Inhibitors of SARS-CoV-2 Main Protease by Deep Docking of 1.3 Billion Compounds.

Authors: Anh-Tien Ton; Francesco Gentile; Michael Hsing; Fuqiang Ban; Artem Cherkasov
Journal: Mol Inform Date: 2020-03-23 Impact factor: 4.050

6. A computationally identified compound antagonizes excess FGF-23 signaling in renal tubules and a mouse model of hypophosphatemia.

Authors: Zhousheng Xiao; Demian Riccardi; Hector A Velazquez; Ai L Chin; Charles R Yates; Jesse D Carrick; Jeremy C Smith; Jerome Baudry; L Darryl Quarles
Journal: Sci Signal Date: 2016-11-22 Impact factor: 8.192

Review 7. Biogenesis and dynamics of the coronavirus replicative structures.

Authors: Marne C Hagemeijer; Peter J M Rottier; Cornelis A M de Haan
Journal: Viruses Date: 2012-11-21 Impact factor: 5.048

8. Structure of Main Protease from Human Coronavirus NL63: Insights for Wide Spectrum Anti-Coronavirus Drug Design.

Authors: Fenghua Wang; Cheng Chen; Wenjie Tan; Kailin Yang; Haitao Yang
Journal: Sci Rep Date: 2016-03-07 Impact factor: 4.379

9. Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2.

Authors: Youngchang Kim; Robert Jedrzejczak; Natalia I Maltseva; Mateusz Wilamowski; Michael Endres; Adam Godzik; Karolina Michalska; Andrzej Joachimiak
Journal: Protein Sci Date: 2020-05-02 Impact factor: 6.993

Review 10. Evolution of In Silico Strategies for Protein-Protein Interaction Drug Discovery.

Authors: Stephani Joy Y Macalino; Shaherin Basith; Nina Abigail B Clavio; Hyerim Chang; Soosung Kang; Sun Choi
Journal: Molecules Date: 2018-08-06 Impact factor: 4.411

27 in total

1. Efficient Hit-to-Lead Searching of Kinase Inhibitor Chemical Space via Computational Fragment Merging.

Authors: Grigorii V Andrianov; Wern Juin Gabriel Ong; Ilya Serebriiskii; John Karanicolas
Journal: J Chem Inf Model Date: 2021-11-11 Impact factor: 4.956

2. Biomolecular modeling thrives in the age of technology.

Authors: Tamar Schlick; Stephanie Portillo-Ledesma
Journal: Nat Comput Sci Date: 2021-05-20

3. Supercomputing Pipelines Search for Therapeutics Against COVID-19.

Authors: Josh Vincent Vermaas; Ada Sedova; Matthew B Baker; Swen Boehm; David M Rogers; Jeff Larkin; Jens Glaser; Micholas D Smith; Oscar Hernandez; Jeremy C Smith
Journal: Comput Sci Eng Date: 2020-11-06 Impact factor: 2.152

4. Essential Dynamics Ensemble Docking for Structure-Based GPCR Drug Discovery.

Authors: Kyle McKay; Nicholas B Hamilton; Jacob M Remington; Severin T Schneebeli; Jianing Li
Journal: Front Mol Biosci Date: 2022-06-29

5. In Silico Screening and Testing of FDA-Approved Small Molecules to Block SARS-CoV-2 Entry to the Host Cell by Inhibiting Spike Protein Cleavage.

Authors: E Sila Ozdemir; Hillary H Le; Adem Yildirim; Srivathsan V Ranganathan
Journal: Viruses Date: 2022-05-24 Impact factor: 5.818

Review 6. Biflavonoid as potential 3-chymotrypsin-like protease (3CLpro) inhibitor of SARS-Coronavirus.

Authors: Yustina Hartini; Bakti Saputra; Bryan Wahono; Zerlinda Auw; Friska Indayani; Lintang Adelya; Gabriel Namba; Maywan Hariono
Journal: Results Chem Date: 2020-12-25

7. Molecular Simulations suggest Vitamins, Retinoids and Steroids as Ligands of the Free Fatty Acid Pocket of the SARS-CoV-2 Spike Protein*.

Authors: Deborah K Shoemark; Charlotte K Colenso; Christine Toelzer; Kapil Gupta; Richard B Sessions; Andrew D Davidson; Imre Berger; Christiane Schaffitzel; James Spencer; Adrian J Mulholland
Journal: Angew Chem Int Ed Engl Date: 2021-02-22 Impact factor: 16.823

8. A COVID-19 Drug Repurposing Strategy through Quantitative Homological Similarities Using a Topological Data Analysis-Based Framework.

Authors: Raul Pérez-Moraga; Jaume Forés-Martos; Beatriz Suay-García; Jean-Louis Duval; Antonio Falcó; Joan Climent
Journal: Pharmaceutics Date: 2021-04-02 Impact factor: 6.321

9. Accelerating high-throughput virtual screening through molecular pool-based active learning.

Authors: David E Graff; Eugene I Shakhnovich; Connor W Coley
Journal: Chem Sci Date: 2021-04-29 Impact factor: 9.825

10. Binding of SARS-CoV Covalent Non-Covalent Inhibitors to the SARS-CoV-2 Papain-Like Protease and Ovarian Tumor Domain Deubiquitinases.

Authors: Dakshinamurthy Sivakumar; Matthias Stein
Journal: Biomolecules Date: 2021-05-28