Terra Sztain1, Rommie Amaro1, J Andrew McCammon1,2. 1. Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, California 92093, United States. 2. Department of Pharmacology, University of California, San Diego, La Jolla, California 92093, United States.
Abstract
The SARS-CoV-2 pandemic has rapidly spread across the globe, posing an urgent health concern. Many quests to computationally identify treatments against the virus rely on in silico small molecule docking to experimentally determined structures of viral proteins. One limit to these approaches is that protein dynamics are often unaccounted for, leading to overlooking transient, druggable conformational states. Using Gaussian accelerated molecular dynamics to enhance sampling of conformational space, we identified cryptic pockets within the SARS-CoV-2 main protease, including some within regions far from the active site. These simulations sampled comparable dynamics and pocket volumes to conventional brute force simulations carried out on two orders of magnitude greater timescales.
The SARS-CoV-2 pandemic has rapidly spread across the globe, posing an urgent health concern. Many quests to computationally identify treatments against the virus rely on in silico small molecule docking to experimentally determined structures of viral proteins. One limit to these approaches is that protein dynamics are often unaccounted for, leading to overlooking transient, druggable conformational states. Using Gaussian accelerated molecular dynamics to enhance sampling of conformational space, we identified cryptic pockets within the SARS-CoV-2 main protease, including some within regions far from the active site. These simulations sampled comparable dynamics and pocket volumes to conventional brute force simulations carried out on two orders of magnitude greater timescales.
In December 2019, the World Health Organization learned of a novel coronavirus that has
since rapidly spread, leading to a global pandemic. The SARS-CoV-2 virus, causing the
disease named COVID-19, has exceeded 100 million cases worldwide, taking over 2 million
lives as of January 30, 2021.[1] The genetic and structural similarity of
the SARS-CoV-2 virus to the agents of the severe acute respiratory syndrome coronavirus
(SARS-CoV) epidemic in 2003 and Middle East respiratory syndrome coronavirus (MERS-CoV)
epidemic in 2012 provide a basis of information for understanding and ultimately treating or
preventing this disease; however, no highly effective antiviral drug exists for any
human-infecting coronavirus.Proteases are responsible for activating viral proteins for particle assembly and have
proved successful targets for antiviral agents; most notable are the protease inhibitors
used to treat HIV and Hepatitis C.[2,3] The main protease of SARS-CoV-2, called Mpro or
3CLpro encoded by the nsp5 gene,[4] was the first SARS-CoV-2
protein deposited to the protein databank (PDB) on January 26th 2020.[5]
This structure was crystalized with a covalent inhibitor (N3) identified from computer-aided
drug design and validated biochemically. Currently, over one hundred Mpro
structures exist in the PDB and massive efforts to discover a successful inhibitor are
underway.[5−8]Mpro is highly similar to 3CL proteases from other coronaviruses in sequence,
structure, and function.[9,10] First, Mpro autocatalytically cleaves itself from the
SARS-CoV-2 polypeptide and then forms a homodimer to subsequently cleave at 11 distinct
sites, while the papain-like protease, PLpro, cleaves at three sites, allowing
the viral proteins to fold and perform their functions.[11−13] Proteolytic cleavage is catalyzed by residues H41 and C145 of the
catalytic cysteine dyad. These active site residues are located at the N-terminal globular
domain of Mpro, which contains an anti-parallel beta barrel reminiscent of
trypsin-like serine proteases, while the C-terminal domain consists of five alpha
helices.[13] Upon dimerization, Mpro resembles the shape of a
heart symbol (Figure A).
Figure 1
Pocket analysis of GaMD simulations of Mpro. (A) Active site, dimer
interface, and distal site definitions for pocket calculations. The inclusion sphere was
centered on the center of mass (COM) of the top and bottom protein domains for active
and distal sites with 12 and 10 Å radii, respectively. A 10 Å-radius sphere
was used for the dimer interface calculation. (B–D) Histograms showing the
distribution of pocket volumes calculated from the aggregate 6 μs GaMD
simulations.
Pocket analysis of GaMD simulations of Mpro. (A) Active site, dimer
interface, and distal site definitions for pocket calculations. The inclusion sphere was
centered on the center of mass (COM) of the top and bottom protein domains for active
and distal sites with 12 and 10 Å radii, respectively. A 10 Å-radius sphere
was used for the dimer interface calculation. (B–D) Histograms showing the
distribution of pocket volumes calculated from the aggregate 6 μs GaMD
simulations.Many efforts to identify inhibitors involve high throughput virtual screening, which
exploits the power of computational docking to screen millions of molecules in
silico to narrow down a few “hits” for lead optimization.
Incorporation of molecular dynamics (MD) has significantly improved the ability to identify
promising protein inhibitors.[14] Docking to protein ensembles obtained
from MD simulations is often employed to consider multiple target states that remain elusive
in static crystal structures.[15] Conventional MD simulations are however
limited in the amount of conformational space that can be sampled due to the amount of time
required to traverse energy barriers between stable conformational states.[16] In the present study, we used the enhanced sampling technique, Gaussian
accelerated MD (GaMD),[17] to overcome such barriers. GaMD adds a harmonic
boost to the potential energy up to a threshold, effectively “filling in the
wells” creating a smoother potential energy surface.[17,18] This method allowed us to extensively sample
conformations of the SARS-CoV-2 main protease at minimal computational expense and provide
detailed characterization of several potentially druggable pockets, which can serve as a
basis for identifying an Mpro inhibitor for COVID-19 treatment.
Results
GaMD simulations were performed on the first published Mpro structure PDB
6LU7.[5] Six
systems were simulated, including monomer and dimer simulation in the presence and absence
of the co-crystalized N3 inhibitor either covalently or non-covalently bound (Figure S1). Five independent simulations of 200 ns for an aggregate of 1
μs were carried out for each system. Three regions were defined to investigate
potentially druggable pockets: the active site, which lies in the N-terminal top lobe of the
heart-shaped protease; the C-terminal region, as a potential allosteric site; and the dimer
interface region (Figure A).
Characterization of Mpro Pockets
A variety of pocket volumes were sampled between 16 and 277 Å3 for the
active site pocket, starting from 170 Å3 in the crystal structure. The
interface spanned from 36 to 429 Å3 starting from 209 Å3
and the distal site from 0 to 78 Å3 starting from 18 Å3
(Figure B–D). Simulations with the N3
ligand present sampled greater mean volumes for the active site and distal region than the
apo simulations without N3; however, the average dimer interface region was greater in the
apo simulation (Figure S2). Covalently bound N3 sampled greater mean volumes for the active
site and dimer interface than non-covalent, which sampled greater mean volumes for the
distal site (Figure S3). In a few simulations, the non-covalent ligand escaped the active
site pocket (Figure S4). In all simulations, the ligand re-arranges compared to the
crystal structure. The non-covalent ligands quickly decrease in total number of native
contacts and increase in non-native contacts. The covalent ligands slightly decrease in
native contacts but have a similar increase in non-native contacts. (Figures S5 and S6). These simulations revealed significant loop dynamics
(Figure A–C) leading to the distinct
pocket conformations (Figure D–F and
Movie S1) with varying active site water occupancy (Figure S7).
Figure 2
Cryptic pockets identified from GaMD simulations. (A–C) Overlay of several
frames highlighting loop dynamics leading to various pocket conformations for each
site. (D–F) Overlay of several pockets shown outlined in mesh, colored in
varying shades of pink for the active site, blue for the distal site, and orange for
the interface. (G–I) Computational solvent mapping of “hot spots”
to pockets at each site.
Cryptic pockets identified from GaMD simulations. (A–C) Overlay of several
frames highlighting loop dynamics leading to various pocket conformations for each
site. (D–F) Overlay of several pockets shown outlined in mesh, colored in
varying shades of pink for the active site, blue for the distal site, and orange for
the interface. (G–I) Computational solvent mapping of “hot spots”
to pockets at each site.Several metrics were used to assess the properties of these pockets. The PockDrug[19] webserver predicts druggability probability based on a variety of factors
including geometry, hydrophobicity, and aromaticity, among others. The FTMap[20] computational solvent mapping webserver helps determine “hot
spots” and can aid in fragment-based drug design. For each site, the sampled pocket
volumes were clustered into 10 populations, and a frame corresponding to that volume was
analyzed. The most populated volumes for the active site, dimer interface, and distal site
were 114, 220, and 10 Å3, respectively. The 10 active site pockets gave
predicted druggabilities between 0.18 and 0.86 and solvents mapped to seven of the pockets
(Figure S8). The dimer interface pockets gave higher druggability prediction
values, between 0.5 and 0.93, with solvent mapping to all 10 structures (Figure S9). The distal site pockets gave the highest predicted druggability,
between 0.37 and 1.00 with six out of ten pockets giving predicted druggability values
greater than 0.90. Solvents mapped to seven out of the ten pockets (Figure S10). Notably, none of the solvent mapping to any of the sites showed
overlapping results, indicating that each structure may provide distinct opportunities for
designing inhibitors. The pocket volume, predicted druggability, and solvent mapping were
not directly correlated, as expected, since each calculation considers different metrics.
Overall, these analyses indicate that the simulations reveal a diverse set of
conformations that are likely to serve as viable targets for in silico
drug development.Dimer association is necessary for catalytic activity of the protease;[5] therefore, it is reasonable to assume that binding of a molecule, which prevents this
association, would inactivate the protease. The significance of the distal regions is less
well understood, though computational investigations support a potential allosteric
role.[21,22] To study
this further, we examined the correlated motions of each residue throughout the
simulations. In the monomer simulation, the active site region was positively correlated
to the distal site region and negatively correlated to the region that connects the two,
whereas the inverse was observed in the dimer simulation (Figure and Figures S11 and S12). These correlations resulted in primary motions,
detected by principal component analysis, comprising hinge-like in the monomer versus
twisting in the dimer (Figure S13). These motions were similar in both the apo and N3 bound
simulations. In the dimer simulations, the active site of chain A was slightly correlated
to the active site of chain B, and the distal regions were highly anticorrelated, twisting
in opposite directions. (Figure S13). Regardless of positive or negative correlation values, the
distal and active sites show a dynamic interdependence on each other in both the apo and
dimer forms.
Figure 3
Correlated motions of each residue. (A) Correlation matrix calculated from aggregate
1 μs simulation of apo Mpro. Calculation from the monomer simulation
is shown as an inset in the upper right corner of the calculation from the dimer
simulation. (B) Structural representation of correlated motions, with each residue
colored based on correlation to the catalytic histidine H41.
Correlated motions of each residue. (A) Correlation matrix calculated from aggregate
1 μs simulation of apo Mpro. Calculation from the monomer simulation
is shown as an inset in the upper right corner of the calculation from the dimer
simulation. (B) Structural representation of correlated motions, with each residue
colored based on correlation to the catalytic histidine H41.
GaMD Enhancement of Mpro Conformational Space
Previous benchmarks have demonstrated that GaMD can enhance sampling by multiple orders
of magnitude.[23] To test the sampling enhancement compared to
conventional MD simulations, we compared the 1 μs aggregate GaMD simulation from
five replicates of 200 ns to the 10 μs simulations carried out at Riken[24] and 100 μs simulations D.E. Shaw Research (DESRES).[25] We additionally performed and compared those to 1 μs of GaMD
continuous simulation and compared 200 ns cMD to 200 ns GaMD. The ranges of pocket volumes
for each site were highly similar, as detailed in Figure S14. Even the short conventional simulations sampled comparable
pocket volumes, indicating that this metric may not be the most effective measure of
conformational sampling. Principal component analysis (PCA) is frequently used as a metric
for conformational diversity as it decreases the dimensionality to maximize the dataset
variance.[26,27] PCA
was indeed more consistent with expectations where increasing simulation time corresponded
to larger regions of space covered by plotting the first two principal components (Figure ).
Figure 4
Conformational sampling of GaMD and conventional MD. Each plot shows the first and
second principal components, projected from a mass weighted matrix of backbone atom
positions of the apo dimer, colored by probability density. (A) 200 ns of GaMD
simulation. (B) 1 μs of GaMD aggregated from five independent 200 μs
simulation. (C) 1 μs of continuous GaMD simulation. (D) 200 ns of cMD. (E) 10
μs of cMD performed by Riken. (F) 100 μs of cMD performed by D.E. Shaw
Research (DESRES). The area of a 2D histogram of the PC1 vs PC2 datapoints from each
simulation is labeled as a percentage relative to the 100 μs DE. Shaw Research
(DESRES) conventional MD simulation.
Conformational sampling of GaMD and conventional MD. Each plot shows the first and
second principal components, projected from a mass weighted matrix of backbone atom
positions of the apo dimer, colored by probability density. (A) 200 ns of GaMD
simulation. (B) 1 μs of GaMD aggregated from five independent 200 μs
simulation. (C) 1 μs of continuous GaMD simulation. (D) 200 ns of cMD. (E) 10
μs of cMD performed by Riken. (F) 100 μs of cMD performed by D.E. Shaw
Research (DESRES). The area of a 2D histogram of the PC1 vs PC2 datapoints from each
simulation is labeled as a percentage relative to the 100 μs DE. Shaw Research
(DESRES) conventional MD simulation.While the GaMD simulations run continuously for 1 μs only covered 54% of the
conformational space covered by the 100 μs DESRES, the 1 μs aggregate GaMD
from five simulations run in parallel covered 80% of the DESRES space. The 10 μs
Riken simulations covered 94% of that space (Figure ). It is worth noting that each of the simulations with the most extensive
sampling (DESRES, Riken, GaMD 5 × 200 ns) reached areas in the PC1-PC2 subspace that
neither of the other two reached. This just confirms that none of these simulations
achieve comprehensive sampling, which, in the ultimate case, would include extensively
unfolded conformations. Each of these simulations reveals cryptic binding sites not seen
in shorter, conventional MD simulations, let alone in the experimental structures, and
some sites are seen in one but not the other simulations.Though perhaps not intuitive at first glance, it is reasonable that many simulations in
parallel would have a greater chance of sampling diverse space than one long simulation,
which may settle into a favorable low energy conformation, even given the boost added by
the GaMD, which is restricted by a threshold.[17] Representative
structures from dense populations of the PCA plots were further analyzed structurally and
through computational solvent mapping (Figure S15). Additionally, RMSD and RMSF comparison revealed that the
majority of the conformational deviations occur within the helices and loops lining the
active site pocket (Figures S16,S17, and S21–S23), which was also visualized in the
representative structures from populations determined from PCA (Figure S15). Taken together, the range of pocket volumes, structural
conformations, and dynamics were similar between short GaMD simulations and brute force
conventional simulations. The plethora of available structures from all of these studies
is expected to be useful for in silico inhibitor development for
targeting Mpro.
Conclusions
GaMD simulations of the SARS-CoV-2 main protease, Mpro, revealed cryptic pockets
not detectable from the crystal structure. In addition to characterizing the vast
conformational landscape of the active site and dimer interface, a distal region was
explored as a potential allosteric site. GaMD sampled conformational space more efficiently
than brute force simulations, with 1 μs simulation data providing comparable results
to 10 μs simulations carried out by Riken[24] and 100 μs
simulations carried out by DE Shaw Research,[25] offering a more widely
accessible simulation strategy. These structures, along with the millisecond simulations
from Folding@home,[28] can serve as a basis for in silico
docking to identify Mpro inhibitors for COVID-19 treatment.
Methods
Simulation Preparation
The first crystal structure of SARS-CoV-2 main protease, PDB ID 6LU7,[5] was used as a
starting point for all simulations. Protonation states of titratable residues were
determined using the H++ webserver[29] with 0.15 M salinity, 10 internal
dielectric constant, 80 external dielectric constant, and pH 7.4. Histidines 64, 80, and
164 were thus named HID to indicate delta nitrogen protonation before protonation of
entire protein using tleap from AmberTools 18.[30] Six systems were
simulated as shown in Figure S1. For apo simulations, the inhibitor was deleted from the PDB file.
For dimer simulations, the PyMOL[31] symexp command was used to generate
initial coordinates for the second chain based on symmetry. The ff14SB[32] forcefield was used for proteogenic residues. For simulations with the N3 bound, the
antechamber package of AmberTools 18 was used to generate GAFF[33]
forcefield parameters for N3. Gaussian 09 was used to calculate partial charges of atoms
according to the RESP[34] method with the HF/6-31G* level of theory. For
covalent simulations, the cysteine 145 was treated as an unnatural amino acid and capped
with N-methyl and C-acetyl groups for charge calculations. A TIP3P isometric water box was
added with at least 10 Å buffer between solute and edges of the box. Enough Na+ was
added to neutralize the system, and then Na+ and Cl– ions were added to a final
concentration of 150 mM. Minimization was carried out in two steps. First, the solute was
restrained with a 500 kcal/mol restraint force to minimize the solvent for 10,000 cycles
followed by unrestrained minimization for 300,000 cycles. Next, the system was heated to
310 K over 350 ps using an isothermal-isovolumetric (NVT) ensemble followed by
isothermal-isobaric (NPT) equilibration for 1 ns.
Gaussian Accelerated Molecular Dynamics
Gaussian accelerated molecular dynamics (GaMD)[17] pmemd.cuda
implementation of Amber 18 was used to generate five independent trajectories of 200 ns,
an aggregate of 1 μs for each system. The dual boost method was employed, adding a
biasing force to both the total and dihedral potential energy. The threshold energy was
set to the upper bound. GaMD production of equilibrated systems was carried out in five
stages. First, 200 ps of preparatory conventional MD simulation was carried out, without
any statistics collected. Second, 1 ns of conventional MD was carried out to collect
potential statistics Vmax, Vmin,
Vavg, and σV. Next, 800 ps of GaMD
was carried out with a boost potential applied with fixed parameters. Then, 50 ns of GaMD
was carried out with updated boost parameters, and finally, 150 ns of GaMD was carried out
with fixed boost parameters. All production simulations were carried out using NVT, with
periodic boundary conditions and 2 fs timesteps. The SHAKE[35] algorithm
was used to restrain nonpolar hydrogen bonds and TIP3P water molecules. The Particle Mesh
Ewald[36] method was used for electrostatic interactions with a 10
Å cutoff for nonbonded interactions. A Langevin thermostat was used for temperature
regulation with a collision frequency of 5 ps–1.
Pocket Analysis
For each system, the aggregated simulation was clustered based on the RMSD from the first
frame using the cpptraj[37] hierarchical algorithm to obtain 100 distinct
structures for pocket calculation. Three regions were defined for individual calculations:
the active site pocket, distal site pocket, and dimer interface. Pocket volume was
calculated using POVME 2[38] with a 1 Å grid spacing. The point
inclusion sphere was determined based on the average center of mass of residues
7–198 with a 12 Å radius for the active site pocket, the average center of
mass of residues 198–306 with a 10 Å radius for the distal site pocket, and
the average center of mass between residues within 3.5 Å of the other dimer with a 10
Å radius for the dimer interface. The pocket volumes were chosen to be underestimated
as opposed to overestimated and were calculated with a minimal radius to avoid extensively
calculating known solvent exposed regions. As shown in the FTMap solvent screening
(Figures S10 and S15), fragments bound to various regions around the distal
site crevices. Increasing the radius of the pocket calculation space would incorporate
solvent exposed regions from many directions that would not be useful and would involve
other regions of the protein. Due to this volume dependence on chosen radius, these
volumes are intended as an initial comparative metrics for identifying structures to
analyze further, rather than concrete final values. For analysis of water occupancy, the
GIST (Grid Inhomogeneous Solvation Theory Method)[39] function of cpptraj
was used for the active site region, and the density of oxygen centers map was used to
visualize results in VMD.[40]To quantify druggability, we have used the PockDrug[19] webserver, which
predicts druggability probability based on a variety of factors including geometry,
hydrophobicity, and aromaticity, among others. Additionally, we have used the FTMap[20] computational solvent mapping webserver to determine druggable “hot
spots.” Pocket volumes calculated by POVME[38] from each of the
three regions were clustered into 10 populations, and representative frames for each were
selected for analysis. The fpocket estimation method was used for PockDrug, and the
druggability probability of the pocket that aligned best to the POVME[38]
calculation sphere was reported.
Principal Component Analysis
Principal motions were calculated using cpptraj, as described previously,[41] from projection of displacement vectors of each of the backbone atoms onto
a diagonalized mass-weighted covariance matrix after rms fitting of every atom except
protons to the first frame. Residue correlation matrices were also generated this way. For
the graphical abstract, reweighting of the potential energy was carried out on one 200 ns
simulation of the one N3 non-covalently bound dimer using the PyReweighting
toolkit.[42] Reweighting was carried out using the Maclaurin series
expansion to the 10th order with a bin size of 6 and maximum energy of 100 based on the
first two principal components. One conventional 200 ns MD simulation was performed for
comparison and was outlined in the figure. For the conformational sampling comparison
shown in Figure , the dimer–apo system
was used for consistency with Riken and DESRES simulations. The colors of the histograms
are normalized so the same number of datapoints (frames) is considered. Therefore, 40,000
frames were used for all of the in house simulations, and 50,000 frames were used for the
Riken and DESRES simulations. The color bars were set to maximum values of 0.0008 and
0.0010, respectively. This results in densely localized (yellow color) in the short
simulations versus less density (blue color) and more spread out in the longer and
enhanced simulations. Visualization of the first normal mode was carried out using the
Normal Mode Wizard plugin of VMD.
Data and Software Availability
In accordance with the community principles around open sharing of COVID19 simulation
data,[43] all simulation input files and data are available at https://covid19.molssi.org through the NSF MolSSI
COVID19 Molecular Structure and Therapeutics Hub.
Authors: Radwa E Mahgoub; Feda E Mohamed; Lara Alzyoud; Bassam R Ali; Juliana Ferreira; Wael M Rabeh; Shaikha S AlNeyadi; Noor Atatreh; Mohammad A Ghattas Journal: Molecules Date: 2022-10-09 Impact factor: 4.927