Thomas J Cross1, Gemma R Takahashi2, Elizabeth M Diessner1,3, Marquise G Crosby2, Vesta Farahmand1, Shannon Zhuang1, Carter T Butts3,4, Rachel W Martin1,2. 1. Department of Chemistry, University of California, Irvine, California 92697-2025, United States. 2. Department of Molecular Biology and Biochemistry, University of California, Irvine, California 92697-3900, United States. 3. California Institute for Telecommunications and Information Technology, University of California, Irvine, California 92697-3900, United States. 4. Departments of Sociology, Statistics, Computer Science, and Electrical Engineering and Computer Science, University of California, Irvine, California 92697-3900, United States.
Abstract
The SARS-CoV-2 main protease (Mpro) is essential to viral replication and cleaves highly specific substrate sequences, making it an obvious target for inhibitor design. However, as for any virus, SARS-CoV-2 is subject to constant neutral drift and selection pressure, with new Mpro mutations arising over time. Identification and structural characterization of Mpro variants is thus critical for robust inhibitor design. Here we report sequence analysis, structure predictions, and molecular modeling for seventy-nine Mpro variants, constituting all clinically observed mutations in this protein as of April 29, 2020. Residue substitution is widely distributed, with some tendency toward larger and more hydrophobic residues. Modeling and protein structure network analysis suggest differences in cohesion and active site flexibility, revealing patterns in viral evolution that have relevance for drug discovery.
The SARS-CoV-2 main protease (Mpro) is essential to viral replication and cleaves highly specific substrate sequences, making it an obvious target for inhibitor design. However, as for any virus, SARS-CoV-2 is subject to constant neutral drift and selection pressure, with new Mpro mutations arising over time. Identification and structural characterization of Mpro variants is thus critical for robust inhibitor design. Here we report sequence analysis, structure predictions, and molecular modeling for seventy-nine Mpro variants, constituting all clinically observed mutations in this protein as of April 29, 2020. Residue substitution is widely distributed, with some tendency toward larger and more hydrophobic residues. Modeling and protein structure network analysis suggest differences in cohesion and active site flexibility, revealing patterns in viral evolution that have relevance for drug discovery.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in late 2019[1] and rapidly spread worldwide, causing an ongoing pandemic. Although the
sequence of its RNA genome is highly similar to that of SARS-CoV-1, SARS-CoV-2 is believed
to have arisen independently from a bat coronavirus,[2] to which it shares
96% similarity.[3] The emerging SARS-CoV-2 subsequently gained a modified
spike protein due to recombination in an intermediate host, possibly the
pangolin,[4,5] followed
by purifying selection for binding to the humanACE2 protein.[6] No
therapeutic agents able to reduce SARS-CoV-2mortality in clinical settings are yet known,
although extensive efforts are underway to discover new drugs or repurpose existing ones to
inhibit key viral proteins. Here we focus on the main protease
(M), which plays a critical role in viral replication. Like
other betacoronaviruses, SARS-CoV-2 is a positive-sense RNA virus that expresses all of its
proteins as a single polypeptide chain, which is cleaved by M
and the papain-like protease PL to yield the mature
proteins.[7]Inhibiting this key enzyme would prevent viral replication, reducing viral load and thus
symptom intensity. A similar approach was instrumental in making HIV a manageable
disease.[8−10] However, the proteins in
question differ markedly, rendering HIV protease inhibitors ineffective against SARS-CoV-2;
indeed, a standard HIV protease inhibitor combination did not prove effective against
COVID-19 in a recent clinical trial.[11] Specifically, HIV protease is an
aspartic protease, whereas M is a 3CL cysteine protease. The
3CL cysteine proteases are characterized by a chymotrypsin-like fold and a
cysteine-histidine catalytic dyad in the active site, implying both different structures and
distinct chemical mechanisms. Although the general strategy of seeking protease inhibitors
is hence viable for both SARS-CoV-2 and HIV, drug development for the former depends on
characterizing this novel enzyme.The M sequence and three-dimensional structure are highly
conserved among coronaviruses, with a characteristic three-domain fold.[12]
Domains I and II make up the catalytic region, which has a chymotrypsin-like structure,
while domain III is an α-helical domain that is primarily responsible for
dimerization.[13,14]
Although a mix of monomers and dimers is found in solution,[15] studies of
the SARS-CoV-1 M have shown that the dimer is the functional
state in the mature protein, whereas the monomer has dramatically reduced trans-enzymatic
activity against the substrate.[13] Removing key residues involved in
dimerization by mutation or truncation of the C-terminus results in lower enzymatic activity
concomitant with an increased population of monomers.[16] The N-finger
(residues 1–7) is not required for dimerization[17] but is necessary
for activity, apparently due to highly specific interactions between the N-terminus of the
inactive monomer and domains II and III of the active one.[18−20] In the G11A variant, the N-finger is unable to occupy the correct
binding site, leading to dissociation of the dimer, loss of activity, and collapse of the
oxyanion hole that is an essential part of the substrate-binding pocket.[21] The global efforts to document clinically isolated SARS-CoV-2 sequences since the
beginning of the pandemic provides an opportunity to examine the impact of mutations on the
protein structure and dynamics, and as more sequences are collected going forward, to
discover which residues and functional features are absolutely conserved.Molecular modeling is an important tool for guiding inhibitor discovery, making it possible
to evaluate large numbers of candidate drugs in silico to select
experimental targets; however, standard approaches screen against only one version of the
protein, typically the reference or wild-type (WT) sequence. In a host population, mutations
accumulate with each viral passage, generating a mutational landscape rather than a single
protein. The design of robust inhibitors that can protect against the multiple strains
encountered in clinical settings requires the characterization of this sequence space and
the populations of conformations it engenders. Furthermore, effective and rapid response to
future emerging coronavirus diseases requires both in silico screening and
experimental testing of antiviral agents and a validated library of relatively general
inhibitors that can be used as a basis for the development of specialized therapeutics.
Central to the success of that effort will be developing an understanding of structural and
functional variation in SARS-CoV-2 proteins, particularly as mutations accumulate and new
strains emerge.In general, SARS-CoV-2 is mutating more slowly than would be expected under neutral drift,
suggesting that most of the genome is subject to purifying selection,[22]
although this does not rule out adaptive mutations in specific genes or other sequence
regions.[23] Especially with a newly emerged zoonotic disease, there is
no reason to assume that any viral protein is currently at a global optimum with respect to
function, and function-enhancing variants may appear at any time. For example, mutation of
the SARS-CoV-2spike protein is a topic of intense current interest, as its D614G variant
appears to confer enhanced infectivity,[24] has spread rapidly through the
global population,[25] and correlates with a higher mortality rate.[26] M appears to be relatively tolerant of
mutations near the active site,[27] underscoring the importance of mapping
the mutational landscape of active variants so that inhibitors whose binding depends on
highly specific interactions with mutation-prone residues can be eliminated early in the
screening process. Here we characterize all 79 known variants of
M as of 29 April, 2020, and analyze trends in amino acid
substitutions and the resulting structural changes for both monomers and dimers, using
molecular modeling and moiety-level protein structure network analysis. Our analysis shows a
trend toward substitution for larger and more hydrophobic residues versus the WT protein.
Analysis of active site networks (ASNs) from M variants
suggests differences in active site flexibility and cohesion that may serve to guide the
design of robust, mutation-resistant inhibitors. Intervariant differences are also observed
among the dimer interfaces, which is another potential inhibitor target.
Materials and Methods
Sequence Analysis and Clustering
SARS-CoV-2 genome sequences were found by searching the GISAID (https://www.gisaid.org/)[28]
EpiCoV database on May 3, 2020, using the host keyword “human” and a
submission cutoff date of April 29, 2020, yielding a total of 15 432 SARS-CoV-2
genomes. Genomes outside the range of ±3% reference (RefSeq: NC 045512.2) length
(29 006 bp–30 800 bp inclusive) or ≥1% N content were removed,
leaving 10 644 “high-quality” sequences. Open reading frames in these
high-quality full genomes were compared with a reference M
nucleotide sequence (WT, RefSeq, NC 045512.2; loc, 10 055–10 972) to
extract M sequences of at least 80% similarity using a
script written in Python v3.7.0.[29] Genomes with gaps or ambiguous
nucleotides (e.g., N, S, D, per International Union of Pure and Applied Chemistry (IUPAC)
nomenclature[30]) in the M sequence were
excluded from this data set, leaving a total of 10 578 sequences from high-quality
genomes.Nucleotide sequences were converted into amino acid sequences and screened for
nonsynonymous mutations against the WT M using code written
in Wolfram Mathematica 12.1,[31] yielding 511 nonsynonymous mutations in
M, 77 of which were unique. A single unique
M variant, found in an April 24, 2020 data set, but no
longer available in the GISAID database, was also used in our analyses.
M variants were used in phylogenetic analyses along
with reference human, bat, and pangolin viruses. Full genome alignments were performed
using MUSCLE (v3.8.1551, max 8 iterations, enable find diagonals)[32] on
the complete set of nonsynonymous M mutants as well as
reference WT, bat, and pangolin sequences. Trees were generated in MEGA X,[33] using the Neighbor-Joining method;[34] a bootstrap
test[35] of 1000 replicates was performed, and distances were
calculated using the Maximum Composite Likelihood model.[35] In all, 515
full genomes were used in phylogenetic analyses; 78 unique
M mutants and a reference WT sequence (79 total) were
used for molecular modeling.
Molecular Modeling of Wild-Type and Variant Protein Structures
Initial conditions for the WT trajectories used here are based on the PDB structure
6Y2E,[36]
representing a mature (i.e., cleaved pro-sequence) protein. For monomer trajectories, the
A chain of 6Y2E was employed. Initial variant monomer and dimer structures were,
respectively, predicted using MODELLER 9.23,[37] using the 6Y2E structure
as a template; three rounds of annealing and MD refinement were performed using the
“slow” optimization level for each (final objective function values are
provided in Table S7). Initial structures were then processed to correct protonation
states to reflect their predicted cellular environment (with protonation states predicted
using PROPKA 3.1[38]). Each corrected model structure was then minimized
and equilibrated in explicit solvent; simulations were performed using NAMD[39] with the CHARMM36 force field[40] in TIP3P water[41] at 310 K under periodic boundary conditions (with a 10 Å margin water
box). Solvated protein models were energy-minimized for 10 000 iterations before
being simulated for 0.5 ns, followed by a water box size adjustment, after which a 10 ns
trajectory was simulated with conformations being sampled every 20 ps; an
NpT ensemble was used, with temperature controlled via Langevin
dynamics with a damping coefficient of 1/ps and Nosé–Hoover Langevin piston
pressure control set to 1 atm.[42,43] Final conformations from each trajectory were used to generate figures
in the main text and Supporting Information.
Network Analysis
A protein structure network (PSN) was calculated for each modeled conformation of each
variant via scripts employing the statnet,[44−46] Rpdb,[47] and bio3d[48] libraries
for R.[49] Vertices were defined using the method used in ref (50), where each node represents a chemical moiety, with
edges being defined by interatomic contacts. Specifically, two nodes i
and j are considered adjacent if i contains atom
g and j contains atom h such that the
g and h distance is less than 1.1 times the sum of
their respective van der Waals radii (using values from ref (51)). The node definitions are illustrated in Figure S1A, and a small-moiety PSN of this type for WT
M is shown in Figure S1B. Active site networks (ASNs) were constructed from each PSN as
described in ref (52). Briefly, all vertices
belonging to the catalytic Cys and His residues were identified, along with all vertices
adjacent to these vertices within the PSN. The ASN was then defined as the subgraph of the
corresponding PSN induced by this combined vertex set (Figure S1C). In the case of dimer models, one ASN was constructed for each
chain.To assess overall cohesion, degree k-core values[53]
were calculated for each vertex in each PSN, and the average core number was computed for
the entire protein and for the vertices in each domain, respectively. All calculations
were performed using the sna library[46] for R. For dimer PSNs,
interfacial moieties were identified by selecting all vertices adjacent to at least one
vertex in the opposing chain, and average core numbers were computed for these interfacial
vertices to assess cohesion in the dimerization interface; raw counts of edges spanning
the two chains (interfacial tie volumes) were also calculated. For each vertex associated
with a moiety in the active site, three measures identified as associated with active site
constraint by ref (52) were computed: the degree,
or number of ties to other vertices; the triangle degree, or number of triangles
(3-cliques) to which the vertex belongs; and core number, or number of the highest degree
k-core[54] to which the vertex belongs. Physically,
these, respectively, indicate the total number of contacts associated with the chemical
group (potentially impeding its motion), the number of truss-like, triangular structures
in which the group is embedded (again, restricting mobility), and the extent of local
cohesion around the chemical group, which is found to distinguish “tighter”
and “looser” packing regimes.[55] To summarize the impact
of each measure over the active site as a whole, values were averaged across active site
vertices. As an additional constraint measure, the number of paths between each pair of
active site vertices through neighboring (i.e., nonactive site) vertices was computed, and
the log of the minimum of this value over the set of active site vertex pairs was employed
as a measure of site cohesion. Intuitively, high values of site cohesion indicate that all
active site chemical groups are connected by a large number of indirect contacts, while
low values suggest that at least one pair of active site moieties has few local pathways
holding them together. These four indices (mean active site degree, mean active site
triangle degree, mean active site core number, and site cohesion) were used to produce an
omnibus index of site constraint via principal component analysis (PCA) of the
standardized network measures over all modeled conformations, as described in ref (52). This first principal component (the constraint
score) accounted for approximately 71% of the variance in all four measures in the case of
monomer ASNs, and the ratio of its associated eigenvalue to the next largest was
approximately 4.7 (confirming the dominance of the principal eigenvector). This process
was repeated for the dimer ASNs, resulting in a constraint score vector accounting for
approximately 69% of the total index variance, with an eigenvalue ratio of approximately
4.3.
Comparing Mean Cohesion and Constraint Scores Across Variants
Because cohesion and constraint scores are heavily autocorrelated within trajectories,
we employ a parametric bootstrap strategy to obtain autocorrelation-corrected standard
errors and confidence intervals.[56] For each time series of scores for
each trajectory, an autoregressive (AR) model with AIC-selected order was fit, and the
estimated series mean was obtained (estimation performed by maximum likelihood
estimation using the ar function in R[49]). The whitened residuals from
the time series model were then used to construct 5000 parametric bootstrap replicate
series, which were then refit to obtain bootstrap replicate means. Mean estimates from
the bootstrap replicates were used to construct 95% bootstrap confidence intervals and
standard errors for the series mean. This procedure was applied to the MD trajectory for
each variant. For cohesion scores, mean and bootstrap standard errors are provided for
the full protein and each domain in Table S3.
Kernel PCA of Active Site Networks
To identify key features differentiating active site conformations observed throughout
the entire sample of trajectories, we employ a kernelized principal component analysis
(kernel PCA[57]) of a stratified sample of monomer and dimer ASNs from
all M variants. The ASN sample consists of 25 evenly
spaced conformations from each trajectory (monomer, dimer A chain, and dimer B chain),
for a total of 5925 networks. For comparative purposes, each ASN was mapped to the set
of all unique vertices appearing in any ASN in the sample, and the upper triangle of the
associated adjacency matrix was vectorized, yielding a binary vector of fixed dimension
encoding each network. Analysis of the vectorized ASNs was performed using a disjunctive
normal form (DNF) kernel,[58] defined
bywhere
x and y are respective ASN vectors, and ϵ is a
free parameter controlling regularization. In a graph-theoretic context, the feature
space of the DNF kernel consists of the set of all labeled subgraphs of the inputs;
thus, kernel PCA on this space identifies combinations of labeled subgraphs that
efficiently discriminate among ASNs. Hyperparameter tuning was performed via grid
search, with the performance assessed by reconstruction of a held-out sample of 100
randomly selected ASNs, following training on a separate sample of 1000 ASNs.
Reconstruction was performed using the preimage method from ref (59) with a local sample of 10 neighbors; the optimal
hyperparameter obtained was ϵ = 0.0739. Following hyperparameter tuning, principal
component scores were calculated for use in subsequent analysis. Analyses were performed
using R[49] with portions implemented using Rcpp.[60]
Results and Discussion
Mutations in M Are Geographically Distributed
From the GISAID (https://www.gisaid.org/)[28] EpiCoV database (through April 29,
2020), 78 unique nonsynonymous mutations to M were found in
addition to the WT sequence, including 73 single point variants and 5 double variants. For
genome sequences containing these M variants, full genome
alignments were performed using MUSCLE,[32] and neighbor-joining trees
were generated using MEGA X.[33] Overall, the variation in SARS-CoV-2
sequences observed so far is relatively low, with mutation hotspots not evenly distributed
throughout the genome, but localized to specific sequence regions.[61]
Because M is critical for viral replication, mutations that
have a large deleterious effect on virus replication are unlikely to be observed in
clinical isolates; all M variants investigated here are
therefore assumed to be enzymatically competent. In general, codon usage and amino acid
frequency in viruses of eukaryotes are essentially identical to those of their eukaryotic
hosts, reflecting the viruses’ use of the host translation machinery.[62]The known mutations in M are summarized in Figure . The tree was generated based on overall
genome similarity; however, only sequences containing at least one nonsynonymous mutation
in M were included in the analysis, along with the WT human
sequence and two nonhuman reference sequences. The accession numbers and geographical
sources are listed in Table S1. The solid arcs around the outside of the diagram indicate
M mutations; color coding corresponds to the
geographical source. Several mutations appear to have arisen more than once in the
virus’s evolutionary history so far. Notably, K90R variants appear in multiple
distantly related subtrees; five of these unique evolutionary events can be verified in
Nextstrain’s SARS-CoV-2 phylogenetic tree.[63] Further, L89F,
P108S, and N274D arise at least twice in both trees.
Figure 1
Optimal tree generated using 512 full mutant genomes and three reference genomes:
human wild-type (WT),[64] bat,[3] and
pangolin.[65] Only topology is shown; branch lengths are not to
scale (average branch length = 1.432161 × 10–4 base
substitutions per site). Each continuous arc corresponds to a variant label; these
represent only adjacent branches with the same mutation in
M and do not necessarily indicate shared ancestry.
Branches and arcs from human clinical samples are color coded by location, which
includes the following subregions: Africa, light blue (Democratic Republic of the
Congo); Asia, green (Beijing, Fujian, Malaysia, Shanghai, Vietnam, and Wuhan);
Australia, gold; Central Eurasia, pink (Georgia, Jordan, Russia, and Turkey); Europe,
red (Belgium, Denmark, England, Finland, France, Germany, Iceland, Luxembourg,
Netherlands, Scotland, Spain, Sweden, Switzerland, and Wales); North America, purple
(Costa Rica and United States of America); South America, yellow (Argentina and
Brazil). Subtrees that contained identical subregions and mutations have been
condensed into a single branch; all subtrees and their constituent accessions can be
found in Table S1.
Optimal tree generated using 512 full mutant genomes and three reference genomes:
human wild-type (WT),[64] bat,[3] and
pangolin.[65] Only topology is shown; branch lengths are not to
scale (average branch length = 1.432161 × 10–4 base
substitutions per site). Each continuous arc corresponds to a variant label; these
represent only adjacent branches with the same mutation in
M and do not necessarily indicate shared ancestry.
Branches and arcs from human clinical samples are color coded by location, which
includes the following subregions: Africa, light blue (Democratic Republic of the
Congo); Asia, green (Beijing, Fujian, Malaysia, Shanghai, Vietnam, and Wuhan);
Australia, gold; Central Eurasia, pink (Georgia, Jordan, Russia, and Turkey); Europe,
red (Belgium, Denmark, England, Finland, France, Germany, Iceland, Luxembourg,
Netherlands, Scotland, Spain, Sweden, Switzerland, and Wales); North America, purple
(Costa Rica and United States of America); South America, yellow (Argentina and
Brazil). Subtrees that contained identical subregions and mutations have been
condensed into a single branch; all subtrees and their constituent accessions can be
found in Table S1.These phylogenetic comparisons appear to support a multiple event hypothesis but are
subject to errors resulting from the sparsity of testing. The repeated occurrence of the
same mutation in seemingly unrelated subtrees may be due to missing data that would show
their evolutionary connectedness. The average branch length of Figure
, which shows only topology, is 1.432161 ×
10–4 base substitutions per site (including those from the bat[3] and pangolin[65]); 32.2% of the 1028 branches have, to ten
significant figures, 0 base substitutions per site. For a genome of roughly 30 000
base pairs, this amounts to an average of only 4 substitutions per branch. All of these
unique mutants therefore effectively belong to the same strain, making them difficult to
place in an evolutionary context. For more diverged mutants, unfortunately placed
ambiguous nucleotides[30] could push them from one subtree to another.
With the exception of five double variants, the sequences in Figure arise from single point mutations. Whether and how
M mutations have affected viral fitness is not yet
known, but at least three mutants have remained in the population long enough to
accumulate another mutation: L220F to A191V/L220F, G15S to G15S/D48E and G15S/V35L, and
K90R to V77A/K90R. It is worth noting, that although a single variant A191V exists, the
A191V/L220F double variant likely stemmed from an L220F ancestor due to its shared lineage
with L220F single variants. A fifth double variant, A193T/R279C, was found but did not
stem from any single mutation in our data set; its origins remain unclear.Although a mutation’s prevalence and evolution in a population may be interpreted
as a sign of robust viral function, the opposite does not necessarily indicate reduced
infectivity. Testing rates, social behavior, and time of first infection in each region
are all factors that contribute to the spread of the disease and the availability of
sequencing data. For instance, a large number of K90R mutants were collected in Iceland,
where the number of tests per 1000 people is nearly twice as many as the next leading
country’s and more than seven times as many as the United States’ (Iceland,
141.75; USA, 18.21, as of April 29, 2020).[66] Consequently, further
investigation is needed to determine whether M mutations
affect viral fitness on a global scale. As such, without greater divergence and more
sequences, it is difficult to tell if the presence of an M
mutation in unrelated subtrees is evidence of multiple evolutionary events, or an artifact
of sparse testing. Despite these caveats, it is clear that these 78 variants are
functional enough to infect people within their local populations. In the structural
analyses that follow, we focus on the differences in protein properties of the clinically
observed M variants relative to WT.Because only sequences harboring M mutations were retained
for analysis, certain geographical areas appear to be underrepresented. It is likely that
the strains that had spread to underrepresented regions prior to our data collection
simply did not have M mutations. Different regions tend to
be dominated by different mutants, a feature that might be explained by the timing at
which these mutations arose or arrived. For instance, 83 of the 100
M mutants from Iceland were K90R, and most stemmed from
a single shared ancestor (see Supporting Information). Further, it is likely that heterogeneity in
sequencing rates have resulted in a less-than-complete data set. As of April 29th, the
only North American, South American, and African M mutants
reported in the GISAID database that passed our filtering parameters were from Costa Rica
and the USA, Argentina and Brazil, and the DRC respectively. This does not necessarily
indicate a lack of M mutations in other subregions and may
instead reflect differences in sequencing rates.It is worth noting that some nodes in expanded Figure (see Supporting Information) exhibit bootstrap values less than 10 (i.e., their
branch relationships were shown in fewer than 10% of replicates). The majority of these
extremely low values can be found on large subtrees that share an
M mutation, like D248E or K90R, many of which likely
spread from a common source. Their branches can often be interchanged with no reduction in
accuracy due to high sequence identity. Low bootstrap values here primarily reflect
genotype similarity within subtrees rather than low accuracy. This is a known property of
the Felsenstein procedure,[35] which bootstraps loci instead of
sequences; low bootstrap values indicate groupings that are sensitively dependent upon
differences in small numbers of loci but do not necessarily reflect uncertainty due to
sequence sampling.
M Mutations to Date Suggest Selection for Larger, More
Massive, and More Hydrophobic Residues
To reveal the global pattern of substitutions, we visualize mutations in
M independent of sequence position or location in the
three-dimensional structure, by a network where the nodes, or vertices, are amino acid
types and the edges (represented by arrows pointing in the direction of substitution) are
directional indicators of how often one amino acid was observed to substitute for another
(Figure ). The weights of the edges indicate
the frequency of the mutation across known M variants,
while node color reflects residue hydrophobicity on the scale of Kyte and
Doolittle.[67] The most obvious trend observed in the pattern of
mutation so far is the preferential substitution of larger, more hydrophobic amino acids
in place of smaller, less hydrophobic ones. The pattern is consistent with increased
incidence of amino acid types that are more likely to be present in folded domains rather
than those found more often in less-structured linker regions.[68]
Figure 2
Amino acid substitutions in SARS-CoV-2 M observed up
to April 29, 2020. Arrows indicate the direction of substitution: an arrow from
i to j indicates at least one clinically observed
substitution of residue type i to residue type j;
heavier lines indicate larger numbers of observed substitutions. Color indicates
hydrophobicity, using the scale of Kyte and Doolittle.[67] In
general, substitution has been toward larger and more hydrophobic residues.
Amino acid substitutions in SARS-CoV-2 M observed up
to April 29, 2020. Arrows indicate the direction of substitution: an arrow from
i to j indicates at least one clinically observed
substitution of residue type i to residue type j;
heavier lines indicate larger numbers of observed substitutions. Color indicates
hydrophobicity, using the scale of Kyte and Doolittle.[67] In
general, substitution has been toward larger and more hydrophobic residues.In particular, it is notable that alanine has very few incoming ties and a large number
of outgoing ties, mostly to valine, which has a larger and more hydrophobic side chain.
Alanine is at the same time one of the most common amino acids and one of those with the
most variable prevalence in the human genome.[69] Similarly, observed
ties to isoleucine are mostly incoming from smaller residues; leucine likewise has more
incoming than outgoing ties, with the bulk of its outgoing ties going to phenylalanine,
which is also large and hydrophobic. However, aromatic residues per se do not appear to be
selected. For example, tyrosine has mostly outgoing ties. Also notable are the selection
away from the secondary structure breakers proline and glycine, both of which have only
outgoing ties, and the propensity for lysine to be replaced by arginine even though both
side chains are positively charged. Arginine is both larger and capable of making more and
stronger hydrogen bonds, as well as cation-π interactions not available to lysine,
leading to its known overrepresentation in interdomain and intermonomer
interfaces.[70−73]The mean differences in side chain properties for observed
M mutations are summarized in Table
. As observed in the network representation (Figure ), on average, the mutated residues are larger and more
hydrophobic than the ones they replace. Although substituted residues are on average
larger and more massive, we do not see strong evidence favoring bulky residues over more
compact ones, independent of mass: residue bulk (measured as volume/mass) for substituted
residues did not differ significantly from WT (mean difference = 0.02
Å3/Da, t = 1.87, p = 0.0650). The
variant sequences are not significantly different from WT in charge or aromatic
content.
Table 1
Mean Differences in Side Chain Properties for Substituted Residues versus WT
(N = 83; Substitutions from Double Mutants Considered
Separately)a
mean difference
std error
t value
p value
polar (1 = true)
0.08
0.07
1.22
0.2251
hydrophobicity
1.03
0.30
3.47
0.0008d
charge
–0.05
0.04
–1.27
0.2078
aromatic (1 = true)
0.07
0.04
1.62
0.1093
mass (Da)
9.97
3.49
2.85
0.0055c
volume (Å3)
11.58
3.65
3.17
0.0021c
bulk (Å3/Da)
0.02
0.01
1.87
0.0650
On average, substituted residues are significantly more hydrophobic, more massive,
and larger in volume than those they replace (all p values for
two-tailed t-tests versus no difference).
P < 0.05
P < 0.01
P < 0.001
On average, substituted residues are significantly more hydrophobic, more massive,
and larger in volume than those they replace (all p values for
two-tailed t-tests versus no difference).P < 0.05P < 0.01P < 0.001
M Mutations Occur Throughout the Protein
In order to provide context for the discussion of conformational changes and the
locations of mutation sites, a ribbon diagram of the SARS-CoV-2
M monomer is shown in Figure A. The three domains, the active site residues, and other
structural features are labeled. The dimer is shown in Figure B; the dashed box shows the approximate location of the interchain
salt bridge shown in Figure C. In SARS-CoV-1
M, this salt bridge between Arg4 of Chain A and Glu290
of Chain B is an important component of the dimer interface,[74] and
substitution of either or both of these residues with alanine resulted in shifting the
equilibrium from dimer to monomer, although dimerization was not abolished
completely.[14] Panels D and E show two different views of the dimer
interface, with key residues labeled. In addition to important interfacial residues, we
also highlight key residues near the oxyanion hole, a structural feature found in many
serine and cysteine proteases that stabilizes negatively charged, tetrahedral transition
states.[75]
Figure 3
(A) Ribbon diagram for the M monomer, with key
structural features labeled, including the three domains, the N- and C-termini, and
the active site residues, H41 and C145. (B) Ribbon diagram for the
M dimer, with separate chains shown in light gray
and black. The N- and C-termini, both of which mediate interactions at the dimer
interface, are labeled for each monomer. The dashed square shows the approximate
location of the salt bridge shown in panel C. (C) A salt bridge between Arg4 of one
monomer and Glu290 of the other is shown. This salt bridge is hypothesized to be
important for stabilizing the dimer interface and maintaining the active conformation.
(D, E) The dimer interface is shown from two different angles, with key residues
labeled.
(A) Ribbon diagram for the M monomer, with key
structural features labeled, including the three domains, the N- and C-termini, and
the active site residues, H41 and C145. (B) Ribbon diagram for the
M dimer, with separate chains shown in light gray
and black. The N- and C-termini, both of which mediate interactions at the dimer
interface, are labeled for each monomer. The dashed square shows the approximate
location of the salt bridge shown in panel C. (C) A salt bridge between Arg4 of one
monomer and Glu290 of the other is shown. This salt bridge is hypothesized to be
important for stabilizing the dimer interface and maintaining the active conformation.
(D, E) The dimer interface is shown from two different angles, with key residues
labeled.In SARS-CoV-2 M, several interactions at or near the dimer
interface play a role in keeping the oxyanion hole in the active conformation. An
intrachain hydrogen bond between the side chain NH2 of Arg298 and the backbone
carbonyl of Met6 appears to help hold the N-terminus (also called the N-finger) in place
to make the appropriate contacts with the other chain.[76] In the S1
specificity pocket, a ring-stacking interaction between the side chains of His163 and
Phe140 is stabilized by a network of hydrogen bonds comprising residues from both
monomers. Hydrogen bonding interactions between Glu166 and Ser139 of Chain A with the
N-terminus of Chain B (Ser1) are also important for stabilizing the dimer interface, along
with interactions between the Ser10 residues of both chains. Mutating the adjacent Gly11
to Ala completely abolishes dimerization due to a dramatic conformational change to the
N-finger that causes disruption of the interface.[21] Although the
mutational studies discussed here were performed using SARS-CoV-1
M, many of the key residues are also found in
SARS-CoV-2 M, and we hypothesize that they play similar
roles.Figure A shows the
M dimer with mutation sites indicated by spheres. The
mutations observed so far are relatively evenly distributed throughout the protein.
Notably, mutations are tolerated close to the active site residues, near the oxyanion
hole, and at the termini, both of which are involved in modulating the dimer
conformation.[17] Two different views of the dimer interface are shown
in Figure B, with interfacial residues shown as
space-filling models and with mutated residues highlighted in light blue (Chain A) or dark
blue (Chain B). Despite the importance of the dimer interface, several mutations are
observed in this region: M6L, A7V, A116V, S121L, C300S, S301L, and G302C. Particularly
notable are M6L and A7V, which are part of the N-finger that participates in interactions
with the C-terminal domain of the second monomer. In both cases, the hydrophobic character
of the residue is preserved, with a slight increase in the size of the residue. With the
exception of A116V, the other mutations represent qualitative changes in side chain
properties. These changes in chemical properties are apparently well tolerated, as all of
these mutations were found in clinical isolates.
Figure 4
(A) Ribbon diagram for the M dimer, with separate
chains shown in light gray/light blue and purple/dark blue. The locations of all
observed mutations are indicated by spheres centered on the α-carbons. (B) Two
different views of the dimer are shown, with all residues at the dimer interface
rendered as space-filling models, providing a visualization of the size of the dimer
interface and its spatial relationship to the active site residues (blue, His41;
yellow, Cys145) Residues having at least one mutation in this data set are highlighted
in light blue (Chain A) or dark blue (Chain B.).
(A) Ribbon diagram for the M dimer, with separate
chains shown in light gray/light blue and purple/dark blue. The locations of all
observed mutations are indicated by spheres centered on the α-carbons. (B) Two
different views of the dimer are shown, with all residues at the dimer interface
rendered as space-filling models, providing a visualization of the size of the dimer
interface and its spatial relationship to the active site residues (blue, His41;
yellow, Cys145) Residues having at least one mutation in this data set are highlighted
in light blue (Chain A) or dark blue (Chain B.).
Molecular Modeling Suggests Regionally Specific Differences in
M Variant Structure
For WT M and each variant, molecular models of the monomer
and dimer structures were constructed using MODELLER 9.23,[37] based on
PDB structure 6Y2E,[36] followed by annealing, correction of protonation states, and all-atom
molecular dynamics simulation in explicit solvent (see Materials and
Methods). Examples of representative models are shown in Figure S2, intramolecular contacts for residue 225 in WT and the T225I
variant are shown in Figure S3 and listed in Table S2, and selected double mutants are shown in Figure S4. The positions of all observed K to R and R to C mutations, which
are commonly observed in interfaces, are shown in Figure S5. In M, two such mutations, K236R
and R279C, are located in domain III, near the dimer interface. The positions of all
mutated residues are mapped onto the WT structure in Figure S6, with color coding indicating the change (if any) in side chain
chemical properties. We do not observe gross differences in structure or dynamics across
variants, as expected given that all variants were found in clinical isolates and are
therefore necessarily functional; mutations leading to radically altered or misfolded
structures would likely be strongly selected against. However, analysis of MD trajectories
does suggest more subtle differences across variants, providing insight into
function-preserving changes.To assess the overall degree to which local structure is conserved across
M variants, we compute the cross-variant variance in
average ϕ,ψ backbone torsion angles by residue within free monomers. In order
to control for overall flexibility, we normalize this by the estimated variance in torsion
angles within each trajectory. For arbitrary angle α at
residue i, this leads to the local variation
indexwhere
α is the vector of angles of type
α over the trajectory of variant j
with corresponding angular mean , is the vector of such means across variants, VarB is the
“between variant” angular variance in mean angles, and VarW is
the “within variant” angular variance in
α. Intuitively, high values of
v(α) indicate relatively large
between-variant variation in α relative to angular
variation seen within the trajectories themselves. For
v(ϕ) and
v(ψ), such values correspond to
systematic changes in local conformation associated with M
mutations. By turns, low values of
v(ϕ) and
v(ψ) indicate residues whose local
structure does not vary meaningfully across variants. It should be noted that such regions
can be either flexible or rigid.Figure shows the mean local variation indices
for ϕ,ψ by the residue for the 79 M variants,
indicated by color on the structure of M WT. (Separate
values for ϕ and ψ are shown in Figure S7.) It is immediately noteworthy that, with the minor exception of
two small loop regions around N277 and F223 (respectively), Domain III shows little
systematic variation across variants. The β-sheet-rich structure around the active
site is also relatively well-conserved. By contrast, we see relatively high levels of
between-variant difference in the interdomain region involving the termini (residues G2-A7
and S301–F305) and the double loop “active site gateway” region
involving (respectively) L50-Y54 and D187-A191. The former is potentially significant in
influencing large-scale flexibility (possibly relevant to dimerization), whereas the
latter is of obvious relevance to substrate processing and specificity. This motivates a
more detailed examination of variation in the active site, to which we return below.
Figure 5
Local variation indices for M monomer backbone torsion
angles (front/back views). Blue residues show higher levels of cross-variant
ϕ,ψ differences relative to baseline variation; red residues show little
evidence of structural difference across variants. Domain III is substantially
conserved, while greater change is seen in the interdomain regions and loop regions
adjacent to the active site.
Local variation indices for M monomer backbone torsion
angles (front/back views). Blue residues show higher levels of cross-variant
ϕ,ψ differences relative to baseline variation; red residues show little
evidence of structural difference across variants. Domain III is substantially
conserved, while greater change is seen in the interdomain regions and loop regions
adjacent to the active site.The relatively high levels of conformational variation in the interdomain regions suggest
functionally relevant differences in global cohesion across variants. To assess this, we
employ protein structure networks (PSNs), which are well-suited for assessing the
looseness or cohesiveness of contacts among chemical groups.For all M variants, moiety-level PSNs were constructed for
each frame within each variant trajectory, using the definitions found in ref (50) (Figure S1) The assessment
of global cohesion was performed by computing the mean degree k-core
number for all moieties in each structure; to allow comparison of global cohesion within
domains, we also compute mean core numbers within each of the three domains. The mean core
number can be considered an index of structural cohesion, with higher values indicating
greater numbers of redundant contacts among chemical groups.[55] To
account for within-trajectory autocorrelation in comparing mean core numbers,
autocorrelation-corrected parametric bootstrap confidence intervals and standard errors
were employed.Figure shows global and domain-specific
cohesion levels (i.e., mean core numbers) for all variant monomers, sorted in descending
order of mean cohesion. Means and standard errors for each variant can be found in
Table S3. As suggested from the torsion angle analysis, cohesion differs
significantly among variants, both globally and within domains. On average, the majority
of variants are estimated to be less cohesively structured than WT, with the exception of
Domain III (in which WT does not differ significantly from the mean). It is possible that
these differences indicate selection for more globally flexible structures (again, with
the exception of Domain III). Whether or not this is the case, however, it appears clear
that less cohesive structures are not strongly selected against. Such flexibility may
affect dimerization kinetics, which is relevant to protease function and the development
of robust dimerization inhibitors.
Figure 6
Mean core numbers for M monomer PSNs, by variant
(ordering is by mean value in each panel). Points indicate trajectory means, with
segments showing autocorrelation-corrected 95% bootstrap confidence intervals;
red/blue intervals have t values versus WT (green) of at least
±2, indicating significant variation in structural cohesion across variants.
Overall, the majority of variants are less cohesive than WT globally and in Domains I
and II, while Domain III cohesion in WT is typical of the variant set.
Mean core numbers for M monomer PSNs, by variant
(ordering is by mean value in each panel). Points indicate trajectory means, with
segments showing autocorrelation-corrected 95% bootstrap confidence intervals;
red/blue intervals have t values versus WT (green) of at least
±2, indicating significant variation in structural cohesion across variants.
Overall, the majority of variants are less cohesive than WT globally and in Domains I
and II, while Domain III cohesion in WT is typical of the variant set.While cohesion and flexibility within M monomers may
provide clues to how mutations may impact dimerization, examination of the interfacial
region within M dimers suggests indications of the impact
of mutations on the behavior and stability of the dimers themselves. Figure A shows mean cohesion scores (i.e., mean degree
k-core numbers) for interfacial moieties in dimer trajectories for all
M variants, together with autocorrelation-corrected 95%
bootstrap confidence intervals; values are also provided in Table S4. (We here define a moiety to be interfacial if it is adjacent to at
least one chemical group in the opposite chain (Figure C). This is assessed on a frame-by-frame basis, allowing us to account for
changes in contact patterns within trajectories.) Although there is some variation in
interfacial cohesion, relatively few trajectories differ significantly from the mean
cohesion level, and very few are two standard errors away from wild-type (12 being more
cohesive and 1 being less cohesive). This is consistent with the hypothesis that
interfacial cohesion is strongly conserved in M, suggesting
that a looser or more dynamic dimer core may impede function. (The fact that, of those
variants differing significantly from WT, all but one show greater cohesion is also
consistent with this determination, albeit not determinative.)
Figure 7
(A) Mean core number for moieties at the dimer interface across all variants, with
95% autocorelation-corected bootstrap confidence intervals. Higher values indicate a
higher core number (greater cohesion); variants indicated in red/blue have significant
variation in core number relative to WT (green). The vertical dotted line indicates
grand mean. (B) Mean tie volume across the dimer interface, ordered per panel A.
Variants indicated in red/blue have significantly lower/higher tie volumes across the
interface relative to WT (green). (C) Illustrative moiety-level PSN for the WT
M dimer. Moieties associated with chains A and B
are indicated in gray and dark purple, respectively. Moieties making up the
interfacial residues for chains A and B are indicated in light gray and light purple,
respectively. Also shown are moieties associated with the active site His (blue) and
Cys (yellow.) (D) Induced subgraph of the PSN in panel C based on interfacial
moieties. Interfacial tie volume is the count of ties from A (gray) to B (purple)
interfacial nodes, while embeddedness in locally cohesive structures contributes to
the mean k-core number.
(A) Mean core number for moieties at the dimer interface across all variants, with
95% autocorelation-corected bootstrap confidence intervals. Higher values indicate a
higher core number (greater cohesion); variants indicated in red/blue have significant
variation in core number relative to WT (green). The vertical dotted line indicates
grand mean. (B) Mean tie volume across the dimer interface, ordered per panel A.
Variants indicated in red/blue have significantly lower/higher tie volumes across the
interface relative to WT (green). (C) Illustrative moiety-level PSN for the WT
M dimer. Moieties associated with chains A and B
are indicated in gray and dark purple, respectively. Moieties making up the
interfacial residues for chains A and B are indicated in light gray and light purple,
respectively. Also shown are moieties associated with the active site His (blue) and
Cys (yellow.) (D) Induced subgraph of the PSN in panel C based on interfacial
moieties. Interfacial tie volume is the count of ties from A (gray) to B (purple)
interfacial nodes, while embeddedness in locally cohesive structures contributes to
the mean k-core number.For a different look at interaction across the interfacial region, we also consider the
total number of contacts between the A and B chains over time (the tie volume across the
dimer interface, e.g., the interfacial ties in Figure D). Figure B shows mean dimer
interface tie volumes and associated confidence intervals for the
M variants, with variants sorted in the order of the
mean core number to facilitate comparison (panel A). Whereas the mean core number provides
a measurement of overall structural cohesion at the interface (including both interactions
within and between chains), the tie volume directly measures the number of cross-chain
contacts and is thus a potential proxy for interaction strength. As with cohesion, tie
volume is fairly well-conserved, and few trajectories show significant deviation from
wild-type (with seven trajectories having higher tie volume than WT, and two lower).
Interestingly, interfacial tie volume is not strongly related to interfacial cohesion,
indicating that the tightness or looseness of the structure around the interface is not
simply a function of the raw number of interfacial contacts. Both, however, suggest a
relatively high level of conservation in interfacial structure in the
M dimer across variants observed to date. While the
first version of this work, which addressed moiety-level networks of
M monomers[77] was under review, a
preprint[78] was released, showing similar results using residue-level
network analysis on MD trajectories of M dimers, and
subsequently published.[79] Although different methodology is used, the
conclusions are in broad agreement, with significant flexibility observed in the N-finger
region that is important for dimerization, as well as the loops near the active site. The
authors also describe key low-frequency motions of the dimer that may be impacted by
mutation and identify a pocket near the dimer interface that could serve as a potential
target for allosteric inhibitors, underscoring the importance of mutational analysis as a
part of effective inhibitor design. Moiety-level network analysis of
M dimers was then added to this paper in response to
reviewers’ comments, again finding substantial agreement with the residue-level
results.[78]
Active Site Networks Suggest Potential Activity Differences across
M Variants
The observation of structural variation in loop regions associated with the binding
pocket motivates closer examination of variation in the M
active site. To this end, subgraphs of the full protein structure networks comprising
moieties belonging to the active site residues and their neighbors were constructed to
produce active site networks (ASNs)[52] for all conformations. A
protein’s ASN describes physical interactions among active site moieties and other
groups that are immediately adjacent in the 3D structure, irrespective of their positions
in the amino acid sequence. Per ref (52), we
compute for each ASN a constraint score, a general measure of active site flexibility that
is associated with substrate specificity. The constraint score is the first principal
component of a set of several network metrics (see Materials and
Methods), with higher values indicating a greater tendency for the catalytic
residues to be constrained by cohesive contacts with other residues, and lower values
indicating fewer such constraints. Examples of ASNs corresponding to the maximum, minimum,
and mean observed constraint values over all observed M
monomer conformations are shown in Figure . For
the dimer structures, we repeat this process for the active site in each chain, yielding
two constraint values per variant; these scores, along with corresponding maximum, mean,
and minimum constraint ASNs, are shown in Figure . Values for both monomers and dimers are also provided in Tables S5 and S6.
Figure 8
(A) Mean monomer active site constraint scores and 95% autocorrelation-corrected
parametric bootstrap confidence intervals, by variant. Higher values indicate greater
constraints on active site residues; red/blue intervals have t values
versus WT (green) of at least ±2, indicating significant variation in average
constraint across variants. (B) Minimum, (C) mean, and (D) maximum constraint ASNs
over all frames. Low constraint conformations are characterized by no shared partners
between the catalytic residues (colored nodes), while highly constrained conformations
show cohesively reinforced contacts between them. (E, F, G) Active site residues and
the surrounding residues making up the ASN for the proteins described in panels B, C,
and D, respectively. (H, I, J) Full protein models for the same examples, with the
active site regions indicated by white boxes.
Figure 9
(A) Mean active site constraint scores and 95% autocorrelation-corrected parametric
bootstrap confidence intervals for dimer A and B chain active sites, by variant.
Red/blue intervals have t values versus WT (green) of at least
±2, indicating significant variation in average constraint across variants
(comparisons made within the chain). (B) Minimum, (C) mean, and (D) maximum constraint
ASNs over all structures. (E, F, G) Active site residues and the surrounding residues
making up the ASN for the proteins described in panels B, C, and D, respectively. (H,
I, J) Full protein models for the same examples, with the active site regions
indicated by white boxes.
(A) Mean monomer active site constraint scores and 95% autocorrelation-corrected
parametric bootstrap confidence intervals, by variant. Higher values indicate greater
constraints on active site residues; red/blue intervals have t values
versus WT (green) of at least ±2, indicating significant variation in average
constraint across variants. (B) Minimum, (C) mean, and (D) maximum constraint ASNs
over all frames. Low constraint conformations are characterized by no shared partners
between the catalytic residues (colored nodes), while highly constrained conformations
show cohesively reinforced contacts between them. (E, F, G) Active site residues and
the surrounding residues making up the ASN for the proteins described in panels B, C,
and D, respectively. (H, I, J) Full protein models for the same examples, with the
active site regions indicated by white boxes.(A) Mean active site constraint scores and 95% autocorrelation-corrected parametric
bootstrap confidence intervals for dimer A and B chain active sites, by variant.
Red/blue intervals have t values versus WT (green) of at least
±2, indicating significant variation in average constraint across variants
(comparisons made within the chain). (B) Minimum, (C) mean, and (D) maximum constraint
ASNs over all structures. (E, F, G) Active site residues and the surrounding residues
making up the ASN for the proteins described in panels B, C, and D, respectively. (H,
I, J) Full protein models for the same examples, with the active site regions
indicated by white boxes.Examination of the mean constraint scores for each variant trajectory suggests potential
activity differences across M variants. Figure A shows mean constraint scores for each variant
monomer, with autocorrelation-corrected parametric bootstrap confidence intervals. Of the
79 trajectories examined, 22 (28%) were significantly below the grand mean (dotted
vertical line) and 28 (35%) were significantly above it; similarly, when directly compared
to WT, 12 variants were observed to be significantly less constrained, while 17 were
significantly more constrained (i.e., bootstrap t-scores less than
−2 or greater than 2, respectively). A total of 43 out of 79
M sequences (55%) showed nominally higher levels of
mean constraint than WT (discounting significance), suggesting a lack of uniform selection
pressure for active sites that are more or less constrained than WT (the fraction greater
does not differ significantly from random deviation, p = 0.16, exact
binomial test). Thus, although we do not see evidence here of systematic selection for net
changes in active site constraint in the monomeric state, we do see evidence that variants
differ from each other and from WT in their average active site properties. Turning to the
dimer trajectories, Figure A shows mean
constraint scores for A and B chain active sites, with autocorrelation-corrected
parametric bootstrap confidence intervals for each variant. Consistent with the hypothesis
that dimerization can result in asymmetric modification of active site
conformations,[20] we frequently observe large differences in the
active site constraint between chains within the same variant; indeed, the
root-mean-square difference in mean constraints within the variant is significantly larger
than would be expected from the variation across sites alone (p = 0.048,
permutation test), being approximately 1.65 times the standard deviation in mean
constraint across variants. That said, such differentiation is not always present, with
many trajectories (including WT) showing similar levels of constraint for both active
sites, and it may not be necessary for function. (We observe that symmetric active site
conformations were obtained in the WT dimer structure of Zhang et al.,[36] for instance.) Unlike the monomeric case, we see a greater trend here toward reduction
in dimer active constraint scores versus WT, with 76% of sequences showing constraint
levels lower than WT (p < 0.0001, exact binomial test). It is possible
that this reflects a higher level of selection for looser active sites in the dimeric
state per se. These differences should be considered when designing inhibitors that are
tolerant of mutational change in M over time. In
particular, it is clear that the population of extant M
variants already possesses some phenotypic diversity in active site flexibility,
potentially facilitating its ability to evolve around some types of inhibitors. The
ability of M dimers to sustain active
sites with different levels of flexibility may also complicate inhibitor design and
suggests the importance of inhibitors that are robust to variation in active site
constraint.
Overall Variation in M Active Site
Conformations Is Related to Cys/His Contact
Given the diversity of active site conformations across monomeric/dimeric states,
variants, and chains, it is useful to seek specific features that can be used to
characterize those conformations. To examine this, we employ a kernel principal component
analysis (kPCA) of a stratified sample of ASNs from the set of monomer and dimer
trajectories, taking 25 evenly spaced frames from each trajectory (including both A and B
ASNs in the dimeric case) for a total of 5925 conformations from the entire data set. The
feature space for our kernel is the set of all labeled subgraphs on the set of potential
ASNs, allowing us to detect specific patterns of contacts among moieties that characterize
subpopulations of conformations. As shown in Figure , the combined set of active site conformations falls into two fairly distinct
clusters obliquely aligned with the first two principal components. These respective
clusters correspond to a single feature, specifically the presence or absence of at least
one contact between the catalytic cysteine and histidine residues; the cluster aligned
with the first principal component consists of “open” conformations lacking
such a contact, while the cluster aligned with the second principal component consists of
“closed” conformations in which the contact is present.
Figure 10
Kernel PCA solution for M active site
networks. Conformations fall into two clusters, corresponding to whether the Cys-His
tie is present (“closed,” circles) or absent (“open,”
triangles). The position within each cluster is strongly associated with mean degree
(see point color) and secondarily with network size (gradient shown via contour
lines). Inset networks illustrate centroids of each cluster; note the presence of the
Cys/His interaction (highlighted) in the selected A226 V chain B conformation, which
is absent in the selected G278R chain A conformation.
Kernel PCA solution for M active site
networks. Conformations fall into two clusters, corresponding to whether the Cys-His
tie is present (“closed,” circles) or absent (“open,”
triangles). The position within each cluster is strongly associated with mean degree
(see point color) and secondarily with network size (gradient shown via contour
lines). Inset networks illustrate centroids of each cluster; note the presence of the
Cys/His interaction (highlighted) in the selected A226 V chain B conformation, which
is absent in the selected G278R chain A conformation.Both clusters have an ellipsoidal form, with distance along the ellipsoids away from the
origin corresponding to the average number of contacts per chemical group (mean degree).
The cluster ellipsoids are also slightly oblique to one another, an orientation that
arises from their joint respective correlation with the total number of moieties
interacting with the active site residues (ASN size); this is shown in Figure via the gradient of the size distribution, indicated
as contour lines. These three properties (Cys/His contact, mean degree, ASN size) thus
provide a parsimonious description of the most important axes of variation in active site
structure and may be useful as targets for experimental investigation (e.g., via NMR).
Interestingly, we observe very little association between conformation and monomer/dimer
status (R2 < 0.6% on the first PC, and no significant
difference on the second-highest R2 in the first 10 dimensions
of 1.4%), indicating that the full range of active site conformations is observed in both
monomer and dimer structures and at similar frequencies. Conformation is more strongly
associated with the variant, the latter accounting for approximately 14% of the variance
in the first PC and 7% in the second; adding interaction effects with chain and
monomer/dimer status increases this by 30% and 16%, respectively, suggesting that
M mutations tend to affect monomer and
dimer structures differently. While the mutations observed to date do not radically alter
active site conformation (this being a priori unlikely for functional mutations in any
event), they do appear to exert a nontrivial influence on the distribution of
conformational states.
Conclusion
For clinically relevant variants of M, the observed
variation so far is toward larger and more hydrophobic amino acids, leading to reduced
structural cohesion on average in the variants relative to wild-type. Although mutations
occur throughout the protein, the structural effects of those mutations appear to be far
more localized, impacting primarily the protein’s interdomain interfaces and several
key loop regions near the substrate-binding pocket. These mutations appear to result in
systematic variations in both global flexibility and in the extent to which the catalytic
residues of the active site are constrained (a factor previously found to be related to
substrate specificity in related systems). Our results suggest that
M may be currently subject to selection for enhanced
global flexibility and that currently circulating M variants
represent a reservoir of phenotypic diversity in active site structure and dynamics that
could facilitate an evolutionary response to certain classes of protease inhibitors. These
findings are relevant to dimerization kinetics, substrate capture, and the development of
resistance to inhibitors, as well as our understanding of SARS-CoV-2 more generally.On a more methodological note, these results underscore the potential of comparative
in silico studies to rapidly probe structural and functional consequences
of genotypic variation in emerging diseases. Advances in both GPU-enabled hardware and
molecular dynamics have made high-volume simulation studies feasible over short time
horizons, giving us a powerful tool for selection of experimental targets. At the same time,
comparative analysis of large volumes of trajectory data created by such simulation studies
remains a challenge. Here, we have used both network analytic and machine learning
techniques to identify potentially important sources of variation across trajectories.
Although network analytic ideas have been used to study protein structures at least since
1993,[80] systematic use of combined network analytic and MD trajectories
for comparative analysis of protein variants is more recent.[81−84] It is hoped that
applications such as this one will inspire further development of this promising
approach.
Authors: Bing Bai; Elena Arutyunova; Muhammad Bashir Khan; Jimmy Lu; Michael A Joyce; Holly A Saffran; Justin A Shields; Appan Srinivas Kandadai; Alexandr Belovodskiy; Mostofa Hena; Wayne Vuong; Tess Lamer; Howard S Young; John C Vederas; D Lorne Tyrrell; M Joanne Lemieux; James A Nieman Journal: RSC Med Chem Date: 2021-08-20
Authors: Nadide Altincekic; Sophie Marianne Korn; Nusrat Shahin Qureshi; Marie Dujardin; Martí Ninot-Pedrosa; Rupert Abele; Marie Jose Abi Saad; Caterina Alfano; Fabio C L Almeida; Islam Alshamleh; Gisele Cardoso de Amorim; Thomas K Anderson; Cristiane D Anobom; Chelsea Anorma; Jasleen Kaur Bains; Adriaan Bax; Martin Blackledge; Julius Blechar; Anja Böckmann; Louis Brigandat; Anna Bula; Matthias Bütikofer; Aldo R Camacho-Zarco; Teresa Carlomagno; Icaro Putinhon Caruso; Betül Ceylan; Apirat Chaikuad; Feixia Chu; Laura Cole; Marquise G Crosby; Vanessa de Jesus; Karthikeyan Dhamotharan; Isabella C Felli; Jan Ferner; Yanick Fleischmann; Marie-Laure Fogeron; Nikolaos K Fourkiotis; Christin Fuks; Boris Fürtig; Angelo Gallo; Santosh L Gande; Juan Atilio Gerez; Dhiman Ghosh; Francisco Gomes-Neto; Oksana Gorbatyuk; Serafima Guseva; Carolin Hacker; Sabine Häfner; Bing Hao; Bruno Hargittay; K Henzler-Wildman; Jeffrey C Hoch; Katharina F Hohmann; Marie T Hutchison; Kristaps Jaudzems; Katarina Jović; Janina Kaderli; Gints Kalniņš; Iveta Kaņepe; Robert N Kirchdoerfer; John Kirkpatrick; Stefan Knapp; Robin Krishnathas; Felicitas Kutz; Susanne Zur Lage; Roderick Lambertz; Andras Lang; Douglas Laurents; Lauriane Lecoq; Verena Linhard; Frank Löhr; Anas Malki; Luiza Mamigonian Bessa; Rachel W Martin; Tobias Matzel; Damien Maurin; Seth W McNutt; Nathane Cunha Mebus-Antunes; Beat H Meier; Nathalie Meiser; Miguel Mompeán; Elisa Monaca; Roland Montserret; Laura Mariño Perez; Celine Moser; Claudia Muhle-Goll; Thais Cristtina Neves-Martins; Xiamonin Ni; Brenna Norton-Baker; Roberta Pierattelli; Letizia Pontoriero; Yulia Pustovalova; Oliver Ohlenschläger; Julien Orts; Andrea T Da Poian; Dennis J Pyper; Christian Richter; Roland Riek; Chad M Rienstra; Angus Robertson; Anderson S Pinheiro; Raffaele Sabbatella; Nicola Salvi; Krishna Saxena; Linda Schulte; Marco Schiavina; Harald Schwalbe; Mara Silber; Marcius da Silva Almeida; Marc A Sprague-Piercy; Georgios A Spyroulias; Sridhar Sreeramulu; Jan-Niklas Tants; Kaspars Tārs; Felix Torres; Sabrina Töws; Miguel Á Treviño; Sven Trucks; Aikaterini C Tsika; Krisztina Varga; Ying Wang; Marco E Weber; Julia E Weigand; Christoph Wiedemann; Julia Wirmer-Bartoschek; Maria Alexandra Wirtz Martin; Johannes Zehnder; Martin Hengesbach; Andreas Schlundt Journal: Front Mol Biosci Date: 2021-05-10