Given the demonstrated utility of coarse-grained modeling and simulations approaches in studying protein structure and dynamics, developing methods that allow experimental observables to be directly recovered from coarse-grained models is of great importance. In this work, we develop one such method that enables protein backbone chemical shifts (1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N) to be predicted from Cα coordinates. We show that our Cα-based method, LARMORCα, predicts backbone chemical shifts with comparable accuracy to some all-atom approaches. More importantly, we demonstrate that LARMORCα predicted chemical shifts are able to resolve native structure from decoy pools that contain both native and non-native models, and so it is sensitive to protein structure. As an application, we use LARMORCα to characterize the transient state of the fast-folding protein gpW using recently published NMR relaxation dispersion derived backbone chemical shifts. The model we obtain is consistent with the previously proposed model based on independent analysis of the chemical shift dispersion pattern of the transient state. We anticipate that LARMORCα will find utility as a tool that enables important protein conformational substates to be identified by “parsing” trajectories and ensembles generated using coarse-grained modeling and simulations.
Given the demonstrated utility of coarse-grained modeling and simulations approaches in studying protein structure and dynamics, developing methods that allow experimental observables to be directly recovered from coarse-grained models is of great importance. In this work, we develop one such method that enables protein backbone chemical shifts (1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N) to be predicted from Cα coordinates. We show that our Cα-based method, LARMORCα, predicts backbone chemical shifts with comparable accuracy to some all-atom approaches. More importantly, we demonstrate that LARMORCα predicted chemical shifts are able to resolve native structure from decoy pools that contain both native and non-native models, and so it is sensitive to protein structure. As an application, we use LARMORCα to characterize the transient state of the fast-folding protein gpW using recently published NMR relaxation dispersion derived backbone chemical shifts. The model we obtain is consistent with the previously proposed model based on independent analysis of the chemical shift dispersion pattern of the transient state. We anticipate that LARMORCα will find utility as a tool that enables important protein conformational substates to be identified by “parsing” trajectories and ensembles generated using coarse-grained modeling and simulations.
Characterizing the
folding/unfolding pathway of proteins remains
an outstanding and significant challenge in structural biology. Though
the emphasis has been on characterizing the native state structure
of proteins, new experimental techniques are now being developed that
enable transiently populated intermediates along the protein folding
pathway to be characterized at the atomic scale.[1−6] Identifying such folding intermediates has always been viewed as
an important task, but now as these and other non-native states have
been implicated in several diseases,[7] developing
approaches that enable the “complete” folding pathway
to be characterized is of even greater importance.In principle,
classical molecular dynamics (MD) simulations, which
can generate full atomic trajectories of a protein by propagating
Newton’s equations of motions, can be used to characterize
its folding pathway(s). However, rigorous MD simulations are computationally
expensive, making it difficult to simulate protein folding; typical
simulations are on the order of nanosecond to microseconds, whereas
proteins (with the exception of fast-folders) fold on a time scale
of milliseconds and beyond. Though recent advances in computer hardware,
software, and methodology[8−11] now allow the long time scale dynamics of some proteins
to be studied,[12] these approaches still
require significant computational resources. Thus, there remains a
keen need for approaches that allow the folding pathway(s) of proteins
to be characterized using readily available computational resources.Coarse-grained molecular simulations, in which the full atomic
system is reduced to a smaller less complex system of interacting
“coarse-grained” particles, have been used to overcome
the “mismatch” between simulation and biological time
scales by sacrificing resolution for enhanced sampling efficiency.
Remarkably, despite their simplicity, coarse-grained modeling and
simulation approaches have been used to provide significant insights
into protein functioning.[13−16]Of considerable interest is the use of coarse-graining
within a
“multiscale” approach, in which coarse-grained simulations
are used to rapidly and exhaustively sample the conformational space
of a target protein, and then “selected” conformers
from the coarse-grained simulations are used to “seed”
more rigorous all-atom simulations. One approach to identifying relevant
“seed” conformations is to use advanced clustering[17−19] and other data-reduction techniques.[20] Alternatively, experimental data can be used to select relevant
“seed” structures by constructing ensembles that are
consistent with the ensemble averaged data.[21−25] A prerequisite for such an “experimentally-augmented”
identification of relevant conformational substates is the ability
to calculate experimental observables from structural models, and,
in the context of coarse-grained modeling, this typically requires
mapping reduced models back to their all-atom representations.[26−31] Unfortunately, in addition to suffering from issues of nonuniqueness,
this mapping incurs an additional computational cost; typically coarse-grained
approaches generate on the order 106 conformers, so this
additional cost can be significant. Techniques are therefore needed
that maximize the structural information that can be directly extracted
from coarse-grained models and thus obviate the need for all-atom
reconstruction of an entire trajectory or ensemble generated using
coarse-grained simulations.NMR relaxation dispersion (NMR-RD)
experiments have recently garnered
significant attention because they allow NMR observables of low-populated
intermediates to be detected. These observables can be then used to
“unveil” the structure of these previously “invisible”
states. Using such an approach, NMR-RD derived chemical shifts, which
are exquisitely sensitive to protein structure, have now been used
to structurally characterize the folding intermediates of several
proteins.[32−34] Incorporating NMR-RD derived chemical shifts into
the analysis of coarse-grained simulations would allow relevant intermediate
states that are sampled along the folding/unfolding pathways to be
identified. As a first step toward being able to use NMR chemical
shifts to “parse” trajectories or ensembles generated
using coarse-grained modeling and simulations, we introduce LARMORCα, a prediction method that allows protein backbone
(1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N) chemical shifts
to be predicted based only on Cα-based atomic
coordinates. In what follows we (i) describe the
model used to generate LARMORCα; (ii) assess the accuracy of LARMORCα; (iii) assess the sensitivity of LARMORCα predicted chemical
shifts to protein structure; and (iv) use coarse-grained
simulations and LARMORCα to characterize the transient
state of the gpW, a small fast-folding protein.
Methods
Training and
Testing Set
LARMORCα predictors
were trained and tested using the RefDB database.[35] Briefly, the data set contains proteins for which both
high-resolution X-ray structures and NMR chemical shifts are available
in the Protein Data Bank (PDB: http://www.pdb.org) and
Biological Magnetic Resonance Bank (BMRB: http://www.bmrb.wisc.edu/), respectively. The training and testing set used here consisted
of 196 and 61 proteins, respectively (Table S1).
Cα-Cα Distance-Based Structure Features
The distance-based structural features used to predict backbone chemical
shifts from Cα coordinates were identical to those recently
used by the program PCASSO to assign protein secondary structure from
Cα coordinates.[36] Briefly, for a
given residue, i, a set of 43 features are calculated
from the Cα coordinates and the pseudocenter coordinates, respectively
(see Table S2 and also ref (36) for a list of all features).
The pseudocenter for the ith residue is
defined as the center-of-geometry between Cα(i) and Cα(i+1). The feature vector, V(i), for the ith residue is made up by features from the ith, i–1th, and i+1th residues
which results in a total of 258 feature elements (Table S2).
Generating LARMORCα Using
Randomized Decision
Trees
For all proteins in the training set, the Cα-Cα
distance-based structural features described above were extracted
from the X-ray structures and combined with their corresponding chemical
shift data. The resulting data set was used as input to build a set
of models to separately predict 1HN, 1Hα, 13Cα, 13C, 13Cβ, or 15N backbone chemical shifts. Specifically, for each backbone
nucleus type, the random forest machine learning technique, implemented
in the RandomForest module in the Scikit-learn Python package,[37] was used to build a predictive model. Each random
forest predictor consisted of 50 randomized decision trees, and the
maximum depth was set to 25. Each node in a given tree was split using
the best splitting variable from a subset of 16 randomly chosen feature
variables. The minimum number of samples required for splitting an
internal node and the minimum number of samples required in a leaf
node were both set to 5. Default values were used for all other parameters.
Molecular Dynamics (MD) Simulations
Coarse-grained
decoy pools were generated for four arbitrarily chosen proteins in
the testing set (PDB ID 1LM4: chain B,[38]2C5L: chain C,[39]1DYT: chain A[40] and 1H4A: chain X[41]) using Go̅ model MD simulations. The native
contacts used to define the Go̅ models were derived from the
coordinates in the corresponding PDBs using the MMTSB Go̅ Model
Server (https://mmtsb.org/webservices/gomodel.html).[42,43] All simulations were carried out using CHARMM MD simulation package.[44] Trajectories of 7.5 μs each were propagated
using Langevin dynamics at 300 K with a friction coefficient of 0.10
ps–1. All bonds were constrained using SHAKE,[45] and nonbond interactions were truncated at 25
Å with a smooth switching function between 21 and 23 Å.
Go̅ model MD simulations of the gpW protein were carried out
using the same procedure described above. In this case, the native
contacts used to define the Go̅ model were derived from the
gpW NMR structure (PDBID: 2L6Q; model 1).[46,47] For all the proteins,
simulations at 300 K enabled both native and non-native conformations
to be sampled.
Chemical Shift Analysis
For each
of the four test systems,
backbone chemical shifts were predicted from the resulting trajectory
using LARMORCα, and then the weighted-root-mean-squared-error
(wRMSE) between predicted and measured chemical shifts
along each trajectory were calculated. The wRMSE
is given aswhere δpred, δmeas, and w are the predicted
chemical shift, the measured chemical shift, and weighting factor,
respectively, for a given nucleus, i. The summation
runs over the set of N chemical shifts. The weighting factors (w) were used to account for the differential accuracy
of the predictors. Specifically,where R and MAE are the estimated
Pearson
correlation coefficient and estimated mean-absolute-error, respectively,
between the measured and LARMORCα predicted chemical
shifts for the nucleus type associated with i. The
weighting factor also scales the contribution to the overall error
such that nuclei with different dispersion ranges can contribute equally
to the wRMSE.In addition to extracting the
model with the lowest wRMSE and then comparing to
the native structure, for each of the four systems receiver-operator-characteristic
(ROC) analysis was carried out. First, the fraction of native contacts, Q, was calculated for each conformer along each trajectory.
Conformers along the trajectories were then designated as native if Q > 0.90 and non-native otherwise. ROC curves were plotted
for each test case, and the area-under-curve (AUC) was determined.
Here the AUC, which ranges between 0 and 1, was used as a measure
of the resolving power of the LARMORCα predicted
chemical shifts. An AUC approaching 1 indicated that the models with
the lowest error wRMSE corresponded to the native
conformer and thus the wRMSE was effective at resolving
native and non-native conformers, whereas an AUC of 0.5 indicated
that the use of wRMSE to distinguish native from
non-native conformers was no better than a random designation. In
addition to carrying out ROC analysis using the total wRMSE, parallel analyses were carried out using only 1HN, 1Hα, 13Cα, 13C, 13Cβ, or 15N chemical shifts.For the gpW protein,
the wRMSE between LARMORCα predicted
chemical shifts and measured chemical shifts
corresponding to (i) the native states and (ii) the transient state were determined.[46] The conformation closest to the average structure of the
10 models exhibiting the lowest wRMSE was then extracted
and considered representative of the state.
Results and Discussion
The prospect of predicting backbone chemical shifts directly from
Cα atomic coordinates opens up the possibility of utilizing
chemical shifts to parse trajectories of Cα-based coarse-grained
simulations and so identify intermediate states along the folding
pathway of proteins. However, relying only on Cα
atomic coordinates reduces the information content in the models and
thus places an inherent limit on how accurately chemical shifts can
be predicted. In what follows, we first examine the accuracy with
which LARMORCα predicts backbone chemical shifts
and then compare it to all-atom prediction methods. We then assess
whether, given its current accuracy, LARMORCα predicted
backbone chemical shifts are likely to be of utility in resolving
protein structure. The latter is essential because it is sensitivity
to protein structure rather than absolute prediction accuracy that
will be most important when utilizing chemical shifts to study the
folding pathway of proteins.
LARMORCα Backbone Chemical
Shift Prediction
Accuracy Is Comparable to Some All-Atom-Based Approaches
We began our analysis by determining the accuracy with which LARMORCα predicts protein backbone chemical shifts for the
proteins in the testing set (Table S1).
The accuracy of the predictions for 1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N nuclei were quantified by computing the root-mean-square-error
(RMSE), mean-absolute-error (MAE), and the Pearson correlation coefficient
(R) between LARMORCα predicted chemical
shifts and measured chemical shifts. The results are summarized in
Table 1. For 1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N the RMSE, MAE and R, calculated over all
corresponding chemical shifts in the testing set, were 0.54, 0.35,
1.21, 1.79, 4.18, and 3.32 ppm, 0.39, 0.25, 0.90, 1.03, 1.55, and
2.45 ppm, and 0.67, 0.77, 0.82, 0.93, 0.94, and 0.81, respectively.
Table 1
Backbone Chemical Shifts Prediction
Accuracya
nucleus
RMSE (ppm)
MAE (ppm)
R
no. of shifts
% prediction outliers
Hα
0.54 (0.44)
0.39 (0.35)
0.67 (0.75)
5776
2.98
HN
0.35 (0.28)
0.25 (0.22)
0.77 (0.83)
8346
3.34
Cα
1.21 (1.06)
0.90 (0.83)
0.82 (0.86)
8856
2.31
C
1.79 (0.99)
1.03 (0.76)
0.93 (0.98)
7218
5.96
Cβ
4.18 (1.06)
1.55 (0.81)
0.94 (1.00)
7322
7.39
N
3.32 (2.88)
2.45 (2.25)
0.81 (0.85)
8125
2.43
The
root-mean-square-error (RMSE),
mean-absolute-error (MAE), and Pearson correlation coefficient (R) between LARMORCα-predicted and experimental
chemical shifts. For each nucleus, the RMSE, MAE, and R statistics were calculated using all chemical shifts in the testing
set. In parentheses are the statistics obtained when prediction outliers
are excluded. Also listed for each nucleus is the number of chemical
shifts over which the statistics were computed and the percentage
of prediction outliers identified. A prediction outlier is identified
as having an error that exceeds the median error by more than three
standard deviations (i.e., the 3-sigma rule).
The
root-mean-square-error (RMSE),
mean-absolute-error (MAE), and Pearson correlation coefficient (R) between LARMORCα-predicted and experimental
chemical shifts. For each nucleus, the RMSE, MAE, and R statistics were calculated using all chemical shifts in the testing
set. In parentheses are the statistics obtained when prediction outliers
are excluded. Also listed for each nucleus is the number of chemical
shifts over which the statistics were computed and the percentage
of prediction outliers identified. A prediction outlier is identified
as having an error that exceeds the median error by more than three
standard deviations (i.e., the 3-sigma rule).The large discrepancy between the RMSE and MAE is
indicative of
the presence of a small set of large prediction outliers. To confirm
this, outlier analysis was carried out for each backbone nucleus.
Specifically, we identified possible prediction outliers using the
3-sigma rule, i.e. a prediction outlier was identified as one that
had an error that exceeded the median error by more than three standard
deviations. When excluding the prediction outliers–on average
∼4.0% of the total testing set for each nucleus–the
RMSE and MAE decreased to 0.44, 0.28, 1.06, 0.99, 1.06, and 2.88 ppm
and 0.35, 0.22, 0.83, 0.76, 0.81, and 2.25 ppm and the R increased to 0.75, 0.83, 0.86, 0.98, 0.99, and 0.85, for 1HN, 1Hα, 13Cα, 13C, 13Cβ, and 15N nucleus, respectively.Consistent with our expectation, backbone chemical shifts predicted
using LARMORCα were generally less accurate than
those calculated using all-atom methods. For example, SHIFX2[48] and SPARTA+,[49] which
are currently the “gold-standard” for empirical structure-based
protein chemical shift prediction, exhibited significantly lower RMSE
over the testing set (Table S3) and had
mean Rs of 0.98 and 0.92, respectively, compared
with 0.82 for LARMORCα (Table
S4). A similar picture emerges when comparing LARMORCα to CamShift;[50] the mean R for CamShift was 0.89 (Tables S3 and S4). However, LARMORCα predicts backbone chemical
shifts with an accuracy comparable to PROSHIFT[51] and SHIFTS;[52] the mean R for PROSHIFT and SHIFTS were 0.86 and 0.81, respectively,
compared to LARMORCα’s 0.82. When prediction
outliers were accounted for, the overall accuracy of LARMORCα prediction accuracy was on par with Camshift (Tables S3 and S4). Together these results show that although
LARMORCα generally predicts backbone chemical shifts
less accurately than all-atom methods, with the exception of SHIFTX2
and SPARTA+, the drop off in accuracy is not too severe, this despite
predicting backbone chemical shifts based only on
Cα coordinates.
Sensitivity to Structure Allows LARMORCα To
Distinguish Native and Non-Native States
Next, we examined
whether chemical shifts predicted by LARMORCα were
sensitive to protein structure by assessing their ability to resolve
native structure from decoy conformational pools that contained both
native and non-native conformers. If sensitive to protein structure,
the native-like models in the decoy pool should exhibit the lowest
error between LARMORCα predicted chemical shifts
and measured chemical shifts and vice versa.To test this, decoy pools for 4 arbitrarily chosen proteins in the
testing were generated using Go̅ model MD simulations. The final
pools contained a total of 100,000 conformers. As shown in Figure 1, the decoy pools generally contained a mixture
of native and non-native conformers. For each protein, LARMORCα was used to predict backbone chemical shifts for every
conformer in the decoy pool, and then the corresponding wRMSE was computed. The fraction of native contacts (Q) was also determined for every conformer in the decoy pool. Receiver-operator-characteristic
(ROC) analysis was then carried out to assess the extent to which
native-like conformers (Q > 0.90) could be resolved
from non-native conformers (Q ≤ 0.90).
Figure 1
Sensitivity
of LARMORCα chemical shifts to protein
structure (I). Shown are the results of using LARMORCα predicted chemical shifts to resolve native conformers from decoy
pools generated using Cα-Go̅ model MD simulations. Results
are shown for four arbitrarily chosen proteins in the testing set:
PDB IDs (A) 1LM4, (B) 2C5L,
(C) 1DYT, and
(D) 1H4A, respectively.
Shown for each protein are plots of the distribution of the fraction
of native contacts (Q) in the decoy pool and the ROC curves (right).
The plots characterize the degree to which the wRMSE
between measured and LARMORCα predicted chemical
shifts can distinguish native from non-native conformers in the decoy
pools. In addition to ROC curves obtained using the total chemical
shift error (black), separate ROC curves are shown when using only 1HN (red), 1Hα (orange), 13Cα
(green), 13C (purple), 13Cβ (cyan), or 15N (blue) chemical shifts. The AUC values associated with
each ROC curve are shown in boxes.
Sensitivity
of LARMORCα chemical shifts to protein
structure (I). Shown are the results of using LARMORCα predicted chemical shifts to resolve native conformers from decoy
pools generated using Cα-Go̅ model MD simulations. Results
are shown for four arbitrarily chosen proteins in the testing set:
PDB IDs (A) 1LM4, (B) 2C5L,
(C) 1DYT, and
(D) 1H4A, respectively.
Shown for each protein are plots of the distribution of the fraction
of native contacts (Q) in the decoy pool and the ROC curves (right).
The plots characterize the degree to which the wRMSE
between measured and LARMORCα predicted chemical
shifts can distinguish native from non-native conformers in the decoy
pools. In addition to ROC curves obtained using the total chemical
shift error (black), separate ROC curves are shown when using only 1HN (red), 1Hα (orange), 13Cα
(green), 13C (purple), 13Cβ (cyan), or 15N (blue) chemical shifts. The AUC values associated with
each ROC curve are shown in boxes.With the exception of 2C5L, the AUC determined from ROC curves
(when using all available backbone chemical shift data) were all ≥0.95;
the AUC for 2C5L was ∼0.70 (Figure 1). Similar results were obtained if 1HN, 1Hα, 13Cα, 13C, 13Cβ, or 15N chemical shifts were used separately; with the exception
of Cα and N nuclei for 1LM4 and 1HN, 1Hα, 13Cα, 13C, and 15N nuclei for 2C5L, the AUC were all ≥0.88 (Figure 1). Encouragingly, for all four proteins, when using
all available backbone chemical shifts, the models with the lowest wRMSE had Q ≥ 0.97 (Figure 2).
Figure 2
Comparison between X-ray structures and models with the
lowest
chemical shift error. Side-by-side comparison of the X-ray structure
(left) and the models with the lowest total error
(wRMSE) between experimental and LARMORCα-predicted chemical shifts (right) for proteins
corresponding to PDB IDs (A) 1LM4, (B) 2C5L, (C) 1DYT,
and (D) 1H4A. In each panel, the fraction of native contacts (Q) is indicated
below the model with the lowest chemical shift error.
Comparison between X-ray structures and models with the
lowest
chemical shift error. Side-by-side comparison of the X-ray structure
(left) and the models with the lowest total error
(wRMSE) between experimental and LARMORCα-predicted chemical shifts (right) for proteins
corresponding to PDB IDs (A) 1LM4, (B) 2C5L, (C) 1DYT,
and (D) 1H4A. In each panel, the fraction of native contacts (Q) is indicated
below the model with the lowest chemical shift error.As a further test of its sensitivity to structure,
we examined
whether LARMORCα predicted chemical shifts could
be used to resolve the difference between conformational substates
of the phage T4 lysozyme (T4L). The free-energy landscape of a mutant
T4L, L99A, has been recently studied using NMR-RD experiments, allowing
chemical shifts to be obtained of a transient low-populated (∼3%)
conformational substate.[33] Using a mutate-to-trap
approach, chemical shifts were also obtained for a triple mutant (L99A-G113A-R119P
T4L) that was purported to “resemble” the transient
state. The structures of the transient L99A and the triple T4L mutants
were determined using CS-Rosetta and confirmed that the structure
of the transient state closely resembles that of the triple T4L mutant.
The RMSD between the transient state of the L99A mutant and the triple
T4L mutant was ∼0.8 Å, whereas the RMSDs of the transient
single and triple mutants compared to the highly populated state of
L99 T4L were ∼2.5 and 2.3 Å, respectively (Figure 3).
Figure 3
Sensitivity of LARMORCα chemical shifts
to protein
structure (II): Structures of three conformational substates of T4L:
(A) native L99A T4L, (B) the transiently populated intermediate of
L99A T4L, and (C) the L99A-G113A-R119P T4L triple mutant, respectively.
The region in the transient intermediate of L99A and the triple mutant
that differs significantly from native L99A T4L is circled (yellow dotted). LARMORCα backbone chemical
shifts were predicted from the Cα coordinates taken from the
solved structure of each of these three substates and then compared
to NMR-RD-derived chemical shifts of the native L99A T4L, the transient
intermediate state of L99A T4L, and the triple T4L mutant, respectively.
For each structure, the wRMSE relative to the native
L99A T4L, the transient intermediate state of L99A T4L, and the triple
T4L mutant are shown in black, red, and blue, respectively (boxes) and the lowest is highlighted (bold and underlined). Also, for each structure, the
structural RMSD relative to the native L99A T4L, the transient intermediate
state of L99A T4L, and the L99A-G113A-R119P T4L mutant structure are
shown in black, red, and blue, respectively (boxes). Here, the wRMSE and RMSDs were calculated for
residues 100–120 and 132–146.
Sensitivity of LARMORCα chemical shifts
to protein
structure (II): Structures of three conformational substates of T4L:
(A) native L99A T4L, (B) the transiently populated intermediate of
L99A T4L, and (C) the L99A-G113A-R119P T4L triple mutant, respectively.
The region in the transient intermediate of L99A and the triple mutant
that differs significantly from native L99A T4L is circled (yellow dotted). LARMORCα backbone chemical
shifts were predicted from the Cα coordinates taken from the
solved structure of each of these three substates and then compared
to NMR-RD-derived chemical shifts of the native L99A T4L, the transient
intermediate state of L99A T4L, and the triple T4L mutant, respectively.
For each structure, the wRMSE relative to the native
L99A T4L, the transient intermediate state of L99A T4L, and the triple
T4L mutant are shown in black, red, and blue, respectively (boxes) and the lowest is highlighted (bold and underlined). Also, for each structure, the
structural RMSD relative to the native L99A T4L, the transient intermediate
state of L99A T4L, and the L99A-G113A-R119P T4L mutant structure are
shown in black, red, and blue, respectively (boxes). Here, the wRMSE and RMSDs were calculated for
residues 100–120 and 132–146.To test whether LARMORCα could resolve the
small
structural differences between these three states (namely, the highly
and transiently populated states of L99A T4L and the conformation
of the triple mutant), LARMORCα was used to predict
backbone chemical shifts from the solved structures of each species.
For each species, we computed the wRMSE between the
predicted and experimental chemical shifts; the wRMSEs were computed using data for residues 100–120 and 132–146
as these were the only residues that exhibited significant changes
in chemical shifts between the different the states of T4L. We expect
that the structures with the lowest wRMSE should
match the system associated with reference (experimental) chemical
shifts. As shown in Figure 3, this was indeed
the case. The L99A T4L structure exhibited the lowest wRMSE relative to the chemical shifts computed for the highly populated
state, the transient state L99A T4L structure showed the lowest wRMSE relative to the transient-state chemical shifts, and
the triple mutant structure displayed the lowest wRMSE relative to the mutant chemical shifts. These results indicate
that LARMORCα was able to resolve the small structural
difference between conformational substates of T4L.Although
LARMORCα was able to resolve the “correct”
structure based upon the chemical shifts, the errors for the L99A
transient state and the triple mutant were higher than the error for
the L99A T4L. The higher errors for the transient states suggest that
models for these states can be refined even further. Indeed, during
the CS-Rosetta protocol used to generate these models, it was assumed,
based upon chemical shifts dispersion patterns, that only residues
100–120 and 132–146 were significantly different between
the L99A transient state and the triple mutant. Thus, during refinement
only atoms in these residues were allowed to deviate from the native
L99A T4L structure.Together these results indicate that backbone
chemical shifts predicted
by LARMORCα are sufficiently sensitive to protein
structure to allow chemical shifts to be used in resolving native
from non-native structure. Even small structural differences between
similar conformational substates can be detected. As such, NMR chemical
shifts should be useful in “parsing” trajectories and
ensembles generated using coarse-grained simulations to identify physically
relevant conformational substates along the folding pathway of proteins.
Analysis Using LARMORCα Indicates That the
Transient State of gpW Is Locally Unfolded
Recently, NMR
relaxation dispersion (RD) experiments were used to study the free-energy
landscape of gpW, a 62-residue α+β fast-folding protein
(see Figure 4). NMR RD experiments allowed
chemical shifts to be obtained for both the native-state and a low-populated
transient state.[46] Analysis of the chemical
shift dispersion pattern of the transient state revealed that the
helices remained intact, whereas the beta-strand region was unfolded.[46] In principle combining LARMORCα with coarse-grained simulations should allow for structures consistent
with the chemical shifts of the transient state to be identified.
Thus, we used LARMORCα to probe the folding pathway
of gpW during Go̅ model MD simulations.
Figure 4
Resolving native and
transient states along the folding pathway
of the fast-folding protein gpW using LARMORCα. The
folding pathway of gpW was studied using Cα-based Go̅-model
MD simulations. Shown are cartoon representations comparing (A) the
solved native-state structure of the gpW and the representative models
of (B) the native and (C) the transient states selected from the Cα-trajectory
using LARMORCα. Representative models were selected
by comparing LARMORCα-predicted chemical shifts to
recently reported NMR-RD-derived backbone chemical shifts for the
native and the transient intermediate states.[46] The models in (B) and (C) correspond to the two models that were
closest to the average structure of the 10 models that exhibited the
lowest error (wRMSE) between LARMORCα-predicted and the measured chemical shifts of the native state and
the transient state, respectively.
Resolving native and
transient states along the folding pathway
of the fast-folding protein gpW using LARMORCα. The
folding pathway of gpW was studied using Cα-based Go̅-model
MD simulations. Shown are cartoon representations comparing (A) the
solved native-state structure of the gpW and the representative models
of (B) the native and (C) the transient states selected from the Cα-trajectory
using LARMORCα. Representative models were selected
by comparing LARMORCα-predicted chemical shifts to
recently reported NMR-RD-derived backbone chemical shifts for the
native and the transient intermediate states.[46] The models in (B) and (C) correspond to the two models that were
closest to the average structure of the 10 models that exhibited the
lowest error (wRMSE) between LARMORCα-predicted and the measured chemical shifts of the native state and
the transient state, respectively.The representative model based on the native state chemical
shifts
was found to contain α+β topology (Figure 4B), indicating that the LARMORCα was able to resolve the native structure from the ensemble of structures
generated during the Go̅ model simulations. In contrast to the
representative model of the native states, the representative model
of the transient-state exhibited an unfolded beta-region (Figure 4C). These results agree well with the analysis of
Kay and co-workers,[46] and they serve to
further confirm that LARMORCα can be used to efficiently
parse coarse-grained trajectories and ensembles to identify important
conformational substates (i.e., both native and intermediary states).Though in the current study we focused on using LARMORCα to seamlessly incorporate backbone chemical shifts
into the analysis of coarse-grained MD simulations, LARMORCα is also well suited for incorporation into most protein structure
prediction methods where it can be used to enable backbone chemical
shifts to actively guide conformation sampling. Additionally, LARMORCα can also be used to parse large all-atom trajectories
and ensembles to identify a smaller subset of relevant conformational
states. In the spirit of “multi-scale analysis”,[36] more accurate and complete chemical shifts prediction
(i.e., prediction of both backbone and side-chain chemical shifts)
can then be carried out for the smaller subset using all-atom prediction
approaches.
Conclusion
In summary, we have developed
LARMORCα, a Cα-based
approach that enables the prediction of backbone chemical shifts from
coarse-grained models of proteins. We show that in addition to predicting
chemical shifts with accuracy comparable to some all-atom approaches,
LARMORCα was capable of resolving protein structure.
This sensitivity to protein structure enables LARMORCα to identify conformational substates from coarse-grained simulations
that are consistent with available NMR chemical shifts. An exciting
application of the method is to identify “invisible”
intermediate substates using chemical shifts obtained from NMR relaxation
dispersion experiments, as was demonstrated here for the gpW fast-folding
protein. Structural information on transiently populated intermediates
afforded by the combination of coarse-grained simulation and LARMORCα has the potential to offer functional insights into
the mechanism of protein folding, misfolding, and aggregation, and
their role in folding-related diseases.[7] Beyond coarse-grained simulations, LARMORCα could
be used to quickly parse all-atom MD trajectories and also be incorporated
into existing structure prediction methods. To facilitate its use,
the source code for LARMORCα is made freely available
at https://github.com/atfrank/LARMORCA.
Authors: Andreas Kreusch; Glen Spraggon; Chris C Lee; Heath Klock; Daniel McMullan; Ken Ng; Tanya Shin; Juli Vincent; Ian Warner; Christer Ericson; Scott A Lesley Journal: J Mol Biol Date: 2003-07-04 Impact factor: 5.469
Authors: Konstantin Berlin; Carlos A Castañeda; Dina Schneidman-Duhovny; Andrej Sali; Alfredo Nava-Tudela; David Fushman Journal: J Am Chem Soc Date: 2013-11-06 Impact factor: 15.419
Authors: Peter Eastman; Mark S Friedrichs; John D Chodera; Randall J Radmer; Christopher M Bruns; Joy P Ku; Kyle A Beauchamp; Thomas J Lane; Lee-Ping Wang; Diwakar Shukla; Tony Tye; Mike Houston; Timo Stich; Christoph Klein; Michael R Shirts; Vijay S Pande Journal: J Chem Theory Comput Date: 2012-10-18 Impact factor: 6.006
Authors: Guillaume Bouvignies; Pramodh Vallurupalli; D Flemming Hansen; Bruno E Correia; Oliver Lange; Alaji Bah; Robert M Vernon; Frederick W Dahlquist; David Baker; Lewis E Kay Journal: Nature Date: 2011-08-21 Impact factor: 49.962
Authors: Loïc Salmon; Logan S Ahlstrom; Scott Horowitz; Alex Dickson; Charles L Brooks; James C A Bardwell Journal: J Am Chem Soc Date: 2016-08-02 Impact factor: 15.419
Authors: Lorenzo Sborgi; Abhinav Verma; Stefano Piana; Kresten Lindorff-Larsen; Michele Cerminara; Clara M Santiveri; David E Shaw; Eva de Alba; Victor Muñoz Journal: J Am Chem Soc Date: 2015-05-12 Impact factor: 15.419
Authors: Ali A Kermani; Olive E Burata; B Ben Koff; Akiko Koide; Shohei Koide; Randy B Stockbridge Journal: Elife Date: 2022-03-07 Impact factor: 8.713