Michael S Jones1, Brennan Ashwood2, Andrei Tokmakoff2, Andrew L Ferguson1. 1. Pritzker School of Molecular Engineering, The University of Chicago, 5640 South Ellis Avenue, Chicago, Illinois 60637, United States. 2. Department of Chemistry, Institute for Biophysical Dynamics, and James Franck Institute, The University of Chicago, 929 East 57th Street, Chicago, Illinois 60637, United States.
Abstract
A robust understanding of the sequence-dependent thermodynamics of DNA hybridization has enabled rapid advances in DNA nanotechnology. A fundamental understanding of the sequence-dependent kinetics and mechanisms of hybridization and dehybridization remains comparatively underdeveloped. In this work, we establish new understanding of the sequence-dependent hybridization/dehybridization kinetics and mechanism within a family of self-complementary pairs of 10-mer DNA oligomers by integrating coarse-grained molecular simulation, machine learning of the slow dynamical modes, data-driven inference of long-time kinetic models, and experimental temperature-jump infrared spectroscopy. For a repetitive ATATATATAT sequence, we resolve a rugged dynamical landscape comprising multiple metastable states, numerous competing hybridization/dehybridization pathways, and a spectrum of dynamical relaxations. Introduction of a G:C pair at the terminus (GATATATATC) or center (ATATGCATAT) of the sequence reduces the ruggedness of the dynamics landscape by eliminating a number of metastable states and reducing the number of competing dynamical pathways. Only by introducing a G:C pair midway between the terminus and the center to maximally disrupt the repetitive nature of the sequence (ATGATATCAT) do we recover a canonical "all-or-nothing" two-state model of hybridization/dehybridization with no intermediate metastable states. Our results establish new understanding of the dynamical richness of sequence-dependent kinetics and mechanisms of DNA hybridization/dehybridization by furnishing quantitative and predictive kinetic models of the dynamical transition network between metastable states, present a molecular basis with which to understand experimental temperature jump data, and furnish foundational design rules by which to rationally engineer the kinetics and pathways of DNA association and dissociation for DNA nanotechnology applications.
A robust understanding of the sequence-dependent thermodynamics of DNA hybridization has enabled rapid advances in DNA nanotechnology. A fundamental understanding of the sequence-dependent kinetics and mechanisms of hybridization and dehybridization remains comparatively underdeveloped. In this work, we establish new understanding of the sequence-dependent hybridization/dehybridization kinetics and mechanism within a family of self-complementary pairs of 10-mer DNA oligomers by integrating coarse-grained molecular simulation, machine learning of the slow dynamical modes, data-driven inference of long-time kinetic models, and experimental temperature-jump infrared spectroscopy. For a repetitive ATATATATAT sequence, we resolve a rugged dynamical landscape comprising multiple metastable states, numerous competing hybridization/dehybridization pathways, and a spectrum of dynamical relaxations. Introduction of a G:C pair at the terminus (GATATATATC) or center (ATATGCATAT) of the sequence reduces the ruggedness of the dynamics landscape by eliminating a number of metastable states and reducing the number of competing dynamical pathways. Only by introducing a G:C pair midway between the terminus and the center to maximally disrupt the repetitive nature of the sequence (ATGATATCAT) do we recover a canonical "all-or-nothing" two-state model of hybridization/dehybridization with no intermediate metastable states. Our results establish new understanding of the dynamical richness of sequence-dependent kinetics and mechanisms of DNA hybridization/dehybridization by furnishing quantitative and predictive kinetic models of the dynamical transition network between metastable states, present a molecular basis with which to understand experimental temperature jump data, and furnish foundational design rules by which to rationally engineer the kinetics and pathways of DNA association and dissociation for DNA nanotechnology applications.
Over
the last couple of decades, DNA has proven to be much more
than a vessel for genetic information. From sensing to computing to
directed self-assembly, the programmable and predictable nature of
DNA has unlocked numerous unforeseen nanotechnology applications.[1−4] Recently, single molecule localization techniques have exploited
the rapid and transient binding of short DNA oligomers in order to
achieve super-resolution microscopy and optical multiplexing.[5−7] Predictive understanding of the sequence-dependent thermodynamics
of DNA hybridization/dehybridization—the assembly/disassembly
of a DNA duplex from two single strands—has underpinned the
rational design of DNA oligomer sequences for nanotechnology applications,
where sequence-dependent nearest-neighbor models can accurately account
for mismatched pairs, dangling ends, and other non-native bonding
effects.[8,9] Secondary DNA structures such as hairpins
and G-quadruplexes have also been studied in depth and leveraged for
nanotechnology applications.[10−12] Predictive models of the dynamical,
as opposed to purely thermodynamical, behaviors of DNA have become
increasingly important in developing technologies such as DNA-PAINT
(DNA Points Accumulation for Imaging in Nanoscale Topography), but
these technologies have outpaced our fundamental understanding of
the dynamics themselves.[13−15] Many experimental and computational
studies have investigated DNA dynamical phenomena over picosecond
to millisecond time scales.[16−20] Kinetic models have been developed for particular DNA processes
such as toehold exchanges and optical barcoding[21,22] and supervised machine learning techniques have been combined with
experimental measurements to predict the on/off rates as a function
of sequence.[6,23,24] A comprehensive understanding of the full dynamical landscape of
hybridization/dehybridization accounting for the sequence-dependent
metastable states and association/dissociation pathways remains lacking
and fundamental questions remain unresolved. For example, it remains
unclear the extent to which hybridization of short DNA oligomers largely
proceeds in a conventionally assumed “all-or-nothing”
fashion or if long-lived metastable states facilitate the transition.[17,19,25−28] Out-of-register “shifted”
base-paired structures[17,18,25,29−31] and frayed structures[32−35] stand as candidates for metastable states with the potential to
mediate substantial deviations from all-or-nothing behavior, but the
degree to which these states are kinetically relevant is difficult
to determine experimentally and is likely to be highly sequence-dependent.
The development of predictive models and design rules with which to
engineer DNA strands with tailored hybridization/dehybridization kinetics
and pathways is vital to advancing rational design of DNA strands
for nanotechnology applications and is also of importance in understanding
fundamental biological processes such as transcription and gene regulation.Our understanding of hybridization dynamics is built upon decades
of experiments—such as temperature-jump, salt-jump, pH-jump,
and other perturbative methods—that rapidly stimulate DNA and
monitor relaxation to a new equilibrium.[28,36−42] More recently, single-molecule diffusion and tethered multi-fluorophore
assays have facilitated analyses under equilibrium conditions, but
these results can be hampered by slow data collection rates and fluorescent
tags effects on strand dynamics, particularly for shorter oligomers.[23,43−45] A number of computational modeling approaches have
also been employed to provide molecular-level resolution of hybridization.
Simplified lattice models can recapitulate the essential aspects of
the hybridization pathways but lack the realism of continuous space
representations.[25,46] The long time scales associated
with hybridization/dehybridization events place them outside the reach
of unbiased all-atom molecular dynamics simulations,[18] but they can be observed by employing enhanced sampling
techniques[18,30,47−53] or by using elevated temperature or denaturing solvent concentrations
to induce one-way dissociation events.[54,55] The effect
of the applied bias upon the thermodynamics can be rigorously corrected
for using standard reweighting techniques.[56−59] Rigorous elimination of the bias
in the kinetics is critical for the construction of robust and reliable
kinetic models that reflect the true system dynamics and are uncontaminated
by any residual effects of the biasing potentials used to induce good
sampling. A number of approaches to correct the kinetics are also
available, including Girsanov reweighting, transition-based reweighting
analysis (TRAM), dynamic histogram analysis method (DHAM), and their
derivatives.[60−66] The application of these methods under the conditions of high bias
necessary for good sampling can, however, present challenges for numerical
convergence. A number of coarse-grained DNA force fields have been
developed that enable direct observation of these events over microsecond
time scales via unbiased coarse-grained molecular dynamics simulations,[29,30,49,67,68] which, up to an acceleration factor associated
with the smoothing of the underlying free energy landscape inherent
to the coarse-graining procedure, can preserve a faithful model of
the unbiased dynamics and associated pathways. These models have previously
been used to study biological phenomena such as nucleosome dynamics[69,70] and transcription factor binding[71,72] as well as
nanotechnology applications such as strand displacement[73,74] and DNA origami.[75,76] In this study, we choose to employ
a coarse-grained model for DNA[49] that is
sufficiently inexpensive to enable the collection of sufficient volumes
of unbiased simulation trajectories and adequately sample configurational
space that we do not need to appeal to biasing strategies to enhance
convergence nor apply any post hoc corrections to
the thermodynamics or kinetics.In this work, we study a family
of self-complementary pairs of
10-mer DNA oligomers using coarse-grained molecular simulation, machine
learning of the slow dynamical modes, and data-driven inference of
long-time kinetic models to establish new understanding of the influence
of sequence upon hybridization/dehybridization kinetics and mechanisms.
This family—5′-ATATATATAT-3′ (AT-all), 5′-GATATATATC-3′
(GC-end), 5′-ATATGCATAT-3′ (GC-core), and 5′-ATGATATCAT-3′
(GC-mix)—was designed to probe the influence of the placement
of two G:C base pairs within an otherwise repetitive A:T sequence
and has been the subject of our prior experimental investigations
using temperature-jump infrared spectroscopy and simple lattice models.[19] We validate the new computational models of
hybridization/dehybridization dynamics developed in this work against
new experimental data and reinterpret our prior experimental observations
in light of the new computational understanding. Consistent with previous
studies,[18,25,30] we find the
degree of repetitiveness in the sequence—and therefore the
kinetic accessibility and thermodynamic stability of out-of-register
shifted states—leads to richer dynamics populated by a diversity
of long-lived metastable states. Our data-driven modeling and analysis
rigorously quantifies these behaviors and furnishes accurate predictive
models of the hybridization/dehybridization rates, dynamical pathways,
and metastable states. Specifically, we demonstrate that disrupting
repetitive stretches of A:T bases by placement of interrupting G:C
base pairs enables us to tune the landscape from rich six-state to
simple two-state “all-or-nothing” behavior, and the
specific location of the interrupting pair can be used to modulate
the stability of long-lived frayed states. Taken together, our analyses
establish new molecular-level understanding of the sequence-dependent
kinetics and pathways through quantitative predictive models for the
long-time system dynamics, resolution of the dynamical folding pathways
and metastable states, and elementary design rules with which to control
the dynamical behaviors of the system. We anticipate that this new
foundational understanding, and the extension of our approach to more
extensive families of DNA sequences, can guide the rational design
of DNA oligomers with tailored kinetic properties engineered for DNA
nanotechnology applications such as DNA-PAINT.[6,7]
Methods
Computational Methods
Molecular Dynamics Simulations
We performed molecular dynamics simulations of four 10-base self-complementary
double-stranded DNA sequences that we have previously studied by temperature-jump
infrared spectroscopy:[19] 5′-ATATATATAT-3′
(AT-all), 5′-GATATATATC-3′ (GC-end), 5′-ATATGCATAT-3′
(GC-core), and 5′-ATGATATCAT-3′ (GC-mix). We modeled
the DNA sequences using the coarse-grained 3-Site-Per-Nucleotide v2
(3SPN.2) model that uses three spherical beads to represent the phosphate,
deoxyribose sugar, and nitrogenous base of each nucleotide and employs
anisotropic interaction potentials to accurately treat intrastrand
base-stacking, interstrand cross-stacking, and base pairing.[49] The model was parametrized against experimental
data on bond lengths, bend angles, torsional angles, base step energies,
and base stacking free energies, and reliably reproduces the structure,
melting temperatures, persistence lengths, and sequence, salt, concentration,
and temperature effects on duplex formation.[49] The model enables access to millisecond time scales and has been
has widely adopted to study numerous phenomena including DNA packing
in viral capsids, protein–DNA binding, and nucleosome unwrapping.[69,77,78] Although the 3SPN.2 model was
not directly parametrized against dynamical experimental data, we
will show below that the predicted sequence-dependent kinetics and
relaxations are, within a corrective scaling factor, in good agreement
with observed experimental trends.All calculations were performed
using the LAMMPS simulation package (http://lammps.sandia.gov) in
accordance with best practices for the 3SPN.2 model.[79] A single pair of self-complementary sequences were placed
in a cubic periodic box with side length 7.8 nm corresponding to a
single-strand concentration of 7 mM. This concentration is only 3.5×
larger than the 2 mM concentration employed in our experimental analyses
(section ). Solvent
effects were modeled implicitly by employing Langevin dynamics[80,81] with an experimentally motivated per-site friction coefficient of
9.94 × 10–11 m2/s.[49,82] We specified a 240 mM implicit salt concentration and treated electrostatic
interactions using the Debye–Hückel with a 5 nm cutoff
radius.[83] Simulations were performed in
the NVT ensemble employing a Langevin thermostat.[84] With the exception of the simulation data reported in section where we draw
comparisons against experimental data collected over a range of temperatures,
each sequence was simulated at its respective melting temperature
rat 7 mM concentration as dictated by the 3SPN.2 model—AT-all:
309 K, GC-end: 317 K, GC-core: 324 K, GC-mix: 324 K—in order
to maximize the number of spontaneous transitions between dissociated
and hybridized states. Melting temperatures for each sequence were
determined empirically by comparing the ratio of hybridized to dissociated
populations over a 10 K temperature ramp centered on the nearest neighbor
(NN) model predicted melting temperature,[8,9] and
selecting the temperature at which the hybridized and dissociated
populations were approximately equal. The Langevin equations of motion
were integrated using the scheme of Bussi and Parrinello[81] with a 20 fs integration time step. We performed
40 independent simulations for each of the four sequences with half
of the runs initialized from the hybridized state and half from the
dissociated state. The initial hybridized state was defined based
on the crystal structure coordinates of Arnott et al.[85] The dissociated state was generated from the hybridized
state by displacing one strand away from the other by 1 nm in each
of the x, y, and z directions. The magnitude of this displacement did not affect the
results so long as all native Watson–Crick (WC) bonds were
completely broken. Initial bead velocities were assigned from a Maxwell–Boltzmann
distribution at the temperature of interest. Each simulation was conducted
for 26 μs and frames saved to disc every 100 ps. Each simulation
required ∼24 CPU-hours on 28×Intel E5-2680v4 CPU cores.
The first 1 μs of each run was discarded for equilibration providing
us with 40 × 25 μs = 1 ms of simulation data for each sequence,
during which time we observed 55–100 hybridization/dehybridization
events.
Markov State Model Construction
Markov state models (MSMs) are a powerful approach to infer long-time
kinetic models from short molecular simulation trajectories[86−90] that we employ in this work to construct high-resolution sequence-dependent
kinetic models of DNA hybridization and dissociation. In brief, a
MSM extracts from simulation trajectories an ensemble of long-lived
metastable macrostates, their equilibrium occupancy populations, and
the equilibrium probability fluxes between them. In this manner, they
provide an interpretable and predictive model of the system thermodynamics
and kinetics. MSMs have recently been implemented to study the hybridization
mechanism of one particular 14-mer DNA oligomer, but determining the
sequence-dependent kinetics and mechanisms was not the focus of this
study.[91] An energy disconnectivity graph-based
approach was used to interrogate the differences in hybridization
rates and mechanisms between GGGGGG and GCGCGC hexamers to reveal
strong deviations from “all-or-nothing” behaviors and
the importance of zippering and out-of-register diffusion mechanisms.[17] Kinetic models were constructed not from the
dynamical trajectories of the molecular model, but by estimating rate
constants between local minima using a transition state theory approximation.
A recent application of MSMs to the long-time dynamics of short RNA
oligonucleotides revealed stacking time scales to be highly sequence
dependent.[92] In this work, MSMs were constructed
for each of the four DNA sequences at their respective 3SPN.2 melting
temperatures at 7 mM concentration from the 40 × 25 μs
simulation trajectories following a four-step protocol detailed in
ref (93): (i) trajectory
featurization, (ii) dimensionality reduction, (iii)
microstate clustering and microstate transition matrix inference,
and (iv) macrostate clustering and macrostate transition matrix inference.
Calculations were performed using the PyEMMA software package.[94]
Featurization
Trajectories comprising
the Cartesian
coordinates of the DNA strands as a function of time were featurized
using the MDTraj Python libraries[95] to
represent the system in a manner that exposes the essential system
dynamics but eliminates trivial translation and rotational invariances.
We adopt intermolecular pairwise distances d(i,j) between the centers of mass of the
10 bases as a natural rototranslationally invariant featurization
that represents each system configuration as the 10 × 10 = 100-element
vector of interstrand pairwise distances. One additional symmetry
arises from the self-complementary nature of these sequences—the
sense and antisense strands in each pair are identical—such
that the representation of the system under our featurization should
remain unchanged upon inverting the arbitrary labeling of strand “1”
and strand “2”.[86] The 100-element
pairwise distance vector is not invariant to this permutation, but
can easily be made so via a simple symmetrization operation in which
each of the = 45 intermolecular pairwise
distances
are replaced by the mean of the two permutationally invariant distances.
Specifically, (d(i1,j2) = d(i2,j1)) ← 0.5(d(i1,j2) + d(i2,j1)), where i1 denotes the ith base on strand 1 and j2 the jth base on strand 2.[86] Finally,
we took the reciprocal of the permutationally symmetrized pairwise
distances to provide higher resolution and differentiation between
proximate strand configurations in the near hybridized state compared
to distantly separated dissociated strands. VAMP-2 scoring—calculation
of the sum of the squared estimated eigenvalues of the transfer operator—of
trajectories under a particular featurization provides a measure of
the kinetic variance carried by that featurization.[96−99] Performing VAMP-2 scoring at
a lag time of τ = 1.2 ns and retaining the top five modes, reveals
that the reciprocal permutationally symmetrized pairwise distances
can carry up to twice the kinetic variance as the nonreciprocal distances,
suggesting that the higher resolution offered at close intermolecular
distances can indeed boost the dynamical representational power of
the model. Somewhat surprisingly, we observed that augmenting our
set of intermolecular distances between bases with intramolecular
distances between bases on the same strand led to no improvement of
the VAMP-2 score. This indicates that the kinetically relevant conformational
state of the two strands are adequately represented via the intermolecular
distances and leading us to employ only intermolecular distances within
our featurization.
Dimensionality Reduction
The featurized
trajectories
were then projected into a low-dimensional space in preparation for
microstate clustering. The standard approach to doing so is to employ
time-lagged independent components analysis (tICA) to learn a linear
projection into a low-dimensional embedding that maximally preserves
the kinetic variance in the data.[97,100,101] In this work, we instead employ state-free reversible
VAMPnets (SRVs) that can be conceived of as a nonlinear version of
tICA.[102] SRVs employ neural networks to
learn flexible nonlinear functions of the trajectory featurization
that better approximate the slow dynamical modes of the system and
have been shown to produce substantially higher resolution MSMs than
those developed using tICA.[93,102] SRV modes were learned
independently for each system to best approximate the slow collective
modes for that particular DNA sequence. SRVs were trained using the
SRV package we previously developed (https://github.com/hsidky/srv) employing the default network architecture of two hidden layers
each comprising 100 neurons and tanh activation functions, a learning
rate of 0.001, and a batch size of 50 000. We adopted a lag
time of τ = 1.2 ns as appropriately short time scale to resolve
the dynamical details of the hybridization/dehybridization dynamics.[103] As observed by Husic and Pande, the lag time
cannot be treated as a hyperparameter to be optimized via the VAMP-2
score, but must be selected as a physically motivated choice designed
to expose the dynamical motions relevant at a particular time and
length scale of interest.[104] As we shall
show, this choice of lag time leads to high-resolution Markovian macrostate
MSMs. We guarded against overfitting using 5-fold cross-validation
in which we constructed five random partitions of the 40 independent
simulation trajectories into a training set of 20 trajectories and
a validation set of 20 trajectories.[93] We
observed plateau of the validation loss and no evidence of overfitting
after 20 epochs of training requiring approximately 22 GPU-minutes
on a single NVIDIA Tesla K80 GPU card. A VAMP-2 scoring of the cumulative
kinetic variance explained as a function of number of SRV collective
modes also showed no evidence of overfitting—as would be evinced
by separation of the training and validation VAMP-2 scores[93]—and exhibited a knee for each of the
four DNA sequences after the fifth, fourth, third, and second slow
modes for AT-all, GC-end, GC-core, and GC-mix, respectively (Figure
S1 in the Supporting Information), and
motivating the construction of 5D, 4D, 3D, and 2D embeddings, respectively.
Microstate Clustering
The SRV projections of the 10
million frames recorded over the course of the 1 ms molecular simulation
trajectories collected for each DNA sequence were then clustered into
microstates using k-means clustering. The VAMP-2 score of the microstate
transition matrix constructed for each sequence at the selected τ
= 1.2 ns was insensitive to the choice of the number of microstates
over the range 100–1000, motivating our selection of 200 microstate
clusters for each system.
Macrostate Clustering
The 200 microstates
comprising
each system were coarsened into our terminal macrostate MSM. Since
we are interested in the construction of models of the equilibrium
kinetics, we explicitly enforce detailed balance in the construction
of the MSM.[90] The microstate transition
matrix for each system was computed at a range of lag times τ
and then diagonalized to recover the corresponding eigenvalues λ and associated implied time scales t = −τ/ln|λ|.[90] The implied
time scale plots for the four DNA sequences are presented in Figure S2. We observe rapid convergence of the
implied time scales t with lag time τ for all systems, motivating the construction
of high resolution macrostate MSMs at a lag time τ = 1.2 ns.
For this choice of lag time, we recover 5, 4, 2, and 1 implied time
scales for the AT-all, GC-end, GC-core, and GC-mix systems, respectively.
The identification of (i – 1) implied time
scales implies the presence of (i – 1) slow
modes and motivates the coarsening of the system into i macrostates. We estimate these i macrostates by
applying PCCA+ spectral clustering to the leading (i – 1) eigenvectors of the microstate transition matrix.[105−107] We then estimate the corresponding 6, 5, 3, and 2 macrostate transition
matrices P for the AT-all, GC-end, GC-core, and GC-mix
systems, respectively, by projecting the molecular simulation trajectories
into these discrete macrostates. These macrostate MSMs constitute
our terminal kinetic models. We validate the Markovian nature of the
four MSMs by subjecting them to the Chapman–Kolmogorov (CK)
test.[60,90,108] This test
asserts that the transition matrix for a Markovian (i.e., memoryless)
MSM constructed at a lag time τ should satisfy the condition P(kτ) = P(τ), which states that k successive
applications of the transition matrix constructed at a lag time τ
should be equivalent to a single application of the transition matrix
constructed at a lag time kτ. We present in Figure S3 the CK tests for each DNA sequence
to demonstrate that the τ = 1.2 ns models perform very well
in predicting transition probabilities out to kτ
= 7.2 ns, validating the Markovian nature and kinetic validity of
the four models.
Experimental Methods
Sample Preparation
Each DNA oligonucleotide
was purchased from Integrated DNA Technologies (IDT) at desalt grade
purity. Oligonucleotides were purified with 3 kDa cutoff centrifugal
filters (Amicon). All labile protons were exchanged in deuterium oxide
(D2O, Cambridge Isotopes, 99.9%). Oligonucleotides were
prepared at a total strand concentration of 2 mM in 50 mM pD 7.2 sodium
phosphate buffer with 240 mM NaCl and 18 mM MgCl2. Prior
to each measurement, DNA solutions were placed in a water bath at
90 °C for 3 min and then cooled to room temperature under ambient
conditions.
T-Jump IR Spectroscopy
The details
of the technique and processing used to acquire temperature-jump infrared
(T-jump IR) data have been described previously.[109−111] Briefly, heating was initiated through optical excitation of the
O–D stretch overtone transition of D2O. The 1.98
μm pulses (5 ns, 20 mJ, 20 Hz) used for heating were generated
from the frequency-doubled output of a Nd:YAG laser sent through an
optical parametric oscillator. Nonlinear IR spectra are collected
from 5 ns to 50 ms delays after the T-jump with a synchronized 1 kHz
spectrometer. T-jump heterodyne-detected vibrational echo (t-HDVE)
IR spectra were acquired with Fourier transform spectral interferometry,[110] where the delay between the local oscillator
(LO) and DVE signal was scanned in 5 fs steps between (−10)
and 10 fs. t-HDVE spectra were acquired with a parallel pulse polarization
scheme and presented as a dispersed pump–probe (t-DPP) spectrum.
t-DPP data are reported as the difference spectra relative to the
initial temperature.The sample was placed between two 1 mm
thick CaF2 windows separated by a 50 μm Teflon spacer
enclosed in a brass jacket. The initial temperature of the sample
was set using a recirculating chiller connected to the brass sample
jacket. The T-jump temperature change (ΔT)
was set to 14–16 °C for all measurements and monitored
using the change in transmission of the D2O bend–libration
combination band measured in the LO beam. The temperature change was
quantified by comparing the change in transmission of the LO beam
with a FTIR temperature series of D2O.
Determination of Fast and Slow Dissociation
Rates
To determine observed rates from the T-jump data, the
time-domain t-HDVE data was inverse Laplace transformed into the rate
domain using a maximum entropy approach (MEM-iLT).[112] Observed rates λfast and λslow were computed from the amplitude-weighted mean rate across detected
IR frequency, as previously described.[28] The fast response kdfast is defined as this amplitude-weighted
mean rate, whereas the dissociation rate constant kdslow was extracted
from the observed rate of the slow response λslow using a two-state model for self-complementary oligomers,[113]where [S] is the concentration of
single-strand oligomer
at the final temperature of the T-jump, and kaslow is the association
rate constant. In practice, eq is recast in terms of the dissociation equilibrium constant Kd to solve for kdslow and kaslow as a function of Kd and [S],FTIR temperature series were measured
for each sequence as reported previously,[19] and the second SVD component along temperature was fit to a two-state
model to determine the fraction of intact duplex θ as a function
of temperature.[114]Kd and [S] were then determined from θ at the final temperature
of each T-jump measurement,where ctot is
the total oligonucleotide concentration, which was 2 mM for all measurements.
Results and Discussion
Sequence-Dependent
Coarse-Grained Kinetics
Recapitulate T-Jump IR Measurements
We first sought to demonstrate
that the coarse-grained 3SPN.2 model accurately recapitulates the
experimentally observed kinetics of DNA oligomer hybridization/dissociation
by validating our computational predictions against temperature-jump
infrared (T-jump IR) experiments. We conducted T-jump IR experiments
for each DNA sequence as a function of temperature and extracted the
“slow” kdslow and “fast” kdfast rates, corresponding to processes proceeding on 10–30 μs
and 70–100 ns time scales, respectively. The slow response
has previously been attributed to duplex dissociation on microsecond
time scales induced by the rapid heating of the initially hybridized
duplex.[19,28] The fast response has been assigned to terminal
base pair fraying,[19,28] with this process corresponding
to a relatively complex dynamical process that can span time scales
from picoseconds to microseconds.[33−35,115] All-atom simulations suggest that frayed ends can assume misaligned
WC bonds, base-sugar hydrogen bonds, and terminal stacked conformations.[32,116]To computationally mimic the T-jump process in our 3SPN.2
simulations, we conducted 1 μs simulations of an initially hybridized
DNA duplex over a range of temperatures and monitored its structural
relaxation. We performed 120 independent simulations for each DNA
sequence at each temperature, and from these extracted computational
estimates of kdslow and kdfast (Figure S4). First, we tracked the slow response corresponding to duplex
dissociation in our simulations by compiling the distribution of times
at which both of the central base pairs first separate to a distance
of 2.0 nm starting from an initial fully hybridized duplex. This cutoff
was selected as the distance beyond which the strands are effectively
non-interacting and defines the dissociated state. We extracted our
computational estimate of kdslow by fitting a decaying exponential
to the fraction of hybridized sequences as a function of time fhybridized(t) = exp(−kdslowt). We verified that our cutoff was sufficiently
large by observing that our calculated values for kdslow changed
by an average of only 7% by adopting a 1.3 nm cutoff.Second,
we tracked the fast response corresponding to terminal
base pair fraying by compiling the distribution of times at which
either of the terminal base pairs first separated to a distance of
1.3 nm, corresponding to a complete breakage of the WC interaction.
We extracted our computational estimate of kdfast through a decaying
exponential fit to the fraction of unfrayed sequences as a function
of time funfrayed(t)
= exp(−kdfastt).We present in Figure a comparison of kdslow and kdfast estimated by computation and
experiment. Although the 3SPN.2 model was not directly fitted against
kinetic data,[49] its predictions of sequence-dependent
T-jump relaxation rates are, within a systematic scaling factor in
time and systematic shift in temperature, in good agreement with observed
experimental trends. It is well-known that the smoothing of the underlying
free energy landscape induced by coarse-graining artificially accelerates
the kinetics of coarse-grained molecular simulations and that different
degrees of freedom may be accelerated by different factors.[117−119] We find that the simulated slow responses corresponding to center-of-mass
translation of the strands during dissociation of the duplex is ∼10×
accelerated relative to experiment, whereas the fast responses corresponding
to fraying of the terminal bases is ∼120× accelerated.
We apply these sequence-independent scaling factors to our reported
computational values in Figure . Although the 3SPN.2 model reproduces melting temperatures
relatively well, we observed a systematic 4 K under-prediction relative
to experiment and so we apply a universal (+4) K corrective temperature
shift to our computational results. We note that our simulations are
expected to slightly over-predict the melting temperature since they
are conducted at 3.5× higher concentration relative to experiment,
and the concentration-adjusted corrective shift accounting for the
approximations inherent in the 3SPN.2 model would be slightly larger.
Owczarzy et al. present an analytical prescription to apply concentration
corrections to the melting temperature using knowledge of the enthalpy
changes associated with duplex formation and helix nucleation.[120] In the absence of these quantities, one could
instead empirically estimate concentration corrected melting temperatures
by conducting a suite of simulations over a range of temperatures
and assume ideal molecular behavior to apply concentration corrections
to the observed hybridized fractions.[121] (As demonstrated by Ouldridge et al., finite-size corrections do
not affect the melting temperature for homodimers.[121]) Furthermore, we note that these empirical calibration
factors to the 3SPN.2 predictions are applied only for the purposes
of making an experimental comparison, but acknowledge that there are
uncertainties introduced by assuming that the equilibrium dynamics
at fixed temperature can be compared directly to relaxation kinetics
following a 15 °C T-jump. The computational time scales and melting
temperatures reported in the remainder of the paper are not corrected
by these calibration corrections since we are only interested in the
relative trends in the behaviors of the four sequences.
Figure 1
Experimental
measurements and computational predictions of slow
and fast at T-jump IR responses. Results are reported in terms of
the final T-jump temperature. (a) The experimental and simulated slow
rate constants kdslow corresponding to duplex dissociation over
long time scales. (b) The experimental and simulated fast rate constant kdfast corresponding to terminal base-pair fraying on short time scales.
The simulation results are corrected by a sequence-independent scaling
factor that corrects for a 10× acceleration of the slow dissociation
dynamics and 120× acceleration of the fast fraying dynamics.
The simulated temperature in all cases is subjected to a (+4) K corrective
calibration to account for an observed systematic under-prediction
of the melting temperature by the 3SPN.2 model
Experimental
measurements and computational predictions of slow
and fast at T-jump IR responses. Results are reported in terms of
the final T-jump temperature. (a) The experimental and simulated slow
rate constants kdslow corresponding to duplex dissociation over
long time scales. (b) The experimental and simulated fast rate constant kdfast corresponding to terminal base-pair fraying on short time scales.
The simulation results are corrected by a sequence-independent scaling
factor that corrects for a 10× acceleration of the slow dissociation
dynamics and 120× acceleration of the fast fraying dynamics.
The simulated temperature in all cases is subjected to a (+4) K corrective
calibration to account for an observed systematic under-prediction
of the melting temperature by the 3SPN.2 modelIn Figure a we
observe an exponential increase of kdslow with temperature,
as expected from the large enthalpic barrier to duplex dissociation.[20,38,40] Under the time and temperature
calibration corrections, we see generally very good agreement between
the computational and experimental curves. Of the four sequences,
GC-core shows the largest discrepancy between computation and experiment,
although the general exponential trend is preserved. This may be a
result of its high propensity to fray (cf. section ). In Figure b we expose a largely linear dependence of kdfast upon temperature for AT-all, GC-end, and GC-mix compared to an exponential
dependence for GC-end. These trends can be understood in light of
the comparatively larger enthalpic barrier for dissociation of the
terminal G:C base pair in GC-end compared to that for the A:T terminal
pair in the other three sequences. Again we see good agreement between
the scaled computational predictions and the experimental T-jump measurements.
The favorable comparison of computation and experiment provides support
for the capacity of the 3SPN.2 model to reliably reproduce sequence-dependent
trends in the slow and fast kinetics of the four DNA oligomers, and
lends confidence in the use of these data for the parametrization
of Markov state models of the long-time dynamics of each sequence.
We observe that the dynamical fingerprinting approach presents an
elegant means to compare experimental relaxation data directly against
a Markov state model extracted from simulation data in terms of a
“fingerprint” of peaks with amplitudes and time scales
related to the relaxation of particular system observables.[122] These techniques have been previously employed
to in applications to base stacking of DNA dinucleoside monophosphates[123] and RNA.[92] We explored
the use of this approach to validate the MD simulation data but encountered
challenges in resolving fast dynamical motions below the lag time
of the fitted MSM and the extremely high computational cost of collecting
sufficient simulation data to fit MSMs at each temperature for which
experimental data was collected. Accordingly, we instead elected to
perform a direct comparison between the experimental data and MD trajectories
to validate the simulations themselves, then proceed to train MSMs
over these data and conduct analysis and experimental tests of the
MSM predictions.
SRV-MSMs for Each Sequence
We then
proceeded to construct Markov state models (MSMs) from 1 ms of aggregated
simulation trajectories for each of the four sequences at their respective
melting temperatures to generate sequence-dependent kinetic models.
MSMs define the long-lived metastable macrostates of the system, their
equilibrium occupancies, and the equilibrium transition probabilities
between them. As such, they are extremely valuable in providing both
a quantitative predictive model and a physically comprehensible mechanistic
understanding of the long-time dynamical evolution of the system between
an ensemble of metastable macrostates. We present in Figure the inferred MSMs for each
of the four 10-base DNA sequences. Across all four sequences we identify
a totality of seven metastable macrostates corresponding to the fully
hybridized state (H) in which all native base pairings are intact,
four shifted states in which the strands are translated out-of-register
by two or four bases in the 5′ (5S2, 5S4) or 3′ (3S2,
3S4) direction, a frayed state (F4)—unique to GC-core—in
which four terminal A:T base pairs are unbound, and the fully dissociated
state (D). We present these seven macrostates in Figure a along with schematic and
cartoon renderings of representative microstates contained within
each of these macrostates. The representative microstates emblematic
of each macrostate were selected randomly from those possessing high
(>99%) macrostate membership probabilities. Since the microstate
ensemble
exhibits a range of conformations, however, it is useful to characterize
the degree of structural heterogeneity to determine the degree to
which these microstates are emblematic of the distribution. In Figure S5 we present a projection of our macrostates
into two physical order parameters—the degree of 3′
shift and degree of 5′ shift—to provide an interpretable
embedding of the macrostates that exposes base pairing patterns and
structural heterogeneity within the microstate ensemble. In all cases,
we find these microstate distributions to be relatively narrowly focused
within the low-free energy core of each macrostate such that a single
archetypal microstate is indeed a good representative for the ensemble
and an accurate representation of the base pairing pattern of the
macrostate. In Figure b we present the occupancy probabilities of each state at thermodynamic
equilibrium. By virtue of the fact that each sequence is simulated
at its corresponding melting temperature (Tm), the probability of the dissociated state (D)
is, by construction, approximately equal to the sum of the probabilities
over the remaining six states (H, 5S2, 3S2, 5S4, 3S4, F4). In Figure c we present a visualization
of the macrostate MSMs for each of the four sequences showing the
connectivity between the identified macrostates. The macrostates are
represented as orange circles in proportion to their equilibrium probabilities
and the gray arrows indicate the probability of hopping from one macrostate
to another under one time step of the kinetic model. The fluxes between
the macrostates provide a wealth of high-level, interpretable information
on the sequence-dependent metastable states and hybridization/dehybridization
pathways. Immediately, we identify that the AT-all sequence possesses
a rich and complex dynamical landscape comprising six metastable states
whereas at the other end of the spectrum GC-mix exhibits far simpler
two-state “all-or-nothing” behavior. In Figure d we present the so-called
implied time scales of each MSM. These time scales correspond to the
relaxation times of the DNA dimer among its constituent metastable
macrostates. The leading implied time scale for each system corresponds
to the characteristic time scale for hybridization/dehybridization.
Since each system is simulated at the same 7 mM concentration and
at its respective melting temperature, it is not surprising that the
leading time scale is approximately equal for all four systems and
corresponds to the characteristic time scale for hybridization/dehybridization.
The spectrum of higher order time scales corresponds to increasingly
quicker relaxations between the metastable macrostates within the
kinetic model and resolve the interesting sequence-dependent differences
in the hybridization/dehybridization kinetic pathways. The total number
of implied time scales is typically one fewer than the number of metastable
macrostates and the existence of large implied time scales is indicative
of slowly relaxing kinetic processes. The dense spectrum of slow implied
time scales for AT-all is indicative of its relatively complex kinetic
landscape whereas that for the two-state GC-mix comprises only a single
time scale corresponding to hybridization/dehybridization. We now
proceed to analyze in detail the sequence-dependent thermodynamics,
kinetic, and mechanisms exposed by the four MSMs.
Figure 2
Thermodynamic and kinetic
predictions of the sequence-specific
MSMs fitted at the sequence melting temperatures AT-all, 309 K; GC-end,
317 K; GC-core, 324 K; and GC-mix, 324 K. (a) Macrostates: Schematic
representation of the seven metastable macrostates occupied by one
or more of the four sequences: fully hybridized state (H), two-base
out-of-register shifted states in the 5′ (5S2) or 3′
(3S2) direction, four-base out-of-register shifted states in the 5′
(5S4) or 3′ (3S4) direction, and the fully dissociated state
(D). The line drawings represent the 10-base self-complementary sequences,
where red-to-blue contacts indicates (possible) WC base pairing and
black indicates an unbound bases. Adjacent to each line drawing we
provide representative molecular structures corresponding to that
macrostate. (b) Thermodynamics: Histograms reporting the number of
the 107 total frames within the 1 ms of simulation trajectories
observed to occupy each of the seven macrostates, corresponding to
our numerical estimates of the equilibrium occupancy probabilities.
Uncertainties are calculated across 100 MSMs using a Bayesian MSM
estimation are reported for each bar and are very small compared to
the total counts. Values are reported on a log y-axis to make the small populations of the shifted and frayed states
visible. (c) Kinetic networks: MSMs illustrating the kinetic network
for each sequence. The orange circles correspond to the macrostates
occupied by each sequence and are labeled by the macrostate codes
reported in panel a. The area of the circles is proportional to the
logarithm of the equilibrium occupancy populations reported in panel
b. Molecular renderings of an illustrative snapshot from the coarse-grained
molecular simulations are provided next to each macrostate. The gray
arrows between macrostates indicate the presence of a probability
flux between this pair of states at equilibrium and the arrow thickness
is proportional to the flux. (To avoid congesting the diagram, arrows
are not reported for probability fluxes lower than 3 × 10–6.) The numerical value overlaid on each arrow reports
the conditional probability that a system occupying the macrostate
at the start of the arrow at time t will transition
to the macrostate at the end of the arrow by time (t + τ), where τ = 1.2 ns is the lag time corresponding
to a single time step of the MSM. Large orange circles correspond
to thermodynamically favorable states and large gray arrows correspond
to kinetically favorable transitions. (d) Kinetic time scales. Distribution
of MSM implied time scales for each sequence. The leading implied
time scale corresponds to the characteristic time scale for hybridization/dehybridization
and is approximately equal for all systems since simulations were
conducted at the same concentration and at the respective melting
temperatures. The higher order implied time scales correspond to a
spectrum of kinetic relaxations between the constituent macrostates
in the MSM corresponding to shifted and/or frayed states.
Thermodynamic and kinetic
predictions of the sequence-specific
MSMs fitted at the sequence melting temperatures AT-all, 309 K; GC-end,
317 K; GC-core, 324 K; and GC-mix, 324 K. (a) Macrostates: Schematic
representation of the seven metastable macrostates occupied by one
or more of the four sequences: fully hybridized state (H), two-base
out-of-register shifted states in the 5′ (5S2) or 3′
(3S2) direction, four-base out-of-register shifted states in the 5′
(5S4) or 3′ (3S4) direction, and the fully dissociated state
(D). The line drawings represent the 10-base self-complementary sequences,
where red-to-blue contacts indicates (possible) WC base pairing and
black indicates an unbound bases. Adjacent to each line drawing we
provide representative molecular structures corresponding to that
macrostate. (b) Thermodynamics: Histograms reporting the number of
the 107 total frames within the 1 ms of simulation trajectories
observed to occupy each of the seven macrostates, corresponding to
our numerical estimates of the equilibrium occupancy probabilities.
Uncertainties are calculated across 100 MSMs using a Bayesian MSM
estimation are reported for each bar and are very small compared to
the total counts. Values are reported on a log y-axis to make the small populations of the shifted and frayed states
visible. (c) Kinetic networks: MSMs illustrating the kinetic network
for each sequence. The orange circles correspond to the macrostates
occupied by each sequence and are labeled by the macrostate codes
reported in panel a. The area of the circles is proportional to the
logarithm of the equilibrium occupancy populations reported in panel
b. Molecular renderings of an illustrative snapshot from the coarse-grained
molecular simulations are provided next to each macrostate. The gray
arrows between macrostates indicate the presence of a probability
flux between this pair of states at equilibrium and the arrow thickness
is proportional to the flux. (To avoid congesting the diagram, arrows
are not reported for probability fluxes lower than 3 × 10–6.) The numerical value overlaid on each arrow reports
the conditional probability that a system occupying the macrostate
at the start of the arrow at time t will transition
to the macrostate at the end of the arrow by time (t + τ), where τ = 1.2 ns is the lag time corresponding
to a single time step of the MSM. Large orange circles correspond
to thermodynamically favorable states and large gray arrows correspond
to kinetically favorable transitions. (d) Kinetic time scales. Distribution
of MSM implied time scales for each sequence. The leading implied
time scale corresponds to the characteristic time scale for hybridization/dehybridization
and is approximately equal for all systems since simulations were
conducted at the same concentration and at the respective melting
temperatures. The higher order implied time scales correspond to a
spectrum of kinetic relaxations between the constituent macrostates
in the MSM corresponding to shifted and/or frayed states.
Comparison of MSM Thermodynamic Predictions
with NN Model
We first compare the thermodynamic predictions
for the equilibrium macrostate probabilities made by the MSM models
fitted at the sequence 3SPN.2 melting temperatures to those of the
nearest neighbor (NN) model as a popular empirical model of DNA hybridization
thermodynamics. The NN model predicts the free energy of duplex formation
as a sum over helix initiation terms and the hybridization free energies
of nearest neighbor pairs of bases that account for both the specific
WC pairings and the modulating effects of the local (i.e., nearest
neighbor) environment.[8,9] The parameters of the NN model
were estimated by regressing over 108 experimental measurements to
furnish a predictive model for the free energy of association as a
function of DNA sequence and explicitly account for stacking contributions
of native pairs, internal mismatches, and dangling ends. We apply
the NN model to predict the free energy FNN of each of the macrostates occupied by each of the four sequences.
A full accounting of our application of the NN model is provided in
Figure S6 and supporting text within the Supporting Information. The free energy of each macrostate is related
to its equilibrium occupancy probability P via the
statistical mechanical relationship F = −kBT ln P + C, where T is temperature, kB is Boltzmann’s constant, and C is an additive constant reflecting our ignorance of the absolute
scale of free energies. We use this relationship to convert the equilibrium
occupancy probabilities predicted by our MSM and reported in Figure b into free energies FMSM. The unknown additive constants preclude
comparisons of absolute free energies between the MSM and NN model,
but it is legitimate to compare relative free energies between pairs
of macrostates since the additive constant cancels in taking differences.
As such, we arbitrarily set the additive constant C in both the MSM and NN model such that the hybridized state H defines
a zero free energy reference state and we report the stability of
all macrostates relative to the hybridized state as ΔFNN = FNN – FHNN for the NN model and ΔFMSM = FMSM – FHMSM for the MSM.As illustrated in Figure , we see that the MSM tends to predict higher free energies
for all macrostates relative to the H state compared to the NN model.
Although the trends are in qualitative agreement, what is the root
of the quantitative discrepancy of the MSM and NN models in the predicted
relative stabilities? First, the MSM is constructed bottom-up from
molecularly detailed 3SPN.2 simulations whereas the NN model is fitted
top-down by regression against experimental data. There are approximations
inherent in the 3SPN.2 model, not least of which is the coarse-grained
representation that integrates over atomic degrees of freedom, and
in the NN model that was fitted to limited experimental data assuming
a low-order expansion in terms of nearest neighbor additive contributions.
Second, although 3SPN.2 is expected to capture some dangling end stabilization
effects through base stacking and cross-stacking interactions, the
model was not parametrized to fully capture the enthalpic contributions
of the interaction of unbound bases with terminal base pairs, whereas
this term is explicitly included within the NN model. Third, a well-known
deficiency of the NN model is the absence of any treatment of inert
tails—free bases that extend beyond the dangling end tend to
destabilize the duplex.[124]
Figure 3
Comparison of the macrostate
free energy predictions of the MSMs
and nearest neighbor (NN) thermodynamic model at the sequence 3SPN.2
melting temperatures.[8,9] (a) Free energies of each macrostate
relative to the hybridized state ΔF = F – FH. We define the
hybridized state H to possess a free energy of zero and take care
to only compare relative free energies (i.e., ΔF) between the MSM and NN model. (b) Discrepancy between the macrostate
relative free energy predictions ΔΔF =
ΔFMSM – ΔFNN of the MSM relative to the NN model. The MSM tends
to predict higher relative free energies (i.e., lower occupancy probabilities)
relative to the hybridized state H compared to the NN model.
Comparison of the macrostate
free energy predictions of the MSMs
and nearest neighbor (NN) thermodynamic model at the sequence 3SPN.2
melting temperatures.[8,9] (a) Free energies of each macrostate
relative to the hybridized state ΔF = F – FH. We define the
hybridized state H to possess a free energy of zero and take care
to only compare relative free energies (i.e., ΔF) between the MSM and NN model. (b) Discrepancy between the macrostate
relative free energy predictions ΔΔF =
ΔFMSM – ΔFNN of the MSM relative to the NN model. The MSM tends
to predict higher relative free energies (i.e., lower occupancy probabilities)
relative to the hybridized state H compared to the NN model.We can further explore the role of inert tails
upon macrostate
stability by analyzing the AT-all and GC-end sequences that occupy
out-of-register shifted macrostates 5S2 and 3S2 comprising a one-base
inert tail and 5S4 and 3S4 comprising a three-base inert tail (cf. Figure a). For both sequences,
our simulations show that 5S2 is both the most stable of shifted states
relative to H (Figure a) and has the smallest discrepancy (∼4 kJ/mol AT-all, ∼10
kJ/mol GC-end) compared to NN predictions (Figure b). The 3S2 and 5S4 states have nearly the
same deviation from NN predictions (∼7 kJ/mol AT-all, ∼
15 kJ/mol GC-end), indicating that longer tails in the 5′ direction
are as destabilizing as shorter tails in the 3′ direction.
It is known from differential scanning calorimetry studies[125] that 3′ inert tails are more destabilizing
than 5′ tails, with the differential behavior attributed to
a combination of 5′ tails preferentially stacking on the core
duplex and 3′ tails perturbing the duplex structure.[124−126] Both the NN and MSM predictions for AT-all are consistent with this
trend (i.e., F3S2 > F5S2 and F3S4 > F5S4). For GC-end, the MSM and NN models both
predict the
out-of-register shifted states to be less stable relative to the hybridized
state than the corresponding predictions for AT-all. This is in line
with expectations since the terminal G:C pairs in GC-end decrease
by two the number of available WC pairings in out-of-register shifted
states compared to AT-all. The GC-end NN predictions run contrary
to the expectation that the 3′ inert tails should be more destabilizing
than the 5′ tails, whereas the MSM predictions are consistent
with this trend. Indeed, the MSM model for GC-end does not identify
the 3S4 macrostate as a metastable conformation for the duplex.In sum, the qualitative trends in the macrostate thermodynamic
stabilities are in good agreement between the MSM and NN models, but
show quantitative discrepancies for macrostates possessing inert tails.
In these instances the MSM predicts these macrostates to be less stable
relative to the hybridized state compared to the NN model predictions
by 4.0–14.5 kJ/mol. The MSM predictions are also consistent
with the experimental expectation that 3′ inert tails should
be more destabilizing than the 5′ tails, whereas the NN predictions
can be in conflict with this trend.
Out-of-Register
States Facilitate Hybridization
and Dissociation Dynamics (AT-All, GC-End)
In addition to
thermodynamic stabilities, the macrostate MSM also furnishes quantitative
and interpretable predictions of hybridization and dehybridization
pathways and mechanisms at the sequence melting temperatures. It should
be noted that because we are using a reversible MSM framework, detailed
balance is enforced by construction. We now proceed to analyze these
predictions for each of the four sequences and illuminate the relationship
between sequence and dynamics. Two of our sequences, AT-all and GC-end,
support out-of-register metastable states, and we commence our analysis
with the role of these shifted states.AT-all possesses the
richest and most complex MSM of the four sequences by virtue of its
repetitive nature, comprising a hybridized state (H), dissociated
state (D), and four out-of-register shifted states (5S2, 3S2, 5S4,
3S4) (Figure c). Analysis
of the MSM transition probabilities reveal a critical role of the
shifted states in mediating hybridization and dehybridization. Commencing
from the dissociated state D, we observe approximately equal probabilities
for transitions to each of the other five states, such that a transition
to one of the out-of-register shifted states 5S2, 3S2, 5S4, or 3S4
is approximately 2.2 times more likely than a direct transition to
the hybridized state H. Commencing from the hybridized state H, however,
a direct transition to the dissociated state is approximately 1.2
times more likely than a transition to one of the two-base shifted
states 5S2 or 3S2. Once in one of the four shifted states, the 5′
vs 3′ overhang and degree of shifting play an important role
in determining whether the duplex will transition to more shifted
states, more aligned states, or completely dissociate. Transitions
from more shifted states toward more aligned states (i.e., 5S4 →
5S2, 5S2 → H, 3S4 → 3S2, 3S2 → H) are approximately
an order of magnitude more probable than the reverse transitions from
more aligned states to more shifted states. The largest single transition
probability from the four shifted states 5S2, 3S2, 5S4, and 3S4 is,
however, back to the dissociated state D. Consistent with the higher
destabilizing effect of 3′ inert tails relative to 5′
tails,[124−126] the 3S4 → D transition probability
is twice as large as the 5S4 → D, and the 3S2 → D is
four times larger than the 5S2 → D. The transition probability
from the 5S2 and 3S2 states back to the dissociated state D is equal
to or greater than the transition probability to the hybridized state.
A transition path theory analysis of the MSM reveals that 33% of productive
hybridization trajectories D --> H (where the dashed arrow indicates
the combination of both direct and indirect pathways) and dehybridization
trajectories H --> D proceed through one or more out-of-register
shifted
states. Among these out-of-register hybridization pathways, the D
--> 5S2 --> H transition is predicted to occur 57% of the time.
A
mean first passage time (MFPT) analysis returns a MFPT for D -->
H
of 3.0 μs and for H --> D of 2.5 μs. As expected by
the
fact that the calculations are performed at the melting temperature,
the MFPTs are approximately equal.GC-end comprises the next
most complex MSM. The introduction of
the G:C pairs at the termini of the strands maximally preserves the
repetitive tract of A:T base pairings such that the GC-end MSM possesses
all of the same macrostates in its dynamical landscape with the exception
of the 3S4 state (Figure c). As discussed in section , the 3S4 state is rendered unstable within
the lag time of our MSM due to the presence of the destabilizing 3′
inert tail and only four WC base pairings compared to six in the case
of AT-all. Analysis of the transition probabilities reveal significant
differences compared to those in the AT-all kinetic network. Commencing
from the dissociated state D, we observe a similar transition probability
to the 5S4 state as for AT-all, but once in the 5S4 state there are
no significant transition probabilities to any other state except
back to D. As such, the 5S4 state acts as a kinetic trap rather than
as an intermediate to hybridization. The D → H and H →
D transition probabilities are commensurate with those for AT-all.
However, the D → 5S2 and D → 3S2 transition probabilities
are half or less of those in AT-all, and the reverse transitions are
an order of magnitude larger. This may be attributed to the reduced
thermodynamic stability of the 5S2 and 3S2 states in GC-end that comprise
only six WC pairs compared to eight in AT-all (cf. Figure ). The 5S2 → H and 3S2
→ H transition probabilities are more than an order of magnitude
larger than in AT-all, which may again be attributed to the lower
thermodynamic stability of the two shifted states relative to the
hybridized state H. Again, the transition probabilities out of the
3S2 state to D or H are comparatively higher than those out of the
5S2 state, consistent with the increased destabilizing effect of 3′
inert tails.[124−126] Commencing from the hybridized state H,
a direct transition to the dissociated state is approximately seven
times more likely than a transition to one of the two-base shifted
states 5S2 or 3S2. A transition path theory analysis of the MSM reveals
that only 7% of hybridization events D --> H and dehybridization
events
H --> D proceed through one or more out-of-register shifted states.
The significantly reduced role for out-of-register shifted states
in mediating the hybridization and dissociation pathways for GC-end
relative to AT-all is consistent with the reduced thermodynamic stability
of these states due to the elimination of possible out-of-register
WC base pairing for the terminal G:C pairs and therefore a reduced
accessibility of these states in the GC-end kinetic network. We compute
a MFPT for D --> H of 1.6 μs and for H --> D of 2.1 μs,
which are again approximately equal.The out-of-register kinetic
landscape that defines AT-all and GC-end
hybridization have been explored by a number of previous computational
studies. Simulations have identified internal displacement mechanisms
capable of correcting base pair alignment in 3SPN.2[18] as well as in the coarse-grained oxDNA[30] and BioModi[67] models. In all
cases, these mechanisms were shown to be crucial components of the
hybridization pathway for homogeneous and repetitive sequences. Xiao
et al. performed an all-atom energy landscape-based analysis of 5′-GGGGGG-3′
and 5′-GCGCGC-3′ hexamers.[17] Out-of-register states for 5′-GCGCGC-3′ hexamers were
identified as deep kinetic traps along the hybridization pathway and
“slithering” through these states did not provide a
significant hybridization pathway compared to an alternative “zippering”
mechanism. (In contrast, slithering through out-of-register shifted
states and zippering served as two parallel pathways for hybridization
of 5′-GGGGGG-3′.) This stands in contrast to our results
for our AT-all (5′-ATATATATAT-3′) sequence, in which
out-of-register states participated in 33% of productive hybridization
events. It is conceivable that the stronger hydrogen bonding in G:C
WC pairs relative to A:T pairs may render out-of-register shifted
states less favorable to hybridization by suppressing fluctuation-driven
rearrangements,[127,128] but additional studies would
be required to reconcile these observations.
Central
GC Placement Induces Long-Lived Frayed
States (GC-Core)
The GC-core MSM represents a departure from
the relatively rich and complex kinetic networks dominated by out-of-register
shifted states to a much simpler one dominated by fraying (Figure c). The MSM contains
only three states—hybridized H, dehybridized D, and frayed
F4. The F4 state is unique to GC-core and contains up to six WC pairs—the
two central G:C core pairs and as many as four A:T pairs on one side
or other of the core, while the other run of four A:T pairs remains
free. (As expected by symmetry, the particular AT run that is free
occurs with equal probability on either side of the core.) Although
partially frayed states containing less than four free A:T bases on
either end of the duplex are common, these tend to interconvert faster
than the lag time and are not registered as metastable within our
MSM. Our model reveals the absence of any direct hybridization or
dehybridization transitions between the H and D states, with all pathways
passing through the frayed state F4. Previous studies would suggest
that hybridization of this sequence should proceed via a zippering
mechanism, wherein upon formation of the strong central G:C WC base
pairings the duplex helix should rapidly assemble in a middle-out
fashion.[16,30] Our results are partially consistent with
this expectation, but reveal the frayed state F4 to be unexpectedly
metastable, serving as a long-lived state with an mean lifetime of
1.8 ns. The stability of the state is attributable to the enthalpic
stabilization offered from up to six WC pairs and the entropic stabilization
associated with the configurational entropy of the two free AT-tails.Analysis of the transition probabilities show that commencing from
the F4 state, progression to the hybridized state F4 → H is
25 times more likely than dissociation F4 → D. Thus, once a
D → F4 transition has occurred, a F4 → H transition
will likely proceed; concomitantly, H → F4 events tend to fall
back to the H state and are less likely to proceed to complete dissociation
D. We noted in section that fraying dynamics in the 3SPN.2 model appear to be significantly
accelerated relative to center-of-mass translation, and it is conceivable
that this may lead to elevated sampling of the F4 state within the
computational model relative to experiment and the induction of more
frequent dissociation. Moreover, since GC-core is the sequence most
prone to fraying, this effect could be the root of the relatively
poorer agreement of the kdslow response for GC-core compared to
the other sequences due to an artificially elevated computational
prediction of this rate (Figure a). Our model predicts a MFPT for D → F4 →
H of 3.4 μs and for H → F4 → D of 2.9 μs,
which are again approximately equal.Lattice models have previously
identified frayed states as putative
intermediates in DNA hybridization/dehybridization.[19,25,46] Araque et al. studied a 5′-ATGCGCAT-3′
octomer using a lattice model and identified a symmetrically A:T frayed
state as a crucial part of the duplex transition path.[25] We previously studied the four sequences that
are the subject of the present work using T-jump IR and 2D IR spectroscopy
and identified GC-core as possessing the highest deviation from two-state
behavior during dissociation when neglecting out-of-register contributions.[19] This result was interpreted to arise from loss
of A:T contacts and fraying around the central G:C core, and this
hypothesis was supported by lattice model calculations that predicted
the GC-core conformational ensemble to possess substantially more
frayed configurations than the other three sequences.[46] Follow-up T-jump measurements and Smoluchowski simulations
on model 1D free energy landscapes showed that AT termini fraying
was an effectively barrierless process characterized by rapid interconversion
between all accessible frayed states.[28] These prior results are consistent with the present findings that
expose the GC-core sequence to be the only sequence that occupies
the F4 frayed state and therefore the only one possessing a metastable
frayed state on time scales exceeding the τ = 1.2 ns lag time
of our MSMs.
Disruption of Repetitive
AT Tracts Promotes
Two-State “All-or-Nothing” Kinetics (GC-Mix)
The GC-mix sequence is the only one of the four sequences studied
that exhibits simple two-state “all-or-nothing” behavior.[17,19,25,26] This characterization is intended to reflect the mechanistic observation
that GC-mix MSM comprises just two states, the hybridized H and dehybridized
D (Figure c), indicating
that association and dissociation of the strands proceeds directly
without passing through any metastable intermediate states resolvable
under the τ = 1.2 ns lag time of our MSM. Indeed, the sequence’s
propensity to fray in simulation and the role of termini fraying in
dissociation experiments[19] indicate that
there remain deviations from “all-or-nothing” behavior
which are not fully captured within the MSM lag time. (We note that
“all-or-nothing” can also be used, somewhat ambiguously,
to refer to a two-state thermodynamic model where the hybridized and
dehybridized are separated by a large free energy barrier but that
the transitions between them may proceed through a network of metastable
states.) The two-state behavior appears to arise as a consequence
of the placement of the G:C pair that maximally disrupts the repetitive
AT tract within the decamer and destabilizing either out-of-register
shifted states or frayed states. We note that we do observe substantial
transient fraying of the terminal two-base AT tails within our dynamical
simulations, but these frayed states are not sufficiently thermodynamically
stable to produce a metastable macrostate within the resulting MSM.
This stands in contrast to the metastable F4 state populated by GC-core.
Our MSM predicts a MFPT for D → H of 2.9 μs and for H
→ D of 2.3 μs, which are again nearly equal.Given
the very simple two-state “all-or-nothing” behavior
of GC-mix and the absence of any intermediate metastable states, we
sought to interrogate our simulation trajectory data to elucidate
the hybridization and dehybridization mechanisms. To do so, we followed
all 10 intermolecular distances between native WC base pairs and tracked
their evolution through a number of hybridization and dehybridization
events. We present one representative example of each event in Figure and four more in Figure S7. During the hybridization process,
we observe a global decrease in all 10 distances as the strands approach
one another and the formation of key native WC contacts immediately
prior to duplex formation: specifically, one of the G:C WC pairs and
at least one neighboring A:T pair or 2–3 central A:T pairs.
This behavior is consistent with a “nucleation–zippering”
mechanism as has been reported in previous studies.[16,20,37,52] In dehybridization,
we observe fraying of one half of the duplex with the strands remaining
associated until the loss of the final A:T pairs and ultimately the
last G:C pair. Qualitatively, we observed some short-lived states
composed of two to four native WC base pair contacts immediately before
full dissociation occurs, but, in contrast to the F4 state we observe
in GC-core, these conformations do not constitute a metastable state
within our MSM nor do they tend to reform intact duplexes. These dissociation
dynamics are consistent with a “fraying–peeling”
dehybridization mechanism.[32,54,55] We observe that the principle of microscopic reversibility for a
molecular system at thermodynamic equilibrium imposes the conditions
of detailed balance and symmetry of the classical equations of motion
under time reversibility.[129] A consequence
of these considerations is that if our system is indeed at equilibrium,
then the “fraying–peeling” dehybridization mechanism
can be considered a reversal of the “nucleation–zippering”
hybridization pathway since the time-reversed simulation trajectories
represent an equally valid sampling of the system dynamics. This is
indeed borne out by our mechanistic observations for GC-mix wherein
the early stages of “nucleation–zippering” proceed
by the formation of one G:C contact followed by one or more additional
A:T pairs and the late stages by the formation of all remaining WC
pairs, which we compare with the early stages “fraying–peeling”
wherein one half of the duplex frays and the late stages wherein dissociation
finally completes by the dissolution of the last few A:T contacts
and the final G:C pair.
Figure 4
GC-mix hybridizes by nucleation–zippering
and dehybridizes
by fraying–peeling. Tracking of the 10 intermolecular distances
between native WC base pairs over the course of an (a) hybridization
event and (b) dehybridization event. Symmetrically permutable distances
(e.g., A1:T10 and T10:A1) are reflected across the x-axis to avoid congestion
in the plot. Circles superposed on the x-axis indicate
the instantaneous MSM state assignment as dissociated D (blue) or
hybridized H (orange). Hybridization tends to occur by a nucleation–zippering
mechanism, wherein a native G:C pair and adjacent A:T pair or 2–3
central A:T pairs first form prior to rapid formation of the duplex.
Dehybridization tends to occur by a fraying–peeling mechanism
wherein fraying of the two-base AT-tails on one or both sides of the
duplex precedes dissociation of the central native base pairs and
complete dissolution of the duplex. Four additional hybridization
events and four additional dehybridization events are presented in Figure S7.
GC-mix hybridizes by nucleation–zippering
and dehybridizes
by fraying–peeling. Tracking of the 10 intermolecular distances
between native WC base pairs over the course of an (a) hybridization
event and (b) dehybridization event. Symmetrically permutable distances
(e.g., A1:T10 and T10:A1) are reflected across the x-axis to avoid congestion
in the plot. Circles superposed on the x-axis indicate
the instantaneous MSM state assignment as dissociated D (blue) or
hybridized H (orange). Hybridization tends to occur by a nucleation–zippering
mechanism, wherein a native G:C pair and adjacent A:T pair or 2–3
central A:T pairs first form prior to rapid formation of the duplex.
Dehybridization tends to occur by a fraying–peeling mechanism
wherein fraying of the two-base AT-tails on one or both sides of the
duplex precedes dissociation of the central native base pairs and
complete dissolution of the duplex. Four additional hybridization
events and four additional dehybridization events are presented in Figure S7.
Long-Lived Metastable Shifted States Predicted
by the MSM Are Resolved by T-Jump IR
Finally, we sought to
validate the predictions of our sequence-dependent MSMs against experimental
T-jump IR spectroscopy. T-jump IR measurements commence from a low
temperature, apply a step jump in temperature, and track the relaxation
of the system to the dehybridized state. We hypothesized that the
influence of the out-of-register shifted states present in the AT-all
and GC-end sequences upon the system relaxation kinetics should manifest
in the slow and/or fast responses measured by T-jump IR. As discussed
in section , the
slow IR response is largely attributed to dissociation events and
the fast to terminal base fraying. With regards to the slow response,
our MFPT analyses of our MSMs predict out-of-register shifting events
(i.e., H → 3S2, 5S2, 3S4, 5S4) to proceed on microsecond time
scales, which are commensurate with the 1.4–2.9 μs time
scales for dehybridization (i.e., H → D) for each of the four
sequences. As such, we anticipate that the dynamical relaxations associated
with out-of-register shifted states proceed on similar time scales
to, and may not be distinguishable from, the relaxation to the dehybridized
state. Nevertheless, the presence of these out-of-register shifted
states in the low-temperature equilibrium ensemble prior to the T-jump
step may be observable via their influence on the fast T-jump IR response
attributable to fraying. Specifically, we hypothesize that the dangling
ends and inert tails present in the out-of-register shifted states
should promote a broader fraying response over the course of the relaxation
that is distinct from that of in-register fraying. This heterogeneity
of configurations should lead to heterogeneous dynamics, manifested
in the observation of a more stretched relaxation over experimental
time scales of 70–100 ns. Analysis of the MSM equilibrium distributions
(Figure b) reveals
10.0% of the equilibrium ensemble to reside in out-of-register shifted
states 3S2, 5S2, 3S4, and 5S4 for AT-all, compared to just 0.23% for
GC-end, and 0% for GC-core and GC-mix. It is our conjecture that a
substantial population of out-of-register shifted states in the pre-T-jump
AT-all ensemble should be distinguishable from the GC-end, GC-core,
and GC-mix as an elongation of the fast relaxation response associated
with terminal base fraying.We present in Figure our T-jump IR t-HDVE difference spectra
and corresponding normalized time traces at 1600 and 1660 cm–1. The signal at 1600 cm–1 corresponds to changes
in A and T ring vibrations while the signal at 1660 cm–1 contains contributions from G and T carbonyl vibrations. Each time
trace was fitted to the sum of a stretched exponential and two exponentials S(t) = A exp(−(t/τfast)β) + B exp(−t/τslow) + C exp(−t/τcool). The stretched exponential describes the
relaxation process from 5 ns to 1 μs, and the two exponentials
describe the signal increase from 1 to 320 μs and signal decay
from rehybridization induced by thermal relaxation back to the initial
temperature. The fitting parameters A, B, and C correspond to the relative amplitudes of
the three kinetic responses and the stretch factor βfast to the heterogeneity of fraying dynamics at the fast time scale.
A testable prediction of our hypothesis is that the long-lived out-of-register
shifted states in AT-all should result in a significantly smaller
value for the fitted βfast parameter (i.e., a more
stretched response) relative to those for GC-end, GC-core, and GC-mix.
This hypothesis was supported by the experimental time series at both
1600 cm–1, where βfastAT-all = 0.3 compared to βfastGC-end = 0.6, βfastGC-core = 0.7, and βfastGC-mix = 0.6, and 1660 cm–1, where βfastAT-all = 0.4 compared to βfastGC-end = 0.7, βfastGC-core = 0.6, and βfastGC-mix = 0.6. This result supports our hypothesis and
validates a testable prediction of our sequence-dependent MSMs.
Figure 5
T-Jump IR responses
reflect sequence-dependent conformational heterogeneity.
The final temperature for each measurement is AT-all, 320 K; GC-end,
319 K; GC-core, 327 K; and GC-mix, 326 K. (a) Mid-IR t-HDVE difference
spectra for GC-core at time delays from 5 ns to 560 μs. Normalized
time traces for each sequence are shown at (b) 1600 cm–1 and (c) 1660 cm–1. The signal at 1600 cm–1 corresponds to changes in A and T ring vibrations while the signal
at 1660 cm–1 contains contributions from G and T
carbonyl vibrations. Each time trace is fit to the sum of a stretched
exponential with two exponentials (solid lines): S(t) = A exp(−(t/τfast)β) + B exp(−t/τslow) + C exp(−t/τcool). The stretched exponential describes the
process from 5 ns to 1 μs, and the two exponentials describe
the signal increase from 1 to 320 μs and signal decay from rehybridization
induced by thermal relaxation back to the initial temperature. The
stretch factor βfast for the fits at 1600 and 1660
cm–1 are reported directly on the plots in panels
b and c.
T-Jump IR responses
reflect sequence-dependent conformational heterogeneity.
The final temperature for each measurement is AT-all, 320 K; GC-end,
319 K; GC-core, 327 K; and GC-mix, 326 K. (a) Mid-IR t-HDVE difference
spectra for GC-core at time delays from 5 ns to 560 μs. Normalized
time traces for each sequence are shown at (b) 1600 cm–1 and (c) 1660 cm–1. The signal at 1600 cm–1 corresponds to changes in A and T ring vibrations while the signal
at 1660 cm–1 contains contributions from G and T
carbonyl vibrations. Each time trace is fit to the sum of a stretched
exponential with two exponentials (solid lines): S(t) = A exp(−(t/τfast)β) + B exp(−t/τslow) + C exp(−t/τcool). The stretched exponential describes the
process from 5 ns to 1 μs, and the two exponentials describe
the signal increase from 1 to 320 μs and signal decay from rehybridization
induced by thermal relaxation back to the initial temperature. The
stretch factor βfast for the fits at 1600 and 1660
cm–1 are reported directly on the plots in panels
b and c.
Conclusions
We have conducted an integrated computational and experimental
study of the sequence-dependent kinetic mechanisms for the hybridization
and dehybridization dynamics of a family of four self-complementary
10-mer DNA oligomers: ATATATATAT (AT-all), GATATATATC (GC-end), ATATGCATAT
(GC-core), and ATGATATCAT (GC-mix). We conducted 1 ms of unbiased
coarse-grained molecular dynamics simulations at the melting temperature
of each sequence and employed deep learning techniques to construct
high-resolution Markov state models as predictive and interpretable
models of the sequence dependent dynamics. T-jump IR spectroscopy
was used to calibrate the kinetic time scales of the coarse-grained
molecular model and validate the kinetic prediction of the Markov
state models that the AT-all sequence should possess long-lived out-of-register
shifted states that are detectable within T-jump IR t-HDVE time traces.
Our results reveal that the specific placement of interrupting G:C
pairs within an otherwise repetitive AT sequence can have a profound
impact on the kinetic pathways and mechanisms for association and
dissociation of the DNA duplex. In particular, we found AT-all to
possess the richest and most complex kinetic landscape of the four
sequences that is dominated by out-of-register shifted states that
participate in 33% of complete hybridization events—pathways
leading from the dissociated state to full duplex formation—and
dehybridization events—pathways from the complete duplex to
full dissociation. Introduction of the G:C pairs at the end of the
strand maintains an eight-base-pair repetitive AT tract and the GC-end
kinetic landscape possess all but one of the same out-of-register
shifted states as AT-all. Destabilization of the GC-end shifted states
relative to AT-all, however, results in a far more limited participation
of these states with only 7% of GC-end hybridization and dehybridization
events passing through one or more shifted states. Placing the G:C
pairs in the center of the strand maintains two four-base AT tracts
on either side of the core and results in qualitatively different
kinetic behaviors for GC-core. In this case, no metastable out-of-register
shifted states are registered by our model with the hybridization
and dehybridization pathways all passing through a strongly metastable
frayed state in which one or other of the four-base AT-tracts is unbound
to produce two free AT-tails. Finally, placing the G:C bases between
the center and end of the strand to maximally disrupt the repetitive
AT tracts results in no metastable out-of-register or frayed states
for GC-mix and results in simple two-state “all-or-nothing”
hybridization/dehybridization behavior. Analysis of the specific pathways
reveals hybridization to largely proceed by a nucleation-zippering
mechanism and dehybridization to proceed by a fraying-peeling mechanism.The ordering of the computationally predicted kinetic landscapes
from most to least complex—AT-all > GC-end > GC-core
> GC-mix—is
largely dictated by sequence repetitiveness, specifically the number
of consecutive AT motifs. We note that this ordering differs from
our previously reported ordering in terms of deviation from two-state
behavior of GC-core > GC-mix > AT-all > GC-end.[19,28] We can understand these two apparently discrepant orderings by understanding
that the latter was deduced based on experimental analyses and lattice
models that did not account for out-of-register states and focused
largely on fraying behaviors. Indeed, under the assumption that fraying
is the dominant kinetic process relative to out-of-register shifting,
we can harmonize the predictions of the present work with our prior
work by eliminating all out-of-register shifted states in our fitted
MSMs (Figure c), in
which case we find GC-core to contain an F4 frayed intermediate and
the remaining sequences to all have simple two-state dynamics such
that the predicted ordering is GC-core > GC-mix ≈ AT-all
≈
GC-end.In sum, our results demonstrate the profound effect
of sequence
upon the kinetic landscapes, metastable states, and hybridization/dehybridization
mechanisms of short DNA oligomers. Our analysis of this small family
of sequences expose preliminary design principles for the (meta)stability
of out-of-register and frayed states but we anticipate much greater
richness in the landscapes will emerge with studies of longer and
more diverse sequences. Going forward, we will extend this work to
discern more general trends in sequence-dependent hybridization/dehybridization
for a wider range of oligomer sequences and motivate strategies for
experimental comparisons. We also suggest that the MSM approach followed
in this work, possibly coupled with biased sampling and reweighting
techniques,[94,130] may be well-suited to expose
changes in the hybridization/dehybridization mechanism and kinetics
as a function of temperature. We anticipate that these insights may
provide foundational design rules by which to improve understanding
of in vivo hybridization processes and rationally
engineer optimized sequences for DNA nanotechnology applications such
as DNA-PAINT[7] and DNA barcoding.[6]