Javier Jarillo1, Borja Ibarra2, Francisco Javier Cao-García2,3. 1. University of Namur, Institute of Life-Earth-Environment, Namur Center for Complex Systems, Rue de Bruxelles 61, 5000 Namur, Belgium. 2. Instituto Madrileño de Estudios Avanzados en Nanociencia, IMDEA Nanociencia, C/ Faraday 9, 28049 Madrid, Spain. 3. Departamento de Estructura de la Materia, Física Térmica y Electrónica, Universidad Complutense de Madrid, Pza. de Ciencias, 1, 28040 Madrid, Spain.
Abstract
DNA replication is a key biochemical process of the cell cycle. In the last years, analysis of in vitro single-molecule DNA replication events has provided new information that cannot be obtained with ensembles studies. Here, we introduce crucial techniques for the proper analysis and modelling of DNA replication in vitro single-molecule manipulation data. Specifically, we review some of the main methods to analyze and model the real-time kinetics of the two main molecular motors of the replisome: DNA polymerase and DNA helicase. Our goal is to facilitate access to and understanding of these techniques to promotetheir use in the study of DNA replication at the single-molecule level. A proper analysis of single-molecule data is crucial to obtain a detailed picture of, among others, the kinetics rates, equilibrium contants and conformational changes of the system under study. The techniques presented here have been used or can be adapted to study the operation of other proteins involved in nucleic acids metabolism.
DNA replication is a key biochemical process of the cell cycle. In the last years, analysis of in vitro single-molecule DNA replication events has provided new information that cannot be obtained with ensembles studies. Here, we introduce crucial techniques for the proper analysis and modelling of DNA replication in vitro single-molecule manipulation data. Specifically, we review some of the main methods to analyze and model the real-time kinetics of the two main molecular motors of the replisome: DNA polymerase and DNA helicase. Our goal is to facilitate access to and understanding of these techniques to promotetheir use in the study of DNA replication at the single-molecule level. A proper analysis of single-molecule data is crucial to obtain a detailed picture of, among others, the kinetics rates, equilibrium contants and conformational changes of the system under study. The techniques presented here have been used or can be adapted to study the operation of other proteins involved in nucleic acids metabolism.
DNA is the biological polymer carrying the genetic instructions for life. DNA replication (or duplication) is an essential part for biological inheritance, which ensures that upon cell division the two new daughter cells contain the same genetic information as the parent cell [10]. DNA is made up of a double helix of two complementary strands that are replicated synchronously, in a process known as semiconservative replication [63]. This process implies that each strand of the parental DNA molecule serves as a template to produce its complementary counterpart. A complex and highly dynamic protein machinery, referred to as the replisome, is in charge of robust, and accurate DNA replication needed for cell survival [6], [39]. Fig. 1. The core components of prokaryotic and eukaryotic replisomes are replicative DNA polymerases, the helicase-primase and the single-stranded DNA binding proteins (SSBs). Depending on the organism, a variety of other proteins associate transiently and coordinate their activities with the core elements to carry out DNA replication [10], [34]. Replicative DNA polymerases synthetize the new complementary strand of DNA by the stepwise addition of the corresponding complementary nucleotide (dNTPs) [94]. These enzymes are designed to maintain low mutation rates; they incorporate one wrong nucleotide per nucleotides polymerized. This fidelity is further enhanced by a factor of 102-103 by their associated exonuclease activities, which hydrolyze the mismatched nucleotide from the 3′ end of the hairpin [53], [5]. In addition, many replicative DNA polymerases, present an intrinsic strand displacement activity (the ability to displace downstream DNA encountered during synthesis, without the help of a helicase) [15]. Replicative helicases form ring-shaped structures that encircle one (or two) of the DNA strand(s) and utilize the chemical energy of NTPs to unwind the DNA fork in coordination with the DNA polymerase [83], [62], [82]. Fig. 1. The SSB proteins bind with high affinity to single-stranded DNA and constitute the nucleo-protein complex upon with the other replisome components work [93], [31]. Replicative DNA polymerases and helicases work as molecular motors that harness chemical and thermal energies to generate unidirectional mechanical motion. All molecular motors operate at energies comparable to those of the thermal fluctuations [50], [13]. Therefore, they experience continuous agitation by random Brownian motion, which is eventually reflected in fluctuations in their real-time kinetics. Biological molecular motors have evolved to take advantage of these Brownian fluctuations, which they couple with chemical potentials (dNTP or NTPs) to exhibit unidirectionality rather than random motion and high rates, e.g., some DNA polymerases can synthetize DNA at a rate of 500–1000 nt/s [87].
Fig. 1
Simplified view of the core components of the mitochondrial DNA replisome. Helicase opens the DNA fork separating the two strands. The leading strand, is replicated directly by the DNA polymerase, and the lagging strand, is initially bound by SSB proteins for later replication. In other replisomes, several primase subunits usually associate with the helicase to prime replication of the lagging strand, which is replicated in the opposite direction in the form of short Okazaki fragment (not shown).
Simplified view of the core components of the mitochondrial DNA replisome. Helicase opens the DNA fork separating the two strands. The leading strand, is replicated directly by the DNA polymerase, and the lagging strand, is initially bound by SSB proteins for later replication. In other replisomes, several primase subunits usually associate with the helicase to prime replication of the lagging strand, which is replicated in the opposite direction in the form of short Okazaki fragment (not shown).In the last two decades, the advent of in vitro single-molecule techniques has allowed researchers to explore, for the first time, the molecular mechanism that govern the operation of many proteins involved in DNA replication including DNA polymerases and helicases [64], [99], [26], [68], [11]. In particular, in vitro single-molecule manipulation techniques, such as optical and magnetic tweezers, have provided mechanistic information about the inner workings of these systems that cannot be obtained with ensemble techniques. Briefly, single-molecule detection opens the possibility to follow the activity of individual biological molecular motors, such as DNA polymerases and helicases in real-time. In this way, it is possible to detect and quantify transient features of the reaction as rare events and heterogeneous behaviour [67], [30], [65]. This possibility is instrumental in unveiling the complex dynamics of these biological motors. The position of the biological motor is measured indirectly through the position of a microsphere (polystyrene bead) linked to the motor or to their substrate (e.g., DNA), with a resolution of 1–10 nm. The beads are manipulated by the trapping or manipulation field (optical or magnetic), which allow to measure and apply controlled mechanical forces in the order of picoNewtons (pN), Fig. 2A and D. Thus, in vitro single-molecule manipulation techniques allow measuring the tiny mechanical forces generated during the course of a reaction and applying controlled mechanical forces directly on it. Note that mechanical force is a byproduct of the reaction of molecular motors. The goal of measuring and applying external forces to these systems is to determine the magnitude of the rates and free energies of the mechanical steps of their reaction. In this way, a detailed picture of the mechanical coordinate, and its relation with the chemical coordinate of the reaction (mechano-chemistry) can be obtained [50].
Fig. 2
A) Example of a primer extension (p.e.) DNA replication experiment with optical tweezers. The two ends of a dsDNA molecule containing a ssDNA gap in the middle are attached between two micron-sized beads: one (grey sphere) held by the optical trap (highly focalized laser, red) and the other (blue sphere) held by suction on top of a micropipette. As replication proceeds, ssDNA is converted into the more rigid dsDNA changing the distance between the beads. The change in distance is registered and later processed to obtain the polymerase trajectory (template position versus time). B) Representative replication traces showing transient pause events (red) intercalated with active replication events (black). The inset shows the velocity histogram. C) Pause length frequency distributions. Depending on the experimental conditions, the pause length frequency distribution can be compatible with a single (red line) or a double exponential distribution (green line). Other distribuitions are also possible. D) Diagram of a magnetic tweezers experiment measuring the DNA unwinding activity of a single replicative helicase. One of the ends of the dsDNA (bearing a helicase loading site) is attached to a glass surface and the other end to a super-paramagnetic microsphere manipulated by the magnetic field. E) Representative unwinding trace showing the binning in displacement and illustrating the determination of first passage times . F) First-passage time (FPT) distribution. Experimental data (blue circles) are fitted by a model with pauses and forward and backward stepping (blue line). Predicted FPT distribution without pauses (orange) and with only forward stepping (green) are shown for comparison. G) Experimental configuration to measure the replication of ssDNA covered with SSB with optical tweezers. When the replication velocity is slow, direct identification of pauses and maximum replication velocity is not possible, neither from the traces nor from the velocity histogram. H) Identification of peak velocities with prominence greater than (which avoids the selection of secondary peaks). Traces are averaged on a time window then the instantaneous velocity is represented versus time to proceed to peak velocity identification. I) The mean of the velocity peaks is computed for each prominence for different time windows . For an intermediate value of the prominence a clear plateau is present in the plot, whose value gives the maximum velocity . (Panel A in this Figure is adapted from Ref. [43]; Panels B and C are from Ref. [69]; Panels D, E and F are from Ref. [11]; Panels G, H and I are adapted from Ref. [17]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
A) Example of a primer extension (p.e.) DNA replication experiment with optical tweezers. The two ends of a dsDNA molecule containing a ssDNA gap in the middle are attached between two micron-sized beads: one (grey sphere) held by the optical trap (highly focalized laser, red) and the other (blue sphere) held by suction on top of a micropipette. As replication proceeds, ssDNA is converted into the more rigid dsDNA changing the distance between the beads. The change in distance is registered and later processed to obtain the polymerase trajectory (template position versus time). B) Representative replication traces showing transient pause events (red) intercalated with active replication events (black). The inset shows the velocity histogram. C) Pause length frequency distributions. Depending on the experimental conditions, the pause length frequency distribution can be compatible with a single (red line) or a double exponential distribution (green line). Other distribuitions are also possible. D) Diagram of a magnetic tweezers experiment measuring the DNA unwinding activity of a single replicative helicase. One of the ends of the dsDNA (bearing a helicase loading site) is attached to a glass surface and the other end to a super-paramagnetic microsphere manipulated by the magnetic field. E) Representative unwinding trace showing the binning in displacement and illustrating the determination of first passage times . F) First-passage time (FPT) distribution. Experimental data (blue circles) are fitted by a model with pauses and forward and backward stepping (blue line). Predicted FPT distribution without pauses (orange) and with only forward stepping (green) are shown for comparison. G) Experimental configuration to measure the replication of ssDNA covered with SSB with optical tweezers. When the replication velocity is slow, direct identification of pauses and maximum replication velocity is not possible, neither from the traces nor from the velocity histogram. H) Identification of peak velocities with prominence greater than (which avoids the selection of secondary peaks). Traces are averaged on a time window then the instantaneous velocity is represented versus time to proceed to peak velocity identification. I) The mean of the velocity peaks is computed for each prominence for different time windows . For an intermediate value of the prominence a clear plateau is present in the plot, whose value gives the maximum velocity . (Panel A in this Figure is adapted from Ref. [43]; Panels B and C are from Ref. [69]; Panels D, E and F are from Ref. [11]; Panels G, H and I are adapted from Ref. [17]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)The spatial resolution of single-molecule manipulation techniques is limited by drift and thermal noise [35], [21], [56], [37], [81]. Thermal fluctuations affect both the instrument and the sample and provide a fundamental limit to the resolution of a given experiment. Because molecular motors operate at energy levels comparable to those of thermal motion, single-molecule manipulation data often present low signal to noise ratios [13]. In addition, in the case of DNA replication studies, the flexibility of the DNA substrate increases the noise level of the data significantly [35], [14].Low signal to noise ratio data requires special data analysis techniques, which typically accumulate evidence (locally or globally) aiming to obtain accurate and unbiased kinetic information. The optimal extraction of the kinetic information from the stochastic operation of individual DNA-based molecular motors is still an open theoretical challenge. Here, we aim to provide a perspective on some relevant techniques for the analysis (Section 2) and modelling (Section 3) of DNA replication and DNA unwinding activities obtained by in vitro single-molecule manipulation techniques. Data analysis techniques allow extracting the main phenomenological information from the individual trajectories. Models predict relations between the observations, which, compared with the observed relations, enable the identification of the underlying processes. Thus, models provide insight on the mechanisms and increase the future predictability of the phenomena. Here, we review fundamental models of DNA unwinding and replication (in different conditions and in the presence or absence of ligands), and also model selection criteria. We center our description on methods used to analyze and model in vitro single-molecule manipulation data on DNA replication. Similar/related data analysis and modelling techniques have been used to study the real-time kinetics of the activity of other nucleic acid-based molecular motors studied at the single-molecule level, such as RNA polymerases [33], [32], [25] and ribosomes [103], [23]. Data analysis and modelling of the stochastic molecular motor trajectories have both benefited from and contributed to statistical physics developments [84]. This fruitful interaction will continue (Section 4).
Data analysis
Single-molecule manipulation techniques provide a way to measure properties of individual molecules (e.g., position, orientation, end-to-end extension), which can be used as reaction coordinates to follow the evolution of the molecule along a reaction pathway in real time. Using the DNA extension as a reaction coordinate, this technique allows monitoring the activity of replicative DNA polymerases as they convert single-stranded (ssDNA) to double- stranded (dsDNA) DNA, Fig. 2A [57], [104], [43], [58], [78], and replicative DNA helicases that unwind dsDNA to ssDNA, Fig. 2D [47], [55], [60], [88], [95], [89], [96], [11], [48].For example, in optical and magnetic tweezers experiments, as those shown in Fig. 2A and D, the experiment provides the measured DNA distance as a function of time at the given tension . The force extension curves of ssDNA, , and of dsDNA, , provides the information required to convert the changes in distances between the beads to replicated nucleotides [14]. The conversion factor depends on the experimental configuration. With the experimental configuration shown in Fig. 2A, the number of nucleotides replicated is given by , as each replication step involves the convertion of one ssDNA nucleotide to its dsDNA configuration. ( is the initial time of primer extension DNA replication.) Whereas the number of nucleotides unwound by a single helicase using the experimental configuration shown in Fig. 2D is given by , as it involves the transformation of one base pair of dsDNA into one nucleotide of ssDNA along the pulling coordinate. ( is the initial time of DNA unwinding). Other experimental configurations involve other conversions, for example, if the ssDNA is covered by SSB protein the ssDNA-SSB complex force extension curve should be characterized and used instead of the ssDNA force extension curve . See Fig. 2G [17].The first and most direct information we can obtain from a replication trajectory is the polymerase mean velocity (usually expressed in nucleotides(nt) per unit of time, e.g., nt/s) and the processivity (the number of nucleotides replicated before detachment). This information can also be restated as the mean residence time per nucleotide (units of time per nucleotide). The detachment time is then given by the product , and the detachment rate as .Additionally, DNA replication trajectories present pauses or transient inactive states that alternate with active replication events (see Fig. 2B). However, when the polymerization rate is low, these two states cannot be easily disentangled due to the noise in the trajectory. The main source of experimental noise comes from the high flexibility of the ssDNA polymer at the low tensions relevant to study DNA unwinding and DNA replication (<10 pN). We briefly describe here the main techniques to separate the pause and active state contributions in the trajectory.When pauses can be identified from the trajectory plot directly (as in Fig. 2B), we can obtain information on frequency of pauses and pause duration. Pause identification is performed by direct statistical methods, as those we cite in Section 2.1. When transitions from pause to active state on a trajectory are difficult to identify, one of the pause identification methods, the velocity histogram, can still provide an estimation of the fraction of time in the pause and in the active state, as described in Section 2.2. A typical velocity histogram shows two peaks (inset Fig. 2B): one for the pause state (centered at a velocity close to zero) and the other for the active state (centered at velocity greater than zero). The later peak corresponds to the maximum velocity or velocity without pauses. However, when the signal to noise ratio decreases, the velocity histogram cannot disentangle the pause and active states contributions. In these cases, the prominence method allows to estimate the active or maximum velocity [17]. The idea of the prominence method is to identify a characteristic velocity higher than the mean velocity and relatively independent of the interval of computation of the velocity. This method is described in Section 2.4. Section 2.3 describes an alternative to the velocity histogram, the first passage time distribution, based on binning in positions and doing a histogram of first passage times. Section 2.5 comments on the Bayesian approach to data analysis, which is more model-dependent.
Direct identification of pauses
In high signal to noise ratio trajectories (as those shown in Fig. 2B) pauses can be identified using several methods, such as plateau identification techniques, step-fitting algorithms or velocity threshold algorithms [16]. Signal to noise ratio sometimes can be increased by averaging over a sliding time window [98], [90], [1], [16]. Time averaging can help to reduce the high frequency noise at the expense of a reduction on time and position resolution.Plateau identification techniques identify pauses directly as the constant position intervals in the trajectory (see Fig. 2B). During a pause, the next position (or a mean of next positions) is not significantly different from the mean of the previous positions [16]. This observation provides a pause identification method comparing the mean and standard deviation of the previous points with the mean and standard deviation of the next points, which allows to automatize the data processing. Additionally, step-fitting algorithms search to fit the trajectory with the optimal number of steps [52], [92], when the resolution is high enough for pause identification.Velocity threshold algorithms [42], [16] identify pauses as sections of the trajectory where the local velocity is below a certain value. One effective method to select an appropriate velocity threshold is doing a velocity histogram [43], [69]. For high signal to noise trajectories (as in Fig. 2B) the velocity histogram presents two clear peaks (as in the inset of Fig. 2B): one centered around zero velocity, identified as the pause state contribution; and another peak center around a nonzero velocity, identified as the active state contribution. When the system is at a pause state, the average velocity is zero (with deviations due to the noise in the trajectory that provide the width of the peak). In contrast, at the active state, there is a finite mean velocity due to the stepping of the motor. (The deviations from the mean active velocity are due to the stochastic nature of the stepping process and to the noise in the trajectory.) The valley of the velocity histogram between the two peaks provides the threshold velocity, which is then used to identify the two states along the trajectory (active and pause states). Additionally, peaks at negative velocities may appear when exonuclease events during DNA synthesis are significant [58], [78].Pause identification allows computing other magnitudes that characterize the trajectory. In particular, the fraction of time the polymerase (or helicase) is in active state is named moving probability, . We can also determine the active and passive state contribution to the mean residence time per nucleotide . The active time per nucleotide is given by , and the pause time per nucleotide by . We can also define the replication velocity or maximum replication velocity, (or sometimes ), as the average replication velocity in the active state.Pause identification provides a great deal of information about pause behavior and its tension dependence. Calculation of the pause length frequency distribution reveals whether one or more characteristic pause durations are present in the trajectory (Fig. 2B, C). First, we define the points of the pause duration binning. Then, we compute the pause length frequency distribution for a trajectory (or set of trajectories) asprovides the frequency of entering to a pause of duration between and when the polymerase is in its active state. ( is estimated as the mean duration of the pauses of duration between and .) This procedure also allows to have a non-uniform binning in the pause duration, which is required when there are two or more characteristic pause durations of different order of magnitude. For example, in Fig. 2C a smaller bin size is used for small pause durations (reflecting short pauses in the trajectories of Fig. 2B), while a larger bin size is used for large pause durations (reflecting long pauses in trajectories of Fig. 2B). This non-uniform binning allows to adequately resolve the distribution for short pauses, and to have enough counts to have a reliable value in each bin for long pauses (Fig. 2C). Pause length frequency distribution is also named as distributions of pause durations, waiting-time distributions and dwell-time distributions [54].This pause length frequency distribution, Eq. (1), fits to the sum of one or several exponentials, depending on the number of characteristic pause durations present on the distribution [44]. For example, when two characteristic pause durations are presents (as in Fig. 2C) the pause length distribution fits towhere is the pause frequency of pause , (i.e., the frequency of entering to a pause state of type ), and is the characteristic pause duration of pause , or average pause length of pause . These parameters of the fit are related to the enter and exit rate to (and between) pauses, as it is discussed below in Section 3.1. When the pause duration distribution is plotted in log-scale (as in Fig. 2C), the number of characteristic pause durations becomes apparent as the number of nearly straight sections in the plot (provided the distribution is well resolved and the characteristic pause durations are different enough). In cases where it is unclear the number of characteristic pause durations to use on the fit, one can resort to model comparison likelihood techniques as the AIC information measure [3], [12]. Additionally, the average pause length and the pause frequency of the pauses can depend on the force applied to the DNA or to other relevant condition in the DNA replication experiment (as whether it is a GC or a AT bound in the fork [69]). The dependence of pause frequency and duration on force will inform about the nature of the pause state (as discussed below in Subsection 3.1).The magnitudes introduced up to this point constitute a complete phenomenological description of the observed replication velocities and pauses. This level of detail on the pause description is not always possible. In the next subsections, we address how to deal with more noisy trajectories where a detailed pause identification and description is not possible.
Velocity histogram
When the direct identification of pauses is not possible, a histogram of the instant velocities can still provide information on the fraction of time that the system is in pause [43]. Typical velocity histograms present two peaks, one centered around zero velocity, the pause peak, and another one centered around a nonzero velocity, the active peak. See for example the inset of Fig. 2B.When these two peaks are well resolved, we can identify the counts of the pause states and those of the active states. In this case, individual pause identification is also possible, as each velocity count corresponds to a time in the trajectory (see previous subsection). When the two peaks cannot be well resolved, we can resource to methods that help to get a better resolved velocity histogram. One of the methods is to average the data over a sliding time window, which averages out the (high frequency) experimental noise. It is important to optimize the width of the average time window. If the average time window is too wide, we lose relevant information; while if the time window is too narrow, the experimental noise is still present. Another method is to compute the instant velocity in a larger time window, which reduces the signal to noise ratio in the velocity. Note that if the velocity time window is too wide, it will average out pause and active state velocities, while if it is too narrow, it will not change the signal to noise ratio in the velocity significantly. The average time window and the velocity time window are optimized to maximize the resolution of the two peaks in the velocity histogram.Even upon optimization of the position of the two peaks, an overlap between them may remain, see for example the inset of Fig. 2B. In the overlap region it is unclear which counts correspond to active or to pause states. If the minimum between the two peaks is low, the two peaks are well-resolved and the location of the minimum can be used as pause identification criteria along the trajectory (as described previously in Section 2.1). If the minimum is high, the pause identification is not possible, but we can estimate the fraction of time in each state from the velocity histogram. The more elementary method to deal with this problem is to find the location of the minimum between the peaks and assign the counts below to the pause state, and the counts above to the active state. A more elaborate method is fitting the velocity histogram to a two Gaussian functions. The area under each Gaussian is proportional to the pause and active state probability, respectively. The quotient between the pause area (or counts), , and the active state area (or counts), , gives the value of the ratio between the pause and the active time per nucleotide . With this information we can compute the moving probability, ; and also, the (maximum) replication velocity .
First passage time distribution
The velocity histogram method takes equal time bins on the trajectory and gets the different displacement (in nucleotides), and therefore velocities along the trajectory. Instead, the first passage time distribution method does the binning in displacement, Fig. 2E, giving for a fixed displacement the different first passage times along the trajectory. The observed first passage time distribution is then analyzed to extract the replication rate, the number of pauses and their characteristics [22], [27], [11].The theoretical first passage time (FPT) distribution for a single-step of a molecular motor stepping forward at a rate is given by an exponential distribution,It gives the probability of a single-step occurring at a time (after the previous one). However, in practice, frequently the position noise is higher than the single-step length. Hence, the displacement binning should be large enough (higher than the typical position noise) to prevent noise dominating the first passage time distribution. For steps binning, the FPT distribution is given by the probability that forward steps require a time ,which is a gamma distribution [86], [27]. When the molecular motor additionally has backsteps () the FPT distribution becomes wider. (In this later case, its analytic expression can be stated in terms of modified Bessel functions of the first kind [22], [11]). Entrance in pause states increases the time spent to advance the displacement binning length, increasing the probability of higher first passage times. Thus, FPT distributions with large tails at high first passage times reflect the presence of one of several pauses (or backstepping). See Fig. 2D, E, F. Each interval of the FPT can be approximated by a single contribution (forward, pause), provided the characteristic times (pause, forward stepping) are well separated. This approach has allowed the identification of pause states in the operation of a DNA helicase and a RNA polymerase [27], [11].Both FPT distribution and velocity histogram methods resort to local information in the trajectory after a binning (on space or time). These methods are limited to cases where the time scales are well separated. Prominence method and Bayesian methods explained below have proven to be more effective when dealing with cases where pause identification is difficult. Their strength is that they use information on the sequence, additionally to the (averaged or binned) local information.
Prominence method
We can resort to the prominence method [17] when it is not possible to resolve the two peaks in the velocity histogram (e.g. Fig. 2G). The prominence method aims to obtain the (maximum) replication velocity from the information contained in the trajectory. The idea is that the more prominent peaks in the velocity time series correspond to the (maximum) replication velocity , after noise removal (Fig. 2H). Prominence is a term from topography and mountaineering, serving to identify the main peaks in a mountain ridge. The prominence of a peak is the difference between its height and the lowest closed contour line encircling it and without any higher peak inside. Here, in the velocity vs. time plot, the prominence of a velocity peak is given by the difference in height between the peak and the higher valley separating the peak from another higher peak.The procedure involves computing the velocity time series using a time window , then selecting the more prominent peaks, with prominence at least , in the velocity time series (Fig. 2H). After the mean of the peaks is computed and represented for each prominence P as a function of the time window . The maximum velocity appears as the height of a plateau in this plot for the appropriate prominence . See Fig. 2I. (The adequate time window and the prominence do not need to be known a priori, a wide range is taken for both and the prominence method provides the adequate range.)The procedure is the following. First, the velocity time series is computed from the trajectory (position vs. time) using a time window of length , . Second, the peaks of prominence P or higher are selected on the velocity time series. After the mean of the peaks is computed and represented for each prominence as a function of the time window . See Fig. 2I. Finally, the plots of the mean of the velocity peaks as a function of the velocity time window for different prominences , reveals that there is a plateau of which is both flatter and larger for the appropriate value of . The height of the plateau for this optimal value of gives a good estimate of the (maximal) replication velocity .Large time windows lead to a velocity plot with high signal to noise ratio. However, if the time window is too large it averages active replication periods and paused periods, leading to velocities below the maximum replication velocity. This is reflected in each of the plot of the mean of the velocity peaks as a function of the velocity time window . See Fig. 2I. At low time window , the plot is dominated by large velocities due to the noise in the trajectory, giving large values of . For intermediate values of the time window, the influence of noise decreases, presenting a plateau at intermediate times. The plateau is clearer for the appropriate prominence , and gives the replication velocity (on the active state) . For large values of the time window, similar or larger than the characteristic time in the active replication state, the time window implicates averaging active and pause state periods, and goes to the mean replication velocity.Dividing the replication velocity obtained by the mean replication velocity of the trajectories we obtain the moving probability as . The active time per nucleotide can be obtained as , and the pause time per nucleotide as . A complete description of the prominent method can be found in the Supplemental Information of Ref. [17].
Bayesian methods
Other class of methods to analyze the trajectories are the Bayesian methods [24]. Bayesian methods fit a model of the system and the experimental device to the observed data. The idea is to find the more probable values of the model parameters given the observed data. This is linked through Bayes theorem to the question of which are the values of the model parameters maximizing the probability that the data is observed (as in fact it was). Most of the models considered belong to the class of models known as hidden Markov models (HMM), as they assume that there is an underlying Markovian behavior of the system. A system is said to be Markovian when its next state depends only on its present state and it is independent on the previous story. (The models described in the next section belong to this class of models.)The experimental device characteristics are also included in the model, for example, through an experimental noise parameter affecting the observed data. This noise parameter can also be fitted, and the result should be consistent with the expected experimental device accuracy.Bayesian methods in combination with HMM have been successfully used for example for the study of gene transcription [97], after one of his early prominent uses, speech recognition [85]. They also have a wide set of potential applications on single-molecule data analysis [28], [74], [29]. These methods are very powerful to identify the best parameters values, and even the best model of a set of models [24], combining them with a model comparison criterion, as AIC [3], [12]. They provide means to include other a priori information (obtained in previous complementary experiments). It might be argued that their drawback is that they require a concrete model (or set of models) to proceed with the analysis. However, this is also true (to a certain extent) for the more frequentist data analysis presented in previous subsections. In previous subsections we assumed a pause and an active state characterized by different velocities, and the presence of transitions between them. But in the previous subsections we do not had to assume a priori whether there was one or several pause states. In this sense the approach of the previous subsections can be considered a more model-independent analysis.In the next section, Section 3, we present models which link the observed replication velocities and pause characteristics with the underlying processes. The models allow us to identify these underlying processes and get a deeper understanding of DNA unwinding and replication.
DNA replication models
Models allow the identification of the underlying processes in a phenomenon, increasing its predictability. Assuming a process, a model predicts relations and dependencies in the observations. When we fit a model to a set of experimental data, we are checking whether the assumed process is compatible with the relations and dependencies between the observations.The fit of the model to the data can be performed with a minimization of the mean squared differences between the observed data and the predicted value of the model. This minimization sets the optimal values of the model parameters that describe this set of data. A deficient fit indicates that the model has wrong or incomplete assumptions on the possible underlying processes. Models with the same number of free parameters can be compared directly using the minimization of the mean squared differences. When comparing models with different number of free parameters, we must use model comparison criteria (based on Bayesian statistics), as the Akaike Information Criterion (AIC) [3], [12]. These comparison criteria compensate that models with more parameters generally fit better leading to overfitting. It is easy to find models with n parameters that fit a set of n data, but in fact this fit would be only a reparameterization of the data. This overfitting situation does not provide information on whether the underlying process assumed on the model is compatible with the data. In practice, to prevent overfitting, we should keep the number of parameters well below the number of data points. We should also keep track of data uncertainties and the uncertainties in the fitted parameters. We should be aware that a large uncertainty in a fitted parameter is a sign of low capability of the data to determine this parameter. In other words, the model poses questions that cannot be answer with the information contained in the data. Thus, a large uncertainty in a parameter indicates that we should either go for a simpler model (with less ingredients and parameters) or to do complementary experiments (which provide more information on the process related to this parameter).
Pause modelling
The key ingredient in the modelling of pauses is the number of pause types. Pause identification was described in the previous Section 2.1. It identified the number of pause types, as the number of characteristic pause length observed, i.e., the number of exponentials with different exponents fitting the pause length frequency distribution . The number of types of pauses identified restricts the reasonable pause models, but not uniquely, as we mention below for the two-pause case.For the cases with one pause type, Fig. 3A, the measured pause length frequency distribution fits to . For this model, the fitted parameters give the entry rate and the exit rate from the pause state.
Fig. 3
Kinetic models for pause and active states of DNA polymerases. Parameters denotes the transition rate from state to state . A) Top: Model with a unique pause state. Bottom: Schematic representation of the effect of an external force on the free energy landscape projected along the displacement coordinate in the direction of the force. The free energy reduction is given by the work done by the force: for the activation state, for the final state. B) Cyclic model with two pause states. In this model direct transitions between pause states are allowed. C) Linear model with two different pauses states. In this model it is not possible to go directly from one pause state to the other one, without passing through the active state. D) Model of polymerization-exonucleolysis transitions mediated by a pause state. DNA tension induces entrance into exonucleolysis through the pause intermediate [43], [41]. (Panels A, B and C adapted from Refs. [69], [73]).
Kinetic models for pause and active states of DNA polymerases. Parameters denotes the transition rate from state to state . A) Top: Model with a unique pause state. Bottom: Schematic representation of the effect of an external force on the free energy landscape projected along the displacement coordinate in the direction of the force. The free energy reduction is given by the work done by the force: for the activation state, for the final state. B) Cyclic model with two pause states. In this model direct transitions between pause states are allowed. C) Linear model with two different pauses states. In this model it is not possible to go directly from one pause state to the other one, without passing through the active state. D) Model of polymerization-exonucleolysis transitions mediated by a pause state. DNA tension induces entrance into exonucleolysis through the pause intermediate [43], [41]. (Panels A, B and C adapted from Refs. [69], [73]).When two pause types are identified, the pause length frequency distribution fits to . A possible model is the linear two pause model represented in Fig. 3C. For this model, the fitted parameters give the entry rates and the exit rates from the pause state, with . However, the same pause length frequency distribution is compatible to the more general cyclic two pause model represented in Fig. 3B. See Refs. [44], [73]. This cyclic pause model contains the linear two pause model as a particular case where the transitions between pauses have negligible rates. The general cyclic model has the drawback that has more free parameters (six transitions rates) than those provided by the pause length frequency distribution (two pause entrance rates and two characteristic duration of pauses). However, the positivity of all the six rates gives maximum and minimum values compatible with the observed fitted four parameters, as shown in Ref. [73]. Nevertheless, only if there is a biological motivation the use of a model with more parameters than those directly measure seems reasonable. In this case, the model calls for complementary experiments or information to help reduce the uncertainty in the transition rates on the cyclic case.The characteristic entrance and exit rates can also depend on the mechanical tension applied to the systems and/or on the DNA sequence, as previously stated in Section 2.1 for the fitted parameters of the pause length frequency distribution . The force dependencies of the entry and exit rates from pause states reveal the magnitude of conformational changes (along the force direction) to the activation state, which governs the entry to and the exit from the pause state. See bottom of Fig. 3A. This distance, , can help to reveal which may be the process leading to the pause [69]. The force dependency of the entry and exit rates from a pause state is given byThe distance, , parameterizes the different effect of the force on the entry and exit rate. The work gives the magnitude of the change of the effective barrier to the activation state of the process. Its ratio with the characteristic energy of the thermal fluctuations determines the magnitude of the increase or decrease of the process rates, as shown in the expressions of Eq. (5). Fitting these expressions to the observed force dependence of the rates provides the rates at zero force, , and the conformational change distances .The equilibrium constant of the process can be defined byThis gives a complete characterization of the entry and exit from the pause, and a clue of the possible conformational changes associated to this pause through the value of . This simplified description of the effect of force on process rates is frequently used in biophysical studies. For a description of the simplifications involved and more exact descriptions see Refs. [51], [44], [101].More advanced multistate models and their kinetics are described in Refs. [50], [18], [66]. Some of these models have been applied to interpret the force-velocity dependencies of replicative DNA polymerases switching between polymerization and exonucleolysis, Refs. [58], [41], [78], Fig. 3D. Many replicative DNA polymerases present two main active sites; the polymerization (Pol) and exonucleolysis (Exo) active sites. The Pol site catalyzes 5′-3′ DNA synthesis by the stepwise addition of the complementary nucleotide (dNTP) on to the terminal 3′ end of the nascent DNA strand (primer), while the Exo site hydrolyzes mismatched nucleotides from the primer strand in the 3′ to 5′ direction increasing the fidelity of the copy. The Exo site is separated by 40–60 Å [102], [7] from the Pol site and only binds single-stranded DNA (ssDNA). Therefore, the primer transfer reaction implies substantial conformational changes and may involve intermediates states. A fined-tuned coordination between polymerization and exonucleolysis reactions is essential for the integrity of the genome. Modeling of the pause kinetics of replicative DNA polymerases and their dependence on mechanical tension applied to the DNA template has provided insight into the Pol-Exo transfer (or proofreading mechanism) dynamics of DNA polymerases. As described below, mechanical tension applied to the DNA template decreases the polymerization rate until stalling (Fig. 4A). Interestingly, upon stalling, a further increase of tension induces processive exonuclease activity in several DNA polymerases [104], [43], [78], suggesting that tension can be used as a variable to study the Pol-Exo equilibrium. Modeling of the effect of tension on the moving and pause states of phages Phi29 and T7 DNA polymerases has shown that the primer transfer reaction between the two active sites is not a one step process. In the case of Phi29 DNA polymerase, the primer transfer reaction is intramolecular and implies at least two intermediates states, one of which may work as a fidelity check point [43]. In the case of T7 DNA polymerase, the primer transfer reaction is intermolecular, following DNA polymerase dissociation the primer is bound to the Exo active site of a new DNA polymerase [41] . In summary, the ability to separate transient inactive states (pauses) from active states and analyze their corresponding force dependencies has been instrumental in determining the intermediates of the proofreading reaction and to measure directly the kinetic rates, equilibrium constants, and conformational changes associated with their interconversion.
Fig. 4
Comparison between the effects of mechanical tension on the DNA (A and B) and mechanical load applied to the DNA polymerase (C-E). A) Effect of mechanical tension of the primer extension replication rate of the mitochondrial DNA polymerase in the absence (blue) and presence of the mitochondrial SSB (mtSSBWT). Dots represent experimental data, and lines the best fitted theoretical models. Eq. (8) for ssDNA and Eq. (11) for SSB covered ssDNA. B) Comparison of polymerase-SSB coupling behavior at different tensions (). The energy landscapes (left) between the coupled state A and the uncoupled state B, for low force (solid line) and for medium force (dashed line), show how force destabilizes the polymerase-SSB coupling. The diagrams (right) represent the polymerase-SSB coupling reduction due to force. [This decoupling effect is modeled by Eq. (10).] C) Diagram of a primer extension experiment applying opposing (top) or aiding (bottom) force on a DNA polymerase. D) Effect of load on the maximum replication rate at saturating dNTP. E) Ratio of the apparent nucleotide constant and the maximum replication rate, , (Michaelis-Menten parameters of the reaction) as a function of the force acting on the polymerase. (Panels A and B from Ref. [17], Panels C, D and E from [70]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Comparison between the effects of mechanical tension on the DNA (A and B) and mechanical load applied to the DNA polymerase (C-E). A) Effect of mechanical tension of the primer extension replication rate of the mitochondrial DNA polymerase in the absence (blue) and presence of the mitochondrial SSB (mtSSBWT). Dots represent experimental data, and lines the best fitted theoretical models. Eq. (8) for ssDNA and Eq. (11) for SSB covered ssDNA. B) Comparison of polymerase-SSB coupling behavior at different tensions (). The energy landscapes (left) between the coupled state A and the uncoupled state B, for low force (solid line) and for medium force (dashed line), show how force destabilizes the polymerase-SSB coupling. The diagrams (right) represent the polymerase-SSB coupling reduction due to force. [This decoupling effect is modeled by Eq. (10).] C) Diagram of a primer extension experiment applying opposing (top) or aiding (bottom) force on a DNA polymerase. D) Effect of load on the maximum replication rate at saturating dNTP. E) Ratio of the apparent nucleotide constant and the maximum replication rate, , (Michaelis-Menten parameters of the reaction) as a function of the force acting on the polymerase. (Panels A and B from Ref. [17], Panels C, D and E from [70]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Primer extension DNA replication models
Primer extension replication is the replication of ssDNA to give dsDNA (See Fig. 2A). In vitro single-molecule manipulation experiments with replicative DNA polymerases have shown that the average primer extension rate presents a strong dependence on mechanical force either applied to the DNA template (Fig. 2A,D) or to the DNA polymerase directly (Fig. 4C).When mechanical tension is applied to the DNA, the average replication rate of many DNA polymerases increases initially with tension, reaching a maximum rate at 6 pN. Above this value of tension, the replication rate decreases gradually until stalling (Fig. 4A). Originally, the so called Global Model [57], [104] was proposed to explain the tension dependence as due to the activation enthalpy of converting bases from single- to double-stranded DNA, which imposeswhere is the replication rate at zero force . The exponential argument accounts for the energy contribution involved in changing the length of nucleotides from the ssDNA length per nucleotide to the dsDNA length per nucleotide at a DNA tension .This model, Eq. (7), explains well the replication rate decay with tension, , for tension values above , with . This value of indicates that only one template base is converted from ssDNA to dsDNA, which is in accordance with strong evidence from structural, bulk and single-molecule experiments [75], [76], [4], [41]. However, Eq. (7) can only explain the entire force–velocity plot (including data at tension below ) with a value of
[57], [104]. Since only one nucleotide is added per polymerase step (n = 1), this model implies that bases have to be reverted to the ss geometry after the activation state. These results are not supported by previous structural and bulk kinetic studies. Alternative models that only involve the two neighboring DNA segment have been proposed (Local Model and its variations: Restricted-Cone Local Model and Minimalist Two Segment Model) [36], [4], [79]. These models explained well the tension dependence of the polymerization rate of some DNA polymerases, considering . However, the models relied on several assumptions about the DNA-polymerase interactions and the nature of the rate-limiting step, which need further experimental validation (see Supplementary Information of Ref. [17] for a more detailed discussion).Recently, Ref. [17] proposed that the initial increase of the DNA replication rate with tension is caused by the mechanical disruption of the self-binding energies (secondary structure) of the ssDNA template [9]. Template secondary structures are known to hinder or slow down the advance of DNA polymerases [49], [38], [46]. This leads to the expressionwhere would be the replication rate at zero force in the absence of secondary structure. The first term in the exponential argument accounts for the energy contribution involved in changing the length of nucleotides (for DNA replication ) from the ssDNA length per nucleotide to the dsDNA length per nucleotide at a DNA tension . The first term of the exponential dominates the decay of the replication rate at higher forces (Fig. 4A blue points and line.) The second term accounts for the contribution of the secondary structure, which slows down the replication at low forces, as shown in Fig. 4A. gives the fraction of ssDNA template bases forming secondary structure and decreases for increasing force (Ref. [9] describes how to experimentally determine from ssDNA force-extension curves). Each of the bases forming secondary structure imposes an average effective energy barrier of to the advance of the polymerase. Fig. 4A (blue line) shows the fit of Eq. (8) to the force dependence replication rate of the mitochondrial DNA polymerase, Pol. Future experiment would clarify whether this model, Eq. (8), gives good fits for other replicative DNA polymerases.In vitro single molecule manipulation experiments showed that, in contrast to the effect of tension on the DNA, application of mechanical load opposing the direction of movement directly on the DNA polymerase decreases the average DNA replication rate monotonically (Fig. 4C,D) [70]. Mechanical load interferes with the translocation step of the polymerase, which becomes rapidly the rate-limiting step of the reaction upon of application of force, explaining the marked effect of force on the replication velocity [50]. Modeling of the combined effects of load and dNTP concentration on the maximum replication rate (at saturating dNTP concentration) and apparent nucleotide binding constant of the reaction, (Vmax and KM, respectively, Fig. 4D,E) provided a detailed picture of the coupling between the mechanical and chemical steps of the nucleotide incorporation reaction [70].The model that explained well the data considered that mechanical translocation is independent on chemistry and therefore, the only force dependent rates of the reaction were the forward and backward translocation rates. According to this model, the kinetic expressions for the Michaelis-Menten parameters , and can be expressed as the sum of a force independent and a force dependent termwhere the coefficients a, b, r, and s are related to the rates of several steps of the nucleotide incorporation cycle such as i.e., the catalytic rate (), the dNTP binding rate (), and the forward/backward translocation rates (, , and ). On the other hand, the and are the characteristic distances from the pre- and post- translocation positions to the transition state. Fits of the data with this model yielded the values of several of the main rates and force dependencies of the nucleotide incorporation cycle. In summary, modeling of the data revealed that chemical catalysis and mechanical translocation are not directly coupled. Instead, upon chemical catalysis, mechanical translocation of the enzyme occurs by thermal diffusion. This diffusion is biased towards the post-translocation state by binding of the next complementary nucleotide (dNTP) to the polymerization active site [70].
Effects of DNA ligands on primer extension replication
The presence of ligands bound to ssDNA, as the single stranded DNA binding proteins (SSB), favor DNA replication by suppressing the formation of secondary structure, but at the same time, they could also be a barrier to the access of the polymerase to the ssDNA template. However, some polymerase-SSB pairs are found to interact in such a way that the barrier imposed by the SSB for the ssDNA replication is negligible [17]. This collaborative interaction is force sensitive and is inhibited by different force values depending on the polymerase-SSB pair.The probability to find the polymerase-SSB pair forming the collaborative pair can be parameterized as a transition between two states (collaborative and non-collaborative state),where is the coupling energy between the pair and is the characteristic length of the conformational change that inhibits the formation of the polymerase-SSB collaborative pair.Thus, in this case, the primer extension replication rate is given byThe free parameter , representing the mean number of nucleotides to release from SSB per step, and the two free parameters of , and , are fixed by fits to the experimental data on ssDNA replication in the presence of SSB. is the replication rate at zero force in the absence of secondary structure previously obtained from the fit of Eq. (8) to the experimental data on ssDNA replication in the absence of SSB. is the length of the ssDNA in the presence of SSB per nucleotide at tension , and it is determined previously by force-extension experiments. is the Gibbs energy to release a nucleotide from SSB at tension , it is determined by the comparison of the integrals above the ssDNA-SSB and the naked ssDNA force extension curves [ and , respectively]. A study of different polymerase-SSB pairs [17] found similar values of , pointing to a similar conformational change, but different pairing energies, , indicating different stability of the collaboration state. Fig. 4A (red line) shows the fit of Eq. (11) to the force dependence replication rate of the polymerase Pol replicating a ssDNA covered by mtSSB.Overall, modelling of the effect of mechanical tension on the DNA replication rate in the presence of ligands (SSBs) revealed that elimination of template secondary structure by SSB binding promoted the maximum replication rate of DNA polymerases. However, for this stimulation to occur functional interactions between DNA polymerase and the SSB are required. These interactions, i.e., electrostatic repulsion, decrease the energy barrier of ssDNA unwrapping from the SSB and facilitate its release from the template without compromising the replication rate of the DNA polymerase.
Strand displacement DNA replication and DNA unwinding models
Some replicative DNA polymerases carry out strand displacement DNA synthesis, which is the ability to displace downstream dsDNA encountered during replication (Fig. 5A, B). Current strand displacement replication models mainly describe how the stability of the dsDNA fork ahead of the polymerase slow down the maximum replication rate of the enzyme (in the absence of fork) [8], [47]. The main idea is that replication cannot proceed at tension if the next base pair is closed, which, at tension f, happens with probability , where is the template position of the polymerase and the number of base pairs of the fork opened ahead. Averaging over the complete template gives
Fig. 5
A) Diagram of a strand displacement DNA replication experiment with optical tweezers. (Left) The two ends of a DNA hairpin are attached between two micron-sized beads (grey and blue spheres) one held by the optical trap and the other held by suction on top of a micropipette. Double parallel lines represent double stranded DNA (dsDNA). (Right) During strand displacement conditions (s.d.), the DNA polymerase (purple triangle) opens the DNA fork, replicates one strand (blue line), and displaces the other (red line). B) Scheme for the polymerase (purple triangle) dynamics during strand displacement DNA synthesis. denotes the length in nucleotides of the DNA template, the number of nucleotides replicated, the number of base pairs opened between the polymerase and the DNA fork, while stands from the number of base pairs that are destabilized by the polymerase (purple circle). All variables used in the strand displacement replication model are described in Section 3.3. (Adapted from [69].) C) DNA polymerases with high fork destabilization energies, , would present s.d. rates similar to those found during primer extension (red line). On the contrary, DNA polymerases with low present lower ratios with stronger force dependencies (green dashed dotted line). (For all lines .) D) Variation of the force dependent ratio with the interaction range M. Higher M values yield stronger force dependencies. Different values of M can fit the same set of data with different interaction intensities . (Values of the lines in this panel are: for , for for for .) E) Helicases are classified as active or passive according to their ability to destabilize the fork, parameterized by the interaction intensity . They are optimally active when is of the order of the higher base pair binding energy . The coordinate operation of a polymerase and a helicase can increase the effective interaction intensity . F) Helicase with steps larger than one require the simultaneous opening of base pairs, implying a stronger tension dependence. (Active , ; passive .) (Panels E and F are from [47].) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
A) Diagram of a strand displacement DNA replication experiment with optical tweezers. (Left) The two ends of a DNA hairpin are attached between two micron-sized beads (grey and blue spheres) one held by the optical trap and the other held by suction on top of a micropipette. Double parallel lines represent double stranded DNA (dsDNA). (Right) During strand displacement conditions (s.d.), the DNA polymerase (purple triangle) opens the DNA fork, replicates one strand (blue line), and displaces the other (red line). B) Scheme for the polymerase (purple triangle) dynamics during strand displacement DNA synthesis. denotes the length in nucleotides of the DNA template, the number of nucleotides replicated, the number of base pairs opened between the polymerase and the DNA fork, while stands from the number of base pairs that are destabilized by the polymerase (purple circle). All variables used in the strand displacement replication model are described in Section 3.3. (Adapted from [69].) C) DNA polymerases with high fork destabilization energies, , would present s.d. rates similar to those found during primer extension (red line). On the contrary, DNA polymerases with low present lower ratios with stronger force dependencies (green dashed dotted line). (For all lines .) D) Variation of the force dependent ratio with the interaction range M. Higher M values yield stronger force dependencies. Different values of M can fit the same set of data with different interaction intensities . (Values of the lines in this panel are: for , for for for .) E) Helicases are classified as active or passive according to their ability to destabilize the fork, parameterized by the interaction intensity . They are optimally active when is of the order of the higher base pair binding energy . The coordinate operation of a polymerase and a helicase can increase the effective interaction intensity . F) Helicase with steps larger than one require the simultaneous opening of base pairs, implying a stronger tension dependence. (Active , ; passive .) (Panels E and F are from [47].) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)This probability is given by a balance of the Gibbs energy contributions involved in the fork openingwith . The Gibbs energy required to open base pairs ahead of position , at force , is given byThe first term accounts for the stability of the base pairs ahead,
[105]. The second term accounts for the tension destabilization contribution, which is computed from the ssDNA elasticity . The third term accounts for the interaction energy between the polymerase and the dsDNA fork. The polymerase is assumed to destabilize the closer base pairs ahead by an amount each. These are the two free parameters in the model. parameterizes the activity of the polymerase (Fig. 5C) [8], [60], [71], and , which parameterizes the range of fork destabilization. Large values of should be interpreted with caution. They might be induced by the simplifying assumption in the model that the destabilization energy is the same for all the next base pairs. (Assuming smaller destabilization energy for the more distant base pairs seems reasonable but it opens the questions of how fast is this decrease). See Fig. 5C, D, and E for further insight on the implications of the different values of the parameters and to the replication ratio.Note that some effects that are detrimental for the primer extension replication might not be present in the strand displacement replication. For example, the formation of secondary structure is prevented by the presence of the fork, and we expect this effect to be absent in the used to compute the strand displacement replication velocity in Eq. (12). This description of the Betterton and Julicher model adapted for DNA polymerases, (12), (13), (14), considers that replication occurs one nucleotide at a time and assumes that there is no significant back-stepping, , (i.e., due to exonucleolysis), Fig. 5B. How to include these additional effects (and others) is discussed on Refs. [47], [60].Fits of the force and sequence dependencies of the strand displacement rates of T4 and Phi29 DNA polymerases with this model revealed the interaction energy of each polymerase with the fork, respectively [59], [69]. Interpretation of the single-molecule data together with biochemical and structural information on polymerase-DNA complexes suggested that the ability of DNA polymerases to unwind DNA during replication depends on two competing processes: On the one hand, binding and bending of the template strand by the DNA polymerase generates mechanical stress at the fork junction, which forces the separation of the dsDNA strands. On the other hand, the complementarity between the template and the displaced strands generates a regression pressure on the enzyme that competes for template binding, which prevents further polymerization and shifts the equilibrium towards the exonuclease conformation.Similarly, single-molecule manipulation experiments have shown that the DNA unwinding rate of replicative DNA helicases is strongly affected by the stability of the fork [47], [55], [60]. The effect of fork stability on the unwinding rate can be explained using the Betterton and Julicher model explained above. In fact, this model was originally developed to quantify the effect of fork stability on the unwinding rate of helicases [8], [47]. The same schemes as in Fig. 5A and B can be used for helicase, just accounting that the helicase opens the fork but does not convert ssDNA into dsDNA. Thus, the ratio between the DNA unwinding and translocation rates is obtained just replacing by in Eq. (12).In fact this model was originally developed for helicases [8], [47]. Quantification of the DNA destabilization energy by helicases using this model is not straightforward because of the uncertainty of helicase step size and the significant probability of backsliding events. Helicases can have large backstepping . The effect of varying the step size, , is shown in Fig. 5F. Increasing the step size, , leads to stronger force dependence (Fig. 5F), as also does an increase of the interaction range, (Fig. 5D) (See Ref. [60] for a more detailed discussion). In any case, the strong dependency of the average unwinding rate of replicative helicases on DNA sequence and mechanical tension (fork stability), suggest that, when working in isolation, these enzymes present weak DNA destabilization energies. Interestingly, helicases may need the assistance of other partner proteins at the fork [59].Ligands attached to dsDNA or to ssDNA can act as inhibitors or activators of strand displacement DNA replication or DNA unwinding. On the inhibitory side, ligands might represent a barrier if attached to dsDNA stabilizing it and preventing the fork opening. In this case, their effects must be accounted in the base pair stability term, , in the model presented in Subsection 3.3. Ligands might also act as activators of the strand displacement replication, lowering the effective base pair binding energy . Ligands attached to the lagging ssDNA (as SSBs) can help fork opening, or simply inhibit rezipping, effectively increasing the active opening of the fork . This inhibitory or activatory role of the ligand can be force modulated through a force dependent transition, as in primer extension, Eq. (10).
Conclusions
Dynamic biological processes such as DNA replication and DNA unwinding are inherently stochastic processes. Single-molecule manipulation and detection techniques allow researchers following the progress of an individual molecule and measuring the instantaneous rates and their fluctuations (such as pauses). These fluctuations can provide detailed information about the underlying kinetic and mechano-chemical cycle that governs the behavior of the motor under study. Extracting this information from experiments requires the ability to identify the pause states and their proper quantification.The data analysis methods presented here approach the analysis of individual replication trajectories obtained with single-molecule manipulation methods differently. The direct pause identification approach is based on local analysis of the trajectory comparing the positions in a time interval to the positions in the following time interval to identify steps and plateaus. Instead, the velocity histogram introduces a more global approach, making the histogram of local velocities, but for the whole trajectory. In this method, the resolution of the two velocity peaks of the whole trajectory allows us to optimize the local average time windows (which filters high frequency noise) and the local velocity time window. Similar comments apply to the first passage time distribution analysis, which takes instead displacement windows. The prominence method combines the local selection of the velocity peaks by its prominence, with the invariance properties of the velocity peak mean along the whole trajectory. These methods combining local and global approaches extract reliable information from noisy trajectories. The key is performing a proper accumulation of small local evidences. Therefore, their philosophy is similar to the Bayesian methods, but from a more phenomenological or model independent approach (to a certain extent). We think there is still room for improvements to trayectory data analysis with new methods using phenomenological approaches. For example, new methods may arise from the mathematical study of the combination of quite general molecular motor models with the Bayesian approach. These mathematical developments could propose improved estimators to extract the relevant biological magnitudes from the trajectories (as the maximum replication velocity).Models provide hypothesis of possible mechanisms for the DNA replication process, and the effects of tension, coordination with helicases, or the role of ligands as SSBs. The fit of the experimental data to the models allows us to check these hypotheses and see whether observations are compatible with the proposed mechanism. New models allow to explore new potential mechanisms or further details of the processes.Further progress in the theory of binding of ligands to long polymers is required to complete the understanding of the observed multimode SSB binding to DNA [45], [72], [77] and mutual interaction between SSB binding and DNA replication [72], [17]. Although the basis for the computation of the equilibrium coverage of ligands bound to a long polymer has been stablished by Mc Ghee and von Hippel, Ref. [61], there is still the need to develop a complete theory for the mechanics, kinetics, and thermodynamics of these systems. Mechanical models and simple kinetic and thermodynamic models have already been developed for one and two modes of binding [45]. However, recent results show that accurate description of the binding process (at medium or high coverages) requires a detailed count of the binding possibilities [100].The analysis and modeling techniques described in this review have proven to be useful for the interpretation of the activities of replisome components when working in isolation (or in pairs) and importantly, have set the stage for the analysis of replication traces of fully reconstituted replisomes in the future. Analysis and modeling of single-molecule manipulation data will evolve hand by hand with the development of new methods that enable increased resolution, multiplexing or access to measure multiple variables of the system at the same time [40], [20], [2]. The methods described in this document have been used or could be adapted to study RNA polymerases [1], [33], [32], [19], [25], [91] and other molecular motors [98], [90].
Author contributions
FJCG wrote the first draft. JJ prepared the figures. BI wrote the first part of the introduction and made general improvements and suggestions. All authors made relevant suggestions to improve the paper text and figures.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.