The great advances of sequencing technologies allow the in vivo measurement of nuclear processes-such as DNA repair after UV exposure-over entire cell populations. However, data sets usually contain only a few samples over several hours, missing possibly important information in between time points. We developed a data-driven approach to analyse CPD repair kinetics over time in Saccharomyces cerevisiae. In contrast to other studies that consider sequencing signals as an average behaviour, we understand them as the superposition of signals from independent cells. By motivating repair as a stochastic process, we derive a minimal model for which the parameters can be conveniently estimated. We correlate repair parameters to a variety of genomic features that are assumed to influence repair, including transcription rate and nucleosome density. The clearest link was found for the transcription unit length, which has been unreported for budding yeast to our knowledge. The framework hence allows a comprehensive analysis of nuclear processes on a population scale.
The great advances of sequencing technologies allow the in vivo measurement of nuclear processes-such as DNA repair after UV exposure-over entire cell populations. However, data sets usually contain only a few samples over several hours, missing possibly important information in between time points. We developed a data-driven approach to analyse CPD repair kinetics over time in Saccharomyces cerevisiae. In contrast to other studies that consider sequencing signals as an average behaviour, we understand them as the superposition of signals from independent cells. By motivating repair as a stochastic process, we derive a minimal model for which the parameters can be conveniently estimated. We correlate repair parameters to a variety of genomic features that are assumed to influence repair, including transcription rate and nucleosome density. The clearest link was found for the transcription unit length, which has been unreported for budding yeast to our knowledge. The framework hence allows a comprehensive analysis of nuclear processes on a population scale.
This is a PLOS Computational Biology Methods paper.
Introduction
As DNA represents the hereditary unit of life, maintaining its integrity is vital for every organism’s survival. A large variety of different genotoxic factors have the potential to damage the molecular structure of DNA. Among others, it has been shown that UV light induces Cyclobutane Pyrimidine Dimers (CPDs). Nucleotide Excision Repair (NER) is an evolutionarily conserved repair mechanism in Saccharomyces cerevisiae that can remove a broad range of damage, including CPDs [1]. NER is conventionally divided into two subpathways. The first recognition mechanism is named Global-Genome Repair (GGR) and can be observed along the entire genome. DNA damage is recognised directly by protein association. There is evidence that protein loading is promoted through interactions with chromatin remodellers that change the nucleosome density or distribution [2, 3]. The second pathway is restricted to actively transcribed regions; hence the name Transcription-Coupled Repair (TCR). Expressed genes exhibit quicker repair than silent downstream regions [4]. This promoted the assumption that TCR is more efficient than GGR, although constrained to the transcribed strand (TS) [5, 6]. TCR is initiated by lesion-blocked RNA polymerase II (Pol II) which cannot continue elongation [7]. Thus, a potential link of TCR to transcription rate has been indicated by several studies [8, 9]. After recognition, TCR and GGR use the same incision and nucleotide replacement mechanism. DNA is incised to either side of the lesion leaving an approximately 30-nucleotide gap, which is subsequently replaced and ligated (for a comprehensive description and analysis, see the review by [10]).Our understanding of such processes in living cells has been largely enhanced by Next Generation Sequencing (NGS). It allows the identification of enriched loci of a selected property on a genome-wide scale. Among others, it has been applied to investigate the CPD repair mechanisms in vivo through analysing temporal changes of the damage distribution. [8] obtained high-resolution CPD-seq data that are often used as a benchmark reference (see for example [9]). Their analysis indicates that single nucleosomes and DNA-bound transcription factors have an impact on the CPD formation. Moreover, they point out that repair is seemingly influenced by the CPD position with respect to the nucleosomal dyad as well as the transcription rate of genes. Another major contribution has been done by [9]. Their protocol for eXcision Repair sequencing (XR-seq) revealed strong TCR at early time points which is followed by repair in non-transcribed regions. Furthermore, [11] and [12] utilised CPD data to compute repair rates in different areas, which indicated that the process is highly organised into genomic regions. By using GGR-deficient strains, they show that repair is changing globally when the subpathway is repressed. This is compared to the distribution of repair proteins and histone modifications.Unfortunately, due to costs and constraints in the experimental protocol, NGS data sets contain barely more than a few time points over several hours. Consequently, previous studies could only derive limited conclusions, e.g. the absolute change at different loci. We argue that such an analysis ignores valuable information about the transitional process from one time point to another. Furthermore, it should be emphasised that sequencing signals are commonly understood as representing an average cell. We advocate an interpretation where the data is explained as the product of many independent cells. Thus, the repair dynamics are driven by non-interfering stochastic processes. Without assuming any specific molecular mechanism, we hypothesise that they are composed of two independent random variables namely accessibility to the lesion governing repair times and Brownian motions of proteins through the nucleus to find their target. It has been shown by several studies that proteins exhibit a range of different movements in the nucleus [13-18]. Diffusion has also been investigated and modelled in context of DNA repair for the Rad4-Rad23 complex [19]. Although protein movements have been used to understand specifics of NER kinetics, a framework to quantitatively describe population-based sequencing data is still lacking.Our approach and main results can be summarised as follows. The sparse temporal resolution of NGS data sets makes it necessary to incorporate precise assumptions about the nature of the process in order to recover missing information. Here, we present a computational framework to analyse DNA repair kinetics. We derive a function to study CPD removal as a Poisson point process of independent cells. Since we do not impose any molecular mechanism, we obtain a simple and minimal representation. The parameters can be derived using the well-studied physical model for phase transitions, which is described in detail by Kolmogorov [20], Johnson and Mehl, [21], and Avrami [22-24] (KJMA model). It can be conveniently transformed to a linear regression problem and is therefore executable on almost any ordinary computer. A consequence that is implied by our repair model is that the observed change of CPDs is non-constant over time. To our understanding, this has not been explicitly incorporated in the analysis of NGS data. The model validity can be verified with independently probed XR-seq data [9]. We are able to recover particular aspects of the NER kinetics despite our broadly applicable assumptions. We ultimately use the framework to predict correlations with other nuclear processes. It is able to establish interrelationships that are supported by other studies such as nucleosome density [8, 12] and transcription rate [8]. It is most surprising, however, that we find the strongest correlation with transcription unit (TU) length, which is a new finding for budding yeast to our knowledge. Interestingly, our model allows also an alternative understanding of the data in which repair positions grow as patterns in a population. Although the analysis has been demonstrated for DNA repair, it can be applied to any process that can be modelled by an irreversible binary state transition. The source code is available on GitHub: https://github.com/leoTiez/jmak [25].
Results
Modelling DNA repair
In a single cell, CPD damage describes the mispairing of two adjacent pyrimidine nucleobases. Instead of establishing hydrogen bonds to the opposite strand, they cause two consecutive nucleotides to bind to each other. Consequently, there can be maximally one lesion per position. This results in a zero-one (i.e. damaged-repaired) state space per position and per cell. During ongoing repair, lesions are removed, and positions change subsequently their state to repaired. It can be assumed that this process is stochastic and involves to some extent unpredictable noise. If we could repeatedly and independently measure the repair times for a single position in a single cell, we could distribute the measurements over a timeline (Fig 1B). This type of data can be investigated by a Poisson point process, which allows the derivation of a predictive function that expresses the probability of repair over time. If we would temporally discretise the data over larger bins, all repair time points within a bin were aggregated together. In the following, we assume that this is given by the change of the CPD-seq signal, as the amplitude decrease at an arbitrary position must explain the number of cells that have repaired their lesions. Consequently, the data represent the process over the entire cell population. We conjecture that the dynamics are independent between cells. The CPD-seq data can be therefore alternatively interpreted as a two-dimensional grid: one axis representing the cells and the other the nucleotide positions (Fig 1C). We understand NGS signals not as representing an average cell but the mutual effect of multiple independent cells.
Fig 1
Schematic representation of the Process.
(A) The repair proteins’ search process (purple with arrow) to the lesion (red) can be understood as a Brownian motion. Repair happens after association with some delay, which we assume to be another random variable (waiting time). (B) If we could repeatedly measure the repair times for one position in a single cell individually, we could distribute the measurements over a timeline. This can be analysed by a stochastic point process. Binning the timeline should correspond to the observable change of CPDs in a given window. (C) We presume that this stochastic repair process happens independently in each cell. Therefore, CPD-seq data can be understood as an accumulation of several experiments. This allows an understanding of the signal as a two-dimensional grid. Since there can be only one lesion per position per cell, we understand the amplitude of the signal as a surrogate for the number of cells that contain a lesion at this locus (marked in green).
Schematic representation of the Process.
(A) The repair proteins’ search process (purple with arrow) to the lesion (red) can be understood as a Brownian motion. Repair happens after association with some delay, which we assume to be another random variable (waiting time). (B) If we could repeatedly measure the repair times for one position in a single cell individually, we could distribute the measurements over a timeline. This can be analysed by a stochastic point process. Binning the timeline should correspond to the observable change of CPDs in a given window. (C) We presume that this stochastic repair process happens independently in each cell. Therefore, CPD-seq data can be understood as an accumulation of several experiments. This allows an understanding of the signal as a two-dimensional grid. Since there can be only one lesion per position per cell, we understand the amplitude of the signal as a surrogate for the number of cells that contain a lesion at this locus (marked in green).It is clear that the number of cells with a lesion at a given position is discrete. Therefore, the number of repair events in a given time window can be studied with a Poisson process. It expresses the probability for a number of events N(t) = n in a given time t and takes the form , where λ(t) denotes the rate of events as a function of time. It is commonly assumed that CPD repair is irreversible if the irradiation source is removed. It follows that the probability that repair has happened (p) at time t can be expressed by the cumulative distribution function for the Poisson process, namelyTo find a functional representation of λ(t), we conjecture that repair proteins move through random Brownian motions (diffusion) to the repair sites, subsequently associate to the DNA and remove the lesion. The entirety of this mechanism can be understood as a mixture of two random processes: diffusion and waiting/repair time (Fig 1A). We surmise that the waiting time is determined by the accessibility to the lesion. The mutual effect of repair proteins removing DNA lesions becomes observable through the decrease of the CPD-seq signal. We thus assume that this change ΔC during time Δt is proportional to the searched volume by these proteins (which is related to the mean squared displacement) and the average of the expected repair time
b denotes a scaling factor that accounts for the diffusion constant and . with the the anomalous coefficient is the dominating term with the highest order. If , the process is called subdiffusive; if , the movement exhibits a superdiffusive behaviour. It is clear that the integral over Eq 2 also follows a power-law, i.e. ∫λ(t)dt ∝ bt+ O(t). Substituting in Eq 1 results in . When setting and assuming that only a fraction θ ∈ [0, 1] of cells have the ability to repair their lesions in a given time, we obtain
τ is the characteristic time until repair can be observed. An equation with a similar form to describe the phase transition in solids was derived independently by [20, 21], and [22-24] (KJMA model). As Eq 3 can be converted to a linear regression problem (Eq 5), it can be straightforwardly applied to find the necessary parameters. More interestingly, the KJMA model allows an alternative understanding of the data and the process, which is explained in Discussion.
Applying the model to the data
We formulated three expectations in order to prove the model validity. Firstly, we required that the estimated repair dynamics need to be in line with independently probed data; secondly, we thought it to be indispensable to recover NER-specific features that were not implicitly incorporated into the model; and lastly, Eq 3 needs to make verifiable predictions about other factors that influence repair. CPD data for the parameter estimation were taken from [8] (0, 20, 60, and 120 minutes after irradiation) and divided into different segments where λ(t) was assumed to be spatially constant. We distinguish between TCR regions, which are the TS of genes that presumably exhibit TCR; the NTS of TCR regions; and non-TCR areas, which are composed of transcripts where the effect of TCR is not evident and intergenic regions (see S1 Appendix as well as S1 and S2 Figs). Moreover, TS and NTS of TCR regions were equally divided into start, centre, and end. Subsequently, CPD data was converted to represent repair using Eq 7 (in the following also called repair data). An example of the predicted repair dynamics is given in S3 Fig. We also compared the presented results with the analysis of a more traditional segmentation into TS and NTS of all genes as well as intergenic regions (S3 Appendix).XR sequencing provides a snapshot of currently ongoing repair in the cell culture, and it therefore represents an independent angle on CPD removal. It should correspond to the derivative of Eq 3 (given in Eq 6). XR-seq signals were taken from [9] (5, 20, and 60 minutes after irradiation) and segmented as for the CPD-seq data. We assumed that a surrogate for ongoing repair can be additionally derived from the CPD-seq data themselves by calculating the damage decrease per time (S4 Appendix Eq 4). This was used as a baseline value for the correlation between model and data. As we assumed a non-linear interrelationship, we used the distance correlation (DC) as a correlation measurement (S4 Appendix). Strikingly, the predicted repair rates correlate clearly better (DC = 0.441) than the actual data (DC = 0.209) (compare Fig 2A with 2B; see S2 Table). Moreover, the model predictions align fairly well with the XR data (exemplified in S4 Fig). We hence surmise that Eq 3 is in agreement with independently probed data.
Fig 2
Comparing XR-seq data with model predictions.
The values at 5 minutes are given in blue, 20 minutes are coloured yellow, and 60 minutes are green. The plots show that the distance correlation between prediction and XR-seq data is even higher than for the CPD-seq data. (A) Predicted repair rates with respect to the XR-seq data exhibit a considerably strong correlation (DC = 0.441). Predictions are given as the square root of the model prediction. This reduces the effect of increasing variance with larger derivatives. (B) The repair rates derive from the data as a function of XR-seq values shows a weaker correlation (DC = 0.209).
Comparing XR-seq data with model predictions.
The values at 5 minutes are given in blue, 20 minutes are coloured yellow, and 60 minutes are green. The plots show that the distance correlation between prediction and XR-seq data is even higher than for the CPD-seq data. (A) Predicted repair rates with respect to the XR-seq data exhibit a considerably strong correlation (DC = 0.441). Predictions are given as the square root of the model prediction. This reduces the effect of increasing variance with larger derivatives. (B) The repair rates derive from the data as a function of XR-seq values shows a weaker correlation (DC = 0.209).Despite the fact that we model time-dependent repair, we nevertheless do not incorporate two (potentially competing) repair mechanisms, i.e. TCR and GGR. It is commonly presumed that CPD removal through TCR is quicker than by GGR [4, 26]. Moreover, as GGR acts genome-wide, they are spatially non-exclusive for genes. Indeed, we can recover the cumulative effect when averaging the repair evolution for a group of segments, e.g. the start of TCR regions. The beginning of TCR areas is almost solely repaired by a single mechanism (Fig 3A). The contribution of this pathway is decreasing as a function of distance from the transcription start site (TSS). Instead, a later acting mechanism becomes increasingly observable (Fig 3B and 3C). Lastly, lesions in non-TCR regions are only detected by the late-acting process (Fig 3D). This is even the case despite the fact that non-TCR areas also include transcripts. We deduce that these two distinct pathways show the effect of TCR and GGR along the gene. We were therefore able to separate the effect of two distinct NER processes without involving any particular mechanism. Parameter distributions (i.e. m, τ, and θ) that create these repair kinetics are exemplified in Fig 4. Surprisingly, the NTS possesses different dynamics whilst not exhibiting any difference between start, centre, and end (S5 Fig). The average repair fraction is much lower than for all other areas (θ ≈ 0.6 instead of ≈0.8). Moreover, we observe a subtle early increase of the derivative, indicating a larger presence of early repair in comparison to non-TCR regions. It is difficult to analyse this trend without additional experiments. It could simply be the impact of neighbouring overlapping regions. However, these results could equally point to different repair dynamics on the NTS.
Fig 3
Collective behaviour of genomic regions can recover mutual effect of TCR and GGR.
Dashed lines give the mean whereas the shaded areas show the standard deviation. Blue and orange represent the repair fraction and the repair rate (derivative of the repair fraction), respectively. (A) The start of TCR regions is repaired early after irradiation, demonstrating the effect of TCR. (B) At the centre of TCR areas, we can observe the mutual effect of TCR (first peak) and GGR (second peak). (C) GGR’s contribution increases whilst the impact of TCR becomes less important towards the end of the gene. (D) Non-TCR regions are solely repaired by GGR. Therefore, repair is expected at later time points during the process.
Fig 4
The parameter distribution is coloured with respect to different genomic properties.
The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame. (A) and (B) are coloured with respect to NET-seq data (pink/low to green/high) for the centre of TS and NTS; (C) and (D) indicate the nucleosome density (turquoise/low to brown/high) for non-TCR regions and the NTS centre; (E) and (F) show the distribution with respect to the TU length (red/small to blue/big) for the centre of TS and NTS.
Collective behaviour of genomic regions can recover mutual effect of TCR and GGR.
Dashed lines give the mean whereas the shaded areas show the standard deviation. Blue and orange represent the repair fraction and the repair rate (derivative of the repair fraction), respectively. (A) The start of TCR regions is repaired early after irradiation, demonstrating the effect of TCR. (B) At the centre of TCR areas, we can observe the mutual effect of TCR (first peak) and GGR (second peak). (C) GGR’s contribution increases whilst the impact of TCR becomes less important towards the end of the gene. (D) Non-TCR regions are solely repaired by GGR. Therefore, repair is expected at later time points during the process.
The parameter distribution is coloured with respect to different genomic properties.
The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame. (A) and (B) are coloured with respect to NET-seq data (pink/low to green/high) for the centre of TS and NTS; (C) and (D) indicate the nucleosome density (turquoise/low to brown/high) for non-TCR regions and the NTS centre; (E) and (F) show the distribution with respect to the TU length (red/small to blue/big) for the centre of TS and NTS.Lastly, we extended the analysis to make predictions about influencing factors for CPD repair in vivo. Previous studies published various measurements of different nuclear properties that could possibly interact with lesion removal dynamics. To assess the predictive power of our model, we opted to analyse a link to transcription rate [8, 9] and nucleosome density as representing chromatin structure [8, 12, 27]. We also investigated a link to TU length and the relative distance to centromeres and telomeres as possible unreported affecting parameters. We used the NET-seq signal produced by [28] as a surrogate for transcription rate without UV irradiation. The TU length was measured by [29]. Nucleosome data after UV treatment were acquired by [12]. We excluded regions outside a reasonable parameter range from the subsequent analysis (see Methods and materials and S2 Appendix; the number of removed regions is given in S1 Table). Correlations with the model parameters were verified with a significance test, during which we compared a binary classification model with the performance of a random model (see Methods and materials and S5 Appendix for more information; the working of the classifier is explained in S6 Fig). Interrelationships to other sequencing data are elaborated and discussed in S6 Appendix. All data distributions are given in S7–S12 Figs. The results differed considerably depending on the genomic area and context (S3 Table). TCR has been repeatedly investigated with respect to transcription rate, and it is surmised to be positively correlated. Hence, higher transcription yields quicker repair, whereas low transcription results in slower lesion removal [8, 9]. Our model confirms these findings (Fig 4A and 4B). Non-TCR segments and the start of the TS seem to be starkly influenced by the nucleosome density, whereas all other areas do not show a strong correlation (Fig 4C and 4D). The clearest results, however, were obtained by the TU length. Both, TS (Fig 4E) and NTS (Fig 4F) are clearly influenced. The TU length is therefore likely to be contributing to the lesion removal dynamics. This is an unreported finding for budding yeast to our knowledge. The developed quantitative framework has hence the potential to identify established as well as new interrelationships. Importantly, a correlation with the distance to telomeres and centromeres (Eq 9) did not indicate a significant link. This shows that the applied method is selective for certain correlations (S12 Fig).
Discussion
The few time points of NGS data sets require a temporal model to recover missing information between data samples. In this work, we developed a computational approach to describe the DNA repair kinetics on a population scale. We recover region-specific properties based on the genome-wide distribution of DNA damages. We assume a mixture of two stochastic processes (diffusion and lesion accessibility) that collectively explain the change of CPD data over time. Parameters of the derived equation can be estimated with the KJMA model, that is conveniently converted to a linear regression problem. This allowed the analysis of the temporal process as a whole rather than only comparing single time points. Importantly, it points out that the signal changes non-linearly over time. This is expected from a biological point of view, as TCR and GGR are commonly seen as acting within different time scales. However, it should be emphasised that it has not been incorporated in the analysis of temporal changes in sequencing data to our knowledge. The model therefore accounts specifically for dynamics on a population scale. Moreover, the derivative (Eq 6) provides key information about active ongoing repair. It thus permits linking CPD-seq data—showing the DNA damage distribution over the genome—and XR-seq data of excised DNA fragments generated by repair. This provides strong support for the validity of the model. Even though Eq 3 represents repair only with one mechanism per region, the combined effect of TCR shortly after UV treatment and GGR at later time points can be recovered when considering the average over several areas. The model can be readily used to uncover interrelationships between repair parameters and genomic contexts. Our outcomes are consistent with known influencing factors such as transcription rate and nucleosome density. Remarkably, the clearest link was established between the repair dynamics within genes and their length. To our knowledge, this is an unreported finding for budding yeast. In the following sections, we discuss the relevance of our approach and results within the context of previous publications.
Applying the CPD repair model
Several studies proposed temporal models for UV-induced lesion repair on different levels of detail. [30] represented NER kinetics in human cells using a Markov-Chain Monte-Carlo approach. It explains the removal of 6–4 photoproducts on a single-cell scale through the random and reversible assembly of repair complexes. A similar model was proposed by [31]. Interestingly, though, they derive very different conclusions, as they suggested that random or pre-assembly of repair proteins is unfavourable. Despite a great level of detail of both models, they are incapable to make region specific predictions. Moreover, as both models are based on microscopy data, they do not explain temporal changes in genome-wide sequencing data on a population scale. A Monte Carlo approach to explain the damage distribution and subsequent repair induced by ionised irradiation in a single cell was proposed by [32]. It also incorporates the collective effect of NER and base excision repair (BER), therefore accounting for potentially competing mechanisms. It should be stressed that the lesion type is considerably different. Again, the predictions are location unspecific. [33] provided a different angle by presenting a protein-protein interaction landscape of NER components in yeast. Predictions about the repair efficiency in different regions were not established. To our knowledge, our model is the first that accounts for region specific changes in population-based data.As there are only three time points, a data-driven machine learning model is prone to overfitting. In order to find a reasonable representation of the data, we incorporate explicit suppositions to derive Eq 3. To be precise, we presume that repair times follow a Poisson point process, and the non-constant rate λ(t) can be described by a mixture of protein diffusion and repair times. This restricts the trajectory to an S-shape. Our understanding of the process makes two independence assumptions. Firstly, there is non-interfering DNA repair between cells; and secondly, protein diffusion is independent of lesion accessibility or repair time.It is important that the points in a Poisson process are (sufficiently) independent from each other [34]. The repair times that are distributed along the timeline (e.g. as in Fig 1B) should hence symbolise values of independent stochastic variables. We find the presumption of non-interfering repair between cells trivial. Saccharomyces cerivisiae are single-cell organisms and should thus react independently to DNA damage. Moreover, yeast cultures were grown in rich medium after UV treatment, precluding any limitations for growth [8]. We conjecture the independence presumption to be reasonable.The Poisson process is governed by the rate parameter λ(t). It is presumed that nearby genomic positions posses similar rates. Therefore, λ(t) should have a slow spatial variation, and a segmentation of the CPD-seq data into similar behaving areas is possible. A similar binning approach was applied by [8]. The simplest model sets λ to a constant and therefore independent of time. In such a setup, we would expect to observe the largest signal change right after irradiation which subsequently slows down (S13A Fig). This could indeed resemble the beginning of the TS of TCR regions (Fig 3A). However, the majority of areas exhibits the strongest repair rates between 20 and 60 minutes. We conclude that a non-constant λ(t) provides a broader applicability for explaining CPD repair.It remains to find a description of the function λ(t). Repair happens through the collective working of several proteins, which need to move through the environment and find their target. Nuclear diffusion has been studied and modelled in detail in different contexts such as chromatin [15-17] and protein movements [13, 35], including the repair protein Rad4 [19]. It is clear that more DNA-protein interactions are possible during longer time windows, since proteins have more time to travel longer distances to reach their target. The distance is denoted by the random variable R which has the expected squared displacement with diffusion constant [14, 16, 17, 19, 35]. Consequently, the searched volume is , where k′ is a scaling constant. We couple the Brownian motion through space with an independent random variable X which symbolises repair time or accessibility to the lesion. We define . As R and X are independent, we can write , with . Substituting in Eq 1 results in Eq 3. We find it remarkable that β (representing a mixture of diffusion constant and expected waiting time) tends to be inversely proportional to m which is linked to the abnormality coefficient. Despite quick diffusion (m being large), we observe slow repair rates (β being low). We interpret this phenomenon as the repair time/accessibility X dominating the process, making diffusion negligible. This argument can be equally applied when diffusion is seemingly very slow.The simplest representation of an irreversible transition between two states is given by an S-shaped function which contains at most one inflection point. It is important to emphasise that this function is not necessarily inversely symmetric around this point, meaning that the left side can be differently shaped than the right side. These requirements are fulfilled by the KJMA model, which is therefore a sensible choice (see S7 Appendix and S13 Fig).
Analysing genomic properties which influence repair kinetics
TCR has been identified as a rapid repair pathway on the TS. Intergenic regions and the NTS exhibit significantly slower lesion removal, which was demonstrated on the genomic scale in yeast and human cells [8, 9, 27, 36, 37]. It remains an unsolved quest to find an interrelationship between TCR efficiency and transcription rate. Whilst the two parameters are indeed assumed to be correlated [8, 9], some studies point out that TCR repairs CPDs efficiently at nearly all genes including those with a low transcription [27]. An in-depth analysis is still missing, and there is no clear consensus on how transcription rate is affecting repair. In this work, we compared the model predictions to gene expression. Our analysis clearly shows a significant correlation on the TS and is therefore supporting the common assumption (S7 Fig).As repair proteins need to recognise and repair lesions on the DNA, it is conjectured that chromatin organisation can significantly modulate the efficiency of CPD repair [8, 12, 27]. However, previous studies were mostly scrutinising the positioning of damage at nucleosomes. CPD removal was shown to be less efficient at the dyad of strongly positioned nucleosomes in yeast [8]. Moreover, GGR on the NTS was asymmetrically inhibited in yeast and human cells with respect to the position within the nucleosome [27]. Even though nucleosome occupancy after UV treatment was already previously probed, the potential relationship of these data with CPD repair was not directly addressed [12]. Our results demonstrate a significant correlation between repair and nucleosome density in non-TCR regions (Fig 4C). We also discovered a clear influence on the beginning of TCR areas (S8 Fig).Unexpectedly, our outcomes show a strong correlation between TU length and repair. Differences in transcription shutdown and restart after UV treatment relative to gene size were previously reported for human cells [38]. Both transcription regulation and efficient repair are necessary to orchestrate an effective cellular response to UV light. The restart of transcription to pre-irradiation levels is an important step at the final stages. However, a direct evaluation of lesion removal with respect to gene size was not performed. To our knowledge, this is a new finding for CPD repair in yeast. Due to our data pre-processing (Eq 7), we can rule out that the result derives only from the fact that larger areas have a greater potential to include more damage. This is true due to two reasons. Firstly, we normalised the CPD value in each bin (e.g. beginning of the TS) by the number of pyrimidine dimers in the sequence as described in [8]. Secondly, and more importantly, we want to point out that the quotient in Eq 7 lets any length dependence and normalisation of the binned data vanish. Therefore, the values become automatically comparable due to the design of Eq 7. It should be mentioned though that the regions of interest can become rather small when segmenting the gene into subareas. Influence or noise from neighbouring areas cannot be excluded. However, due to the fact that the same result can be obtained with a different segmentation (S3 Appendix and S9 Fig), we presume that it represents a genuine feature of the CPD removal mechanisms in yeast cells.Lastly, we investigated a potential link to the distance relative to the centromere and telomere depending on which was closer to the region of interest (S12 Fig). A link to repair has not been proposed to our knowledge, which made it an interesting property to produce verifiable model predictions.In conclusion, our work opens interesting perspectives for future research on DNA repair mechanisms and influencing genomic factors. New experimental data with increased temporal resolution will help to refine the model and analysis. The approach can be similarly used for other organisms including human cells. Moreover, it can be readily applied to sequencing data of any nuclear process that can be represented as a two-state system, and it is not restricted to repair.
Introducing the repair space—An alternative understanding of the data
We have discussed the model in detail with respect to a stochastic point process. We want to provide an additional interpretation that is motivated by the physical implications of the KJMA model. Next to assuming independence between cells, we conjecture in the following also independent repair within each cell (S8 Appendix). Moreover, we presume that CPD data were converted by Eq 7.Considering the two-dimensional grid in Fig 1C, the independence assumptions above permit us to re-order repair positions to patterns. Nevertheless, we restrict the re-grouping to stay within areas of interest which are assumed to behave homogeneously. The growth of patterns in the virtual repair space reminds strongly of the phase transition in solids which is described by the KJMA model.The creation of these repair patterns can be described by the nucleation rate n (which we link conceptually to the expected waiting/repair time) and a growth speed G (which is in this analogy linked to the diffusion process). In the following, we assume G to be constant in all directions. The transformed volume within Δt starting from a single nucleation site is therefore
where σ denotes a parameter that describes the shape of the expanding pattern (which would become part of β). Interestingly, the parameters obtain a slightly different meaning from this point of view. m is the Avrami exponent and characterises the geometry of the area covered by repaired positions after their aggregation. For example, if m − 1 = 2, the area corresponds to regular disks in a two-dimensional space. Irregular forms can be expressed with non-integer values [39]. Nevertheless, a direct comparison of a physical shape with a virtual pattern might be difficult to imagine. We therefore advocate another interpretation. m can be understood to express time dependence of the repair process (compare with S14 Fig). A similar notion has been also proposed in the physical context [40]. We believe that such an understanding could possibly permit the inclusion of independent results from the realm of physics.
Methods and materials
Parameter estimation and derivative
Eq 3 explains CPD repair as an S-shaped transformation over time. It should be noted that the process has a defined starting point at t = 0. By applying the natural logarithm on both sides twice, we obtainNote that the expression is now continuous over ln t. Given the data points for repair and by assuming a value for θ, the parameters m and 1/τ can be found by solving the linear regression problem defined in Eq 5 (compare with bottom plots in S14 Fig). θ was determined through a systematic parameter search. We started with 0.5 as minimal value for transcribed/TCR regions and 0.4 for all other. We increased it thereafter by Δθ = 0.01, until we reached θ = 1.0. We chose the θ-value that can best describe the repair data together with the corresponding parameters m and 1/τ. This is determined by maximising the adjusted R2. It represents the variance in the data that can be explained through the model and can be interpreted as goodness of fit. The derivative of Eq 3 is given by
Data processing
All experimental data that were analysed in this study comes from public databases (see overview in Table 1). CPD-seq data was taken from [8]. It contains two time courses with samples taken at t1 ∈ {0, 60}min and t2 ∈ {0, 20, 120}min, respectively. The location of transcribed areas was taken from [29]. Data signals were partitioned into different segments, i.e. the TS and NTS of TCR regions as well as non-TCR areas. For the latter, we combined both strands to one group. Consequently, the linear regression problem in non-TCR regions was required to find the best representation for both strands.
Table 1
Overview over the data sets that were used in this study.
Property
Strain
Data type
UV Dose
Reference
CPD
BY4741 (WT)
CPD-seq
125 J/m2
[8]
CPD repair
Y452 (WT w.r.t. repair)
XR-seq
120 J/m2
[9]
Abf1
BY4742 (WT)
ChIP-seq
100 J/m2 (0min)
[12]
H2A.Z
BY4742 (WT)
ChIP-seq
100 J/m2 (0min)
[12]
Nucleosome distr.
BY4742 (WT)
MNase-seq
100 J/m2 (0min)
[12]
Transcription rate
YSC001
NET-seq
-
[28]
CPD-seq fragments were normalised by the number of available pyrimidine dimers, as explained in the supplementary material of [27]. The damage distribution was subsequently transformed into repair in area a through
where N denotes the size of a, is the normalised CPD signal at time t and locus i in area a, and t ∈ {20, 60, 120}min. a is any of the previously described regions (e.g. the start of the NTS). We additionally take it for granted that no new CPD lesions can be induced during repair. Hence, data points were enforced to be greater than or equal to zero and monotonously increasing as a function of time. The rectification is defined by
where t ∈ t = (20, 60, 120).All other sequencing data that were used for the correlation analysis were averaged over the size of the area of interest. Start, centre, and end of TS and NTS were linked to the same value to smooth out the potential influence of noise. For example, all subregions of a TS were associated to the same transcription rate. Moreover, both strands were compared to the same data, e.g. the TS and the NTS were related to the same nucleosome density. We noticed that the NET-seq signal amplitude decreases as a function of distance from the TSS (S15 Fig). This could possibly induce a TU length-specific bias that is not removed by taking the average over the TU length. We could verify, however, that the NET-seq data strongly correlates with independently probed Pol2 ChIP-seq data [41] (S16 and S17 Figs). We therefore assume that it reasonably represents transcription rate, whilst allowing a direct comparison to the results obtained by [9] (S9 Appendix).With the exception of nucleosome density, all biological data values possess a biased distribution. They strongly peak around a low value but contain large positive tails. To remove a potential bias introduced by outliers, we limited our analysis to the lower 95th percentile. As this procedure was applied to all data (except nucleosome density), we did not introduce a bias towards a certain model. Rather, we improved comparability. The only exception is the MNase-seq signal, as it is approximately normally distributed. We consider that trimming could introduce a bias rather than removing one.The relative distance to centromeres (c′) or telomeres (t′) was measured as follows. Denoting the gene position by x, we can defineWe divide only by half of the length since the maximal distance (d = 1) to both centromere and telomere should be the middle between them two.
Correlating repair dynamics to genomic contexts
Areas with parameter values outside a reasonable range were excluded from the subsequent analysis. We restricted m ∈ [0.5, 5] and τ ∈ [20, 200]. θ was constrained through the parameter search. Motivation and consequences are discussed in S2 Appendix. The repair parameters were investigated in context of other biological data. We opted for a nonparametric classification k-Nearest Neighbour (kNN) approach. We grouped biological data into high (class c = 1) and low values (c = 0), such that both classes contained the same number of samples to remove the difficulty posed by the biased data distribution. To train the machine learning model, the input values x = (m, β, θ) where normalised so that every dimension was normally distributed with zero-mean and a standard deviation of one. We compared the results for several kNN models with k ∈ {5, 10, 20, 50, 100} to remove any model specific bias. A trained model compares an unknown input to the k closest values of a known data set {X, c} to predict class . We opted for the Euclidean distance as similarity measurement. Here, the i-th row of X is x = (m, β, θ), and c is the associated class in c. is determined by a majority vote. For example, if more than 50% of the k neighbouring values are classified as group c = 1, then is predicted to be group 1 as well. kNN is categorised as nonparametric model which permits the comparison of different results. The performance was measured through calculating the prediction errorThis was compared to a random baseline model, for which classes were randomly shuffled to a given parameter triple (m, β, θ) during training. Data {X, c} were arbitrarily partitioned into learning and testing data sets. Every experiment was independently repeated 100 times to reduce the effect of any potential bias. We consider an interrelationship to be important if the prediction error of the true function is significantly lower than the error of the random model (p < 0.00001% of a one-sided t-test). Moreover, we require that 90% of the prediction errors are below E < 0.5, which is the expected outcome of an unbiased coin-flipping experiment. This significance must be found in three out of the five evaluated k to indicate an interrelationship.
Determining TCR Regions.
(PDF)Click here for additional data file.
Discussing the Effect of Data Transformation and Selection.
(PDF)Click here for additional data file.
Dividing the Data into Genes versus Intergenic Regions.
(PDF)Click here for additional data file.
Explaining the Correlation Between XR-seq Data and Repair Rate.
(PDF)Click here for additional data file.
Discussing the kNN Approach.
(PDF)Click here for additional data file.
Analysing Repair Kinetics in Context of Abf1 and H2A.Z Distribution.
(PDF)Click here for additional data file.
Comparing the KJMA Model With Other Approximations.
(PDF)Click here for additional data file.
Discussing the Model in Context of the Physical KJMA Model.
(PDF)Click here for additional data file.
Investigating a Link Between Transcription Rate and TU length.
(PDF)Click here for additional data file.
The number of models per region, before and after applying the requirements for parameter ranges.
IGR abbreviates intergenic regions.(PDF)Click here for additional data file.
The DC between XR-seq and repair predictions / data for different experimental configurations.
(PDF)Click here for additional data file.
Number of non-random interrelationships between model parameters and sequencing data over k.
The table gives the number of kNN models that could find a correlation between model parameters and genomic context. k ∈ {5, 10, 20, 50, 100}. We defined a link to be significant if at least three out of five k find a non-random interrelationship.—means that data was not used in the given configuration. NET denotes NET-seq data, ND is nucleosome density, and meres give the relative distance to centromeres or telomeres. Suffixes S, C, and E denote start, centre and end of an area. NTCR are non-TCR areas. IGR are intergenic/non-transcribed regions.(PDF)Click here for additional data file.
Scheme of the segmentation setup.
The circles represent the number of cells with ongoing repair in the region. The arrows indicate the region and direction of transcription. The results in the paper follow the TCR setup. Here, only the first gene is considered as TCR area which shows more efficient repair than intergenic regions within the first 20 minutes after UV irradiation. All other parts are labelled as non-TCR region. Therefore, it spans from the end of the first gene to the end of the second. The gene configuration (S3 Appendix) partitions the genome into the traditional notion of transcribed and intergenic regions. The TU positions were determined by [29].(TIF)Click here for additional data file.
Relative repair distribution over genomic areas.
Relative repair in non-transcribed regions is chiefly lower than 20% within the first 20 minutes (88.95%). Genic areas with stronger repair dynamics are thus likely supported by TCR. For all other transcripts, we cannot exclude the possibility that they are exclusively repaired by GGR.(TIF)Click here for additional data file.
Example for model predictions.
Data points are given by solid dots. The blue dashed line represents the repair fraction predicted by the model (left axis). The orange dashed line shows the derivative (right axis). (A) The SNF6 gene can be well approximated. (B) However, GEM1 exhibits no repair within the first 20 minutes, which results in a switch-like behaviour. (C) This is better understood when showing the data points after transformation according to Eq 5. A linear regression is difficult since they do not align.(TIF)Click here for additional data file.
Example for model prediction and XR-seq data over time.
XR-seq data (points) and the predicted repair rate (dashed lines) are exemplified for genes BDH1 (orange) and BDH2 (blue). When re-scaling XR data and repair rate prediction between 0 and 1, both follow clearly similar trends. BDH2 has its largest XR-seq value at 20 minutes post-irradiation, whereas BDH1 shows biggest repair rates after 5 minutes. This is indeed captured by the model.(TIF)Click here for additional data file.
The evolution of repair along the NTS.
The average repair evolution for the NTS (blue dashed line) shows a much lower repair fraction (≈0.6) than the other areas. Moreover, the repair rate (orange dashed line) indicates repair at early time points. This could be caused by possible overlapping transcripts or by antisense-transcription-coupled repair. Shaded areas show the standard deviation. The repair trajectory is the same for (A) the beginning, (B) the centre, and (C) the end of the NTS.(TIF)Click here for additional data file.
Example of the learnt function between model parameters and genomic context.
(A) The learnt parameter distribution and the associated class for the true model after applying a principle component transformation. The x and y-axis give the first and second principle component, respectively. Red represent large genes, whereas blue shows low values. (B) The error distribution for the predictions follows the expected outline given by the learnt function in (A). The blue and red circles give values that were classified as short but were actually large and vice versa, respectively. White points are correctly classified. The right bar shows the error distribution along the colour axis, i.e. over estimated, correctly classified, and underestimated values from top to bottom. The lower histogram shows the distribution of overall correctly and incorrectly classified values. (C, D) The learnt parameter map and the error distribution of the random model.(TIF)Click here for additional data file.
Model parameters with respect to transcription rate.
Our results for the transcription rate support the hypothesis that it influences repair on the TS. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Model parameters with respect to nucleosome density.
Nucleosome density is seemingly influencing repair in non-transcribed/non-TCR regions as well as the beginning of the TCR TS and the TS in the gene setup. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Model parameters with respect to size.
The size is clearly influencing repair for both, TS and NTS in the TCR and gene configuration. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Model parameters with respect to Abf1.
The results for Abf1 are more ambiguous. Although we can find a significant correlation to non-TCR regions as expected in the TCR setup, the picture is less clear for the gene configuration. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Model parameters with respect to H2A.Z.
Similar to Abf1, the correlations with H2A.Z do not allow a straightforward interpretation. Whilst repair in all areas in the TCR configuration is seemingly linked to H2A.Z, this tends to be restricted to the TS and NTS in the gene setup. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Model parameters with respect to cetromeres and telomeres.
With the exception of the TS in the gene setup, the distance to telomeres or centromeres (shortened with meres) does not affect repair dynamics. The x and y-axis give the values of m and θ, respectively. The size of the circles show 1/τ: the larger the circle, the shorter the characteristic time. Significant interrelationships are marked with a red frame.(TIF)Click here for additional data file.
Comparison of the KJMA model with other functional descriptions.
(A) A homogeneous Poisson repair process with λ(t) = c has the strongest change in the beginning which subsequently flattens out. In most investigated regions, such a behaviour is not observed. (B) We compared the performance of different models to describe the data, which is exemplified for gene LDB16 (YCL005W). Black dots represent the repair data (converted CPD-seq data, see Eq 7), whereas the best fit of each model is given in dashed lines. (C) We applied the mean-squared error (MSE, S7 Appendix Eq 1) to compare the performance of the models with respect to the data. The KJMA model and the Hill equation perform undoubtedly better than simpler models like linear or logistic regression. Nevertheless, the Hill equation describes the data slightly yet significantly better. It should be emphasised though that the performance difference is marginal. Width of the shaded areas represents the number of genes that yielded the corresponding error, which is mirrored at the vertical line. The centre horizontal line with the corresponding numbers give the error median. The top and bottom horizontal lines show maximum and minimum, respectively. (D) Despite the fact that non-TCR regions are not expected to show an observable impact by TCR, the Hill equation indicates two mechanisms that act at different time points. Dashed lines give the mean whereas the shaded areas show the standard deviation. Together with the fact that there is no straightforward interpretation of the Hill equation in context of repair evolution, we conclude that Eq 3 is a sensible choice.(TIF)Click here for additional data file.
Example of the KJMA model.
The KJMA model includes two governing parameters (as the original model does not involve θ), which are exemplified in (A) for m and (B) for τ. Eq 3 can be conviniently converted to a linear regression problem which is shown for the parameter settings of (A) in (C) and for the parameters of (B) in (D).(TIF)Click here for additional data file.
Example of sequencing data representing transcription rate.
The example of the NET-seq signal in comparison to the Pol2 ChIP-seq data probed by [41] shows that Pol2 exhibits a constant augmentation of the signal amplitude at transcribed regions, whereas NET-seq data decrease as a function of distance from the TSS. The Pol2 data is coloured in green, whereas NET-seq is given in blue (light blue represents the Watson, and dark blue is the Crick strand). The example is given for chromosome II around CHS2 and CHS3.(TIF)Click here for additional data file.
Correlation between size and transcription rate.
The plots show the two-dimensional histogram distribution of TU length and different measurements of transcription rate. The number of genes per bin is given through the colour intensities and white numbers. The DC (S4 Appendix Eq 3) per measurement is given in the title. Size and transcription rate data was divided into 50 equally sized bins, which is given by the x and y axis. We use the 95th percentile to remove strong outliers. (A) The histogram distribution of NET-seq transcription with respect to size reveals that smaller genes tend to have higher transcription rates than larger genes. (B) This link is weakened when considering Pol2 ChIP-seq data.(TIF)Click here for additional data file.
Correlation between NET-seq and Pol2 ChIP-seq data.
The plot shows the two-dimensional histogram distribution of NET-seq and Pol2 data. The number of genes per bin is given through the colour intensities and white numbers. Size and transcription rate data was divided into 50 equally sized bins, which is given by the x and y axis. NET-seq data and Pol2 ChIP-seq signal are strongly related (DC = 0.75, S4 Appendix Eq 3). As we consider only two groups of genes with respect to transcription, i.e. genes with a low (blue) or high transcription rate (red), we can confirm that the majority of regions fall into the same category.(TIF)Click here for additional data file.
Model predictions with respect to XR-seq data in the gene setup.
(A) The model predictions in the gene configuration are less correlated with the XR-seq data than in the TCR setup (DC = 0.241, S4 Appendix Eq 3). (B) When correlating the relative repair rate and the XR-seq data in the gene setup, the DC is as low as for the model predictions (DC = 0.231). Therefore, we assume that the weak linkage is due to the data segmentation.(TIF)Click here for additional data file.25 May 2022Dear Soutourina,Thank you very much for submitting your manuscript "A Quantitative Modelling Approach for DNA Repair on a Population Scale" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.As you will see, both reviewers have considerable criticism regarding the application of the KJMA model to DNA repair. While the first reviewer believes that this application is flawed, the second reviewer also has significant concerns. Hence, why the assumptions of the KJMA model are fullfilled in this case, and why it can be applied to this biological question should be justufied more thoroughly. Reviewer 2 also recommends putting less emphasis on the KJMA model and checking if an alternative model yielding a S-shaped response might also be considered. Hence we urge you to seriously consider these concerns and see if and how you can address them. Please also consider the other comments, for example regarding the small number of timepoints in your application.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Carl Herrmann, Ph.D.Associate EditorPLOS Computational BiologyIlya IoshikhesDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: I wonder if there is a flaw in the analogy between the physical role of the KJMA model and how it is being superimposed on the biological pattern of repaired sites. I find the argument that underlies figure 1 to be problematic: the assumption that the repair is independent between cells and between sites within cells, which is required to gather the repaired sites into the contiguous block 'pattern', needs much more discussion and evidence. But even then, it is apparent that the spatial metric properties of the equation in the physical application of this equation means that in the domain for which it was derived, the KJMA equation makes a strong use of the fact that one can't freely permute the spatial coordinates arbitrarily. The paper as it stands doesn't really resolve this conflict - it's possible that a major re-phrasing of the paper so that the KJMA model is entirely motivated by the biological setting would resolve this (it's unnecessary to mention the physicists' use of the equation in anything but the most fleeting way, if there is an inherently biological motivation for it), but the situation where phase regions can only grow on their boundary, for instance, doesn't have a clear translation to me when phrasing in terms of repair sites: after the 'pattern' is formed as in figure 1, why are some sites 'special' (appearing on the boundary) whilst others aren't.It is entirely possible that the KJMA model merely describes a stochastic point-process of repair, which seems a much simpler and more intuitive explanation, and that the time-profile of fraction of repaired sites is purely an emergent phenomenon: early on the proportion of repaired is small because of the lack of accumulated time in which to undergo repair relative to the number of damaged sites, and that as time progresses the accumulation increases but then starts to tail off as the number of available unrepaired damage sites becomes sparser. There is nothing inherently two-dimensional about this point-process approach, and if it arrives at a similar or identical functional form for the time dependence of the repair rate, then the rather forced analogy could be removed.The use of KNN in the correlation of parameter estimates to biological features seems overly and unncessarily complex - a data-dependent binary split of the data (it's not clear if this is done on the parameter estimates or the biological features) and then doing a t-test seems prone to measurement error bias - a non-parametric test of association between the two continuous variables would seem more straightforward, or a spline regression if a non-linear association was suspected.I found the impact of the concerns about the applicability of the KJMA model overshadowed my reading of the rest of the paper. The testing of association between biological features and repair rates would have been more convincing if it had been done directly (perhaps empirically, even a seriation or simple PCA decomposition of the signal clustering being graphically annotated with the biological feature) in addition to going via the correlation with KJMA model parameters.Thank you for supplying the github code - I would recommend a DOI service so that a more permanent record can be associated with the code.Reviewer #2: The authors consider a quantitative treatment of the dynamics of DNA repair via an application of the KJMA model. Parameters of the model are tested for correlation with various biological quantities such as transcription, finding a novel connection between repair dynamics and gene size.Some aspects of the approach and results are, to the best of my knowledge, novel. Moreover, I believe that the findings presented will be of interest to the field. However, I do have some concerns regarding interpretation and presentation of the work.Major Concerns:1. Connection to the KJMA modelI actually find the mapping of DNA repair to the KJMA model a bit of a stretch. The authors assume (reasonably) independence of repair within and between cells. They also claim that this allows DNA repair to be modelled using the KJMA model as new regions of repair can be rearranged to give a growing pattern on a two-dimensional grid. However, it seems that independence really means that DNA repair should be modelled as independent nucleation events that do not grow and much of the connection to KJMA model would seem to be lost. The authors should either strengthen the justification for the use of the KJMA model or remove some of the emphasis on its use from the manuscript. If the assumption is that nucleation is time dependent such that an S-shaped curve is expected for the progression of repair, then, in the absence of any further information, any simple model that produces an S-shaped response should be applicable. It might be reasonable to suggest employing the KJMA model as it is well known and produces an appropriately shaped response curve, or perhaps the authors can show that the KJMA model is capable of treating the case of time-dependent nucleation without growth.2. Small number of time pointsThe authors do acknowledge that an important future step is to apply the modelling approach to data with greater time resolution. Here, they apply a three-parameter model to a time course with only four points. Arguably, they have a sizeable data set in that the data is genome wide, covering many regions of repair which can be partitioned into distinct types. However, this is quite a low number of points for each individual fit. I would recommend that the authors provide some additional analysis on model selection, ideally showing that the S-shape curve provides a significantly better fit to the data than the simplest alternative (which would be a linear model). Alternatively, the correlation results could be compared to a similar approach with a linear model, ideally showing that the approach with an S-shape curve is more powerful in revealing links between the chosen biological quantities and the dynamics of repair.3. Potential problem with the fitting algorithmThe authors claim that for data showing no (or negative) repair levels at 20 minutes obtaining good fits is difficult (Figure S2B and associated discussion). I can estimate what seems to be a superior fit to YAL048C with values of theta = 0.7, m = 6, tau = 45 (all within the allowed ranges). The authors should provide more reasoning as to why the apparently poor fit in Figure S2B was chosen by the algorithm.4. Overall presentationThe manuscript was quite difficult to follow in places, often putting heavy emphasis on minor details and giving limited room to what I considered to be important points. I recommend some streamlining of the manuscript to improve readability.Minor Concerns:5. The authors state that TCR is not expected to function on the non-transcribed strand. Has any potential repair coupled to non-coding antisense transcription been considered here?6. Figure 2 – the top and bottom panels appear to have been switched compared with what is described in the main text and the figure caption.7. Some of the description in the main text is more suited for the methods section and some of the description in the methods section might be more suited for the main text.8. Figure 6 is missing a caption title (all other main-text figures have this).9. Figure 4 – the legend spots were too small for me to easily distinguish between the colours.10. Figure 3 – the colour scheme is confusing: I don’t think blue and orange should be reused for the lines at the bottom as they appear to represent something different there.11. Figure 3 – I think it would be helpful to also describe additional classifications in this schematic, including TS, NTS, beginning, centre and end.12. Figure 5 – the meaning of the lines and shaded areas should be in the figure caption. How the averages were taken should also be described in sufficient detail somewhere in the manuscript.13. Figure 5 (and connected discussion) – Is there some analysis that could be done in order to separate out early- and late- repair subpopulations (assuming that they are indeed what is driving the double-peak behaviour in the derivative).14. References appear not to have been ordered.15. Line 180 – “transcripts that exhibit strong repair …” – should this be transcribed regions as the transcripts themselves are not repaired?16. Many of the supplementary figure captions are missing descriptions (for example, the meaning of the size of the circle points in the plots).**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: Yes: Gavin KellyReviewer #2: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols21 Jun 2022Submitted filename: Zeitler_2022_Response to Reviewers_200622.pdfClick here for additional data file.27 Jul 2022Dear Soutourina,Thank you very much for submitting your manuscript "A Quantitative Modelling Approach for DNA Repair on a Population Scale" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.Please carefully consider the recommendations of the reviewer in your revised manuscript, especially regarding the labelling of some figures.Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Carl Herrmann, Ph.D.Associate EditorPLOS Computational BiologyIlya IoshikhesDeputy EditorPLOS Computational Biology***********************A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:[LINK]Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #2: The authors have undertaken a substantial rewrite of many sections of the main text and additional work to address the reviewers’ concerns. I now find the paper to be much improved, with the main message being much clearer. I also now believe I have a firmer grasp on the authors’ original intent with regards to the abstract connection with the KJMA model.Recasting the direct connection to the biological phenomenon of DNA repair as a Poisson point process as the first reviewer suggested (essentially what I was trying to describe as time-dependent nucleation without spreading, using the terminology of the KJMA model) and then connecting this to the KJMA model makes things much more believable from a biological perspective.I was pleased to see that the authors had done additional analysis to compare the overall quality of fit between a number of simple candidate models to show that their approach is significantly better than the simplest models. When combined with the clearer explanation of the connection to the biology, this sufficiently justifies the use of the KJMA model in my opinion.I now understand why the seemingly poor fit to YAL048C was chosen by the algorithm. I still think there might be issues with fitting equally to a data point which has been set to (or close to) zero due to the measurement being negative (an impossibility in the model). Such points could have larger errors on them than positive measurements. Perhaps a fitting algorithm that takes differences in errors into account would improve matters and allow more genes to be considered for further analysis. I appreciate that this would entail significant extra effort and be beyond the scope of this manuscript, but it is something that could potentially be covered in future work.There are a few minor issues that I think should be addressed before publication.1. I spotted a few possible typos when reading the revised manuscript:a. Line 6: ‘evolutionary’ to ‘evolutionarily’?b. Line 98: ‘lager’ to ‘larger’ ?c. Line 152: ‘region’ to ‘regions’ ?2. S3 Fig: Caption seems to be missing completely.3. S4 Fig/Caption: The examples that are being used should be stated/described.4. S13 Fig Caption:a. For panel B, for absolute clarity, that this is a single-gene example should be stated and the black dots need to be described as the experimental data points.b. For panel C, again for absolute clarity, MSE should be defined and the meaning of the lines/bars and shaded regions in the plots should be described.5. S16 Fig/Caption:a. The kind of correlation calculated should be stated/described somewhereb. For clarity and ease of reading, the meaning of the numbers should be stated/described (also applies to S17 Fig Caption).c. The units of the axes should be described (also applies to S17 Fig)**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #2: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #2: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocolsReferences:Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.9 Aug 2022Submitted filename: Zeitler_ResponseToReviewers_Revision2.pdfClick here for additional data file.12 Aug 2022Dear Soutourina,We are pleased to inform you that your manuscript 'A Quantitative Modelling Approach for DNA Repair on a Population Scale' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Carl Herrmann, Ph.D.Associate EditorPLOS Computational BiologyIlya IoshikhesDeputy EditorPLOS Computational Biology***********************************************************5 Sep 2022PCOMPBIOL-D-22-00499R2A Quantitative Modelling Approach for DNA Repair on a Population ScaleDear Dr Soutourina,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Katalin SzaboPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: Patrick van Eijk; Shuvro Prokash Nandi; Shirong Yu; Mark Bennett; Matthew Leadbitter; Yumin Teng; Simon H Reed Journal: Genome Res Date: 2018-12-14 Impact factor: 9.043