Literature DB >> 32214380

Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics.

Gennady Gorin¹, Mengyu Wang^2,3, Ido Golding^2,3, Heng Xu^4,5.

Abstract

Recent advances in single-molecule fluorescent imaging have enabled quantitative measurements of transcription at a single gene copy, yet an accurate understanding of transcriptional kinetics is still lacking due to the difficulty of solving detailed biophysical models. Here we introduce a stochastic simulation and statistical inference platform for modeling detailed transcriptional kinetics in prokaryotic systems, which has not been solved analytically. The model includes stochastic two-state gene activation, mRNA synthesis initiation and stepwise elongation, release to the cytoplasm, and stepwise co-transcriptional degradation. Using the Gillespie algorithm, the platform simulates nascent and mature mRNA kinetics of a single gene copy and predicts fluorescent signals measurable by time-lapse single-cell mRNA imaging, for different experimental conditions. To approach the inverse problem of estimating the kinetic parameters of the model from experimental data, we develop a heuristic optimization method based on the genetic algorithm and the empirical distribution of mRNA generated by simulation. As a demonstration, we show that the optimization algorithm can successfully recover the transcriptional kinetics of simulated and experimental gene expression data. The platform is available as a MATLAB software package at https://data.caltech.edu/records/1287.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 32214380 PMCID： PMC7098607 DOI： 10.1371/journal.pone.0230736

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Transcription has been the focus of intensive study due to its cornerstone role in cell activity regulation. Recent advances in fluorescent imaging have enabled mRNA detection at single-molecule resolution in individual cells, in both live and fixed samples [1,2]. Spatial analysis of mRNA signals allows the identification [3,4] and quantification [5] of nascent (actively transcribed) mRNA, which offers a direct window into the kinetics of gene transcription, with minimal interference from downstream effects [5], at the level of a single gene copy [6]. Converting high-resolution experimental data into theoretical understanding of transcription requires simultaneous modeling of both nascent and mature species of mRNA. Particularly, since at any given moment an mRNA molecule may be in a partially transcribed and/or degraded state, a good model should be able to capture the submolecular features of mRNA. However, current computational models of transcription present challenges for integration with the new wealth of microscopy data. Most models do not distinguish between nascent and mature mRNA or model the transcript length [7-11]. As recently noted [5], several mechanistic models do describe the elongation of nascent mRNA, but do not consider the mature mRNA population and require additional processing for comparison to microscopy data [4,12-14]. Further, studies using these models tend to predict low-order statistics [7,13], which paint a limited picture at biologically low molecule numbers [4,15]. Recent methods based on directly solving the chemical master equation (CME), using the finite state projection (FSP) algorithm, yield distributions of the number of molecules [5,15,16]. However, integrating the discrete CME with submolecular features of mRNA is nontrivial, and has only recently been accomplished on a model with a deterministic elongation process [5]. A stochastic stepwise model of transcription, more faithful to the mechanistic details, is not currently tractable using FSP [5] due to exponential growth in the size of the state space with increasing resolution. Here we present a stochastic simulation platform that aims to capture the complexities of RNA processing. The platform consists of a submolecular implementation of the Gillespie algorithm [17], simulating the gene switching, transcription, and degradation expected in a prokaryotic system. Transcription and degradation occur in a stochastic fashion, where the initiation and individual steps of elongation are Poisson processes. The algorithm outputs time-dependent fluorescent probe signals, calculated from the overlap of intact RNA and probe-covered regions. The probe signals are provided as cell-specific readouts and as aggregated histograms, mimicking live-cell (MS2) and fixed-cell (smFISH) fluorescence data, respectively [1,2]. Using a GUI, a user can input simulation parameters and examine time-dependent statistics, as well as animate the instantaneous molecule states. We use the platform to approach the inverse problem of biological parameter estimation. A recent investigation demonstrated that entire distributions are required to reliably estimate parameter values from single-cell mRNA data [15]. To perform parameter estimation based on these empirical distributions, we implement a heuristic approach based on iteratively minimizing mean squared errors and Wasserstein distances of different observables [18]. This approach represents a novel method of estimating plausible regions for multiple parameters using time-series data with multiple observables, without making assumptions regarding the functional form of the distributions. Thus, the platform provides a flexible simulation environment to implement reaction mechanisms as well as a search algorithm designed to directly test those mechanisms’ parameters against experimental data. The GUI and search algorithm are available at https://data.caltech.edu/records/1287.

Results

Model and simulation platform

Our platform models a common formalism for the mRNA transcription process [5,7], with a series of stochastic reactions, including promoter turn-on and turn-off, transcription initiation, elongation, RNase (ribonuclease) binding, and degradation (S1 File of S1 Table). Specifically, promoter activity is represented as a two-state switch. In the active (“on”) state, transcription can be initiated. The nascent mRNA strand elongates from the 5’ to the 3’ end, in a series of discrete steps. Upon reaching the end of the template gene, the mature mRNA molecule is released from the gene. Regardless of RNA maturity, RNase can bind to the 5’ end of the mRNA, causing the strand to begin stepwise degradation at an average rate assumed to be identical to the elongation speed [19]. The process is depicted in Fig 1A. The physiology of the transcribed gene is parametrized by the turn-on rate k, the turn-off rate k, the transcription initiation rate k, the degradation initiation rate k, the elongation speed v, and the gene length L. The experimental parameters include the timespan of the experiment T, as well as the probe span vector (P5, P3) defining its 5’ and 3’ limits of coverage with respect to the length of the gene [5], as shown in Fig 1A.

Fig 1

Model and simulation platform.

Model and simulation platform.

A: Model schematic and probe parameterization (gold: probe coverage, P3: 3’-most edge of the probe, P5: 5′-most edge of the probe) B: Time-dependent molecule-level visualizations available through the GUI. Trajectory generated using k = 100 min-1, k = 3 min-1, k = 10 min-1, k = 0.5 min-1, v = 41.5 nt s-1, T = 10 min, L = 5300 nt, 241 steps of elongation to complete transcription (dark line: intact RNA stretches, light line: degraded RNA stretches, pink circle: RNase molecule). C: Single-cell trajectory with simulated nascent and mature fluorescent signals. Parameters same as in B (red: total signal, blue: nascent signal, green: mature signal, shaded regions: times displayed in B). The platform performs stochastic simulation of the model using the Gillespie algorithm [17,20], then estimates the fluorescence of each mRNA molecule from the size of its region targeted by fluorescent probes. Specifically, we simulate the production and degradation of each mRNA molecule in the cell, whose status can be defined by four variables, i.e. two integers that define 5’- and 3’-most nucleotides of the transcript and two Boolean variables that define whether the mRNA is polymerase-bound (nascent) and/or RNase-bound (degrading). The gene state (on or off) is defined by a single Boolean variable. To convert the simulated mRNA molecule ensemble (Fig 1B) to the experimentally observed fluorescent signal, we calculate the overlap between the intact RNA and the probe coverage (single realization shown in Fig 1C); the probe readout is rescaled to molecule number using the fluorescence of a single intact molecule [16]. The resolution of the simulation is determined by the number of cells and the number of steps taken to fully elongate or degrade each molecule. Model simulation is implemented in MATLAB 2018a [21]. A simple graphical user interface (GUI), provided as a MATLAB app at https://data.caltech.edu/records/1287, runs the simulation for a user-defined parameter set defining the physical parameters and simulation precision. Upon completion, the GUI outputs the time-dependent mean probe signal (in units of molecule number), Fano factor, and instantaneous nascent and total mRNA probe signal histograms, all calculated over the cell population. The mRNA nucleotide spans are used to visualize and animate the transcriptional activity taking place at an individual gene copy (analogous to Fig 1B and 1C; example visualization given in S1 Movie). Our software allows direct simulation of complex experimental designs. For instance, to mimic the commonly-used induction experiment (e.g. the addition of isopropyl β-d-1-thiogalactopyranoside, an inducer of the lac promoter, to E. coli cells [6]), the simulation starts with no mRNA and undergoes a step increase in the gene turn-on rate. Similarly, to mimic a repression experiment (e.g. the addition of 2-nitrophenyl-β-d-fucoside to E. coli), the system starts with a steady-state population of mRNA and undergoes a step decrease in gene turn-on rate [22]. For physiologically plausible transcription in short, infrequent bursts [23], the decrease in k can also model repression by a step decrease in initiation [6] caused by the addition of rifampicin [24].

Parameter estimation

Given single-cell time-series fluorescence data that describes nascent and mature mRNA, we seek to estimate the underlying model parameters. We would like to approach this inverse problem by simulating mRNA number distributions for the experimentally available timepoints, evaluating an error metric that maps the divergence between the target distribution and each trial distribution to a single number, then minimizing this error by using it as an objective function. Since metrics based on noisy empirical stochastic distributions do not meet the smoothness assumptions of gradient-based optimizations methods [25], we select a genetic algorithm for optimization. We use the MATLAB implementation of the genetic algorithm [21,26] to sample and evolve points in a parameter space spanning several orders of magnitude for each variable. Consistent with previous investigations, we use a logarithmic parameter search space [15]. Each trial parameter vector {k, k, k, k, v} is evaluated using an ensemble of hundreds to thousands of simulated cells. Due to the high computational load (millions of cell trajectories) of a single search, we vectorize the computation and parallelize it across processors on the Amazon Web Services (AWS) cloud [27]. Since cells are independent, the algorithm scales well by parallelization across multiple processors. At the end of the simulation, the parallelized cell ensemble is reassembled into a single population and the statistics defining the error are computed locally, as shown in Fig 2A. To speed up convergence to consistent parameter sets, our heuristic method uses a variable objective function, with five distinct stages that use different error metrics. Details of the metrics are provided in Methods.

Fig 2

Parameter estimation process and performance.

Parameter estimation process and performance.

A: Parallelized calculation of the search objective function for a set of trial parameters (ΔMean: mean squared error, ΔCDF: Wasserstein distance, Objective: error function value). B: Convergence of the genetic algorithm at the end of each stage of the search (red: ground truth target, gray: population of parameter estimates). C: Final trial parameter population from B (red: ground truth target, histogram: estimate population, gray line: mean estimate, gray region: one-sigma region of estimates). D: Evolution of parameter estimates throughout the search process (red: ground truth target, gray line: mean estimate, gray region: one-sigma region of estimates). E: Comparison of mean probe signal between target and fit (circles: target data, dotted line: mean parameter estimate, shaded region around dotted line: signal spanned by fifty estimates sampled from the one-sigma region). Colors as in Fig 1. F: Comparison of copy-number distributions between target and fit (shaded gray regions: target histogram, colored lines: histogram generated from mean parameter estimate, top row/blue: nascent mRNA distribution, bottom row/red: total mRNA distribution). G: Comparison of mean probe signal between target and fit in turn-off cross-validation experiment. Convention as given for E. H: Estimation of modulated parameters. Top trial modulates k, bottom trial modulates k. All other parameters are constant but unknown to the search algorithm and are fit independently (red: ground truth target, gray dots and error bars: mean estimate and one-sigma region of three replicates). To test the algorithm’s ability to recover known parameters, we generated synthetic data for the turn-on experiment using the following ground truth parameters: k = 95 min-1, k = 1 min-1, k = 10 min-1, k = 0.5 min-1, v = 41.5 nt s-1, T = 15 min, L = 5300 nt, 10,000 cells, and 15 steps of elongation to complete transcription. The procedure used to convert these rates into reaction propensities is described in the S1 File. Relatively coarse simulation quality was used as a proof of concept. The simulations were parallelized across 90 AWS processors. The process of parameter identification is visualized in Fig 2B. We found that the one-sigma interval around the mean estimate included the ground truth parameters (Fig 2C). The convergence of k, k, and v throughout the search is relatively well-behaved and close to monotonic; however, k and k are far more challenging to estimate (Fig 2D). We compare the mean signals of nascent and total RNA simulated using the one-sigma estimate interval (Fig 2E), as well as the corresponding distributions simulated using the mean estimate (Fig 2F), to the synthetic ground truth data. Comparison at both levels demonstrates convergence. To cross-validate the search, we compare repression simulations generated from the ground truth and estimated parameters. The nascent and total means are consistent (Fig 2G). To test the robustness of the fitting algorithm, we apply the search procedure to the turn-on data generated using a range of k and k values, mimicking the regulatory parameter modulation hypothesized to occur in vivo [28]. The results suggest consistent performance throughout the parameter space, although identifiability of high k is poor (Fig 2H). Encouragingly, all one-sigma intervals include the ground truth parameters. For additional validation, we ran the search algorithm using synthetic data generated from random parameter vectors, as well as experimental data from a recent study [6]. These procedures are described in the Further Validation section of S1 File. We found that the fits successfully reproduced time-dependent distributions of probe signals. However, agreement between the inferred parameters and ground truth (or, for experimental data, FSP estimates) was not guaranteed, especially for k and k. As in Fig 2, these gaps in performance appear to correspond to non-uniqueness in mapping from the parameter domain to the observable domain [29], and inability of the genetic algorithm to report degenerate results. We suggest that this degeneracy is best identified by running the search algorithm multiple times and examining the resulting distribution of point estimates from the centers of the search populations. We take this approach in Fig 2H.

Methods

The Gillespie algorithm is adapted from the original description [17] and implemented in the MATLAB programming language [21]. To account for submolecular degrees of freedom, the simulation uses multiple data structures to describe the system state. Specifically, one multidimensional dynamic array holds the 5’ and 3’ indices of each mRNA (transcript span), another identifies whether it is being transcribed at a particular gene locus or free in the cytoplasm (RNA polymerase attachment), and a third tracks whether it is being degraded (RNase attachment). Smaller, static arrays track the system time, gene state, and number of mRNA and bound RNase molecules. Each reaction either increments or flips Boolean values in the appropriate state arrays. State variables and reactions are outlined in detail in S1 File; the reaction propensity calculations are given in S1 File of S1 Table. To perform parameter estimation on turn-on synthetic data, we use a heuristic iterative method based on the genetic algorithm [30]. We alternate between optimizing mean signals and entire distributions. The error metric for the mean signal is the mean squared error. Due to the limited support of empirical distributions, the commonplace minimization of Kullback–Leibler divergence between target and test distributions [31] is inappropriate for comparing distributions [25]. Instead, we use the absolute difference between the target and test cumulative distribution functions (CDFs), which tends to be more robust to noise and sparsity [25]; this metric is commonly known as the Wasserstein or earth mover’s distance [18]. We aggregate different time points’ Wasserstein distances by weighing them using a uniform or exponential function of time, as described in S1 File. Empirically, the parameter identifiability is far from uniform throughout the simulated time-series, and different metrics provide sensitivity to different parameters. Further, it is computationally prohibitive to simulate entire trajectories at the beginning of the parameter search, when the relevant region of the five-dimensional search space is not yet known. Therefore, we take an ad hoc iterative approach, which incrementally narrows the region of parameters consistent with the observed signals. This heuristic approach is chosen for computational convenience and is not guaranteed to the global parameter optimum. The parameter domain is shown in Fig 2C. We initialize the search using a uniform distribution over the full parameter domain. The first stage identifies the parameter space consistent with the distributions of nascent signals observed throughout the first few time points of the experiment, essentially acting as an order-of-magnitude filter and eliminating computationally expensive edge regions with extremely high or low transcription. This stage uses a population of 5,000 parameter sets and only keeps the top 10% of best variants; based on Fig 2B, it identifies k and a degenerate line containing consistent values of k and . The second stage attempts to truncate this space to parameter values consistent with the mean level of total RNA for the entire time series, and identifies tighter bounds for k and . This and all following stages use populations of 500 trial parameters. The third stage refines the estimate to parameter values consistent with the steady state distribution of total mRNA, and yields tighter bounds for k and k. The fourth stage uses information from the mean level of nascent RNA for the entire time series, and improves bounds for v. Finally, the fifth stage refines the bounds for k and k by performing a high-precision optimization using the metric used in stage 1. By penalizing the objective function for deviating beyond a given radius from the previous stage’s parameter region, consistency between different error metrics is enforced, as described in S1 File. More detailed data regarding each stage’s penalization and precision are provided in S1 File of S2 Table.

Discussion

Above we describe a new platform for simulating mRNA transcription and degradation on a submolecular level, available at https://data.caltech.edu/records/1287. Its output is directly comparable to single-cell data of nascent and mature mRNA. The output of each simulation is the empirical distribution of signals for each cell at each time point. Therefore, the platform can simulate both live-cell measurements (which identify cell-specific signals over time) and fixed-cell measurements (which yield population statistics) [1,2]. As the platform is based on the stochastic simulation algorithm, it is relatively straightforward to modify the model to incorporate new reactions, chemical species, regulatory pathways, and labeling schema. The software includes single-cell and statistical visualization tools to facilitate general-purpose use without coding. For resource-intensive parameter space exploration, we suggest heuristics to accelerate convergence. The method demonstrates that parameter estimation from a time series of multiple observables is tractable by heuristic likelihood-free methods. The validation we perform suggests that, by using simulations to generate empirical distributions, this approach is more effective to fit experimental signals than traditional methods when no closed-form solutions or approximations are available; further, the visualization capabilities would be useful for the qualitative description and understanding of such complex systems. Our platform allows numerical solution of detailed transcription model for both nascent and mature mRNA species, whose CME may not be solved exactly. However, since the approach is simulation-based, the steady state of the system needs to be computed asymptotically from a non-steady state, which may be time-consuming. Specifically, simulating and fitting the steady-state and turn-off experiments may be computationally prohibitive if the scales of kinetic rates are substantially different. Alternatively, it may be possible to use analytical solutions [32,33] to approximate an equilibrium distribution; however, this approach is challenging to generalize and the resulting simulation would no longer be exact. The parameter identification process may be facilitated by parameter constraints from analytical solutions. For example, if the steady-state solution for the total mean is known, the and k parameters can be fixed for the optimization procedure, reducing the parameter estimation to the simpler problem of optimization in three-dimensional space of k, k, and v, as shown in S2 Movie. On the other hand, we suggest that five-parameter inference entirely from moments is infeasible at this time. Typically, fitting n parameters requires n moments. For the current system, signal expectations can be computed [6], but expressions for the higher moments are unknown. Even if they were available, the choice of error model for these higher moments is far from clear, especially in the physiologically important regime of low copy numbers. Furthermore, we anticipate that the value of this heuristic method rests in applications to models with ad hoc mechanisms whose physics are challenging to approach analytically. Even without moment-based analytical constraints, it is possible to use physical considerations to guide the development of optimization metrics. For example, in a Bayesian framework, the Fisher information of the mean total probe signal is high with respect to k, but low with respect to v. As shown in Fig 2B, stage 2, which optimizes the total mean probe signal, provides a tight bound on k but not v; conversely, stage 4, which optimizes the mean nascent probe signal, yields a tight bound on v. For more complex models, exploratory analysis is necessary to determine the coupling between observables and parameters, but the provided heuristics and physical expectations provide a starting point. The parameter estimation procedure only uses time-dependent histograms: the platform can generate live- and fixed-cell data, but only attempts to fit fixed-cell data. These biochemical distinctions induce methodological differences for parameter inference. Fixed-cell measurements are necessarily destructive, and kinetics may only be inferred from distribution-level data. In contrast, live-cell signals contain additional information regarding the temporal correlation of a given cell. In the current study, we focus on fitting distribution data for two reasons. Firstly, inference from ensembles can be directly implemented using a variety of divergence metrics that make minimal assumptions regarding the form of the data [25]. On the other hand, inference from time-series requires error models for transitions between observed states, which are generally intractable [34]. Secondly, fixed-cell measurements are amenable to high-throughput experiments, can be scaled to the entire transcriptome via multiplexing [35], produce better signal/noise behavior, and do not require genetic modification [36], contributing to their greater popularity [36]. Therefore, we have optimized the parameter estimation method for the most likely current use case of inference from fixed-cell experiments. Recent advances in live-cell labeling techniques do suggest that the method may become more practical and popular in the future [37,38]. To anticipate this, we propose several approaches to live-cell data inference, motivated by previous efforts. If the dataset is large enough, the fixed-cell procedure may be sufficient, discarding the temporal correlation information altogether [25]. Alternatively, it is possible to iterate through the data points of a time-series, generating an ensemble of transitions, estimating the likelihood of the observed transition based on a kernel, and optimizing the likelihood by varying model parameters. This approach has been useful for relatively small datasets [34,39,40]. However, its application to multimodal time-series is potentially problematic due to the assumption of smoothness, the complexity of developing robust adaptive kernels, and the well-documented problems accompanying kernel density estimation of multivariate data [41]. Further, it presents computational challenges: the different increments are ostensibly independent due to the Markov property, but the non-unique mapping from the underlying Markov states to the observed probe data prevents the independent initialization of each increment. This feature makes it infeasible to parallelize the estimation of transition probabilities over non-overlapping increments. Several recent publications perform likelihood-based inference on hidden Markov models [37,38]. However, rigorously recasting these methods into the context of likelihood-free simulation is challenging, as is their extension to multimodal data. We suggest that the algorithm described in the Methods section can be extended to treat time-series data. Such an algorithm may iterate over a single time-series to incrementally shrink to a consistent parameter region. The selection of the region is based on a non-parametric error metric between the target fluorescence and the ensemble distribution for each trial parameter at the end of each interval. Conceptually, this process iteratively identifies parameter values by optimizing for observed transitions, analogously to previous work [40]. Afterward, independent searches over multiple traces may be aggregated to find a single plausible region. Given the computational expense of current HMM-based methods [38], an adaptive simulation-based approach may present a viable alternative. Our platform models the activity of individual gene loci in non-compartmentalized prokaryotic cells with the assumption that transcription follows a two-state random telegraph model with time-homogeneous rate parameters, and elongation and degradation are described by multistep Poisson processes. These assumptions may be violated in the following ways: The description of a eukaryotic system may be of interest. The implementation of eukaryotic transcription would require making significant changes to the reaction schema, such as disabling the degradation of nuclear mRNA and adding a kinetic model of a transport process after the release of the newly transcribed mRNA. Multiple gene copies may be present in a cell [6]. It is straightforward to extend the current model to account for this physiology. For example, S3 Movie shows the correlated dynamics at two gene copies, which may only turn on when an underlying Boolean cell state is on. The two-state switching of gene activation/inactivation may be an over-simplified picture of gene activity. In reality, an N-state model may be more accurate [15,42,43]. To consider this effect, our simulation-based framework can be easily extended to include more gene states rather than a single Boolean state. The transcription elongation rate may not be constant, whether due to sequence dependence [44] or polymerase congestion [13,14,45]. The implementation of these rules is challenging using the CME framework. Our simulation-based platform can incorporate sequence-dependent rates by adjusting rates based on the current 3′ nucleotide position, and congestion by testing for collisions between polymerases based on a pre-set exclusion radius. An example of a simulation with hard-sphere exclusion is shown in S4 Movie. RNA degradation may in reality be more complex than modeled here, with ribonuclease fluctuations [46], multi-step degradation [47], sequence-dependent degradation [48,49], and transcription-coupled degradation [50] potentially yielding deviations from simple Poisson process degradation. Our simulation-based platform can address these effects analogously to elongation. Moreover, transcription is, in general, non-stationary due to cell cycle effects [6,16]. Hence, synchronization of data from different cells is important for accurate inference. This may be achieved experimentally by monitoring cues of mitotic state, such as DNA signal or cell shape [6,16].

Details of the implementation of the algorithm, description of the graphical user interface, and the results of further validation of the search procedure.

(DOCX) Click here for additional data file.

Visualization of transcription dynamics at a single gene copy.

(MP4) Click here for additional data file.

Multi-stage genetic algorithm search over a three-dimensional parameter space.

(MP4) Click here for additional data file.

Visualization of transcription dynamics at two correlated gene copies.

(MP4) Click here for additional data file.

Visualization of transcription dynamics at a single gene copy with hard-sphere exclusion.

(MP4) Click here for additional data file. 31 Dec 2019 PONE-D-19-31981 Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics PLOS ONE Dear Dr. Xu, Thank you for submitting your manuscript to PLOS ONE. The paper was sent to two reviewers, who both appreciate the work but raised minor points that I would ask you to adress prior to publication. You can find the reviewers' comments at the bottom of this message. We would appreciate receiving your revised manuscript by Feb 14 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Jordi Garcia-Ojalvo Academic Editor PLOS ONE Journal Requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ 4. We noted in your submission details that a portion of your manuscript may have been presented or published elsewhere: The manuscript has been released on bioRxiv: https://www.biorxiv.org/content/10.1101/825869v1. The preprint has been uploaded as part of the submission. Please clarify whether this [conference proceeding or publication] was peer-reviewed and formally published. If this work was previously peer-reviewed and published, in the cover letter please provide the reason that this work does not constitute dual publication and should be included in the current manuscript. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this work the authors describe a Matlab platform to perform simulations and rate inference of prokaryotic transcription processes. The stochastic simulations account for promoter on-off switching, initiation, elongation and degradation. An iterative optimization procedure based on a genetic algorithm is implemented to infer the underlying parameters. The manuscript is very clear and everything seems technically correct. I would only draw attention to the fact that the authors only show the performance of the inference method on a single set of synthetic parameter values with variations on either kon or koff (Fig.2 B-H). I think showing the capability to infer parameters for other sets of parameters, and with modulation of other rates, would much strengthen the work and make it more useful to the community. Similarly, assessing the performance on (published) experimental data and comparing the recovered parameters to those obtained by current techniques based on the random telegraph model would be a relevant contribution. Reviewer #2: This paper outlines a software toolbox in MATLAB to simulate stochastic dynamics of transcription in prokaryotes. The paper then uses these simulations to infer transcriptional parameters from fluoroscent RNA probe data. This paper is predicated on the idea that the entire distribution of measurements is important in fitting to a transcriptional model. There are many sources of stochasticity in transcription, even in prokaryotes - promoter state switching, RNA polymerase activity, mRNA degradation, .. in addition to stochasticity due to the readout process by probe hybridization. This paper does two things - it models all these stochastic aspects as part of a "forward" model, producing putative live-cell and fixed-cell FISH data. The paper then uses the results of this forward model to solve the inverse problem by optimization (i.e., minimizing the output of the forward model and observed experimental data). Their approach to the inverse problem does not assume functional forms for the distributions, which is nice. I recommend the paper for publication. I ask the authors to clarify the following points to improve the readability of the paper: Populations vs single cell data - the paper mentions that it concerns itself with both kinds of data. However, the figures and other parts of the text (e.g., early parts of the Discussion) only talk about population-level data. Can the tools described here fit distributions of trajectories (as opposed to distributions at each moment in time)? What kinds of deviations from the model do you think are most likely during real transcription? E.g., if we don’t find a good fit, do I blame sequence dependence of your rate constants or non-stationarity or something else? Even a short summary of results from the literature on common deviations from the 4 parameter model would be useful here. The authors simulate millions of cells using Amazon Web Services (AWS) cloud. Do they find that the resulting distributions generally tend to approximated by simple ones common to molecular reactions? If so, can we get by by estimating, e.g. means and variances? In Fig 2B, why does stage 1 already have a population that covers the target parameters? Is stage 1 shown after some amount of search? If so, it'd be nice to see the initial conditions for the search, to make sure that wasn't chosen to be particularly favorable. The convergence in Fig 2B appears to go through several "relaxation modes".. At first, there is a quick collapse to a pancake in a particular direction (compare Stage 1 to Stage 2), which then shrinks more slowly. What is the meaning of these 'slow' relaxation directions during the search? There seem to be statements relevant to this collapse in the Methods section (something about a degenerate line for k_{obs}) but it was too cryptic for me to understand. The authors should clarify whether this collapse is an informed choice put in by hand for this particular dataset or if the genetic algorithm naturally collapses the cloud of parameters in this manner. If the former is the case, what guiding principles can a end-user use to figure out which lines to collapse to? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 11 Feb 2020 For detailed responses, please check the attached response letter. Responses to Reviewer #1 “In this work the authors describe a Matlab platform to perform simulations and rate inference of prokaryotic transcription processes. The stochastic simulations account for promoter on-off switching, initiation, elongation and degradation. An iterative optimization procedure based on a genetic algorithm is implemented to infer the underlying parameters. The manuscript is very clear and everything seems technically correct. I would only draw attention to the fact that the authors only show the performance of the inference method on a single set of synthetic parameter values with variations on either kon or koff (Fig.2 B-H). I think showing the capability to infer parameters for other sets of parameters, and with modulation of other rates, would much strengthen the work and make it more useful to the community.” The reviewer is correct that the performance of our inference method needs to be demonstrated using additional sets of synthetic parameters and with variations of rates other than k_on and k_off. In the previous version of the manuscript, we only fit synthetic data with variations of these parameters due to previous evidence for their role in the modulation of gene expression (Sanchez and Golding 2013). We have now generated more synthetic data with randomly varied parameters (k_on,k_off,k_ini,k_d,v_el). By applying the inference algorithm, we find that the algorithm can infer parameters and reproduce low- and high-order statistics. We show these results in Fig S2, reproduced below. However, even for high-quality synthetic data, convergence to true underlying parameters is not guaranteed, as many sets of parameters can yield the same observables. Such performance corresponds to the degeneration of mapping from the parameter domain to the observable domain, and the inability of the genetic algorithm to report degenerate results from a single search. We suggest that this degeneracy is most practically identified by running the search algorithm multiple times and examining the distribution of the resulting point estimates. High variance in mean estimates over several searches suggests intrinsic non-identifiability, as seen in Fig 2H for k_ini and k_off. We describe the above results in the Results section of the main text as well as in the Further Validation section of the Supplementary Information. The relevant text is reproduced below: In the Results section: For additional validation, we ran the search algorithm using synthetic data generated from random parameter vectors, as well as experimental data from a recent study (6). These procedures are described in the Further Validation section of Supplementary Information. We found that the fits successfully reproduced time-dependent distributions of probe signals. However, agreement between the inferred parameters and ground truth (or, for experimental data, FSP estimates) was not guaranteed, especially for k_ini and k_off. As in Fig 2, these gaps in performance appear to correspond to non-uniqueness in mapping from the parameter domain to the observable domain (29), and inability of the genetic algorithm to report degenerate results. We suggest that this degeneracy is best identified by running the search algorithm multiple times and examining the resulting distribution of point estimates from the centers of the search populations. We take this approach in Fig 2H. In the Further Validation section of the Supplementary Information: The search algorithm successfully fits the copy-number distributions and mean probe traces. However, its performance in discovering the ground-truth parameter value is fairly poor, especially for k_off and k_ini. This discrepancy essentially speaks to the pervasive degeneracies between the parameter space and the observable space – each observable corresponds to an equivalence class of possible underlying parameter values (1–4). Figure S2: Validation performance using synthetic data from random parameters. Left column: Comparison of mean probe signal between target and fit (circles: target data, dotted line: mean parameter estimate, shaded region around dotted line: signal spanned by fifty estimates sampled from the one-sigma region). Colors and abscissa as in Fig 2E. Central column: Comparison of copy-number distributions between target and fit (shaded gray regions: target histogram, colored lines: histogram generated from mean parameter estimate, top row/blue: nascent mRNA distribution, bottom row/red: total mRNA distribution). Timepoint values as in Fig 2F. Right column: Final trial parameter population (red: ground truth target, histogram: estimate population, gray line: mean estimate, gray region: one-sigma region of estimates). Variables and limits as in Fig 2C. “Similarly, assessing the performance on (published) experimental data and comparing the recovered parameters to those obtained by current techniques based on the random telegraph model would be a relevant contribution.” Following the reviewer’s suggestion, we have demonstrated the performance of our inference method using previously published single-cell experimental data of E. coli transcription (Wang et al. 2019) in the new version of the manuscript. The results are shown in Fig S3, reproduced below. The kinetic parameters of these data were originally extracted using the finite state projection (FSP) technique based on the random telegraph model. The physical model extends previous descriptions (Skinner et al. 2016; Xu et al. 2016; Munsky et al. 2018) by modeling co-transcriptional degradation of mRNA. However, as we state in the Introduction section, solving the chemical master equation (CME) with the full complexity of stochastic stepwise elongation using FSP is technically challenging due to exponential growth in the state space size with increasing resolution. Hence, the FSP method originally used to fit these experimental data relied on a simplified model with a deterministic elongation process (Xu et al. 2016; Wang et al. 2019). In contrast, the inference platform presented in this paper is based on empirical distributions solved from a stochastic simulation of molecular reactions. Therefore, the method can more easily capture the complexities of mRNA production (including stochastic stepwise elongation) and processing. By applying the inference algorithm to the experimental data, we find that, as in the synthetic case, the fits successfully reproduced the target distributions, with performance qualitatively similar to FSP fits in (Wang et al. 2019). However, the parameter values do not agree with those previously derived for FSP. As above, we contend that this occurs due to an intrinsic lack of identifiability. We describe the above results in the Further Validation section of the Supplementary Information, and the relevant text is reproduced below: We further analyzed an experimental dataset, previously reported in (5) as a start-up experiment with E. coli grown in a glycerol medium, using the settings given in S3 Table, 100 cells used in stage 1, 10 steps of elongation, and 100 parameter sets. Apart from the pervasive biases at the zero bin, the fits reproduce the target distributions. Qualitatively, the performance is similar to the fits using the FSP algorithm (cf. (5), Supplementary Figure 21). However, the parameter values based on inference from FSP (with additional zero-inflation) are inconsistent with those derived here. This result suggests that the parameter values may not be identified unambiguously, but either method can provide plausible values. A natural next step is the development of extensions to the genetic algorithm to report degenerate results without imposed degeneracy-breaking through recombination. Figure S3: Validation performance using experimental data. Left column: Comparison of mean probe signal between target and fit (circles: experimental data, dotted line: mean parameter estimate, shaded region around dotted line: signal spanned by fifty estimates sampled from the one-sigma region). Colors and abscissa as in Fig 2E. Central column: Comparison of copy-number distributions between target and fit (shaded gray regions: experimental histogram, colored lines: histogram generated from mean parameter estimate, top row/blue: nascent mRNA distribution, bottom row/red: total mRNA distribution, numbers: minutes since IPTG addition). Right column: Final trial parameter population (green: FSP estimate, histogram: estimate population, gray line: mean estimate, gray region: one-sigma region of estimates). Variables and limits as in Fig 2C. Responses to Reviewer #2 “This paper outlines a software toolbox in MATLAB to simulate stochastic dynamics of transcription in prokaryotes. The paper then uses these simulations to infer transcriptional parameters from fluoroscent RNA probe data. This paper is predicated on the idea that the entire distribution of measurements is important in fitting to a transcriptional model. There are many sources of stochasticity in transcription, even in prokaryotes - promoter state switching, RNA polymerase activity, mRNA degradation, .. in addition to stochasticity due to the readout process by probe hybridization. This paper does two things - it models all these stochastic aspects as part of a "forward" model, producing putative live-cell and fixed-cell FISH data. The paper then uses the results of this forward model to solve the inverse problem by optimization (i.e., minimizing the output of the forward model and observed experimental data). Their approach to the inverse problem does not assume functional forms for the distributions, which is nice. I recommend the paper for publication. I ask the authors to clarify the following points to improve the readability of the paper: Populations vs single cell data - the paper mentions that it concerns itself with both kinds of data. However, the figures and other parts of the text (e.g., early parts of the Discussion) only talk about population-level data. Can the tools described here fit distributions of trajectories (as opposed to distributions at each moment in time)? “ We thank the reviewer for pointing out the two strategies of measuring and inferring the single-cell transcriptional kinetics, i.e., the population statistics-based fixed-cell measurement/inference and single trajectory-based live-cell measurement/inference. The former strategy relies on measuring multiple ensembles of snapshot data from fixed-cells at different time points and inferring the transcriptional kinetics of a single cell from time-dependent population statistics (Skinner et al. 2016; Munsky et al. 2018; Wang et al. 2019); while the latter strategy relies on tracking the mRNA signal of live cells over time and directly inferring transcriptional kinetics from single-cell mRNA trajectories (Golding et al. 2005; Larson et al. 2011; Garcia et al. 2013). Experimentally, the former strategy may be achieved using either smFISH (Femino et al. 1998; Raj et al. 2008) or single-cell RNA-seq (Erhard et al. 2019). Both of these technologies can be applied directly to biological samples without the need for gene modification, and are scalable to the entire transcriptome. In contrast, the latter strategy requires fluorescently labeling mRNA molecules in live cells, which typically relies on genetic modification of the original biological system. Due to the lower signal-to-noise ratio of live imaging data, the technical challenge of applying genetic modification to an arbitrary sample, the possible perturbation of gene activity induced by genetic modification, and the incompatibility with high-throughput measurements, directly measuring the mRNA signal from individual live cells has been less popular than measuring the mRNA signal from a population of fixed cells in previous studies (Specht et al. 2017; George et al. 2018). However, recent advances in live-cell RNA labeling techniques have demonstrated their effectiveness in multiple biological systems (Corrigan et al. 2016; Specht et al. 2017; George et al. 2018; Lammers et al. 2020) and may become more popular in the future. To infer transcriptional kinetics, the fixed-cell strategy relies on fitting the time-dependent population statistics from several ensembles of snapshot data. With a large cell population, dense time sampling, and a detailed biochemical model, the optimization of distribution divergence can provide a good estimation of transcriptional kinetics (Munsky et al. 2018). Conversely, in the live-cell strategy, the single-cell trajectory data potentially provide additional information about the temporal correlation of transcriptional signals (Larson et al. 2011; Desponds et al. 2016). Hence, with the same amount of data, directly fitting the single-cell trajectory may be ideally more effective and accurate. Practically, achieving the single-cell trajectory fitting is still technically challenging, with few examples in the scientific literature. Specifically, (Tian et al. 2007) fit a single time-series by beginning simulations at each time point, simulating until the next time point, and estimating transition probabilities using a normal likelihood kernel. (Golightly and Wilkinson 2011) fit a single time-series by a Markov Chain Monte Carlo method that used a normal error model to calculate acceptance probabilities. (Daigle et al. 2012) fit a single time-series by simulating ensembles and iteratively selecting parameters that gave trajectories close to observations. (Desponds et al. 2016) used autocorrelation analysis to analyze an occupancy model. (Corrigan et al. 2016) and (Lammers et al. 2020) used likelihood-based hidden Markov models (HMMs) to estimate transition probabilities, and pooled multiple traces by assuming statistical independence. The simulation platform that we presented in this manuscript can generate synthetic data for both types of strategies. However, considering that the population statistics-based fixed-cell experiments are more prevalent in the current literature, we only attempted to infer transcriptional kinetics from time-dependent mRNA distributions (as shown in the Results part). To perform the single-cell trajectory-based inference on our non-parametric, non-Bayesian platform, the following challenges need to be considered: The simulations may not be initialized with an arbitrary observable, because there exists a large equivalence class of underlying system states that can yield a given probe observation at a particular precision. The observations of autocorrelation are challenging to convert to biophysically interpretable parameters, especially out of steady state. Although Bayesian HMM procedures are promising, rigorously recasting them into the context of likelihood-free simulation is problematic. This approach requires either using assuming an error model or calculating likelihoods from a kernel. The choice of kernel is unclear, and potentially fraught with challenges for multimodal data. As a conceptual inspiration from Daigle et al. (Daigle et al. 2012), we suggest that our simulation-based platform may be used for single-cell trajectory fitting in the following way. For a small live-cell dataset of N cells and n timepoints, it is possible to initialize N genetic algorithm searches that iterate over n stages to incrementally shrink the plausible parameter space to values consistent with the time-series, as described in Methods. Afterward, the independent searches may be combined to find a single common plausible parameter region. Qualitatively, at each step, this approach finds a region with a high probability of achieving a transition between values at two time points, then conditions on it for the consequent transition. The recombination is analogous to the pooling of multiple traces. The implementation of this algorithm is outside of the scope of the manuscript. We consider the derivation and implementation of time-series fitting methods a valuable direction for future versions of the platform. In particular, we anticipate it may bring computational advantages over current methods. For example, the recent publication (Lammers et al. 2020) mentions that the HMM analysis of 25 multi-trace datasets takes approximately two hours on 24 CPU cores. Seven out of the ten searches in Fig S2 took under 25 minutes on 8 CPU cores; the others took multiple hours, but this computational load may be mitigated using adaptive methods, as well as parallelization across more cores. Therefore, considering the computational expense of the HMM framework, we anticipate a simulation-based approach presents a viable alternative. On the whole, we summarize this part in the Discussion section and the relevant text is reproduced below: The parameter estimation procedure only uses time-dependent histograms: the platform can generate live- and fixed-cell data, but only attempts to fit fixed-cell data. These biochemical distinctions induce methodological differences for parameter inference. Fixed-cell measurements are necessarily destructive, and kinetics may only be inferred from distribution-level data. In contrast, live-cell signals contain additional information regarding the temporal correlation of a given cell. In the current study, we focus on fitting distribution data for two reasons. Firstly, inference from ensembles can be directly implemented using a variety of divergence metrics that make minimal assumptions regarding the form of the data (25). On the other hand, inference from time-series requires error models for transitions between observed states, which are generally intractable (34). Secondly, fixed-cell measurements are amenable to high-throughput experiments, can be scaled to the entire transcriptome via multiplexing (35), produce better signal/noise behavior, and do not require genetic modification (36), contributing to their greater popularity (36). Therefore, we have optimized the parameter estimation method for the most likely current use case of inference from fixed-cell experiments. Recent advances in live-cell labeling techniques do suggest that the method may become more practical and popular in the future (37,38). To anticipate this, we propose several approaches to live-cell data inference, motivated by previous efforts. If the dataset is large enough, the fixed-cell procedure may be sufficient, discarding the temporal correlation information altogether (25). Alternatively, it is possible to iterate through the data points of a time-series, generating an ensemble of transitions, estimating the likelihood of the observed transition based on a kernel, and optimizing the likelihood by varying model parameters. This approach has been useful for relatively small datasets (34,39,40). However, its application to multimodal time-series is potentially problematic due to the assumption of smoothness, the complexity of developing robust adaptive kernels, and the well-documented problems accompanying kernel density estimation of multivariate data (41). Further, it presents computational challenges: the different increments are ostensibly independent due to the Markov property, but the non-unique mapping from the underlying Markov states to the observed probe data prevents the independent initialization of each increment. This feature makes it infeasible to parallelize the estimation of transition probabilities over non-overlapping increments. Several recent publications perform likelihood-based inference on hidden Markov models (37,38). However, rigorously recasting these methods into the context of likelihood-free simulation is challenging, as is their extension to multimodal data. We suggest that the algorithm described in the Methods section can be extended to treat time-series data. Such an algorithm may iterate over a single time-series to incrementally shrink to a consistent parameter region. The selection of the region is based on a non-parametric error metric between the target fluorescence and the ensemble distribution for each trial parameter at the end of each interval. Conceptually, this process iteratively identifies parameter values by optimizing for observed transitions, analogously to previous work (40). Afterward, independent searches over multiple traces may be aggregated to find a single plausible region. Given the computational expense of current HMM-based methods (38), an adaptive simulation-based approach may present a viable alternative. “What kinds of deviations from the model do you think are most likely during real transcription? E.g., if we don’t find a good fit, do I blame sequence dependence of your rate constants or non-stationarity or something else? Even a short summary of results from the literature on common deviations from the 4 parameter model would be useful here. “ The reviewer is correct that multiple factors may affect the accuracy of our model when applying to real transcription. We now summarize these factors in the Discussion section and the relevant text is reproduced below: Our platform models the activity of individual gene loci in non-compartmentalized prokaryotic cells with the assumption that transcription follows a two-state random telegraph model with time-homogeneous rate parameters, and elongation and degradation are described by multistep Poisson processes. These assumptions may be violated in the following ways: The description of a eukaryotic system may be of interest. The implementation of eukaryotic transcription would require making significant changes to the reaction schema, such as disabling the degradation of nuclear mRNA and adding a kinetic model of a transport process after the release of the newly transcribed mRNA. Multiple gene copies may be present in a cell (6). It is straightforward to extend the current model to account for this physiology. For example, S3 Movie shows the correlated dynamics at two gene copies, which may only turn on when an underlying Boolean cell state is on. The two-state switching of gene activation/inactivation may be an over-simplified picture of gene activity. In reality, an N-state model may be more accurate (15,42,43). To consider this effect, our simulation-based framework can be easily extended to include more gene states rather than a single Boolean state. The transcription elongation rate may not be constant, whether due to sequence dependence (44) or polymerase congestion (13,14,45). The implementation of these rules is challenging using the CME framework. Our simulation-based platform can incorporate sequence-dependent rates by adjusting rates based on the current 3' nucleotide position, and congestion by testing for collisions between polymerases based on a pre-set exclusion radius. An example of a simulation with hard-sphere exclusion is shown in S4 Movie. RNA degradation may in reality be more complex than modeled here, with ribonuclease fluctuations (46), multi-step degradation (47), sequence-dependent degradation (48,49), and transcription-coupled degradation (50) potentially yielding deviations from simple Poisson process degradation. Our simulation-based platform can address these effects analogously to elongation. Moreover, transcription is, in general, non-stationary due to cell cycle effects (6,16). Hence, synchronization of data from different cells is important for accurate inference. This may be achieved experimentally by monitoring cues of mitotic state, such as DNA signal or cell shape (6,16). “The authors simulate millions of cells using Amazon Web Services (AWS) cloud. Do they find that the resulting distributions generally tend to approximated by simple ones common to molecular reactions? If so, can we get by by estimating, e.g. means and variances?” Based on previous studies and observations of our simulation, the shapes of mRNA distribution of random telegraph models may be classified into several groups (monomodal, bimodal, etc. (Munsky et al. 2012; Xu et al. 2016)). Yet the distributions don't tend to approach elementary function forms in any practical way once the submolecular probe features are incorporated (Xu et al. 2016). Hence, a quantitative approximation of distributions using simple function forms requires further validation in theory, which, to date, is still lacking. Particularly, the reviewer is correct that moments of mRNA distributions, such as means and variances, are functions of kinetic parameters. For example, in our platform, the mean of the observed probe signal is analytically calculated from the kinetic parameters (i.e., via quadrature rather than via FSP). However, estimating transcriptional kinetics from these moments (or other similar quantities) may not be easier than fitting the entire distribution for the following reasons: Each moment corresponds to an equivalence class of parameters, i.e., many combinations of kinetic parameters can give rise to the same moment. To narrow the parameter space, multiple moments need to be considered simultaneously. Specifically, our model requires fitting five moments in order to estimate the model’s five free parameters. However, we are unaware of any easily tractable expressions for the higher moments of a multi-state gene with arbitrary fluorescent probe coverage. Deriving the mathematical expressions of these moments may not be simpler than fitting the entire distribution. Even if the expressions of higher moments can be derived, optimizing the estimates would require an error model, which is not available in the regime of low copy numbers. In comparison, we limit this issue in our inference algorithm with search stages that use the entire distribution. We summarize the above points in the Discussion section, and the relevant text is reproduced below: On the other hand, we suggest that five-parameter inference entirely from moments is infeasible at this time. Typically, fitting n parameters requires n moments. For the current system, signal expectations can be computed (6), but expressions for the higher moments are unknown. Even if they were available, the choice of error model for these higher moments is far from clear, especially in the physiologically important regime of low copy numbers. Furthermore, we anticipate that the value of this heuristic method rests in applications to models with ad hoc mechanisms whose physics are challenging to approach analytically. “In Fig 2B, why does stage 1 already have a population that covers the target parameters? Is stage 1 shown after some amount of search? If so, it'd be nice to see the initial conditions for the search, to make sure that wasn't chosen to be particularly favorable.” We appreciate this feedback and opportunity to clarify the procedure. The initial condition for the search is drawn from a log-uniform distribution across the entire search space. We describe this point in the Methods section as follows: The parameter domain is shown in Figure 2C. We initialize the search using a uniform distribution over the full parameter domain. The distribution for Stage 1 is the result of the stage. We describe this point in the caption to Figure 2 as follows: B: Convergence of the genetic algorithm at the end of each stage of the search (red: ground truth target, gray: population of parameter estimates). “The convergence in Fig 2B appears to go through several "relaxation modes".. At first, there is a quick collapse to a pancake in a particular direction (compare Stage 1 to Stage 2), which then shrinks more slowly. What is the meaning of these 'slow' relaxation directions during the search? There seem to be statements relevant to this collapse in the Methods section (something about a degenerate line for k_{obs}) but it was too cryptic for me to understand. The authors should clarify whether this collapse is an informed choice put in by hand for this particular dataset or if the genetic algorithm naturally collapses the cloud of parameters in this manner. If the former is the case, what guiding principles can a end-user use to figure out which lines to collapse to?” We thank the reviewer for raising the question about different relaxation modes. The "slow" relaxation directions correspond to observables that are weak functions of the variable under examination, while the "fast" ones correspond to observables that are strong functions of the variable. As a simple example, the mean total amount of mRNA is a strong function of the degradation rate. The simplest ODE model for the total amount of RNA is dT/dt=k_i^obs-k_d T. The turn-on initial condition T(t=0)=0 yields the solution (k_i^obs)/k_d (1-e^(-k_d ) ). Stage 2 of the search fits the mean of the total amount of RNA and yields sharp estimates for (k_i^obs)/k_d and k_d. Analogously, stage 4, which fits the mean nascent mRNA signal, yields sharp estimates for v_el. This idea is virtually identical to the Fisher information in Bayesian parametric statistics. However, it is challenging to formalize this in a simulation-based, likelihood-free context, so we suggest using this analogy with caution. The order of collapse is user-determined through the order of optimization stages; however, for each stage, the direction of collapse is guided by the information content with respect to each parameter. For relatively simple models, basic physical insight is sufficient to draw connections between observables and parameters, e.g., via simplified ODE representations that abstract away gene dynamics and submolecular features. For more complex models, exploratory analysis is necessary. We summarize the above points in the Discussion section, and the relevant text is reproduced below: Even without moment-based analytical constraints, it is possible to use physical considerations to guide the development of optimization metrics. For example, in a Bayesian framework, the Fisher information of the mean total probe signal is high with respect to k_d, but low with respect to v_el. As shown in Fig 2B, stage 2, which optimizes the total mean probe signal, provides a tight bound on k_d but not v_el; conversely, stage 4, which optimizes the mean nascent probe signal, yields a tight bound on v_el. For more complex models, exploratory analysis is necessary to determine the coupling between observables and parameters, but the provided heuristics and physical expectations provide a starting point. References Corrigan AM, Tunnacliffe E, Cannon D, Chubb JR. A continuum model of transcriptional bursting. eLife. 2016 Feb 20;5:e13051. Daigle BJ, Roh MK, Petzold LR, Niemi J. Accelerated maximum likelihood parameter estimation for stochastic biochemical systems. BMC Bioinformatics. 2012 Dec;13(1):68. Desponds J, Tran H, Ferraro T, Lucas T, Perez Romero C, Guillou A, et al. Precision of Readout at the hunchback Gene: Analyzing Short Transcription Time Traces in Living Fly Embryos. PLoS Comput Biol. 2016 Dec 12;12(12):e1005256. Erhard F, Baptista MAP, Krammer T, Hennig T, Lange M, Arampatzi P, et al. scSLAM-seq reveals core features of transcription dynamics in single cells. Nature. 2019 Jul;571(7765):419–23. Femino AM, Fay FS, Fogarty K, Singer RH. Visualization of Single RNA Transcripts in Situ. 1998;280:7. Garcia HG, Tikhonov M, Lin A, Gregor T. Quantitative Imaging of Transcription in Living Drosophila Embryos Links Polymerase Activity to Patterning. Current Biology. 2013 Nov;23(21):2140–5. George L, Indig FE, Abdelmohsen K, Gorospe M. Intracellular RNA-tracking methods. Open Biol. 2018 Oct;8(10):180104. Golding I, Paulsson J, Zawilski SM, Cox EC. Real-Time Kinetics of Gene Activity in Individual Bacteria. Cell. 2005 Dec;123(6):1025–36. Golightly A, Wilkinson DJ. Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo. Interface Focus. 2011 Dec 6;1(6):807–20. Lammers NC, Galstyan V, Reimer A, Medin SA, Wiggins CH, Garcia HG. Multimodal transcriptional control of pattern formation in embryonic development. PNAS. 2020;117(2):836–47. Larson DR, Zenklusen D, Wu B, Chao JA, Singer RH. Real-Time Observation of Transcription Initiation and Elongation on an Endogenous Yeast Gene. Science. 2011 Apr 22;332(6028):475–8. Munsky B, Li G, Fox ZR, Shepherd DP, Neuert G. Distribution shapes govern the discovery of predictive models for gene regulation. Proc Natl Acad Sci USA. 2018;115(29):7533–8. Munsky B, Neuert G, van Oudenaarden A. Using Gene Expression Noise to Understand Gene Regulation. Science. 2012;336(6078):183–7. Raj A, van den Bogaard P, Rifkin SA, van Oudenaarden A, Tyagi S. Imaging individual mRNA molecules using multiple singly labeled probes. Nat Methods. 2008 Oct;5(10):877–9. Sanchez A, Golding I. Genetic Determinants and Cellular Constraints in Noisy Gene Expression. Science. 2013 Dec 6;342(6163):1188–93. Skinner SO, Xu H, Nagarkar-Jaiswal S, Freire PR, Zwaka TP, Golding I. Single-cell analysis of transcription kinetics across the cell cycle. eLife. 2016 Jan 29;5:e12175. Specht EA, Braselmann E, Palmer AE. A Critical and Comparative Review of Fluorescent Tools for Live-Cell Imaging. Annu Rev Physiol. 2017 Feb 10;79(1):93–117. Tian T, Xu S, Gao J, Burrage K. Simulated maximum likelihood method for estimating kinetic rates in gene expression. Bioinformatics. 2007 Jan 1;23(1):84–91. Wang M, Zhang J, Xu H, Golding I. Measuring transcription at a single gene copy reveals hidden drivers of bacterial individuality. Nat Microbiol. 2019 Sep 16;4:2118–27. Xu H, Skinner SO, Sokac AM, Golding I. Stochastic Kinetics of Nascent RNA. Phys Rev Lett. 2016;117(12):128101. Submitted filename: Review1_Response letter_final_02112020.docx Click here for additional data file. 9 Mar 2020 Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics PONE-D-19-31981R1 Dear Dr. Xu, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Jordi Garcia-Ojalvo Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: The authors have sufficiently addressed the issues I raised. The most pressing issue I raised was showing the performance on additional synthetic data, instead of relying on one particular set. I appreciate the authors doing so and plainly reporting the degeneracy in going from ground truth parameters to observables. They explicitly clarify that their algorithm should be run multiple times to understand such degeneracies. They have also added applications to other real datasets. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Rosa Martinez-Corral Reviewer #2: No 13 Mar 2020 PONE-D-19-31981R1 Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics Dear Dr. Xu: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Jordi Garcia-Ojalvo Academic Editor PLOS ONE

38 in total

1. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression.

Authors: Pavol Bokes; John R King; Andrew T A Wood; Matthew Loose
Journal: J Math Biol Date: 2011-06-08 Impact factor: 2.259

2. Contribution of RNA Degradation to Intrinsic and Extrinsic Noise in Gene Expression.

Authors: Antoine Baudrimont; Vincent Jaquet; Sandrine Wallerich; Sylvia Voegeli; Attila Becskei
Journal: Cell Rep Date: 2019-03-26 Impact factor: 9.423

3. Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo.

Authors: Andrew Golightly; Darren J Wilkinson
Journal: Interface Focus Date: 2011-09-29 Impact factor: 3.906

Review 4. Biophysically Motivated Regulatory Network Inference: Progress and Prospects.

Authors: Tarmo Äijö; Richard Bonneau
Journal: Hum Hered Date: 2017-01-12 Impact factor: 0.444

5. Nascent RNA kinetics: Transient and steady state behavior of models of transcription.

Authors: Sandeep Choubey
Journal: Phys Rev E Date: 2018-02 Impact factor: 2.529

Review 6. Integrating single-molecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics.

Authors: Brian Munsky; Zachary Fox; Gregor Neuert
Journal: Methods Date: 2015-06-12 Impact factor: 3.608

7. Stochastic Kinetics of Nascent RNA.

Authors: Heng Xu; Samuel O Skinner; Anna Marie Sokac; Ido Golding
Journal: Phys Rev Lett Date: 2016-09-13 Impact factor: 9.161

8. Absolute quantitative measurement of transcriptional kinetic parameters in vivo.

Authors: Sukanya Iyer; Bo Ryoung Park; Minsu Kim
Journal: Nucleic Acids Res Date: 2016-07-04 Impact factor: 16.971

Review 9. Quantitative approaches for investigating the spatial context of gene expression.

Authors: Je H Lee
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2016-12-21

10. Distribution shapes govern the discovery of predictive models for gene regulation.

Authors: Brian Munsky; Guoliang Li; Zachary R Fox; Douglas P Shepherd; Gregor Neuert
Journal: Proc Natl Acad Sci U S A Date: 2018-06-29 Impact factor: 11.205

4 in total

1. Statistics of Nascent and Mature RNA Fluctuations in a Stochastic Model of Transcriptional Initiation, Elongation, Pausing, and Termination.

Authors: Tatiana Filatova; Nikola Popovic; Ramon Grima
Journal: Bull Math Biol Date: 2020-12-22 Impact factor: 1.758

Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics.

Introduction

Results

Model and simulation platform

Model and simulation platform.

Parameter estimation

Parameter estimation process and performance.

Methods

Discussion

Details of the implementation of the algorithm, description of the graphical user interface, and the results of further validation of the search procedure.

Visualization of transcription dynamics at a single gene copy.

Multi-stage genetic algorithm search over a three-dimensional parameter space.

Visualization of transcription dynamics at two correlated gene copies.

Visualization of transcription dynamics at a single gene copy with hard-sphere exclusion.

1. Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression.

2. Contribution of RNA Degradation to Intrinsic and Extrinsic Noise in Gene Expression.

3. Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo.

Review 4. Biophysically Motivated Regulatory Network Inference: Progress and Prospects.

5. Nascent RNA kinetics: Transient and steady state behavior of models of transcription.

Review 6. Integrating single-molecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics.

7. Stochastic Kinetics of Nascent RNA.

8. Absolute quantitative measurement of transcriptional kinetic parameters in vivo.

Review 9. Quantitative approaches for investigating the spatial context of gene expression.

10. Distribution shapes govern the discovery of predictive models for gene regulation.

1. Statistics of Nascent and Mature RNA Fluctuations in a Stochastic Model of Transcriptional Initiation, Elongation, Pausing, and Termination.

2. Modeling bursty transcription and splicing with the chemical master equation.

3. RNA velocity unraveled.

4. Altering transcription factor binding reveals comprehensive transcriptional kinetics of a basic gene.