Literature DB >> 33802879

Critical Comparison of MaxCal and Other Stochastic Modeling Approaches in Analysis of Gene Networks.

Taylor Firman¹, Jonathan Huihui², Austin R Clark¹, Kingshuk Ghosh^1,2.

Abstract

Learning the underlying details of a gene network with feedback is critical in designing new synthetic circuits. Yet, quantitative characterization of these circuits remains limited. This is due to the fact that experiments can only measure partial information from which the details of the circuit must be inferred. One potentially useful avenue is to harness hidden information from single-cell stochastic gene expression time trajectories measured for long periods of time-recorded at frequent intervals-over multiple cells. This raises the feasibility vs. accuracy dilemma while deciding between different models of mining these stochastic trajectories. We demonstrate that inference based on the Maximum Caliber (MaxCal) principle is the method of choice by critically evaluating its computational efficiency and accuracy against two other typical modeling approaches: (i) a detailed model (DM) with explicit consideration of multiple molecules including protein-promoter interaction, and (ii) a coarse-grain model (CGM) using Hill type functions to model feedback. MaxCal provides a reasonably accurate model while being significantly more computationally efficient than DM and CGM. Furthermore, MaxCal requires minimal assumptions since it is a top-down approach and allows systematic model improvement by including constraints of higher order, in contrast to traditional bottom-up approaches that require more parameters or ad hoc assumptions. Thus, based on efficiency, accuracy, and ability to build minimal models, we propose MaxCal as a superior alternative to traditional approaches (DM, CGM) when inferring underlying details of gene circuits with feedback from limited data.

Entities: Chemical Disease Species

Keywords: Maximum Caliber; gene network; inference

Year: 2021 PMID： 33802879 PMCID： PMC8002683 DOI： 10.3390/e23030357

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

A common goal in synthetic biology of single cells is to design and introduce small circuits to optimize certain behaviors [1,2,3,4,5,6,7,8,9,10]. However, this requires quantitative knowledge about the details of these circuits and their subsequent optimization. Circuit characteristics are often studied by single-cell measurements using fluorescent markers on few proteins [11]. These traditional measurements—with the exception of a few emerging technologies that simultaneously monitor transcription and translation [12,13]—cannot visualize many other underlying species that are involved in dictating gene expression dynamics. For example, nucleic acids, protein-complexes, protein-nucleic acid complexes remain invisible. So, how do we extract information from such limited data? One possible solution is to track the stochastic gene expression trajectories of the fluorescently tagged proteins and use the noisy time series data to learn about the underlying invisible network [14,15,16,17,18,19]. This data mining scheme requires building stochastic models that are typically bottom-up and can be divided in two major categories. The first approach builds detailed models (DM) based on a pre-specified reaction network that include tagged proteins and other molecules such as promoter, protein-promoter complexes, etc. [20,21]. These detailed reaction schemes are usually simulated using Gillespie algorithm [22,23] to generate noisy trajectories and analyze circuit properties. However, the same DMs can be used for inference as well. The detailed reactions—written as chemical master Equation (CME) [24] and solved using Finite State Projection method [25,26]—can yield probabilities of numbers of molecules as a function of rate parameters. Consequently, parameter inferences can be done by trying to match the measured noisy trajectory. The second approach builds coarse-grain models (CGM) using a smaller set of reactions where effective reaction rates are implemented by assuming an ad hoc mechanism that accounts for invisible species. CGMs can be made arbitrarily complex with additional fit parameters but we consider only the minimal CGMs due to their computational efficiency and an alternate to DM. Although CGMs are typically used to describe average protein numbers [27,28], CMEs can be constructed for CGMs [29,30,31,32] and solved via FSP to describe probability distributions. Both models (CGM and DM) have merits and demerits over each other. DM’s are by definition more detailed and are expected to be accurate but require invoking multiple molecular species that are not always visible. Consequently, DM’s involve multiple parameters and a much bigger phase space. Thus, accuracy comes at the expense of computational cost. Minimal CGM’s on the other hand minimize computational burden by using effective rates based on pre-specified mechanisms. As a result, these models can be less accurate but efficient. The accuracy vs. efficiency dilemma raises two critical questions: how costly is a DM and how inaccurate is a CGM? Are CGM’s accurate enough or do we need to build a third class of models that are still feasible and yet more accurate than CGM? We have recently introduced the principle of Maximum Caliber (MaxCal) to model and infer underlying details of gene networks using stochastic gene expression data [32,33,34,35,36]. The adoption of MaxCal-based inference for small gene networks provides a third approach that resolves the accuracy vs. efficiency dilemma. Maximum Caliber (MaxCal) is analogous to the Maximum Entropy (MaxEnt) principle but applied to the distribution of paths/trajectories instead of states [37,38]. Similar to MaxEnt, MaxCal maximizes path entropy (caliber) subject to constraints. In the application to gene networks, MaxCal starts with a minimal set of constraints on proteins whose numbers are followed by fluorescent tags. This makes MaxCal directly amenable to model measured time series data of gene expression while not invoking additional mechanisms or auxiliary species [32]. Consequently, MaxCal avoids ad hoc assumptions on mechanisms inherent in CGM while bypassing the challenge of increased phase space inherent in DM. In addition, the top-down nature of MaxCal allows systematic model building via the imposition of constraints of a higher order on top of the minimal constraints it starts with—allowing the user to improve accuracy as needed. This is in contrast to bottom-up approaches of DM and CGM, which work on the premise that a given model must be pre-specified based on pre-imposed assumptions. Since each mechanism/model can be very different from each other, it is difficult to systematically add higher order terms in a given model to increase model complexity. Instead, a new model must be generated every time and the process of testing must start all over. In this work we show how MaxCal provides a model that is minimal and more accurate than CGM while also being as efficient (or more) as CGM. We establish this with two basic genetic circuits: Single Gene Auto-activation (SGAA) [34] and Toggle Switch (TS) where two genes mutually repress each other [1,21,33]. We first generate synthetic data to be used as input by simulating the respective reaction networks with details of protein and nucleic acid dynamics. We subject three different models (DM, CGM and MaxCal) to infer parameters from this synthetic data for which actual model parameters are known. This serves as the benchmark for accuracy between the three models of DM, CGM, and MaxCal. By noting the computation time needed, we can assess accuracy and efficiency of all three models. The quantitive comparison shows that for small circuits used by synthetic biologists, MaxCal is efficient and reasonably accurate.

2. Materials and Methods

In this section, we describe the methods involved in generating synthetic data of stochastic gene expression using known parameters that serve as a benchmark to test the accuracy of different models. We also describe different methods of analysis to infer model parameters from this stochastic data.

2.1. Single Gene Auto-Activation (SGAA) Circuit

We first demonstrate MaxCal’s performance on a single gene auto activation circuit (SGAA). This circuit consists of a single protein that enhances its own production and has been studied extensively in different biological contexts [30,31,39,40,41,42,43,44].

2.1.1. Generating Synthetic Data

We first describe the model introduced by Elston et al. [20] to generate synthetic data mimicking experiment. where a is mRNA transcribed from gene with rate g, and gets degraded with rate d. Protein A is translated from a with basal rate p and degraded with rate r. Two A molecules complex to form a dimer with rate , and dissociate back to monomers with rate . The dimer can bind and unbind to the promoter with rate and , respectively. In the bound form, the promoter gets converted to the activated state producing mRNA at an enhanced rate much greater than g. Thus, A promotes self-activation. Synthetic input trajectories were generated using a Gillespie algorithm to simulate Equation (1) with , , , , , , , , . These parameters and the reaction scheme (1) were chosen for three primary reasons: First, dimers are typical stoichiometry for regulatory proteins binding to the promoter site [20]. Next, the parameters produce noisy switch like trajectory and were used earlier to test MaxCal’s inferential power [35]. Third, the rate parameter values and resulting switching times (few hours) are typical in realistic circuits [5]. Finally, to further mimic experiment, ten replicates of one hundred trajectories—each trajectory seven days long—were considered.

2.1.2. The Detailed Model for Inference

Next, we describe the three models (DM, CGM, and MaxCal) used to infer the underlying details of the circuit from the synthetic data generated. We start with the DM model first. The detailed model (DM) of SGAA is built by considering the promoter in its basal state (), activated state (), and the protein A. The reaction scheme is given by The effective protein production rate in the non-activated and activated forms are and , respectively. Upon defining the reaction network by Equation (2), the system can be described by a chemical master Equation (CME) to compute probability of observing number of A proteins at time t. The CME is given by where, and 0 denote the basal and the activated state () of the promoter, respectively. The CME can be expressed in the matrix form as where, states (i) are given by different values of and and their time dependent probability is . The transition matrix W is a function of rate parameters and can be constructed from Equation (3). Using Finite State Projection (FSP) [25], the state space can be bounded and the time evolution of probability state vectors is given by where, are final and initial time points, respectively. Thus, Equation (5) yields the probability of being in some state i at time from an initial condition at time . These conditional probabilities are used to compute the likelihood of producing the synthetic trajectory as a function of the rate parameters (see Equation (15)). The likelihood is then maximized to determine the optimum values of the parameters, thus inferring details of the model. It is important to note that DM ignores mRNA and dimer dynamics and hence is less detailed than the model (Equation (1)) created to generate the synthetic data. The choice of this reduced model is motivated to lower computational cost while constructing a model that is sufficiently fine-grained compared to CGM and MaxCal, outlined next.

2.1.3. CGM Model for Inference

The coarse-grain model (CGM) uses an approach where mass-action (MA) models are combined with CME [29,30,31,32]. Specifically, there is only one reaction describing the time evolution of protein number A given by Equation (6). This is akin to a mass-action approach where coupling with other species to capture feedback is modeled in an indirect manner by assuming an ad hoc functional form for the effective rate X. Specifically, X is given by a Hill type function invoking a cooperative model with defined in Equation (6). The parameter is the basal rate in the absence of any positive feedback and the second term in X monotonically increases with A approaching an asymptotic value of , successfully capturing auto-activation. In this model, the rate in the activated state is and K is a parameter. Protein A is degraded with rate . While traditional MA models use similar functional forms to describe the time evolution of average protein number, adopting Equation (6) within Chemical Master Equation (CME) framework allows stochastic modeling. Thus, the combined approach of CME and MA (CME + MA) computes the probability () of protein number () at a given time t. Specifically, the time evolution equation of corresponding to the reaction scheme in Equation (6) is given by where and . Using the matrix formulation, similar to DM, the likelihood of input trajectories can be calculated as a function of . Maximization of the likelihood infers model parameters , , K, and .

2.1.4. MaxCal Model for Inference

MaxCal modeling of SGAA and other circuits have been described in our earlier work [32,34,35]. Here we give a brief outline of the method for SGAA. Maximum Caliber maximizes the path entropy subject to constraints. We define two random variables, and , to define microscopic trajectories between a time interval t and . The probabilities of these paths are denoted as . Variable tracks the production of proteins—in the time interval between t and —ranging between 0 and M (predefined maximum) while is the number of proteins from the previous step that have not undergone degradation, i.e., . The first term in C (in Equation (8)) is the path entropy and the remaining three terms denote three constraints on average production, average degradation, and a correlation between protein production and the proteins present. The constraints are imposed by Lagrange multipliers , and , respectively. Maximizing the caliber subject to the three constraints yield the probability of these micro trajectories as

2.2. Two-Gene Toggle Switch (TS) Circuit

Next we consider a two gene mutually repressing circuit called the Toggle Switch (TS), first designed by Collins and colleagues [1].

2.2.1. Generating Synthetic Data

Synthetic data mimicking experimental time trace was created by using the reaction scheme described in reference [21], detailed below: where mRNA a and b are transcribed from gene (different loci) at rate , that in turn produce proteins A and B, respectively, at rate . mRNAs and proteins are degraded at rate and , respectively. Mutual repression is modeled by binding of either protein (A or B) to the promoter site of at rate , altering the promoter state to different expression levels (for A) and (for B). In the state , mRNAs of B are produced at a rate much less than , while state produces mRNAs of A at that same slower rate , implementing negative feedback. Unbinding of proteins convert the inactivated promoter states (, ) back to the basal state . For simplicity and as proof of concept, we assume a symmetric model in A and B. The rate values (, , , , , , ) were chosen following our earlier work [35]. These values create biologically relevant switching time scales [5] between two states (state 1: low A, high B; state 2: high A, low B) while maintaining typical synthesis and degradation rates [27]. Next, the Gillespie algorithm [22] is used to simulate ten replicates of one hundred noisy trajectories (each seven days long) of protein levels that exhibit bistability and serve as our synthetic data to benchmark models.

2.2.2. Detailed Model

To infer network details, we first consider the detailed model (DM) built on a reaction network similar to Equation (10) but neglecting mRNAs. The exact reaction scheme used in DM for TS is described as where the symbols above and below arrows indicate respective reaction rates that will be inferred from the synthetic trajectory data generated from Equation (10). As before, CME arising from this reaction network architecture can be used to compute the probability of observing and number of protein molecules at time given there were and number of A and B protein molecules at time t. These probabilities are used to compute the likelihood of a given trajectory (see Equation (17)), which can then be maximized to infer rates for this system.

2.2.3. Coarse Grain Model

The coarse grain model (CGM) for TS does not explicitly model protein-promoter complexation unlike DM. Feedback is modeled by an effective production rate of A and B, given by X and Y, respectively, defined as The specific functional form for X ensures that large B inhibits production of A. Similarly, Y ensures that A inhibits production of B. Proteins A and B degrade with rate . As before, the reaction network is used to construct the corresponding CME from which the likelihood is computed as a function of model parameters , , K, and . The cooperativity parameter n was assumed to be 2, typically assumed in MA models of TS to produce bistability [27]. The inferred values of these parameters are obtained by maximizing the likelihood.

2.2.4. MaxCal

Two variables, and , were introduced to model SGAA (in Section 2.1.4) to track production of A and number of A proteins not degraded in a time interval t and . The same variables were used for protein A in TS. Additionally, variables and were introduced to track the same but for protein B. The micro trajectories between time t and are labeled by stochastic variables , , , and , and the corresponding probability is denoted as . The caliber (C) is defined as the path entropy (given by the first term on the right hand side of Equation (13)) along with four constraints. First, the average of and are imposed as constraints with Lagrange multiplier to model protein production. Next, the average of and are constrained with Lagrange multiplier capturing information about degradation. Third, correlations between production and number of existing proteins (of the same protein type) are imposed by constraining average of and by Lagrange multiplier (fourth term in the right hand side of Equation (13)). This term captures any positive feedback that may not be known about a priori (see reference [35] for more). Finally, the average of and are constrained by Lagrange multiplier to model cross-correlation (negative feedback) between A and B. Maximizing the caliber subject to these constraints gives the path probability in terms of the Lagrange multipliers as

2.3. Calculation of Trajectory Likelihood

Parameters of DM, CGM, and MaxCal models are determined by maximizing the likelihood () of the synthetic trajectory—mimicking experimental data—recording fluctuating numbers of protein with time.

2.3.1. Calculation of Trajectory Likelihood for SGAA

For SGAA, the likelihood of the trajectory is calculated as where T is the total snapshots recorded in experiment or synthetic data, is rounded to the nearest integer and is the total number of transition from state i to j over m frames. States are given by number of proteins. Typically m is chosen in the same range as the average residence time (in frames) in the high and low state. Multiple frames are combined to avoid any spurious jumps in the number of proteins present in the trajectory, not allowed in the MaxCal formulation. Thus, calculating likelihood for transitions over m frames allows MaxCal to select reasonable values for the Lagrange multipliers (see reference [34] for further details). The individual transition probability of going from state i to state j in one time step in MaxCal is computed as The transition probability over m frames—needed for likelihood calculation—can be obtained by arranging single time step transition in a matrix (with as matrix elements) and raising the matrix to the power. FSP formalism—originally devised by Munsky and colleagues [25]—has been used to introduce a sink state and carry out this calculation. Supplemental material in reference [36] provides a general description on how to use FSP formalism for MaxCal, including the calculation of transition probabilities for m steps. For DM and CGM models, the calculation of require matrix exponentiation following Equation (5) with .

2.3.2. Calculation of Trajectory Likelihood for TS

For TS, the likelihood of the trajectory is calculated as where T is total time frames noted in data, is rounded to the nearest integer, is the total number of simultaneous transitions from i to j states (in number of protein A) and k to l (in number of protein B) over m frames, and , denote the number of A, B proteins at time t (corresponding to frame where n is an integer). The most likely values of , , , , and M are determined by maximizing the likelihood. As before, transitions between m frames were used to avoid large spurious jumps in protein number in one step in the raw data that cannot be easily modeled in MaxCal. The transition probabilities between two consecutive frames in MaxCal is defined as, with defined in Equation (14). The transition probability between m frames is then calculated by raising the two frame transition probability matrix to the power, similar to the procedure described for SGAA. Similar to SGAA, calculation of the transition probability within DM and CGM formalism require matrix exponentiation in contrast to matrix multiplication needed for MaxCal.

3. Results

3.1. Comparison of Three Models for SGAA

Quantitative comparison of three SGAA models was carried out in terms of their accuracy and efficiency. Accuracy is determined by comparing inferred values of three observables () against the gold standard used to generate the synthetic data (see Table 1). The true basal production rate and activated production rate are obtained from the parameters in reaction 1 by , . The degradation rate remains the same, . For DM, these rates can be extracted as , , , where , , and are defined in reaction 2. The effective rates for CGM (described by Equation (6)) are calculated as , , and . The effective rates using MaxCal are calculated using where denotes the average of an observable for a given , is the peak position of the protein number distribution in the low state, is the peak in the high state, and is the probability distribution of N proteins at relative equilibrium calculated using FSP [25] (see reference [35] for details). Comparing reported values in Table 1, we conclude that both DM and MaxCal models infer these underlying details reasonably well. CGM however infers a value of almost twice the true value.

Table 1

Comparison of accuracy between three models for SGAA. The first row reports the values of three known (“True”) rates for effective production () in the basal state, production () in the activated state, and protein degradation () used to generate the synthetic data. The inferred rates using three models, DM (second row), CGM (third row), and MaxCal (fourth row), are compared against each other and the “True” rates indicating that CGM is less accurate than DM and MaxCal. Error bars for rates were obtained by using inference on ten replicates of the input trajectory data.

Method	peff (s−1)	peff* (s−1)	reff (s−1)
True	5.0×10−3	50×10−3	1.0×10−3
DM	(4.2±0.5)×10−3	(42±5)×10−3	(0.8±0.1)×10−3
CGM	(4.1±0.4)×10−3	(91±8.9)×10−3	(1.3±0.1)×10−3
MaxCal	(5.6±0.2)×10−3	(43±2.1)×10−3	(0.96±0.1)×10−3

Next, we provide a comparison of the computational efficiency between the three models by tracking the typical time needed to complete the basic unit of operation invoked in the calculation of the likelihood function. For a given model, we measured the time taken for each likelihood calculation during the entire process of inference. The averages and standard deviations of these times are noted in column 4 in Table 2. For MaxCal, the basic operation is raising the transition matrix to the power, in contrast to the matrix exponentiation required for DM and CGM (see Equation (5)). We note that both CGM and MaxCal have identical matrix dimensions (column 3 in Table 2), constrained by the maximum number of proteins allowed ( reported in column 2 of Table 2). The maximum number of proteins used for FSP calculation was chosen to be significantly higher than the maximum protein number seen in the input trajectory. The total state space dimension is equal to with the additional state (or “sink” state) accounting for all states with protein numbers greater than . Despite identical matrix size, a typical step in CGM is slower than MaxCal because matrix exponentiation is slower than matrix multiplication performed m times in succession. We notice basic step calculation in DM is significantly slower than MaxCal. This is primarily due to two reasons. First, DM has almost four times larger of a matrix size—compared to MaxCal—due to explicit consideration of the promoter state (basal and activated) in combination with different protein numbers defining the state space. Next, DM requires matrix exponentiation, a much more computationally expensive operation compared to matrix multiplication. We conclude that CGM is less accurate than MaxCal and MaxCal is the method of choice over DM given its efficiency. However, for SGAA, the true power of MaxCal is not reflected since the basic step/operation is in milliseconds for DM and the cumulative time needed for inference (fifth column) is in seconds, feasible on traditional hardware. Nevertheless, the comparison conceptually shows MaxCal’s ability to balance efficiency and accuracy for SGAA, in conjunction with the bigger circuit that we discuss next.

Table 2

Comparison of efficiency between three models for SGAA. Second column reports the maximum number of proteins used in FSP, third column shows the overall matrix dimension, fourth column reports the average time taken (using a CPU platform) for the basic matrix operation needed for a likelihood calculation, and fifth column reports the total time taken during the entire process of likelihood maximization to infer model parameters for the three different models (noted in the first column).

Method	Max N	Matrix Size	Unit Operation Time (ms)	Total Time (ms)
DM	92	185×185	10±0.4	1263±69
CGM	92	93×93	3.0±0.2	813±190
MaxCal	92	93×93	0.6±0.5	47±13

3.2. Comparison of Three Models in TS

Next, we provide a quantitative comparison of three models for TS. As before, accuracy is determined by comparing inferred values of the three effective rates, , , and against the gold standard used to generate the synthetic data (see Table 3). The true basal production rate , repressed production rate are obtained from the parameters in reaction (10) defined as and . The degradation rate remains the same, . Within the framework of DM, these rates are extracted as , , and , where , , and are defined in reaction (11). The effective rates for CGM (described by Equation (12)) are calculated as , , and . The effective rates from MaxCal are calculated as where is the average of a quantity of interest given that there are i and j number of proteins initially present of type A and B, respectively, and are the peaks of the protein number distribution in the basal (unrepressed) and repressed state, respectively, and is the relative equilibrium probability (calculated using FSP [25]) of having and proteins. The details of these definitions and results can be found in our earlier work [35]. Comparing the extracted rates against the gold standard, we find DM and MaxCal reliably infer effective rates, similar to their performance with SGAA. CGM however infers rates that differ from the “True” rates by almost an order of magnitude or even more ( for example).

Table 3

Comparison of accuracy between three models for TS. The first row reports the values of three known (“True”) rates for effective production () in the basal state, production () in the repressed state, and protein degradation () used to generate the synthetic data. The inferred rates using the three models, DM (second row), CGM (third row), and MaxCal (fourth row), are compared against each other and the “True” rates indicate CGM is less accurate than DM and MaxCal. Error bars for rates were obtained by using inference on ten replicates of the input trajectory data.

Method	peff (s−1)	peff* (s−1)	reff (s−1)
True	20×10−3	0.1×10−3	1.0×10−3
DM	20.7×10−3±5.2×10−5	0.1×10−3±2.6×10−5	1.0×10−3±0.2×10−3
CGM	170×10−3±30×10−3	2.3×10−7±0.7×10−7	7.0×10−3±2.0×10−3
MaxCal	14.6×10−3±0.9×10−3	0.16×10−3±0.02×10−3	0.7×10−3±3×10−5

Next, we provide a comprehensive comparison of the three models in terms of their computational efficiency (see Table 4). Similar to SGAA, DM is significantly slower than MaxCal and CGM primarily due to large matrix size. We have chosen resulting in = 13,925 states, considering the two states of the promoter and the sink state included in the state space of DM. The enormous dimensional explosion in DM compounded with matrix exponentiation significantly slows down the basic operation to minutes. The estimates of basic steps are obtained using a GPU platform for feasibility reasons. CPU-based computing—used in SGAA–was abandoned as it would have taken an unreasonably long time to complete DM inference. We modified the existing SciPy libraries for calculating the matrix exponential term (needed for basic operation in DM and CGM) using CuPy. To calculate matrix power needed for MaxCal, we used existing CuPy libraries. The timing of the basic step was evaluated on two Nvidia Tesla P100 cards using CUDA version 8.0.

Table 4

Comparison of efficiency between three models for TS. Second column reports the maximum number of proteins used in FSP, third column shows the overall matrix dimension, and fourth column reports the average time taken (using GPU platform) for the basic matrix operation needed for a likelihood calculation and fifth column reports the average of total time taken during the entire process of likelihood maximization to infer model parameters for the three different models (noted in the first column).

Method	Max N	Matrix Size	Unit Operation Time (s)	Total Time (s)
DM	59	13,925 × 13,925	223±110	106,878 ± 19,765
CGM	59	3482×3482	5.8±0.4	14,124 ± 12,493
MaxCal	59	3482×3482	0.13±0.01	851±60

CGM and MaxCal are however much faster than DM. This is primarily because of significantly smaller variable space that CGM and MaxCal uses. Furthermore, we notice MaxCal is even faster compared to CGM due to MaxCal’s reliance on matrix multiplication in contrast to matrix exponentiation. The differences in basic operation (column four in Table 4) translate to marked differences in the total time taken for the entire process of likelihood maximization (reported in column five in Table 4) to infer model parameters. Combining these findings we conclude for TS—similar to SGAA—MaxCal is the preferred method of inference due to its efficiency and ability to reliably infer underlying model parameters. For TS, unlike SGAA, the gain from MaxCal is readily appreciated when using standard GPU hardware with only few nodes. The example of TS considered above illustrates the typical challenge of inferring network details using detailed models (DM) where multiple genes and species are involved. Genetic circuits larger than TS, i.e., involving more than two expressed proteins, will face even more combinatorial challenges when using models like DM to account for proteins and their promoters. This combinatorial explosion of the state space combined with the need of matrix exponentiation will render detailed models unrealistic for the purposes of inference. Although this challenge can be somewhat mitigated in CGM reducing the state space, CGMs too will face computational challenge, due to their reliance on matrix exponential that will tend to be slower for larger state space (due to multiple genes). Alternate approaches similar to MaxCal should be used where basic operations in likelihood calculation are less burdensome keeping overall computational cost manageable.

4. Discussion

Quantitative determination of parameters in a gene network is critical to designing new circuits for synthetic biology applications. However, determining these parameters is challenging due to limited information available on few expressed proteins, much less than the actual number of species involved in such feedback networks. A powerful approach is to mine information-rich stochastic trajectories of protein numbers to infer network details. Three primary modeling schemes were considered to harness information from these noisy trajectories for two specific gene networks: Single Gene Auto Activation (SGAA) circuit and Toggle Switch (TS). The first inferential approach used a detailed modeling (DM) scheme where proteins and their interaction with corresponding genes were explicitly modeled. The second approach employed a coarse grain model (CGM) where Hill type functional forms were invoked to describe feedback, circumventing the need to explicitly model promoter and protein-promoter complexes. The third scheme used the principle of Maximum Caliber (MaxCal) which relies on the maximization of path entropy subject to constraints of protein production, degradation, and feedback. MaxCal—similar to CGM—also avoids explicit consideration of additional molecular species and provides a stochastic description of protein expression trajectories, suitable to analyze noisy gene expression data typically measured in experiments. Using synthetic data generated from a known model, we show that DM accurately infers network details. However, DM is computationally challenging for networks with multiple proteins, even as few as two. The prohibitive computational cost of DM is due to its large state space, resulting in the exponentiation of high dimensional matrices. CGM operates on a lower dimensional state space and hence is more efficient than DM, but still suffers from computational cost for larger circuits due to matrix exponentiation. Moreover, minimal CGMs are less accurate in inferring underlying model details. MaxCal offers the much needed framework both in terms of accuracy and feasibility. In addition, MaxCal provides a systematic way to improve and select models with the same variables but including different correlations as additional constraints. However, adding more constraints will increase the number of parameters (Lagrange multipliers) over which the likelihood function needs to be maximized, slowing down the overall inference process. Nevertheless, MaxCal enjoys the computational efficiency of the basic step—to be iterated multiple times during likelihood maximization—due to relatively smaller matrix size and matrix multiplication operation. The ability to systematically improve models by adding higher order correlation—due to MaxCal’s inherent top-down nature [32,37]—is another advantage of MaxCal. In contrast, traditional approaches are bottom-up and require first imposing a reaction diagram (such as in DM) or a mechanism (functional forms in CGM). Even with small changes, different reaction networks or functional forms need to be incorporated into the model, restricting systematic model building in a perturbative manner. Furthermore, MaxCal’s reliance on production and degradation variables separately allows the calculation of an effective feedback parameter useful for network characterization, design, and evolution [35,36]. These advantages strongly favor adoption of MaxCal to infer parameters in gene networks from noisy time series protein expression data. However, MaxCal’s performance depends on the availability of basic information. For example, MaxCal model built on only two genes cannot produce oscillation seen in three gene repressilator circuit. MaxCal requires minimal knowledge about the existence of all three genes to correctly model repressilator [36]. Another common challenge with inference is experiments typically measure fluorescence and not protein numbers. Furthermore, fluorescence per protein is not fixed, but random. Direct application of MaxCal on raw experimental data requires experimentally determining the distribution of fluorescence per protein. In the absence of such information, we created synthetic data mimicking noisy fluorescence trajectories and demonstrated MaxCal’s ability to decouple fluorescence noise from gene expression noise and infer underlying circuit details [34,35,36]. It is important to note, MaxCal—in spite of its relative efficiency over DM and CGM— will face increased computational challenge with circuits having multiple different genes due to increased dimensionality of matrices. With the emergence of new mass-spectrometry tools measuring multiple proteins, we are likely to face such data deluge requiring sophisticated tools of inference. Building efficient and reliable models like MaxCal and adoption of GPU platform for additional computational acceleration, as done here, are necessary steps to this direction.

38 in total

1. Stochasticity in transcriptional regulation: origins, consequences, and mathematical representations.

Authors: T B Kepler; T C Elston
Journal: Biophys J Date: 2001-12 Impact factor: 4.033

2. Moment-based inference predicts bimodality in transient gene expression.

Authors: Christoph Zechner; Jakob Ruess; Peter Krenn; Serge Pelet; Matthias Peter; John Lygeros; Heinz Koeppl
Journal: Proc Natl Acad Sci U S A Date: 2012-05-07 Impact factor: 11.205

3. Designing experiments to understand the variability in biochemical reaction networks.

Authors: Jakob Ruess; Andreas Milias-Argeitis; John Lygeros
Journal: J R Soc Interface Date: 2013-08-28 Impact factor: 4.118

Review 4. Build to understand: synthetic approaches to biology.

Authors: Le-Zhi Wang; Fuqing Wu; Kevin Flores; Ying-Cheng Lai; Xiao Wang
Journal: Integr Biol (Camb) Date: 2015-12-21 Impact factor: 2.192

5. Modeling stochastic dynamics in biochemical systems with feedback using maximum caliber.

Authors: S Pressé; K Ghosh; K A Dill
Journal: J Phys Chem B Date: 2011-04-27 Impact factor: 2.991

6. Building Predictive Models of Genetic Circuits Using the Principle of Maximum Caliber.

Authors: Taylor Firman; Gábor Balázsi; Kingshuk Ghosh
Journal: Biophys J Date: 2017-11-07 Impact factor: 4.033

Review 7. Synthetic biology: understanding biological design from synthetic circuits.

Authors: Shankar Mukherji; Alexander van Oudenaarden
Journal: Nat Rev Genet Date: 2009-11-10 Impact factor: 53.242

Review 8. Applications of synthetic gene networks.

Authors: Fuqing Wu; Xiao Wang
Journal: Sci Prog Date: 2015 Impact factor: 2.774

9. Gene networks of fully connected triads with complete auto-activation enable multistability and stepwise stochastic transitions.

Authors: Philippe C Faucon; Keith Pardee; Roshan M Kumar; Hu Li; Yuin-Han Loh; Xiao Wang
Journal: PLoS One Date: 2014-07-24 Impact factor: 3.240

10. Real-time quantification of protein expression at the single-cell level via dynamic protein synthesis translocation reporters.

Authors: Delphine Aymoz; Victoria Wosika; Eric Durandau; Serge Pelet
Journal: Nat Commun Date: 2016-04-21 Impact factor: 14.919