Bram Thijssen1, Tjeerd M H Dijkstra2,3, Tom Heskes4, Lodewyk F A Wessels1,5. 1. Computational Cancer Biology, Division of Molecular Carcinogenesis, Netherlands Cancer Institute, 1066 CX, Amsterdam, The Netherlands. 2. Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tübingen 72076, Germany. 3. Centre for Integrative Neuroscience, University Clinic Tübingen, Tübingen 72076, Germany. 4. Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen 6500 GL, The Netherlands. 5. Faculty of EEMCS, Delft University of Technology, Delft 2628 CD, The Netherlands.
Abstract
Motivation: Computational models in biology are frequently underdetermined, due to limits in our capacity to measure biological systems. In particular, mechanistic models often contain parameters whose values are not constrained by a single type of measurement. It may be possible to achieve better model determination by combining the information contained in different types of measurements. Bayesian statistics provides a convenient framework for this, allowing a quantification of the reduction in uncertainty with each additional measurement type. We wished to explore whether such integration is feasible and whether it can allow computational models to be more accurately determined. Results: We created an ordinary differential equation model of cell cycle regulation in budding yeast and integrated data from 13 different studies covering different experimental techniques. We found that for some parameters, a single type of measurement, relative time course mRNA expression, is sufficient to constrain them. Other parameters, however, were only constrained when two types of measurements were combined, namely relative time course and absolute transcript concentration. Comparing the estimates to measurements from three additional, independent studies, we found that the degradation and transcription rates indeed matched the model predictions in order of magnitude. The predicted translation rate was incorrect however, thus revealing a deficiency in the model. Since this parameter was not constrained by any of the measurement types separately, it was only possible to falsify the model when integrating multiple types of measurements. In conclusion, this study shows that integrating multiple measurement types can allow models to be more accurately determined. Availability and implementation: The models and files required for running the inference are included in the Supplementary information. Contact: l.wessels@nki.nl. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Computational models in biology are frequently underdetermined, due to limits in our capacity to measure biological systems. In particular, mechanistic models often contain parameters whose values are not constrained by a single type of measurement. It may be possible to achieve better model determination by combining the information contained in different types of measurements. Bayesian statistics provides a convenient framework for this, allowing a quantification of the reduction in uncertainty with each additional measurement type. We wished to explore whether such integration is feasible and whether it can allow computational models to be more accurately determined. Results: We created an ordinary differential equation model of cell cycle regulation in budding yeast and integrated data from 13 different studies covering different experimental techniques. We found that for some parameters, a single type of measurement, relative time course mRNA expression, is sufficient to constrain them. Other parameters, however, were only constrained when two types of measurements were combined, namely relative time course and absolute transcript concentration. Comparing the estimates to measurements from three additional, independent studies, we found that the degradation and transcription rates indeed matched the model predictions in order of magnitude. The predicted translation rate was incorrect however, thus revealing a deficiency in the model. Since this parameter was not constrained by any of the measurement types separately, it was only possible to falsify the model when integrating multiple types of measurements. In conclusion, this study shows that integrating multiple measurement types can allow models to be more accurately determined. Availability and implementation: The models and files required for running the inference are included in the Supplementary information. Contact: l.wessels@nki.nl. Supplementary information: Supplementary data are available at Bioinformatics online.
Computational models in biology are frequently underdetermined (Gutenkunst ), which can limit their
usefulness. This underdetermination is a result of our limited capacity to measure
biological systems. A dynamic model of an intracellular regulatory network, for example,
might contain several proteins of interest that carry out important functions in the system.
We would then ideally like to know the concentrations of all these proteins in their various
states and complexes, inside the cell, over time. But such direct measurements are currently
not possible. Instead we are limited to indirect measurements such as relative protein
levels compared to a control, reporter-based measurements, or averages over populations of
cells. A compounding difficulty is that the measurements are often relatively noisy. It is
thus challenging to accurately determine the unknown parameters of computational models of
biological systems.Intuitively, one would expect that multiple types of measurements, obtained using different
experimental techniques, provide more information than a single type of measurement. The
combined information would then be more likely to constrain the parameters in a
computational model compared to using only a single measurement type. However, this need not
be the case; perhaps one particular dataset, such as the most detailed measurements, already
contains all relevant information, making additional datasets irrelevant.The quantification of how much information a dataset brings to the parameter estimates can
be achieved using Bayesian statistics (Vyshemirsky
and Girolami, 2008; Wilkinson, 2007).
For all unknowns in a model, a probability distribution is specified which quantifies the
uncertainty in the parameters. This probability distribution can then be updated based on
each of the different datasets, using Bayes’ theorem. This provides a convenient framework
for the integration of multiple datasets, as it allows a detailed comparison of the amount
of information that can be extracted from each of the datasets.Bayesian statistics has been applied to mechanistic computational models in biology in
various settings and model types, including regulatory network models based on ordinary
differential equations (ODEs) (Eydgahi ; Hug ; Toni and Stumpf,
2010; Vyshemirsky and Girolami, 2008;
Xu ). These
applications have so far been limited to the use of a single dataset consisting of one type
of measurement. It is thus unclear whether integration of multiple data types within the
Bayesian formalism is feasible in practice and whether it is beneficial for achieving more
accurate parameter estimates. The purpose of this study was to test the feasibility of this
type of data integration and to explore whether multiple data types can indeed provide more
accurate parameter estimates.We tested this approach using a model of a well-studied system, cell cycle regulation in
budding yeast. Figure 1 shows the concept of
data integration we used: various measurements are included as prior information,
subsequently two types of data are incorporated during the inference, and finally the
obtained parameter estimates are compared to measurements of these rates from independent
studies.
Fig. 1.
Outline of the approach of data integration using Bayesian statistics. Several initial
datasets are assimilated into a prior probability distribution for all parameters in the
model. Subsequently, multiple datasets are integrated to update the prior and obtain a
posterior probability distribution for all parameters. Finally, this posterior
probability distribution is compared to validation data
Outline of the approach of data integration using Bayesian statistics. Several initial
datasets are assimilated into a prior probability distribution for all parameters in the
model. Subsequently, multiple datasets are integrated to update the prior and obtain a
posterior probability distribution for all parameters. Finally, this posterior
probability distribution is compared to validation data
2 Approach and results
2.1 Constructing an initial model
We will use cell cycle regulation in budding yeast as test case, as this system is well
studied and there is a host of data available. A central event in cell cycle regulation is
the cyclic expression of the cyclin proteins (Morgan, 2007). We wished to model the cyclic expression pattern of four cyclins
in particular: the G1-phase cyclin CLN3, the
G1/S-transition cyclin CLN2, the S-phase cyclin
CLB5 and the M-phase cyclin CLB2 (Fig. 2A).
Fig. 2.
Cyclins and model overview. (A) The expression patterns of the four
cyclins included in the model. The measurements are from Spellman . The approximate
cell cycle phase is indicated at the bottom. (B) Initial structure of the
model in Systems Biology Graphical Notation
Cyclins and model overview. (A) The expression patterns of the four
cyclins included in the model. The measurements are from Spellman . The approximate
cell cycle phase is indicated at the bottom. (B) Initial structure of the
model in Systems Biology Graphical NotationAlthough many models have been constructed of this system, for example (Chen ; Tyson, 1991), we wished to obtain a simple,
sparse model that is sufficient for explaining the cyclic expression of the cyclins. To
this end we created a simple model that might be able to do this. The structure of this
initial model is shown in Figure 2B and the
reasoning behind it is as follows. Since the expression of the cyclins oscillate at the
transcriptional level, we need to include the transcription factors that are responsible
for regulating the transcription of the cyclins in the model. Thus, based on the overview
of the cell cycle provided by Morgan (2007),
and especially Figure 3-34 therein, we included the three transcription factor complexes
SBF, Mcm-Fkh and Swi/Snf. Each of these complexes is represented by one of their subunits:
SBF is represented by the regulatory subunit SWI4, Mcm1-Fkh is
represented by the coactivator NDD1 and Swi/Snf is represented by the
subunit SWI5. We chose these subunits because they are regulatory factors
and are transcriptionally oscillating (Santos
). As most data are available at the mRNA level,
we explicitly included the mRNA transcripts as well as the proteins as species in the
model. The dynamics are modeled by including rates for transcription, translation and
degradation of both mRNA and protein. To keep the model manageable, we did not explicitly
include processes such as post-translational modifications, complex formation and
intracellular localization. While these processes are also clearly important for cell
cycle regulation, the goal is not to create a detailed model but rather a simple model
that is sufficient for explaining the cyclic expression of the cyclins. For the same
reason, the model contains fewer signaling events than the more comprehensive model of
Chen . The
starting model described here will likely require improvement, which we consider below.
Starting from a simple model allows us to find a balance between model complexity and data
fit. The resulting model can then be used for testing the integration of multiple
datasets.Creating a model that can fit the time course mRNA data. On the left the model
structure is indicated in Systems Biology Graphical Notation, with the simplest model
at the top. The changes with respect to the previous model are highlighted in red. On
the right the mRNA time course measurement data of the cyclins is shown, overlaid with
the posterior predictive of the mean of the data. The thick red line indicates the
median and the shaded red area indicates the 90% confidence interval. Above each graph
the median R2 is shown and the 90% confidence interval is
given in brackets (Color version of this figure is available at
Bioinformatics online.)Measurements and posterior predictive with the two types of data separately and
together for the CLN3 transcript. The top row shows the absolute concentrations and
the bottom row shows the expression relative to the time average (by log ratio). In
the left column, only time course data is used, in the middle column only steady-state
data is used and in the right column both types of data are used. The measurements are
shown in black: horizontal line for absolute steady-state data (note that each line is
only a single measurement value), and connected dots for relative time course data
(here each dot is a measurement value). The thick red line indicates the median of the
posterior predictive and the shaded red area indicates the 90% confidence interval.
Grey lines indicate individual trajectories for 100 posterior parameter samples (Color
version of this figure is available at Bioinformatics online.)(A) Posterior 90% confidence intervals for several model parameters, as
inferred using the absolute steady-state data, relative time course data, the time
course data together with absolute transcript or protein data, or all data. The
confidence intervals and posterior probability densities for all parameters are given
in Supplementary Figures S1 and
S2. (B) Size of the 90% confidence intervals of all parameters
on log10 scale. For the prior, the full range is given. Values are given in Supplementary Table S1Comparison of parameter estimates with validation data. (A) Posterior
probability distributions of the parameters, as inferred using the absolute
steady-state data, relative time course data or both. The validation measurements are
marked with a black cross below the probability distributions. (B)
Trajectories for the mRNA expression levels of the transcription factor subunit SWI4
and its target genes CLN2, YOX1 and
HCM1Another important modeling choice is that we specified the model entirely in physical
units of concentration (micromolars) and time (seconds), rather than using dimensionless
parameters and abundances. The physical units allow a comparison of the parameter
estimates to measurements from independent studies. The model is specified in terms of
ODEs, and the rate equations are based on mass action kinetics with the addition of a
non-linear term for modeling inhibitory effects. The model is described in more detail in
Section 3, and SBML versions of all models are included in Supplementary File S1.
2.2 Constructing priors from several datasets
The Bayesian analysis required us to specify prior probabilities for the unknown
parameters in the model. For each of the parameters, we specified priors based either on
biochemical limits or on published datasets providing information for a parameter. The
prior probability distributions and how they were established are described in more detail
in the Supplementary Methods.
All datasets used throughout this study are listed in Table 1.
Table 1.
All datasets used in this study
Measurement
Experimental technique
Used as
Reference
Protein concentration
2D-gel electrophoresis
Prior
Futcher et al.,
1999
mRNA concentration
Hybridization kinetics
Prior
Hereford and Rosbash, 1977
Cell size
Electrical conductivity
Conversion
Tyson et al.,
1979
Transcript elongation rate
ChIP
Prior
Mason and Struhl, 2005
RNA polymerase footprint
Nuclease digestion
Prior
Selby et al.,
1997
Peptide elongation rate
Radioactive labeling
Prior
Boehlke and Friesen, 1975; Waldron et al.,
1974
Ribosome footprint
Nuclease digestion
Prior
Wolin and Walter, 1988
mRNA time course (relative)
Microarray
Inference
Granovskaia et al.,
2010; Pramila et
al., 2006; Spellman
et al., 2003
mRNA concentration
SAGE
Inference
Velculescu et al.,
1997
mRNA concentration
Microarray
Inference
Holstege et al.,
1998
Protein concentration
TAP tag; western blot
Inference
Ghaemmaghami et al.,
2003
Protein concentration
GFP tag; flow cytometry
Inference
Newman et al.,
2006
Protein concentration
2D-HPLC; MS/MS
Inference
Lu et al.,
2007
mRNA degradation rate
Microarray
Validation
Wang et al.,
2002
Transcription rate
GRO; ChIP-chip
Validation
Pelechano et al.,
2011
Translation rate
Polysome profiling
Validation
Arava et al.,
2003
All datasets used in this study
2.3 Fitting time course mRNA measurement data
As we wished to obtain a model for the cyclic expression of the cyclins, we first turned
to measurements of mRNA gene expression over time (Granovskaia ; Pramila ; Spellman ) and tested whether the
model can fit these datasets.A complication with these datasets is that the measurements were taken under different
growth conditions, with different synchronization methods and with slightly different
yeast strains, resulting in different doubling times for the cells, ranging from 60 to
100 min. To make the datasets compatible, we used the time-normalized data provided by
Cyclebase (Santos ), and scaled the times back to an 80-min cell cycle, which is a typical
doubling time for yeast cells in rich yeast extract peptone dextrose (YEPD) medium (Tyson ).We fitted the model to these three gene expression datasets simultaneously. The
measurements are all made on synchronized cells relative to unsynchronized controls, and
the likelihood function was specified such that it reflects this. Specifically, the
likelihood of the observed values was centered on the log ratio of the modeled transcript
concentration divided by the average modeled concentration over time (see Section 3). We
expected that the model would not exactly match the measurement data, and so a
t-distribution was used as error model, such that occasional outlying
measurements with respect to the model are not penalized too heavily.The posterior probabilities were calculated using Markov chain Monte Carlo (MCMC)
sampling. The relatively large number of dimensions, with the prior in each dimension
spanning many orders of magnitude, makes this a challenging inference task. To be able to
run the inference in reasonable time, the Bayesian inference software package BCM was used
(Thijssen ).
The posterior probability distribution contained sub-optimal modes, we therefore used
parallel tempering (Geyer, 1991) to have a
means of escaping these. MCMC sampling relies on a proposal distribution; a distribution
from which new candidate parameter values are drawn. For the efficiency of the sampler it
is important that the proposal distribution reflects the shape and scale of the (unknown)
posterior distribution. We therefore used automated blocking (Turek ) and adaptively scaled the
proposal distributions (see Section 3). Traces and autocorrelation plots for the
convergence analysis of all models are included in Supplementary File S2.To test the goodness of fit, we first used graphical posterior predictive checking. The
posterior predictive distribution describes a new, predicted dataset given the fitted
model. Overlaying this posterior predictive distribution on the observed measurements
provides a convenient way of identifying which data can and cannot be explained by the
model. Figure 3 (top row) shows the posterior
predictive distribution of the mean of the relative transcript levels in the fitted model
overlaid on the observed measurements. It is immediately clear that the model cannot
adequately explain the expression patterns of the four cyclins. The model can only fit the
first peak of CLN3 expression, but not the subsequent oscillations or the
activation of the other cyclins.
Fig. 3.
Creating a model that can fit the time course mRNA data. On the left the model
structure is indicated in Systems Biology Graphical Notation, with the simplest model
at the top. The changes with respect to the previous model are highlighted in red. On
the right the mRNA time course measurement data of the cyclins is shown, overlaid with
the posterior predictive of the mean of the data. The thick red line indicates the
median and the shaded red area indicates the 90% confidence interval. Above each graph
the median R2 is shown and the 90% confidence interval is
given in brackets (Color version of this figure is available at
Bioinformatics online.)
To further quantify the goodness of fit, coefficients of determination
(R2) were calculated for the four cyclins (Fig. 3). We compared these values to the
R2 of a spline fit to the data. The spline fit gives a
reference R2 for the optimal fit that can be achieved. The
median R2 for the model fits range from 0.07 to 0.19, whereas
a spline fit gives R2 values ranging from 0.46 to 0.72, again
showing that the initial model is insufficient to explain the expression patterns of the
cyclins.
2.4 Iterative model refinement to create a well-fitting model for the time course
mRNA measurement data
As the simplest model could not adequately fit the transcription data, it was necessary
to expand the model with additional explanatory factors. We thus searched the literature
to find important mechanisms that were missing from the model. For each addition, we
re-fitted the model to the data and compared the posterior predictive distributions and
R2 values for expression of the four cyclins. Note that we
could not use the marginal likelihood for model selection here, because when we added
additional species to the model we also included the expression data for those new species
in the likelihood function. This affects the marginal likelihood; the marginal likelihood
of two differing sets of data cannot be compared to each other.The first addition to the model which we considered was the transcription factor
HCM1. There is a significant delay between the transcriptional peak of
SWI4 and NDD1, especially compared to the peaks of
CLN2 and CLB5, which occur more rapidly after the
expression of SWI4 (see Supplementary File S2 for the trajectories of all species). The
transcription factor HCM1 has been found to be an important part of the
transcriptional cell cycle regulation system (Pramila ), and the inclusion of this factor could
introduce the necessary delay in the model. As shown in Figure 3 (second row), the addition of HCM1
indeed improved the fit of the model, particularly for the induction of the expression of
CLN2 and CLB5 after SWI4 expression.
However, the model was still not able to explain the oscillatory aspect of the expression
of the four cyclins.The lack of oscillatory behavior of the model suggested that a feedback loop might be
required. We therefore considered the addition of the inhibitory transcription factor
YOX1 (Pramila ). This transcription factor provides a negative feedback
loop from SWI4 back to both SWI4 and
CLN3. As shown in Figure 3
(third row), with this addition the model could indeed recapitulate the oscillatory aspect
of the expression pattern of the four cyclins.As the magnitude of the oscillations in CLB2 was still greater in the
data than could be explained by the model, we considered the addition of another
regulatory mechanism, namely the degradation of NDD1 by the anaphase
promoting complex (Sajman ). This complex is normally active, unless it is inactivated by
CLN2 (Morgan, 2007).
Thus, NDD1 would be actively degraded until CLN2 signals
the start of S-phase. As shown in Figure 3
(bottom row), with the addition of this mechanism the model can indeed better explain the
expression pattern of the NDD1-target CLB2.With these additions to the model, the expression patterns of the four cyclins are
adequately explained (R2 > 0.3 for all cyclins, and at
least 65% of the R2 achieved with a spline fit). Although
further additions can be considered, we wished to keep the model as small as possible
while achieving a reasonable fit. This was mainly done to keep the computational
requirements manageable—to generate 1000 posterior samples for the fourth extended model
required approximately 60 h. The structure of the resulting model is similar to the
Boolean network model of Orlando in terms of the transcriptional regulatory network.
2.5 Simultaneous fitting of time course and steady-state measurement data
Now that the model is able to explain the relative time course measurements adequately,
we can start including additional datasets to test whether the parameters of the model can
be more tightly constrained with the integration of additional data. We turned to absolute
measurements of the mRNA (Holstege ; Velculescu
) and protein (Ghaemmaghami ; Lu ; Newman )
concentrations of the species in the model. These measurements were done at steady-state
growth conditions in non-synchronized cells. We incorporated this in the likelihood by
taking the time average of the modeled trajectories, and setting this time average as the
modeled value of the steady-state data, where the time average was taken over two cell
cycles.The addition of absolute concentration data to relative time series data may seem
trivial, and it could potentially also be achieved by transforming the kinetic parameters
and concentrations accordingly. However, keeping the model specified in physical
dimensions (micromolars and seconds) is natural, and more importantly, it allows for a
direct comparison of the kinetic rates with measurements of these rates later on.Figure 4 shows the posterior trajectories of
the transcripts of one of the cyclins, CLN3, after fitting the relative
time course data alone, the absolute steady-state data alone or all data together
(trajectories for all other species in the model are included in Supplementary File S2). Several
observations can be made. First, it is apparent that the absolute concentrations can be
quite high when only time course data is used. When the steady-state data is included
however, the concentrations are constrained to be much lower. Second, when only
steady-state data is used, the model displays various behaviors including stable
expression, decay and oscillations (see the individual trajectories depicted in grey)—each
of these behaviors would be consistent with the given average data over a period of two
cell cycles. With all measurements types included, the model displays the correct
oscillations at concentrations consistent with the steady-state data. Finally, the fit to
the relative time course data is not compromised by the inclusion of the absolute
steady-state data, and vice versa. The model is thus able to fit both types of data at the
same time, and no modifications need to be made to the model structure to accommodate the
steady-state data.
Fig. 4.
Measurements and posterior predictive with the two types of data separately and
together for the CLN3 transcript. The top row shows the absolute concentrations and
the bottom row shows the expression relative to the time average (by log ratio). In
the left column, only time course data is used, in the middle column only steady-state
data is used and in the right column both types of data are used. The measurements are
shown in black: horizontal line for absolute steady-state data (note that each line is
only a single measurement value), and connected dots for relative time course data
(here each dot is a measurement value). The thick red line indicates the median of the
posterior predictive and the shaded red area indicates the 90% confidence interval.
Grey lines indicate individual trajectories for 100 posterior parameter samples (Color
version of this figure is available at Bioinformatics online.)
Figure 5A shows the 90% posterior confidence
intervals for several parameters in the model, for the relative time course data alone,
the absolute steady-state data alone or all data together (confidence intervals and
density plots for all parameters are included in Supplementary Figs S1 and S2). For several parameters, each data type
separately provides some information, but the inclusion of the two types of data together
provides significantly tighter confidence intervals, for example for the translation rate.
There are also parameters that can already be inferred from the time course data alone;
that is, for these parameters the addition of the steady-state data does not reduce the
confidence intervals, such as the degradation rate of CLN3. In many
cases, the steady-state data by itself provides little information for constraining the
parameters, which is not surprising for a dynamic model. However, the addition of the
steady-state data to the time course data does reduce the uncertainty compared to the time
course data alone. Examples of this are the degradation rate of SWI4 or
the transcription rate of CLN2.
Fig. 5.
(A) Posterior 90% confidence intervals for several model parameters, as
inferred using the absolute steady-state data, relative time course data, the time
course data together with absolute transcript or protein data, or all data. The
confidence intervals and posterior probability densities for all parameters are given
in Supplementary Figures S1 and
S2. (B) Size of the 90% confidence intervals of all parameters
on log10 scale. For the prior, the full range is given. Values are given in Supplementary Table S1
In general across all parameters, combining multiple types of measurements reduces the
uncertainty in parameter estimates (Fig. 5B).
With all data types included, 45 out of 54 parameters have 90% confidence intervals of
less than half of the prior range, whereas the steady-state data by itself constrains only
1 parameter to this extent and the time course data 14 parameters. Comparing the added
value of the absolute protein and transcript concentrations, we note that it is mainly the
transcript concentrations that reduces the uncertainty (columns 4 and 5 in Fig. 5B and see Supplementary Table S1 for the
values). Nevertheless, adding the protein concentration data to the time course and
transcript concentration data still further reduces the uncertainty for several
parameters.
2.6 Comparison of parameter estimates with rate measurement data
To test whether the obtained parameter estimates are accurate, we compared them to
measurements from three additional, independent datasets. In particular, the mRNA
degradation rate (Wang ), transcription rates (Pelechano
) and translation rates (Arava ) have been measured for
budding yeast. Figure 6A shows the measured
values of the parameters compared to the posterior probability distributions of the
parameter estimates.
Fig. 6.
Comparison of parameter estimates with validation data. (A) Posterior
probability distributions of the parameters, as inferred using the absolute
steady-state data, relative time course data or both. The validation measurements are
marked with a black cross below the probability distributions. (B)
Trajectories for the mRNA expression levels of the transcription factor subunit SWI4
and its target genes CLN2, YOX1 and
HCM1
For the mRNA degradation rate, the measurements are in close agreement with the predicted
rates (Fig. 6A, left panel). We assumed a
common rate parameter for all species, while the measurements were done for each gene
separately, and there is indeed some variability between the measurements for the genes
that were included in the model. Nevertheless, the measured degradation rates of all genes
are within the same order of magnitude as the estimated average degradation rate (the
difference between the measurements and the maximum a posteriori estimate
on log10 scale is <0.5), so the scale of the average mRNA degradation rate
was predicted accurately.For the transcription rates, these rates in the model are split into two parts: basal
transcription and transcription factor induced transcription. The rate measurements are
population averages, and as each cell would be in a different stage of the cell cycle,
they will be expressing different levels of the transcription factors. To be able to
compare the measurements of the transcription rates to the model’s estimated rates, it is
necessary to calculate the total, average transcription rate. This was obtained by
averaging the transcription rate of each gene over time. This rate thus includes the
effect of the time-varying expression of the transcription factors. When only time course
data was used, the transcription rates were not constrained, but they do have non-zero
probability at the measured values. However, when all data are included, the estimated
transcription rates closely match the measured values for most genes (Fig. 6A, middle panels; seven of the eight measured values lies
within the 90% confidence interval, and the remaining gene is at least within the same
order of magnitude). Thus, for the transcription rates, the addition of the absolute
concentrations to the relative dynamic data constrained the parameter estimates to values
close to or matching the measurements of these rates.The measured translation rates have been estimated from ribosome densities using polysome
profiling, whereby a processing speed of 10 amino acids/s was assumed (Arava ). Note that
the authors mention that their estimates should be used with caution. Nevertheless,
assuming the estimates from Arava et al. are accurate, then our model
estimate using all inference data are two orders of magnitude too high (Fig. 6A, right panel). The model estimate is
indeed quite high at around 1 protein/transcript/s. While this is feasible given the prior
information, it would require that the transcripts are always essentially fully packed
with ribosomes.To find the reason for this high translation rate estimate, we investigated the
trajectories of the transcription factors and their target genes. If we compare the mRNA
expression trajectories for the transcription factor SWI4 and its targets
CLN2, HCM1 and YOX1 (see Fig. 6B), it makes sense that the model requires a
high translation rate. The peaks in transcription of the target genes follow very closely
after the peak in transcription of the transcription factor, especially in the first cell
cycle. Given that this process of rapid induction of transcription in the model has to
occur through the translation of the transcription factor, then there are two ways in
which the model might fit the data: either the translation rate must be high, or the
concentration of the transcript of the transcription factor must be high. When using only
the relative data, it is not possible to distinguish between these scenarios; indeed in
this case the translation rate is not constrained: the 90% confidence interval of the
translation rate spans almost three orders of magnitude (Fig. 5). However, when including both relative and absolute data,
the inference can make use of the information that the concentration of the transcription
factor is low. It can thus be inferred that the translation rate must be high, given this
model.It is known that other mechanisms are at play here as well, such as the regulation of
SWI4 and the SBF transcription factor complex through phosphorylation
by different cyclin/CDKs (Siegmund and Nasmyth,
1996). Indeed it has been shown that induction of G1-phase transcripts
can occur in the absence of protein translation (Marini and Reed, 1992). It is likely that a model with additional layers of
SWI4 regulation would be able to fit all data with lower translation
rates. Unfortunately we were not able to expand the model with such additional effects, as
the parameter inference for these expanded models would involve a prohibitive amount of
computation time. Regardless, these results show that the model can be identified as being
incomplete by using the inference of parameters from multiple datasets. This model
deficiency could not be deduced from any of the datasets alone.
3 Materials and methods
3.1 Model equations
The computational model consists of two types of species: the proteins and the mRNA
transcripts. The rate equations for these species are based on mass action kinetics, with
the addition of a non-linear term for modeling inhibitory effects. For transcripts, the
rate equation contains three terms: one for transcription, one for inhibition of
transcription and one for degradation. The transcription rate is proportional to the
concentration of the activating transcription factor for that gene. This transcription can
be inhibited by an inhibitory transcription factor. Each transcript has exactly one
activating transcription factor and at most one inhibitory transcription factor. For
proteins, the rate equation also contains three terms: one for translation, one for
degradation and one for inhibition of degradation. The translation rate is proportional to
the concentration of the transcript for that protein, and the degradation rate is
proportional to the concentration of the protein itself. See the Supplementary Methods for a more
detailed description and the equations.
3.2 Prior distributions
For all parameters, we used uniform priors on a log10 scale. A log scale was
chosen as we were interested in the orders of magnitude of the parameters rather than
their precise values. The limits of the uniform distributions were chosen based on various
data points and biochemical limits as described in the Supplementary Methods.
3.3 Likelihood
Firstly, the time average of the concentration of a transcript was calculated by
averaging over two full cell cycles. Then, for relative time course data measured using
synchronized cells relative to unsynchronized cells, we modeled the relative value by
dividing the modeled concentration by the time average and taking the log. As error model
we used a t-distribution with three degrees of freedom, as a means of
robust inference (Gelman ). This distribution was centered on the log ratio of the relative
expression. For the absolute concentration data, the time average value is
log10 transformed, and again a t-distribution is used as
error model. As for the prior, the likelihood is specified on a log scale as it is
sufficient if the model captures the right order of magnitude. See the Supplementary Methods for the
equations.
3.4 Model inference
The posterior probability distributions were calculated using parallel-tempered MCMC
(Geyer, 1991), using the Bayesian
inference software package BCM (Thijssen ). For the initial model, we used 32 parallel chains with the
temperatures of the chains distributed quadratically. The burn-in period was set to 1.25
million samples followed by a sampling period of 5 million posterior samples, which were
subsampled at 1 in 2500. At each step, a random choice was made between updating each
chain with five Metropolis–Hastings steps and swapping a random adjacent pair of chains.
The probability of selecting a swap step was set to 0.99. For the proposal distribution in
the Metropolis–Hastings steps, the parameters were blocked automatically (Turek ) and we used
a multivariate normal distribution for each block of parameters. The proposal covariance
matrix for each block was set to the empirical covariance of the preceding samples and
adaptively scaled to obtain an acceptance rate of 0.23 within each block. These settings
produced sufficiently uncorrelated posterior samples (see Supplementary File S2 for traces and
autocorrelation plots) and were sufficient to achieve at least 100 round trips from prior
to posterior. The sampling period and subsampling was doubled for model 3 and quadrupled
for model 4, such that the resulting posterior samples were still sufficiently
uncorrelated and at least 100 round trips from prior to posterior were achieved.All files required for running the inference in BCM, including the prior and likelihood
specification, the models in SBML/CellDesigner format, as well as a NetCDF archive
containing the pre-processed data, are included in Supplementary File S1.
3.5 Model checking
The model fit was investigated using the posterior predictive distribution and
coefficients of determination. The posterior predictive distribution is the probability
distribution of a new set of data, given the model and the observed data. This
distribution was approximated using the posterior Monte Carlo samples. The coefficients of
determination for the time course data were calculated relative to a null model, which has
a separate mean for each experiment. A reference R2 was
calculated by fitting a cubic spline to the data with the smoothing parameter selected
through cross-validation. See the Supplementary Methods for details and equations.
4 Discussion
Model determination is an important aspect of computational modeling. Models in systems
biology are frequently underdetermined, and as a result it is often not possible to confirm
or falsify a particular model. There is thus a need for methods to determine models more
accurately. With the increasing amount of data available for many biological systems, the
use of multiple datasets to constrain the parameters from different angles is a promising
avenue. Bayesian statistics provides a coherent and convenient framework to accomplish this.
Here, we have shown that it is feasible to integrate diverse datasets during the Bayesian
inference of parameters of an ODE-based model. The process as described here may be useful
as a general recipe for integrating diverse measurement types also in other settings. More
importantly, we have shown that this integration of diverse data types can provide tighter
posterior estimates, at least in obtaining the right order of magnitude, thus achieving more
accurate model determination. We noticed that even when a single dataset, taken by itself,
provides little information, it can still significantly improve parameter estimates when
used in conjunction with other datasets.There are several challenges when using this type of data integration based on model
simulation and Bayesian statistics. The biggest challenge is the scaling of the
computational demands with respect to the size of the model. This is due to two reasons.
First, the simulation of a computational model typically does not scale well with model size
(cubically in the case of direct, implicit ODE solvers). Second, the parameter inference is
increasingly challenging when the number of parameters increases. Although in theory Monte
Carlo methods scale independently of the dimensionality, this requires that the samples are
concentrated in regions of high posterior probability. The efficiency of generating a good
set of samples critically depends on the proposal distribution that is used. Given the
complex shape of the posterior probability distributions of biological computational models,
in particular the presence of multiple modes and ridges (Girolami, 2008; Hug
), proposal distributions typically become much less
efficient with higher dimensionality. Both of these computational challenges apply more
generally to any approach using model simulation and global parameter inference. Increases
in computational capabilities, more efficient simulation methods and sampling or
optimization methods tailored for the inference of biological computational models may allow
larger models in the future.For budding yeast, and their cell cycle regulation in particular, many more measurements
have been performed, such as mRNA quantification by qPCR (Miura ) and RNA sequencing (Nagalakshmi ),
protein-level time course data by mass spectrometry (Flory ) and GFP-tagged time lapse microscopy (Ball ). In principle,
these data can be integrated with the same approach as was done for the data in this study,
and it would be interesting to explore the contributions and concordance of these
measurements. A challenge for further integration of time course data is the synchronization
of the timing, which is not straightforward when using different experimental setups. This
synchronization can also directly affect kinetic rates, for example the alignment of
transcript and protein time course data can directly influence the estimated translation
rate.To be able to compare the contribution of the different datatypes, it is necessary to
quantify the uncertainty in the parameter estimates, which was achieved here using Bayesian
statistics. The quantification of uncertainty has previously been achieved with different
approaches as well (reviewed by Vanlier ), including using the profile likelihood (Raue ) and through
bootstrapping (Brännmark ). The incorporation of multiple datasets in the likelihood function can in
principle be translated to these formalisms as well. A unique advantage of the Bayesian
approach is the ability to explicitly include data as prior information, which we have
utilized to incorporate various datasets. Profile likelihoods may be computationally more
efficient to calculate than posterior probabilities, although the calculations still involve
the most challenging aspect, namely global optimization. The profile likelihood is limited
in that it provides uncertainty estimates for each parameter separately rather than for all
parameters jointly.In conclusion, we have shown that diverse types of measurements can be successfully
integrated during the inference of parameters of ODE systems using Bayesian statistics. This
integration provided more tightly constrained parameter estimates, thereby achieving a more
accurate model determination.
Funding
This work was performed within the Cancer Genomics Netherlands Program supported by the
Gravitation program of the Netherlands Organization for Scientific Research (NWO). This work
was sponsored by NWO Physical Sciences for the use of supercomputer facilities. This work
was further supported by CogIMon H2020 ICT-644727 to T.D from the European Commission.Conflict of Interest: none declared.Click here for additional data file.
Authors: Yoav Arava; Yulei Wang; John D Storey; Chih Long Liu; Patrick O Brown; Daniel Herschlag Journal: Proc Natl Acad Sci U S A Date: 2003-03-26 Impact factor: 11.205
Authors: F C Holstege; E G Jennings; J J Wyrick; T I Lee; C J Hengartner; M R Green; T R Golub; E S Lander; R A Young Journal: Cell Date: 1998-11-25 Impact factor: 41.582
Authors: Mark R Flory; Hookeun Lee; Richard Bonneau; Parag Mallick; Kyle Serikawa; David R Morris; Ruedi Aebersold Journal: Proteomics Date: 2006-12 Impact factor: 3.984
Authors: Marina V Granovskaia; Lars J Jensen; Matthew E Ritchie; Joern Toedling; Ye Ning; Peer Bork; Wolfgang Huber; Lars M Steinmetz Journal: Genome Biol Date: 2010-03-01 Impact factor: 13.583
Authors: P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher Journal: Mol Biol Cell Date: 1998-12 Impact factor: 4.138
Authors: Tian-Rui Xu; Vladislav Vyshemirsky; Amélie Gormand; Alex von Kriegsheim; Mark Girolami; George S Baillie; Dominic Ketley; Allan J Dunlop; Graeme Milligan; Miles D Houslay; Walter Kolch Journal: Sci Signal Date: 2010-03-16 Impact factor: 8.192