- Research article
- Open Access
Statistical model comparison applied to common network motifs
- Núria Domedel-Puig^{1}Email author,
- Iosifina Pournara^{2} and
- Lorenz Wernisch^{3}
https://doi.org/10.1186/1752-0509-4-18
© Domedel-Puig et al; licensee BioMed Central Ltd. 2010
- Received: 22 May 2009
- Accepted: 3 March 2010
- Published: 3 March 2010
Abstract
Background
Network motifs are small modules that show interesting functional and dynamic properties, and are believed to be the building blocks of complex cellular processes. However, the mechanistic details of such modules are often unknown: there is uncertainty about the motif architecture as well as the functional form and parameter values when converted to ordinary differential equations (ODEs). This translates into a number of candidate models being compatible with the system under study. A variety of statistical methods exist for ranking models including maximum likelihood-based and Bayesian methods. Our objective is to show how such methods can be applied in a typical systems biology setting.
Results
We focus on four commonly occurring network motif structures and show that it is possible to differentiate between them using simulated data and any of the model comparison methods tested. We expand one of the motifs, the feed forward (FF) motif, for several possible parameterizations and apply model selection on simulated data. We then use experimental data on three biosynthetic pathways in Escherichia coli to formally assess how current knowledge matches the time series available. Our analysis confirms two of them as FF motifs. Only an expanded set of FF motif parameterisations using time delays is able to fit the third pathway, indicating that the true mechanism might be more complex in this case.
Conclusions
Maximum likelihood as well as Bayesian model comparison methods are suitable for selecting a plausible motif model among a set of candidate models. Our work shows that it is practical to apply model comparison to test ideas about underlying mechanisms of biological pathways in a formal and quantitative way.
Keywords
- Markov Chain Monte Carlo
- Deviance Information Criterion
- Model Evidence
- Feed Forward Model
- Bayesian Model Comparison
Background
Cellular processes are very complex, but it seems that such processes can often be broken down into a small number of reoccuring patterns of interconnections known as network motifs [1, 2]. Interestingly, some motifs are known to display specific dynamic functional roles [3, 4]. Motif dynamics can now be assessed in a precise manner thanks to the emergence of new experimental techniques that allow generating high quality time series data with a high temporal sampling rate [5–9]. However, studying biological systems in general involves two steps: first, the components of the network need to be identified, and then the type of relationships between them established. Different methods exist for doing so. While some have focused on deriving pairs of possible interacting molecules from existing databases [1], others have tried to reconstruct networks from scratch integrating different sources of both static and dynamic data [10]. In fact, automatic identification of interactions has been the aim of reverse engineering for many years [11]. In general, multiple hypotheses about the architecture of a network are easily generated, and assessing their validity is often difficult.
Existing knowledge can be used in the model identification process. If certain aspects of the networks are already known, the space of possible models is smaller and reverse engineering amounts to model comparison in this case. Several techniques exist for comparing the plausibility of different candidate models, given observations. On the one hand, judgement is sometimes based solely on visual inspection of predicted vs. observed data [12]. On the other hand, formal frequentist methods such as likelihood ratio tests (LRT) and bootstrapping have been used for comparing models [13]. Until recently, Bayesian model comparison approaches such as Bayes factors have largely been ignored in analysing the identifiability of biological systems from time series data [14]. However, Bayesian approaches to problem solving have recently gained in popularity due to their inherent control of complexity and the ease with which any prior knowledge about the system under study can be incorported (for an introduction to Bayesian modelling see [15]). This prior information is derived either through literature surveys or through experimental observations.
In this paper, we wish to show how formal statistical model comparison methods can be applied to a typical systems biology scenario. We analyze whether the dynamic fingerprint of a number of commonly occurring motifs allows discrimination of the underlying network structure based on simulated data. These are the single input motif (SIM), regulatory chain (RC), feedforward (FF) and feedback (FB) motifs. Any of these motif architectures can be parameterized differently when converted to ordinary differential equations resulting in different dynamics of the system. This has been reported in the case of the FF motif [5–7], which appears in the arabinose, flagella, and galactose pathways in E. coli. The original work on these systems explored the differences between each FF type and a control non-FF architecture in a statistically informal manner [5–7]. Here we take the analysis one step further and explore whether statistical model comparison is also able to distinguish between FF parameterizations.
Before we apply model comparison to specific dynamic models derived from biological network motifs (Section Results), we provide some background on statistical model comparison in the next section (Section Methods). The spirit of this work is to demonstrate in a didactic way how different statistical model comparison tools perform on a class of dynamic models of interest in the systems biology community. We have therefore restricted the type and number of methods and models to those most easily implementable and accessible to the non-expert reader. In addition, we provide an introductory exercise with a simple one-equation model in [Additional file 1]. The statistical methods presented have important limitations, which we also briefly discuss.
Methods
A model M describes the deterministic dynamic behaviour of variables z(t) = (z_{1}(t), ..., z_{ K }(t))^{ T }by a system of differential equations dz_{ k }/dt = f_{ k }(z(t), t, θ), using parameters θ. The parameters may include initial values. Measurements y_{ ki } of each z_{ k } are taken at time points t_{ i }, i = 1, ..., N. For notational simplicity, we assume all variables are measured at all the time points, but the following discussion is easily extended to more general cases. An additive Gaussian measurement error ϵ ~ N (0, σ^{2}) affects the observations of the variables. For simplicity we assume here that the error distributions of the variables are independent and constant over time. They are also characterised by some variance σ^{2} which we consider to be part of the parameters.
where p_{ N }(y | , σ^{2}) is a Gaussian with mean and variance σ^{2}. Estimates _{ ki } are obtained by where are the solutions of differential equation system z_{ k }(t) = f_{ k }(z(t), t, ).
Model comparison methods involve two steps: first the model parameters need to be inferred from the available data, and then the adequacy of each calibrated model needs to be assessed. The classical and Bayesian approaches to each step are described next.
Bayesian inference
The posterior distribution of the parameters θ, given the data Y, is proportional to the likelihood P (Y | θ, M) of the parameters times the parameter prior P (θ | M), normalised by the likelihood or evidence P (Y | M) for model M. Samples from the posterior distributions of model parameters are routinely obtained by Markov Chain Monte Carlo (MCMC) methods [16]. Descriptions of the posterior distribution in terms of mean, median or variance are easily obtained from such samples. In the following, estimates of the 95% credible interval (CI), for example, are obtained by the cutoffs for the lowest 2.5% and highest 2.5% of samples.
where h(θ | M) is an arbitrary probability density function over parameters θ. The choice of a suitable function h(θ_{ s }| M) is crucial. If set to the prior we obtain the harmonic mean estimator which can perform very poorly due to its high variance. The variance problem is mitigated when a distribution is chosen which is close to the posterior distribution. Stability of the estimator is, for example, achieved by setting h(θ | M) to a multivariate Gaussian fitted to the sampled points θ_{ s }, or to a multivariate t-distribution as used here. The reciprocal importance sampler was shown to perform well [18] when compared to other model comparison methods such as reversible jump MCMC or simple information criteria like the BIC.
As we show in the simulations below the reciprocal importance sampler is suitable for the comparatively simple models which we investigate in this study. For more complex models, particularly with many modes, more complex model comparison algorithms might be required (see, for example, [14]). Such algorithms are harder to implement and run. The strategy we are suggesting here is to test via simulations whether a particular type of models is amenable to simple model selection procedures based on MCMC samples and only if this is not the case to develop more advanced methods.
Bayesian model comparison
where p(M_{ i }) is the prior probability of the model, and p(Y | M_{ i }) is the model evidence which we estimate here by equation 3.
for samples θ_{ s }from the posterior (for example, by MCMC simulation) and (y, M) = 1/N Σ_{ s }D(y, θ_{ s }, M) is the average deviance and = 1/N Σ_{ s }θ_{ s }is the vector of posterior parameter means. The lower the DIC the better the model. However, we find that the DIC calculations are often unstable resulting in completely unrealistic values, which might be less relied upon.
Once the probability of the model is known, we can select the most probable model from a set of competitive models using the Bayes factor BF = p(Y | M_{ i })/p(Y | M_{ j }). That is, the Bayes factor measures the extent by which the data increase the odds of M_{ i }to M_{ j }. Standard cutoffs for interpreting the significance of BFs, like [21], then allow interpretation of the result. Basically, a BF above 1 provides weak, above 3 substantial, above 10 decisive, and above 100 overwhelming evidence for model M_{ i }over M_{ j }.
The effective degree of freedom measures how many and by how much parameters are constrained by the data. On one hand, each parameter contributes close to one degree if the width of its posterior is small compared to the width of its prior. On the other hand, a parameter contributes very little to the overall effective degrees of freedom if it is not well constrained by the data and the width of its posterior hardly differs from the width of its prior. Consequently, the Bayes factor or effective degrees of freedom cannot eliminate or penalise spurious parameters which are ill determined by the data. It might be dubious to invoke Occam's principle once more (after it has already been incorporated in the model evidence) to decide between models and only additional data should be used for final clarification. For purely pragmatic reasons of convenience one might still accept the model with formally fewer parameters even though it has the same model evidence and same effective degrees of freedom as a more complex one.
Frequentist inference
Model selection as hypothesis testing
where p_{ i }is the number of parameters in model i and θ_{ML} is the value of θ that maximises the likelihood in 1. Standard tables exist for assessing the significance of AIC values [22].
Implementation
Five parallel MCMC chains were run for each model. Each chain consisted of 40,000 iterations (20,000 iterations in the simple one-equation model shown in [Additional file 1]). The first 20,000 samples (10,000 samples in the simple one-equation model) of each chain were discarded and then one every 10 iterations was stored. Convergence was assessed by applying the statistic described in [16] to the five parallel runs. This statistic was below 1.05 in all estimations, except otherwise stated, indicating good convergence behaviour. Evaluation of one model takes a few minutes on a standard desktop machine. Monitoring and analysis of MCSim runs was performed with the statistical R software [24]. Maximum likelihood values were obtained by taking the highest value from the MCMC runs.
Results
Here we present the model comparison results for different motif architectures (section Common network motifs), and for different parameterizations of the same FF motif (section Variants of feed forward motif). In each case, the models are introduced first, and the statistical comparisons described secondly in terms of Bayesian model evidence, DIC, effective degrees of freedom, and the maximum likelihood value from the MCMC chains. Statistical comparisons are here supported with the use of simulated data. This step is fundamental since it allows us to evaluate the performance of each approach before applying it to experimental data.
Common network motifs
Models
Analysis
Time series data were simulated from each of the models in tables 1 and [Additional file 1: supplementary table S4]. Parameter values were S = {0, 1, 2, 0} at times t = {0, 2, 6, 10}, and β_{ y }= α_{ y }= 1. Following [28], the coefficient and activation threshold in Hill functions were set to h = 2 and θ = 0.5, respectively. These are not experimentally determined parameter values. They are values required to yield biologically plausible solutions, a strategy also applied in [4]. Finally, 30 equally spaced data points were sampled from each time course, and a Gaussian error term with mean 0 and standard deviation 0.05 was added to simulate measurement errors.
The choice of parameter priors is critical in Bayesian model comparison. To make comparison as fair as possible, the same distribution was chosen for all rate constants in the simulations below, namely a log-normal distribution with mean 0 and standard deviation 1 in log-space. Finally, for the prior on the noise variance σ^{2} an inverse Gamma distribution with shape a = 0.5 and scale b = 0.05 was chosen (a fairly broad and heavy-tailed prior, the mean does not exist and the mode is at 0.1/3).
Model comparison results from simple network motifs.
data source | measure | SIM | RC | FF | FB |
---|---|---|---|---|---|
SIM | log p(Y | M_{ i }) | 89.3 | 73.74 | 89.6 | 49.03 |
DIC | -198.1 | -192.5 | -198.8 | -177.51 | |
pD | 3.93 | 2.94 | 4.01 | 4.34 | |
102.99 | 102.44 | 103.45 | 100.70 | ||
AIC | -197.98 | -196.88 | -196.9 | -191.4 | |
RC | log p(Y | M_{ i }) | 29.21 | 87.61 | 73.58 | 55.38 |
DIC | -86.17 | -194.60 | -187.13 | -175.21 | |
pD | 4.08 | 3.92 | 4.53 | 4.66 | |
47.18 | 101.22 | 100.46 | 97.62 | ||
AIC | -86.36 | -194.44 | -190.92 | -185.24 | |
FF | log p(Y | M_{ i }) | 80.20 | 57.60 | 93.43 | 22.95 |
DIC | -184.7 | -153.1 | -208.8 | -131.53 | |
pD | 4.06 | 3.92 | 4.81 | 5.01 | |
96.42 | 81.03 | 109.17 | 77.64 | ||
AIC | -184.84 | -154.06 | -208.34 | -145.28 | |
FB | log p(Y | M_{ i }) | -17.60 | -13.93 | -39.68 | 79.07 |
DIC | 2351.3 | 2718.1 | 2375.8 | -181.37 | |
pD | 4.04 | 3.66 | 4.61 | 4.98 | |
-1171.59 | -1355.33 | -1176.64 | 95.62 | ||
AIC | 2351.2 | 2718.66 | 2363.26 | -181.24 |
Bayesian model evidence values clearly favour the true model when data is generated from the RC motif (second row in table 2). Note that the maximum likelihood also favours the correct option, although here the relative difference between RC and FF is smaller than the differences in model evidence. When data from the more complex models FF and FB are used (third and fourth rows), the models that achieve the best fit are the true models FF and FB respectively, no matter the statistical measure chosen. Note that the effective degrees of freedom are close to the correct number of five for the FF data as opposed to four for the SIM data when fitting the FF model.
We performed the same analysis assuming the motif models could be defined with cooperative production terms (see [Additional file 1: supplementary table S4]). Bayesian model evidence always favours the correct model ([Additional file 1: supplementary table S5]). The results from of the AIC are less conclusive, which favours the wrong model (FB) in at least one case (RC).
Variants of feed forward motif
Models
where g describes the function (or gate type) that integrates the signals from S and y on the promoter of z [28]. For an AND gate integrating two activating signals, we have g = f^{+}(S, θ_{ Sz }, h)f^{+}(y, θ_{ yz }, h), while for an activating and a repressing signal, we have g = f^{+}(S, θ_{ Sz }, h)f^{-}(y, θ_{ yz }, h).
FF behaviours have been experimentally explored for the coherent (AND gate), coherent (OR gate) and incoherent (AND gate) subtypes in the arabinose, flagella, and galactose systems of E. coli respectively [5–7]. Interestingly, the time series data available allows following the FF component z (figure 2) as the system is switched ON and OFF, in comparison to a non-FF control system. Below we present a description of each control and FF instance, and a formal statistical analysis of the data.
Arabinose system
where AraC is abbreviated as AC, and the product of gene araBAD is abbreviated as AB. Since experiments were performed under saturating arabinose levels, all the AraC protein produced is assumed to be active. Experiments also revealed that the basal rates of AC and AB production, B_{ ac }and B_{ ab }, are small compared to the levels reached upon activation [5], thus they are set to zero in the parameter inference exercise. The ON step consists of a constant signal, CRP = 1, and initial conditions AC(0) = 0 and AB(0) = 0, while the OFF step is defined as CRP = 0, and AC(0) = 1, AB(0) = 1.
Control model for arabinose system
This control model is set up as follows: an ON step consists of CRP = 1, lactose = 1, LZ(0) = 0, and an OFF step consists of CRP = 0, lactose = 1, LZ(0) = 1. As before, B_{ lz }was assumed to be 0. In terms of the target component z, the only difference between this control model and a coherent FF motif is the fact that the two z inputs are independent.
Flagella system
The flagella biosynthesis network of E. coli is the system that governs the swimming capabilities of the bacteria, in such a manner that it moves away from its current location when growth conditions become mildly unfavourable. As illustrated in figure 3b, the genes that produce the flagella motor are regulated by a FF motif (see [6] and references therein). The components of this motif are the master regulator FlhDC (S), the intermediate regulator FliA (y), and a target operon (z) composed of a series of genes, fliLMNOPQR (hereafter abbreviated fliL). The two regulators are activators, thus forming a coherent FF, that converge upon fliL with OR input logic [30].
where FlhDC is abbreviated as FD, FliA as FA, and the product of gene fliL as FL. Under this model, an ON step is simulated by FD = 1, FA(0) = 0, and FL(0) = 0, while an OFF step is set up as FD = 0, FA(0) = 1, and FL(0) = 1.
Control model for flagella system
The initialization conditions are FD = 1, FL(0) = 0 for the ON step, and FD = 0 and FL(0) = 1 for the OFF step.
Additional flagella models
In this type of models, delay differential equation models, the concentration of FA used in the equation at time t is the one which existed earlier at time t - τ, where τ is an additional parameter that needs to be estimated. Intuitively, while the sign-sensitive delay behaviour is well explained by the type of logic gate used, only the incorporation of a time delay on y could explain the observed increase in the concentration of z upon an OFF signal step (see time series in [6]). This model uses the same initialisation conditions described for model FF.C1.OR.1.
Galactose system
In an incoherent FF motif, the two regulation paths flowing from the master regulator display opposite signs. In particular, in a type 1 incoherent FF motif, the branches from S to y and S to z are activating, but the intermediate branch from y to z is repressing (figure 2b). In [7] the dynamics for a FF.I1.AND motif were studied in E. coli cells in vivo in the network of genes in charge of galactose (gal) metabolism (figure 3c). Similarly to the ara system, the gal system is activated in the absence of glucose. Under these conditions, CRP simultaneously activates the galETK operon and the galS gene. GalS is a repressor of galETK that unbinds from the target promoter upon galactose binding. Thus, in contrast to the ara system, an ON step here consists in sensing the no-glucose signal cAMP in absence of the alternative sugar galactose. In the presence of galactose, the system becomes a coherent FF motif, reaching the maximal rate of expression of the galETK operon. According to [7], the biological explanation behind this counter-intuitive effect may be that cells prepare to use galactose as soon as they run out of glucose. The galactose catabolism machinery is produced at medium levels, allowing fast use of this alternative carbon source as soon as it becomes available, in which case the gal system is maximally activated.
where GalS is abbreviated as GS, and the product of gene galETK as GE. An ON step is set up by CRP = 1, GS(0) = 0, GE(0) = 0, while no OFF step was simulated because it was not provided in the original paper [7]
Control model for galactose system
where LI refers to the repressor LacI. The simulation conditions are CRP = 1, LI = 1, LZ(0) = 0.
Analysis
Here we first explore the different FF motif subtypes using simulated data generated from equations 15, 18 and 24 under the same conditions used in the experiments: the same signal function, number of observed variables and dataset size. The latter implies that the number of data points for each of the FF systems is: 60 for the arabinose system (30 in ON, 30 in OFF step), 37 for the flagella system (20 in ON, 17 in OFF step), and 15 for the galactose system (only ON step). Note that only one variable -z- is monitored here, and y is set as a hidden variable. The parameters were set to the same values as indicated in the theoretical work by [28], and parameter priors were the same as indicated in section Analysis.
MCMC model comparison results for the artificial FF datasets.
data source | measure | CONTROL | FF.C1.AND (Eqns 15) | FF.C1.OR.1 (Eqns 18) | FF.I1.AND (Eqns 24) |
---|---|---|---|---|---|
(Eqn. 17) | |||||
FF.C1.AND (Eqns 15) | log p(Y | M_{ i }) | 85.98 | 108.94 | 106.51 | 86.53 |
DIC | -217.98 | -388.91 | -292.81 | -205.64 | |
log p(Y | θ_{ ML }, M_{ i }) | 111.22 | 165.18 | 164.34 | 111.15 | |
AIC | -208.44 | -316.36 | -314.68 | -208.3 | |
FF.C1.OR.1 (Eqns 18) | log p(Y | M_{ i }) | (Eqn. 20) 40.12 | 30.90 | 48.09 | 42.55 |
DIC | -108.96 | -102.47 | -136.65 | -117.22 | |
log p(Y | θ_{ ML }, M_{ i }) | 62.73 | 60.06 | 86.94 | 69.97 | |
AIC | -111.46 | -106.12 | -159.88 | -125.94 | |
FF.I1.AND (Eqns 24) | log p(Y | M_{ i }) | (Eqn. 26) 12.52 | 8.28 | 13.78 | 14.02 |
DIC | -38.33 | -35.75 | -36.75 | -41.63 | |
log p(Y | θ_{ ML }, M_{ i }) | 26.66 | 25.69 | 30.09 | 30.86 | |
AIC | -39.32 | -37.38 | -46.18 | -47.72 |
MCMC model comparison results for the experimental FF datasets.
data source | measure | CONTROL | FF.C1.AND | FF.C1.OR.1 | FF.I1.AND |
---|---|---|---|---|---|
ara system | log p(Y | M_{ i }) | 51.22 | 74.99 | 72.31 | 51.88 |
DIC | -158.69 | -435.02 | -445.81 | -275.31 | |
log p(Y | θ_{ ML }, M_{ i }) | 83.99 | 140.05 | 147.06 | 83.99 | |
AIC | -157.98 | -264.10 | -278.12 | -151.98 | |
flagella system | log p(Y | M_{ i }) | -16.22 | -264.64 | -73.81 | -∞ |
DIC | -54.99 | -41275.70 | -2296.40 | 5878.25 | |
log p(Y | θ_{ ML }, M_{ i }) | 31.02 | -15.28 | -4.72 | -256.50 | |
AIC | -52.04 | 46.56 | 25.44 | 529.20 | |
gal system | log p(Y | M_{ i }) | -8.71 | -13.29 | -4.49 | -0.42 |
DIC | 10.53 | 9.88 | -3.53 | -20.47 | |
log p(Y | θ_{ ML }, M_{ i }) | -1.79 | -1.13 | 11.57 | 12.53 | |
AIC | 13.58 | 18.26 | -7.14 | -9.04 |
Assessment of additional flagella models for the flagella dataset [6], with MCMC.
Measure | CONTROL | FF.C1.OR.1 | FF.C1.OR.2 | FF.C1.OR.3 |
---|---|---|---|---|
log p(Y | M_{ i }) | -16.22 | -73.81 | -179.11 | -33.81 |
DIC | -54.99 | -2296.40 | 23.59 | -212.89 |
log p(Y | θ_{ ML }, M_{ i }) | 31.02 | -4.72 | 30.38 | 35.09 |
AIC | -52.04 | 23.44 | -46.76 | -54.18 |
Finally, in the case of data from the gal system, all approaches agree that the data are drawn from a FF.I1.AND model (the Bayes factor of FF.I1.AND to model FF.C1.OR.1 is 58.55, which can be interpreted as decisive evidence in favour of the incoherent FF motif according to [21]), which agrees with current knowledge [7].
Discussion
When building a model for a biological system, one has to decide whether to use parameter values from the literature or estimate them from the data. While the first option may be very useful, one has to bear in mind the limitations that extrapolating parameter values from other systems and experimental conditions have [31]. Thus, when data is available for the system under study, parameter inference becomes an interesting strategy. However, knowledge about biological systems is often sparse, and thus more than one model structure is often compatible with the system of interest. That is, more than one model structure could potentially be calibrated. In order to explore which model is preferable given the available data, a natural step after parameter inference consists in formally assessing the validity of all possible -now parameterized- candidate models. Despite its importance, this step is often overlooked.
Here we have provided a short overview of the Bayesian and frequentist approaches to model comparison. Then, we have applied the model comparison techniques to two cases. First, we have investigated the identifiability of a series of transcription regulation motif architectures (SIM, RC, FF and FB). The objective was to find out if one could infer the correct underlying model structure given time series data from each of these motifs (tables 2 and [Additional file 1: supplementary table S5]). For the case of the nested models, SIM and FF, when the data were generated by the SIM model, models SIM and FF have about equal model evidence. This is expected since the FF model can mimick the SIM model and the additional parameter can not be estimated from the data. From a pragmatic point of view one might accept the SIM model in this case. Note that AIC failed to identify the correct model in this case. For all the other datasets, model evidence, DIC and AIC favoured the correct model.
Secondly, it is known that the same model architecture can give rise to different dynamics depending on the particular model parameterization. To explore this issue, we have focused on the FF motif, an architecture for which extensive experimental caracterization has been carried out during the past years [5–7], following the description of different FF subtypes (figure 2) with different dynamic properties [28]. In [5–7] it is shown that the predicted behaviour of FFs is indeed observed in vivo. That is, their functionality is conserved even when they are embedded in large genetic systems. To address the question whether the biological signals are strong enough for the specific type of model parameterization to be identified from experimental data, we have analyzed such data under a series of candidate FF subtypes.
The motivation behind the original papers [5–7] was to compare each FF subtype to its non-FF control in a qualitative manner. We have formalized this comparison and have taken the analysis one step further discriminating each FF subtype from the others. Analysis of artificial data indicates that the experiments should be informative enough for the coherent FF subtypes to be differentiated from their controls, but also from each other. Identification of the incoherent subtype seems to be harder than the rest given the available data (table 3). Comparison of each FF model with its corresponding control model given the experimental datasets (table 4) shows that the Bayesian framework agrees with the conclusions that Alon and colleagues derived from visual inspection of the plots in all but the flagella network case [5–7].
While the cis-regulatory functions involved in the flagella gene network are known [30], no mathematical model is given in [6] to describe the experimental data corresponding to this system. Flagella system models including time delay effects have been defined here based on current biological knowledge. Bayesian model evidence still points towards the simpler control model without an additional FF branch as the best explanation for the data, but the AIC indicates that a delay model might be plausible. The importance of a delay element hints towards involvement of further unobserved components in the motif.
It could be argued that the body of information assumed available to generate the dataset used is so large that no model uncertainties remain. We wish to stress that embarking on a model comparison exercise is a way to make sure that all relevant mechanisms have been accounted for. Therefore, model comparison strategies should be regarded as complementary to and dependent on experimental work, rather than as standalone techniques.
Conclusions
We have given an overview of model comparison methods suitable for selecting a plausible network motif structure among a set of candidate models for time series data on gene regulation. We show that it is practical to apply maximum likelihood as well as Bayesian model comparison procedures to test ideas about underlying mechanisms of biological pathways in a formal and quantitative way.
Declarations
Acknowledgements
This work was mainly supported by The Wellcome Trust's Functional Genomics Integrated Thematic Programme, and primarily carried out in the School of Crystallography at Birkbeck College (London, UK). The authors wish to acknowledge the Biostatistics Unit at the MRC (Cambridge, UK) for covering the publication costs. NDP acknowledges current funding by the Instituto de Salud Carlos III (REEM project RD07_0060_0017), the Ministerio de Ciencia e Innovación of Spain (FIS2009-13360), and the Generalitat de Catalunya (AGAUR 2009SGR1168).
Authors’ Affiliations
References
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science. 2002, 298 (5594): 799-804. 10.1126/science.1075090View ArticlePubMedGoogle Scholar
- Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics. 2002, 31: 64-68. 10.1038/ng881View ArticlePubMedGoogle Scholar
- Alon U: An introduction to Systems Biology. Design principles of biological circuits. 2007, Chapman & Hall/CRC Mathematical & Computational BiologyGoogle Scholar
- Tyson JJ, Chen KC, Novak B: Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Current Opinion in Cell Biology. 2003, 15 (2): 221-231. 10.1016/S0955-0674(03)00017-6View ArticlePubMedGoogle Scholar
- Mangan S, Zaslaver A, Alon U: The Coherent Feedforward Loop Serves as a Sign-sensitive Delay Element in Transcription Networks. Journal of Molecular Biology. 2003, 334 (2): 197-204. 10.1016/j.jmb.2003.09.049View ArticlePubMedGoogle Scholar
- Kalir S, Mangan S, Alon U: A coherent feed-forward loop with a SUM input function prolongs flagella expression in Escherichia coli. Molecular Systems Biology. 2005, 1: 2005.0006Google Scholar
- Mangan S, Itzkovitz S, Zaslaver A, Alon U: The Incoherent Feed-forward Loop Accelerates the Response-time of the gal System of Escherichia coli. Journal of Molecular Biology. 2006, 356 (5): 1073-1081. 10.1016/j.jmb.2005.12.003View ArticlePubMedGoogle Scholar
- Zaslaver A, Bren A, Ronen M, Itzkovitz S, Kikoin I, Shavit S, Liebermeister W, Surette M, Alon U: A comprehensive library of fluorescent transcriptional reporters for Escherichia coli. Nature Methods. 2006, 3 (8): 623-628. 10.1038/nmeth895View ArticlePubMedGoogle Scholar
- Cohen AA, Geva-Zatorsky N, Eden E, Frenkel-Morgenstern M, Issaeva I, Sigal A, Milo R, Cohen-Saidon C, Liron Y, Kam Z, Cohen L, Danon T, Perzov N, Alon U: Dynamic proteomics of individual cancer cells in response to a drug. Science. 2008, 322 (5907): 1511-1516. 10.1126/science.1160165View ArticlePubMedGoogle Scholar
- Segal E, Barash Y, Simon I, Friedman N, Koller D: From promoter sequence to expression: a probabilistic framework. Proceedings of the Sixth Annual International Conference on Computational Molecular Biology (RECOMB). 2002Google Scholar
- Brazhnik P, de la Fuente A, Mendes P: Gene networks: how to put the function in genomics. TRENDS in Biotechnology. 2002, 20 (11): 467-472. 10.1016/S0167-7799(02)02053-XView ArticlePubMedGoogle Scholar
- Cavalier G, Anastassiou D: Phenotype analysis using network motifs derived from changes in regulatory network dynamics. Proteins: Structure, Function and Bioinformatics. 2005, 60 (3): 525-546. 10.1002/prot.20538.View ArticleGoogle Scholar
- Swameye I, Müller TG, Timmer J, Sandra O, Klingmüller U: Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by databased modeling. PNAS. 2003, 100 (3): 1028-1033. 10.1073/pnas.0237333100PubMed CentralView ArticlePubMedGoogle Scholar
- Vyshemirsky V, Girolami MA: Bayesian ranking of biochemical system models. Bioinformatics. 2008, 24 (6): 833-839. 10.1093/bioinformatics/btm607View ArticlePubMedGoogle Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian data analysis. 2004, Chapman & Hall/CRC Texts in Statistical ScienceGoogle Scholar
- Gilks WR, Richardson S, Spiegelhalter DJ: Markov Chain Monte Carlo in Practice. 1996, Chapman & Hall/CRC Interdisciplinary Statistics SeriesGoogle Scholar
- Gelfand AE, Dey DK: Bayesian model choice: asymptotic and exact calculations. Journal of the Royal Statistics Society Series B (Methodological). 1994, 56 (3): 501-514.Google Scholar
- Miazhynskaia T, Dorffner G: A comparison of Bayesian model selection based on MCMC with an application to GARCH-type models. Statistical Papers. 2006, 47: 525-549. 10.1007/s00362-006-0305-z.View ArticleGoogle Scholar
- Spiegelhalter DJ, Best NG, Carlin BP, Linde van der A: Bayesian measures of model complexity and fit. Journal of the Royal Statistics Society Series B. 2002, 64 (4): 583-639. 10.1111/1467-9868.00353.View ArticleGoogle Scholar
- Ripley BD: Pattern Recognition and Neural Networks. 1996, Cambridge University PressView ArticleGoogle Scholar
- Kass RE, Raftery AE: Bayes Factors. Journal of the American Statistical Association. 1995, 90 (430): 773-795. 10.2307/2291091.View ArticleGoogle Scholar
- Burnham KP, Anderson DR: Model Selection and Multimodel Inference: a Practical Information-Theoretic Approach. 2002, New York: Springer-Verlag, 2Google Scholar
- Bois FY, Maszle DR: MCSim: A Monte Carlo Simulation Program. Journal of Statistical Software. 1997, 2 (9): 1-60.View ArticleGoogle Scholar
- The R Project for Statistical Computing. http://www.R-project.org
- Ronen M, Rosenberg R, Shraiman BI, Alon U: Assigning numbers to the arrows: parameterizing a gene regulation network by using accurate expression kinetics. Proceedings of the National Academy of Sciences of the USA. 2002, 99 (16): 10555-10560. 10.1073/pnas.152046799PubMed CentralView ArticlePubMedGoogle Scholar
- Alon U: Network motifs: theory and experimental approaches. Nature Reviews Genetics. 2007, 8 (6): 450-461. 10.1038/nrg2102View ArticlePubMedGoogle Scholar
- Batchelor E, Loewer A, Lahav G: The ups and downs of p53: understanding protein dynamics in single cells. Nature Reviews Cancer. 2009, 9 (5): 371-377. 10.1038/nrc2604PubMed CentralView ArticlePubMedGoogle Scholar
- Mangan S, Alon U: Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the USA. 2003, 100 (21): 11980-11985. 10.1073/pnas.2133841100PubMed CentralView ArticlePubMedGoogle Scholar
- Ma HW, Kumar B, Ditges U, Gunzer F, Buer J, Zeng AP: An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Research. 2004, 32 (22): 6643-6649. 10.1093/nar/gkh1009PubMed CentralView ArticlePubMedGoogle Scholar
- Kalir S, Alon U: Using a quantitative blueprint to reprogram the dynamics of the flagella gene network. Cell. 2004, 117 (6): 713-720. 10.1016/j.cell.2004.05.010View ArticlePubMedGoogle Scholar
- Omlin M, Reichert P: A comparison of techniques for the estimation of model prediction uncertainty. Ecological Modelling. 1999, 115: 45-59. 10.1016/S0304-3800(98)00174-4.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.