Predicting network modules of cell cycle regulators using relative protein abundance statistics
 Cihan Oguz^{1}Email authorView ORCID ID profile,
 Layne T. Watson^{2, 3, 4},
 William T. Baumann^{5} and
 John J. Tyson^{1}
DOI: 10.1186/s1291801704091
© The Author(s) 2017
Received: 23 September 2016
Accepted: 17 February 2017
Published: 28 February 2017
Abstract
Background
Parameter estimation in systems biology is typically done by enforcing experimental observations through an objective function as the parameter space of a model is explored by numerical simulations. Past studies have shown that one usually finds a set of “feasible” parameter vectors that fit the available experimental data equally well, and that these alternative vectors can make different predictions under novel experimental conditions. In this study, we characterize the feasible region of a complex model of the budding yeast cell cycle under a large set of discrete experimental constraints in order to test whether the statistical features of relative protein abundance predictions are influenced by the topology of the cell cycle regulatory network.
Results
Using differential evolution, we generate an ensemble of feasible parameter vectors that reproduce the phenotypes (viable or inviable) of wildtype yeast cells and 110 mutant strains. We use this ensemble to predict the phenotypes of 129 mutant strains for which experimental data is not available. We identify 86 novel mutants that are predicted to be viable and then rank the cell cycle proteins in terms of their contributions to cumulative variability of relative protein abundance predictions. Proteins involved in “regulation of cell size” and “regulation of G1/S transition” contribute most to predictive variability, whereas proteins involved in “positive regulation of transcription involved in exit from mitosis,” “mitotic spindle assembly checkpoint” and “negative regulation of cyclindependent protein kinase by cyclin degradation” contribute the least. These results suggest that the statistics of these predictions may be generating patterns specific to individual network modules (START, S/G2/M, and EXIT). To test this hypothesis, we develop random forest models for predicting the network modules of cell cycle regulators using relative abundance statistics as model inputs. Predictive performance is assessed by the areas under receiver operating characteristics curves (AUC). Our models generate an AUC range of 0.830.87 as opposed to randomized models with AUC values around 0.50.
Conclusions
By using differential evolution and random forest modeling, we show that the model prediction statistics generate distinct network modulespecific patterns within the cell cycle network.
Keywords
Parameter optimization Differential evolution Ensemble modeling Machine learning Random forests Budding yeast Cell cycle Systems biologyBackground
In systems biology research, mathematical models of sufficient predictive power allow researchers to interrogate biological systems under a wide variety of experimental conditions that may be difficult to achieve in the laboratory. Such insilico experiments may lead to discoveries that affect life in important ways, for example, in understanding the molecular basis of certain diseases and in designing drugs for their treatment [1, 2]. What makes a model reliably predictive? Before using a model for predictive purposes, it is essential to show that the model is capable of reproducing major known experimental trends. In other words, incorporation of experimental data into a model by parameter optimization is a critical first step. Due to limitations in direct experimental measurements of kinetic parameters, a common approach is to estimate all unknown model parameters by minimizing the difference between model simulations and experimental data [3]. This approach often generates a set of parameter vectors with equivalent (or comparable) performance. Such parametric uncertainty can be used to advantage by extracting information about critical and dispensable parts of a model using global sensitivity analysis or identifying the most informative future experiments. This information can be used to constrain the model’s parameters [4] or to refine the model’s structure [5].
Creating an ensemble of parameter vectors with similar (or identical) performance (with respect to a known set of experimental observations) is especially useful when one would like to predict the potential outcome(s) of novel experimental designs. We refer the reader to [6] for a comprehensive survey of experimental design studies (with an emphasis on objective function formulations) from several fields including systems biology. More recent work in the area of experimental design within systems biology includes a study that compares the performances of several alternative methods with and without a predetermined network topology [7] and a novel framework for model selection implemented for both stochastic and deterministic models [8].
In the literature, “ensemble modeling” is a common term used to describe studies of multiple models [5, 9, 10] or a single model with multiple parameter vectors [11]. Here, we focus on the latter case with a complex model of the budding yeast cell cycle (more than 100 model parameters). Of special interest to us is parameter space exploration with a discontinuous objective function that is the sum of many discrete constraints. Recent work in ensemble modeling includes using simulated annealing with a multi objective function to extract robust and fragile model features [12], implementation of Metropolis Monte Carlo and multi ellipsoidal sampling [11], exploration of parameter space by adaptive sparse grids with control objectives [13], and identifying model fragilities with random walks [14]. More recently, ensembles of parameter vectors were generated to understand parameter adaptations underlying phenotypic transitions [15] with an application in pharmacological intervention [16]. In [17], Rumschinski introduced a setbased framework for detecting incorrect model hypotheses and refining parameter estimates with the help of infeasibility certificates and a bisection algorithm that identifies parts of parameter spaces consistent with incomplete and noisy experimental data. This approach was illustrated using two simple models with four species and 3–5 parameters. More recently, RodriguezFernandez et al. implemented a mixedinteger nonlinear programming (MINLP) formulation to simultaneously perform model selection and parameter estimation using in silico generated data of homeostasis in E. coli [18]. For this biological system, the authors identified the best model among 1700 nested models in a computationally efficient manner rather than fully analysing each candidate model separately. Starting with 21 model parameters, the resulting solution showed that parameters were precisely estimated, while identifiability issues and scalability to models of larger complexity were mentioned as limitations of this model identification approach [18].
A common element in these ensemble modeling studies is the use of timeseries data for optimizing parameters and for exploring the parameter space for alternative “feasible” vectors that provide acceptable fits to the data. Here, we use an ensemble modeling methodology for complex models when the constraining data are not quantitative timeseries of model variables (which are often unavailable in experimental studies of cell physiology) but discrete qualitative observations (in our case, the observed phenotypes of many different yeast strains carrying mutations of cell cycle genes). In addition, the model we consider is much more complex, with many more adjustable parameters and much more experimental data, than the models studied in the work cited above.
Ensemble modeling with qualitative constraints has recently been explored by Pargett et al. [19], who combined “optimal scaling” and gradientbased multiobjective optimization for incorporating a heterogeneous set of experimental constraints into ODE models of stem cell regulation in Drosophila. Starting from a core model with 10 states and 18 unknown parameters, the authors generated several additional models by considering alternative connections between components of the regulatory network. Following the parameter optimization step, experimental design was implemented (based on ranking the predictive variances of measurements) in order to decrease the uncertainty of model parameter values and model structure. Each candidate model was represented by ensembles of optimal parameter vectors and Pareto optimality was used for comparing model performance and for identifying informative experiments.
In [20], temporal logics (typically used with discrete models) was implemented to express the dynamical features of a continuous (ODEbased) model of an enzymatic reaction network involved in cancer. Furthermore, global robustness and sensitivity analysis was used for identifying the boundaries between distinct regions of the model’s parameter space (producing different states such as stable steady states and oscillations) and for generating several novel biological insights regarding system’s dynamics [20].
For a recent review on the use of qualitative data for estimating the parameters of continuous models, we refer the reader to [21]. This review covers the application of alternative data normalization techniques depending on the nature of the experimental data at hand (qualitative vs. quantitative), formulation of multiobjective optimization using heterogeneous experimental data sets, and Pareto optimality based analysis of tradeoffs between such multiple objectives.
The proposed approach in this paper extends our recent work on parameter optimization of a complex model of the budding yeast cell cycle [22]. Starting from an ensemble of optimally performing parameter vectors, we propose several ways to explore the parameter space for more such vectors. In this search, our aim is to find parameter vectors with diverse predictions (i.e., an extended range of predictions for the phenotypes of novel genetic strains). We demonstrate that differential evolution (DE) [23], which is a metaheuristic method, can effectively find feasible parameter vectors with extended predictive ranges provided an additional feasibility criterion (in addition to the criterion of optimal model performance) is enforced so that the search does not get stuck in a small region of parameter space. We show how DE can be forced to widen the range of predictions during the search for optimal parameter vectors.
The application of DE in similar contexts include [24] in which DE is hybridized with Kalman Filter for improving the parameter estimation accuracy compared to pure DE and genetic algorithm (GA) based approaches. In [24], simple models of glycolysis and the cell cycle, with artificially generated noisy time series data, are used to demonstrate the improved performance of the hybrid approach. More recently, the 18 parameters of an ODEbased dynamic model of endocytosis are optimized with several metaheuristic methods including DE under different observability settings (complete vs. incomplete observability of system variables), multiple levels of measurement noise, and with real and artificially generated time series data [25]. In this study, DE turned out to be the best performer in terms of estimation accuracy and convergence speed while practical parameter identifiability problems suggested the need for additional experimental data to further constrain the model’s parameters. Recent studies on the use of metaheuristic methods in a wide range of science and enginnering applications are surveyed in [26] with more than 200 references (including the applications of several DE variants). An earlier review paper focuses on the application of metaheuristic methods to systems biology problems [27] including experimental design [28–30] and parameter identifiability [31–33].
Methods
Problem formulation
The cell cycle is the ordered sequence of events that govern cell growth, replication of the cell’s genome, and division into two daughter cells that are capable of repeating this cycle in successive generations [34, 35]. The four phases of the cell cycle are DNA synthesis (S phase) and mitosis (M phase) separated by two gaps (G1 and G2). G1, S, G2 and M phases progress sequentially in a repeated manner, which is crucial to maintaining a constant number of chromosomes per cell after each cycle of DNA replication and cell division. Furthermore, the duration of a single cell cycle (i.e., from birth to division) has to be balanced (on average) with the time needed for doubling the amounts of all other cellular components. If this condition is not met (i.e., the mass doubling time is substantially different from the cell cycle time), then average cell size becomes progressively smaller or larger leading to cell death. In addition, a number of “checkpoints” prevent G1SG2M progression in cases such as DNA damage or improper alignment of replicated chromosomes on the mitotic spindle. All of these features of cell cycle progression are controlled by the periodic activation of cyclindependent kinases (CDKs) [34]. Since the fundamental molecular mechanisms governing the activation of CDKs are similar among all eukaryotes, an improved understanding of cell cycle controls has potential benefits far beyond the intrinsic challenge of unraveling this complex molecular control system.

START module: START (or G1/S transition) is an event in G1 phase when a new round of DNA synthesis and mitosis are committed by a cell. The most critical step of the START transition is the translocation of Whi5, a stoichiometric inhibitor of SBF and MBF (transcription factors of Cln2 and Clb5 synthesis modeled as a single variable named SBF), from nucleus to cytoplasm. In early G1, SBF is not active since it is inhibited by Whi5. As the cells grow, Cln3 and Bck2 concentrations rise enough to phosphorylate Whi5 (inhibitor of SBF), and as a result SBF becomes active, promoting Cln2 and Cln5 synthesis. Increasing concentrations of Cln2, Cln3, and Clb5 support progression of bud emergence.

S/G2/M module: Increasing Cln2 concentration following the START transition leads to phosphorylation and degradation of CKI. As a result of this, Clb5 is released. The active form of Clb5 promotes DNA synthesis, further inhibiting CKI through phosphorylation. Cln2 and Clb5 inhibit Cdh1 (responsible from Clb2 degradation) and Clb2 concentration increases resulting in the activation of Mcm1 (transcription factor of Clb2), and further Clb2 accumulation. By phosphorylating and inactivating SBF, Clb2 also halts the synthesis of Cln2 and Clb5 and the cells get ready for mitotic exit. Activation of APC by Clb2 and the cooperation of APC with Cdc20 are some of the key steps required for metaphaseanaphase transition and mitotic cyclin degradation. For Clb2 and Clb5 to be degraded, APC has to be phosphorylated and spindle assembly checkpoint needs to be released. Both of these processes are driven by Clb2.

EXIT module: Activation of Cdc14 is the most critical event in the EXIT module since it is essential for exit from mitosis and return to G1 state. Cdc14 dephosphorylates several proteins previously phosphorylated by CDKs in S/G2/M, thereby leading to the activation of Cdh1 and CKI, as well as the repression of Clb2 and Clb5. Two pathways, namely FEAR (Cdc fourteen early anaphase release) and MEN (mitotic exit network), are involved in the activation of Cdc14. The release of Esp1 from Pds1 (through Cdc20 activity) in the FEAR pathway leads to chromatid separation and phosphorylation of Net1. As a result, Cdc14 is released from Net1:Cdc14 complex and free Cdc14 drives exit from mitosis. In order for budding yeast cells to return to G1 state by the robust phosphorylation of Net1, the FEAR pathway is supported by the MEN pathway through the activation of Cdc15 and Tem1 that form a complex (MEN). This results in the full release of Cdc14, activation of Cdh1, complete degradation of Clb2, as well as the stabilization of CKI and a fully restored G1 phase.
In [22], starting from an initial parameter vector that captured 72 of the 119 experimental phenotypes in the Training Set, we improved the number of captured phenotypes to 111. In the process, the optimization algorithm produced more than 3000 parameter vectors that captured the same 111 phenotypes of the Training Set. We call this collection an ensemble of “feasible” parameter vectors. (The ranges of model parameter values in this ensemble are given in Additional file 1: Tables S3 and S4.) In this paper, our goal is to extend the ensemble of feasible parameter vectors identified by [22] to maximize the range of model predictions for a specific group of novel mutant strains (the Prediction Set). These mutants were not included in the Training Set because their phenotypes have not yet been characterized experimentally.
Phosphorylation and dephosphorylation reactions that induce synthetic lethality upon their elimination
Eliminated reaction  Single mutation strains that are viable before and 

(rate constant)  inviable after setting the rate constant to zero 
Whi5 phosphorylation by Bck2 (k p _{ i5k2})  cln3 Δ, Multicopy BCK2, cdh1 Δ, sic1 Δ, swi5 Δ, 
CLB5db Δ, net1ts, GALCLB2, APCA  
CKI phosphorylation by Cln2 (e _{ k i,n2})  bck2 Δ, GALSIC1, net1ts, APCA 
CKI phosphorylation by Clb2 (e _{ k i,b2})  GALCLN3, cdh1 Δ, GALCLB5, CLB1 clb2 Δ 
CKI dephosphorylation by Cdc14 (k d p _{ k i,14})  bck2 Δ, cdh1 Δ, GALCLB2, APCA 
Whi5 phosphorylation by Cln3 (k p _{ i5n3})  bck2 Δ, cdh1 Δ, APCA 
SBF phosphorylation by Clb2 (k p _{ b f b2})  cdh1 Δ, CLB5db Δ, APCA 
Whi5 phosphorylation by Cln2 (k p _{ i5n2})  bck2 Δ, APCA 
Whi5 dephosphorylation by Cdc14 (k d p _{ i514})  APCA 
Net1 dephosphorylation by PPX (k d p _{ n e t,p x })  Multicopy CDC15 
The range of model predictions
where \( {p}^{(i)}_{j}\in \left \{ {0, 1, 2}\right \}\) characterizes the phenotype for the jth novel genetic strain for the ith parameter vector and l is the total number of novel strains. Phenotype values are set according to the following rules. If, during the simulation of a novel strain, cell size exceeds 25 (arbitrary units) at any time, then the strain’s phenotype is inviable (\({p}^{(i)}_{j}=2\)). On the other hand, if cell size at last division is within 5% of the cell sizes at the two previous divisions, then the phenotype is viable (\({p}^{(i)}_{j}=1\)). Finally, if the model generates cycles of multiple periodicity and cell size at division oscillates between values that differ by more than 5%, then the phenotype is “multiply periodic” (\({p}^{(i)}_{j}=0\)). The number S(P) of unique rows in P is defined as the range of the prediction vectors in P. As we explore different schemes for computing prediction matrices, we compute S values for the ensembles created by these schemes. For each ensemble generation scheme, the sampling efficiency (e _{ S }) is computed as S/n _{ tot }, where n _{ tot } is the total number of samples taken from the parameter space. This measure allows us to compare different ensemble generation schemes based on the ranges of phenotypic predictions they produce.
Based on our previous study [22], which demonstrated that DE is an effective tool for exploring the parameter space of our high dimensional model given a discrete multi objective function (i.e, the number of phenotypes in the Training Set captured by the model), we continue using DE, this time for identifying the range of model predictions. While searching for an implementation of DE to meet this objective with efficient sampling, we encounter technical limitations with the standard implementation of DE that is typically used for parameter optimization, and we surmount these limitations by (i) improving the selection of the ensemble that serves as the starting point of DE, and (ii) adding new constraints to DE that force the method to search for feasible parameter vectors expanding the range of model predictions.
Differential Evolution
Let E ^{ D } denote real Ddimensional Euclidean space, and let x=(x _{1}, …, x _{ D })∈E ^{ D } be a vector of parameter values. The vector x includes both the 126 kinetic constants in the model and the 26 ODE initial conditions (D=152). For each vector x∈E ^{ D } proposed by the optimization algorithm, we calculate the phenotype \( {p}^{(i)}_{j}\in \left \{ {0, 1, 2}\right \}\) (for the jth strain for the ith parameter vector) for each of the 119 yeast strains in the Training Set. The objective function O(x) is an integervalued function that counts the number of phenotypes in the Training Set that are correctly captured by the model, given the parameter values in the vector x.
In DE, parameter vectors are propagated from generation to generation by processes of mutation, crossover, and selection. Each generation (indexed by t=0,1,…) consists of N parameter vectors x ^{(t,i)}. Hence, the real number \({x}^{(t,i)}_{j}\) is the value of the jth parameter in the ith parent in the tth generation. Let u ^{(t,i)} be the trial parameter vector born from the ith parent in the tth generation, whose components are constructed in two steps called “mutation” and “crossover”. Then, given the parent parameter vector x ^{(t,i)} and trial parameter vector u ^{(t,i)}, a decision is made as to which one is propagated to generation t+1.
 1.Mutation. First, for each i, 1≤i≤N, we create a “mutant” vector$$ v^{(t,i)}=x^{(t,i)}+F\cdot d^{(t,i)}=x^{(t,i)}+F\cdot \left(x^{(t,i')}x^{(t,i^{\prime\prime})}\right) $$(3)
by perturbing a parental parameter vector x ^{(t,i)}, where the perturbation vector d ^{(t,i)} is the difference between the parameter vectors of two distinct additional parents i ^{′} and i ^{″} chosen at random from the tth generation of parents, and 0<F<1 (F=0.1 in this study).
 2.Crossover. For each i (1≤i≤N) and j (1≤j≤D), and uniform [ 0,1] random variables U _{ i,j }, define the offspring by$$\begin{array}{@{}rcl@{}} {u}^{{(t,i)}_{j}}= \left\{ \begin{array}{rl} {v}^{(t,i)}_{j} &, {0} \leq {U}_{i,j} \leq {C}, \\ {x}^{(t,i)}_{j} &\text{, otherwise.} \end{array} \right. \end{array} $$(4)
We choose the “crossover probability” C=0.5 so that neither parental values nor mutant values are given an advantage during the crossover step.
 3.Selection. The next generation parent x ^{(t+1,i)} is either the parent x ^{(t,i)} or the trial vector u ^{(t,i)}. As DE explores the parameter space under different settings in this study, depending on the settings of the particular DE run, we impose three distinct feasibility criteria for selection, which are described below.

Feasibility Criterion 1 (F C _{1}): Trial vector u ^{(t,i)} satisfies F C _{1} if the model it defines captures the 111 phenotypes listed in Additional file 1: Table S7 out of the 119 phenotypes in the Training Set. F C _{1} is always enforced by DE for creating Ensembles 1 through 16 in Table 2. For each ensemble generation scheme, the efficiency of sampling in terms of identifying parameter vectors that satisfy F C _{1} (\(\phantom {\dot {i}\!}e_{{FC}_{1}}\)) is computed as \(n_{{FC}_{1}}/n_{tot}\phantom {\dot {i}\!}\), where \(\phantom {\dot {i}\!}n_{{FC}_{1}}\) is the number of parameter vectors that satisfy F C _{1} and n _{ tot } is the total number of samples taken from the parameter space.Table 2
Ensembles of feasible vectors generated with different schemes
Ensemble
Scheme
Ensemble
Selection of
Feasibility criteria
# generations
#
S
#
#
size
the initial DE
used in selection
per DE run
DE runs
(Range of
population from
step of DE
predictions)
Ensemble 1
1
3146

Parameter vectors
Ensemble generated

30
satisfy F C _{1}
in optimization [22]
2
1
243

Parameter vectors
Ensemble extracted

51
satisfy F C _{1}
from 50,000
LHS samples
3
2
7143
Randomly selected
F C _{1}
400
1
6
parameter vectors
4
3
1893
V _{ max }(10)
F C _{1}
400
1
41
5
4
1594
V _{ max }(10)
F C _{1} and F C _{2}
1600
1
64
6
4
1326
V _{ max }(10)
F C _{1} and F C _{2}
1600
1
69
7
5
3405
V _{ max }(123)
F C _{1} and F C _{2}
1600
1
94
8
5
3753
V _{ max }(123)
F C _{1} and F C _{2}
1600
1
80
9
6
2207
S _{ max } & V _{ max }(123)
F C _{1} and F C _{2}
1600
1
117
10
6
1842
S _{ max } & V _{ max }(123)
F C _{1} and F C _{2}
1600
1
95
11
7
3704
S _{ max } & V _{ max }(123)
F C _{1}, F C _{2}, and F C _{3}
1600
1
112
12
7
3481
S _{ max } & V _{ max }(123)
F C _{1}, F C _{2}, and F C _{3}
1600
1
133
13
8
4280
S _{ max } & V _{ max }(123)
F C _{1} and F C _{3}
1600
1
313
14
8
4550
S _{ max } & V _{ max }(123)
F C _{1} and F C _{3}
1600
1
367
15
7
15520
S _{ max } & V _{ max }(123)
F C _{1}, F C _{2}, and F C _{3}
2200
4
293
16
8
15050
S _{ max } & V _{ max }(123)
F C _{1} and F C _{3}
2200
4
671

Feasibility Criterion 2 (F C _{2}): F C _{2} requires that trial vector u ^{(t,i)} can only replace parent vector x ^{(t,i)} if u ^{(t,i)} leads to an expansion in the feasible region’s estimated volume. For this, we compute the estimated volumes of two Ensembles X _{1} and X _{2}. The first ensemble X _{1} consists of all the parent vectors of the current tth generation of DE (all satisfying F C _{1}) including x ^{(t,i)}. This ensemble excludes u ^{(t,i)} since it is not a parent vector. The second ensemble X _{2} includes u ^{(t,i)} in addition to all the parent vectors excluding x ^{(t,i)}. F C _{2} dictates that the trial vector u ^{(t,i)} can only replace x ^{(t,i)} if the estimated volume of the second ensemble is greater than the estimated volume of the first one (V(X _{2})>V(X _{1})). (We describe our approach for estimating the volume spanned by an ensemble of parameter vectors in Section 1 of the Additional file 2: Supplementary Text.) With ensemble creation Schemes 4 to 7 in Table 2, DE enforces F C _{2} together with F C _{1} so that a trial vector replaces the corresponding parent if and only if the trial vector that reproduces the 111 target phenotypes of the Training Set, and leads to an expansion in the feasible region’s estimated volume.

Feasibility Criterion 3 (F C _{3}): F C _{3} requires that trial vector u ^{(t,i)} can only replace parent vector x ^{(t,i)} if u ^{(t,i)} yields a prediction vector for the 129 mutant strains of the Prediction Set that has not been derived from any parent vector up through the tth generation of DE. In other words, if a trial vector u ^{(t,i)} satisfies F C _{1}, u ^{(t,i)} replaces its parent x ^{(t,i)} if and only if the prediction vector \(\hat {\mathbf {p}}\) generated by u ^{(t,i)} is not among the rows of the prediction matrix generated by all the parent vectors up through the point of generation of u ^{(t,i)}. For creating Ensembles 11, 12, and 15 in Table 2, DE enforced all three criteria so that a trial vector replaces the corresponding parent if and only if the trial vector defines a model that captures the 111 target phenotypes of the Training Set, leads to an expansion in the feasible region’s estimated volume, and produces a new phenotypic prediction vector for the 129 novel mutants in the Prediction Set. Ensembles 13, 14, and 16 are created by enforcing only first and the third criteria.

Results and discussion
Exploring the parameter space with Latin hypercube sampling
Our starting ensemble in this study is derived from the 3415 feasible parameter vectors identified in [22]. The size of this ensemble is reduced by 8% since only 3146 of these vectors are F C _{1}feasible when truncated to 32bit IEEE single precision. (We are eliminating parameter vectors that are very sensitive with respect to F C _{1}.) We call this collection of vectors “Ensemble 1”. (Throughout this paper, parameter vectors are considered feasible only if their truncated 32bit values are also feasible.) Applying Ensemble 1 to the Prediction Set, we generate 30 unique prediction vectors.
We explore this initial feasible region by Latin hypercube sampling (LHS). The bounds of the hypercube are formed by the minimum and maximum values of each parameter from Ensemble 1. 50,000 samples are generated as described in Section 2 of the Additional file 2: Supplementary Text. Out of these sample vectors, only 243 (0.5% of the total) are F C _{1}feasible. These feasible vectors form Ensemble 2, which produces 51 unique prediction vectors; a 70% improvement (51/30) in the total range of predictions (previously defined as the number of unique prediction vectors).
Exploring the parameter space with DE
The results of LHS point out the possibility of finding feasible parameter vectors with a wider range of model predictions compared to those of Ensemble 1. We next investigate the possibility of using DE to identify alternative feasible ensembles with wider prediction ranges.
First, we created an initial random selection of 19 parameter vectors from Ensemble 1. (The population size of 19 is dictated by computational limitations imposed by the complexity of the model and the size of the Training Set [22]). Starting from this initial population of parameter vectors, DE explores the parameter space with mutation, crossover, and selection operations (described in Methods). (Rather than maximizing the total number of captured phenotypes by the model as we did previously [22], we only look for parameter vectors that capture the the 111 phenotypes listed in Additional file 1: Table S7 while missing the remaining eight phenotypes (Additional file 1: Table S8). Such vectors are feasible according to F C _{1} as described earlier). In 400 generations, DE generates 7143 vectors (Ensemble 3 in Table 2) whose truncated 32bit values satisfy F C _{1}. Despite its large size, Ensemble 3 yields only six unique prediction vectors for the 129 strains in the Prediction Set.
Why did DE perform so poorly compared to LHS even though, in our previous study, it was superior to random sampling in optimizing model performance (capturing phenotypes in the Training Set)? The answer comes from a comparison of the volumes the parameter space that are spanned by Ensembles 2 and 3. Ensemble 3 has an estimated volume that is 83 orders of magnitude smaller than that of Ensemble 2. In other words, DE zooms into a much smaller region of parameter space than LHS.
The ten most critical model parameters
Parameter name 

Total amount of Cdc14 
SPN synthesis rate 
Total amount of Esp1 
Total amount of Net1 
Degradation rate of Cdc20 
PPX inactivation by Esp1 
Efficiency of Cdc14Net1 complex (RENT) formation 
Time scale for protein activation 
Net1 phosphorylation by Clb2 
Total amount of Mcm1 
Therefore, to prevent this drop in dynamic volume, we introduce a new constraint (F C _{2}) as described in the Methods section. To enforce F C _{2}, the estimated volumes of two distinct ensembles are computed every time a new trial vector that satisfies F C _{1} is found. The first ensemble includes all parameter vectors satisfying F C _{1} until that point of DE, except the newest trial vector generated. Hence, this ensemble includes the trial vector’s competitor: the parent vector. The second ensemble is generated by including the trial vector instead of the parent vector, with the remaining members being identical to those of the first ensemble. If the estimated volume of the second ensemble is greater than that of the first one, the trial vector replaces the parent vector in the next generation of DE in the search for feasible vectors. Otherwise, the parent vector is not replaced, but the trial vector is recorded since it satisfies F C _{1} and its predictions for the phenotypes of the Prediction Set are evaluated after DE is complete. Succinctly, F C _{2} allows a trial vector to replace a parent only if it leads to an expansion of the feasible region. As shown in Fig. 3 (green and red lines), this new feasibility criterion prevents the volume of the feasible region from shrinking as the generations pass (two independent DE runs). Additional file 3: Figure S1 (blue line) and Additional file 1: Table S9 show that without this volume maximization strategy, the ranges of nearly all parameters are diminished after 400 generations. On the other hand, with estimated volume maximization, the majority of the parameters have about a 10% variation after 400 generations (green line in Additional file 3: Figure S1). These parameter ranges are calculated by dividing the maximum parameter values by the minimum parameter values among all parent vectors at the 400^{ th } generation. Due to this improvement in the parameter ranges, we allow DE to explore for an additional 1200 generations. Additional file 3: Figure S1 (red line) shows that about 10% range for most parameters is still preserved among the parent vectors in the 1600^{ th } generation. We perform two realizations of DE with this approach for 1600 generations, thereby creating two more ensembles (Ensembles 5 and 6 in Table 2). These ensembles generate 64 and 69 unique phenotypic prediction vectors, respectively.
A further improvement comes from selecting the initial DE population to maximize the volume spanned by the vectors with respect to all 123 kinetic parameters rather than just the 10 most critical parameters. (Note that the kinetic parameters k s _{ n2}, f, and MDT have fixed values in Additional file 1: Table S3.) Two independent DE realizations for 1600 generations produce an average of 87 unique phenotypic prediction vectors (Ensembles 7 and 8 in Table 2) further increasing the range of predictions compared to those of Ensembles 5 and 6. This is also a significant improvement over LHS (51 unique prediction vectors) even though DE required about 30,000 samples (1600 generations × 19 vectors) to identify ∼70% wider (87/50) range of predictions compared to 50,000 LH samples. Having gotten DE to a point where it is more efficient than random sampling in terms of exploring the feasible region, we next seek ways to improve the performance of DE even further.
Increasing the phenotypic diversity of the initial population of DE
As we previously stated, 3146 feasible parameter vectors from the initial DE optimization run on the Training Set [22] (Ensemble 1) generate 30 unique phenotypic prediction vectors for the Prediction Set. Interestingly, 97% of these vectors generate only five of the total 30 prediction vectors as shown in Additional file 3: Figure S2. Due to this, the initial population of parameter vectors used in the last two DE runs (Scheme 5 in Table 2) produces a total of four unique prediction vectors all of which are in this set of five dominant prediction vectors. In other words, the diversity of the initial population in terms of phenotypic predictions is very low, only 13% (4 /30) of the diversity is utilized. Therefore, to increase this diversity, we select the initial population of feasible parameter vectors such that each one generates a different prediction vector (a total of 19) for the 129 strains in the Prediction Set. While using this initial selection scheme, we also maximize the estimated volume spanned by the selected vectors (Scheme 6 in Table 2). The details of this diversification procedure are in the Additional file 2: Supplementary Text (Section 5). This strategy further expands the range of predictions, with two independent runs (each for 1600 generations) increasing the average number of unique prediction vectors from 87 to 106 (Table 2). Thus, improved predictive diversity among the parent parameter vectors in the initial population results in feasible vectors (generated during DE) that are predictively more diverse.
Enforcing an increased range of predictions during DE
In order to explore the phenotypic prediction space of the model further, we enforce a third criterion during DE. With this new criterion, a parent parameter vector is only replaced by a trial vector if the trial vector generates a new prediction vector, one not heretofore generated by any feasible parameter vector during this DE run. (For the descriptions of parent and trial parameter vectors, refer to Methods section.) In other words, with this modification, the trial parameter vector has to satisfy three constraints to replace the parent vector. It should reproduce 111 phenotypes in Additional file 1: Table S7 (F C _{1}), increase the estimated volume of the feasible region upon replacing the parent vector (F C _{2}), and generate a new prediction vector (F C _{3}). Two independent realizations with this new scheme (for 1600 generations) increase the average number of unique predictions from 106 to 122.5 (average of Ensembles 11 and 12 in Table 2). We note that since the occurrence of a trial vector that satisfies the first two criteria is not very frequent (less than 10% among the samples generated by DE), simulating the 129 mutant strains of the Prediction Set onthefly (during DE) adds negligible computational time compared to the time required to run DE for 1600 generations with the 119 phenotypes in the Training Set.
Comparison of the two most efficient ensemble generation schemes
In order to compare the performances of Schemes 7 and 8 more thoroughly, we perform four DE runs with each scheme (2200 generations per run). As shown in Table 2, Scheme 8 produces 671 unique prediction vectors (from 15050 feasible parameter vectors in Ensemble 16), whereas the number of unique prediction vectors is 293 for Scheme 7 (from 15520 feasible parameter vectors in Ensemble 15), reiterating our previously stated conclusion that Scheme 8 is more efficient in exploring the phenotypic prediction space. Lower performance of Scheme 7 suggests that maximizing the feasible estimated volume during DE (through F C _{2}) may have no benefit.
The ten most fragile phenotypes
Phenotype #  Phenotype name  

61  CLB2db Δ multicopy SIC1 (Viable)  
18  cln1 Δ cln2 Δ cdh1 Δ (Viable)  
63  CLB2db Δ clb5 Δ clb6 Δ in galactose (Viable)  
105  cdc15 Δ net1ts cdh1 Δ (Viable)  
56  GALCLB2 cdh1 Δ (Inviable)  
20  cln1 Δ cln2 Δ cdh1 Δ GALCLN2 (Viable)  
59  CLB2db Δ in galactose (Inviable)  
77  APCA (Viable)  
78  APCA sic1 Δ (Viable)  
73  CLB5db Δ pds1 Δ (Viable) 
\({\hat {R}_{i}}=900\) is the highest possible robustness score for a feasible parameter vector that satisfies F C _{1} prior to perturbations. One way to compare different ensembles in terms of robustness is to compare the distributions of \({\hat {R}_{i}}\).
From these results, we conclude that by forcing the DE search to expand the range of predictions, one can explore the prediction space effectively as demonstrated by Scheme 8’s superior predictive diversity over its alternatives (Table 2). However, by forcing DE to maximize the feasible region’s estimated volume, it is possible to improve the robustness of the model in reproducing experimentally verified phenotypes, but at the expense of predictive diversity. Therefore, one should select the appropriate scheme for parameter space exploration, depending on one’s preference between higher robustness (Scheme 7) or diversity of model predictions (Scheme 8). Higher robustness against parametric perturbations may be enforced in some cases. For instance, one may need to modify the values of parameters in feasible vectors in order to capture additional experimental constraints while still capturing the original data [42], and this would favor the selection of Scheme 7 over Scheme 8.
Relative protein abundance predictions
Up to this point, we have only considered the phenotypic prediction range for the 129 mutant strains in the Prediction Set. Next, we consider predictions of relative protein abundances. In simulations, the time average concentration of a protein represents the model’s prediction for that protein’s abundance in an asynchronous population of budding yeast cells. For theoretical and experimental reasons, it is better to focus on relative protein abundances, i.e., the ratio of the abundance of one protein with respect to another. Relative abundances of proteins are typically measured by Western Blotting [43] or mass spectrometry [44]. Relative abundance measurements have been useful in estimating the parameters of systems biology models in the past [45, 46].
The ten novel phenotypes with highest predictive variance
Mutant #  Mutation 1  Mutation 2  Mutation 3 

90  e _{ k i,n2}=0  k p _{ b f b2}=0  k d p _{ i514}=0 
21  e _{ k i,n2}=0  k p _{ b f b2}=0   
57  k p _{ i5k2}=0  e _{ k i,b2}=0  k d p _{ i514}=0 
11  k p _{ i5k2}=0  e _{ k i,b2}=0   
85  e _{ k i,n2}=0  k p _{ i5n3}=0  k p _{ b f b2}=0 
58  k p _{ i5k2}=0  e _{ k i,b2}=0  k d p _{ n e t,p x }=0 
81  e _{ k i,n2}=0  k d p _{ k i,14}=0  k p _{ b f b2}=0 
128  k p _{ b f b2}=0  k d p _{ i514}=0  k d p _{ n e t,p x }=0 
42  k p _{ b f b2}=0  k d p _{ n e t,p x }=0   
106  e _{ k i,b2}=0  k p _{ b f b2}=0  k d p _{ n e t,p x }=0 
Similar approaches have been used in two previous model driven experimental design studies [47, 48]. In [47], Dong et al. presented an experimental design process called “Computing Life” and illustrated it for the biological clock of Neurospora crassa. At each experimental design cycle, the authors chose the Maximally Informative Next Experiment (MINE) from a large set of potential network models and microarray experiments using a criterion that enforced maximal independence between observables. This analysis identified several genes (from a total of 11,000 genes) under the direct control of a key clock oscillator and also discovered a link between this clock and ribosome biogenesis. In [48], Donahue et al. implemented a sparse grid approximation using polynomials to explore their objective function (based on time series data) in order to discriminate simultaneously between uncertainties in model structure and in parameter values (without an initially determined feasible region). One disadvantage of the sparse grid search is the required smoothness of the objective function, whereas typically rugged objective function landscapes [49] are observed for large and nonlinear network models. This is especially the case in our study where many discrete experimental constraints determine the feasibility of model parameter vectors. For detailed theoretical discussions regarding the use of prediction variability statistics in modelbased experimental design, we refer the reader to two excellent reviews [4, 6].
Ranking cell cycle proteins and biological processes in terms of prediction variability
In order to study the potential relationships between the variabilities of relative abundance predictions linked to individual cell cycle proteins and the topology of the cell cycle network, we first identified the total variability associated with each of the 26 proteins. To this end, for each protein, we computed the sum of the CV values for each protein abundance ratio with that particular protein in its numerator. We refer to this sum as the “variability score” of the protein. (We have verified that our ranking of proteins based on their variability scores does not depend on whether we use the protein in the numerator or denominator in the summation process (data not shown)).
Cell cycle regulators ordered in terms of their variability scores (decreasing from top to bottom)
Rank  Regulator  Variability score  Percentile  Category  Network module  Class label 

1  Cdc20AAPC  231.4  100  High variability  S/G2/M  2 
2  Cln3  208.3  94  High variability  START  1 
3  BUD  195.71  91  High variability     
4  APCP  191.2  87  High variability  S/G2/M  2 
5  WHI5dep  160.63  82  High variability  START  1 
6  SBFdep  156.26  79  High variability  START  1 
7  Polo _{ A }  154.48  75  High variability  EXIT  3 
8  Cln2  152.8  71  High variability  START  1 
9  Tem1  147.02  67  High variability  EXIT  3 
10  ORI  146.3  64  High variability     
11  Clb5 _{ T }  139.93  60  High variability  S/G2/M  2 
12  CKI _{ T }  129.58  56  High variability  S/G2/M  2 
13  Cdh1 _{ A }  124.2  52  High variability  S/G2/M  2 
14  SPN  122.95  48  Low variability     
15  Polo _{ T }  122.16  44  Low variability  EXIT  3 
16  Bck2  120.28  40  Low variability  START  1 
17  Pds1 _{ T }  119.01  37  Low variability  EXIT  3 
18  Clb2 _{ T }  118.21  33  Low variability  S/G2/M  2 
19  PPX  117.53  29  Low variability  EXIT  3 
20  Cdc20AAPCP  115.58  25  Low variability  S/G2/M  2 
21  V (Mass)  114.18  21  Low variability     
22  CKI _{ P }  111.99  17  Low variability  S/G2/M  2 
23  Cdc15  109.39  13  Low variability  EXIT  3 
24  Net1dep  109.19  10  Low variability  EXIT  3 
25  CDC20 _{ T }  109.03  6  Low variability  S/G2/M  2 
26  Swi5 _{ T }  107.48  1  Low variability  S/G2/M  2 
Ten regulators in the S/G2/M module, on the other hand, were evenly distributed among both categories. Two regulators in this module, namely Cdc20AAPC and Cdc20AAPCP had strikingly different predictive variability scores. Cdc20AAPC complex has the highest score of 231.4), whereas Cdc20AAPCP complex was ranked 20^{ th } with a score of 115.58. These two complexes are responsible from the degradation of Clb5, Clb2, and Pds1 through ubiquitinmediated proteolysis [51]. Interestingly, Cdc20AAPCP is 9.3, 3.8, and 6.5–fold more potent than Cdc20AAPC (based on the average parameter values in Ensemble 16) in terms of degrading Clb5, Clb2, and Pds1, respectively. Hence, a potent (or critical) regulator turned out to have less predictive variability compared to a weaker regulator in our model once again pointing to a potential relationship between the cell cycle network and the variability scores of individual model variables (more critical variables have less predictive variance). After ranking cell cycle proteins by denominatorbased formation of relative abundancenetwork module pairs (i.e., each relative abundance is matched to the module of the protein in its denominator), we computed the Pearson correlation coefficient between the vectors formed from the order of the two rankings (numeratorbased vs. denominatorbased) as 0.99. Hence, the ranking of cell cycle proteins was independent of the way network modules were assigned to relative abundance values.
Biological processes ordered in terms of predictive variability which decreases from top to bottom
Rank  Biological process  Variability  Percentile 

score  
1  Regulation of cell size  164.38 ±42.18  100 
2  Negative regulation of transcription involved in  160.63 ±0.00  95 
G1/S transition of mitotic cell cycle  
3  Regulation of transcription involved in  157.89 ±71.29  91 
G1/S transition of mitotic cell cycle  
4  Positive regulation of transcription involved in  156.26 ±0.00  85 
G1/S transition of mitotic cell cycle  
5  Positive regulation of transcription from RNA polymerase II promoter  156.26 ±0.00  85 
6  Regulation of cyclindependent protein kinase activity  154.81 ±38.41  81 
7  Mitotic spindle orientation checkpoint  147.02 ±0.00  71 
8  Exit from mitosis  147.02 ±0.00  71 
9  Establishment of mitotic spindle localization  147.02 ±0.00  71 
10  Regulation of mitotic spindle assembly  139.93 ±0.00  64 
11  Positive regulation of DNA replication  139.93 ±0.00  64 
12  G1/S transition of mitotic cell cycle  130.11 ±13.89  60 
13  Positive regulation of spindle pole body separation  129.07 ±15.36  54 
14  G2/M transition of mitotic cell cycle  129.07 ±15.36  54 
15  Positive regulation of protein ubiquitination  124.20 ±0.00  47 
16  Negative regulation of spindle pole body separation  124.20 ±0.00  47 
17  Regulation of cell cycle  120.28 ±0.00  40 
18  Positive regulation of gene expression  120.28 ±0.00  40 
19  Mitotic sister chromatid segregation  119.01 ±0.00  36 
20  Regulation of mitotic spindle elongation  118.21 ±0.00  30 
21  Negative regulation of protein dephosphorylation  118.21 ±0.00  30 
22  Positive regulation of mitotic metaphase/anaphase transition  116.62 ±10.73  23 
23  Activation of APCCdc20 complex activity  116.62 ±10.73  23 
24  Protein phosphorylation  109.39 ±0.00  16 
25  Mitotic cytokinesis  109.39 ±0.00  16 
26  Regulation of exit from mitosis  109.29 ±0.14  12 
27  Negative regulation of cyclindependent protein kinase by cyclin degradation  109.03 ±0.00  6 
28  Mitotic spindle assembly checkpoint  109.03 ±0.00  6 
29  Positive regulation of transcription involved in exit from mitosis  107.48 ±0.00  1 
Based on Table 7, the biological processes associated with the smallest predictive variability values (1 ^{ s t }12^{ th } percentile range) were identified as the positive regulation of transcription involved in exit from mitosis (and also its simpler form “regulation of exit from mitosis”), mitotic spindle assembly checkpoint, and negative regulation of cyclindependent protein kinase by cyclin degradation. These processes are associated with Swi5 (the transcription factor for CKI), Net1 (stoichiometric inhibitor of Cdc14), Cdc15 (responsible for Net1 phosphorylation), and Cdc20 (required for Clb5, Clb2, and Pds1 degradation) that all play critical roles for mitotic exit which is the cell cycle network module with least predictive variability as we previously stated.
The findings that we summarize in this section, when taken together, suggest that the statistics generated from the model predictions are influenced by the topology of the cell cycle network and that these statistics may also be generating distinct patterns that are specific to individual network modules. In order to test this hypothesis, we next implemented the “random forest” classification method and developed statistical models to predict the network modules in which individual cell cycle regulators operate (i.e., biological functions of these regulators) using model prediction statistics.
Predicting biological functions (or network modules) of cell cycle regulators using relative abundance statistics
In order to predict the biological functions (or network modules) of cell cycle regulators using relative abundance statistics, we implemented the random forest classification method using the Statistics and Machine Learning Toolbox^{TM} of Matlab^{®;} [57]. For each relative abundance (a total of 47850 relative abundances with finite CV values), four features were used for predicting the network modules of individual cell cycle proteins, namely the mean, standard deviation, and CV values of the particular relative abundance and the IDnumber of the viable novel mutant (of the 129 strains in the Predictive Set plus the wild type strain) that is simulated to generate the relative abundance prediction. The true class of each relative abundance was identified as the network module to which the protein in the numerator belonged. (We later tested if the predictive accuracy significantly changed when the denominator was taken as the reference point for identifying the true class labels and found out that our predictive ability was not dependent on this choice.) Predictive accuracy is computed by generating receiver operating characteristic (ROC) curves (true positive rate vs. the false positive rate obtained using several classifier output thresholds) and quantifying the areas under these curves (AUC) for each network module as the positive class vs. the negative class generated by combining the remaining two modules (i.e., START module vs. S/G2/M and EXIT modules, S/G2/M module vs. START and EXIT modules, and EXIT module vs. START and S/G2/M modules). We performed 100 runs (per set of features or model inputs) and reported the average AUC and its pvalue based on a Ztest with respect to a random model with two classes (i.e., AUC=0.5) [58], an approach commonly taken for computing the statistical significance of AUC in ROC based predictive modeling studies. When the pvalue computed from the AUC is less than 0.05, the predictive performance measured by the AUC value is deemed statistically significant. We also generated randomized models by permuting the class labels (or network modules) attached to each relative abundance in 100 independent realizations. The pvalues associated with the predictive performances of these randomized models were expected to be higher than 0.05 in order to verify the statistical significance achieved by the nonrandomized models trained and tested by the true network modules associated with all the relative abundances.
Per decision tree, approximately 64% of the samples are retained to be used for model training, whereas the remaining samples are used for model testing. These test samples are referred to as “outofbag” (OOB) samples, whereas the training samples are expanded by bootstrapping [59] (or sampling with replacement) up to the sample size of the original data [60] prior to model training. Classification of the test samples are based on the complete ensemble of trees (a total of 100 trees) with a voting scheme. For example, a test sample (i.e., the protein in the numerator of a relative abundance) is predicted to be in the “START” module if the number of trees that predict this outcome is higher than the ones that predict the protein’s network module as “S/G2/M” or “EXIT”.
Predictive performances of the random forest models developed using relative abundance statistics along with the pvalues corresponding to mean AUC values in 100 independent realizations (STD corresponds to standard deviation)
Positive class  AUC (Mean ±STD)  pvalue  AUC (Mean ±STD)  pvalue 

with randomized modules  with randomized modules  
START  0.8667 ±0.0004  <1.0E 15  0.4996 ±0.0046  0.55 
S/G2/M  0.8326 ±0.0005  <1.0E15  0.5003 ±0.0038  0.46 
EXIT  0.8366 ±0.0005  <1.0E15  0.5008 ±0.0038  0.40 
Recent studies have indicated that abundances of proteins are regulated in a biological functiondependent manner [61–63]. For example, in general, production and degradation rates of regulatory proteins are trained by evolution to quickly respond to certain stimuli, whereas proteins produced by housekeeping genes and structural proteins that are critical for the integrity of an organism are relatively more stable [61]. Furthermore, it is now also clear that protein abundance signatures are shaped not only by transcriptional and posttranscriptional regulation [64] but also by translation and posttranslational regulation, which play prominent roles in determining both dynamic and steadystate behaviours of protein abundances [61, 62, 65]. The cell cycle model used in our study takes into account all of these individual modes of regulation and successfully predicts the network modules of individual cell cycle regulators (related to their biological functions) from model prediction statistics. This outcome demonstrates the critical importance of developing comprehensive and accurate models of important biological processes (such as cell cycle control) for correctly predicting various dynamic and steadystate behaviours shaped by a complex interplay between several modes of regulation. Generating correct predictions despite such complexity holds the key to elucidating critical components and their interactions in complex biological networks in a contextdependent manner.
Conclusions
Previously [22], we demonstrated a practical approach for fitting a complex dynamical model of the budding yeast cell cycle [40, 41] to a large set of qualitative experimental observations (viability/inviability of mutant strains of yeast). Taking a further step in this work, we characterize the feasible region of this model in order to test whether the statistical features of relative protein abundance predictions are influenced by the topology of the cell cycle regulatory network.
Using differential evolution (DE), we generate an ensemble of feasible parameter vectors that reproduce the phenotypes (viable or inviable) of wildtype yeast cells and 110 mutant strains (we call these 111 strains the Training Set). We use this ensemble to predict the phenotypes of 129 mutants (the Prediction Set) for which experimental data is not available. We identify 86 novel mutants that are predicted to be viable and then rank the cell cycle proteins in terms of their contributions to cumulative variability of relative protein abundance predictions. Of the three modules in the cell cycle control system (START, S/G2/M, and EXIT), the EXIT module (the most fragile module identified in [22]) has the least predictive variability, whereas the START module has the highest predictive variability. When we compile all of the gene ontology based biological processes associated with the cell cycle proteins in the model, we identify that the proteins involved in “regulation of cell size” and “regulation of G1/S transition” contribute most to predictive variability, whereas proteins involved in “positive regulation of transcription involved in exit from mitosis”, “mitotic spindle assembly checkpoint”, and “negative regulation of cyclindependent protein kinase by cyclin degradation” contribute the least. These results suggest that the statistics of these predictions may be generating patterns specific to individual network modules (START, S/G2/M, and EXIT). To test this hypothesis, we develop random forest models for predicting the network modules of cell cycle regulators using relative abundance statistics as model inputs. Predictive performance is assessed by the areas under receiver operating characteristics curves (AUC). Our models generate an AUC range of 0.830.87 as opposed to randomized models with AUC values around 0.50. By using differential evolution and random forest modeling, we show that the model prediction statistics generate distinct network modulespecific patterns within the cell cycle network.
Abbreviations
 AUC:

Area under the curve
 CV:

Coefficient of variation
 DE:

Differential evolution
 LHS:

Latin hypercube sampling
 ODE:

Ordinary differential equation
 WT:

Wild type
Declarations
Acknowledgements
Cihan Oguz performed this work when he was employed at Virginia Tech. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, the Department of Health and Human Services, or the United States government.
Funding
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01 GM07898907 to JJT and WTB. We are grateful to the Advanced Research Computing Lab at Virginia Tech for computing resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, the Department of Health and Human Services, or the United States government.
Availability of data and materials
The prediction ranges for all ensembles can be reproduced using Additional file 4 (C++ subroutine that implements the model along with a Matlab script), and Additional files 5, 6, 7, 8 and 9 (Ensembles 1 through 16). DE is implemented using the Matlab code available at http://www1.icsi.berkeley.edu/~storn/code.html.
Authors’ contributions
Conceived the study: CO. Developed the methodology, performed the simulations and analysis: CO. Wrote the paper: CO, WTB, LTW, JJT. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Butcher EC, Berg EL, Kunkel EJ. Systems biology in drug discovery. Nat Biotechnol. 2004; 22(10):1253–9.PubMedView ArticleGoogle Scholar
 Nelander S, Wang W, Nilsson B, She QB, Pratilas C, Rosen N, Gennemark P, Sander C. Models from experiments: combinatorial drug perturbations of cancer cells. Mol Syst Biol. 2008; 4(216):1–11.Google Scholar
 Gutenkunst RN, Waterfall JJ, Casey FP, Brown KS, Myers CR, Sethna JP. Universally sloppy parameter sensitivities in systems biology models. PLoS Comput Biol. 2007; 3(10):189.View ArticleGoogle Scholar
 Kreutz C, Timmer J. Systems biology: experimental design. FEBS J. 2009; 276(4):923–42.PubMedView ArticleGoogle Scholar
 Kuepfer L, Peter M, Sauer U, Stelling J. Ensemble modeling for analysis of cell signaling dynamics. Nat Biotechnol. 2007; 25(9):1001–6.PubMedView ArticleGoogle Scholar
 Franceschini G, Macchietto S. Modelbased design of experiments for parameter precision: State of the art. Chem Eng Sci. 2008; 63(19):4846–72.View ArticleGoogle Scholar
 Meyer P, Cokelaer T, Chandran D, Kim KH, Loh PR, Tucker G, Lipson M, Berger B, Kreutz C, Raue A, et al.Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach. BMC Syst Biol. 2014; 8(1):13.PubMedPubMed CentralView ArticleGoogle Scholar
 Silk D, Kirk PD, Barnes CP, Toni T, Stumpf MP. Model selection in systems biology depends on experimental design. PLoS Comput Biol. 2014; 10(6):1003650.View ArticleGoogle Scholar
 Schaber J, Baltanas R, Bush A, Klipp E, ColmanLerner A. Modelling reveals novel roles of two parallel signalling pathways and homeostatic feedbacks in yeast. Mol Syst Biol. 2012; 8(622):1–17.Google Scholar
 Tran LM, Rizk ML, Liao JC. Ensemble modeling of metabolic networks. Biophys J. 2008; 95(12):5606–17.PubMedPubMed CentralView ArticleGoogle Scholar
 Jia G, Stephanopoulos G, Gunawan R. Ensemble kinetic modeling of metabolic networks from dynamic metabolic profiles. Metabolites. 2012; 2(4):891–912.PubMedPubMed CentralView ArticleGoogle Scholar
 Song SO, Chakrabarti A, Varner JD. Ensembles of signal transduction models using Pareto optimal ensemble techniques (POETs). Biotechnol J. 2010; 5(7):768–80.PubMedPubMed CentralView ArticleGoogle Scholar
 Noble SL, Buzzard GT, Rundell AE. Feasible parameter space characterization with adaptive sparse grids for nonlinear systems biology models. In: American Control Conference (ACC), 2011. New York: IEEE: 2011. p. 2909–14.Google Scholar
 Dayarian A, Chaves M, Sontag ED, Sengupta AM. Shape, size, and robustness: feasible regions in the parameter space of biochemical networks. PLoS Comput Biol. 2009; 5(1):1000256.View ArticleGoogle Scholar
 Tiemann C, Vanlier J, Hilbers P, van Riel N. Parameter adaptations during phenotype transitions in progressive diseases. BMC Syst Biol. 2011; 5(1):174.PubMedPubMed CentralView ArticleGoogle Scholar
 Tiemann CA, Vanlier J, Oosterveer MH, Groen AK, Hilbers PA, van Riel NA. Parameter trajectory analysis to identify treatment effects of pharmacological interventions. PLoS Comput Biol. 2013; 9(8):1003166.View ArticleGoogle Scholar
 Rumschinski P, Borchers S, Bosio S, Weismantel R, Findeisen R. Setbase dynamical parameter estimation and model invalidation for biochemical reaction networks. BMC Syst Biol. 2010; 4(1):69.PubMedPubMed CentralView ArticleGoogle Scholar
 RodriguezFernandez M, Rehberg M, Kremling A, Banga JR. Simultaneous model discrimination and parameter estimation in dynamic models of cellular systems. BMC Syst Biol. 2013; 7(1):76.PubMedPubMed CentralView ArticleGoogle Scholar
 Pargett M, Rundell AE, Buzzard GT, Umulis DM. Modelbased analysis for qualitative data: an application in drosophila germline stem cell regulation. PLoS Comput Biol. 2014; 10(3):1003498.View ArticleGoogle Scholar
 Donzé A, Fanchon E, Gattepaille LM, Maler O, Tracqui P. Robustness analysis and behavior discrimination in enzymatic reaction networks. PloS ONE. 2011; 6(9):24246.View ArticleGoogle Scholar
 Pargett M, Umulis DM. Quantitative model analysis with diverse biological data: applications in developmental pattern formation. Methods. 2013; 62(1):56–67.PubMedView ArticleGoogle Scholar
 Oguz C, Laomettachit T, Chen KC, Watson LT, Baumann WT, Tyson JJ. Optimization and model reduction in the high dimensional parameter space of a budding yeast cell cycle model. BMC Syst Biol. 2013; 7(1):53.PubMedPubMed CentralView ArticleGoogle Scholar
 Price KV, Storn RM, Lampinen JA. Differential Evolution: A Practical Approach to Global Optimization. Natural Computing Series. Berlin: Springer; 2005.Google Scholar
 Chong CK, Mohamad MS, Deris S, Shamsir MS, Choon YW, Chai LE. Improved differential evolution algorithm for parameter estimation to improve the production of biochemical pathway. Intl J Interactive Multimedia Artif Intell. 2012; 1(5):22–9.View ArticleGoogle Scholar
 Tashkova K, Korošec P, Šilc J, Todorovski L, Džeroski S. Parameter estimation with bioinspired metaheuristic optimization: modeling the dynamics of endocytosis. BMC Syst Biol. 2011; 5(1):159.PubMedPubMed CentralView ArticleGoogle Scholar
 Mahdavi S, Shiri ME, Rahnamayan S. Metaheuristics in largescale global continues optimization: A survey. Inf Sci. 2015; 295:407–28.View ArticleGoogle Scholar
 Sun J, Garibaldi JM, Hodgman C. Parameter estimation using metaheuristics in systems biology: a comprehensive review. Comput Biol Bioinformatics IEEE/ACM Trans. 2012; 9(1):185–202.View ArticleGoogle Scholar
 Banga JR, Versyck KJ, Van Impe JF. Computation of optimal identification experiments for nonlinear dynamic process models: a stochastic global optimization approach. Ind Eng Chem Res. 2002; 41(10):2425–30.View ArticleGoogle Scholar
 RodriguezFernandez M, Mendes P, Banga JR. A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems. 2006; 83(2):248–65.PubMedView ArticleGoogle Scholar
 BalsaCanto E, Alonso AA, Banga JR. Computational procedures for optimal experimental design in biological systems. IET Syst Biol. 2008; 2(4):163–72.PubMedView ArticleGoogle Scholar
 Ashyraliyev M, Jaeger J, Blom JG. Parameter estimation and determinability analysis applied to drosophila gap gene circuits. BMC Syst Biol. 2008; 2(1):83.PubMedPubMed CentralView ArticleGoogle Scholar
 Audoly S, Bellu G, D’Angio L, Saccomani MP, Cobelli C. Global identifiability of nonlinear models of biological systems. Biomed Eng IEEE Trans. 2001; 48(1):55–65.View ArticleGoogle Scholar
 Zak DE, Gonye GE, Schwaber JS, Doyle FJ. Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: insights from an identifiability analysis of an in silico network. Genome Res. 2003; 13(11):2396–405.PubMedPubMed CentralView ArticleGoogle Scholar
 Morgan DO. The Cell Cycle: Principles of Control. London: New Science Press; 2007.Google Scholar
 Mitchison JM. The Biology of the Cell Cycle. London: Cambridge University Press; 1971.Google Scholar
 Chen KC, CsikaszNagy A, Gyorffy B, Val J, Novak B, Tyson JJ. Kinetic analysis of a molecular model of the budding yeast cell cycle. Mol Biol Cell. 2000; 11(1):369–91.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen KC, Calzone L, CsikaszNagy A, Cross FR, Novak B, Tyson JJ. Integrative analysis of cell cycle control in budding yeast,. Mol Biol Cell. 2004; 15(8):3841–62. doi:10.1091/mbc.E03110794.PubMedPubMed CentralView ArticleGoogle Scholar
 Singhania R, Sramkoski RM, Jacobberger JW, Tyson JJ. A hybrid model of mammalian cell cycle regulation. PLoS Comput Biol. 2011; 7(2):1001077.View ArticleGoogle Scholar
 Kraikivski P, Chen KC, Laomettachit T, Murali T, Tyson JJ. From start to finish: computational analysis of cell cycle control in budding yeast. npj Syst Biol Appl. 2015; 1:15016.View ArticleGoogle Scholar
 Laomettachit T. Mathematical modeling approaches for dynamical analysis of protein regulatory networks with applications to the budding yeast cell cycle and the circadian rhythm in cyanobacteria. PhD thesis, Virginia Institute of Technology. 2011. http://scholar.lib.vt.edu/theses/available/etd11072011021528/.
 Laomettachit T, Chen KC, Baumann WT, Tyson JJ. A model of yeast cellcycle regulation based on a standard component modeling strategy for protein regulatory networks. PloS ONE. 2016; 11(5):0153738.View ArticleGoogle Scholar
 Donahue MM, Buzzard GT, Rundell AE. Robust parameter identification with adaptive sparse gridbased optimization for nonlinear systems biology models. In: American Control Conference, 2009. ACC’09. New York: IEEE: 2009. p. 5055–060.Google Scholar
 Taylor SC, Berkelman T, Yadav G, Hammond M. A defined methodology for reliable quantification of western blot data. Mol Biotechnol. 2013; 55(3):217–26.PubMedPubMed CentralView ArticleGoogle Scholar
 Oda Y, Huang K, Cross FR, Cowburn D, Chait BT. Accurate quantitation of protein expression and sitespecific phosphorylation. Proc Natl Acad Sci. 1999; 96(12):6591–6.PubMedPubMed CentralView ArticleGoogle Scholar
 Bucher J, Riedmaier S, Schnabel A, Marcus K, Vacun G, Weiss T, Thasler W, Nüssler A, Zanger U, Reuss M. A systems biology approach to dynamic modeling and intersubject variability of statin pharmacokinetics in human hepatocytes. BMC Syst Biol. 2011; 5(1):66.PubMedPubMed CentralView ArticleGoogle Scholar
 Shankaran H, Zhang Y, Tan Y, Resat H. Modelbased analysis of HER activation in cells coexpressing EGFR, HER2 and HER3. PLoS Comput Biol. 2013; 9(8):1003201.View ArticleGoogle Scholar
 Dong W, Tang X, Yu Y, Nilsen R, Kim R, Griffith J, Arnold J, Schüttler HB. Systems biology of the clock in neurospora crassa. PloS ONE. 2008; 3(8):3105.View ArticleGoogle Scholar
 Donahue M, Buzzard G, Rundell A. Experiment design through dynamical characterisation of nonlinear systems biology models utilising sparse grids. IET Syst Biol. 2010; 4(4):249–62.PubMedView ArticleGoogle Scholar
 Lucia A, DiMaggio PA, Depa P. Funneling algorithms for multiscale optimization on rugged terrains. Ind Eng Chem Res. 2004; 43(14):3770–81.View ArticleGoogle Scholar
 Moriya H, ShimizuYoshida Y, Kitano H. In Vivo Robustness Analysis of Cell Division Cycle Genes in Saccharomyces cerevisiae. PLOS Genet. 2010; 6(4). doi:10.1371/journal.pgen.002011.
 Shirayama M, Tóth A, Gálová M, Nasmyth K. Apccdc20 promotes exit from mitosis by destroying the anaphase inhibitor pds1 and cyclin clb5. Nature. 1999; 402(6758):203–7.PubMedView ArticleGoogle Scholar
 Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, IsselTarver L, Schroeder M, Sherlock G, et al.Saccharomyces genome database (sgd) provides secondary gene annotation using the gene ontology (go). Nucleic Acids Res. 2002; 30(1):69–72.PubMedPubMed CentralView ArticleGoogle Scholar
 Turner JJ, Ewald JC, Skotheim JM. Cell size control in yeast. Curr Biol. 2012; 22(9):350–9.View ArticleGoogle Scholar
 Johnston G, Pringle J, Hartwell LH. Coordination of growth with cell division in the yeast saccharomyces cerevisiae. Experimental Cell Res. 1977; 105(1):79–98.View ArticleGoogle Scholar
 Di Talia S, Wang H, Skotheim JM, Rosebrock AP, Futcher B, Cross FR. Daughterspecific transcription factors regulate cell size control in budding yeast. PLoS Biol. 2009; 7(10):1000221.View ArticleGoogle Scholar
 Di Talia S, Skotheim JM, Bean JM, Siggia ED, Cross FR. The effects of molecular noise and size control on variability in the budding yeast cell cycle. Nature. 2007; 448(7156):947–51.PubMedView ArticleGoogle Scholar
 MATLAB. Version 8.1 (R2013a). Natick: The MathWorks Inc.; 2013.Google Scholar
 Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982; 143(1):29–36.PubMedView ArticleGoogle Scholar
 Efron B. Bootstrap methods: another look at the jackknife. Annals Stat. 1979; 7(1):1–26.View ArticleGoogle Scholar
 Dasgupta A, Sun YV, König IR, BaileyWilson JE, Malley JD. Brief review of regressionbased and machine learning methods in genetic epidemiology: the genetic analysis workshop 17 experience. Genet Epidemiol. 2011; 35(S1):5–11.View ArticleGoogle Scholar
 Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 2012; 13(4):227–32.PubMedPubMed CentralGoogle Scholar
 Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M. Global quantification of mammalian gene expression control. Nature. 2011; 473(7347):337–42.PubMedView ArticleGoogle Scholar
 Vogel C, de Sousa Abreu R, Ko D, Le SY, Shapiro BA, Burns SC, Sandhu D, Boutz DR, Marcotte EM, Penalva LO. Sequence signatures and mrna concentration can explain twothirds of protein abundance variation in a human cell line. Mol Syst Biol. 2010; 6(1):400.PubMedPubMed CentralGoogle Scholar
 Plotkin JB. Transcriptional regulation is only half the story. Mol Syst Biol. 2010; 6(1):406.PubMedPubMed CentralGoogle Scholar
 Maier T, Schmidt A, Güell M, Kühner S, Gavin AC, Aebersold R, Serrano L. Quantification of mrna and protein and integration with protein turnover in a bacterium. Mol Syst Biol. 2011; 7(1):511.PubMedPubMed CentralView ArticleGoogle Scholar