Probabilistic strain optimization under constraint uncertainty
© Yousofshahi et al.; licensee BioMed Central Ltd. 2013
Received: 24 August 2012
Accepted: 8 March 2013
Published: 29 March 2013
Skip to main content
© Yousofshahi et al.; licensee BioMed Central Ltd. 2013
Received: 24 August 2012
Accepted: 8 March 2013
Published: 29 March 2013
An important step in strain optimization is to identify reactions whose activities should be modified to achieve the desired cellular objective. Preferably, these reactions are identified systematically, as the number of possible combinations of reaction modifications could be very large. Over the last several years, a number of computational methods have been described for identifying combinations of reaction modifications. However, none of these methods explicitly address uncertainties in implementing the reaction activity modifications. In this work, we model the uncertainties as probability distributions in the flux carrying capacities of reactions. Based on this model, we develop an optimization method that identifies reactions for flux capacity modifications to predict outcomes with high statistical likelihood.
We compare three optimization methods that select an intervention set comprising up- or down-regulation of reaction flux capacity: CCOpt (Chance constrained optimization), DetOpt (Deterministic optimization), and MCOpt (Monte Carlo-based optimization). We evaluate the methods using a Monte Carlo simulation-based method, MCEval (Monte Carlo Evaluations). We present two case studies analyzing a CHO cell and an adipocyte model. The flux capacity distributions required for our methods were estimated from maximal reaction velocities or elementary mode analysis. The intervention set selected by CCOpt consistently outperforms the intervention set selected by DetOpt in terms of tolerance to flux capacity variations. MCEval shows that the optimal flux predicted based on the CCOpt intervention set is more likely to be obtained, in a probabilistic sense, than the flux predicted by DetOpt. The intervention sets identified by CCOpt and MCOpt were similar; however, the exhaustive sampling required by MCOpt incurred significantly greater computational cost.
Maximizing tolerance to variable engineering outcomes (in modifying enzyme activities) can identify intervention sets that statistically improve the desired cellular objective.
In recent years, increasingly sophisticated computational methods have been developed to identify optimal genetic modifications to achieve a desired metabolic engineering objective. The problem of identifying optimal genetic modifications can be expressed in terms of operating state variables such as reaction flux, and control (decision) variables such as the presence or absence of gene expression. The optimal design “tunes” these variables such that the solution meets the engineering objective while satisfying several constraints reflecting physicochemical considerations, experimental observations and assumptions about the physiology of the cell or organism. Due to biological variability [1, 2], stochastic effects associated with gene expression, and imprecision in engineering implementation, it is questionable that enzyme levels can be precisely tuned to exactly match the target values calculated using computational design tools. More likely, the target enzyme levels, and thus the corresponding reaction flux capacities, can only be achieved with a finite degree of uncertainty. Addressing uncertainty at the design stage is a challenging issue that has become increasingly important not only for engineering biological systems, but also man-made systems such as electronic devices. Indeed, the past decade has witnessed a paradigm shift in design of electronics and computational design tools, where all modern electronic circuits are now designed to maximize tolerance to manufacturing and operational variations or to include tuning circuitry for post-manufacturing re-calibration. As metabolic engineering efforts progress from proof-of-principle to scaled-up manufacturing, computational methods to effectively address biological and engineering uncertainties at the design stage will become increasingly important in ensuring the identification of the most robustly optimal gene modifications.
The uncertainty in achieving targeted enzyme values suggests that the enzyme levels, and hence the corresponding flux carrying capacities (bounds), could be considered statistical distributions rather than fixed value parameters. In this statistical interpretation, a flux constraint in a conventional deterministic optimization problem represents the most conservative point in the flux capacity distribution, since a deterministic problem enforces all constraints with zero uncertainty. Although the deterministic approach affords relatively straightforward problem formulation and is most commonly practiced [3–5], this approach might lead to choosing an intervention set that may be far from optimal in a statistical sense. Alternatively, a sampling-based optimization approach (e.g. Monte Carlo sampling ), with the obvious caveat of being computationally intensive, probabilistically explores a possible space of enzyme activities, i.e. flux capacity distributions, and solves for an optimal intervention set for each sampled instance of flux capacities. Repeated sampling produces multiple intervention sets and a corresponding distribution of objective function values. Another alternative for incorporating uncertainties in an optimization problem is chance-constrained programming (CCP), which selects an optimal solution with a user-defined degree of probabilistic confidence in meeting constraints. Chance-constrained programming was first introduced in  to solve the problem of temporal planning when uncertainty is present. Since then, CCP has been utilized in numerous applications, including circuit sizing , soil conservation , ground water management , energy management , and molecular property optimization .
Current strain optimization methods generally seek to identify combinations of gene-level modifications that will result in an improvement of the desired cellular objective. These modifications are commonly gene deletions, but may be also up- or down-regulations of gene expression. A notable example of a computational method to identify gene knockouts is OptKnock . This method uses bi-level programming to identify gene deletions that satisfy the coupled objectives of metabolite overproduction and biomass formation. Another gene deletion strategy is Genetic Design through Local Search (GLDS) , which employs a heuristic and flux balance analysis (FBA) to iteratively find sets of zero flux reactions (corresponding to gene deletions) that would result in the maximization of the target reaction flux. Other, related methods for large-scale problems involve metaheuristic approaches to iteratively improve a candidate set of gene deletions by generating and selecting variants of the candidate set via assessment of the objective function. An example of this approach is OptGene, which uses an evolutionary algorithm to improve the set of gene deletions with respect to an objective function .
Optimization methods have also been described to identify targets for gene expression modification. OptReg  is a constraint-based method that uses bi-level programming to determine which sets of genes should be amplified or down-regulated to satisfy a coupled pair of engineering and cellular objectives. Another class of computational strain design methods utilizes elementary mode (EM) analysis. One recent example is Computational Approach for Strain Optimization aiming at high Productivity (CASOP), which ranks reactions based on their contributions to the yield of desired product . Another example is Flux Design, which selects reactions for up-regulation or deletion based on their correlation with the objective flux computed from EMs that contribute to the target product [15, 16]. Despite increasing sophistication, these and other current computational strain design methods implicitly assume that reaction flux changes can be implemented precisely, and thus do not consider uncertainties as part of the problem formulation.
In this paper, we investigate three computational methods to address uncertainty in strain optimization. Specifically, we compare two probabilistic methods, CCP based optimization (CCOpt) and sampling based optimization (MCOpt), against deterministic optimization (DetOpt). The performance of each method is tested on two metabolic models for which enzyme level changes and corresponding flux capacity distributions are estimated either from kinetic parameters or steady-state flux data. The performance of the solutions, i.e. predicted target fluxes and corresponding intervention sets, is evaluated using Monte Carlo simulations (MCEval) designed to simulate the variable outcomes resulting from experimental implementation of the modifications specified by the optimization solutions.
where and denote the inverse CDFs of and respectively, which can be numerically calculated if needed.
The main objective of the problem is to maximize the target reaction flux v target . It is expected that the optimal value of v target will increase monotonically with L, the number of allowed interventions (enzyme up/down-regulation operations). On the other hand, the engineering cost is also expected to increase with the number of interventions. Therefore, the objective function in (6) also includes the term , which imposes a small penalty α for each added intervention, and balances the optimal flux of the target reaction against the number of required interventions. Constraint (7) represents the steady state assumption that the rate of production of each intracellular metabolite is equal to its rate of consumption. Constraint (8) guarantees a minimal growth rate equaling at least 1% of the theoretical maximum of the wild-type (unmodified) organism. A minimal growth rate constraint is required to guarantee that the cell remains viable. This parameter can be adjusted by the user based on the metabolic model, available data and expectations for cell viability, which does not alter the optimization algorithm. To maximize the growth rate while simultaneously maximizing a certain target metabolite, a bi-level optimization with two objectives (maximizing biomass and a target flux) can be applied in place of the constraints (6) and (8). However, linear bi-level programs are NP-hard  and there are no efficient algorithms to solve large-scale problems . Constraint (9) sets the upper bound flux capacity for each reaction j. Constraint (10) sets the lower bound flux for each reaction j to SSL j (an observed reference state lower bound, if the observation data is available) or zero, based on the value of the binary variable . Constraint (11) sets an upper bound on the number of allowed interventions. Inequality (12) ensures that enzyme manipulations are exclusive, i.e. a reaction can be either up- or down-regulated in a solution, but not both. Similarly, constraint (13) guarantees that the forward and backward directions of a reversible reaction are not both up- and down-regulated at the same time. Constraint (14) specifies that the decision variables and can only be 0 or 1.
The deterministic formulation (DetOpt) can be derived from the CCP formulation by setting ε = 0 in (9), i.e. v j is strictly less than all possible values the random variables or can take.
where and are the randomly drawn set of flux capacities. Each MC sample, i.e. set of randomly drawn flux capacities, defines an instance of an optimization problem. The solution to this optimization problem is a set of interventions and a corresponding optimal flux value for the target reaction. Repeating the process (sampling and optimization) many times, we obtain a distribution of optimal target flux values.
Traditionally, a gene up/down-regulation operation has been modeled as a deterministic event leading to a fold-change in the level of the corresponding enzyme, and hence a fold-change in the flux capacity of the reaction catalyzed by the enzyme. Here, we model enzyme level modification as an uncertain event using a probability distribution. We assume a normal distribution  with an average fold-change of μ = 6 following gene up-regulation and a spread of δ = 6σ = 8, where σ denotes the standard deviation. The average fold-change value reflects experimental data reported in gene over-expression studies involving mammalian cells, specifically adipocytes . We note that the average fold-change value is a user-specified parameter that can be adjusted to reflect different cell types and experimental data, and thus does not lead to loss of generality. The spread δ is chosen so that μ - δ/2 > 1, which ensures that the flux capacity after up-regulating the enzyme level is higher than the unmodified state. A decrease in enzyme level, and hence reaction flux capacity, is modeled by a normal distribution N d (μ, σ 2) with an average fold-change of μ = 0.5 and a spread of δ = 1.
Based on the probabilistic interpretation of fold-changes in enzyme levels resulting from gene modifications, we also estimate the resulting reaction flux capacities as probability distributions. We use two different estimation methods depending on whether the model is kinetic or stoichiometric. In the case of a kinetic model, a fold-change in enzyme level is assumed to directly correlate with a fold-change in the maximal reaction velocity (v j,max). Here, the maximal reaction velocity has the same units as reaction flux. Therefore, flux capacity distributions were calculated by simply multiplying the enzyme fold-change distributions with v j,max. In the case of a stoichiometric model, the distributions of flux capacities are approximated using enzyme control flux (ECF) analysis . Briefly, ECF analysis calculates the effect of enzyme level changes on flux distribution based on elementary mode analysis  and a power law model for the relationship between reaction flux and enzyme activity. Typically, the ECF problem is underdetermined, and the solution is obtained as a range of minimal and maximal flux for each reaction. We use the maximal flux value as the corresponding reaction flux capacity. The maximal flux values, calculated using sample points from the distributions of enzyme level modifications (N u (μ, σ 2) and N d (μ, σ 2)), form a capacity distribution.
In the FBA problem, the flux capacity constraints are drawn from the capacity distributions ( and in equation (19)) if the corresponding reaction (enzyme) belongs to the optimized set of interventions. Otherwise, the capacity constraints are set to maximal steady state value (SSU j ) calculated for the unmodified reference state. Similar to MCOpt, MCEval repeatedly solves a series of optimization problems to generate a distribution of optimal target flux values. Unlike MCOpt, MCEval does not seek to identify an intervention set reflecting decisions on enzyme activity modification. Rather, each instance of MCEval simply solves for the optimal flux and the corresponding flux distribution based on capacity constraints specified by the CCOpt, DetOpt, or MCOpt solution that is to be evaluated.
To assess the benefits and limitations of the optimization methods, we compare their performance using test cases involving both a kinetic and a stoichiometric model. The kinetic model describes the metabolism of Chinese hamster ovary (CHO) cells in fed-batch culture . The stoichiometric model describes the metabolism of adipocytes undergoing differentiation and growth .
The CHO cell model comprises 24 metabolites and 47 irreversible reactions. The kinetic parameters of the model were previously estimated by fitting the model equations to experimentally obtained metabolite time course data . These parameters are used to estimate the effects of enzyme activity increases and decreases on the corresponding reaction flux capacity distributions. The flux capacity distributions for the adipocyte model are estimated from steady state metabolic flux data obtained in previous studies . Additional details of the model including reaction definitions are provided as Additional file 1. The test objective is the synthesis of a recombinant protein product, a therapeutic antibody.
Compared to CCOpt, DetOpt predicts smaller maximal antibody synthesis rates (~1000 nmol/10 6 cells/day) due to the conservative choice of reaction flux capacities. The maximal synthesis rate predicted using CCOpt is more than twice the flux predicted by DetOpt (~2200 nmol/10 6 cells/day). The intervention set identified by DetOpt consists of only a single reaction even when the maximal number allowed interventions is raised, indicating that the deterministic method does not fully utilize the degree of freedom available in the problem.
In the case of L = 4, the aggregate effect of uncertainties in flux capacities is to result in a normally distributed target flux. However, this is not the case for L < 4, where the dominant target flux values generated by MCOpt distribute narrowly with nearly zero spread. Moreover, the mean target flux values rise only incrementally from L = 1 to 3, suggesting that the probabilistic outcomes accumulate at the lower bound of the probable range due to one or more bottlenecks in the network that are not relieved until all 4 reaction flux capacity modifications are introduced.
Reaction 17 is a part of the TCA cycle. Reactions 24 and 26 are palmitate biosynthesis and tripalmitoylglycerol biosynthesis, respectively. All three reactions directly impact synthesis of TG, which is formed from esterification of palmitate with glycerol phosphate, with the latter derived from glycerone phosphate. Previous reports , including our own work , have shown that the addition of long-chain fatty acids stimulates cellular TG accumulation. At first glance, the intervention targets selected by CCOpt appear trivially intuitive. However, other, equally intuitive alternatives also exist, which were not selected. For example, another intuitive intervention to increase net TG accumulation is to down-regulate lipolysis (reaction 27). This intervention was not selected, because the reference (unmodified) state lower bound for reaction 27 is already zero, and a further reduction would have no impact on TG production. In this regard, the optimization results depend not only on the model, but also on the observed reference state.
Our optimization problems (CCOpt, MCOpt and DetOpt) are formulated as mixed integer linear programming (MILP). A MILP problem requires a subset of variables to take on integer values, while the other variables can take on non-integer values. This problem is NP-hard , and thus it is unlikely that there exists an efficient (polynomial-time in the size of the model) algorithm to obtain a globally optimal solution. In the present study, we implemented our optimization methods (CCOpt, MCOpt and DetOpt) using the GNU Linear Programming Kit (GLPK)  in MATLAB. The runtime of our computational experiments solving the MILP problems was on the order of a few seconds on a Core i5 2.53 GHz CPU.
In addition to the scalability issue inherent to MILP problems, another computational challenge lies in estimating the flux capacity distributions. For the stoichiometric model of this study, we used enzyme control flux analysis (ECF)  to obtain these distributions. The ECF method in turn relies on elementary mode (EM) analysis, which can be applied to metabolic models comprising < ~100 reactions, but remains intractable for genome-scale models. An alternative strategy is to model the fold-change in flux capacity, i.e. enzyme activity, resulting from a gene expression modification using a probability distribution, e.g. a normal distribution. This strategy requires knowledge of maximal enzyme velocities (vmax). If these parameters are not known, they may be estimated from FBA, which has been demonstrated on genome-scale models.
These types of limitations, while not trivial, are comparable to other computational strain design methods. For example, bi-level optimization, used in OptKnock , is also NP-hard , and thus can be intractable for large-scale problems. As an NP-hard problem, the runtime grows exponentially with the number of allowed reaction modifications . Methods that rely on EM analysis [14–16, 36] face a similar limitation as our capacity estimation problem, as the analysis is generally only practical for small to mid-scale models. Methods based on local search  or metaheuristics [13, 37] are computationally less prohibitive than MILP, and likely offer the best alternative for large-scale problems. On the other hand, these methods cannot guarantee global solution optimality, and may arrive at solutions that are far from exact.
This study investigates three distinct ways of capturing uncertainty about parameter values when formulating an optimization problem with the objective of identifying targets for enzyme activity adjustments that maximize the production of a desired molecule. The three approaches are chance-constrained programming (CCOpt), Monte Carlo sampling-based solution of the uncertain problem (MCOpt), and deterministic optimization based on worst-case assumptions (DetOpt). Evaluation of the approaches for two test cases (CHO cell and adipocyte models) using Monte Carlo simulations (MCEval) shows that a more sophisticated probabilistic approach such as CCOpt has several advantages compared to a conservative conventional approach like DetOpt. Chance-constrained programming explores a larger portion of the solution space and is able to find a more diverse set of options. Additionally, CCOpt consistently outperforms DetOpt in terms of predicting the more likely maximum of the objective function value. Comparisons of the intervention sets from CCOpt and DetOpt using MCEval shows that the maximal fluxes predicted by CCOpt was always in the probable (5th-95th percentile) range calculated by MCEval, whereas the maximal fluxes predicted by DetOpt typically lies outside of this range. When compared to the sampling-based optimization approach (MCOpt), CCOpt consistently finds the solution most frequently selected by MCOpt, but at a fraction of the computational cost (seconds vs. days).
The CCOpt formulation can be readily extended to capture other types of uncertainties, such as biological variability in measured data and cell transfection efficiency, making CCOpt an effective technique for probabilistic strain optimization.
Monte Carlo-based optimization
Monte Carlo Evaluations
Cumulative distribution functions
Enzyme control flux
Flux balance analysis
Chinese hamster ovary
GNU Linear Programming Kit
Mixed integer linear programming.
This work was supported by the National Science Foundation under Grant no. 0829899 and the Wittich Family Foundation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.