Validation of a constraint-based model of Pichia pastoris metabolism under data scarcity

Background Constraint-based models enable structured cellular representations in which intracellular kinetics are circumvented. These models, combined with experimental data, are useful analytical tools to estimate the state exhibited (the phenotype) by the cells at given pseudo-steady conditions. Results In this contribution, a simplified constraint-based stoichiometric model of the metabolism of the yeast Pichia pastoris, a workhorse for heterologous protein expression, is validated against several experimental available datasets. Firstly, maximum theoretical growth yields are calculated and compared to the experimental ones. Secondly, possibility theory is applied to quantify the consistency between model and measurements. Finally, the biomass growth rate is excluded from the datasets and its prediction used to exemplify the capability of the model to calculate non-measured fluxes. Conclusions This contribution shows how a small-sized network can be assessed following a rational, quantitative procedure even when measurements are scarce and imprecise. This approach is particularly useful in lacking data scenarios.


Background
The collection of biochemical reactions involved in the metabolism of a cell can be assembled in networks in order to carry out studies under a system-level approach [1]. Such analysis have been done with large, even genome-scale, reconstructions of well-characterised organisms such as Escherichia coli, Saccharomyces cerevisiae, Pseudomonas putida [2][3][4], and also with simpler networks that consider only a few key metabolites [5][6][7].
Given a metabolic network, a matrix equation can be used in order to describe the mass balances around the nodes, the m internal metabolites: in which c is a vector of metabolite concentrations and v is the vector of reaction rates, or fluxes, representing the mass flow through each of the n reactions in the network [8].
In order to avoid reaction kinetics, still rarely known, the internal metabolites are often assumed not to accumulate and thus (1) turns into a system of linear equations. Then, other constraints can be imposed; for instance, it is common to consider particular enzyme kinetics [9], thermodynamics [2,10], or the irreversibility of certain reactions using inequalities. In this way, a constraint-based model can be assembled [11,12].
By combination of this model and a set of measurable fluxes, the remaining ones can be estimated performing a metabolic flux analysis (MFA) [13]. It is even possible to incorporate intracellular measurements from stable isotope tracer experiments to apply 13C-MFA [14,15]. Unfortunately, these data are not available in most cases. Indeed, scarcity of measurements often results in practice in underdetermined systems, and therefore traditional MFA cannot be performed. In this context, a constraint-based approach that attempts to provide a range of candidate flux states instead of predicting the actual one with precision [11,16] can be of use. In any case, MFA can only be performed using reasonably small networks with favourable structures: otherwise its under-determinacy can be neither removed, even when tracer experiments are available, nor reduced enough to get valuable estimates with a constraint-based approach.
Besides, these medium-sized networks are derived from the known biochemical reactions involved in the metabolism of a cell, and rely necessarily on reductionist hypothesis, being their validation often insufficient. They are seldom validated against datasets different from the one of interest, which is thus inconveniently used both to validate the model and to perform the MFA analysis. Herein we discuss a procedure seeking for further validation of these networks.
The methylotrophic yeast Pichia pastoris is worldwide recognized as a reference platform for the expression of recombinant proteins in eukaryotes, due to the possibility to grow cultures to very high cell densities, its ability to produce post-translational modifications, and the good protein yield/cost ratio. Heterologous genes are cloned under P. pastoris strong and tightly regulated alcohol oxidase promoter, and thus expressed when the cells grow on methanol as sole or combined carbon source.
The optimization of recombinant protein expression in P. pastoris has been usually addressed heuristically. Only a few publications [17][18][19] describe rational, model-based optimisation and control of Pichia growth and protein production. Among these, semi-structured, metabolism-based models representing intracellular behaviour are particularly rare [20,21].
In the following sections, a constraint-based model of P. pastoris will be described and validated against the available experimental data. Then, its ability to predict non-measured fluxes will be illustrated by estimating the biomass growth rate. The potential use of the model for the estimation of intracellular fluxes will also be discussed. In summary, this work applies a systematic, yet simple, procedure to provide further validation for a small-sized model of P. pastoris, using only data from extracellular measurements.

Constraint-based model
A constraint-based model, assuming that internal metabolites are at steady-state and considering the irreversibility of some reactions, can be described with a set of model constraints (  ) as follows: Where v is the vector of reaction rates, or fluxes, representing the mass flow through each of the n reactions in the network, N is the stoichiometric matrix, and D is a diagonal matrix with D ii = 1 if the flux i is irreversible (otherwise 0).
The constraints in (2) define a space of feasible steady-state flux distributions, or flux states, which ideally comprises every theoretically possible phenotype: only flux vectors v that fulfill (2) are considered valid cellular states.

Consistency analysis
The simplest consistency analysis could be performed checking that the flux states shown by cells fulfill the constraints imposed by the model. However, this simple approach would be impractical because measurements are imprecise and do not exactly satisfy the constraints. Such difficulty is overcome by taking into account uncertainty, as follows: where e m represents the error or deviation between the actual fluxes v m and the measured values w m .
Model and measurements can be consistent if there is a vector v fulfilling (2) and (3) for "reasonably small" deviations e m . Otherwise, we will conclude that model and measurements are inconsistent. An easy way to evaluate consistency is to find the flux vector v fulfilling Where it is assumed that e m are distributed normally with zero mean value and have a variance-covariance matrix F. If only linear equality constraints are considered in  , the residual is a stochastic variable following a c 2 -distribution, and therefore a c 2 -test can be used to detect and evaluate the inconsistency. The c 2 -test is based upon statistical hypothesis testing to determine if the deviation is within expected experimental error [8]. However, we want to consider inequality constraints in (2), and therefore the c 2 -test cannot be used because its assumptions are not fulfilled ( does not follows a c 2 -distribution anymore). Yet, the residual provides at least a rough indication of consistency.

Consistency analysis: Possibilistic MFA
The consistency analysis can also be formulated as a possibilistic constraint satisfaction problem, as it has been recently proposed in [16]. The basic idea is that a flux vector fulfilling the model constraints (2) and compatible with the measurements will be considered "possible", otherwise "impossible". This can be refined to cope with measurements errors by introducing the notion of "degree of possibility".
We introduce a set of measurements constraints (  ) considering imprecision, as in (3), but substituting e m by two pairs of non-negative decision variables (non-negative variables are chosen to formulate the calculations as linear programming problems [16]): These decision variables ε 1 , μ 1 , ε 2 and μ 2 relax the basic assertion w m = v m , conforming a set of possibility distributions in (w m , v m ) associated to some cost index J. Among different possible choices, a simple -yet sensible-one is the linear cost index: with a≥0 and b≥0, which are row vectors of measurement reliability coefficients.
The possibility π of each solution δ of (2) and (5), which corresponds to a particular flux vector v, is given by the value of the cost index: The interpretation of (5-7) may be: "w m = v m is fully possible; the more w m differs from v m , the less possible such situation is". See the article for further technical details [16].
Defining two pairs of decision variables, there is more flexibility to represent the measurements in possibilistic terms: the user can assign the bounds ε 2 max and μ 2 max and the weights a and b. This way, each measurement is represented by a distribution of possibility (see examples in [16]). The bounds ε 2 max and μ 2 max define an interval of fully possible values (possibility π = 1). For instance, the user can choose a band of 10% around the measured value. The values a and b define the decreasing possibility to assign to values out of this interval (details below).
At this point, the maximum possibility (minimumcost) flux vector v mp corresponding to a given set of measurements is obtained solving a linear programming (LP) problem: The possibility of the most possible solution being, This degree of possibility provides an indication of the consistency between model (  ) and measurements (  ): a possibility equal to one must be interpreted as complete agreement between the model and the original measurements; lower values of possibility imply that certain error in the measurements is needed to find a flux vector fulfilling the model constraints.

Possibilistic estimation of non-measured fluxes
Possibilistic MFA also enables estimating the metabolic fluxes based on the model and the available measurements. The simplest point-wise estimate is the minimum-cost flux vector resulting from (7), which contains the most possible value for each flux. However, a pointwise estimate is limited when multiple combinations might be reasonably possible. In this situation, a possibilistic interval estimate is a better choice.
The interval of values with conditional possibility higher than for a given variable, The upper bound v i M , would be obtained by replacing minimum by maximum. Possibilistic intervals have a similar interpretation to "confidence intervals" ("credible intervals") in Bayesian statistics, and provide concise but rich flux estimates. Please refer to the above-mentioned article for details on the possibilistic framework [16].

Results and Discussion
Metabolic Network of P. pastoris The metabolic network presented in Figure 1 is based on the stoichiometric model defined in [22] for P. pastoris growth on glucose, which has been extended with reactions representing methanol and glycerol metabolism. This is a simplified representation whose objective is not to accurately describe the full biochemistry of the yeast but to generate a model in which to apply methodologies of interest aimed to process analysis, monitoring and control.
The main catabolic pathways of the yeast P. pastoris (Embden-Meyerhoff-Parnas pathway, citric acid cycle, pentose phosphate and fermentative pathways) are represented for growth on the substrates mainly used for its culture: glucose, glycerol and methanol. In this case, a mean biomass equation derived from the macromolecular composition of the yeast is used to summarize the anabolic pathways according to [22]. Key metabolites such as NAD, NADP, AcCoA, oxalacetate and pyruvate are considered in distinct cytosolic and mitochondrial pools. Several alternative biomass equations corresponding to Saccharomyces cerevisiae models coming from the literature [4,23,24] were also tested (data not shown) as detailed in the following sections, and found to provide similar results. However, it would be useful to evaluate the sensitivity with particularized P. pastoris biomass compositions, if available.
The model contains 45 compounds and 44 metabolic reactions. The balanced growth condition can be applied to 36 internal metabolites, resulting in a 36 × 44 stoichiometric matrix with 8 degrees of freedom (the matrix and the list of reactions is given in the additional file 1). As in [22], irreversibility is assumed for all reactions except for {2-8; 15; 22-27; 29; 34}, and reaction 41 in order to account for glycerol uptake, resulting in the constraintbased model of the form (1), which is used hereinafter.

Elementary mode analysis
Elementary mode analysis provides a way to systematically identify a set of relevant pathways of a metabolic network [25][26][27]. The elementary modes (EM) are the simplest (steady-state) flux distribution that cells can show, whereas the remaining feasible states can be seen as its aggregated action (without cancelations of reversible fluxes). Moreover, the fact that they comprise all the simple pathways in the network, the functional states or non-decomposable vectors, makes it possible to investigate the infinite behaviours that cells can show by simply inspecting them. They have been used, for instance, to analyse pathways considering optimality [25,28], determine minimal medium requirements [12], and infer viability of mutants [29]. The 98 elementary modes for the described network were obtained using Metatool [30]. They are given in the additional file 2. The set of EMs can be classified as shown in Figure 2 depending first on its ability to produce biomass, and second on the carbon source used: glucose, methanol or glycerol. There are 17 EMs that do not result in biomass production, whereas 9 generate ethanol. No ethanol is produced in single substrate EMs when growing.
The carbon yields for biomass obtained for each EM as shown in Table 1. The maximum yield is 4.93 Cmol dcw/Cmol in presence of glucose. Glucose is the most efficient substrate for growth also in combination with glycerol or methanol.
Methanol is the worst biomass yielding substrate. This is also illustrated in Figure 3. In the following sections 11 different datasets compiled from the literature ( Table  2) are used to determine whether the simplified model described above is coherent with experimental data.

Validation: experimental and theoretical yields
As a first validation, we checked that the experimental growth yields did not exceed the maximum theoretical ones given by the model (which were obtained by inspection of the elementary modes on each category). For instance, the theoretical yield for growth on glucose is 4.93, whereas the experimental one is 3.98 (Cmmol DW/mmol). The maximum yield on glycerol and methanol is 2.25, and the experimental ones at different ratios of glycerol and methanol range between 1.31 and 0.63. It also seems that the experimental yields decrease for combinations of substrates with lower theoretical yields.
Thus, no experimental yield violates the maximum theoretical ones (the contrary would indicate errors in the model because theoretical yields were obtained from it). However, the experimental yields tend to be lower than theoretical ones. There are several reasons for this deviation: (a) the model does not consider restrictions on energy cofactors, such as ATP, nor the resources devoted to recombinant protein production, (b) the EM analysis does not take into account the ratio between the different substrates in mixed cases, and (c) even if optimal pathways exist, the actual behaviour of cells does not necessarily makes use of them in terms of growth [25].

Validation: model and data consistency analysis
The datasets in Table 2 were also used to check that the experimental measurements, which reflect the metabolic state of cells, are feasible states according to the model. Two different analysis of consistency were performed: one based on minimized, variance-weighted sum of squared residuals ( ) and another one based on the Blue denotes substances being consumed by the EM, and red those being produced (the darker, the higher stoichiometric coefficient). Arrows highlighted those EMs with the maximum theoretical yield (in terms of growth) for each type. In all weighted least squares problems, a standard deviation of 10% is assigned to each measurement of the set trying to capture their uncertainty. The variance-covariance matrix F in (4) is defined accordingly.
In the Possibilistic MFA problems, the uncertainty of the measurements was represented as follows: (a) Full possibility (π = 1) is assigned to values near the measured ones, less than ± 5% deviation, to account for random errors. (b) A decreasing possibility is assigned to larger deviations so that values with a deviation equal to ± 20% have a possibility of π = 0.1 (those values with a deviation of ± 9.5% will have possibility of π = 0.5).
The results for each dataset are shown in Table 2, where the values for and π(v mp ) are given. The last column provides another indicator of consistency: the degree of measurements uncertainty needed to find a flux vector in full agreement with the model constraints (π = 1). All the computations were performed with MATLAB (MathWorks Inc., 2003), and YALMIP toolbox [31] was used to conduct Possibilistic MFA. The consistency between model and experimental measurements is very high, but for a small set. In these cases, the inconsistency pinpoints especial characteristics of these sets of data, as explained below.
The dataset D1, which corresponds to Pichia growing on glucose, shows very good agreement. The measured data has full possibility (π = 1), meaning that there is a flux vector compatible with model and measurements. In fact, as shown in the last column, a band of 1% around the measured values is sufficient to enclose this flux vector. Notice also that the residual is very low.
Datasets A1 and A2, which correspond to cultures growing totally or mainly on glycerol and producing a small amount of protein, also show a good agreement. The discrepancy between measurements and model is larger for A3 (π = 0.25), but still a band of 10% of deviation around measurements encloses a flux vector compatible with the model. Dataset A3 corresponds to a culture growing mainly on methanol, but supplemented on glycerol, and producing larger amounts of protein.
The discrepancy is larger for A4, which corresponds to a scenario with high protein productivity.  Similar results are obtained with cultures at a higher growth rate (datasets B1-B3), B1 and B2 are highly consistent, while protein producing B3 shows similar behaviour to A3-A4. This suggests the existence of nonmodelled phenomena, probably related with protein production. The agreement is quite good for the three datasets C1-C3, but the increase of the discrepancy along with higher protein expression is also noticeable.
Finally, we used two batteries of random datasets to assess whether the model is indeed able to reject flux distribution that do not correspond to actual states of P. pastoris cultures. These datasets were defined taking random combinations of values for each flux within predefined bounds (see Table 2). Most of these random scenarios were highly inconsistent with the model (possibilities lower than 0.1 in 99% and 95% of the datasets, for each battery).
In summary, the constraint-based model shows acceptable agreement with the experimental data reported by different groups for P. pastoris cultures, and at the same time, rejects artificially generated invalid datasets. The scenarios with lower agreement pinpoint unmodelled phenomena, possibly related to protein expression.

Using the model to predict growth
Possibilistic MFA can now be applied to the constraint based model and the available measurements in order to estimate the biomass growth rate for each of the previous datasets. Details of this estimation can be found in the methods section. PMFA is applied to the datasets shown above excluding the measured value of the growth rate (which is used to validate the estimation). Results are depicted in Figure 4.
The estimated growth rate is found to be in very good agreement with the measured one for the vast majority of the analysed scenarios (D1, A1, A3, A4, B1, B2, B3, C1 and C2), which correspond to cultures at different growth rates, using different substrates, and coming from three independent literature references. For two other scenarios (A2 and C3), the most possible estimate is still accurate.
The fact that, although limited, the model has predictive capacity provides further validation for this constraint-based representation. This conclusion is strengthened if we consider that the growth rate is highly interconnected along the whole network, since the biomass equation takes into account several metabolic precursors, and thus accurate correspondence between substrate uptake, respiratory fluxes and growth cannot be inferred in a straight-forward way from the network.
Using the model to estimate the whole flux distribution Once the model has been validated, possibilistic MFA could be used to estimate all the non-measured fluxes, either intracellular or extracellular, as done with the growth rate in the previous section. For illustration purpose, the flux distributions for each scenario are given in the additional file 3.
Notice that these estimations cannot be done by means of traditional MFA because the measurements would be insufficient to get a determined system.  Solà et al. [21]; C: Jungo et al. [19]. Citrate and Pyruvate are assumed not to be produced nor consumed except for dataset D1 in which citrate is consumed at 0.007 Cmol·kg -1· h -1 . **Minimized sum of squared residuals ( ), possibility of the most possible flux vector (P) and degree of measurements uncertainty to P = 1.
The network has 8 degrees of freedom (44 fluxes and 36 linear equations) and there are 9 measured fluxes. However, these measurements introduce only 7 independent additional linear constraints, so the system remains under-determined with 1 degree of freedom [32]. Possibilistic MFA is able to get an estimate thanks to the irreversibility constraints (other approaches considering these could also provide an estimate). Possibilistic estimates of fluxes of particular interest are also useful to perform a comparative analysis between the different scenarios and datasets. For instance, the estimates for three relevant groups of fluxes, which represent splitting nodes within the network, are depicted in Figure 5: -Fluxes v 2 , v 3 and v 4 belonging to the glycolysis pathway, are positive as expected in cultures grown in glucose, and appear inverted in glycerol and/or methanol fed cultures.
-Fluxes v 21 , v 22 and v 23 represent the isomerization of R5P into Ru5P and Xu5P. Note how v 23 inverts its direction at growing methanol fluxes, as increased methanol consumption demands higher amounts of Xu5P thus requiring more R5P precursor.
-Fluxes v 32 , v 33 and v 34 represent the branchpoint related to methanol usage, that is, how this flux is split between direct oxidation and catabolic pathways. High methanol fluxes are necessarily conducted via CO 2 generation and thus flux v 34 becomes distinct from zero in A4, B4, C2 and C3 scenarios.
In this way, these results further validate the predictive capability of the model.

Conclusions
The consistency of a constraint-based model of Pichia pastoris has been validated in several experimental scenarios resulting in good agreement between estimations and measurements. In addition, the predictive capacity of the model for cell growth rate, an attractive target for industrial fermentation monitoring and control, has been verified. Interestingly, the accuracy of predictions worsens for higher protein producing scenarios, showing how the model, derived for a wildtype strain, is increasingly less precise as wider resources are devoted to recombinant protein generation.
It must be highlighted that the model has been strictly constructed upon first-principles and sensible hypothesis. At this point, the model can be curated, extended, and its parameters tuned in order to improve the consistency with the investigated scenarios. Particularly, energy requirements, strongly related to protein expression, are not yet considered within the model and future work will address this issue. This contribution shows how a small-sized network can in general be assessed following a rational, quantitative procedure even when measurements are scarce. Possibilistic MFA becomes a useful tool to systematize this procedure. This approach enables validation considering the stoichiometric balances and also reactions reversibilities, and accounting for measurements imprecision. The use of Possibilistic MFA also makes it possible to predict non-measured fluxes without removing the network under-determinancy. There is, however, a challenge when validating networks with higher number of degrees of freedom because there may be many flux vectors compatible with the (few) available measurements. It is expected that the datasets will be highly