Identifying model error in metabolic flux analysis – a generalized least squares approach

Background The estimation of intracellular flux through traditional metabolic flux analysis (MFA) using an overdetermined system of equations is a well established practice in metabolic engineering. Despite the continued evolution of the methodology since its introduction, there has been little focus on validation and identification of poor model fit outside of identifying “gross measurement error”. The growing complexity of metabolic models, which are increasingly generated from genome-level data, has necessitated robust validation that can directly assess model fit. Results In this work, MFA calculation is framed as a generalized least squares (GLS) problem, highlighting the applicability of the common t-test for model validation. To differentiate between measurement and model error, we simulate ideal flux profiles directly from the model, perturb them with estimated measurement error, and compare their validation to real data. Application of this strategy to an established Chinese Hamster Ovary (CHO) cell model shows how fluxes validated by traditional means may be largely non-significant due to a lack of model fit. With further simulation, we explore how t-test significance relates to calculation error and show that fluxes found to be non-significant have 2-4 fold larger error (if measurement uncertainty is in the 5–10 % range). Conclusions The proposed validation method goes beyond traditional detection of “gross measurement error” to identify lack of fit between model and data. Although the focus of this work is on t-test validation and traditional MFA, the presented framework is readily applicable to other regression analysis methods and MFA formulations. Electronic supplementary material The online version of this article (doi:10.1186/s12918-016-0335-7) contains supplementary material, which is available to authorized users.


Additional file 1 -Extended theoretical principles
To make the proposed protocol as accessible as possible, the theoretical section has been extended with a number extra details as well as a simplified example model.
Basic principles of MFA [1] The basis of MFA stems from a mass balance on each intracellular compound (Equation 1).
rate of change = transport in − transport out The right hand side of the equation can be separated into a net rate of formation and a dilution term that results from increasing cellular volume through cell division (Equation 8), where dC dt is a column vector of metabolite rates of change, r is a column vector of net fluxes, and µC is the dilution term. r combines transport fluxes and consumption/production from reactions (which include biomass production). Due to high turnover rate of intracellular metabolite pools, both dC dt and µC are generally assumed to be negligible in relation to large flux magnitudes (termed "pseudo-steady state"). The result is that the sum of transport and reaction fluxes must be equal to zero.
The sum of transport and reaction fluxes can be expressed as a product of a stoichiometry matrix S and a column vector of fluxes v (Equation 4).
Each row of S corresponds to a metabolite, with the columns encoding the stoichiometry of reactions included in the model. As an example, a simple set of reactions is presented in Box 1, with the corresponding stoichiometry matrix defined in [1] Mathematical description of MFA is largely based on [1].
Box 1: Simple set of reactions defining a metabolic model.
Letters A, B, C, D, and E designate row names and correspond to balances on metabolites A, B, C, D, and E. C1, C2, C3, O1, O2, and O3 designate column names and refer to observed and calculated reactions. S can be partitioned into observed S o and unknown (or calculated S c ) components, where the observed fluxes tend to be metabolite uptake or secretion rates. For the reactions in Box 1, Combining Equations (3), (4), and (6), Since S o , S c and v o are all known, v c can then be calculated. Going back to the example, assume that the uptakes of A and B have been observed as 2.1 and 0.9, while the secretion of E was found to be 1.2: The calculation and validation of v c can be approached in multiple ways.

Traditional calculation
Assuming that sufficient fluxes can be observed, Equation (8) can be used to solve for the value of v c through linear algebra: To isolate for v c , S c must be invertible. If S c has more rows than columns (and all rows are independent), then multiply both sides by S T c . Since S T c S c is square, it's possible to calculate v c using a matrix inverse.
For the example model: If v o is known exactly, then Equation (14) can be substituted back into Equation (8) to define a redundency matrix R, which represents the flux balances around each metabolite as a function of the observed fluxes alone.
For the example model: Given that Rv o = 0, each row of R represents a relationship that must be satsfied by v O1 , v O2 , and v O3 . For example the balance on A will only hold if Whereas v o is a set of ideal observations that satisfy the balance perfectly, fluxes observed in practice (v o ) are subject to experimental error and will result in some of the flux balances failing to close: where ε is the vector of residual fluxes associated with the flux balances around each of the metabolites (the length of ε is equal to the number of metabolites in the model). For the example model: By consideringv o as the sum of hypothetically true values (v o ) and a vector of observation errors (δ), it's possible to describe ε as a function of observation error alone. Starting from Equation 21: Given that Rv o = 0 by definition: Relationships among the reactions in the metabolic model will cause linear dependencies in the redundancy matrix (R). In the simple reaction network of A → B → C, for example, a discrepancy in the observed transport fluxes of A and C will cause the same residual fluxes around A, B, and C, stemming from three linearly dependent rows in R. The degree of freedom that can be used to assess whether a significant observation error has been made is equal to the number of independent rows (rank) of R. For this reason, the redundancy matrix is reduced to only the linearly independent rows (R ), resulting in a corresponding change to ε (ε ). For the example model, R has a rank of two, so R can be taken as two linearly independent rows of R: The vector of residuals ε is similarly reduced to ε : For the example model: The test for "gross measurement error" consists of determining whether the residual fluxes (ε ) are normally distributed. If this is the case, then it can be concluded that the residuals are due to random noise in the observations. If the residuals are not normally distributed, at least one of the observations must contain a significant (gross) error. The sum of squares of standard normally distributed (Z) random variables (x i ) is known to be chi-squared (χ 2 ) distributed. Expressed mathematically: The sum of squares can also be expressed in linear algebra as: From this relationship, the "gross measurement error" test is a χ 2 test on the sum of squares of (row-reduced) residual fluxes, with the ε terms standard normal variance of 1. The χ 2 statistic, h, is estimated as follows: Cov(ε ) is the variance-covariance matrix of ε can be calculated using (26): The transition from Equation (39) to Equation (40) can be shown from the definition of a variance-covariance matrix: Since E(RX) = RE(X), Cov(δ) is a covariance matrix of observed flux values that can be calculated directly through replication or estimated from prior information. For the example problem, v o was taken to be [2.1, 0.9, 1.2] T . Following the estimation of a covariance matrix around the observations, the χ 2 would be carried out as follows: Since 2.5714 is smaller than the critical value of χ 2 0.05 = 5.991 for two degrees of freedom, no gross errors can be identified in the observations.
The suggested procedure is to use the χ 2 test to identify whether any significant errors are present among the observations. If no gross errors are identified then all observed values must contain only random noise, which can be "balanced" by finding an estimate for the observed values that minimizes the sum of squared errors for the observations (v o ) given by: For the example problem, the corrected observations arev o = [2.0571, 1.0286, 1.0286] T .

Generalized least squares calculation
Following the generalized least squares framework, the residual term ε is not assumed to be the result of measurement error alone and is introduced earlier in the formulation.
Since S o and v o are known, S o v o is a column vector, and Equation (54) can be rearranged to look more like linear regression, If the elements of ε are independently and identically distributed, thenv c can be estimated through ordinary regression (note that residuals don't have to be normally distributed for the estimate to be a best linear unbiased estimator).
Going back to the example model in Box 1, With this formulation, ε represents the deviation between observed and calculated fluxes that may be the result of either measurement error or lack of model fit.
However, the assumption of ε being independently and identically distributed is unlikely to be true. Balances around large magnitude fluxes are likely to have higher associated variabilities. Mathematically, it is necessary to have Cov(ε) = σ 2 with a variance-covariance matrix Cov(ε) = Iσ 2 , whereas real balances may have unequal variances and non-zero covariance terms, equivalent to Cov(ε) = V σ 2 .
If Equation (55) is scaled by matrix A, then the variance-covariance of the scaled error term becomes where A must be chosen such that AV A T = I to meet the required conditions of linear regression. This is true if A is taken to be the inverse of the matrix square root of V , i.e., A = P −1 and P P = V . Finding the square root of a matrix P , such that P P = V can be performed by matrix diagonalization of V . To prove this, start from the matrix P and assume that that there exists a diagonal matrix D P and matrix Γ, for which: Since P P = V and the square of a diagonal matrix is that same matrix with all diagonal entries squared, then D P D P can be redefined as diagonal matrix D V : P can thus be calculated by diagonalizing V and taking the square root of its diagonal matrix, which is equal to the square root of the individual diagonal values. Since V is symmetric (Cov(X 1 , X 2 ) = Cov(X 2 , X 1 )), Therefore, P must also be symmetric: Combining P P = V and P = P T : = σ 2 P −1 P P P −1 (78) as required. Scaling Equation (55) by P −1 : where P −1 ε now satisfies the assumptions of linear regression. Formally, this is equivalent to generalized least squares (GLS) regression, however, incorporating P −1 directly into each term allows the use of all ordinary least squares techniques: The calculation of P −1 requires the estimation Cov(ε) from the variance of observed fluxes. Calculating the covariance-variance matrix of both sides of Equation (55): Since Cov(ε) = σ 2 V for any value of σ, σ is set to 1 so that V = Cov(ε). In practice, Cov(v o ) need only capture the relative magnitudes of observed flux variances asσ is estimated during regression. Balances around molecular species that do not include an observed flux v o will have a row of zeros in Cov(ε), which prevents the calculation of a matrix inverse (required to get P −1 ). Although this mathematically equates to a variance of zero for those balance, a better interpretation is that there is an unknown variance around the "observation" of no net flux. The simplest solution is to add a small non-zero value to each diagonal entry of Cov(ε), representing the confidence of the calculated fluxes being fully balanced. If there is more uncertainty around some balances than others, this information could be encoded in the magnitude of the added variance. P can then be calculated via a matrix square root of Cov(ε). Since a variance (covariance) matrix is positive semi-definite, P is known to be unique. Coming back to the example problem, assume that Cov(v o ) has been defined as follows: Then, Rows corresponding to balances on C and D do not include any observed fluxes, and are therefore all zero. The variance on the balance can be set to 0.0001 arbitrarily to reflect the relative confidence of the balance being closed: Since V is a diagonal matrix, P −1 can be calculated as the reciprocal square root of each element.
Performing the least squares calculation as before: Whereas calculated fluxesv c are commonly estimated using a very similar "weighted" least squares approach, the use of validation methods that are part of the regression framework have yet to be explored. The common χ 2 test can still be used to detect gross measurement errors in estimated residuals (ε), however, the validation of a regression model also requires the use of t-tests to ensure the significance of calculated fluxes. Confidence and prediction intervals are also highly relevant to MFA. The calculation of a t-statistic follows from normal regression: The estimated standard deviation of ε (orσ) is calculated as follows: where:ε The critical t-value for a two-tailed test with 1 degree of freedom and 5% significance level is 12.71, making all three calculated fluxes statistically significant. The identification of non-significant flux may be interpreted in two ways. The measurement error around observed fluxes may be too high to allow robust flux calculation. In that case, non-significant fluxes should be treated as having a flux of zero and excluded from the model or further analysis. Alternatively, non-significance may be the result of excess variability from a lack of fit between the model and observed data, requiring model correction. To distinguish between these cases, it is necessary to separate model error from measurement uncertainty. One way to accomplish this is to reduce measurement uncertainty through added replication, however, the required effort can make this approach practically infeasible. Another solution is to simulate a set of feasible fluxes directly from the stoichiometric model (and therefore free of model error) for comparison to the observed data.
The simulation of feasible fluxes can be simplified by eliminating flux equality constraints expressed by the stoichiometry matrix. Essentially, only n c − n b fluxes have to specified in order to generate all the other values. More formally, the relationships between the fluxes can be succinctly summarized through the nullspace (or kernel) of S, which describes all flux balance conservations in the model. This makes it possible to calculate all fluxes from a smaller set of variables referred to as the basis. Unlike fluxes, which must satisfy constraints imposed by Sv = 0, the basis can take any arbitrary value to generate fluxes that satisfy all required constraints.
where v i is an observed flux, K i is the corresponding row of K, and a is a scaling constant that can be set to t α/2,df to specify a confidence interval around v i . As the basis solution space is only constrained by inequalities, it is readily amenable to stochastic sampling. All values of v that satisfy Equation (110) represent feasible fluxes that would perfectly satisfy the stoichiometric model while remaining within measurement uncertainty of real observations. If the resulting space is infeasible, then the observed data does not fit the specified model. Otherwise, a random sample of feasible fluxes can be taken for comparison to observed results. If the addition of measurement error to simulated fluxes results in less uncertainty than from observed results, then model error is to blame.
The example model can be solved by specifying only 1 flux value, meaning that K is a column vector.
Based on K, a flux profile is consistent with the model as long as the flux of compound A (v O1 ) is two times greater than all other fluxes. From the estimated variance-covariance matrix, the standard deviations of the observed fluxes are [0.1, 0.1, 0.2] T . Assuming that these estimates came from a large number of samples, approximate 95% confidence intervals can be constructed by multiplying the standard deviation values by 2. With only a single basis value b, the constraints can be combined to limit b between 0.95 and 1.1. Random values can then be generated from a uniform distribution. For basis vectors with more than one value, independent limits on values of b can not be isolated and the solution space must be sampled directly. Once basis vectors are generated, flux profiles can be calculated from Kb = v and subject to measurement error.