Growth maximization for ME models
A ME model describes a cell’s metabolic and macromolecular state as a vector of n fluxes, \(v\in \mathbb {R}^{n}\) (in mmol/grams dry weight/h) that catalyze biochemical reactions among m components (i.e., small molecules and macromolecules) [2]. To compute the state that maximizes the growth rate μ (in h −1), one solves the following optimization problem (1) [2, 22]:
$$ \begin{aligned} \max_{\mu, v} \quad & \mu \\ \mathrm{subject\ to} \quad & S(\mu) v = 0 \\ & l(\mu) \leq v \leq u(\mu), \end{aligned} $$
(1)
where \(S(\mu)\in \mathbb {R}^{m\times n}\) is the stoichiometric matrix, and \(l(\mu)\in \mathbb {R}^{n}\), \(u(\mu) \in \mathbb {R}^{n}\) are the lower and upper flux bounds. These three parameters are functions of μ, for example, due to the hyperbolic relation between growth rate and translation rate, macromolecule dilution, etc. (see [2] for a complete description of these relations). Problem (1) includes constraints in the form of S(μ)v=0, where for any fixed μ, we obtain a linear program. Because our objective function here is to maximize mu, subject to the μ-dependent constraints (S(μ)v=0), a global optimum is found efficiently by bisecting on μ, or using augmented Lagrangian methods [22]. We note that similar optimization problems have also been solved in the context of metabolism and protein expression networks using the Resource Balance Analysis modeling framework (see SI Text E1 in [23] and [24]).
To solve (1), we used the solveME Python module [22]. Specifically, we used bisection (binary search) as in [11] to maximize growth rate to six decimal points. SolveME uses the 128-bit (quad-precision) linear program (LP) and nonlinear program (NLP) solver Quad MINOS 5.6 (qMINOS) [25, 26]. All qMINOS runs were performed with feasibility and optimality tolerances of 10−20. These tight tolerances were used to capture solutions involving fluxes as small as 10−16 mmol/gDW/h and were made possible through the quad-precision capabilities of qMINOS. The models used for this study are available in the Github repositories COBRAme (version 0.0.9) https://github.com/SBRG/ecoli_me_testing
and ECOLIme (version 0.0.9) https://github.com/SBRG/ecolime. The COBRAme software [2] was used for building and developing the ME model.
Dynamic simulation using ME models
Our DynamicME implementation extends dynamic FBA (dFBA) [15, 16], which was developed for metabolic models (M-models). We tested two implementations of the DynamicME method: one that does not account for proteome dynamic constraints (protein “inertia”) and one that does.
The first implementation assumes that the protein abundances can be adjusted freely between time steps. Also, the uptake rate of a substrate was not made a function of its extracellular concentration. Instead, flux bounds were set to zero if the substrate was depleted (zero concentration), or to a finite value otherwise. Consequently, we did not need to perform a ME-model simulation at every time step. Instead, once exchange (i.e., uptake and secretion) fluxes were computed by the ME-model, the same fluxes were used to compute the extracellular metabolite concentration profile over subsequent time steps. At each time step, DynamicME checked whether a substrate became depleted (fully consumed) or newly available, e.g., by feeding for a fed-batch process or secretion of re-consumable metabolites. If so, a new ME computation was performed with the updated exchange flux bounds. Here, the ME model is capable of selecting the optimal set of metabolites to take up from the medium. The exchange fluxes and growth rate were then updated according to the new optimal solution. These updated values were used to compute biomass and metabolite concentrations. This procedure was repeated until the batch time was reached.
In this first implementation, one can still account for concentration-dependent uptake rates or different feed schedules by performing ME-model simulations at every time step. Furthermore, if additional mechanisms such as growth inhibition by substrates or products are modeled, one should perform ME-model simulations at every time step. The procedure for simulating a batch culture using dynamicME is described in Procedure DynamicME. This implementation of DynamicME is also shown schematically in Fig. 2.
DynamicME with protein inertia constraints
The second implementation accounts for protein abundance at the previous time step (i.e., protein “inertia”). This implementation requires modifying the ME model formulation. Thus, at each time step, we solve the following optimization problem (2):
$$ \begin{aligned} \max_{\mu, v,p, \delta} \quad & \mu \\ \mathrm{s.t.} \quad & S(\mu) v = 0 \\ & v^{\text{form}}_{i} - \mu p_{i} = \delta_{i}, \ \forall i \in {Complex} \\ & \sum_{j\in {CAT}(i)} \frac{v_{ij}}{k^{\text{eff}}_{ij}} \leq p_{i}, \ \forall i \in {Complex} \\ & p_{i} = p^{0}_{i} + \delta_{i} H \\ & l(\mu) \leq v \leq u(\mu) \\ & v_{i}^{\text{form}}\geq 0, \ \forall i \in {Complex} \\ & p_{i} \geq 0, \ \forall i \in {Complex}, \end{aligned} $$
(2)
where μ is the growth rate, \(v\in \mathbb {R}^{n}\) the vector of fluxes (metabolic and expression processes), \(v_{i}^{\text {form}} \geq 0\) the flux of protein complex formation reaction for complex i (ComplexFormation reactions in the underlying ME model [2] that convert protein subunits to a complex according to defined complex stoichiometry), \(p \geq 0 \in \mathbb {R}^{k}\) the vector of protein complex concentrations, \(\delta \in \mathbb {R}^{k}\) the vector of protein complex concentration differences, \(p^{0} \in \mathbb {R}^{k}\) the vector of protein complex concentrations at the previous time step, \(S\in \mathbb {R}^{m\times n}\) the stoichiometric matrix that constrains the m components (metabolites and macromolecules), and l, u are the lower and upper flux bounds. H is the time horizon (in hours), which determines the anticipated time window in which to re-allocate the proteome. The value of H need not be the same as the simulation time step when (2) is solved inside the procedure InertiaDynamicME. In this study, we used H=2 hours. Complex is the set of protein complexes that are dynamically constrained, and CAT (i) is the set of reactions that are catalyzed by protein complex (enzyme) i. Note that the original ME network reconstruction accounts for mass balance of all metabolites and macromolecules, including protein complexes. Because we are now allowing accumulation (δi>0) or depletion of complexes (δi<0), we remove from S(μ)v=0 the mass balance constraints on protein complexes in the set Complex, and instead use the constraints \(v_{i}^{\text {form}} - \mu p_{i} = \delta _{i}, \ \forall i \in {Complex}\).
In the protein inertia implementation, we solve Problem (2) at every iteration, rather than only re-solving when environmental conditions change. Re-solving at every iteration is necessary because the intracellular protein abundances are now potentially changing at every iteration, i.e., when δi≠0; therefore, the metabolic fluxes are also subject to change at every iteration, whether or not extracellular conditions are changing. The inertia-constrained dynamicME procedure is described in InertiaDynamicME.
Variable time step procedure (MinTimeStep)
At every time step, we compute the concentration for the next time step based on the current concentration ci for each extracellular metabolite i. If the updated concentration would have become negative, i.e., the time step was too large, we then compute a new time step according to the formula: Δtnew,M= min{ci/(−viX):i=1,…,p}, where p is the number of extracellular metabolites whose concentrations are simulated, and X is the biomass concentration.
Similarly, we also compute a new time step if an intracellular protein concentration is detected to fall below zero with the current time step. In this case, we use the following formula: Δtnew,P= min{pi/(δi):i=1,…,k}. The time step is finally computed as Δt= min{Δt0,Δtnew,M,Δtnew,P}. If the time step Δt differs from Δt0, we re-solve the optimization problem using the updated time step. This way, we ensure that changes to the environment or intracellular concentrations are accounted for in the simulation, regardless of the initial time step.
Model calibration using literature data
A number of adaptive laboratory evolution (ALE) studies have now demonstrated that the proteome of wild-type E. coli is not optimally allocated or efficient for every single nutrient [27–29]. To accurately reflect this wild-type proteome state, we calibrated the model with respect to several known enzymatic features of wild-type E. coli. First, glycerol kinase is known to be significantly less efficient for wild-type compared to ALE endpoints [30]. Second, ALE on lactate minimal medium showed multiple limitations in lactate utilization and enzymes near the phosphoenolpyruvate (PEP) node [27]. Third, respiration is known to have higher proteomic cost than fermentation, leading to acetate overflow [31]. Based on these observations, we calibrated the effective rate constants (keff) (see section below for details). For example, we imposed a realistic turnover rate for isocitrate dehydrogenase based on literature data, effectively increasing proteomic cost for respiration. All calibrated parameters are listed in Additional file 1: Table S1 along with their original and adjusted values. Also, oxygen uptake rate was constrained to −20 mmol/gDW/h to reflect transport limitations not reflected in the proteome cost model.
Sensitivity analysis of dynamic simulations
In the ME model, effective rate constants (keff) relate metabolic flux v to enzyme concentration by the relationship v=keff·e, where e is the enzyme concentration [11]. Precise estimates for these parameters are not available for many reactions and enzymes; therefore, an important step in ME model-based studies has been to assess sensitivity of predictions to these uncertainties [32]. In this study, we investigated the sensitivity of DynamicME predictions to uncertainties in keff. We perturbed keff values from 0.1 to 10 times the nominal values. To avoid exploring the full parameter space consisting of thousands of keff values, we chose relevant pathways and perturbed only these reactions (Additional file 1: Table S1). We generated 200 random samples. Perturbed ME models having good fit to measured metabolite concentration profiles were treated as an ensemble. The exact determination of ensembles is described below.
Archetypal analysis and ensemble of models
Archetypal analysis [33–35] is a dimension-reduction method in which any data point is approximated as a convex combination of the computed archetypes; in turn, each archetype is a convex combination of the data points [33]. Each archetype lies on the convex hull of the data and represents a “pure” phenotype. In our study, we performed archetypal analysis on randomly perturbed samples of model-predicted time-course metabolite concentration profiles (Fig. 3). Thus, each archetype represents a distinct phenotype with a particular substrate utilization hierarchy.
To prepare data for archetypal analysis, timepoints and metabolites were collapsed, resulting in a 2D matrix of features (timepoints-and-metabolites) × samples. In archetypal analysis, we then approximate X as X≈ZA, where Z is the matrix of archetypes and A is a matrix of coefficients with the constraints Aij≥0 and \(\sum _{j=1}^{p} A_{ij}=1\) for p archetypes (Fig. 3d). Thus, X is approximated as a convex combination of archetypes. The matrix of archetypes Z is constrained as Z=XB, with the coefficient matrix Bij≥0, \(\sum _{j=1}^{p} B_{ij}=1\) for p archetypes; therefore, the archetypes are constrained to be convex combinations of the data points X.
Once archetypes were determined, the proteome and exchange flux dynamic profiles were also mapped to the archetypes using B. The best number of archetypes was chosen using the elbow method from a scree plot [36] (Additional file 2: Figure S1). Archetypal analysis was performed using the spams Python module [35].
Optimal parameter estimation via metaheuristic optimization
We developed an optimization-based procedure to match time-course concentration profiles by estimating keff values. For optimization, we used a gradient-free metaheuristic method (list-based threshold accepting) [37] because of its efficiency and flexibility. We developed a parallel implementation of this optimization method for increased efficiency (Additional file 3: Figure S2). The implementation allows each parallel node (CPU thread) to choose between following its local search trajectory or restarting the search from the current best solution. This parallel communication was implemented using MPI (Message Passing Interface) via the mpi4py Python module [38]. The objective function was the sum of squared errors between measured and predicted extracellular metabolite concentration profiles.
Model validation using time-series expression profiles
To validate proteome allocation predictions, we computed the time-lagged cross-correlation between simulated and measured time-course proteome profiles. Lagged cross-correlation measures the similarity between two time-series where one lags the other, and has been particularly useful for analyzing time-course expression profiles. For example, it was used to study regulatory interactions of galactose metabolism in E. coli [39]. To compute lagged cross-correlation we used the R function ccf [40].
We obtained microarray hybridization intensity values over time points from Beg et al. [9]. We log2-transformed these values for further analysis. The log2-transformed measurements were compared against simulated protein mass fractions. We define mass fraction in two ways. First, for the ME model without protein inertia constraints, the protein mass fraction \(f_{j} = v^{\text {trsl}}_{j} w_{j}/\sum _{j\in Prot}\left (v^{\text {trsl}} w_{j}\right)\), where \(v_{j}^{\text {trsl}}\) is the translation rate of protein j, wj is its molecular weight, and Prot is the set of all proteins in the ME model. Second, for the ME model with protein inertia constraints, the mass fraction \(f_{j} = p_{j} w_{j} / \sum _{j\in {Prot}} (p_{j} w_{j})\), where pj is the enzyme concentration, which is a variable in this modified ME model.
We first made the time intervals consistent between the measured and simulated expression profiles. To do so, we determined the smallest time interval used (i.e., measured or simulated) for the two profiles and linearly interpolated each profile separately using this time interval. In this study, the time interval was 0.1 h.
To determine the lagged cross-correlation for the entire simulated proteome, we iterated through each lag value, ranging from −1.7 to 1.7 h, and chose the lag corresponding to the highest median cross-correlation across all proteins.