Equilibrium model selection: dTTP induced R1 dimerization

Radivoyevitch, Tomas

doi:10.1186/1752-0509-2-15

Methodology article
Open access
Published: 04 February 2008

Equilibrium model selection: dTTP induced R1 dimerization

Tomas Radivoyevitch¹

BMC Systems Biology volume 2, Article number: 15 (2008) Cite this article

3946 Accesses
2 Citations
Metrics details

Abstract

Background

Biochemical equilibria are usually modeled iteratively: given one or a few fitted models, if there is a lack of fit or over fitting, a new model with additional or fewer parameters is then fitted, and the process is repeated. The problem with this approach is that different analysts can propose and select different models and thus extract different binding parameter estimates from the same data. An alternative is to first generate a comprehensive standardized list of plausible models, and to then fit them exhaustively, or semi-exhaustively.

Results

A framework is presented in which equilibriums are modeled as pairs (g, h) where g = 0 maps total reactant concentrations (system inputs) into free reactant concentrations (system states) which h then maps into expected values of measurements (system outputs). By letting dissociation constants K_dbe either freely estimated, infinity, zero, or equal to other K_d, and by letting undamaged protein fractions be either freely estimated or 1, many g models are formed. A standard space of g models for ligand-induced protein dimerization equilibria is given. Coupled to an h model, the resulting (g, h) were fitted to dTTP induced R1 dimerization data (R1 is the large subunit of ribonucleotide reductase). Models with the fewest parameters were fitted first. Thereafter, upon fitting a batch, the next batch of models (with one more parameter) was fitted only if the current batch yielded a model that was better (based on the Akaike Information Criterion) than the best model in the previous batch (with one less parameter). Within batches models were fitted in parallel. This semi-exhaustive approach yielded the same best models as an exhaustive model space fit, but in approximately one-fifth the time.

Conclusion

Comprehensive model space based biochemical equilibrium model selection methods are realizable. Their significance to systems biology as mappings of data into mathematical models warrants their development.

Background

Ribonucleotide reductase (RNR) has a small subunit R2 that exists almost exclusively as a dimer, and a large subunit R1 that dimerizes when dTTP, dGTP, dATP, or ATP binds to its specificity site, and hexamerizes when dATP or ATP binds to its activity site [1–6]. Thus, R1 is the backbone of a biochemical equilibrium network that contains a large number of R1 complexes. This network has more dissociation constants (K_d) than can be estimated from currently available data, so assumptions must be made to reduce the number of independent K_d. These assumptions come in two forms: those that state that for the data at hand, a K_dis too large or small to be distinguished from infinity or zero, respectively, and those that state that the data are too weak to rule out a null hypothesis of the form K_d= ${K^{'}}_{d}$ . Model parameters such as the fraction of R1 capable of forming dimers and hexamers, and the enzymatic activities of these R1 states, also come with plausible null hypotheses. In general, different null hypotheses define different models that yield different estimates of the freely estimated parameters. Unfortunately, as modelers traverse a path of reasonable hypotheses until they arrive at a model that provides both a good fit and K_dconfidence interval limits that are not too wide, they often stop at different places, and thus report different K_dvalues. Such K_destimate extraction differences could be reduced, if a systematic reproducible approach to biochemical equilibria model building was established. Progress toward this goal is described in this paper.

Results

Model

Consider a dataset comprised of N steady state non-covalent binding equilibriums indexed by n in which J different complexes can potentially form from a protein R of known total concentration T_{n 1}through interactions with itself and I - 1 other reactants (e.g. substrate, effectors and other proteins) of known total concentrations T_ni(1 <i ≤ I). Suppose W_ijcopies of the i th reactant exist in the j th complex and that a particular R molecule is either undamaged with probability p, and thus capable of forming each of the plausible complexes, or damaged with probability 1 - p, and thus incapable of forming any complexes. Define T_n= (T_{n 1}, T_{n 2}, ...T_nI), F_n= (F_{n 1}, F_{n 2}, ...F_nI) as the corresponding free reactant concentrations, K = (K₁, K₂, ...K_J) as the dissociation constants (of complexes to free reactants), y_nas the measurement(s) made at the n th steady state, and Z_n= (Z_{n 1}, Z_{n 2}, ...Z_nJ) as the concentrations of complexes predicted by W, K and F_nto be

Z_{n j} = \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{K_{j}} .

(1)

The relationship between the system inputs (T_n), states (F_n) and outputs (y_n) is then modeled by I total concentration constraints

g(F_n, T_n, K, p) = 0

that must be solved for the I free reactant concentrations F_nat each n (1 <n ≤ N) given the inputs T_n, and an output measurement model h that connects F_nto expected values of the outputs E(y_n)

y_n= h(F_n, K, p, L) + ε_n

where all of the h specific parameters (e.g. k_cat's and protein masses) are contained in the vector L and, if the y_nare vectors of measurements, the e_nare vectors of zero mean noise, potentially correlated within steady states, but uncorrelated between steady states; only scalar y_nare considered hereafter. The model parameters K, p and L are not indexed by n because they are fitted jointly to the entire dataset, i.e. one set of estimates of these parameters describes all N steady states simultaneously as one (g, h) model of one underlying biochemical equilibrium network.

System models

The I equations of a system model g = 0 are

\begin{matrix} g_{1} (F_{n}, T_{n}, K, p) = p T_{n 1} - F_{n 1} \\ - \sum_{j = 1}^{J} W_{1 j} \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{K_{j}} \\ = 0 \\ g_{i} (F_{n}, T_{n}, K) = T_{n i} - F_{n i} \\ - \sum_{j = 1}^{J} W_{i j} \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{K_{j}} \\ \begin{matrix} = 0 & (1 < i \leq I) \end{matrix} \end{matrix}

(2)

where pT_{n 1}is the total concentration of undamaged R and F_{n 1}is the concentration of free R that is undamaged and thus capable of forming complexes. If all biologically plausible candidate complexes are present in these equations, the model will have as many K parameters as possible, and it will therefore be called a full model. A space of g = 0 models can then be generated from this full model through combinations of null hypothesis constraints on the parameters in (K, p).

Fitting a particular (g, h) to data (T, y) to estimate parameters in (K, p, L) demands many repeated solutions of g = 0. These equations must be solved efficiently to fit large model spaces and models with large numbers of parameters. The approach proposed here solves g = 0 by letting g be the right hand side of a parent set of ordinary differential equations (ODEs) that achieves g = 0 at steady state. Specifically, the following ODEs were simulated to large Τ to solve the polynomial system in Eqs. (2):

\begin{array}{l} \frac{d F_{n 1}}{d τ} & = & p T_{n 1} - F_{n 1} - \sum_{j = 1}^{J} W_{1 j} \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{K_{j}} \\ \frac{d F_{n i}}{d τ} & = & T_{n i} - F_{n i} - \sum_{j = 1}^{J} W_{i j} \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{K_{j}} \end{array}

(3)

where 1 <i ≤ I, n = 1...N and F_ni(0) = 0. Note that the initial conditions guarantee that the system derivatives are initially positive and thus that the system always starts in an acceptable direction; model parameters are constrained to positive values, expressed internally as e^c, where c is unconstrained during optimization.

The system of polynomials in Eqs. (2) has been solved by others using other approaches. In one approach, the F_niterms are pulled to the left hand side and guesses are then iteratively entered into the right hand side until the equations become self consistent [7]. This approach has more recently been shown to fail in cases of oligomerization, and modifications of the approach have been suggested [8]. The difficulties of solving systems of arbitrary nonlinear algebraic equations in general have been described [9] and a common approach (e.g. used by fsolve in Matlab) has been to minimize the sum of squares g² using Levenberg-Marquadt or Gauss-Newton methods. Intuitively, methods that exploit the fact that the equations are strictly polynomials should outperform these general methods. Continuation homotopy is one such method [10]. In this method, polynomials are homogenized to a larger polynomial system with known solutions, and these solutions are then traced to the desired solutions as the homogenized polynomials are continuously morphed back to the original polynomial system. On a practical level, all complex initial solutions must be tracked to find the desired final solution that is strictly real and positive, and this makes the approach slower than the R [11] implementation of Eqs. (3) provided here, which finds only the positive real root and does so rapidly because it automatically generates and compiles C code (of Eqs. 3) that is then used with the dll/so option of the ODE solver lsoda available in R [11]. To glean some insight into why Eq. (3) works, note that the g_i(i.e. right hand sides) are all initially positive, and all monotonically decreasing functions of increasing free concentrations. Free concentration differentials thus start positive and shrink toward zero as the free concentrations move out of their initial values at the origin and into the positive quadrant. When a component F_niof the vector F_ncrosses its steady state value, the corresponding g_iswitches signs, since the g_icontinue to decrease monotonically through zero, and F_niis then thus driven back toward a smaller value, i.e. back toward the steady state value that it just crossed. This explains why the proposed algorithm is stable. Finally, an alternative approach to the problem is to solve g = 0 using full-blown kinetic equations with irrelevant time scales defined by k_on= 1 and k_off= K_d, but the number of ODEs then equals the number of complex species plus the number of reactants, rather than just the number of reactants as in Eqs. 3, and although each ODE is computationally simpler in this case, the savings per ODE do not offset the added cost of the additional ODEs. This added cost is expected to become substantial if not prohibitive in combinatorially complex scenarios wherein the number of complexes is very large relative to the number of reactants.

K hypotheses

In the g = 0 model in Eqs. (2), the elements of K are defined as

K_{j} = \frac{\prod_{i^{'} = 1}^{I} F_{n i^{'}}^{W_{i^{'} j}}}{Z_{n j}} .

(4)

This definition can differ by stoichiometric factors from K_ddefined as k_off/k_on. For example, consider a system where R can bind a ligand t and R can also form dimers. Figure 1 shows the state transitions of this system from a state of i, j, k, l, m and n molecules of R, t, Rt, RR, RRt and RRtt, respectively, per unit volume, where the unit volume is small enough that any reactant can react equally well with any other reactant, yet large enough that these integers are approximately equal to themselves plus or minus one or two. If net fluxes between states are zero, the system is in equilibrium and the following definitions of K_d≡ k_off/k_onarise

\begin{array}{l} k_{o n . R 0} i (i - 1) / 2 & = & \begin{matrix} k_{o f f . R 0} (l + 1) & \Rightarrow \end{matrix} \\ K_{d_R_R} & \equiv & k_{o f f . R 0} / k_{o n . R 0} \\ = & \frac{i (i - 1)}{2 (l + 1)} \approx \frac{[R] [R]}{2 [R R]} \end{array}

(5)

\begin{array}{l} k_{o n . R 1} i k & = & \begin{matrix} k_{o f f . R 1} (m + 1) & \Rightarrow \end{matrix} \\ K_{d_R t_R} & \equiv & k_{o f f . R 1} / k_{o n . R 1} \\ = & \frac{i k}{(m + 1)} \approx \frac{[R t] [R]}{[R R t]} \end{array}

(6)

\begin{array}{l} k_{o n . R 2} k (k - 1) / 2 & = & \begin{matrix} k_{o f f . R 2} (n + 1) & \Rightarrow \end{matrix} \\ K_{d_R t_R t} & \equiv & k_{o f f . R 2} / k_{o n . R 2} \\ = & \frac{k (k - 1)}{2 (n + 1)} \approx \frac{[R t] [R t]}{2 [R R t t]} \end{array}

(7)

\begin{array}{l} k_{o n . t 0} i j & = & \begin{matrix} k_{o f f . t 0} (k + 1) & \Rightarrow \end{matrix} \\ K_{d_R_t} & \equiv & k_{o f f . t 0} / k_{o n . t 0} \\ = & \frac{i j}{k + 1} \approx \frac{[R] [t]}{[R t]} \end{array}

(8)

\begin{array}{l} k_{o n . t 1} j 2 l & = & \begin{matrix} k_{o f f . t 1} (m + 1) & \Rightarrow \end{matrix} \\ K_{d_R R_t} & \equiv & k_{o f f . t 1} / k_{o n . t 1} \\ = & \frac{j 2 l}{m + 1} \approx \frac{2 [R R] [t]}{[R R t]} \end{array}

(9)

\begin{array}{l} k_{o n . t 2} j m & = & \begin{matrix} k_{o f f . t 2} 2 (n + 1) & \Rightarrow \end{matrix} \\ K_{d_R R t_t} & \equiv & k_{o f f . t 2} / k_{o n . t 2} \\ = & \frac{j m}{2 (n + 1)} \approx \frac{[R R t] [t]}{2 [R R t t]} . \end{array}

(10)

In Eqs. 5 and 7, x(x - 1)/2 is the number of unique binary interactions of x molecules with themselves. The stoichiometric factor in Eq. (9) arises because RR has twice as many ways to gain a t as RRt has ways to lose a t, and in Eq. 10 it arises because RRtt has twice as many ways to lose a t as RRt has ways to gain a t. Eqs. 9 and 10 assume that RR and RRtt are symmetric dimers.

Regarding differences between the K_din Eqs. (5–10) and the K_jin Eq. (4), the K_dalways have units of concentration because they always correspond to two molecules binding together at one time, and the K_jhave units of concentrations raised to integer powers $\sum_{i^{'} = 1}^{I} W_{i^{'} j} - 1$ that can be greater than 1 (in such cases the K_jrepresent several sequential binding steps condensed into one, e.g. see Table 1). In general, the K_dare associated with grid-shaped equilibrium network graphs such as those shown in Figure 2 and the K_jare associated with spur-shaped equilibrium graphs such as those shown in Figure 3. Notationally, subscripts of the K_jwill be distinguishably devoid of d's and underscores, e.g. $K_{R R t t} = \frac{{[R]}^{2} {[t]}^{2}}{[R R t t]}$ is the K_jof graph M in Figure 3.

Table 1 K_jassignment model definitions

Full size table

In the graphs shown in Figure 2, it is plausible to conjecture a priori that any two or all three of K_{d_R_R}, K_{d_Rt_R}and K_{d_Rt_Rt}are equal, i.e. that the binding of t to R has no impact on R binding to itself. Similarly, it is plausible that any two or all three of K_{d_R_t}, K_{d_RR_t}and K_{d_RRt_t}are equal. These two sets of hypotheses are not independent, since K_dproducts of two paths between the same two nodes must be equal. For example, in Figure 2A, starting with free reactants, the two paths to RRt are

\begin{matrix} K_{d_R_R_t} = K_{d_R_R} K_{d_R R_t} \\ = \frac{[R] [R]}{2 [R R]} \frac{2 [R R] [t]}{[R R t]} = \frac{{[R]}^{2} [t]}{[R R t]} \\ K_{d_R_R_t} = K_{d_R_t} K_{d_R t_R} \\ = \frac{[R] [t]}{[R t]} \frac{[R t] [R]}{[R R t]} = \frac{{[R]}^{2} [t]}{[R R t]} \end{matrix}

(11)

and the two paths to RRtt are

\begin{matrix} K_{d_R_R_t_t} = K_{d_R_R} K_{d_R R_t} K_{d_R R t_t} \\ = \frac{[R] [R]}{2 [R R]} \frac{2 [R R] [t]}{[R R t]} \frac{[R R t] [t]}{2 [R R t t]} \\ = \frac{{[R]}^{2} {[t]}^{2}}{2 [R R t]} \\ K_{d_R_R_t_t} = K_{d_R_t}^{2} K_{d_R t_R t} \\ = \frac{[R] [t]}{[R t]} \frac{[R] [t]}{[R t]} \frac{[R t] [R t]}{2 [R R t t]} \\ = \frac{{[R]}^{2} [t]}{2 [R R t]} . \end{matrix}

(12)

Similarly, the two paths from the node [Rt R t] to RRtt yield

\begin{matrix} K_{d_R t_R} K_{d_R R t_t} = \frac{[R t] [R]}{[R R t]} \frac{[R R t] [t]}{2 [R R t t]} \\ = \frac{[R t] [R] [t]}{2 [R R t]} \\ K_{d_R_t} K_{d_R t_R t} = \frac{[R] [t]}{[R t]} \frac{[R t] [R t]}{2 [R R t t]} \\ = \frac{[R t] [R] [t]}{2 [R R t]}, \end{matrix}

(13)

though these could have been obtained from (11) and (12). Based on Eqs. (11), either of K_{d_R_t}= K_{d_RR_t}and K_{d_R_R}= K_{d_Rt_R}implies the other, and based on Eqs. (13), either of K_{d_R_t}= K_{d_RRt_t}and K_{d_Rt_R}= K_{d_Rt_Rt}implies the other. Such constraints yield the K_dequality hypotheses shown in Fig. 2. This space of K_dequality models was generated from the fully constrained Model A by releasing pairs of R binding equality constraints and counterpart t binding constraints one at a time. When two R binding constraints are released, all three R binding constants become independent, and this leaves only one permissible t-binding constraint (Model E) or none (Model F). Models with one node less (G to N) are then considered; the two Rt nodes act as one. Models with two or more nodes removed do not allow K_dequality constraints and in these cases, K_jdefined by Eq. 4 are adequate; such models are shown in Figure 3.

The Rt system full model special case of g = 0 in Eqs. (2), with T_n= ([R_T], [t_T]), F_n= ([R], [t]), Z_n= ([Rt], [RR], [RRt], [RRtt]), and thus

W = (\begin{matrix} 1 & 1 & 2 & 2 \\ 1 & 0 & 1 & 2 \end{matrix}),

(14)

is

\begin{array}{l} 0 & = & p [R_{T}] - [R] - \frac{[R] [t]}{K_{R t}} - 2 \frac{{[R]}^{2}}{K_{R R}} \\ - 2 \frac{{[R]}^{2} [t]}{K_{R R t}} - 2 \frac{{[R]}^{2} {[t]}^{2}}{K_{R R t t}} \\ 0 & = & [t_{T}] - [t] - \frac{[R] [t]}{K_{R t}} \\ - \frac{{[R]}^{2} [t]}{K_{R R t}} - 2 \frac{{[R]}^{2} {[t]}^{2}}{K_{R R t t}} . \end{array}

(15)

These g = 0 equations correspond to graph A in Figure 3. As K_j= ∞ assumptions are applied to these equations to remove specific terms one at time, two at a time, and so on, corresponding nodes are removed from graph A to create graphs B to P and thus models that conjecture that the deleted nodes/complexes are not detectable above noise. Of these models, the J single edge models (L to O) can have additional K_j= 0 assumptions applied to them to generate J additional g models (Q to T), each alleging that the free concentration of the reactant that is not in excess (i.e. ligand or R) is indistinguishable from zero (i.e. at a level too low to be detected using the data at hand). In such models, K_j= 0 is handled either by approximating 0 by a small number (e.g. .0001; this option is readily automated, but pushing it too far causes numerical problems) or by replacing the equations with rules (e.g. if K_RRtt= 0 as in Model 3R, the rule would be: if [R_T] <[t_T], [R] = 0 and [RRtt] = [R_T]/2, else [R_T] ≥ [t_T] and thus [R] = [R_T] - [t_T] and [RRtt] = [t_T]/2; this option remains to be automated). In the end, a spur graph (e.g. 3A) with J edges generates 2^Jmodels via K_j= ∞ assumptions and an additional J models via K_j= 0 assumptions, e.g. the 2⁴ + 4 = 20 models in Fig. 3. Considering that J is the number of complex species, which can be large, the number of g models generated can be huge.

The models in Figs. 2 and 3 are characterized by their assignments to the four K_jparameters in Eq. 15 as shown in Table 1. This table defines a standard space of K hypothesis g models for ligand induced protein dimerization equilibria. As Models F, H, J, L and N in Fig. 2 do not have any K_dequality constraints, their data fitting capabilities are equal to those of Models A through E in Fig. 3, respectively. To see this, consider the first of the two rows labeled 3A,2F in Table 1. Eqs. (5) and (8) give K_RR= 2K_{d_R_R}and K_Rt= K_{d_R_t}, Eq. (11) gives K_RRt= K_{d_R_t}K_{d_Rt_R}, which can be adjusted independently by the factor K_{d_Rt_R}, and Eq. (12) gives K_RRtt= $2 K_{d_R_t}^{2} K_{d_R t_R t}$ K_{d_Rt_Rt}, which can be adjusted independently by K_{d_Rt_Rt}. Thus, all four of the K_jparameters of 3A can be independently manipulated to arbitrary values by the four K_dparameters of 2F, and in this sense, the two models are equivalent. A major difference, however, is that 2F can be represented in more than one way. Indeed, two choices are given by the two 3A,2F rows in Table 1, and all of the graphs in Figure 2 can be parameterized as subsets of either the E-shaped or ⊓-shaped parameterization topologies given in these two full model rows.

The nine grid graphs in Fig. 2 that contain at least one K_d= ${K^{'}}_{d}$ constraint have |K_j| > |K_d| where |K_x| is the number of freely estimated K_xparameters. Meanwhile, models that are equally well represented by both grid and spur graphs are characterized by |K_j| = |K_d|, which, in Fig. 3, is all of the graphs except I, J, M, N, R and S. These exceptions must use spur graphs to avoid non-identifiability problems, have |K_j| < |K_d|, include complexes without including required intermediates, and have K_d= ∞ in product expressions that remain finite (see Table 1 footnote). Such models are palatable only because they represent statistical null hypotheses rather than physical null hypotheses, i.e. K_d= ∞ is a claim that the true value of K_dis too large to estimate based on the data at hand, and not a claim that binding never occurs.

p hypotheses

The probability that an R molecule is undamaged can be hypothesized to be close enough to 1 that the data cannot discriminate it from being 1. If B different protein preparation batches (indexed by b) are used in the experiments, 2^Bhypotheses exist. p_b= p_b'hypotheses that two batches are equivalent can also be formulated. In the equations given above and in the data analysis given below, B=1 is assumed.

Measurement models h

In pairs (g, h) the system of interest g is separated from the methods used to study it in h. h maps steady states F_nof g into expected values of measurements E(y_n). The first step in this, common to all h models, is to convert the F_ninto complex concentration predictions Z_nusing Eq. (1), i.e. using W and K. The second step is to form E(y_n) from F_nand Z_nand any other available information (e.g. L and p; note that T_ncan be reconstructed from F_nand Z_n). This second step is different for different measurement types, as illustrated below for average protein mass, fraction of protein bound to a particular ligand, and average enzymatic activity of a distribution of enzyme states.

average mass

Suppose R is the only protein in the system, that ligand masses are too small to be detected relative to protein masses, and that average protein mass measurements are mass-weighted, e.g. as in dynamic light scattering data [1–3]. The second step of h for this type of measurement is then

E (y_{n}) = M_{1} \frac{[R] + [R_{T}] (1 - p_{b}) + \sum_{j = 1}^{J} Z_{n j} W_{1 j}^{2}}{[R_{T}]}

(16)

where M₁ is the mass of R monomer.

fraction bound

For fraction of protein bound to ligand data, suppose the ligand of interest is the i th reactant. The fraction of R bound to ligand is then

E (y_{n}) = (\sum_{j = 1}^{J} Z_{n j} W_{i j}) / [R_{T}] .

(17)

enzyme activity

If k_catjis the per-active-site enzymatic activity of the j th complex, the measured average activity of an ensemble of complexes is

E (y_{n}) = (\sum_{j = 1}^{J} k_{c a t j} Z_{n j} W_{1 j}) / [R_{T}] .

(18)

It is assumed here that R provides all of the enzymatic activity and that it has only one active site.

h space

Enzyme activity differs from the other two measurement types in that its parameters can have many plausible null hypotheses: the k_catjcan be equal to zero or to each other within groups defined in various ways. Thus, Eq. (18) can generate a space of h models. When such a space is multiplied into a g space, not all h models can be paired with any g, since, for example, if a K_jis infinity in a g model, the corresponding product complex concentration is zero, so a corresponding k_catcannot be estimated. Thus, although to first order |(g, h)| = |g||h| where |x| is the number of x models, this is actually an upper bound.

dTTP induced R1 dimerization data analysis

Let R be the R1 subunit of ribonucleotide reductase and let t be dTTP. Using h in Eq. (16), Scott et al [1] fitted Model 2E with p = 1 to their dynamic light scattering data shown in Figure 4. Their final parameter estimates are shown as the initial parameter estimates in Table 2. That these estimates did not converge properly (the authors used a method similar to that of Storer and Cornish-Bowden [7] to solve their g = 0 equations) is evidenced by the poor fit of the solid curve in Figure 4 relative to its fully converged counterpart computed here using the g = 0 solver described above (Eq. 3; dotted curve). The consequences of this poor fit are seen to be substantial in Table 2, where many of the K_destimates have initial values that differ from their final converged counterparts by an order of magnitude. The final K_destimates are, however, very uncertain, with upper-to-lower 95% confidence interval (CI, see Methods) limit ratios of ~10⁶, i.e. Model 2E is overparameterized.

Table 2 Parameter estimates corresponding to Figure 4

Full size table

Given knowledge that R has a binding site for t and that R can dimerize [12], the model space in Table 1 doubled by p free or fixed to 1 and coupled to h in Eq. 16 creates 58 (g, h) candidate models that were fitted to these data. The fitted models were ranked by the Akaike Information Criterion (AIC, see Methods) and the best model was 3Rp (p freely estimated) with K_RRtt= .0001 μ M³ essentially fixed to zero (dashed straight lines in Figure 4; Table 2). This model represents a tight binding titration limit wherein free molecule annihilation (the initial linear ramp in Fig. 4) continues in a one-to-one fashion with increasing [dTTP_T] until [dTTP_T] equals [R1_T] = 7.6 μ M, the plateau point beyond which all dimerizable R has dimerized. The second best model (dashed-dotted in Figure 4) was 3M (p fixed to 1) with K_RRttfreely estimated as 17 μ M³. This second best model is the best model when recent gel filtration data [4] shown in Table 3 are also included in the analysis, see Table 4 (2E ranked 20th and 13th in Tables 2 and 4 in exhaustive model space fits and was not even fitted by the semi-exhaustive method described next).

Table 3 Rofougaran et al.'s R1 dimerization data

Full size table

Table 4 Joint Data Analysis

Full size table

Semi-exhaustive model selection

The semi-exhaustive model selection algorithm is: (1) create a list of all of the candidate models; (2) sort it according to the number of freely estimated parameters in each model; (3) fit all of the models with the fewest number of parameters; (4) fit all models with one additional parameter; and (5), repeat step 4 as long as the current batch of models has an improved AIC relative to the previous batch of models. In the case of the Rt system, compared to exhaustive fits to the entire space of 58 (g, h) models, this algorithm stops before fitting the most time consuming over-parameterized models (those with three parameters or higher) though it identifies the exact same top 13 (Table 2) and top 7 (Table 4) models. CPU times to compute Tables 2 and 4, expressed as exhaustive to semi-exhaustive ratios, averaged 4.7 (4.3/.89, 5.8/1.25, in minutes/minutes) when using 4 CPUs and 5.9 (14.8/2.5, 20.3/3.5) when using 1 CPU, or, rewritten, quad processor gains averaged 3.5 (14.8/4.3, 20.3/5.8) for exhaustive fits and 2.8 (2.5/.89, 3.5/1.25) for semi-exhaustive fits, i.e. there are semi-exhaustive approach losses in parallel processing efficiency as some CPUs become idle while the last models in a batch are fitted.

Implementation

R codes are provided to insure reproducibility of the results. They are also provided because they may be useful in other ligand induced protein dimerization data analyses. The following script illustrates their use.

setwd("/home/radivot/case/active/rnr/Rt/R")

load("RNR.RData") # load RNR adata

source("fRt.r") # function definitions

# the next line generates and compiles C code

g=mkgObj("Rt", c("Rt","RR","RRt","RRtt"))

RtData=adata [c("f1a01")] # Scott et al 2001 Rt data

# these map Kd into Kj as shown in Table 1

Eshape<-function(x)

c(x[1], 2*x[2], x[1]*x[3], 2*x[1]^2*x[4])

nshape<-function(x)

c(x[1], 2*x[2], x[2]*x[3], 2*x[2]*x[3]*x[4])

models=list(

mkModelObj(RtData, g, "2E",

Kdparams=c(R_t=30, R_R=85, RR_t=.55, RRt_t=.55),

Keq=c(RRt_t="RR_t"), Kd2Kj=nshape),

mkModelObj(RtData, g, "3Rp",

Kjparams=c(Rt=Inf, RR=Inf, RRt=Inf, RRtt=0),

pparams=c(pRT=1)),

mkModelObj(RtData, g, "3M",

Kjparams=c(Rt=Inf, RR=Inf, RRt=Inf, RRtt=1))

)

fitMS(models,"MS2")

In this script, load loads the RNR data provided in Additional File 1 and source reads in the function definitions provided in Additional File 2. The main function, fitMS, fits the model space (2E, 3Rp, 3M) and writes the results to html and LaTeX files. It can be passed options to specify the number of CPUs and the choice of semi-exhaustive or exhaustive fitting. A script that fits all 58 (g, h) models is provided as Additional File 3.

Discussion

The most common approach to modeling is to manually identify several plausible models, fit them all, and accept the best in the lot, e.g. [13, 14]. This approach works because human intuition carries external information that guides the choice of the initial lot. If the best model does not provide a good fit, or if it has parameters with very large confidence intervals, the lot can be augmented to include additional models with more or fewer parameters, respectively. The advantage of this approach is that only a handful of models needs to be fitted. The disadvantage is that different analysts can yield different results. In general, a model/hypothesis (e.g. that the experimental data cannot discriminate some K_jfrom ∞ or zero, or that some K_dequal others) is rejected if it is not among the best models selected, and supported if it is. Although inferences made from any model, including the best models, are always conditional on the truth of the model's assumptions, the likelihood of this truth increases as the model withstands elimination. This statement is valid only to the extent that alternative hypotheses are represented in the model space. For example, if a K_d= ${K^{'}}_{d}$ model assumes symmetric oligomers (e.g. as in Eqs. 9 and 10) and the model space does not include counterpart models that assume asymmetric forms, the selection process can lend no additional support to the symmetry assumptions. On the other hand, if independent data support such symmetry assumptions, the use of a restricted model space may be acceptable. It is anticipated that large model spaces will generate many models that are roughly equally best. Overall inferences should then reflect an average of the inferences of the best models, perhaps weighted by some metric of closeness to the optimum. Methods of accomplishing this for (g, h) models is an important area of future work. Another important area is automated model space enumeration: although this can be readily achieved for K_d= ∞ or 0 spur graphs, it remains a challenge to achieve this for K_d= ${K^{'}}_{d}$ grid graphs.

Conclusion

The process of extracting K estimates from data is inseparable from the process of (g, f) model selection. This process requires clear statements of the model space explored, the criterion used to rank models, and the method used to search the space. If standards can be developed for these entities, analyst-to-analyst variations in inferences made from identical datasets could be reduced.

Methods

Data procurement

Plot Digitizer [15] was used to digitize the data of Scot et al. shown in Fig. 4. These data were originally given with model-dependent free concentrations on the x-axis. Such x values were converted to total concentrations using the model and parameter values given by Scot et al. [1]. The data in Table 3 is from Fig. 1 of [4]. It was kindly provided by Dr. Anders Hofer.

Model selection

With P equal to the number of freely estimated model parameters, N equal to the number of steady state data points, and SSE equal to the sum of squared errors of the fitted model, the Akaike Information Criterion [16] used here has the form AIC = 2P + N log(SSE/N) + $\frac{2 P (P + 1)}{N - P - 1}$ [17]. This explicit metric states how much goodness of fit (SSE) one is willing to sacrifice to gain the benefit of one less parameter. For a given model, P and N are fixed, so AIC minimization reduces to SSE minimization by least squares.

Parameter estimation

Best fitting SSEs were found by nonlinear least squares using the optim function in R [11] with the Nelder-Mead [18] option for P > 1, the BFGS option for P = 1, and the Hessian option set to TRUE (see Additional Files). Hessians of the SSE s evaluated at the optimum were divided by 2, inverted, and multiplied by the mean squared error, MSE = SSE/(N - P), to compute parameter estimate covariance matrices. From these, parameter estimate standard deviations were taken as the square roots of the main diagonal, and these were then multiplied by 1.96 to approximate 95% CIs. All parameters were estimated as e^cto constrain point estimates and CIs to positive values.

References

Scott CP, Kashlan OB, Lear JD, Cooperman BS: A quantitative model for allosteric control of purine reduction by murine ribonucleotide reductase. Biochemistry. 2001, 40 (6): 1651-61. 10.1021/bi002335z
Article CAS PubMed Google Scholar
Kashlan OB, Scott CP, Lear JD, Cooperman BS: A comprehensive model for the allosteric regulation of mammalian ribonucleotide reductase. Functional consequences of ATP- and dATP-induced oligomerization of the large subunit. Biochemistry. 2002, 41 (2): 462-74. 10.1021/bi011653a
Article CAS PubMed Google Scholar
Kashlan OB, Cooperman BS: Comprehensive model for allosteric regulation of mammalian ribonucleotide reductase: refinements and consequences. Biochemistry. 2003, 42 (6): 1696-706. 10.1021/bi020634d
Article CAS PubMed Google Scholar
Rofougaran R, Vodnala M, Hofer A: Enzymatically active mammalian ribonucleotide reductase exists primarily as an alpha6beta2 octamer. J Biol Chem. 2006, 281 (38): 27705-11. 10.1074/jbc.M605573200
Article CAS PubMed Google Scholar
Ingemarson R, Thelander L: A kinetic study on the influence of nucleoside triphosphate effectors on subunit interaction in mouse ribonucleotide reductase. Biochemistry. 1996, 35 (26): 8603-9. 10.1021/bi960184n
Article CAS PubMed Google Scholar
Wang J, Lohman GJ, Stubbe J: Enhanced subunit interactions with gemcitabine-5'-diphosphate inhibit ribonucleotide reductases. Proc Natl Acad Sci USA. 2007, 104 (36): 14324-9. 10.1073/pnas.0706803104
Article PubMed Central CAS PubMed Google Scholar
Storer AC, Cornish-Bowden A: Concentration of MgATP2- and other ions in solution. Calculation of the true concentrations of species present in mixtures of associating ions. Biochem J. 1976, 159: 1-5.
Article PubMed Central CAS PubMed Google Scholar
Kuzmic P: Fixed-point methods for computing the equilibrium composition of complex biochemical mixtures. Biochem J. 1998, 331 (Pt 2): 571-5.
Article PubMed Central CAS PubMed Google Scholar
Press WH: Numerical recipes in C: the art of scientific computing. 1988, Cambridge [Cambridgeshire]; New York: Cambridge University Press
Google Scholar
Sommese A, Wampler C: The Numerical Solution of Systems of Polynomials Arising in Engineering and Science. 2005, Singapore: World Scientific Publishing Company
Book Google Scholar
Ihaka R, Gentleman R: R:a language for data analysis and graphics. Journal of Computational and graphical statistics. 1996, 5: 299-314. 10.2307/1390807.
Google Scholar
Reichard P: Interactions between deoxyribonucleotide and DNA synthesis. Annu Rev Biochem. 1988, 57: 349-74. 10.1146/annurev.bi.57.070188.002025
Article CAS PubMed Google Scholar
Schlee S, Carmillo P, Whitty A: Quantitative analysis of the activation mechanism of the multicomponent growth-factor receptor Ret. Nat Chem Biol. 2006, 2 (11): 636-44. 10.1038/nchembio823
Article CAS PubMed Google Scholar
Kuzmic P, Cregar L, Millis SZ, Goldman M: Mixed-type noncompetitive inhibition of anthrax lethal factor protease by aminoglycosides. Febs J. 2006, 273 (13): 3054-62. 10.1111/j.1742-4658.2006.05316.x
Article CAS PubMed Google Scholar
Plot Digitizer., http://plotdigitizer.sourceforge.net/
Akaike H: A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.
Article Google Scholar
Burnham KP, Anderson D: Multimodel Inference: understanding AIC and BIC in Model Selection. Workshop on Model Selection, Amsterdam. 2004
Google Scholar
Nelder J, Mead R: A simplex algorithm for function minimization. Computer Journal. 1965, 7: 308-313.
Article Google Scholar

Download references

Acknowledgements

I thank Dr. Hofer for his data (Table 3) and the referees for their suggestions. This work was supported by the National Cancer Institute under grant number K25CA104791. It does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.

Author information

Authors and Affiliations

Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH, 44106, USA
Tomas Radivoyevitch

Authors

Tomas Radivoyevitch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomas Radivoyevitch.

Additional information

Authors' contributions

TR performed all of the work and wrote the manuscript.

Electronic supplementary material

Additional File 1: RNR.RData = Data file (RDAT 3 KB)

Additional File 2: fRt.r = R function definitions (R 17 KB)

Additional File 3: Rt.r = R script used (R 8 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Radivoyevitch, T. Equilibrium model selection: dTTP induced R1 dimerization. BMC Syst Biol 2, 15 (2008). https://doi.org/10.1186/1752-0509-2-15

Download citation

Received: 27 July 2007
Accepted: 04 February 2008
Published: 04 February 2008
DOI: https://doi.org/10.1186/1752-0509-2-15

Equilibrium model selection: dTTP induced R1 dimerization

Abstract

Background

Results

Conclusion

Background

Results

Model

System models

K hypotheses

p hypotheses

Measurement models h

average mass

fraction bound

enzyme activity

h space

dTTP induced R1 dimerization data analysis

Semi-exhaustive model selection

Implementation

Discussion

Conclusion

Methods

Data procurement

Parameter estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Additional File 1: RNR.RData = Data file (RDAT 3 KB)

Additional File 2: fRt.r = R function definitions (R 17 KB)

Additional File 3: Rt.r = R script used (R 8 KB)

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Systems Biology

Contact us