### Deriving proper propensity functions for stochastic simulations from differential equation-based models

Assuming that the GMA model faithfully captures the average behaviour of a biochemical reaction system and recalling

, the expected metabolite numbers are defined as the expectation

where Φ is the system size as defined above.

To describe the reaction channel

*R*
_{
r
} stochastically, one needs the state update vector

**v**
_{
r
}and must characterize the quantity of molecules flowing through of reaction channel

*R*
_{
r
} during a small time interval. The key concept of this type of description is the propensity function

*α*
_{
r
}(

**x**), which is defined as

[

1]. Because of the probabilistic nature of the propensity function,

**X**(t) is no longer deterministic, and the result is instead stochastic and based on the transition probability

which follows the chemical master equation (CME)

Updating CME requires knowledge of every possible combination of all species counts within the population, which immediately implies that it can be solved analytically for only a few very simple systems and that numerical solutions are usually prohibitively expensive [24]. To address the inherent intractability of CME, Gillespie developed an algorithm, called the *Stochastic Simulation Algorithm* (SSA), to simulate CME models [1]. SSA is an exact procedure for numerically simulating the time evolution of a well-stirred reaction system. It is rigorously based on the same microphysical premise that underlies CME and gives a more realistic representation of a system's evolution than a deterministic reaction rate equation represented by ODEs. SSA requires knowledge of the propensity function, which however is truly available only for elementary reactions. These reactions include: 1) 0^{th} order reactions, exemplified with the generation of a molecule at a constant rate; 2) 1^{st} order monomolecular reactions, such as an elemental chemical conversion or decay of a single molecule; 3) 2^{nd} order bimolecular reactions, including reactive collisions between two molecules of the same or different species. The reactive collision of more than two molecules at exactly the same time is considered highly unlikely and modelled as two or more sequential bimolecular reactions.

For elementary reactions, the propensity function of reaction

*R*
_{
r
} is computed as the product of a stochastic rate constant

*c*
_{
r
} and the number

*h*
_{
r
} of distinct combinations of reactant molecules, i.e.

Here
, where *x*
_{
s
} is the sample value of random variable *X*
_{
s
}. The approximation is invoked when *x*
_{
s
} is large and (*x*
_{
s
} - 1), ..., (*x*
_{
s
} - v
_{
rs
} + 1) are approximately equal to *x*
_{
s
}.

In Gillespie's original formulation [1]*c*
_{
r
} is a constant that only depends on the physical properties of the reactant molecules and the temperature of the system, and *c*
_{
r
}
*dt* is the probability that a particular combination of reactant molecules will react within the next infinitesimally small time interval (*t*, *t* + *dt*). The constant *c*
_{
r
} can be calculated from the corresponding deterministic rate constants, if they are known.

Since the assumption of mass action kinetics is not valid generally, especially in spatially restricted environments and in situations dominated by macromolecular crowding, we address the broader scenario where c

_{
r
}is not a constant but a function of the reactant concentrations. Thus, we denote c

_{
r
}as a

*stochastic rate function*, while retaining the definition of

*h*
_{
r
} as above. Knowing that any positive-valued differentiable function can be approximated locally by a power-law function, we assume the functional form of the stochastic rate function as

Here,

*κ*
_{
r
} and

*ε*
_{
rs
} are constants that will be specified in the next section, and

*r* = 1, ...,

*N*
_{
r
}. Note that

*ε*
_{
rs
} are now real-valued. Once the stochastic rate function is determined (see below), the propensity function can be calculated as

In order to identify the functional expression for a stochastic rate function, and thus the propensity function, we consider the connection between the stochastic and the deterministic equation models. By multiplying CME with

**x** and summing over all

**x**, we obtain

Similarly, the expectation for any species

*X*
_{
s
}(

*t*) is given as

The details of these derivations are shown in Additional file 1.

We can use these results directly to compute the propensity function for a stochastic GMA model, assuming that its deterministic counterpart is well defined. Specifically, we start with the deterministic GMA equation for

*X*
_{
s
},

where

*v*
_{
rs
},

*k*
_{
r
} and

*f*
_{rs' }are again the stoichiometric coefficients, rate constants, and kinetic orders, respectively. By substituting

from Equation (

18) into this GMA model, we obtain a "particle-based" equation of the format

Elementary operations allow us to rewrite this equation as

where

. In this formulation, the differential operator is justified only when large numbers of molecules are involved. The assumption that the deterministic equations precisely capture the average behaviour of the biochemical reaction system directly equates the stochastic CME (25) to the deterministic equation based model (28)

Now we have two choices for approximating the expectation of the propensity function on left-hand side of equation (29):

1) adopt a zero-covariance assumption as was done in [

25], which implies ignoring random fluctuations within every species as well as their correlations. This assumption is only justified for some special cases such as monomolecular and bimolecular reactions under the thermodynamic limit (cf. [

4,

6]), but is not necessary valid in generality. Here the thermodynamic limit is defined as a finite concentration limit which the system reaches when both population and volume approach infinity. Under this assumption, the left hand side of (29) becomes

for every

*r* = 1, ...,

*N*
_{
r
}, and Equation (

24) yields

Here, the index *r*_0 is used to distinguish this 0-covariance propensity function from a second type of propensity in the next section.

With the zero-covariance assumption, one can substitute (32) back into the equation for the expectation for each species, which yields

for every *s* = 1, ..., *N*
_{
s
}.. Note that this result is exactly equivalent to the equation-based model (27).

Equation (

33) is based on assumption that both the fluctuations within species and their correlations are ignorable, which is not necessarily true in reality. If one uses it in simulations where the assumptions are not satisfied, it is possible that the means for the molecular species are significantly different from the corresponding equation-based model values. This discrepancy arises because the evolution of each species in the stochastic simulation is in truth affected by the covariance which is not necessarily zero, as it was assumed. This phenomenon was observed by Paulsson and collaborators [

26] and further discussed in different moment-based approaches [

6,

7]. To assess the applicability limit of the propensity defined by (32), we can apply approximation techniques as shown in eqns. (8)-(10) on the functional expression of

*α*
_{r_0 }and obtain mean and variance as

for every *s* = 1, ..., *N*
_{
s
}. These expressions demonstrate that even with large numbers of molecules the mean of CME does not always converge to the GMA model. Indeed, the convergence is only guaranteed in one of the following special situations: 1) the reaction is of 0^{th} order; 2) the reaction is a real value-order monomolecular reaction, with 1^{st} order reaction as a special case; 3) the covariance contribution in (34) is sufficiently small to be ignored for all participating reactant species of a particular reaction channel. Except for these three special situations, the covariance as shown in (34) significantly affects the mean dynamics. Therefore, stochastic simulations using zero-covariance propensity functions will in general yield means different from what the deterministic GMA model produces. How large these differences are cannot be said in generality. Under the assumption that the GMA model correctly captures the mean dynamics of every species, this conclusion means that *α*
_{r_0 }is not necessarily an accurate propensity function for stochastic simulations, and the direct conversion of the equation-based model into a propensity function must be considered with caution.

Moreover, there is no theoretical basis to assume that there are no fluctuations in the molecular species or that these are independent. Therefore, we need to consider the second treatment of the expectation of the propensity function and study the possible effects of a non-zero covariance.

2) We again assume that the GMA model is well defined, which implies that information regarding the species correlations and fluctuations has been captured in the parameters of the GMA model on the left hand size of Equations (7) and (28). To gain information regarding correlations, we use Taylor expansion to approximate the propensity function (see Additional file

1 for details):

After substitution of (37) in (29), one obtains

Given the state

**x** of the system at time

*t*, the stochastic rate function of reaction

*R*
_{
r
} is

Here it is important to understand that although the random variables {*X*
_{
s
}}_{s∈S}appear in the expression *c*
_{
r
}(**x**), *c*
_{
r
}(**x**) is not a function of random variables but a deterministic function. The reason is that the cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)] in the composition of *c*
_{
r
}(**x**), which as the numerical characteristic of the random variables {*X*
_{
s
}}_{s∈S}, is deterministic. Therefore, the stochastic rate function *c*
_{
r
}(**x**) is a well-justified deterministic function that is affected by both the state of the system
and cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)], the numerical characteristic of fluctuations in the random variables {*X*
_{
s
}}_{s∈S}.

Given the expression

*c*
_{
r
}(

**x**), the propensity function is

These results are based on the assumption that there are large numbers of molecules for all reactant species participating in reaction

*R*
_{
r
}. For simplicity of discussion, we define the

*propensity adjustment factor* (paf) of reaction

*R*
_{
r
} as

*paf* is a function of time

*t* and represents the contribution of the reactants to correlations among species in the calculation of the propensity function for reaction

*R*_{
r
}. We denote the propensity function in (39), which accounts for the contribution of the covariance, as

*α*_{r_cov}, in order to distinguish it from the propensity function

*α*_{r_0 }(32), which is based on the assumption of zero-covariance,

*i.e*.,

Remembering that cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)], which is a component in both the stochastic rate function *c*
_{
r
}(**x**) and now in the function *paf*(*t*), is a deterministic function rather than a function of random variables, *paf*(*t*) is a deterministic correction to the kinetic constant *k*
_{
r
} in the construction of *α*
_{r_cov }in (41), which corrects the stochastic simulation toward the correct average.

In contrast to the propensity function

*α*
_{r_0},

*α*
_{r_cov }leads to accurate stochastic simulations. To illustrate this difference, we analyze

as follows: We apply the approximation techniques in eqns. (9)-(11) in order to obtain the mean and variance of the propensity function

*α*
_{r_cov}:

By substituting (42) back into the derivation of CME (26), one obtains

for every *s* = 1, ..., *N*
_{
s
}, which is equivalent in approximation to the GMA model (28). In the other words, the mean of every molecular species obtained by using *α*
_{r_cov }in the CME derived equation (27) is approximately identical to the corresponding macroscopic variable in the GMA model.

**Calculation of** cov [log*X*_{
i
}(*t*), log*X*_{
j
}(*t*)**]**

When data in the form of multiple time series for all the reactants are available, it is possible to compute cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)**]** directly from these data. Once this covariance is known, the function *paf*, *α*
_{r_cov }and the mean dynamics can all be assessed. Alas, the availability of several time series data for all reactants under comparable conditions is rare, so that cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)**]** must be estimated in a different manner.

If one can validly assume that the covariance based on *α*
_{r_0 }does not differ significantly from the covariance based on *α*
_{r_cov}, one may calculate cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)**]** by one of following methods.

Method 1:

One uses *α*
_{r_0 }to generate multiple sets of time series data of all reactants and then computes cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)**]**.

Method 2:

First, cov [log

*X*
_{
i
}(

*t*), log

*X*
_{
j
}(

*t*)

**]** is expressed as a function of mean and covariance in one of the following ways; either as

The first functional expression of cov [log*X*
_{
i
}(*t*), log*X*
_{
j
}(*t*)**]** is achieved by Taylor approximation, whereas the second expression is obtained by the additional assumption that the concentrations (*X*
_{1}, ..., *X*
_{
s
}) are log-normally distributed [8, 23]. The consideration of a log-normal distribution is often justified by the fact that many biochemical data have indeed been observed to be log-normally distributed (*e.g*., [20–22]).

Second, one uses *α*
_{r_0 }to approximate the mean and covariance either by direct simulation, as shown in method 1, or by a moment-based approach, which is explained in Additional file 2, and which yields the differential equations

For convenience of computational implementation, the above equations can be written in matrix format

Here for *r* = 1, ..., *N*
_{
r
}, and *s*, *m*, *n* = 1, ..., *N*
_{
s
},
, (*V*)_{
rs
}= *v*
_{
rs
},
,
,
,
,
,
,
, and Λ is a diagonal matrix with
.