### Markovian regulatory networks

In a network with *n* genes, a set of binary variables **V**={*X*_{1},*X*_{2},…,*X*_{n}}, *X*_{i}∈{0,1}, determines the expression states of the genes. The vector of gene expression values at time *t*, **X**(*t*)=(*X*_{1}(*t*),…,*X*_{n}(*t*)), referred to as the gene activity profile (GAP), defines the network state at each time step. In a Markovian regulatory network, network dynamics involves a trajectory of states over time governed by the transition rule **X**(*t*+1)=*f*(**X**(*t*),*w*(*t*)), *t*≥0, where *w*(*t*)∈*Ξ* captures randomness in the system and \(f:\mathcal {S}\times \Xi \rightarrow \mathcal {S}\), \(\mathcal {S}=\left \{0,1,\dots,2^{n}-1\right \}\) being the set of corresponding decimal representations for the network states, is a mapping that characterizes the state transitions in the network. The sequence of states over time can be viewed as a Markov chain characterized by a *transition probability matrix* (TPM) \(\mathbf {P}=[P_{ij}]_{i,j=0}^{2^{n}-1}\), where *P*_{ij}=Pr[**X**(*t*+1)=*j*|**X**(*t*)=*i*], Pr[ ·] being the probability operator. An ergodic Markov chain is guaranteed to possess a unique steady-state distribution **π**, such that **π**^{T}=**π**^{T}**P**, *T* being the transpose operator.

Assume that the expression state of a gene is solely determined by its regulating genes. In other words, given the values of its regulating genes, the expression state of a gene is conditionally independent from those of other genes. Let the vector of expression states of the regulating genes for *X*_{i} be denoted by \(\Gamma _{X_{i}}\), where the ordering in \(\Gamma _{X_{i}} \) is induced from the ordering *X*_{1},*X*_{2},…,*X*_{n}. In a binary setting, if *X*_{i} has *k*_{i} regulating genes, then \(\Gamma _{X_{i}}\) can have \(2^{k_{i}}\phantom {\dot {i}}\) possible vector values. To define the regulatory relationship between gene *X*_{i} and its regulating genes, we can construct a *conditional probability matrix* (CPM) **C**(*X*_{i}) of size \(2^{k_{i}}\times 2\phantom {\dot {i}}\), where each row of the matrix corresponds to a certain combination of gene expressions in \(\Gamma _{X_{i}}\) and the first and second columns correspond to the conditional probability of gene *X*_{i} being 0 and 1, respectively, i.e.,

$$\begin{array}{*{20}l} & C_{j,1}(X_{i})=\text{Pr}\left[X_{i}=0|\Gamma_{X_{i}}=j\right], \\ & C_{j,2}(X_{i})=\text{Pr}\left[X_{i}=1|\Gamma_{X_{i}}=j\right], \end{array} $$

(1)

where by \(\Gamma _{X_{i}}=j\) we mean that the equivalent decimal value of the vector of expression states for the regulating genes of *X*_{i} is *j*. A network of *n* genes with each gene having *k*_{i} regulating genes can be completely defined by *n* CPMs **C**(*X*_{i}), 1≤*i*≤*n*, each being of size \(2^{k_{i}}\times 2\phantom {\dot {i}}\). These matrices can be used to construct the transition probability matrix. Owing to the mutual conditional independence of all genes given the values of all regulating genes, the entry *P*_{ij} of the TPM can be found as

$$\begin{array}{*{20}l} P_{ij}& =\prod_{k=1}^{n}\text{Pr}\left[X_{k}=j_{k}\left|\bigcup\limits_{l=1}^{n}\right.\Gamma_{X_{l}}[i]\right] \\ & =\prod_{k=1}^{n}\text{Pr}\left[X_{k}=j_{k}\left|\Gamma_{X_{k}}\right.[i]\right] \notag \\ & =\prod_{k=1}^{n}C_{\Gamma_{X_{k}}[i],j_{k}+1}(X_{k}), \end{array} $$

(2)

where *j*_{k} is the binary value of the *k*-th gene in state *j* and \(\Gamma _{X_{k}}[i]\) is the vector of binary values of the regulating genes for *X*_{k} extracted from the representation of state *i*. For example, consider a 3-gene network, *n*=3, in which gene *X*_{1} (*k*=1 in (2)) is regulated by genes *X*_{2} and *X*_{3}. For this network, when computing *P*_{14} (*i*=1 and *j*=4 in (2)), *j*_{k}=1 (as *X*_{1}=1 for *j*=4) and \(\Gamma _{X_{1}}[i]=(0,1)\) (as (*X*_{2},*X*_{3})=(0,1) for *i*=1).

From a translational perspective, the states of a network can be partitioned into two sets: desirable states \(\mathcal {D}\), being associated with healthy phenotypes, and undesirable states \(\mathcal {U}\), corresponding to pathological cell functions such as cancer. The goal of therapeutic interventions is to alter the dynamical behavior of the network in such a way as to reduce the steady-state probability \(\pi _{\mathcal {U}}={\sum \nolimits }_{i\in \mathcal {U}}\pi _{i}\) of the network entering the undesirable states. There are two different approaches for network interventions: structural interventions and dynamical interventions. In a structural intervention, the goal is to modify the dynamical behavior of the network via a one-time change in its underlying regulatory structure [9, 33–35]. Dynamical interventions are typically studied in the framework of Markov decision processes and are characterized by control policies. These interventions usually involve the change in the expression of one or more genes, being called *control genes*, and can be applied either over a finite-time [36–38] or an infinite-time horizon [31, 39].

### Optimal dynamical control

Network interventions in this paper belong to the category of dynamical interventions. We assume that there is a control gene *g*∈**V** whose expression value is affected by a binary control input \(c\in \mathcal {C}\), \(\mathcal {C}=\{0,1\}\). The value of *g* is flipped when *c*=1 and not flipped when *c*=0. It is straightforward to extend the results to *m* control genes, where there are 2^{m} different control actions. Let \(\mathbf {P}(c)=\left [P_{ij}(c)\right ]_{i,j=0}^{2^{n}-1}\) denote the controlled TPM, i.e.,

$$\begin{array}{*{20}l} P_{ij}(c)=\text{Pr}\left[\mathbf{X}(t+1)=j|\mathbf{X}(t)=i,c(t)=c\right]. \end{array} $$

(3)

The controlled TPM can be found using the uncontrolled TPM **P** as

$$\begin{array}{*{20}l} P_{ij}(c)=\left\{ \begin{array}{l} P_{ij}\qquad \qquad \text{if}\,\,c=0 \\ P_{\tilde{i}j}\qquad \qquad \text{if}\,\,c=1 \\ \end{array} \right., \end{array} $$

(4)

where states \(\tilde {i}\) and *i* differ only in the value of gene *g*.

The problem of optimal control can be modeled as an optimal stochastic control problem [39]. Let the cost function \(r(i,j,c):\mathcal {S}\times \mathcal {S}\times \mathcal {C}\rightarrow \mathbb {R}\) determine the immediate cost accrued when the network transitions from state *i* to state *j* under control action *c*. This cost reflects both the desirability of states and the cost for imposing control actions. Usually, larger cost values are assigned to the undesirable states and when the control action is applied. This cost function is assumed to be time-invariant, bounded, and nonnegative. We consider an infinite-horizon discounted cost approach as proposed in [39] in which a discount factor 0<*ζ*<1 is used to guarantee convergence [40]. Control actions are chosen over time according to a control policy *μ*=(*μ*_{1},*μ*_{2},…), \(\mu _{t}:\mathcal {S}\rightarrow \mathcal {C}\). In this setting, given a policy *μ* and an initial state *X*_{0}, the expected total cost is

$$\begin{array}{*{20}l} J_{\mu}(X_{0})={\lim}_{M\rightarrow \infty}\mathrm{E}\left[\sum_{t=0}^{M-1}\zeta^{t} r\left(\mathbf{X}(t),\mathbf{X}(t+1),\mu_{t}(\mathbf{X}(t)\right)\left|{\vphantom{\sum}} X_{0}\right.\right], \end{array} $$

(5)

where the expectation is taken relative to the probability measure over the space of state and control action trajectories. If *Π* denotes the space of all admissible policies, we seek an optimal control policy *μ*^{∗}(*X*_{0}) such that

$$\begin{array}{*{20}l} \mu^{\ast}(X_{0})=\underset{}{\arg}\,\underset{\mu\in\Pi}{\min}\, J_{\mu}(X_{0})\qquad\qquad \forall X_{0}\in\mathcal{S}. \end{array} $$

(6)

The corresponding optimal expected total cost is denoted by *J*^{∗}(*X*_{0}). It has been shown that the optimal policy *μ*^{∗}(*X*_{0}) exists and can be found by solving *Bellman’s optimality equation* [40],

$$\begin{array}{*{20}l} J^{\ast}(i)\,=\,\underset{c\in\mathcal{C}}{\min}\left[\sum\limits_{j=0}^{2^{n}-1}P_{ij}(c)\left(r(i,j,c)+\zeta J^{\ast}(j)\right)\right]\quad\forall i\in\mathcal{S}. \end{array} $$

(7)

The optimal cost *J*^{∗} =(*J*^{∗}(0),…,*J*^{∗}(2^{n}−1)) is the unique solution of (7) among all bounded functions and the control policy *μ*^{∗} that attains the minimum in (7) is stationary, i.e., *μ*^{∗}=(*μ*^{∗},*μ*^{∗},..) [40]. In order to find the fixed point in the Bellman’s optimality equation and thereby find the optimal control policy, dynamic programming algorithms, including the value iteration algorithm that iteratively estimates the cost function, can be used.

### MOCU-based optimal experimental design framework

In this section, we review the general framework of the experimental design method in [15], which is based on the concept of the mean objective cost of uncertainty (MOCU) [10]. Let *θ*=(*θ*_{1},*θ*_{2},…,*θ*_{T}) be composed of *T* uncertain parameters in a network model. The set of all possible realizations for *θ* is denoted by *Θ* and is called an *uncertainty class*. A prior distribution *f*(*θ*) is assigned to *θ*, which reflects the likelihood of each realization of *θ* being the true value.

For each possible intervention *ψ*∈*Ψ*, the class of interventions, and each model *θ* in the uncertainty class, an error function *ξ*_{θ}(*ψ*) determines the error of *ψ* when applied to the network model *θ*. The optimal intervention *ψ*(*θ*) has the lowest error relative to model *θ*, i.e., *ξ*_{θ}(*ψ*(*θ*))≤*ξ*_{θ}(*ψ*),∀*ψ*∈*Ψ*. When dealing with an uncertainty class *Θ*, the intrinsically Bayesian robust (IBR) intervention *ψ*_{IBR}(*Θ*) is defined as

$$\begin{array}{*{20}l} \psi_{\text{IBR}}(\Theta)=\underset{}{\arg}\,\underset{\psi\in\Psi}{\min}\,\mathrm{E}_{\boldsymbol{\theta}}\left[\xi_{\boldsymbol{\theta}}(\psi)\right], \end{array} $$

(8)

where the expectation is taken relative to the prior distribution *f*(*θ*).

An IBR intervention is optimal on average rather than at each specific network model *θ*; therefore, relative to *θ* an objective cost of uncertainty (OCU) can be defined as

$$\begin{array}{*{20}l} \mathrm{U}_{\Psi,\xi}(\boldsymbol{\theta})=\xi_{\boldsymbol{\theta}}(\psi_{\text{IBR}}(\Theta))-\xi_{\boldsymbol{\theta}}(\psi(\boldsymbol{\theta})). \end{array} $$

(9)

Taking the expectation of U_{Ψ,ξ}(*θ*) relative to *f*(*θ*), we obtain the mean objective cost of uncertainty (MOCU):

$$\begin{array}{*{20}l} \mathrm{M}_{\Psi,\xi}(\Theta)&=\mathrm{E}_{\boldsymbol{\theta}}[\mathrm{U}_{\Psi,\xi}(\boldsymbol{\theta})] \\ &=\mathrm{E}_{\boldsymbol{\theta}}[\xi_{\boldsymbol{\theta}}(\psi_{\text{IBR}}(\Theta))-\xi_{\boldsymbol{\theta}}(\psi(\boldsymbol{\theta}))]. \end{array} $$

(10)

MOCU measures the model uncertainty in terms of the expected increased error due to applying an IBR intervention (the chosen intervention in the presence of uncertainty) instead of an optimal intervention (the chosen intervention in the absence of uncertainty). Uncertainty quantification based on MOCU can lay the groundwork for objective-based experimental design.

Assume that corresponding to each parameter *θ*_{i}, there is an experiment \(\mathcal {E}_{i}\) that results in the exact determination of *θ*_{i}. The goal of the experimental design is to find which experiment should be conducted first so that model uncertainty is reduced optimally. Focusing on experiment \(\mathcal {E}_{i}\) and parameter *θ*_{i}, consider the case that the outcome of experiment \(\mathcal {E}_{i}\) is \(\theta ^{\prime }_{i}\). Then the remaining MOCU given \(\theta _{i}=\theta ^{\prime }_{i}\) is defined as

$$\begin{array}{*{20}l} &\mathrm{M}_{\Psi,\xi}\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right) \\ &\qquad=\mathrm{E}_{\boldsymbol{\theta}|\theta^{\prime}_{i}}\left[\xi_{\boldsymbol{\theta}}\left(\psi_{\text{IBR}}\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right)\right)-\xi_{\boldsymbol{\theta}}\left(\psi\left(\boldsymbol{\theta}|\theta_{i}=\theta^{\prime}_{i}\right)\right)\right], \end{array} $$

(11)

where the expectation is taken relative to the conditional distribution \(f\left (\boldsymbol {\theta }|\theta _{i}=\theta ^{\prime }_{i}\right)\), \(\Theta |\theta _{i}=\theta ^{\prime }_{i}\), is the reduced uncertainty class obtained after \(\theta _{i}=\theta ^{\prime }_{i}\), and vector \(\boldsymbol {\theta }|\theta _{i}=\theta ^{\prime }_{i}\) is obtained from vector *θ* by setting *θ*_{i} to \(\theta _{i}^{\prime }\). Taking the expectation of (11) relative to the marginal distribution \(f\left (\theta ^{\prime }_{i}\right)\), which is in fact the marginal distribution of the parameter *θ*_{i}, we obtain the expected remaining MOCU given experiment \(\mathcal {E}_{i}\) is carried out (or equivalently parameter *θ*_{i} is determined):

$$\begin{array}{*{20}l} {}&\mathrm{M}_{\Psi,\xi}(\Theta;\theta_{i}) \\ {}&=\mathrm{E}_{\theta^{\prime}_{i}}\left[\mathrm{M}_{\Psi,\xi}\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right)\right] \\ {}&=\mathrm{E}_{\theta_{i}^{\prime}}\left[\mathrm{E}_{\boldsymbol{\theta}|\theta^{\prime}_{i}}\left[\xi_{\boldsymbol{\theta}}\left(\psi_{\text{IBR}}\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right)\right)-\xi_{\boldsymbol{\theta}}\left(\psi\left(\boldsymbol{\theta}|\theta_{i}=\theta^{\prime}_{i}\right)\right)\right]\right]. \end{array} $$

(12)

M_{Ψ,ξ}(*Θ*;*θ*_{i}) measures the pertinent uncertainty expected to remain in the model after conducting experiment \(\mathcal {E}_{i}\). The experiment \(\mathcal {E}_{i^{\ast }}\phantom {\dot {i}}\) that attains the minimum value of the expected reaming MOCU is called the *optimal experiment* and suggested as the first experiment [15]:

$$\begin{array}{*{20}l} i^{\ast}=\underset{}{\arg}\,\underset{i\in\{1,2,..,T\}}{\min}\,\mathrm{M}_{\Psi,\xi}(\Theta;\theta_{i}). \end{array} $$

(13)

The parameter \(\theta _{i^{\ast }}\phantom {\dot {i}}\) corresponding to \(\phantom {\dot {i}}\mathcal {E}_{i^{\ast }}\) is called the *primary parameter*. Note that (13) can be further simplified through some mathematical manipulations and removing expressions not dependent on the optimization variable [20]:

$$\begin{array}{*{20}l} i^{\ast}=\underset{}{\arg}\,\underset{i\in\{1,2,..,T\}}{\min}\,\mathrm{E}_{\theta_{i}^{\prime}}\left[\mathrm{E}_{\boldsymbol{\theta}|\theta^{\prime}_{i}}\left[\xi_{\boldsymbol{\theta}}(\psi_{\text{IBR}}(\Theta|\theta_{i}=\theta^{\prime}_{i}))\right]\right]. \end{array} $$

(14)

A number of experimental design methods based on the MOCU framework have been proposed in the literature [15, 19, 20]. In all of these cases, the MOCU-based experimental design can reduce the number of needed experiments significantly in comparison to other selection policies such as entropy-based experimental design, pure exploitation, or random selection policy.

### Uncertainty in transition probability matrix

Assume that regulatory information between a gene and its regulating genes is missing for a number of genes in the network. In other words, a number of rows in the *n* conditional probability matrices are unknown. We represent unknown conditional probabilities by a set of random variables *θ*=(*θ*_{1},*θ*_{2},…,*θ*_{T}). Since each row of the CPM adds up to one, i.e., *C*_{j,1}(*X*_{i})+*C*_{j,2}(*X*_{i})=1, there is only one degree of freedom. The uncertainty in the CPMs will eventually show up in the corresponding TPM and thereby can affect the performance of the control policy. Therefore, it is of interest to reduce the uncertainty in the CPMs. We seek an experimental design method that efficiently guides us on which unknown conditional probability to determine first.

We need to assign prior distributions to the random variables representing unknown conditional probabilities. Assigning accurate priors is highly challenging. A prior distribution must describe the current state of knowledge regarding the unknown model accurately. It is also desirable that the prior distribution and the posterior distribution, obtained by incorporating data into the prior, belong to the same family of distributions, being referred to as a conjugate prior distribution. Using conjugate prior distributions, we can easily update the priors to the posteriors, which facilitates the computations in a Bayesian setting as it is enough to only keep track of the hyperparameters in the distributions. With this in mind, we utilize the beta distribution as the prior distribution for each unknown parameter *θ*_{i}. Relative to a random variable *θ*_{i}, the beta distribution Beta(*α*_{i},*β*_{i}) with hyperparameters *α*_{i} and *β*_{i} is of the following form:

$$\begin{array}{*{20}l} \text{Beta}(\alpha_{i},\beta_{i})=\frac{\theta_{i}^{\alpha_{i}-1}(1-\theta_{i})^{\beta_{i}-1}}{B(\alpha_{i},\beta_{i})}, \end{array} $$

(15)

where *B*(*α*_{i},*β*_{i}) is the beta function. The expected value of *θ*_{i} ∼ Beta(*α*_{i},*β*_{i}) is \(\mathrm {E}[\theta _{i}]=\frac {\alpha _{i}}{\alpha _{i}+\beta _{i}}\). When *α*_{i} = *β*_{i}=1, the beta distribution becomes a uniform distribution over interval [0,1].

We assume that *θ*_{1},*θ*_{2},…,*θ*_{T} are independent and each parameter *θ*_{i}, 1≤*i*≤*T*, has a beta distribution Beta(*α*_{i},*β*_{i}); therefore, the prior distribution of *θ*={*θ*_{1},*θ*_{2},…,*θ*_{T}} is

$$\begin{array}{*{20}l} f(\boldsymbol{\theta})=\prod_{i=1}^{T}\text{Beta}(\alpha_{i},\beta_{i})\propto \prod_{i=1}^{T} \theta_{i}^{\alpha_{i}-1}(1-\theta_{i})^{\beta_{i}-1}. \end{array} $$

(16)

In addition to the set of CPMs, containing unknown conditional probabilities, it is possible that observations from network dynamics in terms of a trajectory \(\mathcal {X}_{L}=\{\mathbf {X}(0),\mathbf {X}(1),\dots,\mathbf {X}(L)\}\) of *L* consecutive state transitions are also available. The state trajectory \(\mathcal {X}_{L}\) can be utilized as an additional source of information to update the initial beta distributions to the posterior beta distributions. If *θ*_{i} represents the unknown conditional probability \(C_{j,1}(X_{i})=\text {Pr}\left [X_{i}(t+1)=0|\Gamma _{X_{i}}=j\right ]\), then given a state trajectory \(\mathcal {X}_{L}\) the posterior distribution \(f(\theta _{i}|\mathcal {X}_{L})\) is again a beta distribution with new hyperparameters

$$\begin{array}{*{20}l} &\alpha^{\prime}_{i}=\alpha_{i}+\sum\limits_{l=0}^{L-1}\mathbbm{1}\left[\Gamma_{X_{i}}[\mathbf{X}(l)]=j,X_{i}(l+1)=0\right] \end{array} $$

(17)

$$\begin{array}{*{20}l} &\beta^{\prime}_{i}=\beta_{i}+\sum\limits_{l=0}^{L-1}\mathbbm{1}\left[\Gamma_{X_{i}}[\mathbf{X}(l)]=j,X_{i}(l+1)=1\right], \end{array} $$

(18)

where *X*_{i}(*l*) denotes the value of gene *X*_{i} at the *l*-th state in the trajectory, and \(\mathbbm {1}[\!\cdot ]\) is the indicator function. In other words, those state transitions in which the event corresponding to the unknown conditional probability *θ*_{i} occurs can be used to update the information about that unknown probability. Note that \(\Gamma _{X_{i}}[\mathbf {X}(l)]=j\) implies that the equivalent decimal value of the gene expression vector for the regulating genes of *X*_{i} extracted from network state **X**(*l*) should be equal to *j*. The conditional expectation of *θ*_{i} given \(\mathcal {X}_{L}\) is

$$\begin{array}{*{20}l} \mathrm{E}[\!\theta_{i}|\mathcal{X}_{L}]=\frac{\alpha^{\prime}_{i}}{\alpha^{\prime}_{i}+\beta^{\prime}_{i}}. \end{array} $$

(19)

Since the uncertainty of an unknown conditional probability is governed by the corresponding terms *α*^{′}, *β*^{′}, and given the fact that observation(s) can potentially increase *α*’s and *β*’s according to (17) and (18), availability of a state trajectory \(\mathcal {X}_{L}\) is equivalent to a lesser initial uncertainty, and hence a simpler experimental design problem.

### Optimal experimental design for determining unknown conditional probabilities

Building on the general MOCU-based experimental design framework in (8)-(14), we propose an experimental design method when dynamical controls characterized by stationary control policies are concerned. A schematic diagram of the proposed experimental design framework is given in Fig. 1. We first assign beta distributions with initial hyperparameters (*α*_{i},*β*_{i}) to each unknown conditional probability. Then if a state trajectory \(\mathcal {X}_{L}\) is available as an additional source of knowledge, it is incorporated to update the initial hyperparameters to \(\phantom {\dot {i}\!}\left (\alpha ^{\prime }_{i},\beta ^{\prime }_{i}\right)\) according to (17) and (18). These updated hyperparameters characterize the uncertainty class for finding the best parameter to determine using the proposed MOCU-based framework. When the first experiment is chosen and carried out, its outcome (the true value for the chosen unknown conditional probability) is incorporated in the uncertainty class, leading to a reduced uncertainty class that contains fewer uncertain parameters. If operational resources allow more experiments, this new uncertainty class can be used to find the next parameter for determination (this process can be iterated). Otherwise, the experimental design step is finished and the reduced uncertainty class is used to derive the IBR control policy based on which control actions at each time step are applied to the underlying true network.

As (14) suggests, in order to implement the experimental design, we need to derive IBR interventions. Therefore, we first focus on explaining how an IBR control policy can be derived. Considering an uncertainty class *Θ* of TPMs, relative to an initial state *X*_{0}, the average expected total discounted cost across *Θ* for control policy *μ* = (*μ*_{1}, *μ*_{2}, …) is:

$$\begin{array}{*{20}l} &{}J^{\Theta}(\mu;X_{0})=\mathrm{E}_{\boldsymbol{\theta}}[J^{\boldsymbol{\theta}}(\mu;X_{0})] \\ &{}=\mathrm{E}_{\boldsymbol{\theta}}\left[{\lim}_{M\rightarrow \infty}\mathrm{E}\left[\sum_{t=0}^{M-1}\zeta^{t} r\left(\mathbf{X}(t),\mathbf{X}(t+1),\mu_{t}(\mathbf{X}(t)\right)\! \left|{\vphantom{\sum}} X_{0}\right.\right]\right] \\ &{}={\lim}_{M\rightarrow \infty}\sum_{t=0}^{M-1}\mathrm{E}^{\ast}_{\boldsymbol{\theta}}\left[\zeta^{t} r\left(\mathbf{X}(t),\mathbf{X}(t+1),\mu_{t}(\mathbf{X}(t)\right)\! \left|{\vphantom{\sum}} X_{0}\right.\right], \end{array} $$

(20)

where \(\mathrm {E}^{\ast }_{\boldsymbol {\theta }}[\!\cdot ]=\mathrm {E}_{\boldsymbol {\theta }}[\mathrm {E}[\![\!\cdot ]\!]\) is the expectation over both within-model stochasticity and model uncertainty. For initial state *X*_{0}, the optimal average cost is defined as

$$\begin{array}{*{20}l} J^{\Theta}(X_{0})=\underset{\mu\in\Pi}{\min}\,J^{\Theta}(\mu;X_{0}), \end{array} $$

(21)

and the minimum is attained by the IBR control policy \(\mu ^{\Theta }(X_{0})=\left (\mu ^{\Theta }_{1}(X_{0}),\mu _{2}^{\Theta }(X_{0}),\dots \right)\).

This control problem can be transformed into a dynamic programming problem of the following form for each \(i\in \mathcal {S}\) and *t*≥0:

$$\begin{array}{*{20}l} J_{t}(i)&=\underset{c\in\mathcal{C}}{\min}\,\mathrm{E}_{\boldsymbol{\theta}}\left[\mathrm{E}\left[r(i,j,c)+\zeta J_{t+1}(j)\right]\right] \notag \\ &=\underset{c\in\mathcal{C}}{\min}\,\mathrm{E}_{\boldsymbol{\theta}}\left[\sum\limits_{j=0}^{2^{n}-1}P^{\boldsymbol{\theta}}_{ij}(c)\left(r(i,j,c)+\zeta J_{t+1}(j)\right)\right] \notag \\ &=\underset{c\in\mathcal{C}}{\min}\left[\sum\limits_{j=0}^{2^{n}-1}\mathrm{E}_{\boldsymbol{\theta}}\left[P^{\boldsymbol{\theta}}_{ij}(c)\right]\left(r(i,j,c)+\zeta J_{t+1}(j)\right)\right]. \end{array} $$

(22)

We call \(\mathrm {E}_{\boldsymbol {\theta }}\left [P^{\boldsymbol {\theta }}_{ij}(c)\right ]\) the *effective controlled transition probability matrix*. It is obtained similarly to *P*_{ij}(*c*) by plugging \(P_{ij}^{\Theta }=\mathrm {E}_{\boldsymbol {\theta }}\left [P_{ij}^{\boldsymbol {\theta }}\right ]\) in (4). The *effective transition probability matrix* (ETPM) \(\mathbf {P}^{\Theta }=\left [P_{i,j}^{\Theta }\right ]_{i,j=1}^{2^{n}}\) is obtained as

$$\begin{array}{*{20}l} P_{ij}^{\Theta}=\mathrm{E}_{\boldsymbol{\theta}}\left[P_{ij}^{\boldsymbol{\theta}}\right]=\prod_{k=1}^{n}\mathrm{E}_{\boldsymbol{\theta}}\left[\text{Pr}_{\boldsymbol{\theta}}\left[X_{k}=j_{k}\left|\Gamma_{X_{k}}\right.[i]\right]\right], \end{array} $$

(23)

where Pr_{θ}[ ·] is the probability operator relative to *θ* and E_{θ}[ ·] is taken relative to the updated prior (posterior) distribution \(f(\boldsymbol {\theta }|\mathcal {X}_{L})\). The ETPM **P**^{Θ} represents the uncertainty class of TPMs and enables finding the IBR control policy *μ*^{Θ} for an uncertainty class of TPMs in the same way that the optimal control policy for a known TPM is found. Since the *θ*_{i}’s are independent, the expectation can be brought inside the product. Each conditional probability term in (23) is either known, whose known value is used for multiplication, or is unknown and corresponds to an unknown parameter *θ*_{i} whose expected value obtained according to (19) is used in the multiplication.

The dynamic formulation in (22) is similar to the dynamic programming used for optimal control except that the known TPM has been replaced by \(\mathrm {E}_{\boldsymbol {\theta }}\left [P^{\boldsymbol {\theta }}_{ij}(c)\right ]\); therefore, a similar approach used for solving optimal control dynamic programming can be utilized here with the distinction that all theorems should be defined relative to ETPM. Keeping this in mind, we define a mapping \(TJ\!:\!\mathcal {S}\!\rightarrow \! \mathcal {R}\) for a bounded function \(J\!:\!\mathcal {S}\!\rightarrow \!\mathcal {R}\) and \(\forall i\!\in \!\mathcal {S}\) as

$$\begin{array}{*{20}l} TJ(i)=\underset{c\in\mathcal{C}}{\min}\left[\sum_{j=0}^{2^{n}-1}\mathrm{E}_{\boldsymbol{\theta}}\left[P^{\boldsymbol{\theta}}_{ij}(c)\right]\left(r(i,j,c)+\zeta J(j)\right)\right]. \end{array} $$

(24)

The following three theorems, whose proofs are similar to those in [40] relative to a known TPM, lay out the theoretical foundation for finding the IBR control policy.

###
**Theorem 1**

*(Convergence of the algorithm)* Letting \(J:\mathcal {S}\rightarrow \mathcal {R}\) be a bounded function, for any \(i\in \mathcal {S}\), the optimal average cost function *J*^{Θ}(*i*) satisfies \(J^{\Theta }(i)={\lim }_{M\rightarrow \infty }T^{M}J(i)\phantom {\dot {i}\!}\).

###
**Theorem 2**

*(Bellman’s optimality equation)* The optimal average cost function *J*^{Θ} satisfies

$$\begin{array}{*{20}l} {}J^{\Theta}(i)=\underset{c\in\mathcal{C}}{\min}\left[\sum\limits_{j=0}^{2^{n}-1}\mathrm{E}_{\boldsymbol{\theta}}\left[ P^{\boldsymbol{\theta}}_{ij}(c)\right]\left(r(i,j,c)+\zeta J^{\Theta}(j)\right)\right]\,\forall i\in\mathcal{S}. \end{array} $$

(25)

###
**Theorem 3**

*(Necessary and sufficient condition)* A stationary policy *μ*^{Θ} is an IBR control policy if and only if for each \(i\in \mathcal {S}\), *μ*^{Θ}(*i*) attains the minimum in Bellman’s optimality equation.

Based on Theorem 1, *J*^{Θ} can be computed recursively using the value iteration algorithm in the same way that this algorithm is used to find the optimal control policy for a known TPM. The converged cost satisfies Bellman’s optimality equation (Theorem 2). Also, the corresponding policy is a stationary IBR control policy (Theorem 3). *μ*^{Θ} attains the minimum in the Bellman’s optimality equation, where the ordinary TPM is replaced by the ETPM.

The concept of effective quantities has also been used for deriving IBR operators in other problems: for example in [7], effective noise statistics are used to derive the IBR Kalman filter; or in [8], effective covariance matrix is used for achieving IBR signal compression.

To define the experimental design problem in the context of the framework laid out in (8)-(14), let the class of interventions *Ψ* be the set of all admissible control policies *Π*. Each *ψ*∈*Ψ* is characterized by a control policy *μ* and the cost of intervention is

$$\begin{array}{*{20}l} \xi_{\boldsymbol{\theta}}(\psi)=\mathrm{E}_{X_{0}}\left[J^{\boldsymbol{\theta}}_{\mu}(X_{0})\right], \end{array} $$

(26)

where \(\phantom {\dot {i}\!}J^{\boldsymbol {\theta }}_{\mu }(X_{0})\) is obtained according to (5) with E[ ·] relative to the probability measure defined by the TPM **P**^{θ}. To find a single value as the cost of a control policy for a specific TPM, in (26), we take the expectation over all possible initial network states *X*_{0}, assuming that the possible initial states are equally likely. Regarding the IBR intervention, we find a single value as the average cost of the IBR intervention:

$$\begin{array}{*{20}l} \mathrm{E}_{\boldsymbol{\theta}}\left[\xi_{\boldsymbol{\theta}}\left(\mu^{\Theta}\right)\right]=\mathrm{E}_{X_{0}}\left[J^{\Theta}(X_{0})\right], \end{array} $$

(27)

where *J*^{Θ}(*X*_{0}) is obtained using (25). The definitions of cost and intervention in (26) and (27) set the stage for objective-based uncertainty quantification in the context of dynamical control according to (10). After defining MOCU, the MOCU-based experimental design framework can be used and the the primary parameter \(\phantom {\dot {i}\!}\theta _{i^{\ast }}\) can be found by plugging (27) in (14):

$$\begin{array}{*{20}l} i^{\ast}&=\underset{}{\arg}\,\underset{i\in\{1,2,..,T\}}{\min}\,\mathrm{E}_{\theta_{i}^{\prime}}\left[\mathrm{E}_{\boldsymbol{\theta}|\theta^{\prime}_{i}}\left[\xi_{\boldsymbol{\theta}}\left(\psi_{\text{IBR}}\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right)\right)\right]\right] \notag \\ &=\underset{}{\arg}\,\underset{i\in\{1,2,..,T\}}{\min}\,\mathrm{E}_{\theta_{i}^{\prime}}\left[\xi_{\left(\Theta|\theta_{i}=\theta^{\prime}_{i}\right)}\left(\mu^{\Theta|\theta_{i}=\theta^{\prime}_{i}}\right)\right] \notag \\ &=\underset{}{\arg}\,\underset{i\in\{1,2,..,T\}}{\min}\,\mathrm{E}_{\theta_{i}^{\prime}}\left[\mathrm{E}_{X_{0}}\left[J^{\mathbf{P}^{\Theta|\theta_{i}=\theta^{\prime}_{i}}}_{\mu^{\Theta|\theta_{i}=\theta^{\prime}_{i}}}(X_{0})\right]\right], \end{array} $$

(28)

where the IBR control policy for the reduced uncertainty class \(\phantom {\dot {i}\!}\Theta |\left (\theta _{i}=\theta ^{\prime }_{i}\right)\) is found using the ETPM \(\mathbf {P}^{\Theta |\theta _{i}=\theta ^{\prime }_{i}}\phantom {\dot {i}\!}\) obtained relative to the conditional probability distribution \(f\left (\boldsymbol {\theta }|\theta _{i}=\theta ^{\prime }_{i}\right)\phantom {\dot {i}\!}\).

According to (28), to evaluate the determination of each unknown parameter *θ*_{i}, for each realization \(\theta ^{\prime }_{i}\) of *θ*_{i}, we need to obtain the average cost of the IBR control policy \(\mu ^{\Theta |\theta _{i}=\theta ^{\prime }_{i}}\phantom {\dot {i}\!}\) across the reduced uncertainty class \(\Theta |\left (\theta _{i}=\theta ^{\prime }_{i}\right)\phantom {\dot {i}\!}\) and then take the average of all these average costs relative to the marginal distribution of parameter *θ*_{i}. In practice, the expression in (28) is approximated via Monte-Carlo simulations. We draw a number of samples from the marginal distribution of *θ*_{i} and then approximate the expression being minimized in (28) as the average of all inner expectations computed for each generated sample. The steps required for obtaining the primary parameter \(\phantom {\dot {i}\!}\theta _{i^{\ast }}\) are summarized in Algorithm 1. The inputs to this algorithm are *n* CPMs characterizing the GRN, *T* unknown parameters *θ*_{i} corresponding to unknown conditional probabilities, hyperparameters (*α*_{i},*β*_{i}) for the prior beta distributions, the state trajectory \(\mathcal {X}_{L}\), and *ζ*, *r*, and *I*, which determine the discount factor, cost function, and the number of iterations for value iteration, respectively.

Finding IBR control policies for an uncertainty class (like finding the optimal control policy for a known TPM [41]) is computationally expensive and the complexity grows exponentially with the number of genes. Therefore, most of the computational burden of the experimental design is in finding IBR control policies. To mitigate the complexity of the proposed method, in the next section we propose an approximate method for computing IBR control policies.

### Approximate experimental design based on MFPT

The mean first passage time (MFPT) from state *i* to *j* measures how long it would take on average that the network transitions from state *i* to state *j*.

For a Markovian GRN, if the sets of desirable states \(\mathcal {D}\) and undesirable states \(\mathcal {U}\) are determined, we can have the following partitioning for the TPM:

$$\begin{array}{*{20}l} \mathbf{P}= \left[\begin{array}{ll} \mathbf{P}_{\mathcal{D},\mathcal{D}} & \mathbf{P}_{\mathcal{D},\mathcal{U}}\\ \mathbf{P}_{\mathcal{U},\mathcal{D}} & \mathbf{P}_{\mathcal{U},\mathcal{U}} \end{array}\right], \end{array} $$

(29)

where \(\mathbf {P}_{\mathcal {S}_{1},\mathcal {S}_{2}}\) involves the transition probabilities from each state in the set \(\mathcal {S}_{1}\) to the states in set \(\mathcal {S}_{2}\). The vectors \(\mathbf {K}_{\mathcal {D},\mathcal {U}}\) and \(\mathbf {K}_{\mathcal {U},\mathcal {D}}\) of MFPTs from each state in \(\mathcal {D}\) to \(\mathcal {U}\) and from each state in \(\mathcal {U}\) to \(\mathcal {D}\), respectively, can be computed as [30]

$$\begin{array}{*{20}l} &\mathbf{K}_{\mathcal{D},\mathcal{U}}=\mathbf{e}+\mathbf{P}_{\mathcal{D},\mathcal{D}}\,\mathbf{K}_{\mathcal{D},\mathcal{U}} \end{array} $$

(30)

$$\begin{array}{*{20}l} &\mathbf{K}_{\mathcal{U},\mathcal{D}}=\mathbf{e}+\mathbf{P}_{\mathcal{U},\mathcal{U}}\,\mathbf{K}_{\mathcal{U},\mathcal{D}}, \end{array} $$

(31)

where **e** is an all-unity column vector of appropriate size. If *g* is the control gene and \(\tilde {X}^{g}\) is the flipped state corresponding to state *X* obtained by flipping *g* in state *X*, then to find the MFPT-based stationary control policy \(\mu _{\mathbf {P}}^{\text {MFPT}}:\mathcal {S}\rightarrow \mathcal {C}\), the control action for each desirable state \(X \in \mathcal {D}\) is obtained as [31]

$$\begin{array}{*{20}l} \mu^{\text{MFPT}}_{\mathbf{P}}(X)=\left\{ \begin{array}{l} 1\qquad \text{if}\,\,\mathbf{K}_{\mathcal{D},\mathcal{U}}(\tilde{X}^{g})-\mathbf{K}_{\mathcal{D},\mathcal{U}}(X)>\Delta \\ 0 \qquad \text{otherwise} \\ \end{array} \right., \end{array} $$

(32)

and for each undesirable state \(X\in \mathcal {U}\) as

$$\begin{array}{*{20}l} {}\mu^{\text{MFPT}}_{\mathbf{P}}(X)=\left\{ \begin{array}{l} 1\qquad \text{if}\,\,\mathbf{K}_{\mathcal{U},\mathcal{D}}(X)-\mathbf{K}_{\mathcal{U},\mathcal{D}}(\tilde{X}^{g})>\Delta \\ 0\qquad \text{otherwise} \\ \end{array} \right., \end{array} $$

(33)

where *Δ* in (32) and (33) is a tuning parameter that should be adjusted based on the definition for the cost function *r*(*i*,*j*,*u*).

In the spirit of the MFPT-based approximation for optimal control, we approximate the IBR control policy needed in experimental design via MFPT. Taking into account that the IBR control policy *μ*^{Θ} is in fact the optimal control policy relative to the ETPM and that MFPT can be used as an approximation for the optimal control policy, we approximate the IBR control policy by finding the MFPT-based control policy relative to the ETPM and denote it by \(\mu ^{\text {MFPT}}_{\Theta }\), i.e.,

$$\begin{array}{*{20}l} \mu^{\Theta}\approx\mu^{\text{MFPT}}_{\Theta}=\mu^{\text{MFPT}}_{\mathbf{P}^{\Theta}}. \end{array} $$

(34)

\(\mu ^{\text {MFPT}}_{\mathbf {P}^{\Theta }}\) is obtained by solving (29)-(33) for the effective transition probability matrix **P**^{Θ}. When we approximate the IBR control policy via MFPT in experimental design, the average cost needed in (28) is computed via Monte-Carlo simulations (over different initial network states) as

$$ {\begin{aligned} &\mathrm{E}_{X_{0}}\left[J^{\mathbf{P}^{\Theta|\theta_{i}=\theta^{\prime}_{i}}}_{\mu^{\Theta|\theta_{i}=\theta^{\prime}_{i}}}(X_{0})\right]\\ &\qquad\approx\ \frac{1}{N}\sum_{n=1}^{N}\left\{{\lim}_{M\rightarrow\infty}\sum_{t=0}^{M-1}\zeta^{t} r^{(n)}\left(\mathbf{X}(t),\mathbf{X}(t+1),\mu^{\text{MFPT}}_{\mathbf{P}^{{\Theta|\theta_{i}=\theta^{\prime}_{i}}}}(\mathbf{X}(t))\right)\right\}, \end{aligned}} $$

(35)

where *N* is the total number of simulations and *r*^{(n)}(·) is the accrued total discounted cost in the *n*-th simulation. The pseudo-code for the approximate method is the same as Algorithm 1 except for steps 19 and 20. For the approximate method, in step 19, we find \(\mu ^{\text {MFPT}}_{\Theta |\hat {\theta }_{i}}\) via the MFPT approach. Then in step 20, we plug \(\mu ^{\text {MFPT}}_{\Theta |\hat {\theta }_{i}}\), obtained from step 19, in (35) to compute \(\eta (\hat {\theta }_{i})\).

Having used the MFPT approach for the sole purpose of reducing the uncertainty class, the IBR control policy is obtained by solving Bellman’s equation using value iteration.

### Computational complexity analysis

The computationally demanding step in the proposed experimental design method is to find IBR control policies. If there are *T* different unknown parameters and we generate *M* different samples for Monte-Carlo simulations for each one, then we need to find IBR control policies in our experimental design calculations *T*×*M* times. Since we use value iterations to solve Bellman’s equation, if we assume that the value iteration converges in *I* iterations, then we need to compute \((|\mathcal {S}|\times |\mathcal {C}|)^{I}\) terminal costs to obtain the control policy, \(|\mathcal {S}|\) being the number of network states and \(|\mathcal {C}|\) being the number of control actions. In this paper, we focus on binary networks and binary control actions, i.e., \(|\mathcal {S}|=2^{n}\), *n* being the number of genes and \(\mathcal {C}=2\). Therefore, the order of complexity when experimental design based on the IBR control policy is implemented is \(\mathcal {O}\left (T\times M\times (2^{n+1})^{I}\right)\). The complexity grows exponentially with the number of genes and polynomially with the number of unknown parameters.

The complexity of the approximate experimental design is much lower because there is no iterative calculation in MFPT. It is enough to solve the two linear equations in (30) and (31), which involves two matrix inversions. Although applying MFPT for experimental design requires us to find the average cost of the MFPT-based IBR control policy via Monte-Carlo simulations, this overhead complexity for MFPT is not concerning and still the complexity of the MFPT-based approach is much smaller in comparison to that of the optimal experimental design. Since the calculations for each unknown parameter and each realization of that parameter can be done independently, a parallel implementation can be used for the proposed experimental design methods.

In Fig. 2, we provide run times required for finding the primary parameter among 5 unknown parameters for GRNs with different numbers of genes. The codes are scripted in MATLAB and run in a parallel framework on a Machine with an Intel^{®} quad-core 2.67 GHz CPU and 12 GB RAM. The number of iterations for value iteration is *I*=4. While the execution time grows exponentially with the number of genes and the runs may be prohibitively time-consuming beyond six genes for the optimal experimental design, the MFPT-based approximate method can be still implemented for networks of larger size.