In this section, several key concepts and definitions are introduced for solving the SIOOR problem, including Observation Strategy (OS), Cardinality of Observation Strategy [4], and Identifiability Gain (IG). The design of the dynamical programming algorithm is also described. In addition, we provide the necessary theoretical justification for the proposed method.
Problem formulation
A directed biological network can be denoted by G = (V, E), where V denotes the node set and E denotes the edge set. Let V
i
(i = 1, 2,…, n) denote the i-th node, and Y
i
denote the variable associated with V
i
. If Y
i
is a linear function of the remaining node variables, the corresponding SEM can be specified as follows,
$$ {Y}_i={\displaystyle \sum_{j\ne i}}{c}_{i j}{Y}_j+{\varepsilon}_i,\kern0.6em i, j=1,\cdots, n, $$
where c
ij
denotes the coefficient associated with the directed edge V
j
→ V
i
, and ε
i
denotes the disturbance error term that follows a certain distribution (Gaussian or non-Gaussian [40, 41]) with mean zero. For simplicity, all disturbance error terms are assumed to be independent. By definition, E specifies the structure of the coefficient matrix C = [c
ij
], i.e., c
ij
= 0 if no edge exists in E from V
j
to V
i
for i ≠ j. When a network structure contains one or more loops, G is a directed cyclic graph (DCG) and the corresponding SEM is called a non-recursive model; otherwise, G is a directed acyclic graph [42] and the corresponding SEM is called recursive. Although Drton’s condition [31] and Foygel’s half-trek criterion [32] are applicable to the identifiability analysis of non-recursive SEMs, the identifiability of parameters on a loop may be still inconclusive. Due to the lack of mature structural identifiability analysis techniques for examining every unknown parameter of a non-recursive SEM, this study focuses on recursive SEMs (i.e., DAGs) only.
Definition 1 (observation strategy). Given a graph G = (V, E), its observation strategy can be denoted by a binary vector \( O={\left({O}_{V_1},\cdots, {O}_{V_n}\right)}^T \), where \( {O}_{V_i}=1 \) if node V
i
is observed and \( {O}_{V_i}=0 \) if V
i
is unobserved.
Observation strategy is important to parameter identifiability. In general, for a given network structure, the more observed nodes an observation strategy contains, the more likely all model parameters are identifiable. However, more observed nodes are usually associated with a higher experiment cost, so it is also desirable to reduce any unnecessary cost. The goal of SIOOR is thus to improve a given observation strategy by observing a minimum number of originally unobserved nodes such that all nonzero parameters in C become identifiable. For this purpose, let P denote the vector of all nonzero parameters in C, and let D denote the vector of identifiability status of every element in P. That is, if P
i
is locally or globally identifiable (i.e., P
i
has a finite number of possible values or a unique value within the parameter space, see [18]), D
i
= 1; otherwise, D
i
=1. When all the parameters in a model are locally or globally identifiable, this model is called identifiable. Consequently, the SIOOR problem can be formulated as follows
$$ \underset{\mathrm{observed}\ {V}_i}{ \min }{\displaystyle \sum_{i=1}^n{O}_{V_{{}_i}}},\ \mathrm{subject}\ \mathrm{t}\mathrm{o}\;\mathbf{D}=\mathbf{1}, $$
(1)
where \( {\displaystyle \sum_{i=1}^n{O}_{V_{{}_i}}} \) is the total number of observed nodes in an observation strategy O, and 1 denotes a vector of ones. For clarification, we stress that the observation measurements are for the random variables associated with network nodes, and we assume (n – m) of them are observed in the original observation strategy, where n denotes the total number of nodes and 0 < m ≤ n.
The objective function above is minimized with respect to the originally unobserved nodes, subject to the constraint D = 1. During the minimization process, it needs to be repeatedly verified whether all parameters have become identifiable (i.e., D = 1). For this purpose, an efficient algorithm for structural identifiability analysis of SEMs is needed. Here we consider the identifiability matrix method proposed by Wang et al. [34]. Briefly, structural identifiability of parameters can be verified by examining the number of solutions to the symbolic polynomial identifiability equations generated by Wright’s path coefficient method [43, 44]. To avoid the expensive symbolic computation involved in reducing such identifiability equations, the identifiability matrix method proposes to derive binary matrices from symbolic polynomials and thus enable us to determine the number of solutions via several simple matrix operations. It is noteworthy that Wang’s work [34] does not explicitly handle colliders involving bidirectional arcs when generating identifiability equations with Wright’s method, however, the identifiability matrix method is still applicable here as we do not consider bidirectional arcs in DAGs.
Identifiability gain and must-be-observed nodes
The optimization problem in the previous section is combinatorial in nature. Therefore, if the number of the originally unobserved nodes (denoted by m) is not small, enumerating all the 2m different possible observation strategies over these nodes will be computationally expensive. We thus need an efficient algorithm such as dynamic programming to obtain the solutions. For this purpose, a few more definitions need to be introduced first.
Definition 2 (redundant identifiability equation). Given a set of identifiability equations, an identifiability equation IE(V
i
,V
j
) is redundant with respect to that set if it can be expressed as a linear combination of the equations in that set.
Definition 3 (cardinality of observation strategy). Given an observation strategy O for a network G, one symbolic polynomial identifiability equation can be generated for each pair of d-connected [28] observed nodes using, e.g., Wright’s path coefficient method. Then the total number of non-redundant identifiability equations is called the cardinality of O, denoted by f(O).
The Wright’s path coefficient method generates identifiability equations for recursive SEMs by calculating the covariance between two node variables, which is equal to the sum of the products of edge coefficients along each d-connected path, i.e., \( IE\left({V}_i,{V}_j\right):\ C o v\left({V}_i,{V}_j\right)={\displaystyle \sum_{pat{ h}_k}}{\displaystyle \prod_{e dg{ e}_l}}{\theta}_l \). After removing all redundant identifiability equations and redundant monomials, the identifiability result of each parameter can be determined by Theorem 1 in [34]. That is, if the number of non-redundant identifiability equations is less than the number of unknown parameters, then the parameters have an infinite number of possible values within the parameter space and are thus unidentifiable; otherwise, the parameters have a limited number of solutions or even a unique solution and are thus at least locally identifiable [45]. Let N
u
denote the total number of unknown parameters in P. For every parameter in P being locally or globally identifiable, the inequality f(O) ≥ N
u
should hold according to Theorem 1 in [34]. Therefore, the optimization problem can also be formulated as follows
$$ \underset{\mathrm{observed}\ {V}_i}{ \min }{\displaystyle \sum_{i=1}^n{O}_{V_{{}_i}}},\ \mathrm{subject}\ \mathrm{t}\mathrm{o}\; f(O)\ge {N}_u, $$
(2)
where the calculation of f(O) is a key challenge because it depends on specific network structure and observation strategy and thus has no closed-form solution. We thus introduce the following definition.
Definition 4 (identifiability gain). Given a network G = (V,E), let O
(k) and f(O
(k)) denote an observation strategy and its cardinality, respectively. Let V
i
be an unobserved node in O
(k), and only V
i
becomes observed in a new observation strategy O
(k + 1) with the observation statuses of other nodes remaining unchanged. Let f(O
(k + 1)) denote the cardinality of O
(k + 1). Then the identifiability gain of observing V
i
, denoted by g(V
i
, O
(k)), is calculated as g(V
i
, O
(k)) = f(O
(k + 1)) − f(O
(k)).
By definition, g(V
i
, O
(k)) is the difference in cardinality between two consecutive observation strategies O
(k) and O
(k + 1). That is, after V
i
becomes observed in O
(k + 1), we need to find out the number of newly added non-redundant identifiability equations. First, if another node V
j
(i ≠ j) is observed in both O
(k) and O
(k + 1) and there exists a Wright’s path [46] of length 1 connecting V
i
and V
j
, it can be shown that the newly added identifiability equation, denoted by IE(V
i
,V
j
), is non-redundant (see Lemma 1 and Additional file 1 for theoretical justification). However, if the length of every Wright’s path between V
i
and V
j
is greater than 1, the identifiability equation IE(V
i
,V
j
) is not always redundant, and it depends on both the node’s observation status and the structure of the network. Here we introduce the concept of detour-path before we further elucidate the redundancy issue. Consider a DAG G = (V,E) and two d-connected observed nodes V
i
and V
j
. Assume that there exists a Wright’s path P
ji
between V
i
and V
j
as well as an observed node V
k
(k ≠ i, j) on P
ji
, and the direction of P
ji
is from V
i
to V
k
and then to V
j
. Now let P
ki
and P
jk
denote the two segments of P
ji
, then P
ki
entering node V
k
has an arrow pointing into V
k
while P
jk
exiting node V
k
has an arrow pointing away from V
k
. However, if there exists another Wright’s path between V
k
and V
j
, denoted by \( {\tilde{P}}_{kj} \), which has no any other observed nodes besides V
k
and V
j
and has an arrow pointing into V
k
, then V
k
is a collider with respect to P
ki
and \( {\tilde{P}}_{kj} \). Thus, we call the Wright’s path segment P
jk
the detour-path, and call V
i
, V
j
and V
k
the upstream node, the downstream node, and the collider node of the detour-path P
jk
, respectively. By definition, a detour-path can have only one downstream node and one collider node but may have one or more upstream nodes. Moreover, multiple detour-paths can share the same upstream node, the same downstream node or the same collider node. Several examples are given in Fig. 1 to illustrate the concept of detour-path.
In addition, when an upstream node V
i
is shared by two or more detour-paths that have the same downstream node, V
i
is called a shared upstream node; otherwise, V
i
is called an exclusive upstream node. Note that a detour-path can have both exclusive and shared upstream nodes in the same time, and the collider node of one detour-path can be an upstream node of another detour-path. Consider two detour-paths that have no exclusive upstream nodes, if they share the same downstream node and at least one upstream node, or one upstream node of one detour path is the collider node of the other detour-path, then two detour-paths are intersecting. One can tell that if \( {P}_{j{ k}_1} \) intersects with \( {P}_{j{ k}_2} \) and \( {P}_{j{ k}_2} \) intersects with \( {P}_{j{ k}_3} \), then \( {P}_{j{ k}_1} \) also intersects with \( {P}_{j{ k}_3} \). Then we consider a downstream node V
j
, let S_IDP denote all the intersecting detour-paths, and let S_SUN denote all the shared upstream nodes of S_IDP. Similar to a single unknown parameter, the coefficient product \( W P={\displaystyle \prod_{e dg{ e}_l}}{\theta}_l \) of a Wright’s path P can be deemed as a single parameter and one can tell its structural identifiability based on identifiability equations. If a detour-path P has at least one exclusive upstream node, then the Wright’s coefficient WP of P is globally identifiable (see Lemma 2 and Additional file 1 for theoretical justification). Also, for a group of intersecting detour-paths, if the node number of S_SUN is equal to or greater than the number of intersecting detour-paths in S_IDP, then the Wright’s coefficient of each detour-path in S_IDP is globally identifiable (see Lemma 3 and Additional file 1 for theoretical justification).
Given a DAG G = (V,E), consider two observed nodes V
i
, V
j
and an unobserved node V
u
. V
u
may not be on any Wright’s paths between V
i
and V
j
. For this case,if only V
u
becomes observed in O
(k + 1), then for each observed node V
i
in O
(k), one can check whether the identifiability equation IE(V
i
,V
u
) is redundant according to Lemma 4 (see Additional file 1 for theoretical justification). That is, when none of the Wright’s paths between V
i
and V
u
contains detour-paths, IE(V
i
,V
u
) is redundant if and only if each Wright’s path between V
i
and V
u
passes at least one observed node other than V
i
and V
u
; otherwise, IE(V
i
,V
u
) is redundant if and only if the Wright’s coefficient of each detour-path between V
i
and V
u
is globally identifiable in O
(k) and each Wright’s path between V
i
and V
u
passes at least one observed node other than V
i
and V
u
. If V
u
is on a Wright’s path between V
i
and V
j
, and the sufficient and necessary condition for one of the identifiability equations IE(V
i
,V
u
) and IE(V
j
,V
u
) being redundant is similar to Lemma 4 and given in Lemma 5 (see Additional file 1 for theoretical justification). Note that it can be determined whether the Wright’s coefficient of a detour-path is globally identifiable according to Lemma 2 and Lemma 3.
Based on Lemma 4 and Lemma 5, we propose a novel graphic method to calculate the identifiability gain g(V
i
, O
(k)). Let des
i
denote the descendant node set of V
i
, anc
i
denote the ancestor node set of V
i
, rel
i
denote the set of nodes that are not included in des
i
or anc
i
. Moreover, let bound
i
⊂ anc
i
denote the boundary node set, in which every node has at least one outgoing edge to a node in rel
i
. Then we can calculate g(V
i
, O
(k)) by removing the following edges from the original graph G: i) all the incoming edges to the observed nodes that are not collider nodes of detour-paths in anc
i
; ii) all the outgoing edges from some observed nodes in des
i
and rel
i
and these observed nodes are not the collider nodes of the detour-paths whose Wright’s coefficients are unidentifiable in O
(k); and iii) all the outgoing edges from the observed nodes in bound
i
to nodes in rel
i
, and then we get a new graph denoted by G '. Let N
w
denote the total number of the observed nodes that are connected with V
i
via any Wright’s path in graph G '. Furthermore, one can tell from the edge-removal operation that there still exist some redundant identifiability equations in G ', because the following two types of redundancy cases have not been considered in the edge-removal operation: V
i
is the downstream node of an arbitrary detour-path, and V
i
is on a Wright’s path between two observed nodes in G '. Let N
r
denote the number of redundant identifiability equations in G '. According to the topological structure of G ' and the node’s observation status, we can obtain N
r
based on Lemma 4 and Lemma 5 (see the details in Implementation and Verification Section). It can be shown that the identifiability gain is g(V
i
, O
(k)) = N
w
− N
r
(see Theorem 1 and Additional file 1 for theoretical justification).
For a given DAG G and an observation strategy O
(k), different unobserved nodes may associate with different identifiability gains. Naturally, our strategy is to choose the unobserved node in O
(k) with the maximum identifiability gain if it becomes observed in O
(k + 1). However, we also recognize that, to assure that all model parameters are at least locally identifiable, certain nodes of a DAG must be observed if they are unobserved in an observation strategy (see Lemma 6 and Additional file 1 for theoretical justification). For convenience, we call such nodes the must-be-observed [14] nodes, and let \( {O}^{(0)_M} \) denote the observation strategy, in which only the MBO nodes are observed.
Lemma 1
Given a DAG G = (V,E), an observed node
V
i
, and an unobserved node
V
u
in
O
(k), if only
V
u
becomes observed in
O
(k + 1), the identifiability equation
IE(V
i
,V
u
) is non-redundant if there exists a Wright’s path of length 1 connecting
V
i
and
V
u
.
Lemma 2
If a detour-path
P
has one or more exclusive upstream node, the Wright’s coefficient
WP of P
is globally identifiable.
Lemma 3
For a group of intersecting detour-paths, if the number of the shared upstream nodes in
S_SUN
is equal to or greater than the number of intersecting detour-paths in
S_IDP, then the Wright’s coefficient of each detour-path in
S_IDP
is globally identifiable.
Lemma 4
Given a DAG G = (V,E), an observed node
V
i
, and an unobserved node
V
u
in O
(k), if only
V
u
becomes observed in
O
(k + 1), there exist two cases:
-
1)
each Wright’s path between V
i
and V
u
passes at least one observed node other than V
i
and V
u
when none of the Wright’s paths between V
i
and V
u
contains detour-paths;
-
2)
each Wright’s path between V
i
and V
u
passes at least one observed node other than V
i
andV
u
, and the Wright’s coefficient of each detour-path between V
i
and V
u
is globally identifiable in O
(k) when certain Wright’s paths between V
i
and V
u
contain detour-paths.
Then the identifiability equation IE(V
i
,V
u
) is redundant if and only if one of the above conditions holds.
Lemma 5
Given a DAG G = (V,E), two
d-connected observed nodes
V
i
and
V
j
, and an unobserved node
V
u
in O
(k), if V
u
is on a Wright’s path between V
i
and V
j
and only V
u
becomes observed in O
(k + 1), there exist two cases:
-
1)
each Wright’s path between V
i
and V
j
passes at least one observed node other than V
i
and V
j
when none of the Wright’s paths between V
i
and V
j
contains detour-paths;
-
2)
each Wright’s path between V
i
and V
j
passes at least one observed node other than V
i
and V
j
, and the Wright’s coefficient of each detour-path between V
i
and V
j
is globally identifiable in O
(k) when certain Wright’s paths between V
i
and V
j
contain detour-paths.
Then one of the two identifiability equations IE(V
i
,V
u
) and IE(V
j
,V
u
) is redundant if and only if one of the above conditions holds.
Theorem 1
Given a DAG G = (V, E) and an unobserved node V
i
in an observation strategy O, let G ' denote the sub-graph after the edge-removal operation. Then the identifiability gain is g(V
i
, O) = N
w
− N
r
, where N
w
denotes the total number of the observed nodes that are connected with V
i
via any Wright’s path in graph G ', and N
r
denotes the number of redundant identifiability equations in graph G '.
Lemma 6
For a given DAG G = (V, E), the following nodes must be observed to assure that all the parameters of the corresponding SEM are at least locally identifiable
-
1)
The nodes with an out-degree 0;
-
2)
The nodes with an out-degree 1;
-
3)
The nodes with an in-degree 0 and an out-degree less than 3.
Dynamic programming strategy
Let \( {O}^{(0)_G} \) denote a given observation strategy. If some of the MBO nodes are not observed in \( {O}^{(0)_G} \), \( {O}^{(0)_M} \) should be incorporated into \( {O}^{(0)_G} \) according to Lemma 6. Therefore, the initial observation strategy, denoted by O
(0), should always be \( {O}^{(0)}=\left({O}^{(0)_M}\Big|{O}^{(0)_G}\right) \), where the OR operator is an element-wise operation. For example, for a DAG with six nodes, if \( {O}^{(0)_M}={\left[\begin{array}{cccccc}\hfill 1\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \end{array}\right]}^T \) and \( {O}^{(0)_G}={\left[\begin{array}{cccccc}\hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \end{array}\right]}^T \), then \( {O}^{(0)}={\left[\begin{array}{cccccc}\hfill 1\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \end{array}\right]}^T \).
The dynamic programming strategy starts with the calculation of the cardinality of O
(0) (that is, f(O
(0))) based on Theorem 1. Specifically, let R be the number of observed nodes in O
(0), V
o_r
(r = 1, 2, ⋯, R) be the r-th observed node in O
(0), and O
(0){V
o_1, …, V
o_r
} be the observation strategy in which only the first r observed nodes in O
(0) are observed. Then \( f\left({O}^{(0)}\right)={\displaystyle \sum_{r=1}^{R-1} g\left({V}_{o\_\left( r+1\right)},{O}^{(0)}\left\{{V}_{o\_1},\dots, {V}_{o\_ r}\right\}\right)} \) can be calculated according to Theorem 1. Note that the order at which V
o_r
is selected into O
(0){V
o_1, …, V
o_R
} will not change the observation strategy (e.g., O
(0){V
o_1, V
o_2} = O
(0){V
o_2, V
o_1}) and thus have no effect on the value of f(O
(0)).
The second step of our dynamic programming strategy is to define stages and their associated states. Let S denote the number of unobserved nodes in O
(0), and let V
u_s
(s = 1, 2, ⋯, S) denote the s-th unobserved node in O
(0), then the dynamic programming procedure can be divided into S + 1 stages. For illustration purpose, we consider a simple example with 5 unobserved nodes, as shown in Fig. 2. The 0-th stage is actually the initialization step as described in the previous paragraph, and it has only one state, i.e., O
(0). At the first stage, there are S = 5 different states; that is, only one of the unobserved nodes {V
u_1, V
u_2, ⋯, V
u_5} in O
(0) will be selected to observe. At the second stage, since one of the five unobserved nodes has been selected at the previous stage, there are only four unobserved nodes for selection and thus four states exist (that is, {V
u_2, V
u_3, V
u_4, V
u_5}). Therefore, as shown in Fig. 2, except for stages 0 and 1, each subsequent stage has one less states than its previous stage; also, the upper triangular region (see the area above the labels of stages 1–5 in Fig. 2) is empty because the selection order of unobserved nodes does not affect the eventual observation strategy so the inclusion of such states in the upper triangular region is redundant. One can tell that the proposed stage and state definitions satisfy the optimality principle of dynamic programming [47,48,49].
The third step is to compute the state transition costs for searching the optimal state transition path(s). According to the definitions of stages and states, there may exist several different states at the s-th stage that can transit to the same state at the (s + 1)-th stage. For instance, four states V
u_1, V
u_2, V
u_3 and V
u_4 at the first stage can transit to V
u_5 at the second stage, as shown in Fig. 2. The state transition cost from state V
u_i
to state V
u_j
(i ≠ j) between two consecutive stages is just the identifiability gain g(V
u_j
, O
(k){…, V
u_i
, …}), where O
(k){…, V
u_i
, …} means that V
u_i
is observed in O
(k). Then the cardinality of an observation strategy can be computed by adding f(O
(0)) and all the state transition costs along the state transition path. Since the goal of the dynamic programming strategy is to search for the optimal observation strategies, when there exist multiple transition paths from state V
u_i
in O
(k) to state V
u_j
in O
(k + 1) (i ≠ j), the transition path associated with the maximum identifiability gain will be chosen; that is, \( f\left({O}_{V_{u\_ j}}^{\left( k+1\right)}\right)=\underset{V_{u\_ i}, i\ne j}{ \max}\left( g\left({V}_{u\_ j},{O}_{V_u{\_}_i}^{(k)}\right)+ f\left({O}_{V_{u\_ i}}^{(k)}\right)\right) \), where \( {O}_{V_{u\_ i}}^{(k)} \) is a convenient notation for O
(k){…, V
u_i
, …}.
The dynamic programming strategy above can be mathematically described in Eq. (3), and we have implemented this strategy in R (see the “Implementation and verification” Section),
$$ \left\{\begin{array}{l} f\left({O}_{V_{u\_ s}}^{(1)}\right)= f\left({O}^{(0)}\right)+ g\left({V}_{u\_ s},{O}^{(0)}\right),\kern0.75em s=1,2,\dots, S,\\ {} f\left({O}_{V_{u\_ j}}^{\left( k+1\right)}\right)=\underset{V_{u\_ i}, i\ne j}{ \max}\left( g\left({V}_{u\_ j},{O}_{V_{u\_ i}}^{(k)}\right)+ f\left({O}_{V_{u\_ i}}^{(k)}\right)\right),\kern0.75em k=1,2,\cdots\ \&\ k\le S-1.\end{array}\right. $$
(3)
It should be stressed that it is not necessary to finish all the S iterations as shown in Eq. (3). Once the cardinality \( f\left({O}_{V_{u\_ i}}^{(k)}\right) \) at the k-th stage becomes equal to or greater than the number of unknown parameters N
u
, the dynamic programming process will stop and we get the SIOOR strategies.