Parameter identifiability-based optimal observation remedy for biological networks

Background To systematically understand the interactions between numerous biological components, a variety of biological networks on different levels and scales have been constructed and made available in public databases or knowledge repositories. Graphical models such as structural equation models have long been used to describe biological networks for various quantitative analysis tasks, especially key biological parameter estimation. However, limited by resources or technical capacities, partial observation is a common problem in experimental observations of biological networks, and it thus becomes an important problem how to select unobserved nodes for additional measurements such that all unknown model parameters become identifiable. To the best knowledge of our authors, a solution to this problem does not exist until this study. Results The identifiability-based observation problem for biological networks is mathematically formulated for the first time based on linear recursive structural equation models, and then a dynamic programming strategy is developed to obtain the optimal observation strategies. The efficiency of the dynamic programming algorithm is achieved by avoiding both symbolic computation and matrix operations as used in other studies. We also provided necessary theoretical justifications to the proposed method. Finally, we verified the algorithm using synthetic network structures and illustrated the application of the proposed method in practice using a real biological network related to influenza A virus infection. Conclusions The proposed approach is the first solution to the structural identifiability-based optimal observation remedy problem. It is applicable to an arbitrary directed acyclic biological network (recursive SEMs) without bidirectional edges, and it is a computerizable method. Observation remedy is an important issue in experiment design for biological networks, and we believe that this study provides a solid basis for dealing with more challenging design issues (e.g., feedback loops, dynamic or nonlinear networks) in the future. We implemented our method in R, which is freely accessible at https://github.com/Hongyu-Miao/SIOOR. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0432-2) contains supplementary material, which is available to authorized users.


Background
The emergence of young research fields such as systems biology and network medicine [1,2] reflects some exciting changes in biomedical investigators' view of biology and practice. Particularly, it has been increasingly recognized that thinking in networks may lead to novel scientific insights and findings [3] that the traditional reductionism approaches cannot grant [4]. The recent development of experimental techniques (e.g., a variety of high-throughput omics approaches) also provides unprecedented opportunities for biomedical investigators to construct numerous biological networks at different levels and scales; for instance, proteinprotein interaction networks [5,6], gene regulatory networks [7][8][9][10], functional RNA networks [11][12][13], and metabolic networks [14,15] can be found in a number of databases or knowledge repositories nowadays [9,16,17]. All such previous efforts provide a solid basis for further advancing our understanding of biological systems and the associated outcomes qualitatively or quantitatively.
Graphical models have long been considered as a natural mathematical representation of biological network for various quantitative analysis tasks such as parameter inference [18][19][20][21]. Specifically, given a biological network structure and experimental observations of certain variables associated with network nodes, it is often of significant interest to determine the unknown coefficients associated with network edges. For instance, to understand the responses of a biological network (e.g., activation or inhibition) to different environmental signals (e.g., different signaling molecules or different doses of the same signaling molecule), edge coefficients are likely to vary under different conditions and thus need to be estimated under each condition for the same given network structure [18]. In such a scenario, although the structure of the corresponding graphical model is known and fixed, concerns about the accuracy and reliability of parameter estimates often raise due to, e.g., the existence of unobserved node variables (i.e., latent variables). In practice, latent variables are not uncommon due to various technical limitations, ethic issues, financial affordability, and so on [18,20]. Therefore, a natural question to ask is: what is the remedy that enables us to obtain reliable parameter estimates for a given graphical model structure with partially observed variables?
To the best knowledge of our authors, the aforementioned important question has rarely been tackled before in the context of quantifying unknown model parameters of biological networks; and in this study, we make the very first attempt to address this question from the structural identifiability point of view. By the definition in Miao et al. [18], an unknown model parameter is structurally identifiable if it can be uniquely determined for a given model structure under the assumptions that sample size is sufficiently large and data quality is not of concern. Of course, one can also take the effects of sample size and data noise into consideration and conduct the so-called practical identifiability analysis [18]; however, this is out of the scope of this study as practical identifiability analysis is not feasible at certain experimental design stage when real data are not available. On the contrary, structural identifiability analysis allows us to detect flaws in model structure and observation scheme before data collection, and thus should be investigated first. Our solution to the question mentioned at the end of the previous paragraph is thus a strategy that identifies a minimum number of unobserved nodes, for which the associated node variables should be observed in experiments such that all unknown parameters become structurally identifiable. This is a useful and cost-effective remedy if some of the model parameters are not identifiable given the original observation scheme, and we thus name it the structural identifiability-based optimal observation remedy (SIOOR).
Since biological networks can be represented by many different types of mathematical or statistical models, it is impossible to devise the SIOOR strategy for every different model type in one study. Therefore, we consider a linear structural equation model [22] here because it is a representative graphical model type and has been widely applied in various disciplines including systems biology [23][24][25][26][27]. A number of previous studies have investigated the parameter identifiability problem of SEMs, but the majority of these studies only derived theoretical criteria or conditions for identifiability verification, including Pearl's back door and front door criteria [28], Brito and Pearl's generalized instrumental variable criterion [29], Tian's accessory set approach [30]. Only a few studies proposed computerizable identifiability analysis approaches, including Drton's condition [31] and Foygel's half-trek criterion [32] (implemented in R package SEMID), Sullivant's computer algebra method and the more recent Wang's identifiability matrix method [33,34]. More importantly, all such criterions and methods assume that the observation strategy is given (i.e., it is pre-specified which variables are observed and which are not), and none of them considered the remedy strategy if a given observation strategy does not grant identifiability to all unknown model parameters. The focus of this study is thus to investigate how to choose a minimum number of nodes that are not observed in the original observation strategy for additional experimental measurements such that all unknown model parameters become identifiable. This study leads to a general and computerizable solution to the SIOOR problem for the first time.
More specifically, in the case that a given observation strategy of a biological network cannot grant identifiability of all unknown parameters in the corresponding SEM due to the existence of unobserved variables, we propose a dynamic programming (DP) approach to search for all possible SIOOR strategies. The proposed approach is a generic and computerizable method that can deal with recursive SEMs. It should be stressed that SIOOR strategy does not involve any power or sample size calculation and thus cannot be compared with the traditional experimental design approaches [35,36]. Also, it should be stressed that the observability problem in control theory is different from the SIOOR problem because the aim of observability analysis is to determine the internal states of a system from its external outputs [37]. For clarification purpose, we also compare Liu's graphic approach for observability analysis [38] with our SIOOR strategy in this study.
This article is organized as follows. In the Methods Section, the structural identifiability-based optimal observation remedy problem is mathematically formulated. We then propose a dynamic programming approach with theoretical justification to solve the problem for recursive SEMs. In the Results and Discussion Section, we describe our algorithm implementation and validate the proposed method using selected benchmark networks. Also, a real substructure from the influenza virus A [39] KEEG pathway is chosen as an example to illustrate the application of the proposed method in practice.

Methods
In this section, several key concepts and definitions are introduced for solving the SIOOR problem, including Observation Strategy (OS), Cardinality of Observation Strategy [4], and Identifiability Gain (IG). The design of the dynamical programming algorithm is also described. In addition, we provide the necessary theoretical justification for the proposed method.

Problem formulation
A directed biological network can be denoted by G = (V, E), where V denotes the node set and E denotes the edge set. Let V i (i = 1, 2,…, n) denote the i-th node, and Y i denote the variable associated with V i . If Y i is a linear function of the remaining node variables, the corresponding SEM can be specified as follows, where c ij denotes the coefficient associated with the directed edge V j → V i , and ε i denotes the disturbance error term that follows a certain distribution (Gaussian or non-Gaussian [40,41]) with mean zero. For simplicity, all disturbance error terms are assumed to be independent. By definition, E specifies the structure of the coefficient matrix C = [c ij ], i.e., c ij = 0 if no edge exists in E from V j to V i for i ≠ j. When a network structure contains one or more loops, G is a directed cyclic graph (DCG) and the corresponding SEM is called a nonrecursive model; otherwise, G is a directed acyclic graph [42] and the corresponding SEM is called recursive. Although Drton's condition [31] and Foygel's half-trek criterion [32] are applicable to the identifiability analysis of non-recursive SEMs, the identifiability of parameters on a loop may be still inconclusive. Due to the lack of mature structural identifiability analysis techniques for examining every unknown parameter of a non-recursive SEM, this study focuses on recursive SEMs (i.e., DAGs) only.
Definition 1 (observation strategy). Given a graph G = (V, E), its observation strategy can be denoted by a bin- Observation strategy is important to parameter identifiability. In general, for a given network structure, the more observed nodes an observation strategy contains, the more likely all model parameters are identifiable. However, more observed nodes are usually associated with a higher experiment cost, so it is also desirable to reduce any unnecessary cost. The goal of SIOOR is thus to improve a given observation strategy by observing a minimum number of originally unobserved nodes such that all nonzero parameters in C become identifiable. For this purpose, let P denote the vector of all nonzero parameters in C, and let D denote the vector of identifiability status of every element in P. That is, if P i is locally or globally identifiable (i.e., P i has a finite number of possible values or a unique value within the parameter space, see [18]), D i = 1; otherwise, D i =1. When all the parameters in a model are locally or globally identifiable, this model is called identifiable. Consequently, the SIOOR problem can be formulated as follows O V i is the total number of observed nodes in an observation strategy O, and 1 denotes a vector of ones. For clarification, we stress that the observation measurements are for the random variables associated with network nodes, and we assume (nm) of them are observed in the original observation strategy, where n denotes the total number of nodes and 0 < m ≤ n. The objective function above is minimized with respect to the originally unobserved nodes, subject to the constraint D = 1. During the minimization process, it needs to be repeatedly verified whether all parameters have become identifiable (i.e., D = 1). For this purpose, an efficient algorithm for structural identifiability analysis of SEMs is needed. Here we consider the identifiability matrix method proposed by Wang et al. [34]. Briefly, structural identifiability of parameters can be verified by examining the number of solutions to the symbolic polynomial identifiability equations generated by Wright's path coefficient method [43,44]. To avoid the expensive symbolic computation involved in reducing such identifiability equations, the identifiability matrix method proposes to derive binary matrices from symbolic polynomials and thus enable us to determine the number of solutions via several simple matrix operations. It is noteworthy that Wang's work [34] does not explicitly handle colliders involving bidirectional arcs when generating identifiability equations with Wright's method, however, the identifiability matrix method is still applicable here as we do not consider bidirectional arcs in DAGs.

Identifiability gain and must-be-observed nodes
The optimization problem in the previous section is combinatorial in nature. Therefore, if the number of the originally unobserved nodes (denoted by m) is not small, enumerating all the 2 m different possible observation strategies over these nodes will be computationally expensive. We thus need an efficient algorithm such as dynamic programming to obtain the solutions. For this purpose, a few more definitions need to be introduced first.
Definition 2 (redundant identifiability equation). Given a set of identifiability equations, an identifiability equation IE(V i , V j ) is redundant with respect to that set if it can be expressed as a linear combination of the equations in that set.
Definition 3 (cardinality of observation strategy). Given an observation strategy O for a network G, one symbolic polynomial identifiability equation can be generated for each pair of d-connected [28] observed nodes using, e.g., Wright's path coefficient method. Then the total number of non-redundant identifiability equations is called the cardinality of O, denoted by f(O).
The Wright's path coefficient method generates identifiability equations for recursive SEMs by calculating the covariance between two node variables, which is equal to the sum of the products of edge coefficients along each d-connected path, i.e., IE V i ; V j À Á : Cov After removing all redundant identifiability equations and redundant monomials, the identifiability result of each parameter can be determined by Theorem 1 in [34]. That is, if the number of non-redundant identifiability equations is less than the number of unknown parameters, then the parameters have an infinite number of possible values within the parameter space and are thus unidentifiable; otherwise, the parameters have a limited number of solutions or even a unique solution and are thus at least locally identifiable [45]. Let N u denote the total number of unknown parameters in P. For every parameter in P being locally or globally identifiable, the inequality f(O) ≥ N u should hold according to Theorem 1 in [34]. Therefore, the optimization problem can also be formulated as follows where the calculation of f(O) is a key challenge because it depends on specific network structure and observation strategy and thus has no closed-form solution. We thus introduce the following definition. Definition 4 (identifiability gain). Given a network G = (V,E), let O (k) and f(O (k) ) denote an observation strategy and its cardinality, respectively. Let V i be an unobserved node in O (k) , and only V i becomes observed in a new observation strategy O (k + 1) with the observation statuses of other nodes remaining unchanged. Let f(O (k + 1) ) denote the cardinality of O (k + 1) . Then the identifiability gain of observing V i , denoted by is the difference in cardinality between two consecutive observation strategies O (k) and O (k + 1) . That is, after V i becomes observed in O (k + 1) , we need to find out the number of newly added non-redundant identifiability equations. First, if another node V j (i ≠ j) is observed in both O (k) and O (k + 1) and there exists a Wright's path [46] of length 1 connecting V i and V j , it can be shown that the newly added identifiability equation, denoted by IE(V i , V j ), is non-redundant (see Lemma 1 and Additional file 1 for theoretical justification). However, if the length of every Wright's path between V i and V j is greater than 1, the identifiability equation IE(V i , V j ) is not always redundant, and it depends on both the node's observation status and the structure of the network. Here we introduce the concept of detour-path before we further elucidate the redundancy issue. Consider a DAG G = (V,E) and two dconnected observed nodes V i and V j . Assume that there exists a Wright's path P ji between V i and V j as well as an observed node V k (k ≠ i, j) on P ji , and the direction of P ji is from V i to V k and then to V j . Now let P ki and P jk denote the two segments of P ji , then P ki entering node V k has an arrow pointing into V k while P jk exiting node V k has an arrow pointing away from V k . However, if there exists another Wright's path between V k and V j , denoted byP kj , which has no any other observed nodes besides V k and V j and has an arrow pointing into V k , then V k is a collider with respect to P ki andP kj . Thus, we call the Wright's path segment P jk the detour-path, and call V i , V j and V k the upstream node, the downstream node, and the collider node of the detour-path P jk , respectively. By definition, a detour-path can have only one downstream node and one collider node but may have one or more upstream nodes. Moreover, multiple detourpaths can share the same upstream node, the same downstream node or the same collider node. Several examples are given in Fig. 1 to illustrate the concept of detour-path.
In addition, when an upstream node V i is shared by two or more detour-paths that have the same downstream node, V i is called a shared upstream node; otherwise, V i is called an exclusive upstream node. Note that a detour-path can have both exclusive and shared upstream nodes in the same time, and the collider node of one detour-path can be an upstream node of another detour-path. Consider two detour-paths that have no exclusive upstream nodes, if they share the same downstream node and at least one upstream node, or one upstream node of one detour path is the collider node of the other detour-path, then two detour-paths are intersecting. One can tell that if P jk 1 intersects with P jk 2 and P jk 2 intersects with P jk 3 , then P jk 1 also intersects with P jk 3 . Then we consider a downstream node V j , let S_IDP denote all the intersecting detour-paths, and let S_SUN denote all the shared upstream nodes of S_IDP. Similar to a single unknown parameter, the coefficient product W P ¼ Y edge l θ l of a Wright's path P can be deemed as a single parameter and one can tell its structural identifiability based on identifiability equations. If a detour-path P has at least one exclusive upstream node, then the Wright's coefficient WP of P is globally identifiable (see Lemma 2 and Additional file 1 for theoretical justification). Also, for a group of intersecting detourpaths, if the node number of S_SUN is equal to or greater than the number of intersecting detour-paths in S_IDP, then the Wright's coefficient of each detour-path in S_IDP is globally identifiable (see Lemma 3 and Additional file 1 for theoretical justification).
Given a DAG G = (V,E), consider two observed nodes V i , V j and an unobserved node V u . V u may not be on any Wright's paths between V i and V j . For this case,if only V u becomes observed in O (k + 1) , then for each observed node V i in O (k) , one can check whether the identifiability equation IE(V i , V u ) is redundant according to Lemma 4 (see Additional file 1 for theoretical justification). That is, when none of the Wright's paths between V i and V u contains detour-paths, IE(V i , V u ) is redundant if and only if each Wright's path between V i and V u passes at least one observed node other than V i and V u ; otherwise, IE(V i , V u ) is redundant if and only if the Wright's coefficient of each detour-path between V i and V u is globally identifiable in O (k) and each Wright's path between V i and V u passes at least one observed node other than V i and V u . If V u is on a Wright's path between V i and V j , and the sufficient and necessary condition for one of the identifiability equations IE(V i , V u ) and IE(V j , V u ) being redundant is similar to Lemma 4 and given in Lemma 5 (see Additional file 1 for theoretical justification). Note that it can be determined whether the Wright's coefficient of a detour-path is globally identifiable according to Lemma 2 and Lemma 3.
Based on Lemma 4 and Lemma 5, we propose a novel graphic method to calculate the identifiability gain g(V i , O (k) ). Let des i denote the descendant node set of V i , anc i denote the ancestor node set of V i , rel i denote the set of nodes that are not included in des i (a) (b) (d) (c) Fig. 1 Four examples for illustrating the detour-path concept, where observed nodes are colored green and unobserved nodes are colored grey. a A simple detour-path; b Two detour-paths P jk , P nk share the same collider node V k and upstream nodes V i , V o ; c Two detour-paths P jk , P nk share the same collider node V k , upstream nodes V i , V m and edge e jk ; d Two detour-paths P jk , P nk share the same upstream node V i and downstream node V j or anc i . Moreover, let bound i ⊂ anc i denote the boundary node set, in which every node has at least one outgoing edge to a node in rel i . Then we can calculate g(V i , O (k) ) by removing the following edges from the original graph G: i) all the incoming edges to the observed nodes that are not collider nodes of detourpaths in anc i ; ii) all the outgoing edges from some observed nodes in des i and rel i and these observed nodes are not the collider nodes of the detour-paths whose Wright's coefficients are unidentifiable in O (k) ; and iii) all the outgoing edges from the observed nodes in bound i to nodes in rel i , and then we get a new graph denoted by G '. Let N w denote the total number of the observed nodes that are connected with V i via any Wright's path in graph G '. Furthermore, one can tell from the edge-removal operation that there still exist some redundant identifiability equations in G ', because the following two types of redundancy cases have not been considered in the edge-removal operation: V i is the downstream node of an arbitrary detour-path, and V i is on a Wright's path between two observed nodes in G '. Let N r denote the number of redundant identifiability equations in G '. According to the topological structure of G ' and the node's observation status, we can obtain N r based on Lemma 4 and Lemma 5 (see the details in Implementation and Verification Section). It can be shown that the identifiability gain is g(V i , O (k) ) = N w − N r (see Theorem 1 and Additional file 1 for theoretical justification).
For a given DAG G and an observation strategy O (k) , different unobserved nodes may associate with different identifiability gains. Naturally, our strategy is to choose the unobserved node in O (k) with the maximum identifiability gain if it becomes observed in O (k + 1) . However, we also recognize that, to assure that all model parameters are at least locally identifiable, certain nodes of a DAG must be observed if they are unobserved in an observation strategy (see Lemma 6 and Additional file 1 for theoretical justification). For convenience, we call such nodes the must-be-observed [14] nodes, and let Lemma 2 If a detour-path P has one or more exclusive upstream node, the Wright's coefficient WP of P is globally identifiable.
Lemma 3 For a group of intersecting detour-paths, if the number of the shared upstream nodes in S_SUN is equal to or greater than the number of intersecting detour-paths in S_IDP, then the Wright's coefficient of each detour-path in S_IDP is globally identifiable.
Lemma 4 Given a DAG G = (V,E), an observed node V i , and an unobserved node V u in O (k) , if only V u becomes observed in O (k + 1) , there exist two cases: 1) each Wright's path between V i and V u passes at least one observed node other than V i and V u when none of the Wright's paths between V i and V u contains detour-paths; 2) each Wright's path between V i and V u passes at least one observed node other than V i andV u , and the Wright's coefficient of each detour-path between V i and V u is globally identifiable in O (k) when certain Wright's paths between V i and V u contain detour-paths.

Then the identifiability equation IE(V i , V u ) is redundant if and only if one of the above conditions holds.
Lemma 5 Given a DAG G = (V,E), two d-connected observed nodes V i and V j , and an unobserved node V u in O (k) , if V u is on a Wright's path between V i and V j and only V u becomes observed in O (k + 1) , there exist two cases: 1) each Wright's path between V i and V j passes at least one observed node other than V i and V j when none of the Wright's paths between V i and V j contains detour-paths; 2) each Wright's path between V i and V j passes at least one observed node other than V i and V j , and the Wright's coefficient of each detour-path between V i and V j is globally identifiable in O (k) when certain Wright's paths between V i and V j contain detourpaths.
Then one of the two identifiability equations IE(V i , V u ) and IE(V j , V u ) is redundant if and only if one of the above conditions holds.
Theorem 1 Given a DAG G = (V, E) and an unobserved node V i in an observation strategy O, let G ' denote the sub-graph after the edge-removal operation. Then the identifiability gain is g(V i , O) = N w − N r , where N w denotes the total number of the observed nodes that are connected with V i via any Wright's path in graph G ', and N r denotes the number of redundant identifiability equations in graph G '.
Lemma 6 For a given DAG G = (V, E), the following nodes must be observed to assure that all the parameters of the corresponding SEM are at least locally identifiable 1) The nodes with an out-degree 0; 2) The nodes with an out-degree 1; 3) The nodes with an in-degree 0 and an out-degree less than 3.

Dynamic programming strategy
can be calculated according to Theorem 1. Note that the order at which V o_r is selected into O (0) {V o_1 , …,V o_R } will not change the observation strategy (e.g., and thus have no effect on the value of f(O (0) ). The second step of our dynamic programming strategy is to define stages and their associated states. Let S denote the number of unobserved nodes in O (0) , and let V u_s (s = 1, 2, ⋯, S) denote the s-th unobserved node in O (0) , then the dynamic programming procedure can be divided into S + 1 stages. For illustration purpose, we consider a simple example with 5 unobserved nodes, as shown in Fig. 2. The 0-th stage is actually the initialization step as described in the previous paragraph, and it has only one state, i.e., O (0) . At the first stage, there are S = 5 different states; that is, only one of the unobserved nodes {V u_1 ,V u_2 , ⋯,V u_5 } in O (0) will be selected to observe. At the second stage, since one of the five unobserved nodes has been selected at the previous stage, there are only four unobserved nodes for selection and thus four states exist (that is, {V u_2 ,V u_3 , V u_4 ,V u_5 }). Therefore, as shown in Fig. 2, except for stages 0 and 1, each subsequent stage has one less states than its previous stage; also, the upper triangular region (see the area above the labels of stages 1-5 in Fig. 2) is empty because the selection order of unobserved nodes does not affect the eventual observation strategy so the inclusion of such states in the upper triangular region is redundant. One can tell that the proposed stage and state definitions satisfy the optimality principle of dynamic programming [47][48][49].
The third step is to compute the state transition costs for searching the optimal state transition path(s). According to the definitions of stages and states, there may exist several different states at the s-th stage that can transit to the same state at the (s + 1)-th stage. For instance, four states V u_1 , V u_2 , V u_3 and V u_4 at the first stage can transit to V u_5 at the second stage, as shown in Fig. 2. The state transition cost from state V u_i to state V u_j (i ≠ j) between two consecutive stages is just the identifiability gain g( Then the cardinality of an observation strategy can be computed by adding f(O (0) ) and all the state transition costs along the state transition path. Since the goal of the dynamic programming strategy is to search for the optimal observation strategies, when there exist multiple transition paths from state V u_i in O (k) to state V u_j in O (k + 1) (i ≠ j), the transition path associated with the maximum identifiability gain will be chosen; that is, The dynamic programming strategy above can be mathematically described in Eq. (3), and we have implemented this strategy in R (see the "Implementation and verification" Section), Fig. 2 Schematic illustration of the stages, states and state transition costs in the proposed dynamic programming strategy It should be stressed that it is not necessary to finish all the S iterations as shown in Eq. (3). Once the cardin- at the k-th stage becomes equal to or greater than the number of unknown parameters N u , the dynamic programming process will stop and we get the SIOOR strategies.

Results and Discussion
Overview of the framework Observation strategy design is an under-investigated problem for biological networks, despite the fact that a variety of biological networks have been actively constructed and used in numerous benchside or bedside studies. However, the existence of latent variables is a common problem due to cost, technical or other limitations, and has significantly hampered our capability to quantitatively investigate and understand such networks via, e.g., key network parameter estimation from experimental data. Identifiability analysis has long been recognized as a powerful tool to assure the accuracy and reliability of parameter estimation techniques; however, identifiability-based observation strategy design for biological networks turns out to be an unexplored field despite its substantial importance to biological network studies like structure identification.
To the best knowledge of our authors, this is the first study that tackles the problem of identifiability-based observation strategy design for biological networks described by linear SEMs. First, we introduce several new concepts such as cardinality of observation strategy and identifiability gain and mathematically formulate the identifiability-based optimal observation problem. Second, for a given network structure, the key idea is to turn a minimum number of unobserved nodes in the original observation strategy into observed such that the number of non-redundant identifiability equations becomes greater than or equal to the number of unknown model parameters (i.e., the whole system is at least locally identifiable). By counting the number of non-redundant identifiability equations, we avoid performing actual identifiability analysis on SEM and the proposed method is thus computationally efficient. Third, by defining the concepts of stage division and state transition, a dynamic programming strategy is proposed to solve the maximization problem without involving any time-consuming symbolic computation or matrix operations [33,34]. Fourth, an efficient computing algorithm is proposed to calculate the identifiability gain of each unobserved node in a given observation strategy. More specifically, the computing process is significantly simplified by counting the number of observed nodes that connect with the node of concern via Wright's paths after removing certain edges from the original graph.
It takes a non-constant time to compute the node identifiability gain in each iteration, and the algorithm complexity depends on the number of observed nodes. Furthermore, the number of iterations of the dynamic programming algorithm does not depend on the total number of nodes, but the number of unobserved nodes in the original observation strategy. Let S denote the number of unobserved nodes and T denote the number of observed nodes in the original observation strategy, then the computation complexity of the dynamic programing strategy is O(S 2 ⋅ T).

Implementation and verification
The flowchart of the proposed algorithm for searching the structural identifiability-based optimal observation remedy is shown in Fig. 3. We have implemented the dynamic programming algorithm in R, and all the source codes and examples are freely accessible at https:// github.com/Hongyu-Miao/SIOOR.
Here we describe several important technical details of the implementation. First, at the state transition step, there exist multiple transitions that produce the same f(O (k + 1) ), our current implementation chooses only one such transition to update the next-stage observation strategy. If needed, the R code can be slightly modified to enumerate all optimal observation strategies. Second, since the boundary node set is just a subset of the ancestor node set for a given node, the processing of the boundary nodes is incorporated into the processing of the ancestor nodes in the current implementation. In order to verify the implementation, synthetic DAGs can be generated for this purpose, like the two DAG examples in Fig. 4. The first DAG contains 8 nodes and 13 edges, and it has only a single input node and a single output node. Moreover, the first example considers a special initial observation strategy (i.e., all nodes are unobserved) to illustrate the capability of the proposed method to design optimal observation strategy from scratch. The second DAG has multiple input and output nodes, and it considers a more general situation, that is, there exists both observed and unobserved nodes in the initial observation strategy. We analyzed the two examples using the proposed algorithm, and used the identifiability matrix method [34] to verify that the obtained observation remedies do grant (local) identifiability to all model parameters.

Applications to real biological networks
Since it is impossible to cover all the biological networks in various databases and knowledge repositories [16,17] in one study, we choose the biological network associated with influenza A virus [39] as an application example for illustration purpose. IAV can infect birds as well as mammals including human, and it has been one of the major infectious pathogens that have caused millions of human deaths. It is thus of great scientific significance to systematically understand IAV infection and immune response mechanisms. Therefore, Matsuoka et al. [50] manually curated a comprehensive database, called FluMap, for depicting the influenza virus life cycle at the molecular level from over 500 previous publications. There are mainly five modules in FluMap: virus entry, virus replication and transcription, post-translational processing, transportation of virus proteins, and packaging and budding. Given the critical role of virus replication in influenza virus life cycle, numerous experimental studies (e.g., [42,[51][52][53]) have made attempts to understand virus replication mechanisms and their clinical implications. Thus, we choose to focus on the IAV replication module and analyze its observation strategy.
Since IAV replication involves many different biomolecules and complex interactions, it is usually infeasible to observe all such components and their interactions in one study. The question of concern here is how to choose a minimal number of nodes in Fig. 5 to observe such that all the model parameters become at least locally identifiable. Note that Fig. 5 is derived from Matsuoka's work [50], and consists of 22 nodes and 26 edges; for simplicity, the catalyzers and inhibitors in this network are treated as reactants.
A relevant concept, called observability, has been previously investigated by Liu et al. [38] for complex dynamic systems. Although observability analysis also deals with observation strategies considering the existence of latent variables, it is very different from identifiability analysis in two aspects: 1) the focus of observability analysis is not model parameters but how to infer the unobserved state variables from experimentally measured outputs of a system; 2) the graphical approach proposed by Liu et al. was developed for the so-called balance equations based on mass-action kinetics, the model structures of which are very different from static linear SEMs. However, it is of interest to compare the identifiability-based observation results with those of the observability-based method. For this purpose, we assume that all the nodes in Fig. 5 are initially unobserved. After applying the proposed dynamic programing method, we get the optimal observation strategy shown in Fig. 6(a) for achieving parameter identifiability. The optimal observation strategy produced by Liu's observability approach is shown in Fig. 6(b). According to Fig. 6(a) and (b), one can tell that the identifiability-based observation strategy contains 20 observed nodes and 2 unobserved nodes, while the observability-based strategy contains 3 observed nodes and 19 unobserved nodes. That is, for the IAV replication module, the system internal states can be inferred from a few observed output nodes if a balance equation model is used; however, it needs much more observed nodes to achieve parameter identifiability if a linear SEM is used. Such an observation is not only due to the different goals of observability and identifiability analyses, but also the differences in the underlying model structures used in observability or identifiability analyses.
Moreover, besides the nodes with an out-degree 0 or 1 as mentioned in Lemma 3, the identifiability-based observation strategy is also likely to select the nodes with a high out-degree as unobserved nodes; for instance, the two unobserved nodes viral_RNA and NP(ub) in Fig. 6(a) have the highest out-degrees 2 and 3, respectively. This is because, if an unobserved node has a high out-degree, this node is connected with many out-neighbor nodes; when its out-neighbor nodes are observed, there will exist multiple Wright's paths that connect such out-neighbor nodes and pass this unobserved node, and the corresponding identifiability equations thus contain the parameters associated with the out-edges of this unobserved node such that these parameters can be identifiable. Interestingly, the observability-based strategy tends to select the nodes with a low out-degree as observed nodes, for example, all the nodes with 0 out-degree are observed in Fig. 6(b). It is because the nodes with an out-degree 0 in a DAG are usually the final products of chemical reactions, instead of reactants, and thus the internal states associated with other nodes can be easily inferred based on the balance equations if all the final products of chemical reactions are measured.

Conclusions
In this study, we address an important problem for biological networks: the design of observation strategies for all edge coefficients being identifiable. Linear SEMs are used as the mathematical representation of biological networks, which allows us to formulate the problem as a constrained optimization problem. A dynamic programming strategy was then developed to solve the constrained optimization problem to obtain the optimal observation strategies at the cost of turning a minimal number of unobserved nodes into observed. The proposed solution is novel and efficient because it avoids both symbolic computation and matrix operations as used in other studies, and we provided necessary theoretical justifications for the proposed algorithm. As verified by multiple examples (synthetic or real networks), the proposed solution is generic and can be applied to an arbitrary DAG (recursive SEMs) without bidirectional edges.