Discovery of Boolean metabolic networks: integer linear programming based approach

Background Traditional drug discovery methods focused on the efficacy of drugs rather than their toxicity. However, toxicity and/or lack of efficacy are produced when unintended targets are affected in metabolic networks. Thus, identification of biological targets which can be manipulated to produce the desired effect with minimum side-effects has become an important and challenging topic. Efficient computational methods are required to identify the drug targets while incurring minimal side-effects. Results In this paper, we propose a graph-based computational damage model that summarizes the impact of enzymes on compounds in metabolic networks. An efficient method based on Integer Linear Programming formalism is then developed to identify the optimal enzyme-combination so as to minimize the side-effects. The identified target enzymes for known successful drugs are then verified by comparing the results with those in the existing literature. Conclusions Side-effects reduction plays a crucial role in the study of drug development. A graph-based computational damage model is proposed and the theoretical analysis states the captured problem is NP-completeness. The proposed approaches can therefore contribute to the discovery of drug targets. Our developed software is available at “http://hkumath.hku.hk/~wkc/APBC2018-metabolic-network.zip”.

of the deletion effect of enzymes using a branching process. Enumerating all the possible enzyme combination is infeasible since the number of combination increases exponentially with the number of enzymes. Lu et al. [13] developed an integer linear programming method for designing synthetic metabolic networks by considering minimum reaction insertion in a Boolean model. The proposed method can appropriately solve minimum reaction insertion in a Boolean network.
In pharmaceutics, the development of drugs mainly focus on target identification and lead inhibitor identification [14]. Furthermore, many researchers have done a lot of work on the efficacy of drugs regardless of their side-effects (toxicity). However, toxicity or lack of efficacy may occur when the compounds which are not the intended targets in the metabolic network. Thus how to effectively reduce the side-effect of drugs has become an important and challenging issue. The current works aim to identify the biological targets (could be enzyme or protein) for drugs which can be manipulated so as to produce the desired effect (i.e., curing a disease) with minimum disruptive side-effects [15,16]. Drug targets, in particular enzymes, are chosen to reduce abnormal metabolites by formulating an optimal combination problem in enzyme combination in metabolic networks [10,17]. It is known that the function of enzymes is to catalyze reactions and produces metabolites (compounds) in metabolic networks. Human diseases, especially metabolic diseases, may be directly caused by the accumulation of certain compounds due to the malfunctions of enzymes. We denote such compounds as target compounds while the others as non-target compounds. For example, the malfunction of enzyme phenylalanine hydroxylase results in accumulation of the amino acid phenylalanine which causes the disease phenylketonuria [18]. Therefore, there is a great need to identify a set of enzymes which are manipulated by drugs to prevent the buildup of target compounds with minimal damage. Here damage corresponds to the number of non-target compounds whose production are forced to stop by the inhibition of that enzyme (or enzyme set).
Then the problem becomes to identify the optimal enzyme set whose inhibition eliminates the target compounds and at the same time, incurs minimum damage based on the given metabolic network and a set of target compounds. We denote such problem as enzyme combination identification (ECI). Sridhar et al. [19] developed a branch-and-bound algorithm to dynamically explore the search space. Furthermore, two filtering strategies are proposed to prune the search space to guarantee an optimal solution. However, their algorithm is complicated and impractical for us to use it. In this paper, we develop an efficient method based on Integer Linear Programming [20][21][22] which has a wide application in solving NP-hard problems. Furthermore, all instances of the original problem are needed to convert into integer programming formalization so as to apply the existing free solver called CPLEX [23]. This paper has two main contributions. First, we formulate a new biological problem based on Boolean metabolic network to identify the drug target with minimum damage. Second, we propose an effective and efficient method, Integer Linear Programming, (needs O(m + n) variables) to solve the captured problem and it is easy to implement. By integrating the human metabolic networks, we have shown that the proposed approach can accurately identify the target enzyme set for known drugs in the literature. Furthermore, experiments have shown that our proposed method is extremely efficient which can effectively solve the problem in seconds.
The remainder of the paper is organized as follows. In "Problem formulation" section, we formulate the captured problem. "Methods" section presents the algorithm ILP-ECI. In "Results and discussions" section, we discuss the experimental results and give a theoretical illustration. Finally, conclusions are given in the last section.

Problem formulation
A metabolic network can be represented by a directed graph G = (V , A) which captures the interaction between reactions, compound, and enzymes. Let C, R, and E denote the set of compounds, reactions, and enzymes, respectively. Then V C , V R and V E denote the set of vertices from C, R, and E, where V C = {v c 1 , v c 2 , · · · , v c m } denotes a set of compound nodes, V R = {v r 1 , v r 2 , · · · , v r n } represents a set of reaction nodes and V E = {v e 1 , v e 2 , · · · , v e l } defines a set of enzyme nodes. The subscript m, n, l denote the number of compound nodes, reaction nodes and enzyme nodes, respectively.
Here {} is the empty set. Let V s ⊆ V C and V t ⊆ V C be sets of source nodes and target nodes, respectively. Source nodes are those nodes having no incoming edges and they are the seed compounds. A metabolic network can be defined as follows: holds and "∧" represents "AND" function; • Each source node has no incoming edges; • For nodes v / ∈ V s and v ∈ V C have at least one incoming edge.
The main problem Enzyme Combination Identification (ECI) in a Boolean network is first described with a simple example and then followed by its mathematical formalization.
A small hypothetical metabolic network is shown in Fig. 1, where circles, rectangles and triangles represent compounds, reactions and enzymes, respectively. For instance, R 1 is a reaction, its substrates are C 1 and E 2 and its product is C 2 . Here C 5 is the target compound in this figure which shall be stopped. To stop the production of C 5 , reaction R 2 must be prevented from taking place. One of the possible solution is to disrupt one of its catalyzing enzymes (E 1 for instance). Another is to stop the production of its reactant compounds (i.e., C 2 or C 3 ). If C 2 is stopped, then two possible ways can achieve this effect (i.e., enzyme E 2 or E 1 ). It is noted that C 2 , R 2 , C 7 and R 1 forms a cycle, the inhibition of E 1 will result in inactivating reaction R 2 which will further makes C 2 inproducible. The other way is inhibiting enzyme E 2 which makes R 1 inactive and further stop the production of compound C 2 . Therefore, the inhibition of enzyme E 1 or E 2 can result in stopping the production of the target compound.
We define the Enzyme Combination Identification as follows.
• Input: A metabolic network and a set of target compounds T(T ∈ C), C is the set of compounds. • Output: Find a set of enzymes X(X ∈ E) with minimum damage, whose inhibition stops the production of all the compounds in T.
For simplicity, we assume there are no external inputs to all reactions and all the input compounds related to reactions are shown in the network. We note that different compounds and enzymes may have different levels of importance in the metabolic networks. Here we assume that all the compounds and enzymes are of equal importance. We assign binary value (i.e., 0 or 1) to each node V in the Boolean model. "0" means that Fig. 1 A graph constructed for a hypothetical metabolic network the corresponding compound is not producible or the corresponding reaction is inactive while "1" means that the corresponding compound is producible and the corresponding reaction is active. For the related enzyme catalyzing the reaction, "1" means that it is active and "0" means that it is inhibited. Let G be such an assignment (G is a function from V to {0, 1}). For each node v ∈ V , we write v = 1 (resp., v = 0) if 1 (resp., 0) is assigned to v. Then G can be regarded as a valid assignment if the following conditions are satisfied: Condition (ii) implies that compound nodes correspond to "OR" nodes. Condition (iii) implies that reaction nodes correspond to "AND" nodes, which means that the output is forced to 0 when a node is inactivated. From Fig. 1, we can see that inhibition of E 1 results in the knock out of compounds C 4 , C7, C 8 and C 9 in addition to the target compound C 5 . Then we denote the number of non-target compounds knocked out as the damage, which is caused by the manipulating the enzyme set in the metabolic network. It can be seen that the damage of inhibiting E 2 is 2 (i.e., C 2 and C 4 ). Compound C 7 is still producible because it can be produced by R 3 even after the inhibition of E 2 . The damage effect of inhibition of E 1 is 4 (i.e., C 4 , C7, C 8 and C 9 ). Both E 1 and E 2 are potential drug targets since they can achieve the effect of disrupting the target compound C 5 . However, E 2 is a better drug target than E 1 owing to the fact that it causes less damage.

Methods
In this section, we introduce integer programming-based methods for ECI. Integer programming, in particular, Integer Linear Programming (ILP) is set to minimize (or maximize) a linear objective function under linear constraints with all the variables taking integer values. In the following, each variable takes including the binary value (i.e., 0 or 1), representing the Boolean values. We apply ILP to ECI since ILP is widely used for solving NP-hard problems.
The ILP formulation for the network in Fig. 1 is as follows: subject to TX + FX = 1for any X We denote the above formalization as ILP-ECI. Here all variables including the value of reaction compound and enzyme nodes take either 1 or 0. Thus, v r i can be either 0 or 1, and v c i and v e i also take 0 or 1. In this example, v r i = 0 (resp. v r i = 1) indicating that the value of reaction i takes 0 is represented by FRi = 1 (resp. TRi = 1) which implies that the reaction is inactivated (otherwise, the TRi = 1 implies the reaction is activated). Therefore, TRi = 0 (equivalent to FRi = 1) means the corresponding value for true reaction takes 0, which implies the reaction is inactive. And FRi = 0 (equivalent to TRi = 1) indicates the corresponding value for false reaction takes 0, which implies the reaction is active. Thus, TRi + FRi = 1 holds for any node i in the network. Similarly, TCi and FCi are used to represent the values of compound nodes. For instance, TC2 = 1 means that v c 2 = 1 and in other words, FC2 = 0 since TCi + FCi = 1. Furthermore, v r i corresponds to "AND" node which implies that if v e i = 0 will inactivate v r i .
The objective function (1) means that the damage should be minimized. FCi = 1 (or TCi = 0) means that a compound v c i is not producible. Equation (2) means that the target compound v c 5 should be 0 after the 0-1 assignment converges. Equation (3) represents the Boolean relation v r 1 = v c 1 ∧ v c 7 ∧ v e 2 . Note that the Boolean relations such as "∨" or "∧" cannot be used in ILP formulation, we need to convert them into linear equations and/or inequations. Actually, "∨" indicates "AND" function and "∧"represents "OR" function. Since x 1 = x 2 ∧ x 3 ∧ · · · ∧ x n can be represented by Thus Eq. (3) is obtained. Similarly, Eqs. (4)-(5) represent the constraints of v r 2 and v r 3 , respectively.
For a compound node with indegree is 1 which indicates the node has only one incoming edge, the value of the predecessor is just copied. For instance, since v c 2 has only one predecessor v r 1 , v c 2 is just copied from v r 1 as shown in Eq. (6). Similarly, v c 4 is just copied from v r 2 which is shown in Eq. (7).
However, for a compound node with indegree more than 1, it is necessary to convert the "∨" relation into linear equation or inequations. Equation (9) represents the Boolean relation v c 7 = v r 2 ∨v r 3 . Since x 1 = x 2 ∨x 3 ∨· · ·∨x n can be represented by 3 . Thus, Eq. (9) can be obtained.
Equation (13) means that "T" and "F" correspond to "true (1)" and "false (0)", respectively, and complement each other. X in Eq. (13) means any component or reaction in the metabolic network.
The above formalization can clearly solve ECI and obtain the correct solution {E 2 }. Besides, the number of variables is O(m + n) in the above formalization where m and n are the number of compounds and reactions, respectively.
It is noted that solving ILP is NP-complete, however, a problem that can be formalized as ILP is not always NPcomplete. Thus in the following, we prove that ECI is NPcomplete.

Theorem ECI is NP-complete problem with the maximum indegree and outdegree being bounded by 2.
Proof Obviously, the problem is in NP, it suffices to show that it is NP-hardness. The proof is by a polynomial time reduction from minimum edge cover (MEC), which is a problem for a given graph to find the minimum number of edges so that each node is incident to at least one of the selected edges. For instance, E 1 = {e 2 , e 3 , e 6 } is one of optimal solutions of MEC for graph shown in Fig. 2. Let G = (V , E) be an instance of MEC, where V = {v 1 , v 2 , · · · , v n } and E = {e 1 , e 2 , · · · , e m }. We then construct the corresponding ECI as below. The metabolite network G = (V c ∪ V r ∪ V e , E) is given by V c = {c 1 , c 2 , · · · , c m } ∪ c t , V r = {r 1 , r 2 , · · · , r n } , E = c i , r j |i = 1, · · · , m, j =1, · · · , n, if v j is an end point of e i .} ∪ r j , c t |j = 1, · · · , n It is noted that the minimum damage is determined uniquely by the inhibition of enzyme set. Furthermore, our objective is to minimize the "damage" (side-effects).
Then V e can be regarded as virtual nodes and denoted as an empty set in this case. The ECI problem can be converted into the problem of identifying the minimum set of non-target compounds. Thus the graph for MEC shown in Fig. 2 is converted into ECI shown in Fig. 3. It is clear that this conversion can be done in polynomial time. Then we show that MEC for G has a solution of size z if and only if ECI has a minimum damage of size z. To guarantee that the target compound c t is stopped (i.e., c t = 0), it implies that all r j (j = 1, · · · , n) takes the value 0. If G has an edge cover of size z, then it follows that the minimum number of c i taking 0 should be z. On the other hand, if the minimum damage for ECI is z, then each r j must be 0 so as to satisfy c t = 0, we have at least predecessor of each r j must be included in the minimum damage set. Since there is an edge between c i and r j if and only if v j is incident to

Results and discussions
In this section, we verify the biological validity of our proposed method by using known drugs. Besides, the performance of the ILP-ECI algorithm is evaluated by using the execution time which indicates the total time taken by the method. For the experimental data, we extracted the metabolic network information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [24]. KEGG is database which provides known drug molecules along the the enzymes they inhibit and their therapeutic category. Then we use drugs at this database as our benchmarks and we report two of them due to the space limitation. And the value in parenthesis that starts with letter "C", "D", and "E" (e.g., E1.13.11.34) is the unique identifier assigned to the corresponding compound, drug, and enzyme respectively in KEGG.
The drug we used in this paper is Benoxaprofen (D03080) which inhibits arachidonate 5-lipoxygenase (E1.13.11.34) [25]. This enzyme appears in several networks including arachidonic acid metabolism network (hsa00590). In Pharmacology, the inhibition of 5-lipoxygenase will decrease the biosynthesis of LTB4 (C02165), cysteinylcontaining leukotrienes LTC4 (C02166), LTD4 (C05951) and LTE4 (C05952) (see, for instance, Fig. 4) [26]. According to our graph model, the removal of E1.13.11.34 will eliminate four compounds (LTB4, LTC4, LTD4 and LTE4) in arachidonic acid metabolism network. Furthermore, these four compounds play an important role in the mechanisms of toxic brain damage in acute methanol poisoning in humans [27]. Thus, we chose them as the target compounds. Apart from that, inhibiting this enzyme also eliminates four more compounds (i.e., 5(S)-HPETE(C05356), 5-HETE(C04805), LTA4(C00909), and 20-OH-LTB4(C04853)) [28]. These compounds can be regarded as "damage" in our model. Furthermore, when applying ILP-ECI with LTB4, LTC4, LTD4 and LTE4 as the target compounds, we find LTA4H (E3.3.2.6) and LTC4 synthase (E4.4.1.20) as the optimal enzyme set since the inhibition of these two enzymes eliminates only one non-target compound namely 20-OH-LTB4 (C04853). Thus, it is shown that ILP-ECI potentially finds a better solution in this experiment than the existing drugs since the same target compounds are eliminated by the existing drug in addition to other four compounds. Indeed, recent research validated our model since the anti-inflammatory effect of the levels of LTA4H [29] and LTC4 [30] has been observed. The computational time in this experiment takes only 5.19 s.
Another experiment we conducted is the histidine metabolism network (hsa00340). The enzyme amine oxidase (E.1.4.3.4) appeared in hsa00340 can be inhibited by drug (Rasagiline (D02562)). According to our graph model, the removal of enzyme E.1.4.3.4 will result in eliminating two compounds Methylimidazoleacetic acid (C05828) and Methylimidazole acetaldehyde (C05827). It should be noted that the level of pros-methylimidazoleacetic acid is closely related to severity of Parkinson disease in patients [31,32]. Running ILP-ECI with Methylimidazoleacetic acid and Methylimidazole acetaldehyde as the target compounds finds amine oxidase as the optimal enzyme. It takes only 3.26 s to run ILP-ECI. This experiment verifies that Rasgiline targets the optimal enzyme. An important advantage of our Boolean model is its capability of detecting the lack of substrates where the connectivity-based methods fail to handle this. Another advantage of this model is its capability of handling branches and cycles in a pathway from the source compound to the target compound. However, there are still have some limitations in this method. One of the major drawbacks is the assumption that all the compounds and enzymes are of equal importance. However, different compounds and enzymes Fig. 4 Subgraph of arachidonic acid metabolism may have different levels of importance in the metabolic networks. Our future work will focus on developing other models which include the weight of different nodes.

Conclusions
In this paper, we formulate the optimal enzyme combination identification (ECI) problem as an optimization problem in Boolean metabolic networks. We have proven that ECI in the Boolean model is NP-complete and the target enzyme set is uniquely determined when the target compounds are given. Furthermore, considering that an exhaustive search cannot be used to solve ECI when the network is large, we developed an ILP-based algorithm for ECI. Considering that the computational time of IPbased method is exponential to the number of variables, to improve the scalability of the developed method, it is vital to reduce the number of variables appearing in IP formalization. And our proposed IP-based method needs O(m + n) variables.
The efficiency and effectiveness are validated by the computational experiments in which datasets were downloaded from the KEGG database. The results demonstrate that the proposed model can accurately identify the target enzymes for known successful drugs in the literature. Specifically, ILP-ECI has found a different enzyme set for the target compounds of Benoxaprofen which indicates that our method has a great potential to be better than Benoxaprofen. The reason is that the solution of our algorithm damages only one non-target compound while Benxaprofen damages 4 non-target compounds including the compound damaged by ILP-ECI's solution. Besides, ILP-ECI has found that the same target enzyme like Rasagiline when its target compounds are given in advance. It is to be noted that our problem needs only O(m + n) variables in the IP formalization. The experiments also show that ILP-ECI can solve the problem in a short time which confirms the efficiency of our algorithm.