### The problem of pathways and EFM extraction

Network metabolic models let a full or a context-specific analysis of the role that plays any reaction or metabolite inside the cell. It is particularly interesting the role that they play in particular disease research, where is needed to extract a specific piece of information from the full network. Pathways and Elementary Flux Modes (EFMs) are types of sub-network whose analysis have been remarked by plenty of works [10, 11]. The main drawbacks of using EFMs are the high computational cost to enumerate them and, when obtaining a subset of all the possible EFMs in a GSMN, the uncertainty in having enough biological relevance. There are different proposals to enumerate subsets of EFMs in GSMNs [12–15]. There are also a family of algorithms for context-specific metabolic network reconstruction that ensuring the presence of a key set of reactions within the simplified resulting model.

There are two main groups of computational approaches to extract information from metabolic pathways: path-finding and stoichiometric [16]. The first ones consider the network as a directed graph and explore it [17–19]. The main disadvantage is that they do not use stoichiometric coefficients during the exploration process. The second ones use the stoichiometric data to do calculations during the process. Linear Programming and Null-Space Algorithm [20] are some of the mathematical strategies applied to find pathways, mainly solving the system of linear equations proposed by the stoichiometric matrix.

### Genomic metabolic networks as a system of equations

Each metabolic reaction inside a cell can be represented by its correspondent stoichiometric equation. All the equations are arranged in a stoichiometric matrix where columns represent metabolic reactions and rows represent metabolites. The matrix values represent the stoichiometric coefficients for the production or consumption of metabolites on each reaction.

Be *S* a stoichiometric matrix (i.e., the matrix of coefficients of the biochemical reactions for the studied network). These coefficients represent the frequency at which reactions occur at the steady-state or, equivalently, the rate of metabolites production/consumption through the reactions. A feasible sequence of reactions occurring inside a cell can be comprised in a vector called flux rate that contains the reaction rates, that is, the values for the variables of the system of equations represented in the *S* matrix. If *R* is the full set of metabolic reactions, the flux vector \(\vec {v}\) must fulfil the steady-state condition (Eq. 1) and the thermodynamic constraints (Eq. 2).

$$\begin{array}{*{20}l} S \cdot \vec{v} = \vec{0} \end{array} $$

(1)

$$\begin{array}{*{20}l} v_{r} \geq 0, \quad \forall{r}\in R \end{array} $$

(2)

Equation 1 (i.e, steady-state condition) involves a balance among all metabolites and constant concentrations. The thermodynamic constraint forces each irreversible reaction present in the solution to have a positive rate. This is a biological thermodynamic restriction. Many methods need to split reversible stoichiometric reactions into two irreversible reactions to implement the accomplishment of the thermodynamic restriction. This are a very common strategy and it is needed when using linear programming as a mathematical tool.

Once the above equations are solved at least two different solutions are obtained: the trivial one and other that represents the whole network. A pathway is just a solution of the solutions: a vector flux that is positive and verifies the steady-state condition. It can be viewed that any solution is associated to a subset of the full set of reactions formed by those that has non-zero rate. Starting from the whole network, and working in an iterative way, the goal is to find subsets of reactions, the ones associated with pathways, that correspond to solutions of the system. So, a pathway is a subset of the full set of reactions satisfying that we can find a flux-vector that is a solution of the previous equations and whose support (its non-zero values) are exactly the values corresponding to reactions in the pathway. If the metabolic network is represented as a graph, a pathway can be seen as a sub-graph.

It is said that \(\vec {v}\) is the vectorized representation of an EFM if it is not decomposable (i.e., it can not be written as a positive linear combination of flux rate vectors representing any other pathway in the network). It is well-known that a pathway \(\overrightarrow {v}\) is an EFM if and only if there is no other pathway whose support is strictly included in that of \(\overrightarrow {v}\). Non-decomposability, also called *minimality*, is the condition that let transform the extraction method into an optimization problem instead of just a system of equations.

The biological relevance of an EFM is inherited from the fact of the uniqueness of the set of EFMs and the canonical quality as a set of vectors that can generate any pathway, even those unobserved [11, 21].

### Linear programming

Linear programming (LP) is the most popular optimisation-based approach for EFMs extraction. LP is being widely used to reduce the complexity of combinational problems in systems biology introducing optimization objectives that lead the search and constraint the space of solutions to a subset within an specific focus [22–24].

The existing literature describes how to formulate an EFMs extraction problem as a linear program considering the constraints defined in the previous Section [13] (Eq. 3).

$$\begin{array}{*{20}l} \text{Minimize} \quad & \sum\limits_{i=1}^{n} v_{i} \\ \text{subject to} \quad & S \cdot \vec{v} = \vec{0} \\ & v_{r_{i}} \geq 0 \quad \forall r_{i} \in R \end{array} $$

(3)

Once a linear program is obtained, it can be solved using, for instance, the Simplex Algorithm implemented in plenty of LP solvers.

A trivial solution for the linear program posed in Eq. 3 is \( v_{r_{i}} = 0 \quad \forall r_{i} \in R\), which provides no biological information. Therefore, we must impose different conditions to modify the linear program by adding new constraints to the LP problem to obtain different solutions from the trivial one.

Additional constraints can be seen as set of reactions forced to have positive fluxes (positive constraints) or, contrarily, as set of reactions forced to be inactive, which means that their associated fluxes are equal to zero (negative constraints).

We define a *seed* as a constraint of this kind in reference to the fact that a seed is the precursor of one solution. Seeds are needed to coerce the solvers to find non trivial, thermodynamically feasible solutions and can also be used to try to find solutions that are different from the previous ones. There are existing heuristics that, searching for good seeds, introduce jumps as a strategy to escape of the attraction of local minimums. The attraction of those local minimums is one of the causes of the repetitions in the solutions found by the LP solver.

### EFM extraction using LP

Observe that the objective function \(\text {Min}\sum \limits _{i=1}^{n} v_{i}\) was introduced in order to transform our system of equations into an optimization problem. Intuitively, this function reaches its minimum when the number of non-zero addends is minimal and so we expect to get solutions that are EFMs. But it is easy to modify the proposed function to obtain a whole set of functions with the same behaviour (we can use, for example, \(f(\overrightarrow {v})=\sum \limits _{i=1}^{n} \lambda _{i}v_{i}\) for any set of positive numbers *λ*_{i}).

A fairly approach for the extraction of EFMs is to use a computer program based on linear programming. This program essentially consists of a number of iterations of the Simplex method. Algorithm 1 shows the typical composition the program used. The main loop is iterated at least as many times as solutions needed (*N*).

A seed is an input for the LP solver that is computed in the constraints Section. It can be said that a linear programming based extraction method is basically a strategy to produce relevant seeds.

### Seed generation: infeasibilities

There are two different issues that can appear when using seeds. Firstly, different seeds can produce exactly the same solution (as it is well-known). So, it is not enough to produce a large set of different seeds to obtain a large set of different pathways. It is not easy to choose seeds to get new pathways, and the difficulty increases as more solutions are computed.

On the other hand, the chosen seed can led to a problem that has no solution at all. That is, when a certain set of reactions are imposed to be in the solution and another set cannot appear, sometimes contradictory conditions are being imposed. Suppose, for example, that a reaction *r*_{1} produces a metabolite *m* and it can only be eliminated by another reaction *r*_{2}. If the reaction *r*_{1} is imposed to be part of the solution but not the reaction *r*_{2}, it is impossible to get a solution satisfying the steady-state condition (*m* will be produced and cannot be eliminated). A seed is called infeasible if its associated linear-program problem has no solution.

In order to avoid infeasibilities, the set of seeds have to be restricted to a positive ones, that is, seeds that only determine what reactions are required to be part of the solution but not what are forbidden to be.

### Seed generation: repetitions and representativeness

Another frequent problem of a set of solutions after run an long extraction experiment is the under or over-representation of some part of the metabolic networks. This phenomenon comes associated to some kind of affinity of the seed generator or a lack of dispersion or randomness of its conception.

Each iteration requires its own seed that is incorporated as a part of the final computed solution. The process of building a seed consists of choosing what reactions are included in the constraint and, therefore, in the particular solution of a constrained LP using that seed. Defining different constraints from previous ones does not guarantee different solutions. As the Simplex algorithm is deterministic, the method to compose the seeds is critical to get a significant set of solutions.

Due to the fact that the seed induces the final solution, the correspondent seed generator is also responsible of the quality of the final set of solutions. It seems clear that the generation of a new seed should be influenced by the previous ones, and in that way, avoid to obtain the same EFM twice. For example, graph-affinity based approaches use the graph adjacency of reactions to build seeds, so in some way assures some kind of minimality by forcing the connectivity of the solution. The counterpart is that the adjacency can cause over-representation of a determined subgraph in detriment of the global meaning. It is also worth to say that sometimes the same problems are due to the exhaustion of the method to generateb different seeds.

A completely random seed generator can be considered as the simplest and fair seed generator. It could be a module of an invented extraction method that produces constraints for a linear program on every iteration. This is a neutral way to generate seeds that frees us to record previously produced seeds to avoid repeated seeds. Algorithm 2 shows how it works.

### Frequencies

Pathways and EFMs extraction methods incorporate strategies to avoid repeated solutions as much as possible. The frequency distribution of the reactions present in the solution is a characteristic of each extraction method. Be *N* the number of iterations performed in one optimization problem and *O*_{i} the number of occurrences of the reaction *r*_{i} in the experiment. The frequency of occurrence for each *r*_{i} is \(F_{i} = \frac {O_{i}}{N}\). A typical frequency distribution that outcomes for one experiment is shown in Fig. 1.

The frequency distribution is somehow conditioned by the extraction method, but there are other characteristics that can influence on it. For example, the intrinsic properties of the biological network (its size or its coupling relations), or even the amount of runs done for an experiment can influence the outcome.

It is reasonable to expect similarity between two sets of frequencies obtained in two separated experiments. So, if the frequencies are almost the same, we can expect that they express properties of the network and not of the chosen extraction method. This can open a new path to understand why some reactions appear more often in the set of solutions than others.

### Penalizing the objective function

Depending on structural properties of the network or magnitude factor of the stoichiometric coefficients, solvers could tend to include the same reactions in the final solution in almost any case, and this can lead to obtain repeated solutions independently of the chosen seed.

In this work, a *penalization* scheme that modifies previous results is proposed in order to increase the number of different solutions, and/or to get them faster. Our aim is to force the solver to avoid (when possible) reactions that are present in solutions computed previously in order to get new ones.

The proposed strategy consists of modifying the objective function with sets of positive *weights* {*w*_{i}} to formulate it as \({\sum \nolimits }_{i=1}^{n} w_{i}\cdot v_{i} \) instead of \({\sum \nolimits }_{i=1}^{n} v_{i} \). These weights may vary depending of the number of previously computed solutions containing any reaction so that overrepresented reactions get a higher weight. Due to the fact that we are minimizing the function, the solver will tend to avoid reactions with higher weights if possible.

Observe that this strategy is not equivalent to the use of negative seeds. The latter one can lead us to infeasibilities while the use of weights only discourages the use of reactions that has appeared before but does not ban them. Therefore, different sets of weights may provide different solutions even when using the same seed.

It is interesting to study the influence of a set of weights over the solutions. To that, the first step is to analyse if the use of weights has a real impact on the set of solutions obtained. Once we are sure that our approach modifies the set of solutions, a second question is how to choose the weights in order to obtain sets of solutions with the desired properties: the number of repeated solutions should be as small as possible and the set should be representative (as many different solutions as possible).

To answer these questions we have to compare the outcomes of different experiments by applying different statistical tools to compare the results. In this work, we have used the Wilcoxon signed-rank test [25] and a test based on *chi-square*.

Initially, all the objective function weights equal 1 (i.e., Eq. 3). We aim to modify some of these weights to obtain different solutions from the LP. In particular, we increase the weights of the fluxes associated to the reactions with higher frequencies in the set of obtained solutions. This may bias the solutions towards reactions not considered in the set of obtained solutions, thus increasing the diversity of the extracted EFMs.

We must avoid weights equal to zero, because this would imply not taking the corresponding reaction into account. Therefore we choose a minimum weight equal to 1 common for all fluxes (reactions). We propose to use as weight for a reaction *r*_{i} the number

where *p* is called the penalization of the method, and *F*_{i} is the frequency of the appearance of the corresponding reaction in the previous set of solutions. Equation 4 shows the new linear program where penalties have been included.

$$\begin{array}{*{20}l} \text{Minimize} \quad & \sum\limits_{i=1}^{n} w_{i} \cdot v_{i} \\ \text{subject to} \quad & S \cdot \vec{v} = \vec{0} \\ & v_{i} \geq 0 \qquad \qquad \;\; \forall r_{i} \in R \\ \text{where} \quad \quad \;\; & w_{i} = 1+ p \cdot F_{i} \quad \forall r_{i} \in R \end{array} $$

(4)

As it can be deduced, the higher is the frequency *F*_{i} the higher *w*_{i} becomes and, over a threshold, the optimizer is induced to consider other less penalized reactions (if the structural network properties allow it). The proposed modification of the objective function weights should result in differences between the frequencies of a non-penalized experiment (*F*_{i}) and the penalized one (*F*^{′}_{i}). Algorithm 3 extends the more general Algorithm 1 by adding the modification of the objective function and the computation of occurrences and frequencies for each reaction in the set of obtained solutions.

### Comparing set of solutions

Being able to compare different pathway extraction experiments is required in order to measure the effectiveness of our penalization strategy. Differences in the sets of obtained frequencies *F* between penalized and non-penalized experiments would involve that the penalization approach has an impact in the obtained solutions.

Be *F* and *F*^{′} the resulting frequencies in two different experiments, the question is how to measure the possible differences between them. In other words, our objective is to determine whether two samples were selected from populations with the same distribution or not. Observe that we cannot compare these two samples performing a standard *chi-square* test because the values *F*_{i} are not independent in general. As we have mentioned before, there are often structural dependencies between reactions and metabolites that can force two related frequencies to be the same. Thus, we have chosen another statistical test, the Wilcoxon signed rank test [26].

To analyse the differences between *F* and *F*^{′} we can use the well-known statistic (denoted by \(\overline {\chi }^{2}\)) that comes from the *chi-square* test (Eq. 5).

$$ \overline{\chi}^{2} = \quad \sum\limits_{i=1}^{n} \frac{{\left({F'}_{i} - F_{i}\right)}^{2}}{F_{i}} $$

(5)

As in the usual *χ*^{2} test, \(\overline {\chi }^{2}\) provides a good measure of the differences between the values of *F* and *F*^{′} even if we cannot assure that it corresponds to the *chi-square* distribution. That is, greater values of our statistic means a greater difference between the corresponding experiments but we cannot use it to assign a probability to assure that this difference is statistically significant.

This test measures the differences between frequencies taking into account all the possible factors that could be causing them: the seed generator, the LP solver, the coupling relationships, the size of the GSMN, the number of iterations for the experiment and the penalization. Fixing all the factors except the penalization and choosing a seed generator as neutral as possible, we can study the impact of the penalization in the difference between our non-weighted experiment and the weighted ones. It is expected that higher values of *p* should provide higher values of the statistic which would mean that he behaviour of our extraction method changes.