### Envirome-guided projection to latent pathways (PLP)

According to the definition of elementary flux modes, the universe of infinite flux solutions,

*r*, of a given metabolic network operating in steady state can be described as a non-negative weighted sum of a finite number of elementary flux modes,

*e*
_{i} (e.g. [

12,

14,

15]):

Elementary modes thus represent unique and non-decomposable flux solutions, from which all possible system solutions can be obtained through the proper determination of elementary modes weighting factors, *λ*
_{i}. Several methods were proposed to calculate weighting factors, *λ*
_{
i
}, from flux vectors, r. (e.g. [27–29]). As noted by Wiback et al. [27], the reconstruction of a particular flux vector according to Eq. (1) is not unique if the number of elementary modes is higher than the number of fluxes, which is generally the case. Unique solutions can however be obtained with additional assumptions about the biological system, such as the maximisation of number of elementary modes [28], or minimisation of the values of weighting factors to select the elementary modes that are closest to the actual biological state [29]. In the present study we also calculate a unique solution for the weighting factors *λ*
_{
i
}although under completely different criteria. We set the weighting factors *λ*
_{
i
}as linear functions of envirome variables, x(t), which represents additional measured information in our method that is not used in other methods. The objective function in our case consists of:

i) maximising correlation between *λ*
_{
i
}and envirome variables (or equivalently between r(t) and x(t)), and

ii) minimising redundancy, i.e. eliminating all elementary modes with weak correlations with the envirome.

The basic idea is that the genome sets the structure of the elementary flux modes, while the relative weights of elementary flux modes are to a large extent set by the environment.

According to the above criteria, the PLP algorithm was designed to maximise the covariance between an input data matrix, X = {x(t)}, of independent envirome observations, x(t), and an output response matrix, R = {r(t)}, of observed steady state reaction rates, r(t), under the constraint of a plausible set of elementary flux modes. This problem can be expressed mathematically by the following constrained linear program:

with X = {x(t)} a np × nx matrix of np independent observations of envirome vectors x(t) (dim(x) = nx), R = {r(t)} a np × nr matrix of np independent observations of reaction rates, r(t) (dim(r) = nr), E = {e_{i}} a nr × nem matrix of nem elementary flux modes, e_{i} (dim(e_{i}) = nr) (for the BHK metabolic network, E is the 24 × 251 matrix given in the Additional File 1), Λ = {λ(t)} a np × nem matrix of weighting vectors, λ(t), of elementary flux modes (dim(λ) = nem) and *C* a nem × nx matrix of regression coefficients.

Maximising covariance implies two concomitant goals, namely maximising variance and minimising redundancy. Unconstrained maximisation of covariance can be achieved by the very popular projection to latent structures method, also known as partial least squares (PLS). Since PLP is derived from PLS, in the lines below we first review the structure of PLS and then show how it can be modified to PLP.

PLS is a multivariate linear regression technique that maximises the covariance between an input matrix,

*X*, called the predictor, and an output response matrix, which here is the target rate data,

*R*. It differs from traditional multivariate linear regression in that it decomposes iteratively both the predictor and the response matrices into a reduced set of uncorrelated

*latent variables*, thereby eliminating redundancy in the input and output data sets. More specifically, the matrices

*X* and

*R* are decomposed into loadings matrices (

*W* and

*Q*), scores matrices (

*T* and

*U*) and residuals matrices (

*E*
_{
X
}and

*E*
_{
R
}):

The columns of the loadings matrices

*W* and

*Q* are the latent variables in which the input and output matrices are decomposed, respectively. Additionally, the outputs scores matrix U is linearly regressed against the inputs scores matrix T:

where

*B* holds the regression coefficients determined by minimising the residuals

*E*
_{
U
}. Finally, the explained variance in target data (which in our case is flux data, R) by the model is given by the following equation:

with indexes *i* and *j* denoting observation and flux indexes respectively. Solving system Eqs. (3a-c), which imply minimising residuals E_{X}. E_{R} and E_{U}, can be effectively performed by the very popular Non Iterative Partial Least Squares (NIPALS) algorithm. For more details about PLS and NIPALS see, for instance, Geladi and Kowalski [44].

PLP can be viewed as a constrained version of PLS by the inclusion of the elementary flux modes constraints. Given their equivalent mathematical structures, an interesting comparison may be made between projection of target reaction rate data, R, onto latent variables Q (performed in PLS according to Eq. (

3b)), and projection of flux vectors onto elementary modes weighting factors according to Eq. (

1). Indeed, elementary flux modes

*e*
_{
i
}can be viewed as PLS latent variables equivalents, while the elementary modes weighting factors

*λ*
_{
i
}can be interpreted as PLS latent variables score values. Following this analogy, a straightforward modification to Eq. (

3b) can be introduced

by substituting PLS loading matrix, Q, by the elementary modes matrix, E. This means that the latent structures *E* are no longer degrees of freedom in PLP as *Q* was in PLS. Instead, *E* represents a constraint to the relationship between X and R imposed by the metabolic network structure and ultimately by the genome of the cells.

An advantage of PLP over PLS is that both the latent variables and score values of the target matrix have in PLP a physical meaning. The latent variables are latent pathways while scores U are equivalent to Λ, i.e. they represent the relative weighting factors of latent pathways. For this reason, the algorithm is called projection to latent pathways. Moreover, the regression coefficients B can be used to deduce the functional enviromics matrix, FEM, as follows:

FEM is a nx × nem matrix comprising the regression coefficients of elementary flux modes against envirome components, thus providing information of how elementary flux modes are up- or down-regulated by envirome components.

The PLP algorithm was coded in MATLAB (Mathworks, Inc) as a modified version of the NIPALS algorithm {Geladi, 1986 #15}, wherein the loadings of the outputs are fixed *a priori* according to the elementary modes structure, *E*. The calculation of the elementary modes is not automatically integrated in PLP. For that we used the METATOOL 5.0 [13] as explained below.