Inferring biochemical reaction pathways: the case of the gemcitabine pharmacokinetics

Background The representation of a biochemical system as a network is the precursor of any mathematical model of the processes driving the dynamics of that system. Pharmacokinetics uses mathematical models to describe the interactions between drug, and drug metabolites and targets and through the simulation of these models predicts drug levels and/or dynamic behaviors of drug entities in the body. Therefore, the development of computational techniques for inferring the interaction network of the drug entities and its kinetic parameters from observational data is raising great interest in the scientific community of pharmacologists. In fact, the network inference is a set of mathematical procedures deducing the structure of a model from the experimental data associated to the nodes of the network of interactions. In this paper, we deal with the inference of a pharmacokinetic network from the concentrations of the drug and its metabolites observed at discrete time points. Results The method of network inference presented in this paper is inspired by the theory of time-lagged correlation inference with regard to the deduction of the interaction network, and on a maximum likelihood approach with regard to the estimation of the kinetic parameters of the network. Both network inference and parameter estimation have been designed specifically to identify systems of biotransformations, at the biochemical level, from noisy time-resolved experimental data. We use our inference method to deduce the metabolic pathway of the gemcitabine. The inputs to our inference algorithm are the experimental time series of the concentration of gemcitabine and its metabolites. The output is the set of reactions of the metabolic network of the gemcitabine. Conclusions Time-lagged correlation based inference pairs up to a probabilistic model of parameter inference from metabolites time series allows the identification of the microscopic pharmacokinetics and pharmacodynamics of a drug with a minimal a priori knowledge. In fact, the inference model presented in this paper is completely unsupervised. It takes as input the time series of the concetrations of the parent drug and its metabolites. The method, applied to the case study of the gemcitabine pharmacokinetics, shows good accuracy and sensitivity.

http://www.biomedcentral.com/1752-0509/6/51 biochemical interactions between the drug, or in some cases its metabolites, and their molecular targets. The drug metabolites and the drug targets are components of a network of interactions. This network is usually represented by a graph whose nodes are the system's components (drug, metabolites and targets) and the edges between nodes link two biochemically interacting components. The directions of the edges specify the directions of the chemical reactions and the sign of the edges denotes the inhibitory or activatory effect of one component on another.
To infer the directional signed edges of the network it is common to use experimental time series of the concentration of the components of the systems, i.e. of the nodes of the network we want to infer. The network inferred from the species dynamics will include edges between those nodes whose time series exhibit sufficient functional connectivity, typically defined as a measure of coupling exceeding a predetermined threshold [1]. The same time series used to infer the network can be used also to estimate the rate constants that parametrize the interactions between the biochemical species. The preliminary network can be fitted to the experimental time series to eventually detect those edges whose kinetics is null, and that can be removed from the network.
In accordance to several studies of the late 1990s that indicated that poor understanding of pharmacokinetics properties and toxicity were important causes of costly late-stage failures in drug development [2], it has become appreciated the ability to infer and calibrate pharmacokinetic models, whose usefulness is now widely recognized through all the process of drug design and clinical development [3].
Our study proposes a method to deduce both reactions pathways and kinetic parameters of biotransformations of drugs from time series data of measured concentrations of metabolites. Determining the pathways in chemical reaction networks from time series data has been an active area of research over a decade. A comprehensive review of available techniques can be found in [4][5][6]. Unlike in the most part of network inference techniques, in our approach a probabilistic model of parameter inference is part of the procedure of the identification of interconnection topology. The integration of parameters estimation into a method of topology identification results in a well performing network inference algorithm. The parameter estimation works as a sort of model discriminator selecting only the interaction responsible for the kinetic of the system. In this way, the whole procedure is able to recover a good approximation of the true network topology, especially for metabolic networks. In this study, we focus on gemcitabine metabolic pathway. Gemcitabine is an oncological drug used in various carcinomas: non-small cell lung cancer, pancreatic cancer, bladder cancer and breast cancer [7]. It is being investigated for use in esophageal cancer, and is used experimentally in lymphomas and various other tumor types. The gemcitabine reaction pathway has been recently investigated [7] and new observational data provided by the experiments of Veltkamp et al. [8] contribute to elucidate the cellular pharmacology of gemcitabine. Therefore, the gemcitabine pathway is a good case study for our inference algorithm.
The majority of the literature about application of inference techniques in pharmacokinetics and pharmacodynamics modelling concerns the parameter estimation, rather then the identification of the interaction network structure. Moreover, the majority of the methods of parameter estimation fits continuous pharmacokinetics models to discrete time measurements of the drug's kinetic and/or dynamic response. In pharmacokinetics, the first methods of inference were applied almost thirty years ago to the clinical problem of dose regimen [9]. They were mostly Bayesian methods.
More recently other parameter inference techniques and parameter fitting procedures have been developed. As a result, many software tools implementing them were developed and provided to the scientific community of modelers and pharmacologists (e.g. maximum likelihood inference techniques [10], parameter fitting by nonlinear mixed [11][12][13] effects are only two recent examples). Now, the application of inference and fitting procedures for parameter estimation has been extended to all phases of drug development [14], and it is no longer limited to the dose regime scheduling. In pharmacokinetics and pharmacodynamics, the parameter estimation techniques are applied to infer or fit two categories of parameters. The first category includes parameters for which a priori information is available, e. g. organ blood flows, organ volume fractions, receptor density, cellular signalling protein turnover, whereas the second category includes parameters with no a priori information, e. g. intrinsic clearance, transport and binding coefficients, or parameters representing specific properties of the drug such as partition coefficient.
The development of inference methods that deduce from time discretely observed concentrations of metabolites, both the network of the biotransformations and mechanisms of action as well as the kinetic parameters of these interactions is a new challenge of a new paradigm in drug design and discovery: the network pharmacology introduced by A. L. Hopkins [15]. Network pharmacology is grounded on network biology, that conceives a biological systems as a network whose nodes are the system's components and the edges between nodes indicate the occurrence of interactions between the nodes. In the network-centric view of a biological system, the time evolution of biological systems, entities and processes is the result of the occurrence of the http://www.biomedcentral.com/1752-0509/6/51 interactions among the system's components. The dynamics of the network is governed by the kinetic parameters of the interactions. In most of the cases these parameters are not directly measurable and a priori knowledge is unavailable. Often, subjective prior beliefs on the parameters are difficult to be converted into a mathematically formulated model and prior. All these considerations motivated our work and our attempt to develop inference methods which are robust against noise, efficient in computation and flexible enough to meet different constraints.
The network inference method we propose is inspired to the recent works of M. Samoilov et al. [16] and A. Arkin et al. [17] and has been tailored for pharamcokineticspharmacodynamic modelling. It takes as an input the time series data of the concentrations of the parent drug and its metabolites. A prediction of the reaction network is deduced from time-lagged correlation functions of two chemical species at a time, obtained from concentration measurements. These functions are converted into interspecies distances whose analysis and visualization on a spatial domain yields the reaction pathway of the reacting system. To calibrate the network, i.e. to estimate the parameters, we developed an innovative probabilistic model of inference of the rate coefficients [18]. The tool implementing the parameter inference of a model of biochemical network, is the software KInfer [18]. KInfer is a software (free for non-commercial purposes) available for download from the software web page [19]. The only inputs required by KInfer are the list of chemical equations, or alternatively a generalized mass action, and the experimentally measured time-series of the reagents that are known to be involved in the system. Principal features of the tool are: automatic generation of generalized mass action model from the chemical reactions involved in the system; automatic estimation of the initial guesses and bounds for the parameter values; estimate of the propagation of the experimental errors from the input data to the parameter estimates; estimation of the level of noise in the input data. No a priori knowledge about parameter distribution is required by KInfer. A schematic overview of the main steps of the network identification method presented in this paper is shown in Figure 1. All the steps will be described in detail in the rest of the paper.
The paper outlines as follows: (i) we first present the model of inference based on the estimation of the timelagged correlation functions; (ii) we review the parameter inference methods of KInfer; (iii) we describe the in vitro cytotoxicity, cellular uptake, efflux, biotransformation, and nucleic acid incorporation of the gemcitabine and present the experimental data used in this study; (iv) we show and discuss the inferred network of biotransformations and mechanisms of action of gemcitabine as obtained by our inference algorithm. Finally, (v) we end up the paper with some final conclusions and plans of future works.

A time-lagged correlation based network inference
The time-lagged correlation is a measure that is related to the Pearson correlation coefficient but that takes into account shifts in time, i.e. lags, between the expression of the causal effector and the target module. The key-idea of correlation-based methods for network inference is that for a data set comprising time series profiles of N species, X i (t) for i = 1, . . . , N, the correlation matrix of the N(N − 1)/2 independent pairwise correlation coefficients can be used to cluster the data set into groups of species within which correlations between species are high, when compared to pairwise correlations between different groups. These groupings can most easily be discerned by calculating a matrix of pairwise distances, d ij from the correlation matrix, whereby d ij = 0 for two species which are completely positively correlated and increases as the pairwise correlation coefficient decreases.
The distance matrix can subsequently be analyzed to find, and visualize, patterns of proximities between species. A large family of techniques for the analysis and visualization of proximity data coming from similarity/distance matrix is available in literature. Many of these techniques refer to methods of data clustering, and, more recently, to generalized methods of graph splatting such as layout algorithms and multidimensional scaling algorithms [20][21][22][23][24]. The purpose shared by all these techniques is to reveal and, possibly, visualize, patterns of similarities and topological structures underlying the data. In this study we use a multidimensional scaling method. Multidimensional scaling transforms a distance matrix into a set of coordinates such that the distances derived from these coordinates approximate as well as possible the original distances. This way, the multidimensional scaling techniques are used in information visualization for exploring similarities or dissimilarities in data. In fact, the visualization of the similarities/distances among data facilitates the interpretation and the analysis of complex network, and, for this reason, they are currently becoming popular in bioinformatics [17,25,26] in network inference [16,17] and network analysis problems. We will describe later in the paper the details of the multidimensional scaling we used for this study.
Very often in practical situations the influence of one species on another takes some finite amount of time to propagate through the network, and the ordering of responses to impulse stimuli reveals information about network connectivity. In particular, this will be evident in the time series if the time interval t between concentration measurements is smaller than the characteristic response timescales for the network. Two time series which have a low correlation coefficient may in fact be http://www.biomedcentral.com/1752-0509/6/51 strongly correlated if a time lag is allowed between the data points for the two species. Therefore, according to Arkin et al. [17] and Samoilov et al. [16], the covariance function between X i and X j is . . , m} (here,X i denotes the observed value of the concentration of species i-th); p(X i (t), X j (t + τ )) is the pair distribution function, corresponding to the density of points on a scatter plot of X i (t) and X j (t + τ ), and m is the number of measurements. The pair distribution function gives the density of points in the rectangle dX i × dX j on the plot (X i , X j ). τ is a delay (time lag) introduced to detect correlations that otherwise could be non-detectable. The delay τ can be estimated in the following way. Consider can be calculated with the Stineman procedure [27] on the curve interpolating experimental timeseries of species i at time point t k . We assume that and The values of τ in (2) range over the interval of the rate limitingness across the entire reaction network. Now, since generally the analytical expression of p(X i (t), X j (t + τ )) cannot be obtained, the calculation of the integral in Eq. (1) can be performed only switching from a continuous to a discrete domain, so that the integral can be approximated by a sum and, taking a time average over all of the measurements, we obtain a covariance matrix depending only on τ , as follows In order to estimate the pair distribution function p μν , Samoilov et al. [16] proposed to divide the space of the phase plane into rectangles of varying size so that the distribution of points is uniform in each rectangle. The algorithm developed by A. Fraser [28] is the most used http://www.biomedcentral.com/1752-0509/6/51 procedure to perform such a partition of the phase plane X i -X j . The pair distribution density can then be estimated as (6) where N μν is the number of points in the particular rectangle labeled μ, ν, N tot is the total number of points and A μν is the area of the rectangle. We propose here a different solution to the problem of the pair distribution function. Instead of diving the phase plane into rectangle of variable size, we propose a Voronoi tessellation of the space, following the results of the recent study of M. Browne [29] and Q. Du et al. [30]. This division of the space according to point proximity leads to region boundaries being straight lines, bisecting and running perpendicular to the line connecting the Delaunay neighbors. Boundary points are equidistant to exactly two sites, and vertices are equidistant to at least 3. Neighboring points are points whose associated Voronoi regions share a common boundary. Thus, Voronoi tessellation generates a clustering of the points in the phase plane that in a good approximation satisfies the requirement of homogeneity for the distribution of points inside a cell, and p μν in Eq. (4) can be calculated as follows where V μν is the Voronoi cell μν.
Once, the covariance matrix has been calculated, the time-lagged correlation matrix R(τ ) can be calculated according to the definition in Eq. (8) [16,17].
Finally, from the correlation matrix we calculate a distance matrix D whose elements are defined in Eq. (9). where is the maximum absolute value of the correlation between two species with a time lag τ . The distances are used to find the connections between the different species in the system. Namely, the distances measure the relatedness of the time series describing the time behaviour of the species; the more related the more likely that two species are connected by a single reaction. The greater the distance between two species the more likely that two species are connected by several intermediate reactions or they are not connected at all.
Inspired by [16,17,31,32], we analyzed the distance matrix elements with a multidimensional scaling algorithm. A multi-dimensional scaling algorithm starts with a matrix of species-species distances. Then it assigns a location, i.e. the spatial coordinates to each item (species) in a D-dimensional space, where D is specified a priori. We used the Kruskal-Shepard multidimensional scaling [20]. This scaling is defined in terms of minimization of a cost function called stress function which is a measure of lack of fit between distances d ij and distances ||x i − x j ||. The stress is a residual sum of squares: so that Kruskal-Shepard scaling is also known as leastsquares scaling [31]. This scaling seeks values of the coordinates x 1 , x 2 , . . . , x n ∈ R D to minimize S. For a given value of D the estimation of the coordinates is performed in such a way that the pairwise distances are preserved as well as possible. The choice of D is arbitrary in principle, but low in practice: D = 2; 3 are the most frequently used dimensions, for the simple reason that the points serve as easily visualized representors of the species. A downhill simplex (amoeba) algorithm is used to minimize S D [33,34]. In the application domain of this study, the downhill simplex method turned out to be more efficient that the gradient descendant minimization method that is usually exploited in the Kruskal-Shepard scaling algorithm.
In this study we reported the 2D and the 3D visualizations of the networks. Nevertheless, a set of Euclidean distances on p points can be represented exactly (i.e. with S D = 0) in at most p − 1 dimensions. An insufficient number of dimensions is not the only cause of non zero stress. It may be caused also by random measurements errors in the input data. In such cases, even if the "true" number of dimensions of the problem were known, this would not guarantee that the stress corresponding to that number of dimensions is null [35][36][37]. Unfortunately, in most datasets the true dimensionality of the problem is not known in advance, like in the case treated in this study. The commonly advocated procedure for determining the dimensionality is a heuristic one of seeking a sharp drop or "elbow" in the rate of decline of stress as dimensionality increases. In practice such elbows are rarely obvious. In fact is has often noted that the pattern of change of stress versus dimensionality, rather than having an elbow, is characterized by a smooth and gradual decline [38]. As reported by M. D. Lee in [38], although there are at least two variants [39,40] of the multidimensional scaling that attempt to determine automatically the number of dimensions of the spatial representation they derive, there is not a rigorous and principled basis for this determination.
However, it not necessary that a multidimensional scaling representation has zero stress in order to be informative and useful. A certain amount of distortion is tolerable http://www.biomedcentral.com/1752-0509/6/51 [37]. The amount of stress to tolerate can be derived from the accuracy of the input time series. The measurement error on the input data propagates to the distance matrix entries and, consequently, to the estimate of the stress function in the following way. For convenience, we introduce the following notation Then, we apply the rules of error propagation [41] to find the error on the covariance estimate given in Eq. 1.
The relative error on C i j(t, τ ) is where We assume that the estimate of the relative error of the density p ij (t, τ ) is such that (15) and the absolute error on the time average of C ij (t, τ ) is where m is the number of records in the time series. From the Eq. (8), the absolute error on the correlation coefficient r ij is Now, applying the rules of error propagation to the Eq. (9), we obtain that the absolute error on the distance d ij is where, as per the definition in Eq. (10) with δr ij (τ ) as in Eq. (17). Therefore, in a given dimensionality D we can consider tolerable a non-zero value of the stress function if This condition can be satisfied for more than one nonnull value of the dimensionality D. If the condition is satisfied for D = 2, 3, we will use these values as input to the multidimensional scaling procedure and the network can be visualized. In particular, D = 3 is selected if, moving from two to three dimensions, we have a significant reduction in stress, i.e. if where δS D is the error affecting the value of the stress function because of the error δd ij on the distance d ij . δS D is given as in the following If D ≤ 3, once the coordinates of the species in Ddimensional space have been calculated from the distance matrix, we fix a threshold on the multidimensional-scaling estimated distances to establish when two species interacts through a single reactions or not. The threshold is calculated from the histogram of the distances, and it is set equal to the average of the values on the x-axis corresponding to the absolute maxima of the histogram (see Figure 2). If D > 3, the multidimensional scaling procedure can be skipped and the original distance matrix can be directly analyzed and thresholded.
Then, we derive the undirected graph representing the network: species whose distance do not overcome the threshold undergo a biochemical interaction. In order to determine directions of the edge we need to infer a temporal ordering of the reaction events. Having knowledge of the temporal sequence of the reaction means knowing whether a perturbation of one species follows or proceeds that of another species. As suggested by Arkin et al. [17] the temporal ordering of variation in each of the variables can be assigned in the following way. If the time series for a given species has a maximum correlation at negative lags compared to a reference time series, then that species receives the input signals after the reference series, and vice versa. Similarly, if the two series are maximally correlated at zero lag but correlation tails to negative (positive) lags, variation in the given species closely follows (precedes) variation in the reference species. In this study, we do not consider negative lags, but without any loss of generality, we consider only positive lags and look for cause-effect relationships on a positive temporal scale.
In the next section we will describe the procedure of inference of the kinetic parameters of the interactions among the nodes of the network.

Inferring the kinetic parameters with KInfer
The tool we recently developed to infer the kinetic parameters is KInfer (Kinetics Inference). It takes as an input the experimental time series of each reactants in the systems and a set of chemical reactions that are supposed to occur among the species of that system. Therefore, we apply KInfer to the network model we inferred with the timelagged correlation based inference procedure described in the previous section.
We report here the key statements of the theory implemented by the tool and refer the reader to the references [18,42,43].
Consider N reactant species, with concentrations X 1 , X 2 , . . . , X N , that evolve according to a system of rate equations established by the generalized mass action law.
where θ i , i = 1, 2, . . . , N , is the vector of the rate coefficients, which are present in the expression of the function f i ; α w ∈ R, and N i is the number of parameter in the f i rate equation.
We wish to estimate the set of parameters = {θ i } (i =  1, 2, . . . , N), whose element θ i is the set of rate coefficients appearing in the rate equations of i-th species, therefore is the vector of concentrations of chemicals that are present in the expression of the function f i for the species i. We assume we have noisy observationsX i = X i + at times t 0 , . . . , t M , where ∼ N (0, σ 2 ) is a Gaussian noise term with mean zero and variance σ . With this choice we are assuming that the concentration measurements are not significantly affected by systematic errors, but by uncontrolled random errors and that an error is equally likely to occur in either positive or negative direction with respect to the symmetry axis of the distribution. We also assume a number M of concentration measurements for each considered species. Approximating the rate Equation (21) as a finite difference equation between the observation times, gives (22) where k = 1, . . . , M. In Eq. (22) the rate equation is viewed as a model of increments/decrements of reactant concentrations; i.e., given a value of the variables at time t k−1 , the model can be used to predict the value at the next time point t k . Increments/decrements between different time points are conditionally independent by the Markov nature of the model (22). Therefore, given the Gaussian http://www.biomedcentral.com/1752-0509/6/51 model for the noise, the true value of X i (t k ) is normally distributed around the observed valueX i (t k ), so that Therefore, the probability to observe a variation for the concentration of the i-th species between the time t k−1 and t k , given the parameter vector θ i is and where X (i) is the sample space of X (i) , and K i is the number of chemical species in the expression for f i .
While the increments/decrements are conditionally independent given the starting point X i (t k ), the random variables D i (t k ) are not independent of each other. Intuitively, if X i (t k ) happens to be below its expected value because of random fluctuations, then the following increment D i (t k+1 ) can be expected to be bigger as a result, while the previous one D i (t k ) will be smaller. A simple calculation allows us to obtain the covariance matrix of the vector of increments for the i-th species. This is a banded matrix C i ≡ C = Cov(D i ) with diagonal elements given by 2σ 2 and a non-zero band above and below the diagonal given by −σ 2 . All other entries of C are null.
The likelihood for the observed increments/decrements therefore will be where 1, 2, . . . , N), and m i (t k−1 ) ≡ E f i (X(t k−1 ), θ i ) . http://www.biomedcentral.com/1752-0509/6/51 Eq. (26) can be optimized w. r. t. the parameters = (θ 1 , θ 2 , . . . , θ N ) of the model to yield estimates of the parameters themselves and of the noise level. The inferred parameters are then used as multiplicative coefficients for each element of the correlation matrix, so that we obtain a weighted correlation matrix, reflecting the physical interaction in the network.
This procedure makes null the element of the correlation matrix if the rate constant multiplying this element is a null kinetic rate constant or if this rate constant is affected by a relative error of 50%.

Metabolism and mechanisms of action of gemcitabine
Gemcitabine (2 ,2 -difluorodeoxycytidine, dFdC) is an anticancer drug, which is effective against solid tumours, including non-small-cell lung cancer and pancreatic cancer. Gemcitabine is an anticancer nucleoside analog in which the hydrogen atoms on the 2 -carbon of deoxycytidine are replaced by fluorine atoms. As with fluorouracil and other analogues of pyrimidines, the triphosphate analogue of gemcitabine replaces one of the building blocks of nucleic acids, in this case cytidine, during DNA replication. The process arrests tumor growth, as only one additional nucleoside can be attached to the "faulty" nucleoside, resulting in termination of DNA replication and ultimately leading the cells to apoptosis.
In this study we refer to the recent model proposed by Veltkamp et al. [8]. The model is depicted in Figure 3. The gemcitabine is transported into cells by equilibrative and concentrative nucleoside transporters. Then, it is phosphorylated by deoxycytidine kinase (dCK) to become the gemicitabine mono-phosphate (dFdC-MP). dFdC-MP is phosphorylated to its active diphosphorylated (dFdC-DP) and triphosphorylated (dFdC-TP) forms with the intervention of nucleoside monophosphate kinase (NMPK) and nucleoside diphosphate kinase (NDPK), respectively. The triphosphate metabolite (dFdC-TP) competes with the natural nucleoside triphosphate for the incorporation into the DNA and blocks cells in the early DNA synthesis phase. Gemcitabine is also rapidly metabolized by cytidine deaminase to 2 ,2 -difluorodeoxyuridine (dFdU), which can be further phosphoriylated to its disphosphate (dFdU-DP) and triphosphate (dFdU-TP) whose activity has been recently associated with the cytotoxic effect of the drug [8].
In this study we used the time series data on the concentration of the gemcitabine and its metabolites published by Veltkamp et al. [8] and reported in Figure 4. The concentration of the following metabolites has been measured at four time points (0, 4, 12, and 24 hours): extracellular dFdC (dFdCout), intracellular dFdC, intracellular dFdC-MP, dFdC-DP, dFdC-TP, dFdU, dFdU-MP, dFdU-DP, and dFdU-TP.

Results
We first applied the algorithm of network inference to deduce some of the biotransformations of gemcitabine from the experimental time series of metabolite concentrations available in [8]. The algorithm can infer the reactions between the measured species, therefore, since the experiments in [8] measured the concentrations of dFd-Cout, dFdC, dFdCMP, dFdCDP, dFdCTP, dFdU, dFdUMP, dFdUDP, dFdUTP, the reactions we expect to infer are only those among these chemical species.
The Figures 5 (A) and 5 (B) show the undirected unsigned graph representing the network of interactions between the systems components, in a 2D and 3D space respectively The maximum value of the distance at which two species are still connected by an edge has been estimated to be 0.8. The interval ranged by τ has been calculated by the formulas (2) and (3) and is [0, 6] hours.
The analysis of the pairwise time-lagged correlation C ij (τ ) allows to deduce the directions of the edge of the graph. To facilitate the understanding of this analysis and its results, we visualized on a plot this correlation metric. Figures 6, 7, and 8 are heatmap representations of the behavior of the correlation C ij (τ ) for each possible pairwise combination of the biochemical species. In Figure 6, the reference species are dFdCout (A) and dFdC (B), i.e. the extracellular and the intracellular gemcitabine, respectively. In Figure 7, the reference species are the mono-(A), di-(B) and tri-phosphate (C) of dFdC. In Figure 8, the reference species are dFdU (A), dFdUMP (B) and dFdUDP (C), i.e. the intracellular 2 ,2 -difluorodeoxyuridine and its mono-and di-phosphate. A linear increment of the correlation is obtained in almost all the cases in the observed time interval. This behavior of the correlation indicates that most of the reactions are sequential and reversible. Species connected by a chain of sequential reversible reactions maintain a correlation significantly different from zero during all the observation time. Moreover, as we can see in Figures 4 (A) and 4 (B), these species appear to be connected by an edge if they do not directly interact. The linear increment of the correlation for almost all the couples of species makes not easy to dissect the chain of sequential reactions and distinguish between direct and indirect interactions. Table 1 reports the values of the lags corresponding to the maximum and minimum value of the correlation. In this table we see that the extra-cellular concentration of dFdC (dFdCout) is maximally anti-correlated to all the other species after 6 hours, whereas the minimal value of the correlation is obtained for τ = 0. This results is consistent with the behavior of the experimental time series in Figure 4, and is explained by the time resolution of the measurements. Although the concentration of dFdCout rapidly decreases within the first 5 hours, the increment of the intracellular dFdC slowly increases and reaches the maximum http://www.biomedcentral.com/1752-0509/6/51 around t = 12 hours. Therefore the correlation between dFdCout and dFdC and between dFdCout and the phosphorilated metabolites of dFdC and dFdU is detected only after 6 hours (see the first part of Table 1). Conversely, the correlation between dFdCout and dFdC, as well as the correlation between dFdCout the phosphorilated metabolites of dFdC and dFdU reaches a minimum at τ = 0 (see the first part of Table 1). Table 1 also shows that dFdC and its phosphorilated metabolites are maximally correlated at τ = 0. dFdC is maximally correlated to the dFdU metabolites at τ = 0 as well (see third part of Table 1). Similarly the correlation between the metabolites of dFdC     and dFdU has a maximum at τ = 0 (see parts 2, 3, 4, 5, 6, 7 of Table 1). Although these results are consistent with the dynamics measured in the experiments (see Figure 4), they do not reflect the expected kinetics in the first 4 hours. In fact, the kinetics of uptake of dFdC is expected to be faster than the one observed in the experiments. In particular, the rate of increment of the intracellular concentration of dFdC is expected to be proportional to the decrement of its extracellular concentration within the first 4 hours. We argued that the time resolution of measurements of the concentration of the metabolites does not allow to detect the dynamics of the concentration in the first 4 hours. As a consequence, the observed slow increment of dFdC in spite of the rapid decrement of its external concentration, and the observed peaks of the concentration of dFdC metabolites after 6 hours should be the reason why the maximum correlation between dFdCout and dFdC and its metabolites is reached at 6 hours. In order to demonstrate this guess, and to check the capability of our algorithm to correctly detect the connectivity between the biochemical species, we simulated a mass action model of the metabolism and mechanisms of action depicted by the cartoon in Figure 3. The aim of this experiment is to validate the capability of the inference procedure to correctly identify the simulated model, i.e. to detect all the interactions specified in the model. The model parameters have been obtained by fitting the model to the experimental time series of Veltkamp et al. [8]. The best-fitting parameters have been estimated with a downhill simplex technique [34] and provide the dynamics showed in Figure 9. Unlike the experimental observation, the simulation of time behaviour of the internal dFdC has a maximum within the first 4 hours (Figure 9 (A)). Although this maximum is expected, it has been not detected in the experiments because of the coarse-grain time resolution of the records.
We used the time series of dFdCout, dFdC, dFdCMP, dFdCDP, dFdCTP, dFdUout, dFdU, dFdUMP, dFdUDP, dFdUTP generated by the simulation of the model as input to the inference algorithm. The simulated time series contain 200 points regularly spaced in the interval [0,20] hours (see the curves in Figure 9). Unlike the set of real data, this set contains also a time course of the extracellular concentration of dFdU (dFdUout).
The undirected unsigned graph represented in two and three dimensions are shown in Figures 10 (A) and 10 (B), respectively. Similarly to Table 1, Tables 2 and 3 list the time lags corresponding to the minimum and the maximum of the correlation C ij (τ ), and Figures 11,12,and 13 show the heatmap representation of the behavior of the correlation as function of τ . A monotonic behavior of the correlation vs the time lag is observed for almost all the pairs of the species. This indicates again the presence of chains of reversible reactions involving the species. An elementary single reaction is inferred whenever the maximum correlation between the reference species and another potentially interacting species occurs at a time lag τ equal to the average value of the τ s at which a given species exhibits the maximum correlation with the other. In this experiment τ = 7.27.
For more clarity, we did not indicate the directions of the edges on the graph in Figures 10 (A) and 10 (B), but we show the directions of each inferred interaction in Table 4. The third column of the table tags with the letter "E" the expected reactions (i.e. the reactions depicted in the cartoon of Figure 3), and with the letter "U" those unexpected and incorrect, and with the letter "P" those reactions that are unexpected as direct reactions but that are plausible. In fact, two species could appear to be significantly The comparison of this set of reactions with those observed in experiments and depicted in Figure 3 reveals the presence of false positive in the inferred reactions set. The calibration of the inferred model with KInfer allowed to detect the null kinetics, i.e. those reactions whose rate constant k is null or it is affected by an error k equal or greater than the estimated value. The cases in which the ratio between the estimate of the parameter value and the estimate of its error is equal or greater than one, are interpreted as noise and not as a biochemical kinetics governing the time behavior of the species concentration. The calibration of this inferred model with KInfer gives as nonnull kinetics the reaction showed in Table 5. In this set of reactions, R3, R4, R10 and R15 are also included, although in these reactions the interaction between the reactants are mediated by intermediate reactions. Reaction R30 has been inferred, but it is not expected at all.
According to reaction R3, the decrement of dFdCout is directly proportional to the increment of dFdCMP via a single reaction consuming dFdCout and producing dFdCMP. However, the model includes also the uptake of external dFdC into the cell and its excretion, i.e. dFdCMP. The estimates of the rates obtained with KInfer for these reactions are: k 1 = 7.11, k 2 = 0.8, k 5 = 0.43 and k 6 = 1.1. k 1 is one order of magnitude greater than k 5 , and similarly k 6 is one order of magnitude greater than k 5 . As consequence, since (i) the uptake of dFdCout is much faster than the phosphorilation reaction, and (ii) the dephosphorilation of dFdC is much faster than the excretion from the cell [7,8], the correlation-based inference model detects a significant correlation between dFdCout and dFdCMP through the reactions R3 and R4. Nevertheless, the algorithm is able to detect the uptake and the excretion of the dFdC (Reactions R1 and R2).
Similarly, the algorithm infers also reaction R15: → dFdU. dFdCDP and dFdU are connected through a pathway depicted in Figure 14 (A) (dotted arrows). The variation of the concentration of dFdU is correlated to the variation of the concentration of dFdC and dFdUMP. In turn, the variation of the concentration of Finally, the correlation-basednetwork inference algorithm does not detect the reaction dFdCMP → dFdU that is reported in the cartoon of Figure 3, and later in the Table 6 (i.e reaction E13). Table 5 reports also Reaction R10: dFdUout k 10 → dFdC. According to the current knowledge, the interaction between dFdUout and dFdC is mediated by the pathway illustrated in Figure 14 (B). Therefore reaction R10 is not correct, because a decrement of the extra-cellular concentration of dFdU does not cause an increment of the Table 4 Set of reactions inferred by the time-lagged correlation inference method from synthetic time series data: we denote with the letter "E" the expected reactions, and with a letter "U" those unexpected and incorrect, and with the letter P those reactions that are unexpected but that can be easily explained and are plausible in terms of correlation; the reaction dFdCMP → dFdU (i.e. reaction E13 in Table 6 The third column of the table denote with the letter "E" the expected reactions, and with a letter "U" those unexpected and incorrect, and with the letter "P" those reactions that are unexpected but that can be easily explained and are plausible in terms of correlation. The reaction dFdCMP → dFdU (i.e. reaction E13 in Table 6) has been not detected. http://www.biomedcentral.com/1752-0509/6/51

Figure 14 (A) Pathway connecting dFdCDP and the intra-cellular concentration of dFdU (dotted arrows). (B)
The interaction between dFdUout and dFdC is mediated by the pathway illustrated in this figure. Therefore the reaction R10 (see Table 5) is a false positive inferred reaction.
intracellular concentration of dFdC. The opposite is plausible: a decrement of the intracellular dFdC is proportional to an increment of the intracellular dFdU, that can be released outside the cell. Finally, Table 5 reports reaction R30, that is not correct, because, as we can see in Figure 3, the metabolites of phosphorylated metabolites of dFdU are not converted in any metabolites of dFdC. The reactions R11, R19 and R20 tagged as "expected" by the correlation-based inference procedure and reported in Table 4 have been cut off by the network calibration procedure, and thus they do not appear in Table 5.
Summarizing, from the model in Figure 3 and the list in Table 6, we see that 19 reactions are expected. The network inference algorithm paired to the parameter inference algorithm of KInfer correctly inferred 12 reactions. It inferred two false positive, i.e. R10 and R30. Moreover, it missed Reactions R19 and R20: they are expected, but they have been not detected. We introduce two parameters to evaluate the performance of the procedure: the sensitivity and the accuracy. The sensitivity is defined by the ratio between the number of detected edges and the number of expected edges. The accuracy is defined as the ratio between the number of correctly detected edges and the number of detected edges. The number of expected edges is the numer of expected reactions, that is 19 (see Table 6). According to these definitions we have sensitivity = Nr. of detected edges(the "plausibles" included) Nr. expected edges Our procedure on the case study of gemcitabine metabolism outperforms the algorithm in [44] that has been designed for the determination of chemical reaction network interconnections from time series data too. However, we point out that the sensitivity and the accuracy defined in this way are global measures that average over the whole network giving two values as representative of the performance of the inference algorithm. A better assessment of the inference algorithm could be conducted if we disposed either of real biological but smallscale data. A comparison of these performances with the performances of other state-of-art tools is still in progress. Preliminary results of this analysis show us that the outcome of the network inference varies between tools and can be complementary. Moreover, each tool is tailored to a specific biological pathway and experimental set up. For this reason mainly, in the scientific community there is no general consensus that would have been reached declaring one network inference method as gold standard [45]. Table 5 The set of reactions for which the ratio between the estimate of the rate constant and the estimate of its error is equal or greater than one http://www.biomedcentral.com/1752-0509/6/51 Table 6 The set of expected real reactions (from the cartoon of Figure 3) Reaction

Conclusions
We presented a method for the prediction of the interactions within reaction networks from experimental time series of the concentration of the species composing the system. The overall aim of our investigation was to develop a procedure of system identification in which the parameter inference acts as tool of model refinement and elimination of false positive interconnections. Both the network topology inference and the parameter estimation have been designed to predict reaction pathways in chemical kinetic systems from noisy time resolved concentration measurements.
We developed a method that does not require any a priori information about the topology of the interconnections. It consists of two parts: the estimation of time-lagged correlation function between the components of the systems for the determination of direct edges between the components and the inference of the kinetic parameters to establish real dynamics and to cut "null dynamics" edges instead. The method has been developed to deducing the connectivities of chemical species, the reaction pathway, and the reaction mechanism of complex reaction systems. The procedure can process more than 3,000 reactions involving hundreds of species. In particular, our methodology coupling the connectivities inference to the parameter inference reveals to be suitable to infer reactions mechanisms of complex systems as drug metabolic network, where many feedback loops and reversible reactions are present. The data driven estimation of the time lag interval facilitates the use of the tool and minimize the a priori information to input. Good performances have been obtained on the test case of gemcitabine metabolism, that is supported by recent experimental studies. Finally, this work highlights five main aspects determining a sensitive and accurate inferability of a biochemical network with time-lagged correlation-based methods: (i) the availability of replicates of experiments, so that averaged time series can be used to soften the noise in the input data; (ii) the availability of data having a time resolution suitable to capture the time of onset of the correlation between the species and the number of chemical species that can be measured simultaneously; (iii) the availability of local measures on the level of individual edges and subnetworks, instead of having only global measures of time series averaging over the whole network dynamics [45]; (iv) the effects of measurement error, and (v) the presence of unmeasured species. These issues have to be addressed before adopting inference methods in the daily laboratory practice for any specific problem.