S-system representation of gene expression dynamics
Identification of the regulatory network from time series gene expression data first requires modeling the dynamic evolution of the individual genes constituting the network. Here we model gene dynamics as a set of coupled non-linear ode following the S-system formulation, which captures the non-linearity in gene expression profiles using a power-law kinetic representation.
For a system with N-genes, the S-system model can be represented using equation (1
is the concentration of the gene i, α and β represent the kinetic rate constants, g and h represent the kinetic orders for the production and degradation terms, respectively, and n is the total number of species in the system, in this case total number of genes in the network. In this work, we are using a modification of the above equation by assuming that species degradation follows a first order kinetics of the corresponding species and independent of other species (h
= 1 for i = j; 0 otherwise). While being relevant to biological systems , this assumption also reduces the unknown parameters from 2n(n + 1) to n(n + 2).
Network Identification Algorithm
Our network identification algorithm is primarily based on the hypothesis of sparsity of network connections governing biological systems. Hence our overall objective is to determine the sparsest network which can satisfactorily capture the observed network dynamics. Following this idea, the network identification problem is formulated as an optimization problem with the objective of promoting sparsity given the constraint of maximizing predictive capacity. Such problem definition results in a bi-level optimization problem, where the constraint itself is an unconstrained optimization problem. In the current formulation using S-system to model the gene expression level (equation (1
)), the kinetic orders (g
) are decomposed into two parts: binary part, λij
, which determines the existence of the connection; and continuous part, ρ
, representing the nature and strength of interaction for an existing connection. A value of 1 of the binary variable λij
would indicate the presence of the corresponding connection
, while value of 0 indicates its absence. These binary variables are optimized in the upper level which results in an integer programming problem. For each chosen network in the upper level, the connections are sent to the lower level, where corresponding ρ
are optimized to maximize network prediction and hence minimize deviation of the network predictions from the observables. The lower level essentially optimizes both strength (magnitude) and nature (sign) of the existing connections (ρij
, reactions orders) as well as the strengths of the production and degradation rate constants (α
respectively). Hence it results in a continuous non-linear programming problem where the objective is to minimize the deviation of the predicted profiles from experimental data in a least square sense. A constraint of tolerance (tol
) is imposed on this minimized error which defines the maximum allowable deviation in prediction. The mathematical formulation of the network identification problem in its entirety is shown in equation (2
= binary variable
experimental and predicted gene expression levels, respectively
= kinetic rates constants of ith gene's production and degradation, respectively
= kinetic orders of production and degradation, respectively
nstep = number of time points
n = number of genes constituting the network
m = number of experimental time points
In the above formulation ∑λ represents the total number of network connections, minimizing which will promote sparsity in the network. The upper level integer programming is solved using combinatorial optimization techniques since combinatorial approach is known to handle L
minimization problems more efficiently than approximation algorithms . Of them, evolutionary algorithms are particularly efficient in finding a good approximate solution for combinatorial problems . In this work, we have used genetic algorithm (GA) for solving the integer programming problem, while the lower level non-linear programming problem is solved using a standard least square optimization routine.
GA is typically designed to handle unconstrained optimization problems. One technique for constraint handling in GA is by penalty function, where the constraint is conditionally incorporated in the objective function. For conditions violating the constraint the objective function is penalized, and not so otherwise. In the current formulation the constraint is incorporated in the objective function using the following modification of the objective function:
A significant advantage of the bi-level formulation is that it allows optimum utilization of experimental data by sequentially reducing the number of unknown parameters in the lower level. In a conventional least-square parameter estimation problem, the connectivity is fixed and includes all possible network connections. Therefore, the size of the identifiable system is restricted, governed by the availability of experimental data points so that number of unknown parameters is less than the number of data points. For instance, a single level algorithm, using the above S-System formulation, would be restricted to less than m-3 genes. However, in the current bi-level formulation, this restriction is relaxed. Because the number of network connections are first reduced in the upper level, the number of genes to be analyzed is not so restricted, with the only constraint coming from the connectivity:
Hence the constraint is imposed on the maximum number of binary variables assigned in the upper level, but does not constrain the total size of the analyzed network. Moreover, our primary objective being sparsity of network connections, the formulation essentially tries to minimize the number of connections assigned to 1. Hence, except for the very initial phase of GA evolution, the constraint defined in equation (2) typically does not become active, and never so in the final optimal solution.
Identification of Robust Networks
Real world data typically contains noise due to experimental uncertainty and system stochasticity. Biological data are particularly notorious for its inherent heterogeneity and stochasticity . Hence it is important to explicitly account for data variability in order to increase confidence in the predicted network. In the presence of large experimental repeats it may be possible to determine robustness of identified network by repeatedly solving the network identification problem at each of the experimental data sets and analyzing the connections which are heavily repeated. However, drawing statistically significant inference would necessitate a large data set which is impractical and infeasible.
An alternative to actual experimental repeats is to use bootstrapping. The purpose of this statistical technique is to estimate the distribution of the estimator around the unknown true value θ. However, instead of achieving this with a large number of individual replicates, bootstrapping utilizes resampling of the data. In this way, a large number artificial data sets can be generated from a limited number of experimental repeats. For each bootstrap run, data samples are randomly chosen, with replacement, from the empirical distribution, with the size of each artificial set being the same as the experimental set (e.g. if the experimental set has 20 data points, so would the bootstrap set). For each bootstrap, the estimators (e.g. mean, variance, or, as in the case of the current work, regression parameters) are calculated, and with sufficient number of resampled data sets, relevant statistical information, including confidence intervals, can be estimated [26, 33].
In our algorithm, we are dealing with limited experimental data. Hence, following the above methodology, we generate a large artificial data set by repeated resampling of the limited experimental repeats. Once the bootstrapped samples are obtained, the network identification algorithm previously described is applied to all bootstrap data sets to identify a network corresponding to each. The network sets thus obtained is further analyzed to determine the frequency of occurrence of each connection in the entire set of identified networks. We hypothesize that frequent occurrence of network connections in the bootstrap samples indicate the insensitivity of the corresponding network to experimental noise, and hence claim that connection to be robust.
In order to quantify the quality of prediction of the proposed algorithm the measures of recall
are used, calculated as:
Where: TP (True Positive) denotes the number of connections correctly captured; FN (False Negative) denotes existing connections which are not captured in the identified network; and FP (False Positive) denotes connections which are incorrectly captured in the identified network. Following the above equation: a low value of recall would indicate a more conservative estimate which is unable to capture many of the existing connections; a low value of precision will indicate prediction of incorrect connections not appearing in the actual network; and a value of 1 will indicate perfect network identification. The flow diagram of the overall network identification algorithm is shown in Figure 1.