Inference of dynamical generegulatory networks based on timeresolved multistimuli multiexperiment data applying NetGenerator V2.0
 Michael Weber^{1},
 Sebastian G Henkel^{2}Email author,
 Sebastian Vlaic^{1},
 Reinhard Guthke^{1},
 Everardus J van Zoelen^{3} and
 Dominik Driesch^{2}
DOI: 10.1186/1752050971
© Weber et al.; licensee BioMed Central Ltd. 2013
Received: 6 August 2012
Accepted: 15 December 2012
Published: 2 January 2013
Abstract
Background
Inference of generegulatory networks (GRNs) is important for understanding behaviour and potential treatment of biological systems. Knowledge about GRNs gained from transcriptome analysis can be increased by multiple experiments and/or multiple stimuli. Since GRNs are complex and dynamical, appropriate methods and algorithms are needed for constructing models describing these dynamics. Algorithms based on heuristic approaches reduce the effort in parameter identification and computation time.
Results
The NetGenerator V2.0 algorithm, a heuristic for network inference, is proposed and described. It automatically generates a system of differential equations modelling structure and dynamics of the network based on timeresolved gene expression data. In contrast to a previous version, the inference considers multistimuli multiexperiment data and contains different methods for integrating prior knowledge. The resulting significant changes in the algorithmic procedures are explained in detail. NetGenerator is applied to relevant benchmark examples evaluating the inference for data from experiments with different stimuli. Also, the underlying GRN of chondrogenic differentiation, a realworld multistimulus problem, is inferred and analysed.
Conclusions
NetGenerator is able to determine the structure and parameters of GRNs and their dynamics. The new features of the algorithm extend the range of possible experimental setups, results and biological interpretations. Based upon benchmarks, the algorithm provides good results in terms of specificity, sensitivity, efficiency and model fit.
Keywords
Generegulatory networks Network inference Heuristic algorithm ODE NetGeneratorBackground
For the adaptation of biological systems towards external and environmental stimuli usually a complex interaction network of intracellular biochemical components is triggered. That includes changes in the gene expression at both the mRNA and protein level. Considering a certain stimulus as an input signal to the system and mRNA or protein levels as outputs, the connecting network may include interactions between signal transduction intermediates: transcription factors and target genes. Generally, the term “generegulatory network” (GRN) summarises genetic dependencies, which describe the influence of gene expression by transcriptional regulation,[1].
The inference (elucidation) of GRNs is important for understanding intracellular processes and for potential manipulation of the system either by specific gene mutations, knockdowns or by treatment of the cells with drugs, e.g. for medical purposes. Towards a full understanding in terms of a complete network, partial models of the network give intermediate results which help to refine the knowledge and to design new experiments. Still, many generegulated cellular functions, e.g. stem cell differentiation, depend on more than one stimulus and the crosstalk within the GRN, e.g.[2]. On the other hand, the stimuli might influence distinct components of a GRN. Such biologically relevant dependencies can be investigated by applying two or more stimuli and measuring the influenced genes. This approach can be called multistimuli experiment. If this is carried out in two or more separate experiments, one derives multistimuli multiexperiment data. Only algorithms with the ability to consider those data can infer such dependencies.
As shown in review articles, e.g.[1, 3, 4], there are different inference methods using various sources of information thus leading to different results. Amongst the typically mathematical models the application of differential equations describing timeresolved gene expression data (“time series”) has been proven successful. Unfortunately the potential complexity of the networks leads to a high number of structural connections and parameters in contrast to the comparably small number of available measurement data. Apart from the problem of identifiability, the number of possible parameter combinations is very large, thus resulting in high computational costs. Therefore, appropriate heuristic approaches can reduce this amount while providing comparably good inference results. NetGenerator is a heuristic algorithm, which considers time series data to automatically infer GRNs influenced by an external stimulus,[5] and[6]. The approach combines a structure (network topology) and parameter optimisation. The final result in form of a differential equations model can be simulated and displayed graphically. An earlier version with less functionality was applied successfully to biological problems, e.g. the regulatory network of iron acquisition in Candida albicans and the analysis of the Aspergillus fumigatus infection process,[7] and[8].
In the present article, we propose NetGenerator V2.0, an extended version of the algorithm which enables the use of multistimuli multiexperiment data, thus increasing the number of addressable biological questions. This causes significant changes in the algorithmic procedures, especially the processing of this kind of data as well as the structure and parameter optimisation. Also, some other updated features will be outlined, for example the different modes of prior knowledge integration, further knowledgebased procedures, options of graphical outputs, changed nonlinear modelling and reimplementation in the programming language / statistical computing environment R,[9]. Further, in comparison to the previous version, some of the algorithmic procedures will be explained in more detail, because they are important for understanding the overall method.
The successful application of the novel NetGenerator will be shown by inference of relevant multistimuli multiexperiment benchmark examples, namely systems with a different degree of crosstalk. Two aspects will be assessed: (i) reproduction of the benchmark systems (data and structure) and (ii) refinement / extension of a network structure by combination of different experimental data. Furthermore, the applicability of NetGenerator to a realworld problem is presented: after describing necessary data preprocessing steps, the underlying GRN of chondrogenic differentiation of human mesenchymal stem cells, a process influenced by the two stimuli TGFbeta1 and BMP2, is inferred.
Methods
In the following subsections the necessary background knowledge and methodology of the NetGenerator algorithm is described. In comparison to previous publications this includes new, updated and more detailed algorithmic procedures. First, the motivation and the goals are defined by considering the biological data. Necessary steps of data preprocessing are also explained within this subsection. Subsequently, ordinary differential equations and some of their properties are presented as a means for modelling the dynamics of gene regulatory networks. Then the heuristic approach of the algorithm is explained including the structure and parameter identification (here: optimisationbased determination). The next important topic will be the consideration of prior knowledge, followed by a subsection about the numerical simulation as well as the representation of modelling and graphical results. Finally, some important options and their influence to the algorithm are presented.
Time series data and preprocessing
Microarray preprocessing applies multiple procedures to remove nonbiological noise from the data and to estimate gene expression levels. Custom probesets, as assembled by[10], reduce the number of crosshybridising probes. This initial reduction accomplishes a onetoone correspondence between probeset and gene. Background correction, normalisation and summarisation are provided by the RMA package,[11], resulting in logarithmised gene expression estimates, which can be used for the next processing step.
Gene selection (“filtering”) is the important second step of processing, since reliable network inference is only feasible for a sufficient number of measurements per gene[1]. This number is often limited and therefore a selection of genes for modelling is inevitable. Candidate genes should show pronounced temporal dynamics and significant differences compared to the control group. Statistical methods for identification of differentially expressed genes are widely used for gene selection. We use the LIMMA tool, which can operate on time series data determining significance of gene expression changes over time[12]. The statistical test (moderated tstatistics) operates on contrast terms, defined by subtracting the control group at each time point. LIMMA returns a ranked table for all genes containing columns for gene name, foldchange and adjusted pvalues. Differentially expressed genes are selected by a combination of adjusted pvalue cutoff and foldchange criterion.
Time series standardisation is the last processing step including centering and scaling of each time series. The centering procedure subtracts the original initial value at the starting time point from all values such that the transformed time series starts from zero. In the subsequent scaling procedure each time series is divided by its maximum (absolute value) across all provided experimental data. This leads to genewise scaled data and gene expression time series varying within −1and 1. The resulting data provided to the NetGenerator algorithm are stored in${\underset{=}{X}}_{e}$ and${\underset{=}{U}}_{e}$, i.e. matrices for the time series (output) and stimuli (input) data, respectively, for all experiments e = 1,…,E. Therefore, the dimensions are${\underset{=}{X}}_{e}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}{T}_{e}\times N$ and${\underset{=}{U}}_{e}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}{T}_{e}\times M$ with T_{ e } being the number of experimental time points, N being the number of time series and M being the number of inputs. Furthermore, NetGenerator provides the option of introducing additional artificial data points by cubic spline interpolation.
GRNs considered as linear timeinvariant systems
with the vector of state variables$\underset{}{x}$ and their changes$\underset{}{\stackrel{\u0307}{x}}$ as a function$\underset{}{f}$ of state variables, input vector$\underset{}{u}$ and parameter vector$\underset{}{\theta}$. The state variables and inputs depend on time t, the independent variable, that is dropped in further equations. The description is valid for a certain time range starting at t_{0} from the initial conditions for the state variables${\underset{}{x}}_{0}$. If not stated otherwise each of the state variables corresponds to one specific output variable, i.e. one time series. The dimensions of the variables are$\underset{}{x}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}N\times 1$,$\underset{}{u}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}M\times 1$, and$\underset{}{\theta}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}P\times 1$, with N being the number of state variables, M the number of inputs and P the number of parameters.
with the system or interaction matrix$\underset{=}{A}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}N\times N$ and the input matrix$\underset{=}{B}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}N\times M$. Most important for the understanding of the biological systems properties and the heuristic approach of the NetGenerator algorithm is the system matrix$\underset{=}{A}$ and its elements a_{i,j}, i,j ∈ N, because they describe the dynamics and the coupling of state variables.
Under the assumption, that the behaviour of a GRN is described sufficiently by indirect transcriptional events and not by a conversion of material, activation (a_{i,j} > 0) or inhibition (a_{i,j} < 0) of state variable x_{ i } is not changing the value of the originating state variable x_{ j }.
Without any further assumptions all elements of$\underset{=}{A}$ and$\underset{=}{B}$, adding up to N^{2} + M·N, had to be determined. Typically, in GRNs there are far less connections than theoretically possible leading to a sparse matrix$\underset{=}{A}$. Regarding this property and avoiding problems occurring by the number of usually available measurement data (parameter identifiability, local or unique solutions, computational effort) the NetGenerator algorithm applies a heuristic approach as described in the next subsection.
Heuristic multistimuli multiexperiment approach
The novel NetGenerator algorithm is a heuristic multistimuli and multiexperiment approach. The heuristic is based on the observation that in GRNs the number of connections is much lower than all possible connections. Further, since the applied stimulus is the cause of the observed dynamical changes, the network can be considered as a hierarchical structure originating from the input. The NetGenerator algorithm implements both observations by an iterative development of the statespace system (2) by including coupled submodels for each time series based on a structure optimisation iteratively increasing the number of connections. Structural changes are taking place only if they result in a better adaptation of simulated to measured behaviour. The terms multistimuli and multiexperiment mean that the extended algorithm can handle more than one changed input and data of several experiments, respectively.
containing connections from state variables, N_{ i } being the indices of statestate connections including the selfregulatory term a_{i,i}x_{ i }, and connections from inputs with M_{ i } being the indices of inputstate connections for the considered state variable x_{ i }. That means that only the parameters of submodels have to be identified.
Since the algorithm aims at a low number of parameters, i.e. small N_{ i } ≤ N and M_{ i } ≤ M, the inner loop starts with basic models for the remaining time series containing only selfregulation, one input term as well as connections from “fix” prior knowledge if available, see respective subsection. Those basic structures can be extended by further incoming connections (“growing”) from already included submodels and further inputs. Every structural change requires a parameter identification of the active connections with respect to the considered time series, as will be explained later in the corresponding subsection. For every different set of parameters the resulting model needs to be simulated, that is the numerical solution of an initial value problem has to be found, as will be described later in another subsection.
The basic submodel which reproduces one of the remaining time series best, is chosen for further improvement, for details see Figure2 (B), and included into the model as a state variable. The most important structural improvements are

“Growing”: further connections added

“Higher Order”: increase submodel order

“Pruning”: connections removed
In the improvement step “growing” is not restricted to connections from time series that are already included in the model. For describing the influence of time series that have not yet been included as submodels, the corresponding measured and interpolated data are used as inputs. Those connections form global feedbacks in the final model.
In this way the dynamics of a certain submodel are described by an r th order integrator chain allowing for reproduction of processes with more complex time courses. It should be emphasised that by applying this approach the number of parameters is not increased but on the other hand the number of state variables becomes larger than the number of time series data. In that case only the state variable with the highest order in such a submodel is to be compared to time series data. Still, for the sake of simplicity all following algorithmic procedures are described for firstorder submodels.
describe forward, local feedback and global feedback connections. Elements below the main diagonal become forward connections, whereas the main diagonal elements$\frac{}{}{a}_{1,1},\dots ,{a}_{{N}_{s},{N}_{s}}$ describe local feedbacks or selfregulations, while the elements above the main diagonal represent global feedbacks. From a biological point of view the local feedbacks describe different mechanisms including not only feedback regulation, but also the important process of mRNAdegradation.
All the previously described structural procedures and the corresponding parameter identification are controlled by apriori defined settings and options of the algorithm. Some of them are balancing network complexity and error between measurement and simulation. For example, additional connections are rejected if they are not improving the objective function value to a significant extent while on the other hand connections are removed only if they are not worsening the result significantly. Further important options of the algorithm are explained in the respective subsection.
Parameter identification
with the weight matrix$\underset{=}{W}$ and the state variable derivatives${\underset{}{\stackrel{\u0307}{x}}}_{i,\mathrm{num}}$. The latter are calculated by numeric differentiation of the respective time series (output) data. This means the vector${\underset{}{\stackrel{\u0307}{x}}}_{i,\mathrm{num}}$ is not the vector of state variables but the vector of time points of the considered time series derivatives. Its length (T = K + K_{ interp }) equals the sum of the number of measurement time points and interpolation time points, as outlined in the subsection on data preprocessing. The reason for the use of interpolated data is the avoidance of overfitting. The different influence of measured and interpolated values is considered in the elements of the weight matrix$\underset{=}{W}$ possessing the dimensions T × T. Since the model must be valid for all E experiments, the respective input and time series data are concatenated, indeed resulting in$T=\sum {T}_{e}$, e = 1,…,E. This becomes possible because the regression approach implicitly assumes a “dynamic independence” of data points. The dimensions of the other variables are$\underset{=}{U}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T\times \left{M}_{i}\right$ and$\underset{=}{X}\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T\times \left{N}_{i}\right$, with the number of rows of each matrix also equalling to the total number of time points. Both dimensions of$\underset{=}{U}$ reflect necessary algorithmic changes due to the consideration of multiinput multiexperiment data in this NetGenerator version, because M_{ i } > 1 represents multiple inputs while the concatenated data of length T considers multiple experiments. For the sake of completeness it should be mentioned that higherorder submodels are initialised first by their firstorder equation and then adapted such that total time constant and static gain remain the same.
describing the deviation between measured x_{e,i}and simulated${\widehat{x}}_{e,i}$ time series at different time points t_{ k } depending on the parameter vector${\underset{}{\theta}}_{i}$. The minimisation following (9) is an optimisation problem of the least squares type featuring a double sum of experiments e = 1,…,E and time points k = 1,…,T_{e,i}. In contrast to the objective function applied in former NetGenerator versions, now E multiple experiments are considered. The simulated time series are compared to measured and also interpolated data weighed by different w(t_{ k }) avoiding overfitting. A further weighing based on properties of the data, like for example the maximal range, is not necessary since the described preprocessing normalises and scales the data. For the optimisation problem, the new NetGenerator implementation applies the “LBFGSB” algorithm,[14], of the optim Rfunction, which has the ability to solve bounded nonlinear optimisation problems.
Consideration of prior knowledge
For improving the results, prior knowledge about the network connections can be integrated into the network inference. This version of NetGenerator provides two modes for integration of prior knowledge about connections of stimuli on time series as well as between the time series: (i) “fix” and (ii) “flexible”. For both modes the knowledge can be provided in form of connection matrices${\underset{=}{A}}_{\text{fix\u2014flexible}}$ and${\underset{=}{B}}_{\text{fix\u2014flexible}}$ resembling the system matrix and input matrix, respectively, as well as additional matrices containing reliability scores of the connections. The connection matrices can contain singlevalued information about connection (1), no connection (0), activation (10) and inhibition (−10). Fix integration represents rigid model requirements that cannot be ignored by the heuristic. Therefore fix connections are always included in the model structure.
The term J_{i,output} corresponds to the previously in (9) defined evaluation of output deviation, while λ weighs the overall consideration of prior knowledge, s represent the score values of the respective prior knowledge and d describe the distances between the prior knowledge and the modelled structure (incoming connections) evaluated by comparison of signs. That means the resulting elements of$\underset{=}{A}$ and$\underset{=}{B}$ are converted into the described notation of 0, 1, 10, and −10, thus permitting a comparison with elements of flexible prior knowledge connection matrices. Here we consider two types of prior knowledge origin: (i) gene interactions automatically extracted from published literature and (ii) predicted transcription factor binding sites (TFBS) in the proximal promoter region of target genes.
For the extraction from published literature the software Pathway Studio V9 provides a gene relation database termed ResNet Mammalian, which has been compiled by automatic extraction of interactions from PubMed, as evaluated by[15]. As shown in the latter publication, gene relations derived from Pathway Studio V9 can be considered of high quality, since in general scientific literature is a reliable resource and the false positive rate is reported to be about 10 %.
Further, the tool matrixscan from the RSAT toolbox determines putative TFBS in the promoter regions of target genes, which might be involved in transcriptional regulation[16]. This approach requires known sequence motifs of the investigated transcription factors as well as promoter sequences. Sequence motifs are stored in form of position weight matrices (PWM), which describe relative nucleotide frequencies for each motif position, as can be obtained from the Transfac database (Version 2010)[17]. Gene promoter sequences are available from Ensembl using biomaRt,[18].
Additional prior knowledge about the regulatory potential of the individual genes can be obtained by examining the known molecular functions. For example, the interaction between genes coding for nonregulatory proteins, such as structural proteins, and target genes can be assigned “no connection”.
Simulation and graphical output
For every comparison of measurement and simulation as well as the generation of results the model equations (2) must be integrated. This corresponds to an initial value problem that is solved numerically. Since the recent implementation of the NetGenerator algorithm is in R, repeated operations of certain types take a long time. Therefore, the model itself is implemented in C, created iteratively and simulated applying the implicit method “impAdams” of the Rpackage deSolve,[19]. The necessary initial conditions${\underset{}{x}}_{0}=\underset{}{x}\left({t}_{0}\right)$ are either measurement data or extrapolated measurement data typically at t_{0} = 0 of the respective time scale.
The final result of the NetGenerator algorithm is a parametrised model of the considered GRN. Moreover, the new implementation of the algorithm contains important graphical output facilities which have been extended to meet the needs of displaying multiinput multiexperiment data as well as different results concerning prior knowledge. First, there is a graphical comparison of measurements and simulations, showing the single measured data points and the corresponding simulated trajectory. This can be done either by comparison of each component (gene) over all experiments or by displaying the data for each experiment independently. Second the resulting network structure can be displayed as a directed graph applying the language DOT and the software collection Graphviz,[20]. Nodes denote the biochemical components, e.g. genes, and edges display connections of either activation or inhibition. In case of applying prior knowledge (see respective subsection), a comparison between the inferred network and this knowledge is displayed with a colour code. Black edges denote inferred connections without prior knowledge, green edges present an agreement, red edges could either have a wrong sign (e.g. activation instead of inhibition) or be connections that do not comply with prior knowledge, while grey dashed edges stand for prior knowledge not reproduced in the inferred network.
Further settings and updated methods
The NetGenerator algorithm itself can be controlled by parameters (settings) and also contains further methods that will be summarised in the following. An important setting is the “allowedError” that controls the structure optimisation. If the objective function value of a certain submodel structure is worse than this value the model structure must be extended as described. Therefore smaller values of “allowedError” are indirectly leading to more complex structures. Further important settings are the maximal number of connections and submodel order.
Additional updated or new methods, not described extensively here, include nonlinear modelling and knowledgebased methods. The optional nonlinear modelling approach contains an additional sigmoid transformation of the linear combination described in this publication. This transformation has its biological background in the saturating behaviour of gene expression. The additional nonlinear parameters of each submodel are determined by the described nonlinear parameter identification, too. Amongst further knowledgebased methods, the most important presents the possibility of retrieving network information from databases and combining this information with the inferred model in a directed graph. In that way, the biological interpretation can be extended by introducing unmeasured components into the network structure.
Availability
The algorithm has been implemented as a package in the programming language / statistical computing environment R,[9]. It is available in form of a testing bundle containing both the algorithm and the examples at http://www.biocontroljena.com/NetGenerator/NetGeneratorBundle.zip
Results
Example networks
We applied the NetGenerator algorithm, which has been described extensively in the Methods section, to 3 benchmark examples and 1 realworld example to examine the performance of our approach. At first, we consider the three benchmark systems, their corresponding artificial data and inferred networks in order to test the reliability and performance of our algorithm. Particularly, we investigated whether network inference from multiple data sets, originating from different stimulation experiments, is beneficial. Finally, we applied NetGenerator to microarray time series data gained from human mesenchymal stem cells. We focussed on the modelling of gene regulation that occurs during in vitro stimulation of chondrogenic differentiation of these cells, with emphasis on the different effects triggered by multiple stimuli in the inferred model.
Benchmark examples
We constructed three fully parametrised benchmark systems based on linear timeinvariant descriptions, i.e. they are composed of differential equations representing the time series of genes and two external stimuli (u_{1} and u_{2}). The systems are characterised by a different degree of crosstalk between the components with respect to the external stimuli, that is “full crosstalk” (FCT): all components are influenced by all stimuli, “limited or low crosstalk” (LCT): some of the components are influenced by more than one stimulus, and “no crosstalk” (NCT): the stimuli influence distinct components resulting in separate networks. They also differ in the number of genes (FCT: 5, LCT: 4, NCT: 7) and the parameters. The artificial data were generated exhibiting characteristics of real microarray time series data, i.e. low number of time points (six), exponentially increasing time intervals, and additional normally distributed noise$\mathcal{N}(0,\phantom{\rule{0.3em}{0ex}}0.0{5}^{2})$. In summary, this procedure led to sample data sets containing matrices with number of rows equalling number of genes and six columns (time points).
For all three benchmark examples, we evaluated the inference by those statistical measures showing the reproduction of the system structure and time series by the model.
FCT scenarios and network inference evaluation. For FCT, artificial data generation and subsequent network inference was performed within three scenarios: (i) “S_{1}”: single experiment applying only u_{1}, (ii) “S_{2}”: single experiment applying only u_{2} and (iii) “M”: multiple experiment integrating experiments “S_{1}” and “S_{2}”. For the special case of FCT, the scenarios allowed us to directly compare the inference of multiple stimuli data sets with the inferences of single stimulus data sets.
Chondrogenesis model
Background of chondrogenic data. Human mesenchymal stem cells (hMSC) are multipotent adult stem cells that have the capacity to differentiate into a variety of cell types depending on the external stimulus,[2]. Regulation of lineagespecific genes is crucial in this temporal process,[21]. Transforming growth factor (TGF)beta1 is essential for induction of chondrocyte differentiation of hMSC, a process which is strongly enhanced by the additional presence of bone morphogenetic protein (BMP)2,[22] and[23]. In this section, we describe the complementary effects of TGFbeta1 and BMP2 by multistimuli multiexperiment inference applying the NetGenerator algorithm.
Microarray time series data. hMSC from bone marrow were commercially obtained (Lonza) and cultured as described in[2]. To induce chondrogenic differentiation trypsinised hMSC were pelleted and subsequently incubated in culture medium supplemented with 100 nM dexamethasone, 10 ng/mL TGFbeta1 and, if applicable, 50 ng/mL BMP2. Timedependent gene expression was studied under three experimental conditions: (i) following treatment with TGFbeta1 (“T”), (ii) following treatment with TGFbeta1 + BMP2 (“TB”) and (iii) untreated hMSC as a control. At 10 different time points (0, 3, 6, 12, 24, 48, 72, 128, 256, 384) h after addition of the stimuli, RNA was isolated from three technical replicates per time point and measured on Affymetrix HGU133a microarrays.
Preprocessing and filtering. Raw microarray data was preprocessed as described in the corresponding subsection. This included the use of custom chip definition files provided by[10] and application of the RMA method[11]. This procedure resulted in logarithmised gene expression estimates for 12 095genes.
Modelling a smallscale GRN using microarray data requires adequate filtering of genes. We tested all genes for differential expression, used functional annotation and expert knowledge. Differentially expressed genes were identified for both experiments (“T”, “TB”) by computing adjusted pvalues using LIMMA. All genes with an adjusted pvalue less than 10^{−10}and an absolute foldchange greater than 2 for any time point were considered significant. Using those criteria, 192 genes were found to be differentially expressed compared to control as well as between “T” and “TB”. Subsequently, we selected from this group 10 annotated transcription factors (GO:0003700, sequencespecific DNA binding transcription factor activity) and associated 5 of them (SOX9, MEF2C, MSX1, TRPS1, SATB2) with our investigated process (GO:0051216, cartilage development). Those genes may be involved in promoterdependent regulation, which is important for binding site predictions. Furthermore, we added COL2A1, ACAN, COL10A1, all three essential marker genes of chondrocyte differentiation, which encode essential structural proteins of the extracellular matrix.
Prior knowledge. Prior knowledge was taken into account as described in the corresponding subsection. Gene interactions were retrieved from the Pathway Studio ResNet Mammalian database. We obtained 6 genegene and 5 inputgene regulatory interactions. Genegene interactions were passed as flexible prior knowledge to NetGenerator. Inputgene interactions were not integrated. Additionally, potential gene interactions were determined by binding site predictions. For this purpose, we obtained PWMs for SOX9, MEF2C and MSX1 from the Transfac database and promoter sequences 1000 bp upstream from the transcription start site. Both PWMs and sequences were loaded into matrixscan from RSAT, which is performed with default options (weightscore >1, pvalue <10^{−4}) and organismspecific estimation of background nucleotide frequencies. The resulting significant binding sites have been listed in the table of Additional file3. The observed high significance of all matches minimises the risk obtaining similar results from random sequences.
For validation, we performed resampling which is based on random perturbation of time series data. A Gaussian noise component$\mathcal{N}(0,\phantom{\rule{0.3em}{0ex}}0.0{5}^{2})$ was added to the time series data which is used for subsequent model inference. Repeated performance (100×) led to a series of inference results as well as relative frequencies for each of the connections of the nominal model, i.e. the proportion of models containing that specific connection. Those frequencies imply a reliability ranking of all nominal connections. Most of the connections were inferred with high frequency, (76±24) %, see Figure8. Particularly, this applies to connections which reflect prior knowledge. Also, inferred connections which are associated with a predicted binding site (blue colour) were present in more than 50 % of the models.
Network interpretation. SOX9 exhibits a central role in this chondrogenic network and is activated by both TGFbeta1 and BMP2. This indicates a complementary effect of both stimuli on the expression of SOX9. Activated SOX9 drives the expression of its target genes COL2A1, ACAN and COL10A1[24–26]. This regulation marks the essential formation of cartilagespecific structural components of the extracellular matrix and the differentiation of hMSC towards chondrocytes. Beside this process, SOX9 also activates the repressor gene TRPS1 and vice versa. Regulatory interactions between both factors has not yet been addressed in the literature. Additionally, the SOX9 binding motif is present in the proximal promoter of the TRPS1 gene according to the prior knowledge. There is also a modelled effect from TRPS1 on MEF2C, which in turn activates COL10A1 and ACAN, but represses SOX9. This represents a negative global feedback from MEF2C on SOX9 in our model. MEF2C also represses the expression of MSX1, which is solely activated by BMP2 stimulus and activates COL2A1 according to the prior knowledge[27]. MSX1 also activates the SATB2 gene, which in turn activates MEF2C expression. Negative regulation of ACAN by TGFbeta1 is contrary to prior knowledge, as indicated by the red connection of the network graph. However, TGFbeta1 can also activate ACAN indirectly through SOX9,[24]. In summary, the central player SOX9 is influenced by both TGFbeta1 and BMP2. Essential structural proteins are not solely regulated by SOX9, but also by other transcription factors (MEFC, MSX1). Moreover, SOX9 and MSX1 are repressed by MEF2C through negative feedback that involves TRPS1 and SATB2.
Discussion
The NetGenerator algorithm for automatic network inference from multiinput multiexperiment time series data and prior knowledge, described in the methods section, will be classified and distinguished from other methods in the next subsection. Therefore, its properties will be reviewed and justified showing advantages and disadvantages to other approaches. Our discussion contains a wide spectrum of other methods, but will only go into detail for the ones closely related to NetGenerator. Also, further specifications of NetGenerator will be summarised without a detailed comparison to other methods.
Classification of the algorithm
Good review articles on methods for automatic inference of GRNs can be found in[1, 3, 4]. The different methods can be classified by the data type (static or dynamic), the mathematical approach (e.g. probabilistic vs. deterministic) and the result (e.g. undirected vs. directed graphs, algebraic correlation vs. dynamic models) whereby various combinations are possible. Mutual information methods (for a review of ARACNE, CLR and MRNET see[28]) are based on evaluating the statistical dependencies of large data sets resulting in undirected graphs. In comparison to NetGenerator they possess far different preconditions and purposes, for example they do not consider a concerted influence of the variables or the dynamics of the statespace concept, and therefore a more detailed comparison is set aside. Even though dynamical Boolean networks for generegulation, first proposed by[29], possess some similarities to discretevalued statespace models, their rulebased approaches typically lead to rather qualitative results (for an overview of recent methods see the aforementioned review articles).
Very often, like in case of the core elements of NetGenerator, GRNs are based on linear modelling, i.e. the behaviour of one variable depends on a linear combination of other variables. Still the method can be a combination of either probabilistic or deterministic approach as well as algebraic correlation modelling (equations system) or dynamic modelling (differential equations system). In the case of the probabilistic modelling which is especially covered by static and dynamic Bayesian networks (see aforementioned review articles) the inference is based on the application of probability distributions to describe the uncertainties or noise inherent in GRNs. Beside the differences in the mathematical approach, probabilistic modelling includes the determination of statistical parameters and therefore generally more data replicates are required in comparison to deterministic modelling approaches such as NetGenerator.
Deterministic linear modelling applied to automatic network inference,[30], can be distinguished into at least two types depending on the results: (i) algebraic equations systems, e.g.[31], and (ii) differential (difference) equations systems, e.g.[32]. Although they have different prepositions on the dynamics of time series data, both types can be solved by linear regression. Still, there is a disproportion between the number of free parameters and available measurement data on the one hand and the property of sparsity of GRNs on the other hand. For the former interpolated data points can be applied under the assumption that the influence of the chosen interpolation on the results can be neglected. For the reproduction of sparse networks the regression can be combined with model reduction, for example using the large group of LASSObased algorithms, see e.g.[33–36], on the basis of PCA (SVD),[37], or a combination of both,[38]. For further approaches, see the aforementioned review articles.
In contrast to all these methods, we propose the NetGenerator algorithm dealing with the problem of data number and sparsity in a different way. The algorithm is not inferring the network structure and parameters in one go. Instead we applied an heuristic approach of explicit structure optimisation, which iteratively generates a system of sparsely coupled submodels. In that way, the GRN property of possessing more or less hierarchical input to output structures is reproduced. Thus, only the parameters of submodels describing one time series have to be determined. A major drawback of regressionbased solutions of linear differential (difference) equations systems is the necessity of applying numerical derivatives of small sample size and noisy data, which have a strong influence on the resulting network and modelled dynamics. NetGenerator uses a different solution, whereby the regression just provides initial parameters for a nonlinear optimisation of an objective function of the least squares type. Overall, the final dynamic network can be obtained by a lower computational effort, because in comparison to the total number of parameters (N^{2} + M·N) in the model description (2) only a small number of parameters has to be determined.
Inference from multistimuli multiexperiment time series data
The concept of inferring from multiple data sets is also applied by[38], however on the basis of principal components of those data sets. The work of[39] provides a multiple methods framework to integrate distinct types of data like steadystate and time series data, focussing mainly on the combination of knockout and stimulation data.
The proposed NetGenerator V2.0 algorithm allows for integrating data sets of multiple experiments with multiple stimuli. In the inferred models, weighed input terms represent external stimuli and resulting GRNs represent the merged effects of the diverse experiments. Therefore, from a biological point of view, the algorithm is able to handle experiments which investigate the degree of crosstalk.
We applied and tested this feature for 3 benchmark examples and 1 realworld example, the gene regulation during chondrogenic differentiation. The evaluation of the benchmark examples’ results showed the power of the algorithm to infer the network structure and to reproduce the time series. Further, for a special system of “full crosstalk”, i.e. all components are influenced by all stimuli, we could show that the simultaneous utilisation of different data sets leads to higher model quality compared to modelling data sets individually. The reason for this effect is due to the different stimulation by another external input which alters the time series data qualitatively and quantitatively, something that could not be achieved by biological replicates of a single input experiment. This underlines the benefit of using our integrated approach. Further, the presented examples LCT and NCT are possible outcomes of GRN investigations. In the first case, there are two different types of genes: some are induced by one stimulus only and some are induced by multiple stimuli. The model inferred by NetGenerator contains both the separate and common structural elements. The special case of NCT occurs, if network parts are stimulated that are not connected at all. In summary, the extended NetGenerator takes advantage of multistimuli multiexperiment data by network refinement and extension.
We further inferred a twostimulus network for hMSC differentiating towards chondrocytes. This network model contains gene regulatory events following the stimulation with two distinct chondrogenic factors, therefore providing a view on how genes involved in differentiation might be controlled by external molecules. Applying a subsequent resampling gives further information about the connections of this inferred GRN: (i) the majority of connections, especially the ones of prior knowledge and predicted binding sites, occur with a high frequency which can be considered a measure of reliability and (ii) the ranking of the frequencies can be used in interpreting the results with regard to biological hypotheses. Overall, this shows the importance for an extension of NetGenerator to deal with multiple data sets.
Consideration of prior knowledge
The means to integrate prior knowledge (fix and flexible) into the network inference is a distinctive feature of the extended NetGenerator algorithm achieved by modifying the objective function. This feature can reduce the complexity of the structure optimization, although it strongly depends on the origin and quality of the given knowledge, see e.g.[7]. Using prior knowledge for network inference can also be found in several other algorithms, see[1, 3, 4].
For our example of chondrogenic differentiation, we exemplarily showed network inference using flexible prior knowledge about regulatory interactions extracted from a database (Pathway Studio). The graphical evaluation of the inferred network showed very good reproduction of the proposed prior knowledge. Further predicted connections could be associated with potential regulatory binding sites generated from sequence data (Transfac, Ensembl).
Further aspects
Apart from the linear modelling presented in detail, the ability of NetGenerator to infer a nonlinear model has been mentioned as a further option. The additional sigmoid function describing saturation in geneexpression has been proven successful before, e.g.[40–42]. Since the sigmoid transformation has also been used for neural network models, those inference methods are sometimes classified as such.
Besides the many advantages and possible application areas, there are minor restrictions of NetGenerator: it should be applied to preprocessed data without high correlations, it infers networks from measured time series data and due to the heuristic approach it cannot be proven that the global solution was found. The latter can be improved by decreasing the influence of noisy data using a bootstrap (resampling) approach, see chondrogenesis example and[1]. One feature which might be introduced in subsequent versions is the application of “interventional” multiexperiment data, i.e. data originating from perturbations within the system. This can be dealt with by applying either experimentwise prior knowledge or an additional module in the structure optimisation explicitly dealing with that kind of data.
Conclusions
We presented the novel NetGenerator algorithm for automatic inference of GRNs, which applies multistimuli multiexperiment time series data and biological prior knowledge resulting in dynamical models of differential equations systems. This heuristic approach combines network structure and parameter optimisation of coupled submodels and takes into account the biological properties of those networks: indirect transcriptional events for information propagation, limited number of connections and mostly hierarchical structures. The analysis of benchmark examples showed a good reproduction of the networks and emphasised the biological relevance of inferred networks with a different degree of crosstalk. The ability to infer a realworld example based on multistimuli multiexperiment data was shown by application of NetGenerator to a system of growth factorinduced chondrogenesis.
Declarations
Acknowledgements
We would like to thank all our LINCONET project partners of the ERASysBio+ initiative. Also, we kindly acknowledge the support of this work by the BMBF (German Federal Ministry of Education and Research) funding MW within this initiative (Fkz. 0315719). Further, we acknowledge the Virtual Liver Network initiative of the BMBF for granting support (Fkz. 0315760 and Fkz. 0315736).
Authors’ Affiliations
References
 Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R: Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems. 2009, 96: 86103. 10.1016/j.biosystems.2008.12.004.PubMedView ArticleGoogle Scholar
 Piek E, Sleumer LS, van Someren EP, Heuver L, Haan JRd, Grijs Id, Gilissen C, Hendriks JM, van Ravesteinvan Os RI, Bauerschmidt S, Dechering KJ, van Zoelen EJ: Osteotranscriptomics of human mesenchymal stem cells: accelerated gene expression and osteoblast differentiation induced by vitamin D reveals cMYC as an enhancer of BMP2induced osteogenesis. Bone. 2010, 46 (3): 613627. 10.1016/j.bone.2009.10.024.PubMedView ArticleGoogle Scholar
 de Jong H: Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 2002, 9: 67103. 10.1089/10665270252833208.PubMedView ArticleGoogle Scholar
 Bansal M, Belcastro V, AmbesiImpiombato A, Di Bernardo D: How to infer gene networks from expression profiles. Mol Syst Biol. 2007, 3: 78PubMedPubMed CentralView ArticleGoogle Scholar
 Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics. 2005, 21 (8): 16261634. 10.1093/bioinformatics/bti226.PubMedView ArticleGoogle Scholar
 Toepfer S, Guthke R, Driesch D, Woetzel D, Pfaff M: The NetGenerator algorithm: reconstruction of gene regulatory networks. Lecture Notes in Computer Science. Edited by: Tuyls K, Westra R, Saeys Y. 2007, Nowé A, Berlin and Heidelberg: Springer Berlin Heidelberg, 119130.Google Scholar
 Linde J, Wilson D, Hube B, Guthke R: Regulatory network modelling of iron acquisition by a fungal pathogen in contact with epithelial cells. BMC Syst Biol. 2010, 4: 14810.1186/175205094148.PubMedPubMed CentralView ArticleGoogle Scholar
 Albrecht D, Kniemeyer O, Mech F, Gunzer M, Brakhage A, Guthke R: On the way toward systems biology of Aspergillus fumigatus infection. Int J Med Microbiol. 2011, 301 (5): 453459. 10.1016/j.ijmm.2011.04.014.PubMedView ArticleGoogle Scholar
 R Development Core Team: R: A language and environment for statistical computing. 2008, R Foundation for Statistical Computing, Vienna, Austria, [http://www.Rproject.org]Google Scholar
 Ferrari F, Bortoluzzi S, Coppe A, Sirota A, Safran M, Shmoish M, Ferrari S, Lancet D, Danieli G, Bicciato S: Novel definition files for human GeneChips based on GeneAnnot. BMC Bioinformatics. 2007, 8: 44610.1186/147121058446.PubMedPubMed CentralView ArticleGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
 Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3PubMedGoogle Scholar
 Draper NR, Smith H: Applied regression analysis. 1998, A WileyInterscience publication, New, York: WileyView ArticleGoogle Scholar
 Byrd RH, Lu P, Nocedal J, Zhu C: A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput. 1995, 16 (5): 119010.1137/0916069.View ArticleGoogle Scholar
 Yuryev A, Mulyukov Z, Kotelnikova E, Maslov S, Egorov S, Nikitin A, Daraselia N, Mazo I: Automatic pathway building in biological association networks. BMC Bioinf. 2006, 7: 17110.1186/147121057171.View ArticleGoogle Scholar
 ThomasChollier M, Defrance M, MedinaRivera A, Sand O, Herrmann C, Thieffry D, van Helden J: RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res. 2011, 39 (Web Server issue): W86W91.PubMedPubMed CentralView ArticleGoogle Scholar
 Matys V, KelMargoulis OV, Fricke E, Liebich I, Land S, BarreDirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, LewickiPotapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34 (Database issue): D108D110.PubMedPubMed CentralView ArticleGoogle Scholar
 Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart – biological queries made easy. BMC Genomics. 2009, 10: 2210.1186/147121641022.PubMedPubMed CentralView ArticleGoogle Scholar
 Soetaert K, Petzoldt T, Setzer RW: Solving differential equations in R: package deSolve. J Stat Software. 2010, 33 (9): 125.View ArticleGoogle Scholar
 Gansner ER, North SC: An open graph visualization system and its applications to software engineering. Software–Practice and Experience. 1999, 00: 15.Google Scholar
 Hartmann C: Transcriptional networks controlling skeletal development. Curr Opin Genet Dev. 2009, 19 (5): 437443. 10.1016/j.gde.2009.09.001.PubMedView ArticleGoogle Scholar
 Jin EJ, Lee SY, Choi YA, Jung JC, Bang OS, Kang SS: BMP2enhanced chondrogenesis involves p38 MAPKmediated downregulation of Wnt7a pathway. Mol Cells. 2006, 22 (3): 353359.PubMedGoogle Scholar
 van der Kraan PM, Blaney Davidson EN, Blom A, van den Berg WB: TGFbeta signaling in chondrocyte terminal differentiation and osteoarthritis: modulation and integration of signaling pathways through receptorSmads. Osteoarthritis and Cartilage / OARS, Osteoarthritis Res Soc. 2009, 17 (12): 15391545. 10.1016/j.joca.2009.06.008.View ArticleGoogle Scholar
 Sekiya I, Tsuji K, Koopman P, Watanabe H, Yamada Y, Shinomiya K, Nifuji A, Noda M: SOX9 enhances aggrecan gene promoter/enhancer activity and is upregulated by retinoic acid in a cartilagederived cell line, TC6. J Biol Chem. 2000, 275 (15): 1073810744. 10.1074/jbc.275.15.10738.PubMedView ArticleGoogle Scholar
 Yamashita S, Andoh M, UenoKudoh H, Sato T, Miyaki S, Asahara H: Sox9 directly promotes Bapx1 gene expression to repress Runx2 in chondrocytes. Exp Cell Res. 2009, 315 (13): 22312240. 10.1016/j.yexcr.2009.03.008.PubMedPubMed CentralView ArticleGoogle Scholar
 Oh Cd, Maity SN, Lu JF, Zhang J, Liang S, Coustry F, Crombrugghe Bd, Yasuda H: Identification of SOX9 interaction sites in the genome of chondrocytes. PLoS ONE. 2010, 5 (4): e1011310.1371/journal.pone.0010113.PubMedPubMed CentralView ArticleGoogle Scholar
 Craft AM, Krisky DM, Wechuck JB, Lobenhofer EK, Jiang Y, Wolfe DP, Glorioso JC: Herpes simplex virusmediated expression of Pax3 and MyoD in embryoid bodies results in lineagerelated alterations in gene expression profiles. Stem Cells. 2008, 26 (12): 31193129. 10.1634/stemcells.20080417.PubMedView ArticleGoogle Scholar
 Meyer PE, Lafitte F, Bontempi G: minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinf. 2008, 9: 46110.1186/147121059461.View ArticleGoogle Scholar
 Kauffman S: Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol. 1969, 22 (3): 437467. 10.1016/00225193(69)900150.PubMedView ArticleGoogle Scholar
 D’haeseleer P, Wen X, Somogyi R, Fuhrman S: Linear modeling of mRNA expression levels during CNS development and injury. Pac Symp Biocomput. 1999, 4: 4152.Google Scholar
 Altwasser R, Linde J, Buyko E, Hahn U, Guthke R: Genomewide scalefree network inference for Candida albicans. Frontiers Microbiol. 2012, 3: 51View ArticleGoogle Scholar
 Gustafsson M, Hornquist M, Lombardi A: Constructing and analyzing a largescale genetogene regulatory network–Lassoconstrained inference and biological validation. IEEE/ACM Transac Comput Biol Bioinf. 2005, 2 (3): 254261. 10.1109/TCBB.2005.35.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc: Ser B (Stat Methodology). 1996, 58: 267288.Google Scholar
 van Someren E, Wessels L, Reinders M, Backer E: Regularization and noise injection for improving genetic network models. Computational and Statistical Approaches to Genomics. Edited by: Zhang W, Shmulevich I. 2002, NJ, USA: World Scientific Publishing Co., 211226.Google Scholar
 Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V: The Inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo. Genome Biol. 2006, 7 (5): R3610.1186/gb200675r36.PubMedPubMed CentralView ArticleGoogle Scholar
 Hecker M, Goertsches R, Engelmann R, Thiesen HJ, Guthke R: Integrative modeling of transcriptional regulation in response to antirheumatic therapy. BMC Bioinformatics. 2009, 10: 26210.1186/1471210510262.PubMedPubMed CentralView ArticleGoogle Scholar
 Bansal M, Gatta GD, Di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006, 22 (7): 815822. 10.1093/bioinformatics/btl003.PubMedView ArticleGoogle Scholar
 Wang Y, Joshi T, Zhang XS, Xu D, Chen L: Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics. 2006, 22 (19): 24132420. 10.1093/bioinformatics/btl396.PubMedView ArticleGoogle Scholar
 Gupta R, Stincone A, Antczak P, Durant S, Bicknell R, Bikfalvi A, Falciani F: A computational framework for gene regulatory network inference that combines multiple methods and datasets. BMC Syst Biol. 2011, 5: 5210.1186/17520509552.PubMedPubMed CentralView ArticleGoogle Scholar
 Weaver DC, Workman CT, Stormo GD: Modeling regulatory networks with weight matrices. Pac Symp Biocomput. 1999, 4: 112123.Google Scholar
 Wahde M, Hertz J: Coarsegrained reverse engineering of genetic regulatory networks. Biosystems. 2000, 55 (13): 129136. 10.1016/S03032647(99)000908.PubMedView ArticleGoogle Scholar
 Mjolsness E, Mann T, Castaño R, Wold B: From coexpression to coregulation: An approach to inferring transcriptional regulation among gene classes from largescale expression data. Advances in Neural Information Processing Systems, Volume 12. Edited by: Solla SA, Leen TK. 2000, Müller KR: MIT Press, 928934.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.