Genomic data assimilation using a higher moment filtering technique for restoration of gene regulatory networks
- Takanori Hasegawa^{1}Email author,
- Tomoya Mori^{1},
- Rui Yamaguchi^{2},
- Teppei Shimamura^{3},
- Satoru Miyano^{2},
- Seiya Imoto^{2} and
- Tatsuya Akutsu^{1}
https://doi.org/10.1186/s12918-015-0154-2
© Hasegawa et al.; licensee BioMed Central. 2015
Received: 24 September 2014
Accepted: 20 February 2015
Published: 13 March 2015
Abstract
Background
As a result of recent advances in biotechnology, many findings related to intracellular systems have been published, e.g., transcription factor (TF) information. Although we can reproduce biological systems by incorporating such findings and describing their dynamics as mathematical equations, simulation results can be inconsistent with data from biological observations if there are inaccurate or unknown parts in the constructed system. For the completion of such systems, relationships among genes have been inferred through several computational approaches, which typically apply several abstractions, e.g., linearization, to handle the heavy computational cost in evaluating biological systems. However, since these approximations can generate false regulations, computational methods that can infer regulatory relationships based on less abstract models incorporating existing knowledge have been strongly required.
Results
We propose a new data assimilation algorithm that utilizes a simple nonlinear regulatory model and a state space representation to infer gene regulatory networks (GRNs) using time-course observation data. For the estimation of the hidden state variables and the parameter values, we developed a novel method termed a higher moment ensemble particle filter (HMEnPF) that can retain first four moments of the conditional distributions through filtering steps. Starting from the original model, e.g., derived from the literature, the proposed algorithm can sequentially evaluate candidate models, which are generated by partially changing the current best model, to find the model that can best predict the data. For the performance evaluation, we generated six synthetic data based on two real biological networks and evaluated effectiveness of the proposed algorithm by improving the networks inferred by previous methods. We then applied time-course observation data of rat skeletal muscle stimulated with corticosteroid. Since a corticosteroid pharmacogenomic pathway, its kinetic/dynamics and TF candidate genes have been partially elucidated, we incorporated these findings and inferred an extended pathway of rat pharmacogenomics.
Conclusions
Through the simulation study, the proposed algorithm outperformed previous methods and successfully improved the regulatory structure inferred by the previous methods. Furthermore, the proposed algorithm could extend a corticosteroid related pathway, which has been partially elucidated, with incorporating several information sources.
Keywords
Background
Gene regulatory networks (GRNs) are fundamental for sustaining complex biological systems in cells. Although a comprehensive understanding of intracellular systems is still far from complete, many findings regarding intracellular systems have been published as a result of recent technological advances in biotechnology, e.g., microarray and Chip-Seq. By combining these findings, we can construct biological simulation models in which the dynamics of biomolecules are described by mathematical equations, e.g., the Michaelis-Menten model [1] and S-system [2]. However, simulation results may not match results from biological observations due to inaccurate or missing information about intracellular systems.
In oder to infer unknown parts of biological systems, there exist roughly two major approaches, i.e., simulation model-based and statistical approaches. In constructing biological simulation models, regulatory relationships among biomolecules are collected from the literature. To represent the regulatory systems, mathematical equations, often differential equations [1-4], are given to describe the dynamic behavior of the involved biomolecules. The parameter values of these simulation models have been estimated to be consistent with the data by some computational methodologies. Several methods have been proposed to infer regulatory structures [5,6], to reproduce the dynamic behavior of biological systems recorded in the literature [7-10], and to improve published pathways so that they are consistent with the data [11,12]. However, differential equation-based approaches are computationally intensive when updating parameter values and simulation results simultaneously. Therefore, they cannot be applied to more than several genes when much of the regulatory structure is unknown.
A statistical approach using more abstracted models, e.g., Bayesian networks [13-16] and the state space model [17-21], have been successfully applied to infer the structure of transcriptional regulation using data from biological observations. Whereas purely data-driven methods need to explore a large model space, some studies have further incorporated other information, e.g., literature-recorded pathways and TFs information [22-26]. In contrast, these approximations can generate false regulations; there is a trade-off relationship between accuracy and computational ease. To overcome the problem, methods to improve and deconvolve networks, which are inferred by some computational approaches, utilizing less abstract models to better predict the data have been also proposed recently [27-29]. In following the direction, we should apply models that can maximally emulate the nonlinear dynamics of gene regulatory networks and establish a method for estimating the parameter values that maximize the ability to predict the data.
For this purpose, we proposed a novel data assimilation algorithm utilizing a simple nonlinear model, termed the combinatorial transcription model [5,30], and a state space representation [31,32], to infer GRNs by restoring networks that is inferred by some GRNs inference methods or derived from the literature. Since the nonlinearity results in generating non-Gaussian conditional distributions of the hidden state variables, we applied the unscented Kalman filter (UKF) [33-35] that can efficiently calculate the approximated conditional distributions as Gaussian distributions [36]. However, UKF cannot satisfy the requirements for estimating accurate parameter values of the model; thus, the first four moments of the conditional distributions of the hidden states should be retained. To address this problem, we developed a novel method, termed a higher-moment ensemble particle filter (HMEnPF), which can retain the first two moments and the third and fourth central moments throughout the prediction, filtering, and smoothing steps. Starting from an original network, which is derived from the literature or some GRNs inference methods, the proposed algorithm using HMEnPF improves the network based on the nonlinear state space model. Furthermore, the combinatorial transcription model was extended so that the model can handle additional biomolecules such as drugs.
To show the effectiveness of the proposed algorithm, we first prepared synthetic time-course data and compared the proposed algorithm to GeneNet [37,38] based on an empirical graphical Gaussian model (GGM), G1DBN [39] based on dynamic Bayesian networks using first order conditional dependencies, and the previous algorithm using UKF only [36]. For this comparison, six synthetic data with 30 time-points were generated based on a WNT5A [40] and a yeast-cell-cycle network [41]. As an application example, we used the time-course microarray data after stimulating rat skeletal muscle with corticosteroid, which were downloaded from the GEO database (GSE490). For this experiment, we also utilized corticosteroid pharmacogenomics [42,43], a previously defined regulatory structure for rat skeletal muscle [44], TF information from ITFP (Integrated Transcription Factor Platform) [45] and the original network inferred by G1DBN. Consequently, we proposed candidate pathways for an extension of corticosteroid-related pathways.
Methods
State space representation of combinatorial transcription model
where f _{ i } is a function of the regulatory effect on the ith gene by other genes, x(t)=(x _{1}(t),…,x _{ p }(t))^{′}, \(\boldsymbol {\theta }_{f_{i}}\) is a vector of tuning parameters for f _{ i }, u _{ i } is a synthesis coefficient, d _{ i } is a degradation coefficient and v(t) is a random system noise. Here, (·)^{′} stands for transposition. f _{ i } is often considered as a hill function, such as the Michaelis-Menten model [1].
where x _{ i,t } is the amount of the ith gene at time t, a _{ i,j } is an individual effect by the jth gene on the ith gene, b _{ i,(j,k)} is a combinatorial effect of the jth and kth genes on the ith gene and \(\mathcal A_{i}\) is an active set of genes regulating the ith gene. Since this model is a simple extension of a linear model to express a combinatorial effect by two different genes, b _{ j,j } is not considered.
where x _{ t }=(x _{1,t },…,x _{ p,t })^{′}, A=(a _{1},…,a _{ p })^{′}∈R ^{ p×p } is a linear effect matrix, a _{ i }=(a _{ i,1},…,a _{ i,p })^{′} (i=1,…,p), \(B=(\boldsymbol {b}_{1}, \ldots, \boldsymbol {b}_{p})'\in R^{p \times p^{2}}\) is a combinatorial effect matrix, b _{ i }=(b _{ i,(1,1)},…,b _{ i,(1,p)},b _{ i,(2,1)},…,b _{ i,(p,p)})^{′} (i=1,…,p), vec is a transformation function \((R^{p \times p} \rightarrow R^{p^{2}})\), u=(u _{1},…,u _{ p })^{′}, and v _{ t }∼N(0,Q) and w _{ t }∼N(0,R) are system and observational noises with diagonal covariance matrices, respectively. We define an entire set of time points \(\mathcal T=\{1, \ldots, T\}\) and the observed time set \(\mathcal {T}_{\textit {obs}}\) (\(\mathcal {T}_{\textit {obs}}\subset \mathcal {T}\)), and consider \(\mathcal {T}_{\textit {obs}}=\mathcal {T}\) in the following for simplicity. Note that A and B should be sparse matrices, and we also consider an active set of elements \({\mathcal B_{i}}(i=1, \ldots, p)\), which are sets of non-zero columns in the ith row of B.
Incorporation of biomolecules affecting biological systems
A higher-moment ensemble particle filter
where \(\boldsymbol {x}_{t}^{(n)}\) is the nth sample from p(x _{ t }), N is the number of samples and δ is a Dirac delta function. A sample \(\boldsymbol {x}_{t}^{(n)}\) and a set of samples \(\{\boldsymbol {x}_{t}^{(n)}\}\) are called particle and ensemble, respectively. Previously, many types of ensemble approximation methods have been developed to obtain conditional distributions of the hidden state variables in nonlinear state space models, e.g., the ensemble Kalman filter (EnKF) [47] and the particle filter (PF) [48,49]. Here, our requirements for this study are the followings; (i) the number of particles is not reduced through filtering steps since p and the dimension of θ can be too high for the resampling procedure and (ii) the third and fourth moments of probability densities of the hidden states should be kept to optimize θ as explained in the next sub-section. To satisfy these requirements, we extended a method termed the ensemble particle filter (EnPF) [50,51], which can retain the first two moments through filtering steps, and developed a novel method termed a higher-moment ensemble particle filter (HM-EnPF) that can additionally retain third and fourth central moments without reducing particles. The procedure of the proposed method is explained below.
Prediction step
- 1.
Generate particles \(\boldsymbol {v}_{t}^{(n)}\sim N(0, Q)\) for n=1,…,N.
- 2.
Calculate \(\boldsymbol {x}_{t+1|t}^{(n)}\) by applying \(\boldsymbol {x}_{t|t}^{(n)}\) and \(\boldsymbol {v}_{t}^{(n)}\) to Eq. (5) for n=1,…,N.
Filtering step
In this step, we attempt to calculate p(x _{ t+1}|Y _{ t+1}) after obtaining p(x _{ t+1}|Y _{ t }) (t=0,…,T−1). This step consists of the following three sub-steps termed “Particle Filter Step”, “Ensemble Kalman Filter Step” and “Merging Step”.
- 1.“Particle Filter Step” is firstly executed to obtain \(\left \{\hat {\boldsymbol {x}}_{t|t}^{(n)}\right \}\) that is according to the theoretically exact conditional probability density p(x _{ t }|y _{ t }) as follows.
- (a)Resample \(\hat {\boldsymbol {x}}_{t|t}^{(n)}\) according to$$\begin{array}{*{20}l} p(\boldsymbol{x}_{t}|\boldsymbol{y}_{t}) &= \frac{1}{\sum_{\dot{n}=1}^{N}p\left(\boldsymbol{y}_{t}|\boldsymbol{x}_{t|t-1}^{(\dot{n})}\right)}\\ &~~~~~ \times \sum_{n=1}^{N}p\left(\boldsymbol{y}_{t}|\boldsymbol{x}_{t|t-1}^{(n)}\right)\delta\left(\boldsymbol{x}_{t}-\boldsymbol{x}_{t|t-1}^{(n)}\right). \end{array} $$(7)
- (b)
Calculate the first and second moments \(\mu _{t|t} = E\left [\left \{\hat {\boldsymbol {x}}_{t|t}^{(n)}\right \}\right ]\) and \(V_{t|t}=Var\left [\left \{\hat {\boldsymbol {x}}_{t|t}^{(n)}\right \}\right ]\), respectively.
- (c)Standardize \(\hat {\boldsymbol {x}}_{t|t}^{(n)}\) as$$\begin{array}{*{20}l} \hat{\boldsymbol{z}}_{t|t}^{(n)} = V_{t|t}^{-\frac{1}{2}} \cdot \left(\hat{\boldsymbol{x}}_{t|t}^{(n)} - \boldsymbol{\mu}_{t|t}\right). \end{array} $$(8)
- (d)
Calculate the third and fourth central moments \(\hat {\boldsymbol {m}}_{t|t}^{(3)}=E\left [\left \{\hat {\boldsymbol {z}}_{t|t}^{(n)}\right \}^{3}\right ]\) and \(\hat {\boldsymbol {m}}_{t|t}^{(4)} =E\left [\left \{\hat {\boldsymbol {z}}_{t|t}^{(n)}\right \}^{4}\right ]\), respectively.
- (a)
- 2.“Ensemble Kalman Filter Step” is secondly executed to obtain \(\left \{\tilde {\boldsymbol {x}}_{t|t}^{(n)}\right \}\) that is calculated under the Gaussian assumption with regard to p(x _{ t }|y _{ t }) as follows.
- (a)
Generate particles \(\boldsymbol {w}_{t}^{(n)}\sim N(0, R)\) for n=1,…,N.
- (b)Calculate Kalman gain$$\begin{array}{*{20}l} K_{t} = V_{t|t-1}\left(V_{t|t-1} + R_{t}\right)^{-1}, \end{array} $$(9)
where \(V_{t|t-1} = Var\left [\left \{\boldsymbol {x}_{t|t-1}^{(n)}\right \}\right ]\) and \(R_{t} = Var\left [\left \{\boldsymbol {w}_{t}^{(n)}\right \}\right ]\).
- (c)Calculate \(\tilde {\boldsymbol {x}}_{t|t}^{(n)}\) as$$\begin{array}{*{20}l} \tilde{\boldsymbol{x}}_{t|t}^{(n)} = \boldsymbol{x}_{t|t-1}^{(n)} + K_{t} \left(\boldsymbol{y}_{t} - \boldsymbol{x}_{t|t-1}^{(n)} + \boldsymbol{w}_{t}^{(n)}\right). \end{array} $$(10)
- (d)
Calculate the first and second moments \(\tilde {\mu }_{t|t} = E\left [\left \{\tilde {\boldsymbol {x}}_{t|t}^{(n)}\right \}\right ]\) and \(\tilde {V}_{t|t}=Var\left [\left \{\tilde {\boldsymbol {x}}_{t|t}^{(n)}\right \}\right ]\), respectively.
- (e)Standardize \(\tilde {\boldsymbol {x}}_{t|t}^{(n)}\) as$$\begin{array}{*{20}l} \tilde{\boldsymbol{z}}_{t|t}^{(n)} = \tilde{V}_{t|t}^{-\frac{1}{2}} \cdot \left(\tilde{\boldsymbol{x}}_{t|t}^{(n)}- \tilde{\boldsymbol{\mu}}_{t|t}\right). \end{array} $$(11)
- (f)
Calculate the third and fourth central moments \(\tilde {\boldsymbol {m}}_{t|t}^{(3)}=E\left [\left \{\tilde {\boldsymbol {z}}_{t|t}^{(n)}\right \}^{3}\right ]\) and \(\tilde {\boldsymbol {m}}_{t|t}^{(4)} =E\left [\left \{\tilde {\boldsymbol {z}}_{t|t}^{(n)}\right \}^{4}\right ]\), respectively.
- (a)
- 3.“Merging Step” is finally executed to obtain \(\left \{\boldsymbol {x}_{t|t}^{(n)}\right \}\) of which the first, second, third central and fourth central moments match to those of \(\left \{\hat {\boldsymbol {x}}_{t|t}^{(n)}\right \}\). Here, we consider a standardization function S(γ,α,β) that transforms a normal random vector γ into a normalized random vector x whose the third and fourth central moments are α and β, respectively. From a previous study [52], we have S(γ,α,β) and S _{ inv }(x,α,β) that transforms x to γ as explained in the Additional file 1. Then, we obtained \(\boldsymbol {x}_{t|t}^{(n)}\) as$$\begin{array}{*{20}l} \boldsymbol{x}_{t|t}^{(n)} &= \hat{V}_{t|t}^{\frac{1}{2}}S\left(\boldsymbol{z}_{t}^{(n)}, \hat{\boldsymbol{m}}_{t|t}^{(3)}, \hat{\boldsymbol{m}}_{t|t}^{(4)}\right) + \hat{\boldsymbol{\mu}}_{t|t}, \end{array} $$(12)$$\begin{array}{*{20}l} \boldsymbol{z}_{t}^{(n)} &= S_{inv}\left(\tilde{\boldsymbol{z}}_{t|t}^{(n)}, \tilde{\boldsymbol{m}}_{t|t}^{(3)}, \tilde{\boldsymbol{m}}_{t|t}^{(4)}\right). \end{array} $$(13)
Smoothing step
The smoothing step used for calculating x _{ t|s } (s>t) was essentially the same as the filtering step. The details of the smoothing step can be found in the Additional file 2.
Parameter estimation using EM-algorithm
where p(x _{0}) is a probability density of N-dimensional Gaussian distributions N(μ _{0},Σ _{0}), p(x _{ t }|x _{ t−1}) and p(y _{ t }|x _{ t }) can be probability densities of N-dimensional non-Gaussian distributions obtained by ensemble approximation.
is iteratively maximized with respect to θ until the convergence is attained. More details are included in the Additional file 3.
Network restoration algorithm
Results and discussion
Comparison using synthetic data
The comparison results using the original model given by G1DBN and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.01
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.689 | 0.573 | 0.625 | 17.2 | 7.8 | 52.2 | 12.8 |
UKF | 0.632 | 0.533 | 0.577 | 16.0 | 9.4 | 50.6 | 14.0 |
G1DBN | 0.487 | 0.453 | 0.466 | 13.6 | 15.0 | 45.0 | 16.4 |
The comparison results using the original model given by GeneNet and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.01
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.595 | 0.520 | 0.553 | 15.6 | 10.4 | 49.6 | 14.4 |
UKF | 0.573 | 0.493 | 0.529 | 14.8 | 10.6 | 49.4 | 15.2 |
GeneNet | 0.493 | 0.495 | 0.493 | 10.4 | 10.8 | 13.2 | 10.6 |
The comparison results using the original model given by G1DBN and the five time-courses generated from a yeast-cell cycle network with Gaussian system noise of mean 0 and variance 0.01
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.759 | 0.700 | 0.727 | 18.2 | 5.8 | 58.2 | 7.8 |
UKF | 0.707 | 0.692 | 0.698 | 18.0 | 7.6 | 56.4 | 8.0 |
G1DBN | 0.574 | 0.562 | 0.555 | 14.6 | 12.8 | 51.2 | 11.4 |
The comparison results using the original model given by GeneNet and the five time-courses generated from a yeast cell-cycle network with Gaussian system noise of mean 0 and variance 0.01
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.827 | 0.738 | 0.778 | 19.2 | 4.2 | 59.8 | 6.8 |
UKF | 0.703 | 0.708 | 0.704 | 18.4 | 8.0 | 56.0 | 7.6 |
GeneNet | 0.413 | 0.520 | 0.460 | 10.4 | 14.8 | 10.2 | 9.6 |
The comparison results using the original model given by G1DBN and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.05
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.646 | 0.580 | 0.609 | 17.4 | 9.4 | 50.6 | 12.6 |
UKF | 0.609 | 0.573 | 0.589 | 17.2 | 10.8 | 49.2 | 12.8 |
G1DBN | 0.490 | 0.460 | 0.468 | 13.8 | 14.6 | 45.4 | 16.2 |
The comparison results using the original model given by GeneNet and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.05
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.705 | 0.633 | 0.665 | 19.0 | 8.0 | 52.0 | 11.0 |
UKF | 0.649 | 0.567 | 0.604 | 17.0 | 9.2 | 50.8 | 13.0 |
GeneNet | 0.453 | 0.543 | 0.492 | 11.4 | 14.0 | 10.0 | 9.6 |
The comparison results using the original model given by G1DBN and the five time-courses generated from a yeast-cell cycle network with Gaussian system noise of mean 0 and variance 0.05
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.661 | 0.600 | 0.628 | 15.6 | 8.0 | 56.0 | 10.4 |
UKF | 0.573 | 0.538 | 0.553 | 14.0 | 10.4 | 53.6 | 12.0 |
G1DBN | 0.482 | 0.515 | 0.495 | 13.4 | 15.0 | 49.0 | 12.6 |
The comparison results using the original model given by GeneNet and the five time-courses generated from a yeast cell-cycle network with Gaussian system noise of mean 0 and variance 0.05
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.604 | 0.577 | 0.589 | 15.0 | 9.8 | 54.2 | 11.0 |
UKF | 0.578 | 0.562 | 0.568 | 14.6 | 11.0 | 53.0 | 11.4 |
GeneNet | 0.387 | 0.360 | 0.366 | 7.2 | 11.2 | 13.8 | 12.8 |
The comparison results using the original model given by G1DBN and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.1
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.689 | 0.633 | 0.659 | 19.0 | 8.8 | 51.2 | 11.0 |
UKF | 0.637 | 0.600 | 0.616 | 18.0 | 10.6 | 49.4 | 12.0 |
G1DBN | 0.590 | 0.513 | 0.548 | 15.4 | 11.0 | 49.0 | 14.6 |
The comparison results using the original model given by GeneNet and the five time-courses generated from WNTA5A network with Gaussian system noise of mean 0 and variance 0.1
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.671 | 0.607 | 0.635 | 18.2 | 9.0 | 51.0 | 11.8 |
UKF | 0.644 | 0.593 | 0.615 | 17.8 | 10.2 | 49.8 | 12.2 |
GeneNet | 0.503 | 0.590 | 0.542 | 12.4 | 12.2 | 11.8 | 8.6 |
The comparison results using the original model given by G1DBN and the five time-courses generated from a yeast-cell cycle network with Gaussian system noise of mean 0 and variance 0.1
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.721 | 0.731 | 0.725 | 19.0 | 7.4 | 56.6 | 7.0 |
UKF | 0.714 | 0.715 | 0.713 | 18.6 | 7.6 | 56.4 | 7.4 |
G1DBN | 0.611 | 0.585 | 0.591 | 15.2 | 10.2 | 53.8 | 10.8 |
The comparison results using the original model given by GeneNet and the five time-courses generated from a yeast cell-cycle network with Gaussian system noise of mean 0 and variance 0.1
PR | RR | F-measure | TP | FP | TN | FN | |
---|---|---|---|---|---|---|---|
HMEnPF | 0.724 | 0.746 | 0.735 | 19.4 | 7.4 | 56.6 | 6.6 |
UKF | 0.690 | 0.731 | 0.709 | 19.0 | 8.6 | 55.4 | 7.0 |
GeneNet | 0.407 | 0.460 | 0.427 | 9.2 | 13.6 | 11.4 | 10.8 |
The results indicate that the proposed algorithm using HMEnPF and only UKF could outperform G1DBN and GeneNet, and the proposed algorithm showed better performance than that of using UKF only. This concludes that retaining higher moment information can improve the accuracy of approximation and estimate correct parameter values. Additionally, we recognized that the performance of the proposed algorithm strongly depends on the accuracy of the original network. Thus, to obtain better results, we should carefully construct original networks or select inference methods for creating the original network. Note that the Jar file of the proposed algorithm is available at: http://sunflower.kuicr.kyoto-u.ac.jp/~t-hasegw/, and the synthetic data, the parameter values and the original networks are in the Additional file 5.
Inference using real data
As an application example, we analyzed microarray time-course gene expression data from rat skeletal muscle [42,43]. The microarray data were downloaded from the GEO database (GSE490). The time-course gene expression data was measured at 0, 0.25, 0.5, 0.75, 1, 2, 4, 5, 5.5, 7, 8, 12, 18, 30, 48, and 72 [h] (16 time points) after stimulation of corticosteroid, but we removed data at 48 and 72 [h] (steady state profiles) for computational efficiency. The data at time 0 represent controls (untreated). There were two, three, or four replicated observations for each time point.
Sets of pharmacogenomic genes handled in the real data experiment
Gene Set | Literature | TF | |
---|---|---|---|
candidate | |||
(i) | Trim63, Akt1, Akt2, Mstn, Irs1 | ∘/- | ∘ |
(ii) | Akt3, Anxa3, Bcat2, Bnip3, | ∘/- | - |
Foxo1, Igf1, Igf1r, Mtor | |||
Pik3c3, Pik3cb, Pik3cd, | |||
Pik3c2g, Rheb, Slc2a4 | |||
(iii) | Cebpb, Cebpd, Gpam, Srebf1 | -/ ∘ | ∘ |
Rxrg, Scarb1, Scd, Gpd2, | |||
Mapk6, Ace, Ptpn1 | |||
(iv) | Ptprf, Edn1, Agtr1a, Ppard, | -/ ∘ | - |
Hmgcs2, Serpine1 | |||
Il6r, Mapk14, Ucp3, Pdk4 |
Conclusions
In this paper, we developed a novel approach to restore original GRNs to be consistent with time-course mRNA expression data based on the combinatorial transcription model. Since we applied a state space representation with the nonlinear system equation in the context of data assimilation, the conditional distributions of the hidden variables can be non-Gaussian distributions. In contrast to the previous approaches using particle filter, Gaussian approximation and regression-based solutions, our proposed approach, HMEnPF, can retain the first, second, third central and fourth central moments through filtering steps to estimate near optimal parameter values by the EM-algorithm.
According to the comparison results using six synthetic data based on the real biological pathways, the proposed algorithm successfully explored better models than the previous methods, G1DBN and GeneNet, considering linear relevance. Moreover, the proposed algorithm using HMEnPF outperformed that of using UKF. This concludes that HMEnPF retaining parts of higher moment information can improve the accuracy of the estimation of the parameter values beyond unscented approximation (that cannot retain any moment through filtering steps based on Gaussian approximation). Through the experimental results, we also observed that the performance of the restoration algorithm strongly depends on the original network, which was prepared by literature information or some GRNs inference methods. Thus, one of significant points is to select methods to infer the original network. On the other hand, the proposed method has some limitations. For example, we require time-course data in which the number of time points should be more than 10 or so. Moreover, due to its heavy computational costs, the calculation for more than 20−30 genes without TF information can be infeasible.
As an application example, we prepared corticosteroid pharmacogenomic pathways in rat skeletal muscle that have been investigated and established a part of regulatory relationships and related genes. Additionally, the intracellular concentration of corticosteroid that directly/indirectly affects gene expression can be obtained by the previously developed differential equations and TF information for rat genes can also be utilized. In summary, we inferred the regulatory relationships among 40 genes and corticosteroid with fixing the established pathways and restricting that only TF candidates can regulate other genes. G1DBN was employed to construct the original model for the proposed algorithm. Consequently, several combinatorial regulations and regulations by corticosteroid were inferred by extending the original network. Since previous linear models may not be able to infer these regulations, the proposed algorithm can be valuable to restore inferred and literature-based networks to be consistent with the data.
Declarations
Acknowledgements
The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo (http://sc.hgc.jp/shirokane.html).
This work was partly supported by Grant-in-Aid for JSPS Fellows Number 24-9639.
Authors’ Affiliations
References
- Savageau MA. Biochemical systems analysis: II. The steady-state solutions for an n-pool system using a power-law approximation. J Theor Biol. 1969; 25(3):370–9.View ArticlePubMedGoogle Scholar
- Savageau MA, Voit EO. Recasting nonlinear differential equations as s-systems: a canonical nonlinear form. Math Biosci. 1987; 87(1):83–115.View ArticleGoogle Scholar
- Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature. 2000; 403(6767):335–8.View ArticlePubMedGoogle Scholar
- de Jong H. Modeling and simulation of genetic regulatory systems: A literature review. J Comput Biol. 2002; 9(1):67–103.View ArticlePubMedGoogle Scholar
- Opper M, Sanguinetti G. Learning combinatorial transcriptional dynamics from gene expression data. Bioinformatics. 2010; 26(13):1623–9.View ArticlePubMedGoogle Scholar
- Henderson J, Michailidis G. Network reconstruction using nonparametric additive ode models. PLoS ONE. 2014; 9(4):94003.View ArticleGoogle Scholar
- Koh CHH, Nagasaki M, Saito A, Wong L, Miyano S. DA 1.0: parameter estimation of biological pathways using data assimilation approach. Bioinformatics. 2010; 26(14):1794–6.View ArticlePubMedGoogle Scholar
- Matsuno H, Nagasaki M, Miyano S. Hybrid petri net based modeling for biological pathway simulation. Nat Comput. 2011; 10:1099–120.View ArticleGoogle Scholar
- Ramsay JO, Hooker G, Campbell D, Cao J. Parameter estimation for differential equations: a generalized smoothing approach. J R Stat Soc: Ser B (Stat Methodology). 2007; 69(5):741–96.View ArticleGoogle Scholar
- Quach M, Brunel N, d’Alche-Buc F. Estimating parameters and hidden variables in non-linear state-space models based on odes for biological networks inference. Bioinformatics. 2007; 23(23):3209–16.View ArticlePubMedGoogle Scholar
- Hasegawa T, Yamaguchi R, Nagasaki M, Imoto S, Miyano S. Comprehensive pharmacogenomic pathway screening by data assimilation. In: Proceedings of the 7th International Conference on Bioinformatics Research and Applications. ISBRA’11. Berlin, Heidelberg: Springer: 2011. p. 160–171.Google Scholar
- Hasegawa T, Nagasaki M, Yamaguchi R, Imoto S, Miyano S. An efficient method of exploring simulation models by assimilating literature and biological observational data. Biosystems. 2014; 121(0):54–66.View ArticlePubMedGoogle Scholar
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007; 9(3):432–41.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim S, Imoto S, Miyano S. Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems. 2004; 75(1-3):57–65.View ArticlePubMedGoogle Scholar
- Young W, Raftery A, Yeung K. Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Syst Biol. 2014; 8(1):47.View ArticlePubMed CentralPubMedGoogle Scholar
- Zacher B, Abnaof K, Gade S, Younesi E, Tresch A, Fröhlich H. Joint Bayesian inference of condition-specific miRNA and transcription factor activities from combined gene and microRNA expression data. Bioinformatics. 2012; 28(13):1714–20.View ArticlePubMedGoogle Scholar
- Barenco M, Tomescu D, Brewer D, Callard R, Stark J, Hubank M. Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome Biol. 2006; 7(3):25.View ArticleGoogle Scholar
- Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL. A bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics. 2005; 21:349–56.View ArticlePubMedGoogle Scholar
- Hasegawa T, Yamaguchi R, Nagasaki M, Miyano S, Imoto S. Inference of gene regulatory networks incorporating multi-source biological knowledge via a state space model with l1 regularization. PLoS ONE. 2014; 9(8):105942.View ArticleGoogle Scholar
- Hirose O, Yoshida R, Imoto S, Yamaguchi R, Higuchi T, Charnock-Jones DS, et al. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics. 2008; 24:932–42.View ArticlePubMedGoogle Scholar
- Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, et al. Modeling t-cell activation using gene expression profiling and state-space models. Bioinformatics. 2004; 20:1361–72.View ArticlePubMedGoogle Scholar
- Sabatti C, James GM. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics. 2006; 22(6):739–46.View ArticlePubMedGoogle Scholar
- Asif HMS, Sanguinetti G. Large-scale learning of combinatorial transcriptional dynamics from gene expression. Bioinformatics. 2011; 27(9):1277–83.View ArticlePubMedGoogle Scholar
- Eduati F, De Las Rivas J, Di Camillo B, Toffolo G, Saez-Rodriguez J. Integrating literature-constrained and data-driven inference of signalling networks. Bioinformatics. 2012; 28(18):2311–7.View ArticlePubMed CentralPubMedGoogle Scholar
- do Rego TG, Roider HG, de Carvalho FAT, Costa IG. Inferring epigenetic and transcriptional regulation during blood cell development with a mixture of sparse linear models. Bioinformatics. 2012; 28(18):2297–303.View ArticlePubMedGoogle Scholar
- Tian Y, Zhang B, Hoffman E, Clarke R, Zhang Z, Shih I-M, et al. Knowledge-fused differential dependency network models for detecting significant rewiring in biological networks. BMC Syst. Biol. 2014; 8(1):87.View ArticlePubMed CentralPubMedGoogle Scholar
- Barzel B, Barabási A-LL. Network link prediction by global silencing of indirect correlations. Nat Biotechnol. 2013; 31(8):720–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Feizi S, Marbach D, Medard M, Kellis M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nat Biotechnol. 2013; 31(8):726–33.View ArticlePubMed CentralPubMedGoogle Scholar
- Nakajima N, Tamura T, Yamanishi Y, Horimoto K, Akutsu T. Network completion using dynamic programming and least-squares fitting. Sci World J. 2012; 2012:1–8.View ArticleGoogle Scholar
- Wang W, Cherry JM, Nochomovitz Y, Jolly E, Botstein D, Li H. Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation. Proc Nat Acad Sci USA. 2005; 102(6):1998–2003.View ArticlePubMed CentralPubMedGoogle Scholar
- Kalman RE. A New Approach to Linear Filtering and Prediction Problems. Trans ASME - J Basic Eng. 1960; 82(Series D):35–45.View ArticleGoogle Scholar
- Shumway RH, Stoffer DS. An approach to time series smoothing and forecasting using the em algorithm. J Time Ser Anal. 1982; 3(4):253–64.View ArticleGoogle Scholar
- Julier SJ, Uhlmann JK. A new extension of the kalman filter to nonlinear systems. In: Proc. of AeroSense: The 11th Int. Symp. on Aerospace/Defense Sensing, Simulations and Controls: 1997. p. 182–193.Google Scholar
- Julier SJ, Uhlmann JK. Unscented filtering and nonlinear estimation. Proc IEEE. 2004; 92(3):401–22.View ArticleGoogle Scholar
- Chow S-M, Ferrer E, Nesselroade JR. An unscented kalman filter approach to the estimation of nonlinear dynamical systems models. Multivariate Behavioral Res. 2007; 42(2):283–321.View ArticleGoogle Scholar
- Hasegawa T, Mori T, Yamaguchi R, Imoto S, Miyano S, Akutsu T. An efficient data assimilation schema for restoration and extension of gene regulatory networks using time-course observation data. J Comput Biol. 2014; 21(11):785–98.View ArticlePubMedGoogle Scholar
- Schäfer J, Strimmer K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005; 21(6):754–64.View ArticlePubMedGoogle Scholar
- Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Syst Biol. 2007; 1(1):37.View ArticlePubMed CentralPubMedGoogle Scholar
- Lébre S. Inferring dynamic genetic networks with low order independencies. Stat App Genet Mol Biol. 2009; 8(1):1–38.Google Scholar
- Kim S, Li H, Dougherty ER, Cao N, Chen Y, Bittner M, et al. Can markov chain models mimic biological regulation. J Biol Syst. 2002; 10(4):337–28093357.View ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. Kegg for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012; 40(D1):109–14.View ArticleGoogle Scholar
- Almon RR, DuBois DC, Jin JY, Jusko WJ. Temporal profiling of the transcriptional basis for the development of corticosteroid-induced insulin resistance in rat muscle. J Endocrinol. 2005; 184(1):219–32.View ArticlePubMed CentralPubMedGoogle Scholar
- Yao Z, Hoffman EP, Ghimbovschi S, DuBois DC, Almon RR, Jusko WJ. Mathematical modeling of corticosteroid pharmacogenomics in rat muscle following acute and chronic methylprednisolone dosing. Mol Pharm. 2008; 5(2):328–39.View ArticlePubMed CentralPubMedGoogle Scholar
- Shimizu N, Yoshikawa N, Ito N, Maruyama T, Suzuki Y, Takeda S-I, et al. Crosstalk between Glucocorticoid Receptor and Nutritional Sensor mTOR in Skeletal Muscle. Cell Metab. 2011; 13(2):170–82.View ArticlePubMedGoogle Scholar
- Zheng G, Tu K, Yang Q, Xiong Y, Wei C, Xie L, et al. Itfp: an integrated platform of mammalian transcription factors. Bioinformatics. 2008; 24(20):2416–7.View ArticlePubMedGoogle Scholar
- Greenfield A, Hafemeister C, Bonneau R. Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics. 2013; 29(8):1060–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Evensen G. Sequential data assimilation with a nonlinear quasi-geostrophic model using monte carlo methods to forecast error statistics. J Geophys Res. 1994; 99:10143–62.View ArticleGoogle Scholar
- Gordon NJ, Salmond DJ, Smith AFM. Novel approach to nonlinear/non-gaussian bayesian state estimation. IEEE Proc F, Radar Signal Process. 1993; 140(2):107–13.View ArticleGoogle Scholar
- Kitagawa G. Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space Models. J Comput Graphical Stat. 1996; 5(1):1–25.Google Scholar
- Anderson LJ, Anderson LS. A monte carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts. Monthly Weather Rev. 1999; 127(12):2741–58.View ArticleGoogle Scholar
- Pham DT. Stochastic methods for sequential data assimilation in strongly nonlinear systems. Monthly Weather Rev. 2001; 129(5):1194–207.View ArticleGoogle Scholar
- Zhao Y, Lu Z. Fourth-moment standardization for structural reliability assessment. J Struct Eng. 2007; 133(7):916–24.View ArticleGoogle Scholar
- Foti D, Iuliano R, Chiefari E, Brunetti A. A nucleoprotein complex containing sp1, c/ebpb, and hmgi-y controls human insulin receptor gene transcription. Mol Cell Biol. 2003; 23(8):2720–32.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.