 Research
 Open Access
 Published:
Fusing gene expressions and transitive proteinprotein interactions for inference of gene regulatory networks
BMC Systems Biology volume 13, Article number: 37 (2019)
Abstract
Background
Systematic fusion of multiple data sources for Gene Regulatory Networks (GRN) inference remains a key challenge in systems biology. We incorporate information from proteinprotein interaction networks (PPIN) into the process of GRN inference from gene expression (GE) data. However, existing PPIN remain sparse and transitive protein interactions can help predict missing protein interactions. We therefore propose a systematic probabilistic framework on fusing GE data and transitive protein interaction data to coherently build GRN.
Results
We use a Gaussian Mixture Model (GMM) to softcluster GE data, allowing overlapping cluster memberships. Next, a heuristic method is proposed to extend sparse PPIN by incorporating transitive linkages. We then propose a novel way to score extended protein interactions by combining topological properties of PPIN and correlations of GE. Following this, GE data and extended PPIN are fused using a Gaussian Hidden Markov Model (GHMM) in order to identify gene regulatory pathways and refine interaction scores that are then used to constrain the GRN structure. We employ a Bayesian Gaussian Mixture (BGM) model to refine the GRN derived from GE data by using the structural priors derived from GHMM. Experiments on real yeast regulatory networks demonstrate both the feasibility of the extended PPIN in predicting transitive protein interactions and its effectiveness on improving the coverage and accuracy the proposed method of fusing PPIN and GE to build GRN.
Conclusion
The GE and PPIN fusion model outperforms both the stateoftheart single data source models (CLR, GENIE3, TIGRESS) as well as existing fusion models under various constraints.
Background
Gene regulations describe the interactions among genes during cellular activity. Through regulation, genes orchestrate the level of synthesized mRNA and thereby control the expression of other genes and the rates at which proteins are produced, eventually deciding the state of the cell. Gene expression (GE) microarrays provide quantitative or semiquantitative data on the cell state at a specific time and condition. By “reverseengineering” GE data, regulatory interactions among genes can be identified and gene regulatory network can be mapped using computational methods [1, 2].
Vast majority of functional analysis approaches to modelling microarray GE data assume that genes with similar expression profiles have similar cellular functions [3–5]. A molecular pathway is a set of genes that activate together to achieve a specific task and thus share similar expression profiles. In this paper, we use a datadriven method  the modelbased clustering  to model genes in distinct pathways. Specifically, we model each pathway as a Gaussian model as it allows modelling correlations among gene expressions in a datadriven way. It is better suited for situations where the prior knowledge of the regulatory pathways is unknown. In addition, because genes naturally participate in more than one regulatory pathway, softclustering is used to allowed so that genes can have memberships in multiple pathways. Hence, we adopt the Gaussian mixture model (GMM) on GE data so that different regulatory pathways can be identified.
The rationale behind clustering is that coexpressed genes, i.e., genes in the same cluster are more likely to be functionally related and belong to the same cluster. However, regulatory processes of the genes in a cluster could not necessarily be direct as it could refer to an indirect regulation via proteins, metabolites, or ncRNAs. In cases where two interaction partners are transcription factors or where two proteins are in the same complex, the interactions are direct. In order to identify indirect regulations in GRN, evidences from multiple data sources should be used. For example, medical literature, proteinprotein interaction (PPI) data, gene ontology, etc., have all been used to supplement wet lab data in the inference of GRN [6–9]. When more than one source are available, an essential step is to optimally combine evidences from multiple sources to derive a coherent GRN [10–12].
Since proteins are products of genes, protein interactions provide useful evidence for gene regulation. PPIN data have been fused with GE data for GRN inference in previous studies [13–17]. Most of these works considered only binary links of PPIN: if the link is consistent with the predicted edge from GE, the link is accepted as a true regulation. This approach throws away valuable information, so an accurate quantitative scoring scheme is needed to evaluate consistency between PPIN and gene regulation. On the other hand, existing PPINs are sparse and many real protein interactions are missing in current PPIN databases. Suggested by previous PPI prediction works [18, 19], there exist a large number of interactions between proteins in complexes, which have not yet been observed or recorded in current PPIN. We therefore propose a heuristic to quantitatively extend sparse PPIN by using transitive linkages. We then propose a novel way to score protein interactions by combining topological properties of extended PPIN and correlations of GE. Our experiments demonstrate that transitive protein interactions indeed play an important role in predicting protein interactions. We fuse the extended PPIN scores with GE data, using a Gaussian hidden Markov model (GHMM) to identify gene regulatory pathways, which are found to more consistent with PPIN than those produced by GMM. We further refine PPIN confidence scores by including gene interaction scores from GHMM, which makes the PPIN score more consistent with the existing GRN.
Since there exists no widely accepted model that universally fits GE data well [20–23], and different models capture different GE properties leading to different or complementary GRN structures [21], fusion of different models should lead to better GRN. The GHMM identifies regulatory pathways and obtain possible interacting genes by considering linear correlations between genes but misses conditional dependencies, i.e., nonlinear relations, among genes in the same regulatory pathway. The Bayesian network (BN) model is good at capturing these conditional dependences but suffers from poor computational efficiency. We thus propose a systematic probabilistic framework that fuse these two models and derive coherent GRN closer to biological reality. Specifically, our framework takes a coarsetofine approach: GHMM generates regulatory pathways (i.e., a coarse GRN having high coverage) and obtains refined interaction scores, both of which are then used to constrain the GRN structure of a BN model (i.e., the Bayesian Gaussian mixture model). This generates GRN that are of good coverage and high precision. Furthermore, GRN structural constrains help greatly reduce the search space for BGM model, thereby reducing the overall computational complexity. Figure 1 shows the flow chart of our GRN inference process.
Methods
SoftClustering of GE by Gaussian mixture model
Gaussian mixture model
Since gene expression measurement can be viewed as an expression of every gene over all the possible pathways, we employ a Gaussian mixture model (GMM) [24] to describe gene expression (GE) data. For simplicity, we assume that the expression of genes in a pathway follows a Gaussian model, and gene expression data are generated by a finite mixture of underlying probability distribution, that is, by multivariate normal distributions. The key difference between our work and existing GMM hard clustering [13, 24] of GE data lies in our assumption that each gene can participate in multiple pathways. That is, we allow softclustering membership for genes so that they can participate in multiple pathways. When one gene participates multiple pathways, it can have softclustering membership values for different pathways, representing its contributions of gene expressions in the Gaussian model in different pathways.
Given a gene expression dataset, let \(\mathcal {I}=\{i\}_{i=1}^{I}\) denote the set of genes, \(\mathcal {L}=\{l\}_{l=1}^{L}\) be the set of pathways, and \(\gamma _{i}\in \mathcal {L}\) denote the regulatory pathways in which gene i participates, i.e., the pathway assignment for gene i. The regulatory pathway assignments of gene set \(\mathcal {I}\) is denoted by the set \(\Gamma =\{\gamma _{i}\}_{i=1}^{I}\). The pathway assignment variable γ_{i} follows a multinomial distribution parameterized by vector \(\Theta =\{\theta _{l}\}_{l=1}^{L}\), with assignment probabilities p(γ_{i}=l)=θ_{l}, θ_{l}∈[0,1] where \(\sum _{l=1}^{L} \theta _{l}=1\). We assume that the set of weights \(\Theta =\{\theta _{l}\}_{l=1}^{L}\) follow a Dirichlet distribution p(Θ)=Dir(α_{1},…,α_{L}) with Dirichlet weights having uniform priors.
Given gene expression data \(\mathcal {X}=\{x_{i}\}_{i\in \mathcal {I}}\), each instance i has J continuousvalued attributes \(x_{i}=\{x_{ij}\}_{j=1}^{J}\), where x_{ij} represents the gene expression value measured for gene i in experiment j out of a total of J experiments. We represent each pathway with a Gaussian model wherein the tuple representing the gene expression levels of each experiment (conditioned on a pathway) is a multivariate sample. The probability of observing a given tuple of gene expression levels conditioned on a pathway is p(x_{i}γ_{i}=l)∼N(μ_{l},Σ_{l}), where μ_{l} is the mean vector and Σ_{l} is the J×J covariance matrix.
The pathway assignments are determined by \(p(\Gamma \mathcal {X})\propto p(\mathcal {X},\Gamma)\). Assuming independence between genes, the joint distribution over \(\mathcal {X}\) and pathway assignments Γ is
Learning parameters of the Gaussian mixture model
Expectation Maximization (EM) algorithm [25] is used to estimate the parameters of the joint likelihood in (1). The EM algorithm generates a sequence of parameter approximations that eventually maximizes the observed likelihood. The parameters are initialized to random values. The tth iteration comprises the following two steps.
Estep: Compute the probability of gene {i:i=1,…,I} conditioned on regulatory pathway {l:l=1,…,L} by using current parameter estimates as:
This requires I×L computations.
Mstep: Reestimate the GMM parameters \(\left \{\theta _{l}, \mu _{l}, \Sigma _{l}\right \}_{l=1}^{L}\) for all L pathways as:
Once EM procedure converges, we accumulate the joint probability of each pair of genes (i,i^{′}) over all the pathways. This gives the pairwise interaction probability of two genes \(G_{i,i^{\prime }}\), which measures how likely both genes belong to the same pathway (assuming that genes are mutually independent):
where τ_{il}=p(x_{i},γ_{i}=l) be the joint probability of gene i belonging to the lth regulatory pathway. That is, from genes’ pathway assignments matrix τ={τ_{il}}_{I×L}, we obtain the first version of GRN as a multiplication of τ and its transpose τ^{T}:
Note that \(G = \left \{G_{i,i^{\prime }}\right \}\) consists of the probabilities of gene pairs belonging to the same regulatory pathways. Each entry \(G_{i,i^{\prime }}\) denotes the interaction probability of gene i and i^{′}, estimated from GE data.
The number of mixture components L can be determined by the componentwise EM algorithm [25] automatically, instead of trying all possible L∈[L_{min},L_{max}] via the timeconsuming EM algorithm. The idea is to make use of the nonincreasing property of \(p(\mathcal {X},\Gamma \mathcal {M}_{L})\) with respect to L, implying that the minimum message length (MML) criterion \(\text {MML}(\mathcal {X},\Gamma,\mathcal {M}_{L})\) also decreases with L. Starting from L=L_{max}, we run the EM algorithm and compute the MML. At the next iteration, L is decremented by eliminating the smallest or empty components (setting the smallest θ_{l} as 0). The E and M steps are repeated to compute an updated MML value at each iteration until \(\text {MML}(\mathcal {X},\Gamma,\mathcal {M}_{L})\) converges. In this way, not all values of L need to be evaluated, saving unnecessary computations.
Structural inferences from PPIN
Implicit PPI derived from transitivity
Suggested by [18], there exist a large number of interactions between protein complex components that are not yet observed or recorded. Yamada et al. [19] showed that PPIN evolutionary properties, e.g., shortest path, clustering coefficient, give some clues for potential protein interactions. We therefore propose a heuristic to augment proteininteraction networks (PPIN) by assuming transitivity among known protein interactions. The rationale behind this heuristic is also the base for a lot of similaritybased approaches of predicting GRNs, e.g., the relevance networks based algorithms [26] assumed that “if A is similar to B, then A interacts with B”. Hence, “if A is similar to B, and B is similar to C, then A is similar to C, implying that A interacts with C.” As such, we infer implicit interactions based on the firstorder transitivity assumptions of protein interactions, i.e., if A interacts with B and B interacts with C, we infer that A interacts with C. Suthram et al. [27] designed methods to assign confidence scores of predicted potential protein interactions from multiple data sources, i.e., by combining GE data, literature, and PPIN data. Inspired by their work, we assign confidence scores of implicit proteinprotein interactions (PPI) based only on GE and PPIN data. We use a confidence scoring scheme similar to that of [27]. The novelty of our score lies in its combination of the shortest path score and the Markov clustering score (a graph property) instead of the clustering coefficient.
We extract the shortest path of length \(d_{ii^{\prime }}\) from protein/gene i to protein/gene i^{′} in protein interaction network P whose gene set is \(\mathcal {I}\) by using Dijkstra’s shortest path algorithm [28]. Let \(\mathcal {E}\) denote the set of shortest paths \(d_{ii^{\prime }}<\infty \). When the interaction dataset P is quite sparse, we choose a larger PPI dataset of the same species, whose gene set is a superset of \(\mathcal {I}\). Use of a larger PPI dataset incorporates additional biological hints for the analysis. By choosing the shortest path as a potential PPI, we obtain a higher confidence in deriving inferred implicit interactions. We also avoid the need to consider many paths by introducing all the transitive PPIs. For example, we can avoid the hassle of considering protein interactions in a clique.
Scoring the shortest path
Implicit interactions are designed to have lower scores compared to explicit interactions; scores are assigned inversely proportional to the number of intermediaries, i.e., implicit interactions with a larger number of intermediaries have proportionately lower scores. The path length \(d_{ii^{\prime }}\) indicates the strength of the implicit association between gene i and i^{′}. Let the confidence score c_{d}∈[0,1] be a nonincreasing number series over path length d, which quantifies the implicit protein interaction between i and i^{′}. Clearly, c_{1} should be 1 for 1hop connections. For d≥2, we set c_{d}=ζ^{d−1} where ζ is the probability of extending the path by one hop. We assume that each additional hop is independent of the previous hop.
Using the equivalence assumption, we deduce the closed form approximation for ζ as follows. Suppose there are paths of length d=2,…,∞ between proteins i and i^{′}, then there should be a direct interaction between protein i and i^{′}, i.e., \(\sum _{d=2}^{\infty } c_{d}=1\). Thus, the solution to this infinite summation \(\sum _{d=1}^{\infty }\zeta ^{d}=\frac {\zeta }{1\zeta }=1\) is ζ=0.5. We therefore assign a confidence score of c_{d}=0.5^{d−1} to each shortest path of length d. This score decreases exponentially with the number of hops, e.g., c_{2}=0.5,c_{3}=0.25,c_{4}=0.125, etc.
For a gene pair (i,i^{′}) in \(\mathcal {I}\) with a corresponding shortest transitive path length \(d_{ii^{\prime }}\), a confidence score of \(c_{d_{ii^{\prime }}}\) is assigned omitting \(\sum _{pl>d_{ii^{\prime }}}^{\infty } 0.5^{pl1}\). In fact, this confidence score is a conservative estimate of the real confidence score because it assumes that there exists path lengths of length d>pl. In practice, there may exist only one path between two genes. The shortfall is at most \(\sum _{pl=d>2}^{\infty } 0.5^{pl1}<0.5\). If there exist many paths of varying lengths (overwhelming evidence) between a genes pair, then the confidence score becomes an underestimate. On the other hand, if there exists only one path, which is also the shortest path, then the confidence score is an accurate depiction of the strength of regulation between two genes.
We use the method of [27] to estimate PPI scores by modeling protein interactions as a function of two random variables: (1) the implicit PPI confidence score based on the shortest path and (2) the Pearson correlation coefficient of expression measurements for the corresponding genes. Unlike the approach of [27], which learns the weight of each random variable from a training set of positive and negative examples, we simply use uniform weights. Our formulation involves only two variables: the implicit PPI confidence score and Pearson correlation coefficient derived from GE data. Let W with elements W_{ij} denote the graph with connection strengths computed from these two variables, then
where ρ_{ij} is the correlation of gene i and j in the gene expression data, and d_{ij} is the shortest path length between gene/protein i and j in the PPIN.
Until now, W collected the evidence from GE correlation and transitive proteinprotein interaction for predicting direct genegene interactions. In fact, there are other evidences for genegene interactions. Similar to [29] that used a random walk model to consider the incompleteness of current gene ontology (GO) or PPIN evidences, we propose a random walk model to allow collecting additional evidence in a random fashion for predicting genegene interactions.
Topological connectivity via random walk transitions
The random walk models including Markov clustering algorithm [30] and PageRank [31] have been successfully used to model the link structure of graphs. Likewise, we extract the topological structure of the extended PPIN by using a random walk model instead of modelling with a smallworld clustering coefficient [27].
Given an undirected graph W, random walk transition matrix T is defined as
where c∈(0.5,1) is the fusion parameter (a.k.a. the damping factor in PageRank, typically set to 0.85) that determines the probability of the next transition from one of the outgoing links versus the transition from going to any random link.
Enright et al. [30] has shown that transition matrix T converges quadratically to an equilibrium state representing the topological connectivity of the graph. The converged matrix, denoted by \(\hat {T}={\lim }_{k\rightarrow +\infty }T^{k}\), can be computed by Markov CLustering (MCL) algorithm [30]. In our context, the converged matrix of extended PPIN shows how likely protein pairs are related to one another.
Extended PPIN
Recall that the confidence score of how likely the predicted PPI occurs are inferred from the shortest path evidences from existing database, which can be treated as the confidence from biological knowledge. Here, the topological connectivities of the extended PPIN are inferred from the graph structure or the topological properties by random walk transitivity. We combine both of them to refine the confidence score of how likely all protein pairs will interact with each other in the PPIN as C=W+W∗T.
Combining W from (6) and the transition matrix from (7), the final confidence PPI scores are arrived as follows.
where W_{ij} denotes the confidence score derived from the PPIN database and GE correlations and \(\hat {T}_{kj}\) represents the converged random walk probability. Hence, \(W\times \hat {T}\) yields the confidence on the link structures of the original network. The summation term actually compensates for the confidence score W_{ij} that could have been underestimated in the previous step using the shortest path.
To summarize, score C_{ij} starts from the values estimated from PPIN and GE correlations and then updated with converged values of a random walk model. Thus, the confidence score takes into account topological properties of GRN. The extended PPIN is derived from thresholding the C scores.
Gaussian hidden Markov model
When genes collaborate to achieve a specific task, the corresponding protein products generally interact [13]. The PPI serve as valuable hints to underlying regulations among genes in GRN pathways. To fuse PPI and GE data, we treat gene expressions as observations, pathways as hidden states, and protein interactions as transitions in a Gaussian Hidden Markov Model (GHMM) [24]. In the GHMM, gene expression is the observed variable, the pathway it belongs to is the hidden state, given a hidden state (pathway), the expression of genes in the pathway follows a Gaussian model. The are genegene interactions are treated as transitions between their corresponding hidden states (pathways): if they are in the same pathway, the genepathway assignments are reinforced; otherwise, they are penalized. In this way, the GHMM combines the extended PPIN in the genepathway assignments.
Hidden Markov random field (HMRF) models assume that the conditional distribution of a variable obeys Markov property, i.e., the probability of a variable only depends on the neighbouring variables (see [32] for a complete description of HMRF). In the present context, HMRF graph is represented by a set of nodes where node i represents observation x_{i} with hidden variable γ_{i} and the neighborhood graph is represented by constraints C_{ij}, each of which indicates the edge weight between node i and node j. Specifically, the extended PPIN with confidence score C is used as constraints (prior knowledge) to the GHMM, i.e., they are considered neighbourhood structural priors of the corresponding genegene associations. Figure 2 shows a sample GHMM model with PPI priors.
The prior probability of a particular cluster assignment Γ follows a Gibbs distribution [33] as \(p(\Gamma C)=\frac {1}{Z}\exp \left (\sum _{i=1}^{I}\sum _{j\neq i}^{I} C_{ij}\delta (\gamma _{j}\neq \gamma _{i})\right)\) where δ(·) is the indicator function and \(Z=\sum _{\Gamma }p(\Gamma C)\) is the normalizing function, and C_{ij} is the PPIN confidence score between gene i and j. Exact inference of the posterior requires the complete evaluation of \(p(\Gamma \mathcal {X},C) \propto p(\mathcal {X},\Gamma)p(\Gamma C)\), where \(p(\mathcal {X},\Gamma)\) follows the Gaussian mixture model defined in (1). The posterior [34] probabilities for the multivariate Gaussian case are approximated as
where θ_{l} is the mixture weight and p(x_{i}μ_{l},Σ_{l}) is the probability density function of the component model.
Based on the above, we learn a GHMM by using componentwise EM algorithm of “SoftClustering of GE by Gaussian mixture model” section, from which we derive the GRN from the pathway assignments: G=τ×τ^{T}. The gene interaction score is then refined as R=0.5×G+0.5×C where C is the confidence score of PPIN in (8).
Bayesian Gaussian network model
The GHMM assumes that the genes in a pathway are mutually and conditionally independent and do not consider partial correlation of their mRNA expression levels. To model this intergene dependency in a pathway, we use a Bayesian network (BN), a directed acyclic graph with local conditional distributions. Here, nodes represent genes, proteins, and/or metabolites, while edges represent molecular interactions such as proteinDNA and proteinprotein interactions, including indirect relationships like those from inferred PPI.
BN are prone to overfitting of noisy or sparse training data such as gene expression data. Overfitting could lead to a vastly incorrect graph structure of GRN. Furthermore, since protein levels are unobservable from microarray data, vast majority of BN models of GRN proposed so far only include mRNA levels of genes as nodes but do not include protein levels. Thus, one way to reduce overfitting the BN is to incorporate the prior knowledge extracted from PPI. We thus propose a novel scoring scheme to feed implicit protein interactions (derived from GHMM) as structural priors into BN, thereby enhancing the robustness of predicted GRN.
Another major limitation of BN is its exponentially growing solution space with respect to the size of the network and thus approximate solutions such as those using Monte Carlo Markov Chain (MCMC) or genetic algorithms (GA) have been used. Such solutions are prone to errors and suboptimal. Since gene regulations exist among the genes in the same regulatory pathway, we constrain the parent set of each gene in the BN by first detecting the pathways, thereby greatly reducing the solution search space for subsequent computations. In a GRN, every gene appears in multiple pathways and therefore the relationship between genes in each pathway cannot be solved independently, i.e., by modelling each pathway as an independent BN. We thus use a Bayesian Gaussian Mixture (BGM) [35] that simply implements a BN with Gaussian Mixture observations. Here, observations for the Gaussian mixture reflect experimental observations, i.e., the experiments/observations are hardclustered by using Gaussian Mixture Models. We thus use a MCMC inference method to learn the BGM model with structural priors derived from R scores in “Gaussian hidden Markov model” section, as shown in Fig. 1.
The BN is defined by a graph \(\mathcal {G}\) with a family of conditional probability distributions and their parameters \({\mathcal {Q}=\{q_{i}\}}\), which together specify the joint distribution over the variables \(p\left (\mathcal {X}\mathcal {G},\mathcal {Q}\right)\). The joint distribution in a static BNs is factorized as \( p\left (\mathcal {X}\mathcal {G},\mathcal {Q}\right)=\prod _{i=1}^{I} p(x_{i}\pi _{i},q_{i})\) where each node x_{i} depends only on its parent nodes π_{i}. The parameter matrix \(\mathcal {Q}\) is comprised of I vectors where each vector q_{i} specifies a local probability distribution.
If we assume a Gaussian Bayesian Network (GBN), the parameter vector \(q_{i}=\left \{\mu _{\pi _{i}},\sigma _{\pi _{i}}\right \}\) consists of the mean and standard deviation of the local probability distributions \(p(x_{i}\pi _{i},q_{i})\sim N\left (\mu _{\pi _{i}},\sigma _{\pi _{i}}\right)\). If we assume conditional independence among the J experiments, then \(p\left (x_{i}\pi _{i},q_{i}\right)=\prod _{j=1}^{J} p\left (x_{ij}\pi _{i}=x_{\pi _{i},j},q_{i}\right)\) where \(p\left (x_{ij}\pi _{i}=x_{\pi _{i},j},q_{i}\right)\sim N\left (\mu _{\pi _{i},j},\sigma _{\pi _{i},j}^{2}\right)\). Assuming parameter independence, prior distribution \(p(\mathcal {Q}\mathcal {G})\) of the unknown parameters is expressed in terms of I local prior distributions: \(p(\mathcal {Q}\mathcal {G})=\prod _{i=1}^{I} p(q_{i}\mathcal {G})=\prod _{i=1}^{I} p(q_{i}\pi _{i})\). The marginal likelihood \(p(\mathcal {X}\mathcal {G})\) is thus the integral over the parameter space:
which can be rewritten as \(p(\mathcal {X}\mathcal {G})= \prod _{i=1}^{I} \Psi (x_{i},\pi _{i})\), where \(\Psi (x_{i},\pi _{i})=\int p(q_{i}\pi _{i})p(x_{i}\pi _{i},q_{i}) d q_{i}\).
GE data are generated under a variety of conditions that may include timeseries experiments, i.e., each experiment may be a continuation of preceding experiments. Thus, it is infeasible to assume independence among individual experiments. Taking a step back, we assume the experiments are generated from Gaussian mixtures and adopt a Bayesian Gaussian mixture model to represent GE data.
Suppose there are K mixtures {D^{(k)}} where each mixture D^{(k)}={x_{·,j}}_{I∗m} represents some m attributes of a gene set \(\mathcal {I}\), and {θ_{k}} mixture weights of experiments in the BGM, then we express
where mixture weights {θ_{k}} are estimated with the EM algorithm, assuming independence among genes. Theoretical considerations and more details specifying the GBN parameters estimation can be found in [36].
Constraining the set of parent candidates
The GHMM derived in “Gaussian hidden Markov model” section can provide some structural constraints to BN learning. First, parents of genes should hail from the same regulatory pathway, i.e., if gene i belongs to some regulatory pathways, then other genes in these regulatory pathways could be its potential parents. In this manner, we generate a more accurate parent candidates set for each gene in addition to using the relationship defined in GRN. This is a nice middle ground between an exhaustive search through all the genes for potential parents and the limited set of GRN derived parents.
With the refined confidence score R assigned to every gene pair, the parent candidate set is limited to only topk neighbours or topN gene pairs as determined experimentally. One simple way to determine k and N is to use the average number extracted from a real GRN having a comparable size.
Incorporating neighborhood confidence scores
From [33], the prior probability of a particular structure \(\mathcal {G}\) with constraints R={R_{ij}} follows a Gibbs distribution,
where \(Z=\sum _{\mathcal {G}}p(\mathcal {G}R)\) is a normalizing function.
By using Markov networks framework to represent the correlations between neighbouring links, joint probability of data \(\mathcal {X}\), given the BGM structure \(\mathcal {G}\) and constraints R, can be written as \(p(\mathcal {X},\mathcal {G} R)= p(\mathcal {X}\mathcal {G})p(\mathcal {G}R)\). That is,
Learning BGM with constraints via MCMC
Given GE data \(\mathcal {X}\), fusing association scores R computed from GHMM and fixed parameters \(\mathcal {Q}\), the structural posterior probability is written as
In the context of static BNs, different MCMC methods have been proposed for sampling Directed Acyclic Graphs (DAG) from the structural posterior distribution. We adopt the structural MCMC approach of [35] to sample our BGM structure \(\mathcal {G}\) from posterior distributions \(p(\mathcal {G}\mathcal {X}, \mathcal {Q}, R)\). The idea is to give preference to the structures of higher posterior probability. The details of MCMC method to learn DAG from BGM are described in the supplementary materials. With model averaging [37], i.e., we run MCMC structural learning for a maximum number of iterations and each predicted edge in the GRN is assigned a confidence score that is measured by the number of occurrences of predicted edges among the generated graphs.
Experiments and results
Datasets
Our methods were tested on a dataset consists of 25 yeast genes, similar to those used by similar studies on fusion of GE and PPIN [6, 16]. The only difference is that we added five more genes (CDC28, CLB6, CLN3, FUS3, FKH2) that are highly connected to the 25 genes, based on current biological databases and literature. The 30 genes network involves in the cellcycle regulation of yeast. A cell cycle is comprised of four phases: (i) Gap 1 (G1) phase  the checkpoint to ensure that the cell is ready for division, (ii) Synthesis (S) phase  involving DNA replication, (iii) Gap 2 (G2) phase  a checkpoint to ensure that the cell is ready to enter the next phase, and (iv) Mitotic (M) phase referring to cell division.
We collected data from several resources in order to construct a comprehensive target groundtruth network: (i) GeneNetWaver (http://wwwmgs.bionet.nsc.ru/mgs/gnw/genenet/) [38, 39] that includes 12873 transcriptional regulations among 4441 yeast genes where 62 transcriptional regulations are among our 30 cell cycle genes; (ii) 141 literaturereported regulatory relations among the 30 target genes manually collected from [40, 41], etc.; and (iii) Lee et al. [42] which proposed a integrated functional association score from mRNA expression data, PPIN, and literature mining edges of 5552 yeast genes. The linkage score showed good performance on independent benchmark datasets from KEGG (https://www.genome.jp/kegg/), STRING (https://stringdb.org/), Gene Ontology (http://www.geneontology.org/), and experimentallydetermined subcellular localization. We collected the “IntNet” from [42] with a high likelihood of regulation score above 0.5, which included coexpression regulations from 717 experiments for yeast (divided into 27 experimental categories), proteinprotein interaction experiments, and literature mining of edges. From this network, 166 linkages are among the 30 target genes. In total, the groundtruth network has 317 regulations among 30 cell cycle genes.
The GE data was obtained from [43] that contains 77 experiments collected over 8 yeast cell cycles by using four different synchronisation protocols. The PPI data for Saccharomyces Cerevisiae was downloaded from BioGRID (http://thebiogrid.org/), that contains 6263 proteins and 210,996 interactions.
For comparison, true positives (TP), false positives (FP), and false negatives (FN) of edges were computed by comparing the predicted pathways to the target ground truth network. Various performance metrics including Precision, Recall, and F1score were evaluated. To compare with stateoftheart methods of predicting GRN from GE data, such as CLR [26], GENIE3 [44], TIGRESS [45], we determined AUROC (The area under the receiver operating characteristic (ROC) curve) and AUPR (The area under the precisionrecall (PR) curve) scores as defined in [44]. For these metrics, the best results that are significantly different (pvalue < 0.05) from other methods are shown in bold. We implemented ttest to check whether the best result is significantly different from other methods; If the best result is not significantly different from the next best ones, we choose the best results as one group and the others as another group and then implemented the unpaired ttest for the two groups.
Feasibility of extending PPIN
In “Structural inferences from PPIN” section, we described a method to extend PPIN, using transitive protein interactions and assigned confidence scores C to the predicted protein interactions. We will demonstrate the feasibility of predicting protein interactions from this extended PPIN in this section.
To test how well extended PPIN (derived from C scores) recover an incomplete PPIN, we randomly selected 200 yeast genes and assign 3008 PPIs from BioGRID among them. We used 10fold crossvalidation on the PPINs of these 200 randomly selected genes, by randomly removing 10% edges in the target PPIN, deriving extended PPIN based on the remaining 90% edges, and then comparing the extended PPIN with the target PPIN on the missing 10% edges. We repeated 10fold crossvalidation experiments for 10 times to show the robustness of the method. The results demonstrated that the extended PPIN (when the cutoff threshold for C score is set as 0.5) effectively recover 346 out of the 383 removed edges on average. Further, our extended PPIN based only on 90% PPIN edges predicted the entire PPIN with an average F1score of 43%, indicating the effectiveness of the method of extending PPIN (C score) and in its utility in predicting the missing PPIN.
As BioGrid may contain functional linkages predicted as proteinprotein interactions, we also choose another PPIN dataset to show the reliability of the method of extending PPIN (C score) to predict real PPIs. The yeast PPIN data from http://interactome.dfci.harvard.edu/S_cerevisiae consists of the full set of physical interactions that occur in a physiologically relevant dynamic range between all its macromolecules, including proteinprotein, DNAprotein, and RNAprotein interactions. By combing all the physical interactions, cocomplex membership associations [46] and literaturecurated interactions, [47] collected in “Yeast Interactome Datasets” from which we get 11995 interactions of 2234 yeast proteins where only 100 PPIs are found among the same randomly selected 200 yeast genes. The experiments of 10fold crossvalidation on these 100 PPIs showed that our extended PPIN (when the cutoff threshold for C score is set as 0.55) effectively recovers the removed edges on average. It also shows that the extended PPIN based only on 90% PPIN edges predict the entire PPIN with an average F1score of 36.62% (average precision at 39.56% and average recall at 34.10%). This further validates the reliability of the method of extending PPIN (C score) to predict the real PPIs.
In the BGM with prior model, we added the prior information into the likelihood of gene expression data of the learned BN structure. If the GRN learned from GE data is consistent with prior information, the likelihood is reinforced; otherwise, the likelihood is penalized. Thus, the GRN from the fusion model achieves the maximum consistency between gene expression data and prior information. We choose the informative BioGRID PPINs as priors in our fusion model.
In order to show the effectiveness of predicting gene regulations from extended PPIN, we evaluated the extended PPIN alongside the original PPIN (denoted by “raw PPIN”). As mentioned in “Structural inferences from PPIN” section, we can extend PPIN either from a subnetwork or from the global network where the global network is the PPIN for the whole genome of the same organism and the subnetwork indicates the subnet PPIN defined by the genes in the target gene set. When the PPIN was extended from subnet information, we only consider transitive links among the target gene set. If the subnet PPIN is quite sparse, we choose the global network to make use of transitivity. However, extended PPIN from global information may introduce more noise especially when PPIN is spurious. We experimentally compared differences in the performance of extending PPIN from the subnetwork (i.e., 30gene PPIN for the benchmark) and the global network (i.e., the complete PPIN from 6263 yeast genes) on the 30gene benchmark network. Table 1 shows the performance with three PPINs: raw PPIN, extended PPIN with subnet information, and extended PPIN with global information on predicting the GRN of the 30 benchmark genes. The cutoff thresholds to generate GRN from C score for the extended PPINs were set as 0.5 for a better tradeoff between precision and recall as seen in Fig. 3.
Comparing two extended PPINs, we see that the extended PPIN with global information outperforms the prediction with only subnet information on recall, sacrificing some precision. This indicates that global information may introduce more incoherent genes to make prediction noisy. The observation that both local and global information achieve comparable performance shows that global information can complement subnet information from sparse PPIN data. Clearly, if not much protein interactions are lost in the target PPIN, extended PPIN with subnet information is a good choice to balance the precision and recall, as illustrated by the best F1 performance with extended PPIN from local information. We thus choose extended PPIN with subnet information for the 30gene benchmark network in subsequent experiments.
Comparing extended PPIN with the raw PPIN, we see that the extended PPIN with subnet information yielded a notable 20% improvement on recall compared to raw PPIN at a hefty decrease of 12% in precision. This shows the feasibility of deriving regulations by using the transitivity present in the PPIN. It is reasonable that raw PPIN predicts GRN with the highest precision, indicating how much protein interactions in the current PPIN database are consistent with gene regulation data. The extended PPINs provide better recall/coverage on predicting GRN even with the subnet information, giving the best F1score, indicating the incompleteness of the current PPIN database. The precision of extended PPIN can be improved by raising the C score threshold but the recall will then be decreased at the same time. Since our approach is a coarsetofine framework to predict GRN where the extended PPIN is used to limit candidate gene interaction set (coarse step), a higher coverage or recall is preferred. As such, we choose 0.5 as the cutoff threshold that gave us a reasonable recall. The high false positive rate from the extended PPIN can be reduced in the subsequent “refinement” step. As subsequent experiments demonstrate, our methods predict the final GRN with a high precision. Therefore, what follows next are demonstrated only with extended PPIN data.
Comparison of performances
We benchmarked the generated GRNs against the groundtruth networks. All the methods except the BGM generated networks with undirected edges. When comparing an undirected graph to a directed one, we considered an existence of a matched link regardless of its direction. For comparison, all the networks generated by the BGM were sampled and averaged for 10 models each of which was learned from 1000 iterations of MCMC structural moves. The performance of the BGMs were further improved by increasing the number of models by model averaging and the iterations of structural moves.
Since the generated graphs from C, R, and G assigned confidence scores for the edges, i.e., probabilities of edge existence, we chose a cutoff threshold and constructed the GRN, based on the scores: if the edge score is higher than this threshold, we confirm the existence of an edge; and otherwise, the edge was rejected. From experiments, as seen in Fig. 3, we found that the best cutoff thresholds (for C and R scores) that maximize the F1score (which is also the best tradeoff between precision and recall) of GRN tend to be clustered around 0.5. For GRN structure \(\mathcal {G}\) from clustering or the BGM, scores came from the probability of genes belonging to the same pathway or the model averaging where both reflected how likely genes interact. We therefore intuitively fixed 0.5 as the threshold to determine interaction edges. Thus, for consistency, we set the cutoff thresholds for all our graphs to be 0.5. The confidence scores were normalized between 0.0 and 1.0 before averaging. The thresholds can also be experimentally finetuned to further improve the performances of our methods. The vast majority of existing works on prediction of GRN choose the cutoff of the confidence score as a tradeoff between sensitivity and specificity of prediction [48].
Table 2 compares performances of building GRNs by various methods. The first four methods (CLR, GENIE3, TIGRESS, GMM) are recent stateoftheart methods of predicting GRN from GE. We use MATLAB implementations of CLR [26], GENIE3 [44], and TIGRESS [45]. The other three methods (GHMM, R scores from GHMM, BGM with R scores) are fusion methods that use both the GE and PPIN data as described in the “Methods” section. As seen, all three fusion methods significantly outperformed (5–20% better in all metrics) the methods using GE data only.
To show the effect of extending PPIN by incorporating transitive edges, we also compared three fusion models with and without extending the PPIN (i.e., by using only raw PPIN data). The three fusion methods using extended PPIN outperformed the methods using only raw PPIN. This demonstrates the effectiveness of using extended PPIN in GRN inference.
The GRNs obtained using GHMM and refined R scores are undirected graphs, indicating gene interactions but not specifically the directional regulations. To figure out the directional dependencies among genes, a directional model such as BGM is needed. The BGM model based on R scores as constraints in fact gave the overall best performance and thus deserves further investigation.
BGM with different priors
Using the same set of parent candidates and the BGM model, we evaluated whether the BGM using refined confidence scores R can hold its own when benchmarked against other PPIN fusion methods. Specifically, we compared its performance with two wellknown methods. Nariai et al. [14] tested if each added protein pair form a protein complex by considering the complex as a virtual node in the network. The method computes the principal component analysis (PCA) model of the protein pairs from original observations and checks if the protein complex contributes to a higher likelihood. If a higher likelihood is obtained, the complex (pair) is accepted and otherwise rejected. Imoto et al. [15] updated PPIN whenever the learned GRN produced different structure. Each inconsistent edge in the learned DAG is perturbed by either removing or reversing its direction. If a perturbation leads to a better likelihood, the perturbed PPIN is accepted otherwise it is rejected.
Both methods needed significantly more operations than the basic models and increased already substantial computational complexity. Further, they tend to converge to local optima given their greedy hillclimbing nature. Nariai’s method can only increase the protein interactions contributed by a protein complex if the PCA projection of the complex performs better than using raw data. Imoto’s method is overlyoptimistic as it assumes that the learned DAG is correct. However, the MCMC method is known to only simulate a DAG; it may accept a large number of incorrect edges in one iteration and could accept all of them as updated PPIN in the subsequent iteration. Hence, updating the PPIN based on a learned DAG is unreliable.
To show the utility of each component in the fusion model, we compared the performance of the BGM by using priors from the GMM only, the PPIN only or the GHMM, as shown in Table 3. For the priors from PPIN, we also tested two existing fusion methods with the same priors. We described different components of our method as follows: “BGM (GMM)” denotes the BGM with parental constraints from GMM and predicts GRN only from GE data without PPIN data; “BGM (C scores)” denotes the BGM constrained with C scores from extended PPIN; “Nariai et al. (GHMM)” and “Imoto et al. (GHMM)” use parental constraints from the GHMM with extended PPIN to learn GRNs in BGM with corresponding fusion method; “BGM (R scores)” uses the extended PPIN in the GHMM fusion model and then R constraints for the BGM. For comparison, we also included a BGM without any constraints as the baseline.
The results in Table 3 show that the priors from GHMM significantly improve TPs (highest recall) with comparable F1 scores (higher coverage at the slight expense of precision) compared to the methods using priors only from the PPIN. This indicates that GE clustering help to predict more TPs but at a lower precision. It also demonstrates that the use of the GRN structure from the GMM and extended PPIN in the prediction of GRN. These results further validate the utility of each component in the fusion model. Results in Table 3 show that with same structural constraints, the fusion BGM (with R scores) beats both the BGM proposed by Nariai et al. and by Imoto et al. This means that the R scores are effective measure of the reliability of PPIN edges. Table 3 also shows the runningtime of different algorithms implemented with MATLAB on a machine with an Intel Xeon E51620 3.6GHz CPU and 8GB RAM. As seen, the priors given by our confidence scores lead to computationally more efficient procedures.
Discussion & conclusion
We proposed an automated method to detect gene regulations by fusing GE data and PPI data. Gene expression data is first softclustered into regulatory pathways by a GMM where the number of regulatory pathways is automatically determined via a componentwise EM algorithm. Transitive protein interactions are derived using a novel confidence score and PPIN are refined and extended. The extended PPIN are then fused with the GMM derived from GE data, using a GHMM. This improves the biological relevance of the clustering results and also refines the confidence score of indirect gene regulations/protein interactions. Using refined PPI confidence scores, together with regulatory pathways obtained from the GHMM as structural constraints, a BGM model was used to capture direct gene regulations by using an effective and efficient MCMC procedure. Fusion of the BGM model with GHMM as structural priors generated more accurate GRN compared to those produced by the GHMM and the BGM models. We have experimentally shown that our procedures are more effective than two existing wellknown methods for fusing GE and PPIN data.
Furthermore, the fusion framework reduces the overall time complexity of building GRN with Bayesian networks. The GHMM uses a componentwise EM algorithm to softcluster genes into pathways at a complexity O(I^{2}×L^{2}) (where L denotes the number of components and I denotes the number of genes), indicating that the method scales well on large networks (large I). The BGM structural constraints, i.e., the parent of each node must itself be a node from the same regulatory pathway, help to significantly reduce the search space, i.e., from O(I^{I}) to approximately O((I/L)^{I}). In fact, for larger I, the reduction can even be more pronounced.
Our approach shows the importance of systematically fusing multiple sources of biological evidences in inferring useful and reliable GRN. In fact, our framework can be extended to fuse more than two data sources, such as gene ontologies, biological pathways (KEGG), transcription factors (known regulators), information mined from literature, and other multiomics data [4, 5, 10–12]. Our methods may be easily extended by further considering dependencies among the experiments [35] and are worth investigating further.
Abbreviations
 AUROC:

The area under the receiver operating characteristic (ROC) curve
 AUPR:

The area under the precisionrecall (PR) curve
 BGM:

Bayesian Gaussian mixture
 EM:

Expectation maximization
 GHMM:

Gaussian hidden Markov model
 GRN:

Gene regulatory networks
 HMRF:

Hidden Markov random field
 MCL:

Markov cLustering
 MML:

Minimum message length
 MCMC:

Monte Carlo Markov Chain
 PPIN:

Proteinprotein interaction networks
References
Liu F, Zhang SW, Guo WF, Wei ZG, Chen L. Inference of gene regulatory network based on local bayesian networks. PLoS Comput Biol. 2016; 12(8):e1005024.
Liu ZP. Reverse engineering of genomewide gene regulatory networks from gene expression data. Curr genomics. 2015; 16:3–22.
Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysis. BMC Bioinformatics. 2008; 9:559.
Chan TE, Stumpf MPH, Babtie AC. Gene regulatory network inference from singlecell data using multivariate information measures. Cell Syst. 2017; 5:251–67.
Aibar S, GonzálezBlas CB, Moerman T, HuynhThu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. Scenic: Singlecell regulatory network inference and clustering. Nat Methods. 2017; 14:1083–6.
Bernard A, Hartemink AJ. Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. In: Pacific Symposium on Biocomputing.Hawaii:2005. p. 459–470.
Wu M, Chan C. Learning transcriptional regulation on a genome scale: a theoretical analysis based on gene expression data. Brief Bioinform. 2011; 13(2):150–61.
Wang L, Hou L, Qian M, Deng M. Integrating phosphorylation network with transcriptional network reveals novel functional relationships. PLoS ONE. 2012; 7(3):33160.
Cheng C, Alexander R, Min R, et al. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012; 22(9):1658–67.
Zarayeneh N, Ko E, Oh JH, Suh S, Liu C, Gao J, Kim D, Kang M. Integration of multiomics data for integrative gene regulatory network inference. Int J Data Min Bioinform. 2017; 185:223–39.
Banf M, Rhee SY. Enhancing gene regulatory network inference through data integration with markov random fields. Sci Rep. 2017; 7:1–13.
Yugi K, Kubota H, Hatano A, Kuroda S. Transomics: How to reconstruct biochemical networks across multiple ’omic’ layers. Trends Biotechnol. 2016; 34:276–90.
Segal E, Wang H, Koller D. Discovering molecular pathways from protein interaction and gene expression data. In: International Conference on Intelligent Systems for Molecular Biology.Brisbane:2003. p. 264–272.
Nariai N, Kim S, Imoto S. Using proteinprotein interactions for refining gene networks estimated from microarray data by bayesian networks. In: Pacific Symposium on Biocomputing.Hawaii:2004. p. 336–347.
Imoto S, Higuchi T, Goto T. Error tolerant model for incorporating biological knowledge with expression data in estimating gene networks. Stat Methodol. 2006; 3(1):1–16.
Werhli AV, Husmeier D. Gene regulatory network reconstruction by bayesian integration of prior knowledge and/or different experimental conditions. J Bioinforma Comput Biol. 2008; 6(3):543–72.
Li M, Wu X, Wang J, Pan Y, Towards the identification of protein complexes and functional modules by integrating ppi network and gene expression data. BMC Bioinformatics. 2012; 13:109.
Chen Y, Xu D. Global protein function annotation through mining genomescale data in yeast saccharomyces cerevisiae. Nucleic Acids Res. 2004; 32(21):6414–24.
Yamada T, Bork P. Evolution of biomolecular networks — lessons from metabolic and protein interactions. Nat Rev Mol Cell Biol. 2009; 10(11):791–803.
EmmertStreib F, Glazko GV, Altay G, Simoes RM. Statistical inference and reverse engineering of gene regulatory networks from observational expression data. Front Genet. 2012; 3:8.
Bansal M, Belcastro V, AmbesiImpiombato A. How to infer gene networks from expression profiles. Mol Syst Biol. 2007; 3:78.
Kuffner R, Petri T, Tavakkolkhah P, Windhager L, Zimmer R. Inferring gene regulatory networks by anova. Bioinformatics. 2012; 28(10):1376–82.
Zhang X, Zhao XM, He K, Lu L, Cao Y, Liu J, Hao JK, Liu ZP, Chen L. Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics. 2012; 28(1):98–104.
Vignes M, Forbes F. Gene clustering via integrated markov models combining individual and pairwise features. IEEE Trans Comput Biol Bioinforma. 2009; 6:260–70.
Figueiredo MAT, Jain AK. Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell. 2002; 24(3):381–96.
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2010; 5(1):8.
Suthram S, Shlomi T, Ruppin E. A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics. 2006; 7:360.
Johnson DB. A note on dijkstra’s shortest path algorithm. J Assoc Comput Mach (ACM). 1973; 20(3):385–8.
Yang H, Nepusz T, Paccanaro A. Improving go semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics. 2012; 28:1383–9.
Enright AJ, Dongen S, Ouzounis CA. An efficient algorithm for largescale detection of protein families. Nucleic Acids Res. 2002; 30(7):1575–84.
Page L, Brin S, Motwani R. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab. 1999.
Li SZ. Markov random field modeling in image analysis, 2nd edn. New York: Springer; 2001.
Chandgotia N. Generalisation of the HammersleyClifford theorem on bipartite graphs. Trans Amer Math Soc. 2017; 369:7107–7137.
Lange T, Law MH, Jain AK, Buhmann JM. Learning with constrained and unlabelled data. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Beijing: 2005. p. 731–8.
Grzegorczyk M, Husmeier D, Edwards KD. Modeling nonstationary gene regulatory processes with a nonhomogeneous bayesian network and the allocation sampler. Nucleic Acids Res. 2008; 24(18):2071–8.
Grzegorczyk M. An introduction to gaussian bayesian networks. Methods Mol Biol. 2010; 662:121–47.
Battle A, Jonikas M, Walter P. Automated identification of pathways from quantitative genetic interaction data. Mol Syst Biol. 2010; 6:379.
Ananko EA. Genenet: a database on structure and functional organisation of gene networks. Nucleic Acids Res. 2002; 30(1):398–401.
Schaffter T, Marbach D, Floreano D. Genenetweaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011; 27(16):2263–70.
Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS, Young Ra. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001; 106(6):697–708.
Lee TI, Rinaldi NJ, Robert F, et al.Transcriptional regulatory networks in saccharomyces cerevisiae. Science. 2002; 298(5594):799–804.
Lee I, Date SV, Adai AT. A probabilistic functional network of yeast genes. Science. 2004; 306(5701):1555–8.
Spellman PT, Sherlock G, Zhang MQ. Comprehensive identification of cell cycle regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998; 9(12):3273–97.
HuynhThu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using treebased methods. PLoS ONE. 2010; 5:9.
Haury AC, Mordelet F, VeraLicona P, Vert JP. Tigress: Trustful inference of gene regulation using stability selection. BMC Syst Biol. 2012; 6:145.
Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FCP, Weissman JS, Krogan NJ. Toward a comprehensive atlas of the physical interactome of saccharomyces cerevisiae. Mol Cell Proteomics. 2007; 6(3):439–50.
Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon GC, Myers CL, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya OG, Ideker T, Dolinski K, Batada NN, Tyers M. Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. J Biol. 2006; 5(4):11.
Djebbari A, Quackenbush J. Seeded bayesian networks: Constructing genetic networks from microarray data. BMC Syst Biol. 2008; 2:57.
Acknowledgements
Not applicable.
Funding
Publication of this article was sponsored by Tier2 MOE2016T21029 grant by the Ministry of Education, Singapore.
Availability of data and materials
The MATLAB source code are available at https://gitlab.com/liuwt/fusionBGM.
About this supplement
This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume13supplement2.
Author information
Authors and Affiliations
Contributions
WL developed and implemented the method. WL and JCR conceptualised the methods and wrote the manuscript. Both authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liu, W., Rajapakse, J.C. Fusing gene expressions and transitive proteinprotein interactions for inference of gene regulatory networks. BMC Syst Biol 13 (Suppl 2), 37 (2019). https://doi.org/10.1186/s129180190695x
Published:
DOI: https://doi.org/10.1186/s129180190695x
Keywords
 Gene regulatory network (GRN)
 Gene expressions
 Gaussian mixture model (GMM)
 Proteinprotein interaction networks
 Transitive proteinprotein interactions