Signaling network prediction by the Ontology Fingerprint enhanced Bayesian network
© Qin et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Qin et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Despite large amounts of available genomic and proteomic data, predicting the structure and response of signaling networks is still a significant challenge. While statistical method such as Bayesian network has been explored to meet this challenge, employing existing biological knowledge for network prediction is difficult. The objective of this study is to develop a novel approach that integrates prior biological knowledge in the form of the Ontology Fingerprint to infer cell-type-specific signaling networks via data-driven Bayesian network learning; and to further use the trained model to predict cellular responses.
We applied our novel approach to address the Predictive Signaling Network Modeling challenge of the fourth (2009) Dialog for Reverse Engineering Assessment's and Methods (DREAM4) competition. The challenge results showed that our method accurately captured signal transduction of a network of protein kinases and phosphoproteins in that the predicted protein phosphorylation levels under all experimental conditions were highly correlated (R2 = 0.93) with the observed results. Based on the evaluation of the DREAM4 organizer, our team was ranked as one of the top five best performers in predicting network structure and protein phosphorylation activity under test conditions.
Bayesian network can be used to simulate the propagation of signals in cellular systems. Incorporating the Ontology Fingerprint as prior biological knowledge allows us to efficiently infer concise signaling network structure and to accurately predict cellular responses.
New proteomics techniques enabled large-scale experiments that monitor phosphorylation states of many proteins under different physiological stimuli and/or pharmacological treatments. Each measurement captures a static picture of how the cellular signaling network responses to the binding of a ligand to its receptor, but the interconnections among many different ligand-activated pathways are complex and dynamic. Thus, it is of biological importance to infer which signaling path is at work in response to a particular ligand and how pathways "cross-talk" to each other in a cell-type-specific manner, and eventually to develop computational models capable of predicting cellular responses under different stimuli.
One of the most common approaches to signaling network modeling is to represent the dynamic system as a set of ordinary differential equations (ODEs) using mass action kinetics, by which the concentration of species over time can be analyzed [1, 2]. Additionally, when spatial information is important for such modeling, a system of partial differential equations (PDEs) is considered to be more precise tool to model biochemical processes in both space and time dimension [3–5]. ODEs or PDEs mathematically represent signal transduction by introducing many parameters in the model, which becomes impractical for extremely large networks due to the increasing difficulty in parameter estimation . To approach large-scale signaling network modeling, several data-driven methods have emerged and applied to simulate signal transduction: constraint-based network analysis that allows reconstruction of large systems of biochemical reactions in analyzing genome-scale metabolic networks [6–8]; network component analysis (NCA) which incorporates prior knowledge of network topology to infer signaling pathways [9, 10]; partial least squares regression (PLSR) analysis to investigate complicated signaling networks by identifying optimal principle component-based dimensions from a proposed relationship [11–13]. Many of the approaches described above are deterministic models, which are not aimed at accommodating the noise inherent in biologically data . In contrast, Bayesian network analysis is an alternative probabilistic graphical approach to model signaling pathways [15, 16]. Bayesian network, which can explicitly handle the uncertainty of unobserved events [14, 17], provides a compact graphical representation of the joint probability distributions over all random variables, and has been used for reconstruction of signaling networks [18–24].
To assess the current state of the art network inference methods, Columbia University, the New York Academy of Sciences, and the IBM Computational Biology Center have been organizing the Dialogue for Reverse Engineering Assessments and Method (DREAM), an annual international competition to assess methods that infer network structures and predict cellular response to different combination of stimuli from actual experimental data . Challenge 3 of the 2009 DREAM4 competition (will be referred to as DREAM4 challenge) was titled Predictive Signaling Network Modeling and included two tasks. In the first part, a canonical protein phosphorylation network was provided. This network was constructed by combining pathways from different cell types reported in the current literature. The participants were also provided with a dataset of protein phosphorylation measurements collected from HepG2 hepatocellular carcinoma cells that were treated with various stimuli and inhibitors. The task was to induce a HepG2 cell specific protein phosphorylation pathway out of the canonical network and to build a predictive model of how the cell responds to these stimuli and inhibitors. The second part of the challenge was to use this induced pathway to predict the activities of the phosphoproteins under a new set of perturbations.
The provided canonical pathway consists of a union of the known signaling pathways responding to the following ligands TNFα, IL1α, IGF-1, and TGFα (see the Methods section for detailed description). The training data consisted of the activities of seven downstream phosphoproteins measured when cells were treated with four cytokine (and control) stimuli in various combinations with four inhibitors at 0, 30 minutes and 3 hours post-stimulation. The test data was generated similarly, but the cells were treated with different combination of stimuli and inhibitors [26–28].
Our approach to this challenge is to employ an enhanced Bayesian network to identify the most plausible HepG2 specific signaling network and to predict the cellular responses to new stimuli. Bayesian network is a directed acyclic graph (DAG) model representing the probabilistic relationships between a set of random variables . Given a signal transduction pathway such as the canonical network of DREAM4 challenge, a Bayesian network can represent the propagation of cellular signal for the biological network in such a way that the state of a downstream phosphoprotein is determined by the states of its upstream kinases, and their relationships can be quantified by conditional probabilities . We could then transform the task of inducing cell-type-specific network as a task to find a subnetwork within the canonical network that explains the observed data as well as possible--a data-driven structure search problem. It is well known that brute force exhaustive search of Bayesian network structure is intractable  although different heuristic algorithms exist to address the task. However, solely employing these heuristic algorithms in our setting would fail to utilize a wealth of biological knowledge regarding genes and proteins and their relationships. Ignoring this knowledge may result in a Bayesian network that captures the statistical relationships between the states of phosphoproteins perfectly but does not make any biological sense--a phenomenon referred to as equivalent classes of Bayesian networks in the machine learning field [30, 31]. In order to address this problem, we developed a Bayesian network searching algorithm that incorporates prior biological knowledge.
We recently developed the concept of the Ontology Fingerprint from biomedical literature and Gene Ontology (GO) . The Ontology Fingerprint for a gene or a phenotype is a set of GO terms overrepresented in the PubMed abstracts linked to the gene or phenotype, along with these terms' corresponding enrichment p-values . By comparing two genes' Ontology Fingerprints, we can assess their biological relevance quantitatively. Such relevance can be used to assess gene-gene connections for model selection in Bayesian network-based signaling network prediction. Incorporating this information accelerates the network search process and helps to identify biologically sound connections in predicting signaling networks, eventually leading to better models. We thus developed an enhanced Bayesian network method by incorporating the Ontology Fingerprint for model selection. This novel approach was used to predict a signaling network for the DREAM 4 challenge and performed very well, indicating ontology and prior biological knowledge can make a significant contribution to signaling network predictions.
In order to incorporate independent biological knowledge to learn the network structure, we evaluated the degree of biological relevance between genes by using the gene-gene similarity scores derived from their Ontology Fingerprints; the pairwise similarity scores among the 40 nodes were calculated. The detailed procedures of constructing Ontology Fingerprint were described in . Specifically, we downloaded and processed the June 13th, 2007 version of GO to extract GO terms and their descriptions. The 2007 version of PubMed abstracts in XML format was also downloaded and processed to extract the PubMed ID and the text of each abstract. The links between PubMed abstracts and genes were obtained from the NCBI "pubmed2gene" file. Abstracts that contained GO terms were identified by exact string match. We also labeled the abstracts containing a GO term with all of the term's parent terms. In addition, each abstract was labeled with a GO term only once regardless of how many times the term occurred in the abstract. The ontology fingerprints were derived from 178,687 abstracts linked to at least one human gene. In total, we constructed Ontology Fingerprints for 25,357 human genes using 5,001 ontology terms mapped to the PubMed abstracts that linked to human genes.
Where v i denotes the activity reading of observed node i, s i denotes its state; μ i ,0 and μ i ,1 represent the average activity reading of node i at sate 0 and state 1 respectively; σ i,0 and σ i,1 represent the variance of activity readings of node i at sate 0 and state 1 respectively. The fluorescent measurements of the seven observed nodes are modeled using a mixture of signals produced by phosphorylated and unphosphorylated proteins.
where pa(s i ) denotes the set of parent nodes of node i, and j denotes one of i's parent nodes; s j represents the state of j; β i,0 is the interception and β i,j is the logistic regression coefficient between node i and its parent node j.
The number of all possible network structures of a Bayesian network G (Equation (3)) is super exponential [38, 39] with respect to the number of nodes. Thus, exhaustive search of all possible structures is intractable. In this study, we developed a heuristic approach to utilize prior biological knowledge to guide a stochastic search of biologically plausible candidate graphs, equivalent to selecting networks with higher prior p(M). Based on these candidate networks, we further performed a data-driven search of network structure through parameterization. We identified an optimal cell-type-specific network for HepG2 cells by combining the networks that were preferentially selected based on prior knowledge and that explained the observed data well.
Using the provided canonical network as a starting point, we explored the space of the cell-type-specific networks by stochastically adding and deleting edges. The edge selection was based on the available prior biological knowledge in order to search for network structures that are more biologically sensible. To this end, we employed the Ontology Fingerprint  to represent the prior knowledge of proteins of interest. The Ontology Fingerprint of a gene provides the characteristics of the cellular component, molecular function, or biological process captured in the literature with a quantitative measure. By comparing two genes' Ontology Fingerprints using a modified inner product, a similarity score is generated to quantify the gene-gene relationship--the higher the score, the more the two genes are biologically relevant. We used these similarity scores to guide the exploration of model space of possible networks.
We calculated the similarity scores for all pairs of 40 genes in the canonical pathway. The similarity score was used to assess whether an edge should be added or deleted in the canonical network: edges linking two genes with strong biological relevance (i.e. high similarity score) will be added into the network with a higher chance, while edges with weak biological relevance and weak data support will be deleted from the network with a higher chance. Figure 2 shows the heuristic rules of network search. The candidate graphs were then used to infer the parameters by applying the EM algorithm.
Given a candidate network produced in the aforementioned space exploration, we further evaluated if the model explains the observed experimental data well by calculating the term p(D|M) in Equation (4). This involves learning the parameters of the network model and considering all possible combination of parameterization of the model to derive the marginal probability p(D|M). In this study, we employed LASSO logistic regression to perform regularized (aka Bayesian) estimation of parameters. We also used the Bayesian information criteria (BIC)  as a surrogate of the marginal probability of the network to assess the goodness of fit of the models. In addition, we took advantage of the fact that, when the logistic regression parameter between a target phosphoprotein and one of its parents is set to zero by the Lasso logistic regression, we can effectively delete the edge between these two proteins--searching for network model through parameterization.
Logistic regression was then used in the M-step to estimate the parameters of the generalized linear model. In order to reduce the search space, LASSO regression implemented in the LARS package from R  was applied in the final round of the EM algorithm to determine whether to perform regularization. This would set certain parameters to zero between a parent-child protein pair in the candidate network [44, 45] while retaining the edges that were sufficient to model the observed data. Lasso regression could thus reduce the number of edges in networks that have weak or duplicated effect on signaling cascade.
To predict the fluorescent signals of 7 phosphoproteins in response to cytokine stimuli under 40 testing conditions, the phosphorylation states of these proteins were sampled using the aforementioned EM algorithms (E step only) and the belief propagation algorithm. The fluorescent signals were then simulated by mixture of the signals of proteins in both phosphorylated and unphosphorylated states defined in Equation (1). We generated 50 samples of the activation state for each protein node according to its posterior probability and each sample predicted the strength of fluorescent signal of the monitored proteins from the learned normal distribution conditioned on sampled states. The final prediction was then produced by averaging the predicted measurements of the observed nodes across all samples.
The task of learning cell-type-specific network is equivalent to determining which subset of vertices and edges from the canonical network should be retained for that cell type. We addressed the task of learning network structure through combining prior knowledge and experimental data in the following steps: 1) stochastically exploring candidate network structures based on prior knowledge; 2) training candidate Bayesian network using experimental data, which further modifies network structure through parameterization, i.e., setting the parameters associated with certain edges to the values that would be equivalent to deleting these edges; and 3) selecting the network model that best simulates the experimental results. A Bayesian network can also readily simulate the propagation of a signal in the system using a belief propagation algorithm , which can predict the system's response to cellular stimuli.
The novelty of our approach is to update the network by leveraging prior biological knowledge captured in the Ontology Fingerprints  in order to efficiently search for better network structure. The similarity of the Ontology Fingerprints of a pair of genes captures their biological relevance, e.g. whether they participate in a common biology process within a common biological setting such as the same cell type. Therefore, two genes with similar Ontology Fingerprints are more likely to cooperatively work in a common biological environment than those that are not. This information could be used as prior knowledge to preferentially retain or reject the edges in the canonical network in a principled manner.
The predicted network represents a biologically plausible signaling pathway specific to HepG2 cells, partially due to the novel graph search algorithm based on the Ontology Fingerprints. For instance, the connections between IKK and IKB tended to be kept during graph updating due to the relatively high similarity of their Ontology Fingerprints, with the similarity score ranking above the 80th percentile. In contrast, the connection between ERK1/2 and HSP2/7 was deleted with a high probability since their similarity score lies on the 30th percentile. Overall, the model updating process based on the novel graph search algorithm seamlessly included prior biological knowledge embedded in the literature and GO. Based on the training data of HepG2 cell, employing LASSO regression  in learning Bayesian network parameters further identifies main paths specifically transducing the signal in this cell type, resulting in a sparse network.
Our results also indicate that Bayesian network is particularly suitable for modeling cellular signal transduction in that principled statistical inference algorithms, e.g., the belief propagation algorithm, enabled us to represent hidden variables (nodes without observations) in the graph and to infer detailed signal transduction in the pathway. In contrast, other modeling approaches reported at the DREAM4 conference, e.g., methods based biochemical systems theory , usually ignore all hidden variables to reduce the complexity of network modeling and parameter estimation at the cost of missing intermediate information. The full network predicted by our approach consists of 37 nodes connected by totally 47 edges, and each edge is associated with a parameter that quantifies the relationship of the signal propagated from the parent node to its child node (Figure 3B). In this network, twenty-four nodes are hidden but our inference algorithm correctly inferred their states and relationships between the nodes in the network. For instance, the directed edge from RAS to RAF, both of which are hidden nodes, was assigned with a positive coefficient (β = 53.12), indicating that RAS plays a strong activation role for RAF1. This infer-ence agrees with previous findings [48–50] that RAF1 is a critical RAS effector target, and its activation is a consequence of RAS activation and the formation of RAS-GTP-RAF1 complex. The positive connection between another pair of hidden nodes, from RAC to PAK (β = 16.49), is also consistent with the fact that PAK is the downstream effector of RAC [51, 52]. Zimmerman and Moelling  suggested that AKT-mediated phosphorylation of RAF1 leads to the inhibition of the Raf-MEK-ERK cascade and the modulation of the cellular response [54, 55]. Indeed, our algorithm correctly captured this relationship, which assigned a negative coefficient (β = -1.17E-12) to the edge from AKT to RAF1 in our predicted network. While the low coefficient may reflect the fact that the inference used the measurement performed on two distal nodes--AKT and ERK, the negative value is indeed consistent with the known inhibitory effect. These evidences demonstrate that our approach can utilize the limited observed data to infer the signal transduction of the full network, even though the state of certain nodes are not observed.
Using the predicted HepG2 specific network and the learned parameters, we then predicted the phosphoprotein activity levels of the 7 proteins under the test conditions given by the DREAM 4 Challenge. The predicated phosphoprotein activities were evaluated against experimental measurement by the organizers of DREAM4 challenge using two criteria: first, the accuracy evaluated by a prediction cost function (sum of squared errors over all the predictions); second, network parsimony. Our group (Team 451) ranked within the top five (#4 or #5 depending on different DREAM4 ranking methods) among all submissions for this challenge (http://wiki.c2b2.columbia.edu/dream/results/DREAM4/?c=3_1). This outcome demonstrates that incorporating prior biological knowledge in the form of the Ontology Fingerprint with statistical algorithms for graph searching and parameter estimation can significantly outperform many other approaches for signaling network inference. Our results also demonstrate a novel way to integrate ontological data and literature in learning signaling network construction, as well as the feasibility of applying ontology as biological information in other challenging data-mining problems.
A signaling network is a complex and dynamic system that governs biological activities and coordinates cellular functions [56, 57]. Defects in signal transduction are responsible for diseases such as cancer, autoimmunity, and diabetes . By understanding signaling networks, mechanisms of diseases can be investigated more specifically, and the disease could be targeted and treated more efficiently. Moreover, different cell types often activate different parts of signaling networks, resulting in different responses to the same perturbation. In this study, we addressed the DREAM4 challenge of predicting signaling networks using two innovative approaches: 1) by incorporating prior knowledge in the form of the Ontology Fingerprint, we efficiently and preferentially search biologically plausible models, and 2) by using LASSO regression, we unified the Bayesian network parameter learning and structure learning in a data-driven manner. These improvements are principled from a statistical learning point of view and sensible from a biological point of view.
Participants of the DREAM4 challenge developed various computational approaches to model the signaling network and predict their cellular responses to different stimuli. Dynamic mathematical modeling implemented in a system of differential equations is one of the mainstream approaches [59, 60]. The method represents signal transduction as detailed and biochemically realistic mathematical equations with the need to estimate many free parameters. However, the parameter estimation becomes extremely challenge as the number of species in the network increases . To circumvent this pitfall, one of the participant teams using this approach omitted all hidden nodes, i.e. species not subjected to experimental manipulation or measurement. Such simplification resulted in missing information of network topology and intermediate signal transduction. An alternative approach is to depict the signaling pathway as a logical model and utilize a two-state discrete (Boolean) logic to approximate the signal propagation in the network. However, the Boolean model is a deterministic approach not rigorous enough to capture real biological events. Furthermore, this model also involved node compression process to remove non-identifiable elements .
By contrast, Bayesian network analysis represents an effective mean to encode both the prior knowledge of network topology and the probabilistic dependency in signaling networks [19, 61]. This approach has the advantage of being able to handle hidden nodes in a principled manner and to model mixed information of both the noisy continuous measurements and the discrete regulatory logic by modeling these nodes as latent variables and infer novel signaling paths from observed data. Such advantage is particularly useful in real world application where experimental measurements are expansive and limited to certain selected proteins. The utility of these data can be maximized by using latent variables to infer novel signaling paths that contain proteins not been measured. However, the application of Bayesian network in real world modeling is limited due to the super exponential space one has to search in order to identify the optimal model . Compared with other approaches applied in the DREAM4 challenge, our approach has several significant advantages: 1) it is able to predict the discrete state of proteins in a probabilistic manner under different stimuli, without the requirement of node compression; 2) the incorporation of prior biological knowledge embedded in the Ontology Fingerprint accelerates the search for optimal network topology, in other words, it increases the probability of obtaining an optimal network within limited learning time; 3) the Ontology Fingerprint enhanced network search process makes the inferred network more biologically sensible; 4) the LASSO model regularization method efficiently assist the search for a sparse network.
Our algorithm was further improved by embedding biological information from the Ontology Fingerprint into the learning stage of the Bayesian network modeling. This was accomplished through the introduction of prior distributions for the variables. The seamless integration of prior knowledge into the Bayesian network framework allowed us to construct a cell-type specific signal transduction pathway and to use the pathway to predict novel perturbation outcomes in the DREAM4 competition. The Ontology Fingerprint derived from PubMed literature and biomedical ontology serve as a comprehensive characterization of genes. Compared to current gene annotation, the Ontology Fingerprints were generated by a largely unsupervised method, thus do not need well-annotated corpus which is difficult to assemble. In addition, the enrichment p-value associated with each ontology term in an Ontology Fingerprint can be used as a quantitative measure of biological relevance between genes--a feature that is lacking in current gene annotations. This comprehensive and quantitative characterization of genes works well as prior knowledge in our graph searching strategy. In contrast, commonly used graph searching algorithms, such as genetic algorithms, only rely on a randomized exhaustive search that is not able to utilize useful prior information. This limitation not only makes these algorithms inefficient in searching the plausible model space but also potentially lead to networks that are biologically irrelevant.
To assess the contribution of the Ontology Fingerprints to Bayesian network learning algorithm, we compared the likelihoods of Bayesian networks iteratively updated with or without the guidance of prior knowledge derived from the Ontology Fingerprints. Starting with the canonical network, we iteratively updated network structure until a fixed number of networks were obtained. The converged likelihood of each network was obtained by Monte Carlo EM algorithm (MCEM) . The likelihoods from Ontology Fingerprint-guided network update were significantly higher than those without the guide (Wilcoxon signed-rank test, p-value = 3.4 × 10-2). In addition, we investigated the performance of Ontology Fingerprint enhanced Bayesian network in eliminating biologically irrelevant relationships from the network. We randomly added edges with similarity scores of zero into the canonical network, and considered the new network as a noisy network. Starting with this noisy network, we performed the same comparison as described above, and the resulting likelihoods from Ontology Fingerprint-guided network update were also significantly higher than the update process without prior knowledge (Wilcoxon signed-rank test, p-value = 1.5 × 10-3). Furthermore, the network update with prior knowledge successfully identified and eliminated noisy edges quickly at the first several iterations. These results demonstrated that integrating the Ontology Fingerprint as prior knowledge can speed up the convergence of likelihood, resulting in the increased efficiency of both identifying optimal network structure and retaining biological meaningful connections in the final network.
In addition to prior knowledge, our approach also employed the LASSO technique  to select a plausible model in a data driven manner. LASSO is one of the regularization algorithms originally proposed for linear regression models, and has become a popular model shrinkage and selection method. The LASSO method combines shrinkage and model selection by automatically setting certain regression coefficients to zero . This approach effectively deleted certain candidate edges between signaling molecules, and helped to remove redundant variables to obtain a concise model in the final step.
By incorporating prior biological knowledge, utilizing advanced statistical method for parameter estimation and modeling unobserved nodes as latent variables, we developed a novel approach to infer active signaling networks from experimental data and a canonical network. Our results demonstrated that these improvements allow us to predict signaling network structure and responses that match closely to those identified by experimental approaches.
This work is partly supported by PhRMA Foundation Research Starter Grant, Computational Biology Core of 1 UL1 RR029882-01, R01GM063265-09S1; P20 RR017677-10 and a pilot project from 5P20RR017696-05 (WJZ), as well as grants 5R01LM010144 and 5R01LM009153 (XL). LCT was supported by NLM training grant 5-T15-LM007438-02. TQ was supported by PhRMA Foundation Research Starter Grant, NIH/NCRR 5P20RR017677-10, NIH/NIGMS R01GM063265-09S1 and T32GM074934 07. KJS was funded by Grant 5K12GM081265-03, an Institutional Research and Academic Career Development Award (IRACDA) program from NIGMS. We would like to thank Dr. John Schwacke for providing us with the R code to generate the plot of protein phosphorylation activity.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.