Molecular interaction data at proteomic and genetic levels provide physical and functional insights into a molecular biosystem and are helpful for the construction of pathway structures complementarily. Despite advances in inferring biological pathways using genetic interaction data, there still exists weakness in developed models, such as, activity pathway networks (APN), when integrating the data from proteomic and genetic levels. It is necessary to develop new methods to infer pathway structure by both of interaction data.
We utilized probabilistic graphical model to develop a new method that integrates genetic interaction and protein interaction data and infers exquisitely detailed pathway structure. We modeled the pathway network as Bayesian network and applied this model to infer pathways for the coherent subsets of the global genetic interaction profiles, and the available data set of endoplasmic reticulum genes. The protein interaction data were derived from the BioGRID database. Our method can accurately reconstruct known cellular pathway structures, including SWR complex, ER-Associated Degradation (ERAD) pathway, N-Glycan biosynthesis pathway, Elongator complex, Retromer complex, and Urmylation pathway. By comparing N-Glycan biosynthesis pathway and Urmylation pathway identified from our approach with that from APN, we found that our method is able to overcome its weakness (certain edges are inexplicable). According to underlying protein interaction network, we defined a simple scoring function that only adopts genetic interaction information to avoid the balance difficulty in the APN. Using the effective stochastic simulation algorithm, the performance of our proposed method is significantly high.
We developed a new method based on Bayesian network to infer detailed pathway structures from interaction data at proteomic and genetic levels. The results indicate that the developed method performs better in predicting signaling pathways than previously described models.
A cellular biological system is controlled by the molecules at different levels, such as protein phosphorylation or genetic variations, and their interactions. Protein interaction (i.e., protein-protein interaction) refers to physical interconnection between two or more proteins that occur in a cell, by which protein components can carry out most of cellular molecular processes . Genetic interaction refers to functional relationship between two genes, which can be measured by the difference between the phenotype levels of double gene mutations and the expected neutral level evaluated by the corresponding single mutant phenotype level [2, 3]. The publicly available data sets, such as Biological General Repository for Interaction Datasets (BioGRID, https://thebiogrid.org/), Saccharomyces Genome Database (SGD, http://www.yeastgenome.org/), Human Protein Reference Database (HPRD, http://www.hprd.org/), Search Tool for the Retrieval of Interacting Genes/Proteins (STRING: http://www.string-db.org/) and so on, collect thousands of proteins and a few genetic interactions from several of species.
Given a great deal of these interaction data collected, it is of challenges to elucidate biological meaning behind the data, especially to identify biological pathways underlying the data . A few methods and tools have been developed to predict signaling pathways based on protein interaction networks [5,6,7,8]. Several different studies utilized various biological data to discover regulatory networks [9,10,11,12] and reconstruct metabolic networks [13,14,15,16]. There are other methods that uncover pathway networks by integrating protein-protein interaction data and gene expression data [17,18,19]. In genetic interaction studies, the most important method is cluster analysis, grouping genes by the similar genetic interaction profiles [20,21,22]. Some other studies focus on aggravating or alleviating relationships between related gene groups [23,24,25]. In order to automatically identify detailed pathway structures using high-throughput genetic interaction data, the activity pathway network (APN) was developed . However, these available approaches cannot fully take advantages of the complementarity between protein and genetic interaction data to infer the biological pathway structures.
In this paper, we present a Bayesian model that integrates high-throughput protein and genetic interaction data to reconstruct detailed biological pathway structures. The model can organize related genes into the corresponding pathways, arrange the order of genes within each pathway, and decide the orientation of each interconnection. Based on protein interaction network, the model predicts detailed pathway structures by using genetic interaction information to delete redundancy edges and reorient the kept edges in the network. Similar to APN , our model represents a biological pathway network as a Bayesian network , in which each node presents the activity of a gene product. Different from APN that drew network sample from complete network, our method introducing protein interaction networks as underlying pathway structures. In addition, a scoring function is defined by gene pairwise score, which can avoid the unadjusted balance between gene pairwise score and edge score in the APN. Thus, our model is able to improve computational efficiency of stochastic simulation algorithm and overcome the limitation of APN that some edges in the results are difficult to interpret. In our model, each edge in the network can capture physical docking, and represent functional dependency.
We model a pathway network as a Bayesian network that is a directed acyclic graph. The activity of a gene is assigned to a node in the network . The edge in the network is an interaction in protein interaction network. Additionally, it presents the conditional dependency between the nodes connected as well. The experiments of genetic interaction are not for detection of the influence between pairwise genes but for measurement of impact of mutation of these two genes on phenotype of interest. Thus, it is impossible to evaluate conditional probability distribution between the nodes of the Bayesian network, and the standard Bayesian learning methods lost their efficacy. Here, we only utilize conditional independence assumptions of the Bayesian network theory to construct a network that can represent independence assumptions hidden in the gene interaction data. As in Ref. , based on the independence assumptions, it is elucidated that given the activity level of X, the fitness level is independent of the activity level of Y, if gene X is fully epistatic to gene Y. The constructed network can encode a linear pathway substructure between X and Y, in which Y must be the father node of X, that is, the direction of edge between is decided.
For a candidate pathway network (Fig. 1b) sampled from protein interaction network, we score it in term of genetic interaction quantitative measurement using method in Ref. . For every pair of genes, there are four topological structures and their local scores shown in Fig. 2. Despite the larger score indicating the more possible local structure for each gene pair, we still need every one of four scores to find the optimal global structure. We computed the four possible scores for each pair of genes before all the steps to improve computation efficiency.
Using the scoring methods in Fig. 2 and dataset D of genetic interaction and protein interaction, we can compute a local score for every pair of genes in a candidate pathway network N, and sum up all of the scores for all pairs to define the global score function f(N), to which the Bayesian network posterior probability distribution p(N|D) is proportional, shown as eq. (1). In Bayesian network theory , a network N with the higher posterior probability or global score should be more accord with the data set.
Different from study of Ref. , we do not include every edge score in f(N), because the edge in our network represents protein interaction that insures its existence. Then, it avoids the dilemma how to adjust the balance between the two scores.
We utilized annealed importance sampling [26, 28] to learn the pathway structure by the above distribution p (N| D) ∝f(N). The annealed importance sampling approach can assign weights to pathway networks sampled by simulated annealing schedules, then to evaluate that converge to the real network structure. The approach is appropriate for sampling N from multi-modal distributions p (N| D) or abbreviated to p(N), since its independent sampling method can overcome some problems of convergence and autocorrelation in general Markov chain Monte Carlo (MCMC) samplers. Figure 1 presents the brief procedure of an annealing run of the annealed importance sampling.
After K annealing runs, the sampler generates K pathway networks and their weights. Then we can compute the confidence for any given substructure s, shown as
Where I(∙) is the indicator function, Nk is the sample at the kth annealing run, and ωk is the important weight. Based on the theory of annealed importance sampling, we can compute confidences of all structure forms of an interesting gene subset, and choose the maximal one as the possible detailed pathway structure of the subset.
Pseudo-code for pathway network reconstruction
Input: Matrix P: protein interaction network
Vector S: signal mutation levels
Matrix D: double mutation levels
Matrix E: typical value for double mutation levels
Vector T: temperatures for the annealing run
Integer K: number of parallel annealing runs
Some optional parameters
Output: Matrix of directed pathway networks and their weights
Compute all scores for every possible gene pair by inputs of genetic interaction data
Compute p(N) by scores of gene pairs in N
m = length(TV)
Design distributions pj(j = 0, … , m − 1) (as Fig. 1) to approach P(N)
For i = 1 to K:
Sample initial network nm − 1 from uniform distribution pm − 1
For j = m-2 to 0:
Generate nj from nj − 1 by uniform distribution over P
Accept nj according to Metropolis–Hastings algorithm by Tj and pj
Update importance weight
Save network Ni and its weight ωi
Return networks Ni (i = 1, … , K) and their importance weights
Specify interesting pathways and compute their confidence
The MATLAB codes of our algorithm can be freely downloaded at .
Results and discussion
We applied our developed method to the genetic interaction measured by the protein folding in the endoplasmic reticulum  and the corresponding protein interaction network. The genetic interaction data set contains 444 queries crossed to the same 444 array strains from the budding yeast, Saccharomyces cerevisiae, and keeps available 86,396 raw double mutants from the 444 × 444 genetic interaction pairs . Another genetic interaction data are from the coherent subsets of the global genetic interaction network , including 198 single mutants and 30,256 double mutants. We used regression method to predict the missing genetic interaction data from known genetic interaction profiles. The protein interaction data of the above gene set are downloaded from BioGRID till December 2016.
Due to the fact that the raw measurements of genetic interaction data are limited in publicly available databases, we applied our developed method to an available data set from Ref. . Though there are some raw measurement data sets in Refs. [2, 30], either smaller number of samples or the higher sparsity makes it infeasible to apply our method to these data sets. That also explains why few available methods were designed to reconstruct pathways by integrating genetic interaction and protein interaction data. We compared our results with those predicted by the APN to validate the advantage of our method.
In our method, we modeled the pathway network as a Bayesian network. The sampling algorithm of annealed importance is applied to curate networks with the probability distribution defined by genetic interaction data, and simultaneously assign weights to them. And the corresponding protein interaction network of the genes in genetic interactions was used to represent underlying sample population, interpreting existence of potential edges in the sampled networks. Using these sampled networks and their assigned weights, we can estimate the detailed structure of the gene subset with high confidence (see Methods). Two substructures reconstructed by our method are shown in Fig. 3. Though the genetic interaction data for SWR complex are not complete, our approach still pools the existing genes together (Fig. 3a). It precisely reconstructs the known functional dependencies of ERAD pathway (Fig. 3b).
We compared N-Glycan biosynthesis pathway substructure reconstructed by our model with the result of APN (Fig. 4). The detailed structures of the pathway from our model (Fig. 4b) and APN (Fig. 4c)  are very similar. Both of them are similar to the true one in Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.kegg.jp). One obvious difference is the place of OST3 that is incorrectly placed in APN (Fig. 4c). It may be due to the scoring function of APN based on edge score that strengthens the confidence of edge (ALG3, OST3). The edge from ALG3 to OST3 has a high confidence, 0.65, indicating that APN really cannot interpret some edges in its result. Moreover, the orders of genes from our model and APN may be reverse to the true one  because the mechanisms of genetic interaction dependency are represented by phenotype (the unfolded protein response or fitness). Intriguingly, the OST3 position is correctly predicted in our method. It indicates the power of our developed method by integration of protein interaction data. However, we still found the limitation of the protein interaction data. The edge (CWH41, DIE2) is not presented in our result, because the corresponding protein interaction is not found in currently used protein interaction databases. In future, we are planning to include more predicted protein interaction data from STRING, and design parallel computing in high-performance computers to improve the performance.
We also applied our model to infer pathways from another available data set of a global genetic interaction profiles . From about 5.4 million gene pairs, we only selected coherent subsets in which the gene pairs have the high Pearson correlation coefficients, for our method based on annealed importance sampling is not suitable for so large data set. Using our model, we reconstructed three substructures, that is Urmylation pathway (Fig. 5c), Elongator complex (Fig. 3c), and Retromer complex (Fig. 3d). In Fig. 5, we compared our developed method with APN. The edge (NFS1, NCS2) presented in results of APN, as shown in Fig. 5b is difficult to interpret. However, our result in Fig. 5c is consistent with protein information from BioGrid as shown in Fig. 5a. The interactions of UBA4, NFS4, and NCS2 were predicted by our method. The edge (UBA4, AHP1) in Fig. 5d is not inferred by these two methods. For our model, the reason may be the incompleteness of protein interaction network that is the main weakness of our model.
In this paper, we propose a Bayesian network model to identify pathway structures by integrating protein interaction with genetic interaction data. Our approach makes use of the complementarity between protein (physical) and genetic (functional) interaction data to refer the biological pathway structures. We define a scoring function by which the sampling algorithm of annealed importance can draw some pathway networks and their weights that are used to evaluate the candidate pathway structures. The results show that our model can predict the pathway structures more accurately.
Activity pathway networks
Biological general repository for interaction datasets
Human protein reference database
Kyoto encyclopedia of genes and genomes.
Markov chain Monte Carlo
Saccharomyces genome database
Search tool for the retrieval of interacting genes/proteins
De Las RJ, Fontanillo C. Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6:e1000807.
Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003;34:166–76.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7 Suppl 1:S7.
Grzegorczyk M, Husmeier D. Improvements in the reconstruction of time-varying gene regulatory networks: dynamic programming and regularization by information sharing among genes. Bioinformatics. 2011;27:693–9.
Barba M, Dutoit R, Legrain C, Labedan B. Identifying reaction modules in metabolic pathways: bioinformatic deduction and experimental validation of a new putative route in purine catabolism. BMC Syst Biol. 2013;7:99.
Guillen-Gosalbez G, Sorribas A. Identifying quantitative operation principles in metabolic pathways: a systematic method for searching feasible enzyme activity patterns leading to cellular adaptive responses. BMC Bioinformatics. 2009;10:386.
Shirshin E, Cherkasova O, Tikhonova T, Berlovskaya E, Priezzhev A, Fadeev V. Native fluorescence spectroscopy of blood plasma of rats with experimental diabetes: identifying fingerprints of glucose-related metabolic pathways. J Biomed Opt. 2015;20:051033.
Wang Y, Wu QF, Chen C, Wu LY, Yan XZ, Yu SG, Zhang XS, Liang FR. Revealing metabolite biomarkers for acupuncture treatment by linear programming based feature selection. BMC Syst Biol. 2012;6(Suppl 1):S15.
Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF, et al. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell. 2005;123:507–19.
Jonikas MC, Collins SR, Denic V, Oh E, Quan EM, Schmid V, Weibezahn J, Schwappach B, Walter P, Weissman JS, et al. Comprehensive characterization of genes required for protein folding in the endoplasmic reticulum. Science. 2009;323:1693–7.
Qi Y, Suhail Y, Lin YY, Boeke JD, Bader JS. Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic interactions. Genome Res. 2008;18:1991–2004.
Support for the authors was provided by the National Natural Science Foundation of China (#11371016), the Chinese Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) (#IRT_15R58). The publication costs were funded by the National Natural Science Foundation of China (#11371016).
Availability of data and materials
The protein interaction data analyzed during the current study are available in the BioGRID. The genetic interaction data are available in article [22, 30].
CHF came up with the idea of the study, built the model of the study, designed the algorithm, and wrote the manuscript; SD debugged the program, and performed functional and statistical analyses; GXJ assisted in functional, statistical and data analyses, and revised the manuscript; XXW gathered the data, and performed data analyses; ZGY supervised the model building, and statistical computational approaches, and revised the manuscript. All authors read and approved the final version of the manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Fu, C., Deng, S., Jin, G. et al. Bayesian network model for identification of pathways by integrating protein interaction with genetic interaction data.
BMC Syst Biol11
(Suppl 4), 81 (2017). https://doi.org/10.1186/s12918-017-0454-9