Bayesian network
We model a pathway network as a Bayesian network that is a directed acyclic graph. The activity of a gene is assigned to a node in the network [26]. The edge in the network is an interaction in protein interaction network. Additionally, it presents the conditional dependency between the nodes connected as well. The experiments of genetic interaction are not for detection of the influence between pairwise genes but for measurement of impact of mutation of these two genes on phenotype of interest. Thus, it is impossible to evaluate conditional probability distribution between the nodes of the Bayesian network, and the standard Bayesian learning methods lost their efficacy. Here, we only utilize conditional independence assumptions of the Bayesian network theory to construct a network that can represent independence assumptions hidden in the gene interaction data. As in Ref. [26], based on the independence assumptions, it is elucidated that given the activity level of X, the fitness level is independent of the activity level of Y, if gene X is fully epistatic to gene Y. The constructed network can encode a linear pathway substructure between X and Y, in which Y must be the father node of X, that is, the direction of edge between is decided.
Scoring
For a candidate pathway network (Fig. 1b) sampled from protein interaction network, we score it in term of genetic interaction quantitative measurement using method in Ref. [26]. For every pair of genes, there are four topological structures and their local scores shown in Fig. 2. Despite the larger score indicating the more possible local structure for each gene pair, we still need every one of four scores to find the optimal global structure. We computed the four possible scores for each pair of genes before all the steps to improve computation efficiency.
Using the scoring methods in Fig. 2 and dataset D of genetic interaction and protein interaction, we can compute a local score for every pair of genes in a candidate pathway network N, and sum up all of the scores for all pairs to define the global score function f(N), to which the Bayesian network posterior probability distribution p(N|D) is proportional, shown as eq. (1). In Bayesian network theory [27], a network N with the higher posterior probability or global score should be more accord with the data set.
$$ f(N)=\exp \left(\sum_{x\ne y\ in\ N} Score\left(x,y\right)\right) $$
(1)
Different from study of Ref. [26], we do not include every edge score in f(N), because the edge in our network represents protein interaction that insures its existence. Then, it avoids the dilemma how to adjust the balance between the two scores.
Sampling
We utilized annealed importance sampling [26, 28] to learn the pathway structure by the above distribution p (N| D) ∝ f(N). The annealed importance sampling approach can assign weights to pathway networks sampled by simulated annealing schedules, then to evaluate that converge to the real network structure. The approach is appropriate for sampling N from multi-modal distributions p (N| D) or abbreviated to p(N), since its independent sampling method can overcome some problems of convergence and autocorrelation in general Markov chain Monte Carlo (MCMC) samplers. Figure 1 presents the brief procedure of an annealing run of the annealed importance sampling.
Pooling
After K annealing runs, the sampler generates K pathway networks and their weights. Then we can compute the confidence for any given substructure s, shown as
$$ C(s)=\frac{\sum_{k=1}^K{\omega}_kI\left(s\subset {N}_k\right)}{\sum_{k=1}^K{\omega}_k} $$
(2)
Where I(∙) is the indicator function, N
k
is the sample at the kth annealing run, and ω
k
is the important weight. Based on the theory of annealed importance sampling, we can compute confidences of all structure forms of an interesting gene subset, and choose the maximal one as the possible detailed pathway structure of the subset.
Pseudo-code for pathway network reconstruction
Input: Matrix P: protein interaction network
-
Vector S: signal mutation levels
-
Matrix D: double mutation levels
-
Matrix E: typical value for double mutation levels
-
Vector T: temperatures for the annealing run
-
Integer K: number of parallel annealing runs
-
Some optional parameters
Output: Matrix of directed pathway networks and their weights
Procedure:
Compute all scores for every possible gene pair by inputs of genetic interaction data
Compute p(N) by scores of gene pairs in N
m = length(TV)
Design distributions p
j
(j = 0, … , m − 1) (as Fig. 1) to approach P(N)
For i = 1 to K:
Return networks N
i
(i = 1, … , K) and their importance weights
Specify interesting pathways and compute their confidence
The MATLAB codes of our algorithm can be freely downloaded at [29].