Maximum expected accuracy (MEA) alignment of biological networks
Let us assume that we have a set of N PPI networks . Each network has a set of nodes and edges , where e
i,j
represents the interaction between nodes v
i
and v
j
in the network . For each pair of PPI networks and , we denote the pairwise node similarity score for a node pair (u
i
, v
j
), where and , as s(u
i
, v
j
). In this study, we use the BLAST bit score between proteins as their node similarity score, but other types of similarity scores based on structural or functional similarity can be also utilized if available.
Suppose is the true alignment of the networks in the set G, which is unknown and needs to be predicted. As in [12, 17], we can define the accuracy of a given network alignment as follows
(1)
where 1 (·) is an indicator function, whose value is 1 if the mapping u
i
~ v
j
is included in the true alignment and 0 otherwise. The given measure assesses the goodness of the alignment based on the relative proportion of correctly aligned nodes. Of course, since the true alignment is not known, the accuracy of a network alignment cannot be measured using (1), hence we cannot directly use this measure to compare different potential alignments to choose the best one. A reasonable alternative would be to estimate the expected accuracy as follows
(2)
where P (u
i
~ v
j
|G) is the posterior alignment probability between the nodes u
i
and v
j
given the set of networks G. Based on this measure, our objective is then to predict the maximum expected accuracy (MEA) network alignment of the networks in G as follows
(3)
A similar MEA approach [18] has been formerly adopted by a number of multiple sequence alignment algorithms, including ProbCons [17], ProbAlign [19], and PicXAA [20–22]. The MEA framework has been shown to be very effective in constructing accurate alignment of multiple biological sequences, making it one of the most popular approaches for sequence alignment. Recently, the MEA approach has been also applied to comparative network analysis, where RESQUE [4] performs MEA-based network querying and SMETANA [12] performs MEA-based multiple network alignment.
Comparing and aligning networks based on context-sensitive random walk
In order to find the alignment that maximizes the expected accuracy defined in (2), we first need an accurate method for estimating the posterior node alignment probability P (u
i
~ v
j
| G). For this purpose, we adopt a context-sensitive random walk (CSRW) model, motivated by the pair hidden Markov model (pair-HMM) that has been widely used in sequence alignment [23]. The pair-HMM provides a simple, yet very effective, mathematical framework for estimating the alignment probability between symbols in different biological sequences. Unlike the traditional HMM, which generates a single symbol sequence, the pair-HMM generates a pair of aligned symbol sequences. Pair-HMM makes transitions between three different internal states M, I
X
, and I
Y
, where the M state emits an aligned pair of symbols, one symbol in sequence X and the other in sequence Y, while I
X
and I
Y
emit an unaligned symbol in sequence X and sequence Y, respectively. Given two biological sequences, the pair-HMM can be used to estimate the probability whether a given symbol pair was jointly emitted at state M, hence should be aligned to each other. This probability can be computed using the forward and backward algorithms and the resulting alignment probability provides us with a measure of confidence about the (biological) relevance between the given symbols (i.e., nucleotides, amino acids).
One of the most important features of pair-HMM is that it properly recognizes that conserved sequence patterns and motifs in different species may contain inserted and/or deleted symbols (often referred to as "indels") and therefore it specifically tries to model these indels. In a similar manner, a mathematical model that can recognize node insertions and deletions in different biological networks that contain conserved subnetwork regions and network motifs may be useful for obtaining a reliable posterior node-to-node alignment probability. Recently, random walk models have been shown to be effective for estimating the node correspondence in different networks [7, 12, 15] in a way that seam-lessly integrates both node similarity and topological similarity. However, the random walk models that were used in previous network alignment algorithms did not explicitly consider indels.
In this work, we adopt a novel context-sensitive random walk model that has been recently proposed to improve on existing models by taking such indels into account [24]. In a way that is conceptually similar to the pair-HMM, the CSRW has three different internal states M, , and , each of which corresponds to a different mode of random walk. At the M state, the random walker simultaneously moves on both networks to enter a pair of "matching" nodes. On the other hand, at the state, the random walker only moves on network to enter a potentially "inserted" node in that may not have a corresponding node in the network . Similarly, at the state, the random walker only moves on to enter a potentially inserted node in . Transitions between states take place in a context-sensitive manner, where the random walker examines the neighboring nodes to determine the mode of random walk. For example, if there are node pairs with significant node similarity (i.e., potential orthologous nodes) in the immediate neighborhood, the CSRW switches to the M state to make a simultaneous move on both networks and randomly enter one of these node pairs. Otherwise, the CSRW switches to either or and performs an individual random walk only on one of the networks. Based on this random walk model, we compute the long-run proportion of time that a given pair of nodes will be simultaneously visited (i.e., at the M state), which can be used to compute a probabilistic correspondence score between these two nodes, as we will describe in the following section.
Estimation of node correspondence scores
Suppose we want to measure the correspondence between nodes that belong to two different networks and , both of which are included in G, the set of PPI networks to be aligned. For every node pair (u
i
, v
j
), where and , our goal is to quantify the level of confidence - which we refer to as the node correspondence score - using the CSRW model discussed earlier. For this purpose, we first construct the transition probability matrix that corresponds to the random walk. Let be the set of node pairs (u
i
, v
j
) with a positive pairwise node similarity score s(u
i
, v
j
)
(4)
We also define the set of non-similar node pairs as follows
(5)
Let the current position of the random walker in the product graph be (u
c
, v
c
), where and . In each time step, the random walker examines the set of similar neighboring nodes to determine its mode of random walk (corresponding to one of the three possible internal states), where is the set of neighbors of the node u
c
in the network and is the set of neighbors of the node v
c
in the network . If there are similar node pairs among the neighboring node pairs, hence is not empty, the random walker switches its internal state to the M state and performs a simultaneous walk on both networks, moving from (u
c
, v
c
) to one of the nodes
. We define the transition probability for this simultaneous walk as follows
(6)
In case there is no similar node pair around the current position of the random walker, that is , the random walker randomly changes its state to either or , and performs an individual walk on the corresponding network or . The probability that a given network will be chosen for an individual random walk is proportional to its size (i.e., number of nodes in the network), which ensures that both networks are equally well-traversed at the I states. The random walker randomly moves to one of the neighboring nodes with equal probability on the selected network, while staying at the same node on the other network. Based on this behavior, the transition probabilities at state are given by
(7a)
for , and the transition probabilities at state are given by
(7b)
for .
Based on the transition probabilities given by (6), (7a), and (7b), we can construct the transition probability matrix P for the random walk on the two networks and . Given P, we can estimate the longrun proportion of time that the random walker spends in each pair of nodes (u
i
, v
j
) by computing the steady state distribution π. In practice, since real PPI networks typically have a relatively small number of interactions (therefore only few edges for most nodes), the resulting transition probability matrix for the CSRW is sparse, which makes it relatively straightforward to compute the steady state distribution using the power method.
In order to increase the computational efficiency of the proposed network alignment method, instead of using the original transition probability matrix P, we use a reduced matrix . The reduced matrix is obtained by removing the rows and columns in P that correspond to node pairs in while keeping only the rows and columns that correspond to node pairs in . After the reduction, is re-normalized to make it a legitimate stochastic matrix. In practice, since the CSRW is designed to spend more time at node pairs with higher similarity, the random walker spends a relatively small amount of time at node-pairs that belong to the set , and using the reduced matrix instead of P only minimally affects the estimated long-run proportion of time spent at . As a result, the difference in terms of network alignment performance that results from replacing the original matrix P by this reduced matrix appears to be small as shown in the supplementary material (see Section S1).
We make one further modification to the CSRW in [24] by allowing the random walker to restart at a new position at each time step with a fixed restart probability λ. Note that a similar "random walk with restart" approach was used by IsoRank [6] and IsoRankN [7], although these algorithms do not utilize the CSRW adopted in our method. We allow the random walker to select its restart position according to the pairwise node similarity, such that node pairs with higher node similarity have higher chance to be the restart position of the random walker. To this aim, we normalize the pairwise node similarity scores so that they sum up to 1. Our final node correspondence score vector c is obtained from a linear combination of the steady-state distribution of the context-sensitive random walker (estimated using the reduced transition probability matrix ) and the normalized node similarity score vector s as follows
(8)
The above formulation, obtained by allowing the CSRW to restart the random walk at a new position, is especially useful when comparing real PPI networks, which are often incomplete and contain many isolated nodes. Simulation results show that the incorporation of the restart scheme can make our CSRW-based alignment method more robust, especially when the available topological data are either unreliable or insufficient for detecting the similarities between networks (see Section S2).
In order to determine the restart probability λ, we first analyze the structure of the reduced product graph of and that contains only similar node pairs included in . Intuitively, it is desirable to increase the restart probability λ if the networks are disconnected and decrease the probability if the networks are well connected. For example, if all the nodes in the reduced product graph are completely disconnected, it is desirable to restart the random walker at every step. Additionally, when we consider the following two cases - (i) most nodes in the product graph are connected and there are only a few disconnected nodes; (ii) the product graph is equally divided into N connected subnetworks of identical size - it would be desirable to assign a higher λ to the latter case. Based on these intuitions, we set the restart probability λ as the ratio of the total number of nodes in the top K% smallest subnetworks to the total number of nodes in the reduced product graph. In this work, we used K = 99% to determine the restart probability λ.
Constructing the multiple network alignment
Once we have computed the node correspondence scores in (8) for every pair of networks in G, we take a greedy approach as in [12] to construct the multiple network alignment. The overall alignment process is as follows. First, in order to improve the reliability of the node correspondence scores, we selectively apply the probabilistic consistent transformation (PCT) defined in [12]. If λ is larger than a predefined threshold λ
t
, we do not apply PCT to the node correspondence scores. A large λ implies that the product graph is ill connected (e.g., containing a large number of isolated nodes), in which case applying the PCT would not be helpful and may in fact make the scores less reliable. This is because the PCT in [12] was developed based on the assumption that the product graphs for all network pairs are relatively well connected. After the potential score refinement step through PCT, we begin with an empty alignment and greedily add aligned node pairs (u
i
, v
j
) to the network alignment, starting from the pairs with the highest node correspondence scores, until there is no other node pair left that can be added without creating inconsistencies in the network alignment. Assuming that the node correspondence scores in (8) obtained by the context-sensitive random walk model with restart accurately reflect the true correspondence between nodes - such that the score is proportional to the posterior node alignment probability - the proposed network alignment scheme can be viewed as a heuristic way to find the MEA alignment of the networks in G.