Preliminaries
We model a protein interaction network as an undirected, unweighted graph where the nodes are the proteins, and two nodes are connected by an edge if the corresponding proteins are annotated as interacting with each other.
Formally, a graph is given by a set of vertices V and a set of edges E. The degree of a node u ∈ V, denoted by d(u), is the number of edges adjacent to u. A graph is often represented by its adjacency matrix. The adjacency matrix of a graph G = (V, E) is defined by
We can learn a lot about the structure of a graph by taking a random walk on it. A random walk is a process where at each step we move from some node to one of its neighbors. The transition probabilities are given by edge weights, so in the case of an unweighted network the probability of transitioning from u to any adjacent node is 1/d(u). Thus the transition probability matrix (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one:
Here the D matrix is the degree matrix, which is a diagonal matrix given by
In a random walk it is useful to consider a probability distribution vector p over all the nodes in the graph. Here p is a row vector, where p(u) is the probability that we are at node u and Σu∈Vp(u) = 1. Because we transition between nodes with probabilities given by W, if p
t
is the probability distribution vector at time t, then pt+1= p
t
W.
PageRank
A PageRank vector pr
α
(s) is the steady state probability distribution of a random walk with restart probability α. The starting vector s gives the probability distribution for where the walk transitions after restarting. Formally, pr
α
(s) is the unique solution of the linear system
The PageRank vector with a uniform vector for s gives the global PageRank of each vertex. PageRank with non-uniform starting vectors is known as personalized PageRank.
Here we always use a starting vector that has all of its probability in one vertex, defined as follows:
pr
α
(e
u
) is thus the steady-state probability distribution of a walk that always returns to u at restart, and we will refer to it as the personalized PageRank vector of u. We will use pr
α
(e
u
) [v] to denote the amount of probability that v has in pr
α
(e
u
), and use a shorthand of pr(u → v) for this quantity, dropping the α in the subscript because in our computations it is always fixed. As pointed out in [35], v's global PageRank, denoted by PR(v), satisfies
Thus pr(u → v) can be thought of as the contribution that u makes to the PageRank of v.
PageRank Affinity
For two vertices u and v we define their PageRank Affinity to be the minimum of the PageRank that u contributes to v and v contributes to u:
This quantity can be computed by solving the PageRank equation for pr
α
(e
u
) and pr
α
(e
v
), and reporting the minimum of the two PageRank contributions. The restart probability of the random walk (α) must be greater than 0 to ensure that pr
α
(e
u
) and pr
α
(e
v
) have unique solutions, and must be much smaller than 1 to prevent the random walk from returning too often to the starting vertex and being too local. We set α to 0.15, which is typical for computations of PageRank.
Approximate PageRank Affinity
We can also use approximate PageRank to compute closeness between nodes. While it is possible to compute exact PageRank vectors for smaller graphs by solving the PageRank equation, it is computationally infeasible to do this for larger networks. To calculate approximate PageRank, we use the ApproximatePR algorithm from [36], which computes an ϵ-approximate PageRank vector for a random walk with restart probability α in time O(
). An ϵ-approximate PageRank vector for pr
α
(s), denoted by
α(s), satisfies
for any subset of vertices S, where p [S] = Σv∈Sp [v], and vol(S) = Σv∈Sd(v). In other words, the amount of error in the approximate PageRank vector for any subset of vertices is at most the product of ϵ and the sum of degrees of its nodes.
Algorithm Description
We develop an algorithm that approximates PageRank Affinity, which uses ApproximatePR as a subroutine. Our approximatePRaffinity algorithm takes a queried vertex v, approximation parameter ϵ, and integer k as input, and returns the k nodes closest to v in the graph. The algorithm is outlined below.
Algorithm 1 approximatePRaffinity(v, ϵ, k)
(e
v
) = ApproximatePR(v, ϵ)
for each u do
(v → u) =
(e
v
) [u]
end for
for each u do
(u → v) =
(v → u)
end for
for each u do
affinity(u) = min(
(u → v),
(v → u))
end for
return the k vertices with highest affinity scores
We first compute an approximate personalized PageRank vector of v, denoted by
(e
v
), to approximate the amount of PageRank that v gives to each vertex u, denoted by
(v → u). We then use the observation that for undirected graphs
to approximate the PageRank contribution of each vertex in the graph to v. We then calculate the affinity to v of each vertex u as
and return the k nodes with highest affinity values. Equation 2 follows from the discussion of computing PageRank contributions in the time-reverse Markov chain in [35], and the fact that in an undirected graph the amount of probability that a vertex has in the stationary distribution of a random walk is proportional to its degree.
It follows from Equation 1 that the amount of error in the probability that u has in the approximate personalized PageRank vector of v is at most ϵ·d(u):
We denote by pr-aff(u, v) the exact PageRank Affinity of u and v, and by
(u, v) the Approximate PageRank Affinity computed by approximatePRaffinity. Using Equations 2 and 3 we can verify that the amount of error in the Approximate PageRank Affinity of vertices u and v is at most the product of ϵ and the larger of their degrees:
Runtime Analysis
The approximate PageRank vector computed by ApproximatePR has few non-zero entries. This saves computation time because we do not need to consider vertices with 0 probability in the approximate PageRank vector (they have an affinity of 0). Let us call the support of probability distribution vector p, denoted by Supp(p), the set of all vertices that have non-zero probability in p:
ApproximatePR computes an approximate PageRank vector with small support, which is useful for large graphs that have many vertices. More specifically, the number of non-zero entries in the approximate PageRank vector is less than
:
Thus the exact runtime of approximatePRaffinity is the time necessary to compute
(e
v
), which takes O(
), plus the time necessary to compute the affinity to v of each vertex in Supp(
(e
v
)), which is linear in the size of the support set, plus the time necessary to find the k vertices with largest affinity scores, which takes at most k·
, giving a total runtime of O(
+
). Moreover, if we treat α as a constant in this analysis (because we always set it to 0.15), this expression simplifies to O(
).
Properties of PageRank Vectors
It is well-known that a PageRank vector can be expressed as a weighted average of random walk vectors [36]:
The sWtterm gives the probability distribution of the random walk after t steps. Equation 4 thus shows that in computing PageRank we consider paths of all lengths, with less weight given to longer paths based on the value of α.
Another important property of PageRank vectors is that if u and v are in the same cluster, both pr(u → v) and pr(v → u) are likely to be high. The quality of a cluster C is measured by proportion of outgoing edges, known as conductance, which we denote by Φ(C). A cluster of lower conductance is better because its nodes are more connected among themselves than they are with the other nodes in the graph. It is proved in [36] that for any set C, there is a subset of vertices C' ⊆ C, such that for any vertex u ∈ C', the personalized PageRank vector of u, denoted by pr
α
(e
u
), satisfies
In other words, pr(u → v) = pr
α
(e
u
) [v] is high on average if u and v are in the same good (low-conductance) cluster C and u ∈ C'. Moreover, the set C' is large, as the sum of degrees of its nodes, denoted by vol(C'), satisfies vol(C') ≥ vol(C)/ 2.
Other Measures of Closeness
In our experiments on protein networks, we compare PageRank Affinity and Approximate PageRank Affinity with several other measures of closeness, which are described below. Some of these measures assign an affinity score to each pair of vertices, while others simply order pairs by their closeness.
Shortest Path and Shortest Path Multiplicity
The shortest path closeness of two vertices is the inverse of the length of the shortest path between them. However, using the length of the shortest path does not allow for much granularity, so we also consider the multiplicity of the shortest path to break ties between pairs that are the same distance apart.
Common Neighbors
A very intuitive measure of closeness of two vertices is the number of neighbors that they share in the graph. In our experiments, we notice that in addition to counting common neighbors, it also helps to take into account whether the two nodes are directly connected, by adding a small constant to their closeness score if this is the case.
Partitioning
We also compare with another measure of closeness, motivated by efforts to partition PPI networks and determine overlap with known protein complexes. It is observed that the densest clusters are often the ones that overlap most with known complexes [30]. Therefore, we partition the protein network, and score pairs of vertices that are in the same cluster by the edge density of the cluster. To partition the network, we use Metis [37], a widely used algorithm that finds high-quality, balanced clusters in the graph. Once we partition the network, we consider protein pairs in denser clusters closer than pairs in less dense clusters, because pairs in denser clusters are more likely to be part of the same functional unit. Of course, this approach only allows us to consider a small fraction of the pairs, because we have no way to evaluate the closeness of two proteins assigned to different clusters.
Cliques and k-cores
In addition to partitioning the graph and evaluating the edge density of each cluster, we can also search for dense components directly by enumerating maximum cliques and finding k-cores. A k-core is a vertex-induced subgraph where the degree of each node is at least k[38]. We then consider pairs that are part of a larger clique closer than pairs that are part of a smaller clique, and consider pairs that are part of an m-core closer than pairs that are part of an n-core if m > n. However, once again, these measures allow us to evaluate the closeness of only a small number of pairs in the network.
Commute time
Another way to assess the closeness of two nodes using a random walk on the graph is to consider the inverse of the commute time between them. The commute time between vertices u and v is the expected number of steps taken for a random walk from u to reach v and return, which is computed as described in [39].