The key idea behind our algorithm consists of three main steps: (1) verifying the existence of weak ties effect in PPI networks; (2) constructing a reliable network by exploring the roles of edges; and (3) identifying the protein complexes by using a core-attachment based method. We show them in turns.
Weak ties phenomenon in PPI networks
A network consists of two basic elements: vertices and edges. Many measurements are developed to characterize the role of a node for structure and function including random walk-based indices [43], PageRank score [44]. In comparison, the study of the edge's role is less extensive.
Actually, edges in a network usually have two roles to play: some contribute to the global connectivity like the ones connecting two clusters while others enhance the locality like the ones inside a cluster. In social networks, the two roles are reflected as two important phenomena, being respectively the homophily [45] and weak ties effects [46]. Homophily demonstrates that connections are more likely to be formed among individuals with close background, common characteristics. On the other hand, the weak ties phenomenon shows that the less similar individuals are prone to be connected with weaker strength. These weak ties have important roles to play in maintaining the global connectivity. It has been proved that the weak ties phenomenon exists in the mobile communication [39] and document networks [40]. But, the weak ties effect for PPI networks remains to be tested.
To investigate the weak ties effects in PPI networks, we quantify how the topological structure changes according to an edge percolation process. In detail, if the weak ties effect exists in terms of topological similarity, the network disintegrates faster when we delete edges successively in an ascending order of the similarity than that in descending order. Similar to [40] two measures are employed to quantify how topo-logical structure changes when the edges are removed. The first one is the fraction of vertices contained in the giant component, represented by R
GC
. The second one is the normalized susceptibility, defined as
(1)
where s is the size of a connected subgraph, N is the size of the whole network and the sum includes all connected components. An obvious gap occurs when the network disintegrates [47].
Prior to studying the weak ties, the bridgeness of an edge should be discussed. In [40] it is defined as
(2)
where (u, υ) is the edge with u, υ being the endpoints, C
u
is the size of the maximal clique containing vertex u and C(u, υ)is the size of the maximal clique containing (u, υ). It, however, can not distinguish the bridges and non-bridges because it fails to take into account the difference between a pair of vertices. The bridggness value for each edge in a clique is 1 according to Eq.(2). It is unreasonable because intuitively the larger the size of a clique is, the lower the probability for some edge in the clique being a bridge is. For example, edges in 3-clique are more prone to be bridges than ones in 8-clique.
Actually, if (u, υ) is a bridge, the roles of vertex u,υ should differ greatly since they belong to various groups, indicating that they are dissimilar in topology. Therefore, a new bridgeness is defined as
(3)
where J(u, υ) is the Jaccard similarity, i.e., with N(u) being the neighbors of vertex u, and C
u\υ
is the size of the maximal clique containing u without υ. The 1- J(u, υ) measures the dissimilarity between the pair of endpoints while the latter component quantifies the relation between the neigbors of two endpoints. The physical interpretation of Eq.(3) is that only these edges whose endpoints are less similar in topological and maintain the global connectivity are the bridges. Compared with Eq.(2), the new index is more reasonable, for example, for an edge in a m-clique is , which decreases as the size of a clique increases.
Similar to Ref. [39], we quantify the weak ties phenomenon according to an edge percolation process. Generally speaking, if the weak ties phenomenon exists in terms of content similarity, the network will disintegrate much faster when we remove edges successively in ascending order of content similarity than in descending order. Figure 2 (a) shows R
GC
decreases much faster when the less similar edges are removed firstly. As shown in Figure 2 (b), a sharp peak occurs when the edges removed from the weakest to the strongest one, demonstrating the disintegration of the networks involved. Careful comparison of Figure 2 (a)(b) further shows that no percolation phase transition appears since there is no clear peak. These strongly supports the weak ties phenomenon in the PPI networks. In addition to the existence of weak ties phenomenon, we also have great interest in quantifying the edges' role of maintaining global connectivity. How good the bridgeness characterizes the weak ties phenomenon has been investigated in Figure 2 (c)(d). Figure 2 (c) indicates that R
GC
decreases much faster when the stronger bridges are removed firstly. As shown in Figure 2 (d), a sharp peak occurs when the edges removed from the strongest to the weakest one, demonstrating the disintegration of the networks involved. It is enough to assert that the bridgeness is an excellent alternative to describe the tie strength. To make a fair comparison between the index [40] in and ours, we also investigated how the networks changes in terms of bridgeness in Eq.(2) as shown in Figure 2 (e)(f). Compared Figure 2 (c)(d) with Figure 2 (e)(f), we can easily conclude that the network disintegrated more quickly (the bigger gaps in R
GC
and ) when the novel bridgeness is adopted, indicating that the new index is more efficient in characterizing the bridges in networks.
Furthermore, the relation between the topological similarity and bridgeness is also studied. The topological similarity for protein pair is defined as
(4)
where A is the adjacency matrix of the network involved, (Ak)
ij
denotes the number of walks of length k connecting vertex υ
i
and υ
j
, and β is parameter controlling the relevant importance of each component. The long walks receive greater weights when β > 1 while the short ones get more attention if β < 1. Here, we set β = 0.618. The result is showed in Figure 3. It demonstrates that there is a negative correlation between bridgeness and topological similarity, i.e., the weaker the similarity between a pair of proteins is, the stronger its bridgeness is.
Constructing a reliable network
Gavin et al [8] have pointed out that the core of a complex has relatively more interactions while the attachments bind to the core proteins to form a biological complex, implying that the connectivity of a core is better than the whole complex.
To assess the topological proximity of a core, the measure of proximity of a pair of vertices should be handled beforehand. The most commonly used one is the graph distance, that is, the length of the shortest path connecting the pair of vertices. This quantity, however, is not appropriate for the biological networks largely because of two drawbacks: first, it does not take into account the local structural feature of the networks; second, it is very susceptible to the noises, e.g., a single missing edge effects the proximity, significantly. Thus, vertices connected by paths of various lengthes are likely to be functionality closer than vertices connected via a single path. In detail, give an edge, say (u,υ), it is reasonable to consider that the information transferred from u to υ through the right channels. The more the channels are, the better the connectivity is. Actually, in biological network, the genetic information is transferred by the pathways. From the aspect of graph theory, it is natural to consider the channels as various walks connecting u, υ. Likewise, we also take into consideration the strength of paths: the strength of the effect via longer paths with more intermediate vertices is very likely to be lower than those via shorter ones with fewer intermediaries. Given a walk of length k, say υ1→υ2 → ... υ
k+
1, its strength is defined as the product of the weights on each edge in the walk, i.e., where w
i, j
is the weight on the edge (υ
i
, υ
i
+1).
Given an un-weighted PPI network, how to assign weights to edges is one of the key steps in our algorithm. As shown in Figure 3, there is a negative correlation between bridgeness and topological similarity. Thus, a novel strategy for the weight on the interaction (u, υ) based on the bridgeness in Eq.(3) is developed as
(5)
The larger the bridgeness of an interaction is, the less weight it is.
Now, it is sufficient to deal with the similarity between a pair of proteins via various lengths of walks. (Dk)u υdenotes the sum of strengths of all walks of length k connecting u and υ. Since the connectivity in cores is high, any pair of proteins in the same core should be tightly connected by short walks. Therefore, the similarity for a pair of proteins is the sum of strengths of walks connecting them, which can be a generalization of Eq.(4) as
(6)
where W is a matrix with element (W)
ij
= D(i, j).
For any protein pairs, if the similarity between them is large enough, we have enough reason to believe they should be connected, otherwise, un-connected. Therefore, the proteins among a core should connect each other. To construct a virtual and reliable network for the original PPI network, similar to [25], a definition is proposed as
Definition 1 The reliable network Φ(G, τ) = (V
τ
, E
τ
) for a PPI network G = (V, E) is the graph with V
τ
=V and E
τ
= {(u, υ)│u, υ ∊ V, ψ(S
u,υ
,τ) = 1}, where ψ(x, τ) is a function defined as
There are two good physic interpretations for Φ(G, τ): first of all, if the similarity of a pair of proteins is considered as the reliable score on the corresponding edge, Φ(G) can be considered as a reliable network of the original one; second, it can be understood as a perturbation of the original network by adding edges between vertices if there are enough short walks connecting them and deleting edges between vertex pairs if there are fewer short walks connecting them.
In this way, the core of a protein complex corresponds to a maximal clique in the virtual network. In the follows, we design algorithm to discover complexes by extracting cores and attachments, respectively.
A core-attachment algorithm
The first task is to extract all the maximal cliques in the virtual network, known as the classic all cliques problem-an NP-hard problem [48]. Therefore, the exact algorithms are prohibited largely due to the complexity. The heuristic algorithms are selected in order to avoid the time issue. The Coach algorithm detects dense subgraphs very quickly and accurately from each vertex's neighborhood graphs [24]. We adopt the Protein-complex core mining algorithm in the Coach to identify approximately all cliques in the communicability graph Φ(G). Of course, others can be used to identify the cliques, for example, the greedy algorithm, the tabu search and so on.
What we would like to point out is that, although we adopt the same strategy to detect the cores, our algorithm differ greatly from Coach algorithm for two reasons: first, our algorithm detects core in a virtual network based on the weak ties phenomenon, while the Coach on the original network; second, the strategies for the attachment vary greatly.
Given a core denoted by an induced subgraph G(U) with U is the protein set of the core in the virtual network Φ(G), one crucial step to reveal the attachments is to construct the candidate protein set CS(U). For simplicity, we limit ourselves to only these proteins connected to at least one protein in U, i.e., CS(U) = {v│υ ∊ V \ U, ∃u ∊ U ⇒ (u,υ) ∊ E}. What remains to be done is to determine the correct membership of each protein v in CS(U) by exploring the closeness between the vertex υ and U. If υ is an attachment of G
U
, there should be no protein u∊ U such that interaction (u, υ) is bridge. In other words, there must be many short walks connecting υ and vertices in U. Thus, we can define a new similarity function based on the brigdeness to quantify how closeness of a vertex υ to its core component as
(7)
which quantifies the average closeness of υ to U from the aspect of connectivity. The larger cl(υ, U) is, the more walks connecting υ and the core. Thus, a vertex υ ∊ CS(U) is selected as an attachment when the , indicating that the selected attachment has more connection ways with U than the average connectivity in N(U).
The procedure can be described as following:
Step 1: Compute the bridgeness for each interaction in PPI network G according to Eq.(3);
Step 2: Compute similarity matrix S based on Eqs.(5)(6);
Step 3: Construct the virtual network Φ(G) with a predefined threshold τ;
Step 4: Extract the cores using Protein-complex core mining algorithm [24];
Step 5: Detect the attachments for each core.
Performance measures
The biological significance of the numerically computed modules can be validated by comparing the experimentally determined complexes (will be introduced in result section).
F-measure
Let PS (Predicted Set of Complexes) and BS (Benchmark Set of Complexes) be the sets of protein complexes that are predicted by a computational algorithm compared to the real complexes in the benchmark. N
cb
is the number of real complexes that match at least a predicted complex, i.e. N
cb
= │{b│b ∊ BS, ∃p ∊ PS, NA(p, b) ≥ t}│, where t determines whether two sets match or not. N
cp
is the number of correct predictions that match at least one real complex, i.e., N
cp
= │{p│p ∊ PS,∃b ∊ BS, NA(p, b) ≥ t}│. The F-measure can be used to quantize the closeness between two complex sets [20]:
(8)
where and [49].
Coverage rate
The coverage rate assesses how many proteins in the real complexes can be covered by the predicted complexes [50, 51]. In detail, given the set of benchmark complexes BS and the set of predicted complexes PS, a │BS│ × │PS│ matrix T is constructed whereby each element T
ij
is the number of proteins in common between the i-th benchmarked complex and the j-th predicted complex. The coverage rate is defined as
(9)
where N
i
is the number of proteins in the i-th benchmarked complex.
P-value
The P-value [18] is employed. In detail, given a cluster C with k proteins in a functional group
F, the P-value is defined as
(10)
where │V│ denotes the size of PPI network involved.
Geometric accuracy
To measure the robustness of the algorithm, the following measures are adopted [51]. Similar to Eq.(9), a matrix T is obtained by considering the annotated complexes as the BS. The clustering-wise sensitivity Sn is defined as
(11)
where n, m and N
i
are the sizes of BS, the number of clusters obtained by algorithms and the number of proteins in the i-th complexes, respectively. The positive predictive value PPV is defined as
(12)
Based on Sn and PPV, the geometric accuracy is defined as
Geometrical separation
Before our description about the geometrical separation, we define separation
(14)
where Then, the geometrical separation Sep is defined as
(15)
where and .