Implementation
The cytoHubba plugin is implemented in Java, based on the Cytoscape API. The plugin implements eleven node ranking methods to evaluate the importance of nodes in a biological network including Degree [1], Edge Percolated Component [9], Maximum Neighborhood Component [10], Density of Maximum Neighborhood Component [10], Maximal Clique Centrality (proposed in this paper), Bottleneck [11], EcCentricity [12], Closeness [13], Radiality [14], Betweenness [15], and Stress [16]. Each method is associated with a function F which assigns every node v a numeric value F(v). We say that the ranking of a node u is greater than that of another node v if the score of u (i.e. F(u)) is greater that of v (i.e. F(v)). The 11 methods can be divided into two major categories: local and global methods. To calculate the score of a node within a network, a local rank method only considers the relationship between the node and its direct neighbors; on the other hand, the global method examines the relationship between the node and the entire network.
Text for this sub-section.
The algorithms
A. Local-based Methods
Here we state notations used for describing these methods. We assume that a biological network G = (V, E) is an undirected network, where V is the collection of nodes within the network and E is the edge set. We can use another notation G = (V(G), E(G)) to represent a network, where V(G) is the collection of nodes in a network G, and E(G) is the collection of edges in a network G. For a set S, we use |S| to denote its cardinality (i.e. the number of elements in the set).
Local based method only considers the direct neighborhood of a vertex. Given a node v, N(v) denotes the collections of its neighbors. There are four local based methods shown as follows:
1. Degree method (Deg)
Deg(v)=|N(v)|.
2. Maximum Neighborhood Component (MNC)
MNC(v) = |V(MC(v))|, where MC(v) is a maximum connected component of the G[N(v)] and G[N(v)] is the induced subgraph of G by N(v).
3. Density of Maximum Neighborhood Component (DMNC)
Based on MNC, Lin et. al. proposed DMNC(v) = |E(MC(v))|/ |V(MC(v))|ε, where ε = 1.7 [10].
4. Maximal Clique Centrality (MCC)
To increase the sensitivity and specificity, we propose MCC to discover featured nodes. The intuition behind MCC is that essential proteins tend to be clustered in a yeast protein-protein interaction network [17]. Given a node v, the MCC of v is defined as, where S(v) is the collection of maximal cliques which contain v, and (|C|-1)! is the product of all positive integers less than |C|. If there is no edge between the neighbors of the node v, then MCC(v) is equal to its degree.
B. Global-based methods
In cytoHubba we implement six node ranking methods based on shortest paths and one method based percolated connectivity. Before we introduce the shortest based methods, let us introduce some notation. The length of a shortest path between nodes u and v is denoted as dist(u, v). Let C(v) be the component which contains node v. The dist(u, v) is equal to infinite if C(v) ≠ C(w), and it makes methods of this category cannot be applied to networks with disconnected components. To overcome this problem, we enhance the original methods [11–16], and the score of a node in a connected network computed by enhanced method is the same as that computed by original one.
1. Closeness (Clo)
2. EcCentricity (EC)
3. Radiality (Rad)
, where Δ
C
(v)is the maximum distance between any two vertices of the component C(v).
4. BottleNeck (BN)
Let T
s
be a shortest path tree rooted at node s. where p
s
(v) = 1 if more than |V(T
s
)|/4 paths from node s to other nodes in T
s
meet at the vertex v; otherwise ps(v) = 0.
5. Stress (Str)
, where σ
st
(v) is the number of shortest paths from node s to node t which use the node v.
6. Betweenness (BC)
, where σ
st
is the number of shortest paths from node s to node t.
7. Edge Percolated Component (EPC)
Given a threshold (0 ≤ the threshold≤ 1), we create 1000 reduced networks by assigning a random number between 0 and 1 to every edge and remove edges if their associated random numbers are less than the threshold.
Let the G
k
be the reduced network generated at the k th time reduced process. If nodes u and v are connected in G
k
, set to be 1; otherwise =0. For a node v in G, EPC(v) is defined as
.
The demo dataset and evaluation
Database of Interacting Proteins used in this study is from DIP database ([18])(http://dip.doe-mbi.ucla.edu, version: 20140117). Essential protein lists are collected from Saccharomyces Genome Deletion Project (SGDP) [19] and Saccharomyces Genome Database (SGD) [20]. The protein ID match table from Uniprot ID to NCBI gene id is downloaded from Uniprot ftp site.
The PPI network is loaded to cytoscape and calculated by 11 methods using cytoHubba plugin. Precision of each method is estimated by the performance of the method to include essential proteins in the top × ranked list (x = 10, 20, 30 ..... 100) by Precision: