At this section we analyze in detail the basic concepts of our method. Firstly we elucidate the integration of PPI and gene expression data and the reasons why this procedure can lead to biologically more meaningful functional modules. Then we describe in detail the graph clustering algorithm DetMod, which is responsible for the determination of functional modules on the PPI network. The proposed algorithm identifies functional modules on a PPI graph, which is weighted with the gene expression information. The first step of DetMod algorithm involves the construction of modules starting from a 'seed' protein. Next DetMod algorithm merges modules by examining a score we compute for each one of the extracted modules. The procedure of merging is preferred in cases, where the score of the merged cluster is better than the score of the forerunning clusters. However if the merged cluster does not significantly overlap in respect to its members with one of the forerunning clusters, then both the merged as well as the old cluster are preserved.
Data Integration
In our work we chose to unify the above types of data for various reasons. Firstly PPI data descending from high-throughput techniques suffers from many false interactions [49]. Also protein interaction measurements stem from a certain range of experimental conditions, thus they manage to identify only a small portion of all possible protein-protein interactions. It is evident that the direction of just clustering the PPI graph (without considering gene expression data) leads to partially valid functional modules due to the exclusion of interactions that would lead to even more coherent modules. Moreover it is common among graph clustering algorithms to neglect peripheral proteins that link loosely to clusters, even if these few interactions are true and experimentally confirmed [9, 24]. However an important aspect of PPI networks is that they provide information about direct partners, property lost when dealing with co-expression networks. On the other side gene expression data provides information of the genome under many different experimental conditions despite the embedded noise [13]. Although co-expression between two gene profiles implies that they are under the same transcriptional control and functionally correlated, the resulting interactions are often indirect.
Specifically we used highly confident PPI data in the form of a graph G (V, E), where vertices represent proteins and edges represent interactions. Then we applied a clustering algorithm at the respective gene expression profiles. The number of clusters was appointed both by the algorithm itself as well as by the functional enrichment of clusters in GO (Gene Ontology) terms. Next we weighted the interaction between two proteins according to the weight function:
W (x,y) = n1 (||x - K
x
||2 + ||y - K
y
||2 )+ n2||K
x
- K
y
||2
||·|| stands for the distance metric and there are many metrics for measuring it, in this study we have used Euclidean distance. Kx and Ky symbolize the centroids of the clusters that genes x and y respectively, belongs to. The constants n1 and n2 add an extra confidence score to the factors of the weight function. They can have the same or different values according to which member (if any) of the function we want to enhance. We chose n2 > n1 (specifically n2 = 0.7, n1 = 0.3) because we consider the distance between centroids more significant comparing to the distance of each gene from its centroid. This selection was motivated by the noise (outliers) of gene expression profiles. Based on several runs of the algorithm and the corresponding results, in the current study we have set the values of the two variables as n2 = 0.7, n1 = 0.3, but in general we systematically found better results when the value of n2 was larger than n1.
The outcome of our integration method is a weighted PPI graph, at which the proposed algorithm will be applied in order to detect functional modules that are supported by both types of data.
Basic notations
As we have already mentioned, in the approach we have followed, we combine gene expression profiles and PPI data, in the form of a weighted graph, G(V, E). By N(x) we denote the neighbours of a node x, or in other words the set of nodes that are connected to x. Then, the degree of x is equivalent to the number of neighbours of x |N(x)|. For a given subgraph G1 of a larger graph G we define the internal degree |NG1INT| as the number of edges connecting x with other vertices belonging to G1 and external degree as the number of nodes with which x is connected and exist in G but do not belong to G1.
The above concepts can be expanded to the weighted graphs easily. Weighted degree of a node is the sum of weights of the edges between x and its neighbours divided by |N(x)|. Weighted internal degree of a node x is the sum of weights of the edges between x and its neighbours within G1 over |NG1INT|:
(1)
Correspondingly we define the term of weighted external degree.
The density of a graph G(V, E) is generally measured by the proportion of the number of edges in the graph to the number of all possible edges, which is equal to |V|(|V|-1) for an undirected graph. Weighted density of a graph or subgraph Dw(G), is the sum of the weights of actual edges over the number of possible edges among all nodes in G:
Detect module from 'seed' protein
DetMod incorporates in its first phase the application of another algorithm called Detect Module from Seed Protein (DMSP) [22] which operates in two phases. Firstly accepts one 'seed' protein and selects a subset of its most promising neighbours, subsequently expands this initial kernel to accept more proteins. This expansion is based on certain assumptions, concerning the number of neighbours for the specific protein as well as the weights of these connections.
DMSP algorithm initiates its function by selecting only a certain number of the neighbours of the 'seed' protein (named hereafter s). These adjacent nodes are sorted in descending degree of significance and this subset of nodes – proteins is named kernel.
The two criteria by which the original kernel is selected are the density of the kernel and the weighted internal and external degrees of it. Initially, the kernel Ks is equal to all the neighbours of s. Then for each one of the neighbours ui belonging to Kernel(s) we find the NINT(Ks), NEXT(Ks), as well as the βINT and βEXT. The objective for selecting the kernel of the seed node is two-fold. Firstly we check so that the number of edges of a kernel node within the rest of the kernel is larger or at least equal to the number of the edges that a node has outside the group. We accomplish this by requesting for the internal and external degrees of each node:
(2)
In this study we selected p1 to have value over 45%. At the same time and after we have confirmed that a selected node fulfils the first condition, we request that the same node has smaller weighted internal degree than its corresponding weighted external degree. Nodes that fail to pass the above criteria are discarded, while those that do, are sorted based on the level that each one of them manages to do so.
This original subset of proteins is further distilled, in order to acquire an even more coherent kernel. This can be achieved by minimizing Dw(Ks) as:
(3)
In this step, DMSP removes one at a time, each one of the sorted per significance nodes starting from the most insignificant until it reaches a minimum value of weighted density.
After the creation of the initial kernel, DMSP iteratively adds adjacent nodes based again on certain criteria. The depth of the neighbours (referring to the initial kernel) checked by DMSP vary per specific problem and data set, meaning that as long as the criteria we will mention are true the algorithm can go beyond the 2nd and 3rd level or not. The first criterion the algorithm checks is the same as the first one of the initial stage of DMSP described by (2). After this criterion has been checked then we select a node to be added to the module, if it satisfies the following:
(4)
G is the final module that is built from the initial kernel (i.e. initially G = Ks), we select the constant p2 to be anywhere between 0.9 and 1.0 (in the specific study we have set the value of p2 to 0.9). Ideally the value of p3 should be equal to 1.0 but given that we work on a real and very complex biological problem we allow the value to range down to 0.9. The experiments we have conducted showed that a lower value could create artifacts in the final determination of modules. Relation (4) states that in order for an adjacent node ui of some kernel node v, to become member of the module, its weight must be less or equal to a specific percentage of the weighted internal degree of node v.
At this point we should emphasize, that DMSP, uses two values describing the relation of internal and external neighbours (2) (we are referring to the value of p1). The distinction of this value depends on whether the current node is a direct neighbour of the kernel or not. In this way we have a two-layer scheme where we retain a looser criterion for immediate neighbours and a stricter one for the remote neighbours of the initial kernel (specifically we have set the value of p1 in equation (2) to 0.75 when DMSP checks for members in the remote neighbours of the initial kernel).
DetMod analysis
In the first phase, DetMod iteratively applies DMSP to every node of the overall graph, therefore each node is regarded as a seed protein and based on this a possible functional module is created. Each newly constructed module is checked in terms of overlap with the rest of the modules that have been previously created. If this overlapping degree is above a certain threshold then the module is discarded. We give below the pseudocode for the first part of DetMod:
Procedure Create_Basic_Modules
-
1)
G' = G;
-
2)
Modules_List = Empty
-
3)
While G' != empty
-
I)
Retrieve randomly a node v from G'
II) Apply DMSP to create a new functional module M with v as seed, M = DMSP(G, v)
III) For all modules in Modules_List
-
i)
Check if there is a module with more than p% overlap with M
ii) If there is, find = true then break
IV) End_for
-
4)
If find! = true
-
I)
Keep M in Modules_List
II) Keep v in Nodes_List
-
5)
End_If
-
6)
Delete v from G'
-
7)
End while
As we have seen DetMod allows every node to be part of more than one module. In this way DetMod manages to compromise between the complexity of genes or their products and their tendency to participate in different groups towards achieving different goals. For this reason we insert the term of node score (easily extended to module score). This metric has dual purpose, it checks if the majority of the immediate neighbors of a respective node are in the same cluster as well as the repetitive appearance of a node and its immediate neighbors in many different clusters.
Node score is an expansion of the node degree term, and is related with the connectivity of a node in regard to its neighbors in every module.
To compute the score of a node (example given in Figure 5), we isolate the modules in which a certain node belongs, and then we check the common modules for each one of its neighbors. We add an imaginary neighbor to the total number of neighbors of v, every time the actual neighbor and the node have a common module. In mathematical terms it is:
(5)
with:
(6)
where, vc is the set of modules the node belongs to, and N is the set of the real neighbors of v. Given the total number of neighbours for a node under the scheme we described, we can calculate the score of a node which is given as:
(7)
After determining the score of every node we can calculate the score of a module by averaging the score of the nodes that constitute it.
In the second phase of the algorithm, the retrieved functional modules of the first phase are checked in order to determine whether or not they could be merged.
Specifically DetMod, checks every pair of connected modules, in order to determine whether or not a probable merging operation among them will lead to a new module which will have a higher score than one or both of its predecessors.
Procedure Merge
N = number of modules
-
1)
For i = 1:N-1
-
2)
For j = i+1:N
-
I)
If Modi and Modj are connected
-
i)
Modnew = merge(Modi, Modj)
ii) If Score(Modnew) > Score(Modi) AND Score(Modj)
-
(a)
If Modnew has no overlap conflict save in Merge_Modules
-
(b)
Delete the other two modules
iii) If Score(Modnew) > Score(Modi) OR Score(Modj)
-
(a)
If Modnew has no overlap conflict save in Merge_Modules
-
(b)
Delete the module with the worst score
II) End_If
-
3)
End_For
-
4)
End_For