At this section we analyze in detail the basic concepts of our method. Firstly we elucidate the integration of PPI and gene expression data and the reasons why this procedure can lead to biologically more meaningful functional modules. Then we describe in detail the graph clustering algorithm DetMod, which is responsible for the determination of functional modules on the PPI network. The proposed algorithm identifies functional modules on a PPI graph, which is weighted with the gene expression information. The first step of DetMod algorithm involves the construction of modules starting from a 'seed' protein. Next DetMod algorithm merges modules by examining a score we compute for each one of the extracted modules. The procedure of merging is preferred in cases, where the score of the merged cluster is better than the score of the forerunning clusters. However if the merged cluster does not significantly overlap in respect to its members with one of the forerunning clusters, then both the merged as well as the old cluster are preserved.
Data Integration
In our work we chose to unify the above types of data for various reasons. Firstly PPI data descending from highthroughput techniques suffers from many false interactions [49]. Also protein interaction measurements stem from a certain range of experimental conditions, thus they manage to identify only a small portion of all possible proteinprotein interactions. It is evident that the direction of just clustering the PPI graph (without considering gene expression data) leads to partially valid functional modules due to the exclusion of interactions that would lead to even more coherent modules. Moreover it is common among graph clustering algorithms to neglect peripheral proteins that link loosely to clusters, even if these few interactions are true and experimentally confirmed [9, 24]. However an important aspect of PPI networks is that they provide information about direct partners, property lost when dealing with coexpression networks. On the other side gene expression data provides information of the genome under many different experimental conditions despite the embedded noise [13]. Although coexpression between two gene profiles implies that they are under the same transcriptional control and functionally correlated, the resulting interactions are often indirect.
Specifically we used highly confident PPI data in the form of a graph G (V, E), where vertices represent proteins and edges represent interactions. Then we applied a clustering algorithm at the respective gene expression profiles. The number of clusters was appointed both by the algorithm itself as well as by the functional enrichment of clusters in GO (Gene Ontology) terms. Next we weighted the interaction between two proteins according to the weight function:
W (x,y) = n_{1} (x  K_{
x
}^{2} + y  K_{
y
}^{2} )+ n_{2}K_{
x
} K_{
y
}^{2}
· stands for the distance metric and there are many metrics for measuring it, in this study we have used Euclidean distance. K_{x} and K_{y} symbolize the centroids of the clusters that genes x and y respectively, belongs to. The constants n_{1} and n_{2} add an extra confidence score to the factors of the weight function. They can have the same or different values according to which member (if any) of the function we want to enhance. We chose n_{2} > n_{1} (specifically n_{2} = 0.7, n_{1} = 0.3) because we consider the distance between centroids more significant comparing to the distance of each gene from its centroid. This selection was motivated by the noise (outliers) of gene expression profiles. Based on several runs of the algorithm and the corresponding results, in the current study we have set the values of the two variables as n_{2} = 0.7, n_{1} = 0.3, but in general we systematically found better results when the value of n_{2} was larger than n_{1}.
The outcome of our integration method is a weighted PPI graph, at which the proposed algorithm will be applied in order to detect functional modules that are supported by both types of data.
Basic notations
As we have already mentioned, in the approach we have followed, we combine gene expression profiles and PPI data, in the form of a weighted graph, G(V, E). By N(x) we denote the neighbours of a node x, or in other words the set of nodes that are connected to x. Then, the degree of x is equivalent to the number of neighbours of x N(x). For a given subgraph G_{1} of a larger graph G we define the internal degree N_{G1}^{INT} as the number of edges connecting x with other vertices belonging to G_{1} and external degree as the number of nodes with which x is connected and exist in G but do not belong to G_{1}.
The above concepts can be expanded to the weighted graphs easily. Weighted degree of a node is the sum of weights of the edges between x and its neighbours divided by N(x). Weighted internal degree of a node x is the sum of weights of the edges between x and its neighbours within G_{1} over N_{G1}^{INT}:
{\beta}_{{G}_{1}}^{INT}\left(x\right)=\frac{1}{\left{N}_{{G}_{1}}^{INT}\right}{\displaystyle \sum _{y\in {N}_{{G}_{1}}^{INT}}{w}_{xy}}
(1)
Correspondingly we define the term of weighted external degree.
The density of a graph G(V, E) is generally measured by the proportion of the number of edges in the graph to the number of all possible edges, which is equal to V(V1) for an undirected graph. Weighted density of a graph or subgraph D_{w}(G), is the sum of the weights of actual edges over the number of possible edges among all nodes in G:
{D}_{w}\left(G\right)=\frac{{\displaystyle {\sum}_{\u3008x,y\u3009\in E}{w}_{xy}}}{\leftV\right\left(\leftV\right1\right)}
Detect module from 'seed' protein
DetMod incorporates in its first phase the application of another algorithm called Detect Module from Seed Protein (DMSP) [22] which operates in two phases. Firstly accepts one 'seed' protein and selects a subset of its most promising neighbours, subsequently expands this initial kernel to accept more proteins. This expansion is based on certain assumptions, concerning the number of neighbours for the specific protein as well as the weights of these connections.
DMSP algorithm initiates its function by selecting only a certain number of the neighbours of the 'seed' protein (named hereafter s). These adjacent nodes are sorted in descending degree of significance and this subset of nodes – proteins is named kernel.
The two criteria by which the original kernel is selected are the density of the kernel and the weighted internal and external degrees of it. Initially, the kernel K_{s} is equal to all the neighbours of s. Then for each one of the neighbours u_{i} belonging to Kernel(s) we find the N^{INT}(K_{s}), N^{EXT}(K_{s}), as well as the β^{INT} and β^{EXT}. The objective for selecting the kernel of the seed node is twofold. Firstly we check so that the number of edges of a kernel node within the rest of the kernel is larger or at least equal to the number of the edges that a node has outside the group. We accomplish this by requesting for the internal and external degrees of each node:
IO\left({K}_{s},{u}_{i}\right)=\frac{\left{N}_{{K}_{s}}^{INT}\left({u}_{i}\right)\right}{\left{N}_{{K}_{s}}^{EXT}\left({u}_{i}\right)\right+\left{N}_{{K}_{s}}^{INT}\left({u}_{i}\right)\right}>{p}_{1}
(2)
In this study we selected p_{1} to have value over 45%. At the same time and after we have confirmed that a selected node fulfils the first condition, we request that the same node has smaller weighted internal degree than its corresponding weighted external degree. Nodes that fail to pass the above criteria are discarded, while those that do, are sorted based on the level that each one of them manages to do so.
This original subset of proteins is further distilled, in order to acquire an even more coherent kernel. This can be achieved by minimizing D_{w}(K_{s}) as:
{D}_{w}^{\mathrm{min}}({K}_{s})=\mathrm{min}(\underset{{K}_{s}}{\mathrm{arg}}){D}_{w}\left({K}_{s}\right)
(3)
In this step, DMSP removes one at a time, each one of the sorted per significance nodes starting from the most insignificant until it reaches a minimum value of weighted density.
After the creation of the initial kernel, DMSP iteratively adds adjacent nodes based again on certain criteria. The depth of the neighbours (referring to the initial kernel) checked by DMSP vary per specific problem and data set, meaning that as long as the criteria we will mention are true the algorithm can go beyond the 2^{nd} and 3^{rd} level or not. The first criterion the algorithm checks is the same as the first one of the initial stage of DMSP described by (2). After this criterion has been checked then we select a node to be added to the module, if it satisfies the following:
{W}_{v{u}_{i}}\le {p}_{2}\cdot {\beta}_{G}^{INT}\left(v\right)
(4)
G is the final module that is built from the initial kernel (i.e. initially G = K_{s}), we select the constant p_{2} to be anywhere between 0.9 and 1.0 (in the specific study we have set the value of p_{2} to 0.9). Ideally the value of p_{3} should be equal to 1.0 but given that we work on a real and very complex biological problem we allow the value to range down to 0.9. The experiments we have conducted showed that a lower value could create artifacts in the final determination of modules. Relation (4) states that in order for an adjacent node u_{i} of some kernel node v, to become member of the module, its weight must be less or equal to a specific percentage of the weighted internal degree of node v.
At this point we should emphasize, that DMSP, uses two values describing the relation of internal and external neighbours (2) (we are referring to the value of p_{1}). The distinction of this value depends on whether the current node is a direct neighbour of the kernel or not. In this way we have a twolayer scheme where we retain a looser criterion for immediate neighbours and a stricter one for the remote neighbours of the initial kernel (specifically we have set the value of p_{1} in equation (2) to 0.75 when DMSP checks for members in the remote neighbours of the initial kernel).
DetMod analysis
In the first phase, DetMod iteratively applies DMSP to every node of the overall graph, therefore each node is regarded as a seed protein and based on this a possible functional module is created. Each newly constructed module is checked in terms of overlap with the rest of the modules that have been previously created. If this overlapping degree is above a certain threshold then the module is discarded. We give below the pseudocode for the first part of DetMod:
Procedure Create_Basic_Modules

1)
G' = G;

2)
Modules_List = Empty

3)
While G' != empty

I)
Retrieve randomly a node v from G'
II) Apply DMSP to create a new functional module M with v as seed, M = DMSP(G, v)
III) For all modules in Modules_List

i)
Check if there is a module with more than p% overlap with M
ii) If there is, find = true then break
IV) End_for

4)
If find! = true

I)
Keep M in Modules_List
II) Keep v in Nodes_List

5)
End_If

6)
Delete v from G'

7)
End while
As we have seen DetMod allows every node to be part of more than one module. In this way DetMod manages to compromise between the complexity of genes or their products and their tendency to participate in different groups towards achieving different goals. For this reason we insert the term of node score (easily extended to module score). This metric has dual purpose, it checks if the majority of the immediate neighbors of a respective node are in the same cluster as well as the repetitive appearance of a node and its immediate neighbors in many different clusters.
Node score is an expansion of the node degree term, and is related with the connectivity of a node in regard to its neighbors in every module.
To compute the score of a node (example given in Figure 5), we isolate the modules in which a certain node belongs, and then we check the common modules for each one of its neighbors. We add an imaginary neighbor to the total number of neighbors of v, every time the actual neighbor and the node have a common module. In mathematical terms it is:
{N}_{v}^{TOT}={\displaystyle \sum _{i=1}^{N}\Xi \left(v,{u}_{i}\right)}
(5)
with:
\Xi \left(v,u\right)=\{\begin{array}{ll}\left{v}_{C}\cap {u}_{C}\right,\hfill & {v}_{C}\cap {u}_{C}\ne \varnothing \hfill \\ 1,\hfill & else\hfill \end{array}
(6)
where, v_{c} is the set of modules the node belongs to, and N is the set of the real neighbors of v. Given the total number of neighbours for a node under the scheme we described, we can calculate the score of a node which is given as:
{S}_{v}^{G}=\frac{{N}_{G}^{INT}(v)}{{N}_{v}^{TOT}}
(7)
After determining the score of every node we can calculate the score of a module by averaging the score of the nodes that constitute it.
In the second phase of the algorithm, the retrieved functional modules of the first phase are checked in order to determine whether or not they could be merged.
Specifically DetMod, checks every pair of connected modules, in order to determine whether or not a probable merging operation among them will lead to a new module which will have a higher score than one or both of its predecessors.
Procedure Merge
N = number of modules

1)
For i = 1:N1

2)
For j = i+1:N

I)
If Mod_{i} and Mod_{j} are connected

i)
Mod_{new} = merge(Mod_{i}, Mod_{j})
ii) If Score(Mod_{new}) > Score(Mod_{i}) AND Score(Mod_{j})

(a)
If Mod_{new} has no overlap conflict save in Merge_Modules

(b)
Delete the other two modules
iii) If Score(Mod_{new}) > Score(Mod_{i}) OR Score(Mod_{j})

(a)
If Mod_{new} has no overlap conflict save in Merge_Modules

(b)
Delete the module with the worst score
II) End_If

3)
End_For

4)
End_For