DIGNiFI algorithm
The core assumption of disease gene prioritization from a PPI network is that genes that share topological similarities tend to be associated with phenotypically close disorders and may cause the same or similar diseases [8, 11, 29]. Such a “guilt by association” principle has been widely used to prioritize candidate disease genes. Hence, the most important task in using PPI network is measuring the similarity between known genes and candidate genes. In order to rank the candidate genes, we use two different ways to calculate the similarity: one is designed for directly connected genes and the other is for indirectly connected genes in the PPI network.
A PPI network can be presented as a graph G(V,E,W), where a set of nodes (V) denotes proteins together while a set of edges (E) denotes interactions between proteins with different edge weights (W). Given a protein v∈V, Γ
v
represents the combination of v’ neighbors and v. From a topological view, if two genes share more common direct neighbors, those two genes are likely to be more similar. Hence, given a protein pair v
i
and v
j
, we calculate the similarity between them by using Eq. 1.
$$ Sim\left(v_{i},v_{j}\right)= \left\{\begin{array}{ll} DN\left(v_{i},v_{j}\right)&\;e(v_{i},v_{j})\in{E}\\ LRW_{v_{i}v_{j}}(t)&\;otherwise \end{array}\right. $$
(1)
The value of DN(v
i
,v
j
) is defined as:
$$ DN\left(v_{i},v_{j}\right)=\frac{\sum_{v_{k}\in\left(\Gamma_{v_{i}}\bigcap\Gamma_{v_{j}}\right)} w\left(v_{i},v_{k}\right)*w\left(v_{j},v_{k}\right)}{\sqrt{max\left\{ K_{v_{i}},K_{v_{j}}\right\}}} $$
(2)
where w(v
i
,v
j
)=1 if v
i
and v
j
directly connect with each other or if v
i
=v
j
, otherwise w(v
i
,v
j
)=0. \(K_{v_{i}}\) denotes the total edge weights that link to v
i
, and we use the maximum of K in order to depress the hub node effect. The value of \(LRW_{v_{i}v_{j}}\) is derived by Eq. 3 according to [21].
$$ LRW_{v_{i}v_{j}}(t)=\frac{K_{v_{i}}}{M}{\pi}_{v_{i}v_{j}}(t)+\frac{K_{v_{j}}}{M}{\pi}_{v_{j}v_{i}}(t) $$
(3)
where M is the number of links in the network and \({\pi }_{v_{i}v_{j}}(t)\) is the v
j
-th value of \(\boldsymbol {\pi }_{v_{i}}(t)\) and \(\boldsymbol {\pi }_{v_{i}}(t)\) is calculated by
$$ \boldsymbol{\pi}_{v_{i}}(t+1)=\boldsymbol{P^{T}\pi}_{v_{i}}(t) $$
(4)
in which, π
x
(0) is a N∗1 vector (N is the number of nodes in the network) which the v
i
-th is equal to 1 and others are 0. P is the transition probability matrix, with \({P}_{v_{i}v_{j}}=a_{v_{i}v_{j}}/k_{v_{i}}\) representing the probability that a random walker staying at node v
i
will walk to v
j
in the next step, where \(a_{v_{i}v_{j}}\) equals 1 if v
i
and v
j
are connected, 0 otherwise. For two connected nodes, we use DN, which emphasizes the similarity of common direct neighbors, to calculate the similarity between them. In addition, we also consider that this approach may result in hub nodes receiving inappropriately high ranks, since they are connected to more nodes but are not necessarily the most directly similar genes. So we use maximum weight to penalize the hub nodes. At the same time, we use LRW to calculate the similarity between two indirectly connected nodes. One difficulty with general random-walk-based similarity measures is that they sensitively depend on parts of the network far away from the source nodes [30]. For example, the walker has a certain probability to go too far away from a source node to a target node even though they may in reality be close to each other. Using the LRW method can counteract this dependence and assign high similarity scores to the target node and the nodes nearby. Besides, the t step Local Random Walk algorithm has lower computational complexity than other random walk based algorithms and is suitable for scale and sparse networks [21]. As most disease genes connect with each other through calculable steps and as the PPI network is a large-scale yet sparse network [31, 32], LRW is a high-performance way to calculate the similarity between genes in the PPI network.
For a given disease d, if \(S_{d_{k}}\) denotes the set of known genes, then the probability of a new candidate gene v
c
to be a causal gene is evaluated by the sum similarity scores between all known genes and the candidate gene, as shown in Eq. 5:
$$ Score_{v_{c}}=\sum_{v_{i}\in{S_{d_{k}}}}Sim(v_{i},v_{c}) $$
(5)
After calculating the total score, we rank candidate genes of the given disease by their total scores. Figure 1 shows the flow chart of using DIGNiFI to prioritize disease causing genes for a query disease. As the similarity scores of DN and LRW can be pre-calculated, the complexity of ranking candidates genes depends only on the number of known genes when given a new disease.
Integration with biological resources
It is well known that PPI data contain various false positive and false negative links. Therefore, integrating different data resources with PPI data should reduce the bias of using PPI data as a single resource and increase the ability of the PPI network to prioritize disease-causing genes. Recent research has demonstrated that genes with similar phenotypes often share common molecular signatures such as biological function, as measured by GO annotations [8]. Also, protein complex data is distinct from PPI network data, with clear, biologically relevant distinctions. For example, PEX26, PEX16 and PEX3 are three causal genes of Zellweger Syndrome. These genes don’t have any direct interaction in the PPI network, but do form a real protein complex. Hence, we integrate GO annotations and protein complex data to further improve our method.
Gene ontology annotation
The Gene Ontology project [33] provides a collection of well-defined biological terms for annotating genes and describing the characteristics of their gene products. GO annotation terms cover three separate fields: biological process, molecular function, and cellular components [34]. Many computational methods have used semantic similarity to calculate the similarity between two concepts in a taxonomy [35]. We employed a modification of a previous method [36] to calculate two genes’ semantic similarity by considering the number of common GO terms and how many genes the common GO terms have annotated. Specifically, we calculate two genes’ similarity based on their shared GO terms including biological process, molecular function and cellular component GO terms. For a given GO term, we define the annotation size of a GO term as the number of genes with that GO term. We then calculate the semantic similarity between two genes by the annotation size of their common GO terms. Thus, if two genes share a smaller annotation size of GO term, they are considered functionally more similar.
To describe the algorithm clearly, we first give some definitions. For a given gene v
i
, suppose it is annotated with m different GO terms. S
k
(v
i
) denotes a set of annotated genes with the GO term g
k
, whose annotation set includes v
i
, where 1≤k≤m. Suppose n is the number of common GO terms between gene v
i
and v
j
, where n≤m. S
k
(v
i
,v
j
) denotes a set of annotated genes on GO term g
k
whose annotation set includes both v
i
and v
j
, where k≤n. Then, the semantic similarity of two genes based on GO annotations is calculated by the following formula:
$$ SimGO\left(v_{i},v_{j}\right)=-log\frac{min_{k}|S_{k}\left(v_{i},v_{j}\right)|}{|S_{max}|} $$
(6)
where min
k
|S
k
(v
i
,v
j
)| is the minimum size of S
k
(v
i
,v
j
) and S
max
is the maximum size of annotation among all GO terms.
Protein complex
Protein complexes are direct manifestations of the biologic interconnectivity of genes. It is likely that variants of genes whose protein products form complexes together may lead to similar disease phenotypes. Indeed, protein complexes have already been successfully used to predict disease-causing genes [37, 38]. However, these approaches overlook the information of the actual protein complexes by only using formed protein complexes based on topological properties (neighbors or densely connected subsets). Furthermore, these previous studies did not consider any of the unique characteristics of each protein complex. Many groups have demonstrated that dense subgraphs in a PPI network generally correspond to protein complexes [39, 40], and some studies show that if the nodes of a subgraph have more internal weight (or edges) than external weight (or edges), it will be more likely to form a group [41]. Thus the density and internal weight ratio of protein complex in a PPI network can be an index for the richness of protein interactions within the complex. In other words, proteins are more similar if they are in a more dense protein complex. Considering the two issues, we use the internal weight ratio [42] and the density to assign a network reliability score to an actual complex C
k
. The formula is shown by Eq. 7:
$$ Score(C_{k})=density(C_{k})*\frac{w^{in}(C_{k})}{w^{in}(C_{k})+w^{bound}(C_{k})} $$
(7)
where, w
in(C
k
) is the total edges’ weight within a complex and w
bound(C
k
) is the total weight of edges that connect the complex with the rest of the network. The density of a protein complex C
k
is defined as Eq. 8:
$$ density(C_{k})=\frac{2*|E_{C_{k}}|}{|V_{C_{k}}|*(|V_{C_{k}}|-1)} $$
(8)
where \(E_{C_{k}}\) and \(V_{C_{k}}\) denote the edges and nodes in the complex respectively. Then, the Score(C
k
) can quantify the richness and reliability of the interactions with C
k
.
If two genes are in M same protein complexes, the similarity score between them is calculated as:
$$ SimCOM(v_{i},v_{j})=\sum_{k\in M}Score(C_{k}) $$
(9)
Finally, in order to integrate biological similarity (SimBio) with topological similarity (DIGNiFI), parameters α and β are used. The total score of a candidate gene with a known gene is calculated as:
$$ \begin{aligned} Sim\left(v_{i},v_{j}\right)=&\left(1-\alpha-\beta\right)DIGNiFI\left(v_{i},v_{j}\right)\\ &+\alpha SimGO\left(v_{i},v_{j}\right)+\beta SimCOM\left(v_{i},v_{j}\right) \end{aligned} $$
(10)
Then, a candidate gene of a query disease is ranked by summing up the similarity scores between the candidate gene and the known genes of that disease.