### Calculation of semantic similarity scores

We adopt three methods based on information contents of GO terms (Resnik [23], Schlicker et al. [24] and Lin [25]) and one method based on the structure of gene ontology (Wang et al. [26]) to calculate semantic similarity scores between GO terms.

Given the gene ontology and annotations of human genes, the probability of occurrence of a GO term

*t* in annotations,

*p*(

*t*), is estimated as the number that the term or its descendants are used in annotations divided by the total number of annotations, as

In general, more specific terms are less frequently used in annotations and thus have lower probability of occurrence. A pair of terms

*a* and

*b* usually has more than one common ancestor in the ontology. Let

be the set of all common ancestors of

*a* and

*b*, the probability of occurrence of the most concrete common ancestor of

*a* and

*b* is then calculated as

With these definitions, the method of Resnik [

23] calculates the semantic similarity score between two terms

*a* and

*b* as the information content (negative logarithm of the probability) of the most concrete common ancestor of the two terms, as

The method of Lin [

25] normalizes the above information content with the average information content of the two terms, as

The method of Schlicker et al. [

24] further weights the above quantity with the probability of occurrence of the most concrete common ancestor of the two terms, as

Different from the above methods that rely on annotations of genes, the method of Wang et al. [

26] depends only on the structure of gene ontology to calculate semantic similarity scores between GO terms. Let

*t* be a GO term,

the set of its ancestors and

the set of its children in the GO structure. Wang et al. iteratively calculate an

*s*-value for every term

to measure the contribution of

*a* to the semantics of

*t*, as

where the weight factor

*w*
_{
e
} = 0.8 if

*x* and

*a* have the “is_a” relationship and

*w*
_{
e
} = 0.6 if

*x* and

*a* have the “part_of” relationship. Then, a semantic value for a term

*t* is calculated as

Finally, the semantic similarity score between two terms

*a* and

*b* is calculated as

With the semantic similarity scores between GO terms calculated by either of the above methods, we calculate the semantic similarity between two genes as follows. The semantic similarity score between a GO term

*t* and a set of GO terms

is calculated as

The semantic similarity score between two sets of GO terms

and

is calculated as

Let

*g* and

*g′* be two genes. Let

and

be the two sets of GO terms with which

*g* and

*g′* are annotated, respectively. The semantic similarity between

*g* and

*g′* is then calculated as

Applying the above method to every pair of genes, we obtain a pairwise semantic similarity matrix of genes. Certainly, this matrix can be thought of as the weight matrix of a fully connected network, whose vertices are genes and whose edges represent semantic similarity scores between genes. However, such a fully connected network may contain a large number of low confident edges between gene pairs with low semantic similarity scores. We therefore further filter out edges with low weights (similarity scores) in the fully connected network by introducing a threshold *κ* (defaulting to 100 in this paper) and keeping only the first *κ* nearest neighbors for each gene. By doing this, we obtain a gene semantic similarity network.

### Prioritization of candidate genes

The random walk with restart on the heterogeneous network model [17] is one of the state-of-the-art methods that utilize a disease similarity network with a protein-protein interaction network to prioritize candidate genes. This model simulates the process that a random walker wanders on a heterogeneous network composed of a phenotype similarity network, a protein-protein interaction network, and known associations between diseases and genes. In each step of the process, the random walker may start on a new journey with probability *γ* or move on with probability 1 – *γ*. When starting on, the walker may choose the query disease of interest as the starting point with probability *η* or choose a seed gene known to be associated with the query disease with probability 1 – *η.* When moving on, the walker may choose to jump from the disease similarity network to the protein-protein interaction network or vice versa with probability *λ* or choose to wander in either the disease network or the protein-protein interaction network with probability 1 – *λ*. When wandering about, the walker moves at random to one of its direct neighbors.

In this model, the protein-protein interaction network serves as a simplified yet systematic view of functional relationships among genes. Since a gene semantic similarity network also provides a means of measuring functional relationships among genes, conceptually we can also use a gene semantic similarity network with the phenotype similarity network to infer disease genes. Following the literature [17], we use the following random walk with restart model on the heterogeneous network that is composed of a phenotype similarity network, a gene semantic similarity network, and known associations between diseases and genes.

We represent the phenotype similarity network using a weight matrix **D** = (*d*
_{
ij
})_{
m
}
_{×}
_{
m
}, where *m* denotes the number of diseases and *d*
_{
ij
} the similarity score between the *i*-th disease and the *j*-th disease. By normalizing each row of this matrix, we obtain a transition matrix **U** = (*u*
_{
ij
})_{
m
}
_{×}
_{
m
}, where
, representing the probability that a random walker moves from the *i*-th disease to the *j*-th disease.

We represent the gene semantic similarity network using a weight matrix **G** = (*g*
_{
ij
})_{
n
}
_{×}
_{
n
}, where *n* denotes the number of genes and *g*
_{
ij
} the similarity score between the *i*-th gene and the *j*-th gene. By normalizing each row of this matrix, we obtain a transition matrix **V** = (*v*
_{
ij
})_{n×n}, where
, representing the probability that a random walker moves from the *i*-th gene to the *j*-th gene.

We represent known associations between diseases and genes using an adjacency matrix **A** = (*a*
_{
ij
})_{m×n}, where *a*
_{
ij
} = 1 indicates that the *j*-th gene is known to be associated with the *i*-th disease, and *a*
_{
ij
} = 0 otherwise. By normalizing each row of this matrix, we obtain a transition matrix **R** = (*r*
_{
ij
})_{m×n}, where
, representing the probability that a random walker jumps from the *i*-th disease to the *j*-th gene. Note that we define *r*
_{
ij
} = 0 when
, i.e., when there is no gene known as associated with the *i*-th disease. Similarly, by normalizing each row of the transpose of the matrix **A**, we obtain a transition matrix **S** = (*s*
_{
ij
})_{n×m}, where
, representing the probability that a random walker jumps from the *i*-th gene to the *j*-th disease. We also define *s*
_{
ij
} = 0 when
i.e., when the *i*-th gene is not associated with any disease.

With the above four transition matrices, we define

and further normalize every row of this matrix to obtain the transition matrix of the heterogeneous network **W** = (*w*
_{
ij
}), where
. The parameter *λ* is the probability that the random walker jumps from the disease similarity network to the gene semantic similarity network or vice versa.

When the random walker starts in the disease similarity network, we let it start from the query disease, therefore the initial probability is 1 for the query disease and 0 for other diseases. We use a vector

**u**
^{(0)} to represent these probabilities. When the random walker starts in the gene similarity network, we let it start at random from one of the genes known as associated with the query disease, therefore the initial probability is 1

*/s* for every seed gene (suppose there are a total of

*s* seed genes) and 0 for other genes. We use a vector

**v**
^{(0)} to represent these probabilities. Let

*η* be the probability that the random walker starts from the disease similarity network, we have the initial probability vector

Finally, let

**p**
^{(}
^{
t
}
^{)} be the vector composed of probabilities of finding the random walker at all vertices in the heterogeneous network at step

*t*, we have

After a number of steps, the probability will reach a steady state. This is obtained by performing the iteration until the difference between **p**
^{(t)} and **p**
^{(t+1)} is sufficiently small (i.e., the *L*
_{1} norm of Δ**p** = **p**
^{(t+1)} – **p**
^{(t)} is less than a small positive number *ε*). The steady-state probability **p**
^{(∞)} then gives a measure of the strength of association of each gene to the query disease of interest, and we can then rank candidate genes according to their steady-state probabilities.

It has been show that the random walk model is not sensitive to the parameters involved in the model [17]. Hence, we follow the literature [17] and default the parameters to *λ* = 0.7, *η* = 0.5, *γ* = 0.5 and *ε* = 10^{–4}.

### Validation methods and evaluation criteria

We perform three large-scale leave-one-out cross-validation experiments to examine the performance of the proposed method in prioritizing genes that are known to be associated with certain diseases (i.e., disease genes) from a set of candidates. First, in the validation against a linkage interval, we take a known association between a gene and a disease in each run, assume the association is unknown, and prioritize the gene against a set of 99 control genes that locate nearest to the disease gene according to their genomic distance on the same chromosome. Second, in the validation against random genes, we select control genes in each validation run as 99 (or 999) genes that are selected at random from all genes in a gene semantic similarity network. Third, in the genome-wide scan of disease genes, we select control genes in each validation run as all genes in a gene semantic similarity network.

We use two measures to evaluate the performance of the proposed method. Taking the cross-validation against a linkage interval as an example, after each validation run, we obtain a score (the steady-state probability) for each candidate gene and further rank genes according to their scores (ties are broke by assigning ranks to genes with equal scores at random) to obtain a ranking list of candidate genes. We then calculate rank ratios of candidate genes by dividing their ranks with the number of candidate genes in the list. For a set of validation runs, we calculate the following two measures. First, we calculate the mean rank ratio (MRR) of all disease genes as the average of rank ratios of all disease genes in the validation runs. Second, given a threshold of rank ratio, we calculate the sensitivity as the fraction of disease genes ranked above the threshold and the specificity as the fraction of control genes ranked below the threshold. Varying the threshold value from 0.0 to 1.0, we are able to draw a receiver operating characteristic (ROC) curve and further calculate the area under this curve (AUC). Obviously, smaller MRR and larger AUC values indicate higher performance of a prioritization method.