- Open Access
Constructing a gene semantic similarity network for the inference of disease genes
© Jiang et al; licensee BioMed Central Ltd. 2011
- Published: 14 December 2011
The inference of genes that are truly associated with inherited human diseases from a set of candidates resulting from genetic linkage studies has been one of the most challenging tasks in human genetics. Although several computational approaches have been proposed to prioritize candidate genes relying on protein-protein interaction (PPI) networks, these methods can usually cover less than half of known human genes.
We propose to rely on the biological process domain of the gene ontology to construct a gene semantic similarity network and then use the network to infer disease genes. We show that the constructed network covers about 50% more genes than a typical PPI network. By analyzing the gene semantic similarity network with the PPI network, we show that gene pairs tend to have higher semantic similarity scores if the corresponding proteins are closer to each other in the PPI network. By analyzing the gene semantic similarity network with a phenotype similarity network, we show that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. We further use the gene semantic similarity network with a random walk with restart model to infer disease genes. Through a series of large-scale leave-one-out cross-validation experiments, we show that the gene semantic similarity network can achieve not only higher coverage but also higher accuracy than the PPI network in the inference of disease genes.
- Gene Ontology
- Semantic Similarity
- Similarity Network
- Human Protein Reference Database
- Prioritization Method
Not withstanding the remarkable success of such statistical methods as linkage analysis and association studies in identifying genetic variants underlying inherited human diseases in the past few decades , susceptibility genomic regions obtained by these methods may contain dozens or even hundreds of candidate genes, appealing for the development of effective computational methods to infer genes that are truly associated with a query disease of interest from a long list of candidates .
In the face of this challenge, several methods have been proposed to score genes in a candidate list according to their functional relevance to the genes that are already known to be associated with the query disease (i.e., seed genes) and then prioritize the candidates according to their scores. The basic assumption of these methods, which is typically referred to as the “guilt-by-direct-association” principle, is that genes associated with a disease should have similar functions. It is therefore crucial for these methods to estimate functional similarity between genes. For this purpose, a wide variety of genomic information has been adopted, with examples including protein sequences , gene expression profiles , literature descriptions , protein-protein interactions (PPI) , gene ontology annotations , and many others . Methods using multiple genomic data sources have also been proposed [9, 10].
Depending on seed genes to prioritize candidate genes will restrict the scope of application of the above methods, because genetic bases for about half of the known human diseases are completely unknown according to the Online Mendelian Inheritance in Man (OMIM) database . To overcome this limitation, recent studies have suggested the “guilt-by-indirect-association” principle, which relies on the modular nature of inherited human diseases [8, 12] and resorts to a phenotype similarity network of diseases  to prioritize candidate genes [14–17, 20]. These methods successfully extend the scope of prioritizing candidate genes to diseases whose genetic bases are completely unknown.
However, all methods based on the “guilt-by-indirect-association” principle thus far are designed to be used with one or more protein-protein interaction networks. For example, Wu et al. used a linear regression model to explain phenotype similarity using protein network proximity . Zhang et al. extend the regression model to include multiple protein-protein interaction networks . Li and Patra utilized a random walk model to simulate the steady-state probability of a random walker staying at a gene . Although a protein-protein interaction network could provide a simplified yet systematic view of functional relationships between genes, the coverage of available protein-protein interaction networks is typically low, and the reliability of different protein-protein interaction networks is quite different , making the selection of a suitable network far from trivial. Moreover, focusing on common interactions in multiple networks to improve the confidence of edges will sacrifice the coverage of the resulting network, while focusing on the union of interactions to improve the coverage will result in a network of low reliability .
Motivated by these observations, we propose to construct a gene semantic similarity network using the biological process domain of gene ontology and GO annotations of human genes. We show that the gene semantic similarity network covers 14,085 genes, about 50% more genes than the widely used Human Protein Reference Database (HPRD)  protein-protein interaction network. Via a comprehensive analysis of the gene semantic similarity network with the HPRD network, we show that gene pairs tend to have higher semantic similarity scores if the corresponding proteins are closer to each other in the HPRD network. Through a detailed analysis of the gene semantic similarity network with a phenotype similarity network, we show that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. We further use the gene semantic similarity network with a random walk with restart model  to infer disease genes. Through a series of large-scale leave-one-out cross-validation experiments, we show that the gene semantic similarity network can achieve not only higher coverage but also higher accuracy than the HPRD network in the inference of disease genes. With these results, we conjecture that the gene semantic similarity network can serve as a better assessment of functional relationship between genes and then be used in a large number of applications in systems biology.
We propose to prioritize candidate genes using 1) a gene semantic similarity network that is constructed using the biological process (BP) domain of the gene ontology (GO) and known GO annotations of human proteins, 2) a phenotype similarity network of human diseases, and 3) known associations between diseases and genes.
First, we extract 18, 850 GO terms in the biological process domain from the gene ontology (released on April 18, 2010) and extract 186, 080 annotations of human proteins from the UniProtKB GO annotations of human proteins (released on April 18, 2010). Focusing on proteins with corresponding gene identifiers in the Ensembl database, we obtain 59,681 annotations that involve 14,085 human genes and 5,596 GO terms.
Second, we obtain a phenotype similarity profile, represented as a matrix of similarity scores between 5,080 human diseases, from the literature . Since most small similarity scores in this profile are likely to be noise and only high scores have clear biological meanings , we follow the literature  to keep the first five nearest neighbors for each disease and obtain a phenotype similarity network, in which vertices are human diseases and weighted edges indicate similarity scores between diseases.
Finally, we use the high quality Human Protein Reference Database (HPRD)  to demonstrate the relationship between a gene semantic similarity network and a protein-protein interaction network. After removing duplications and self-linked interactions, we extract from release 9 (release on April 13, 2010) of this database 37, 067 interactions between 9,518 human genes.
Construction of gene semantic similarity networks
We adopt three methods based on information contents of GO terms (Resnik , Schlicker et al.  and Lin ) and one method based on the structure of gene ontology (Wang et al. ) to calculate similarity scores for GO terms, and we use a method in the literature  to calculate similarity scores for genes (see Methods for details). Hence, we obtain four semantic similarity networks, each containing 14,085 human genes.
Gene semantic similarity correlates with protein network proximity
There have been a few methods relying on protein-protein interaction networks to infer disease genes . The basic assumption of these methods is that interacting proteins are usually related in their functions, and thus the proximity of two proteins in a protein-protein interaction network can be used as an estimation of the functional relationship between the corresponding genes. Therefore, we first show that the similarity score between two genes in a gene semantic similarity network correlates with the proximity score of the corresponding proteins in a protein-protein interaction network.
These results suggest that gene semantic similarity scores are correlated with protein proximity scores. Hence, given the successful applications of [14–20], it is reasonable to use gene semantic similarity networks for the inference of disease genes.
Gene semantic similarity implies disease phenotype similarity
We use gene semantic similarity scores calculated using the method of Resnik as an example to demonstrate the relationship between gene semantic similarity and disease phenotype similarity (Figure 3:A). In group 1, we look at each disease separately. We collect genes that are associated with a disease, plot pairwise semantic similarity scores of these genes, and obtain a median semantic similarity score of 0.1945 for this group of genes. In group 2, we look at the nearest neighbor (the disease with the highest similarity score) of each disease in the disease similarity network. We collect genes associated with a disease and genes associated with the nearest neighbor of the disease, and we obtain a median pairwise semantic similarity score of 0.1635 for this group of genes. In group 3, we look at the second nearest neighbor of each disease in the disease similarity network. We collect genes that are associated with a disease and its second nearest neighbor, and we obtain a median pairwise semantic similarity score of 0.1486 for this group of genes. Similarly, in groups 4, 5 and 6, we look at the third, fourth and fifth nearest neighbor of each disease, respectively, and we obtain median pairwise semantic similarity scores of 0.1441, 0.1394 and 0.1383 for the corresponding groups of genes, respectively. Finally, in group 7, we look at 10,000 pairs of genes that are selected at random, and we obtain a median pairwise semantic similarity score of 0.0649.
These results demonstrate that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. In other words, semantic similarity of genes implies phenotype similarity of diseases that the genes are associated.
Gene semantic similarity networks improve the accuracy in prioritizing candidate genes
Performance of the semantic similarity networks and the HPRD network in the validation experiments. Candidate genes are selected from the overlap of the semantic similarity and the HPRD networks.
We use the gene semantic similarity network constructed using the method of Resnik as an example to demonstrate the performance of the proposed approach. At the threshold κ = 100, we obtain a network composed of 14, 085 genes and 2,112, 750 edges. Taking the overlap of genes in this network and those in the HPRD database, we obtain 8, 286 genes. Focusing on these genes, we obtain 2,397 associations between 1,572 diseases and 1,391 genes. We then perform the leave-one-out cross-validation experiment against a linkage interval and obtain the Mean Rank Ratio of disease genes (MRR) as 10.60% and the Area Under the rank receiver characteristic Curve (AUC) as 90.30%. We further perform the validation experiment against random genes and obtain an MRR of 10.65% and an AUC of 90.25%. Since a random guess will yield an MRR of 50% and an AUC of 50%, these results clearly suggest the effectiveness of relying on the semantic similarity network to uncover disease genes. For gene semantic similarity networks constructed using the other methods, we obtain similar results (Table 1).
We replace the gene semantic similarity network with the HPRD network and repeat the experiments. In the validation of a linkage interval, we obtain an MRR of 14.21% and an AUC of 86.65%. In the validation of random genes, we obtain an MRR of 14.40% and an AUC of 86.46%. We further plot the ROC curves of the validation results in Figure 4, from which we observe that the curves for the gene semantic similarity networks climb much faster towards the top left corner of the plot than that for the HPRD network. From these results, we conclude that the gene semantic similarity networks are superior to the HPRD network in the prioritization of candidate genes.
Gene semantic similarity networks improve the coverage in prioritizing candidate genes
The reliability and coverage of existing protein-protein interaction data sets are quite different. Focusing on common interactions in these data sets to improve the confidence will sacrifice the coverage; considering the union of interactions to improve the coverage will result in a network of low reliability. A gene semantic similarity network, however, can cover a large proportion of human genes while providing high accurate inference of disease genes.
Performance of the semantic similarity networks in the validation experiments. Candidate genes are selected from the semantic similarity networks.
Random genes (999)
We further increase the number of random genes in each validation run to 999 and find the AUC only drop slightly to 90.36%, suggesting that the prioritization method is not sensitive to the number of control genes in validation. With this understanding, we pursue a more ambitious goal of genome-wide scan for disease genes and obtain an MRR of 10.16% and an AUC of 90.10% in uncovering the disease genes from all 14,085 genes in the gene semantic similarity network.
We also notice that relying on semantic similarity networks constructed using the other methods (with default threshold values) yields similar results as we analyzed above (Table 2).
In this paper, we have proposed to rely on the biological process domain of gene ontology and GO annotations of human genes to construct a semantic similarity network of genes, and then use the network with phenotype similarity network of diseases to infer genes that are associated with a query disease of interest.
The main objective of this research is to overcome one of the shortcomings of existing protein-protein interaction networks, i.e., the low coverage. The constructed gene semantic similarity network covers 14,085 genes, about 50% more than the widely used HPRD network. More importantly, as demonstrated in our comprehensive analysis, the improvement in coverage is accompanied by the gain in accuracy in the inference of disease genes. Hence, the gene semantic similarity network can serve as a better assessment of functional relationship between genes and then be used in a large number of applications in systems biology.
The filtration of low semantic similarity scores is important to the success of the proposed approach. We currently achieve this goal by keeping the first κ nearest neighbors of each gene. Alternatively, we can introduce a threshold and discard all edges whose weight (similarity score) is less than the threshold. According to our experiments, this alternative strategy is likely to yield a disconnected network and thus adversely affect the performance of a prioritization method relying on the network. Therefore, we resort to the nearest neighbor strategy to filter out low semantic similarity scores.
Certainly, our research can further be improved from the following aspects. First, although we have focused on the biological process domain in this paper, it is conceptually straightforward to use the molecular function and the cellular component domains to construct gene semantic similarity networks. According to our experiments, semantic similarity networks relying on these two gene ontology domains have similar coverage as that of the biological process domain and can achieve comparable performance as the HPRD network in the inference of disease genes (data not shown). Therefore, a possible improvement of our approach is to construct a gene semantic similarity network with the integration of all three domains in the gene ontology.
Second, the semantic similarity network and the protein-protein interaction network assess the functional relationship between genes from different points of view. Therefore, the inference of disease genes may be benefit from the integrated use of these two types of networks. Furthermore, as the effectiveness of relying on the “guilt-by-association” principle (without using the phenotype similarity profile) and multiple genomic data to infer disease genes has been demonstrated in previous studies. It is reasonable to pursue the goal of using the phenotype similarity profile with multiple genomic data to achieve more accurate inferences of disease genes.
Calculation of semantic similarity scores
We adopt three methods based on information contents of GO terms (Resnik , Schlicker et al.  and Lin ) and one method based on the structure of gene ontology (Wang et al. ) to calculate semantic similarity scores between GO terms.
Applying the above method to every pair of genes, we obtain a pairwise semantic similarity matrix of genes. Certainly, this matrix can be thought of as the weight matrix of a fully connected network, whose vertices are genes and whose edges represent semantic similarity scores between genes. However, such a fully connected network may contain a large number of low confident edges between gene pairs with low semantic similarity scores. We therefore further filter out edges with low weights (similarity scores) in the fully connected network by introducing a threshold κ (defaulting to 100 in this paper) and keeping only the first κ nearest neighbors for each gene. By doing this, we obtain a gene semantic similarity network.
Prioritization of candidate genes
The random walk with restart on the heterogeneous network model  is one of the state-of-the-art methods that utilize a disease similarity network with a protein-protein interaction network to prioritize candidate genes. This model simulates the process that a random walker wanders on a heterogeneous network composed of a phenotype similarity network, a protein-protein interaction network, and known associations between diseases and genes. In each step of the process, the random walker may start on a new journey with probability γ or move on with probability 1 – γ. When starting on, the walker may choose the query disease of interest as the starting point with probability η or choose a seed gene known to be associated with the query disease with probability 1 – η. When moving on, the walker may choose to jump from the disease similarity network to the protein-protein interaction network or vice versa with probability λ or choose to wander in either the disease network or the protein-protein interaction network with probability 1 – λ. When wandering about, the walker moves at random to one of its direct neighbors.
In this model, the protein-protein interaction network serves as a simplified yet systematic view of functional relationships among genes. Since a gene semantic similarity network also provides a means of measuring functional relationships among genes, conceptually we can also use a gene semantic similarity network with the phenotype similarity network to infer disease genes. Following the literature , we use the following random walk with restart model on the heterogeneous network that is composed of a phenotype similarity network, a gene semantic similarity network, and known associations between diseases and genes.
We represent the phenotype similarity network using a weight matrix D = (d ij ) m × m , where m denotes the number of diseases and d ij the similarity score between the i-th disease and the j-th disease. By normalizing each row of this matrix, we obtain a transition matrix U = (u ij ) m × m , where , representing the probability that a random walker moves from the i-th disease to the j-th disease.
We represent the gene semantic similarity network using a weight matrix G = (g ij ) n × n , where n denotes the number of genes and g ij the similarity score between the i-th gene and the j-th gene. By normalizing each row of this matrix, we obtain a transition matrix V = (v ij )n×n, where , representing the probability that a random walker moves from the i-th gene to the j-th gene.
We represent known associations between diseases and genes using an adjacency matrix A = (a ij )m×n, where a ij = 1 indicates that the j-th gene is known to be associated with the i-th disease, and a ij = 0 otherwise. By normalizing each row of this matrix, we obtain a transition matrix R = (r ij )m×n, where , representing the probability that a random walker jumps from the i-th disease to the j-th gene. Note that we define r ij = 0 when , i.e., when there is no gene known as associated with the i-th disease. Similarly, by normalizing each row of the transpose of the matrix A, we obtain a transition matrix S = (s ij )n×m, where , representing the probability that a random walker jumps from the i-th gene to the j-th disease. We also define s ij = 0 when i.e., when the i-th gene is not associated with any disease.
and further normalize every row of this matrix to obtain the transition matrix of the heterogeneous network W = (w ij ), where . The parameter λ is the probability that the random walker jumps from the disease similarity network to the gene semantic similarity network or vice versa.
After a number of steps, the probability will reach a steady state. This is obtained by performing the iteration until the difference between p(t) and p(t+1) is sufficiently small (i.e., the L1 norm of Δp = p(t+1) – p(t) is less than a small positive number ε). The steady-state probability p(∞) then gives a measure of the strength of association of each gene to the query disease of interest, and we can then rank candidate genes according to their steady-state probabilities.
It has been show that the random walk model is not sensitive to the parameters involved in the model . Hence, we follow the literature  and default the parameters to λ = 0.7, η = 0.5, γ = 0.5 and ε = 10–4.
Validation methods and evaluation criteria
We perform three large-scale leave-one-out cross-validation experiments to examine the performance of the proposed method in prioritizing genes that are known to be associated with certain diseases (i.e., disease genes) from a set of candidates. First, in the validation against a linkage interval, we take a known association between a gene and a disease in each run, assume the association is unknown, and prioritize the gene against a set of 99 control genes that locate nearest to the disease gene according to their genomic distance on the same chromosome. Second, in the validation against random genes, we select control genes in each validation run as 99 (or 999) genes that are selected at random from all genes in a gene semantic similarity network. Third, in the genome-wide scan of disease genes, we select control genes in each validation run as all genes in a gene semantic similarity network.
We use two measures to evaluate the performance of the proposed method. Taking the cross-validation against a linkage interval as an example, after each validation run, we obtain a score (the steady-state probability) for each candidate gene and further rank genes according to their scores (ties are broke by assigning ranks to genes with equal scores at random) to obtain a ranking list of candidate genes. We then calculate rank ratios of candidate genes by dividing their ranks with the number of candidate genes in the list. For a set of validation runs, we calculate the following two measures. First, we calculate the mean rank ratio (MRR) of all disease genes as the average of rank ratios of all disease genes in the validation runs. Second, given a threshold of rank ratio, we calculate the sensitivity as the fraction of disease genes ranked above the threshold and the specificity as the fraction of control genes ranked below the threshold. Varying the threshold value from 0.0 to 1.0, we are able to draw a receiver operating characteristic (ROC) curve and further calculate the area under this curve (AUC). Obviously, smaller MRR and larger AUC values indicate higher performance of a prioritization method.
This work was partly supported by the National Natural Science Foundation of China (60805010, 61175002, 71101010, 60928007, 60934004), National Basic Research Program of China (973 Program) (2012CB316504), Tsinghua University Initiative Scientific Research Program, Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-discipline Foundation, and the Fundamental Research Funds for the Central Universities (FRF-BR-11-019A).
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S2.
- Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-37.View ArticlePubMedGoogle Scholar
- Glazier AM, Nadeau JH, Aitman TJ: Finding genes that underlie complex traits. Science. 2002, 298 (5602): 2345-9. 10.1126/science.1076641.View ArticlePubMedGoogle Scholar
- Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics. 2005, 6: 55-10.1186/1471-2105-6-55.PubMed CentralView ArticlePubMedGoogle Scholar
- Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006, 78 (6): 1011-25. 10.1086/504300.PubMed CentralView ArticlePubMedGoogle Scholar
- Gaulton KJ, Mohlke KL, Vision TJ: A computational system to select candidate genes for complex human traits. Bioinformatics. 2007, 23 (9): 1132-40. 10.1093/bioinformatics/btm001.View ArticlePubMedGoogle Scholar
- Kohler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008, 82 (4): 949-58. 10.1016/j.ajhg.2008.02.013.PubMed CentralView ArticlePubMedGoogle Scholar
- Schlicker A, Lengauer T, Albrecht M: Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics. 2010, 26 (18): i561-7. 10.1093/bioinformatics/btq384.PubMed CentralView ArticlePubMedGoogle Scholar
- Oti M, Brunner HG: The modular nature of genetic diseases. Clin Genet. 2007, 71: 1-11.View ArticlePubMedGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens Bo: Gene prioritization through genomic data fusion. Nat Biotechnol. 2006, 24 (5): 537-44. 10.1038/nbt1203.View ArticlePubMedGoogle Scholar
- Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJo: A genomewide functional network for the laboratory mouse. PLoS Comput Biol. 2008, 4 (9): e1000165-10.1371/journal.pcbi.1000165.PubMed CentralView ArticlePubMedGoogle Scholar
- Amberger J, Bocchini CA, Scott AF, Hamosh A: McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 2009, 37 (Database issue): D793-6.PubMed CentralView ArticlePubMedGoogle Scholar
- Goh KI, Cusick ME, Valle D, Childs B, Vidal Mo: The human disease network. Proc Natl Acad Sci U S A. 2007, 104 (21): 8685-90. 10.1073/pnas.0701361104.PubMed CentralView ArticlePubMedGoogle Scholar
- van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining analysis of the human phenome. Eur J Hum Genet. 2006, 14 (5): 535-42. 10.1038/sj.ejhg.5201585.View ArticlePubMedGoogle Scholar
- Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AGo: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25 (3): 309-16. 10.1038/nbt1295.View ArticlePubMedGoogle Scholar
- Wu X, Jiang R, Zhang MQ, Li S: Network-based global inference of human disease genes. Mol Syst Biol. 2008, 4: 189-PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Liu Q, Jiang R: Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics. 2009, 25: 98-104. 10.1093/bioinformatics/btn593.View ArticlePubMedGoogle Scholar
- Li Y, Patra JC: Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010, 26 (9): 1219-24. 10.1093/bioinformatics/btq108.View ArticlePubMedGoogle Scholar
- Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010, 6: e1000641-10.1371/journal.pcbi.1000641.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang W, Sun F, Jiang R: Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics. 2011, 12 (Suppl 1): S11-10.1186/1471-2105-12-S1-S11.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Y, Jiang T, Jiang R: Uncover disease genes by maximizing information flow in the phenome-interactome network. Bioinformatics. 2011, 27 (13): i167-i176. 10.1093/bioinformatics/btr213.PubMed CentralView ArticlePubMedGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar So: Human Protein Reference Database-2009 update. Nucleic Acids Res. 2009, 37 (Database issue): D767-72.PubMed CentralView ArticlePubMedGoogle Scholar
- Smedley D, Haider S, Ballester B, Holland R, London Do: BioMart-biological queries made easy. BMC Genomics. 2009, 10: 22-10.1186/1471-2164-10-22.PubMed CentralView ArticlePubMedGoogle Scholar
- Resnik P: Semantic similarity in a taxonomy: An Information-Based measure and its application to problems of ambiguity in natural language. J Artif Intell Res. 1999, 11: 95-130.Google Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006, 7: 302-10.1186/1471-2105-7-302.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin D: An Information-Theoretic Definition of Similarity. Proceedings of the 15th International Conference on Machine Learning. 1998, Morgan Kaufmann, 296-304.Google Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23 (10): 1274-81. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.