- Open Access
An improved method for functional similarity analysis of genes based on Gene Ontology
© The Author(s). 2016
- Published: 23 December 2016
Measures of gene functional similarity are essential tools for gene clustering, gene function prediction, evaluation of protein-protein interaction, disease gene prioritization and other applications. In recent years, many gene functional similarity methods have been proposed based on the semantic similarity of GO terms. However, these leading approaches may make errorprone judgments especially when they measure the specificity of GO terms as well as the IC of a term set. Therefore, how to estimate the gene functional similarity reliably is still a challenging problem.
We propose WIS, an effective method to measure the gene functional similarity. First of all, WIS computes the IC of a term by employing its depth, the number of its ancestors as well as the topology of its descendants in the GO graph. Secondly, WIS calculates the IC of a term set by means of considering the weighted inherited semantics of terms. Finally, WIS estimates the gene functional similarity based on the IC overlap ratio of term sets. WIS is superior to some other representative measures on the experiments of functional classification of genes in a biological pathway, collaborative evaluation of GO-based semantic similarity measures, protein-protein interaction prediction and correlation with gene expression. Further analysis suggests that WIS takes fully into account the specificity of terms and the weighted inherited semantics of terms between GO terms.
The proposed WIS method is an effective and reliable way to compare gene function. The web service of WIS is freely available at http://nclab.hit.edu.cn/WIS/.
- Gene Ontology
- Specificity of terms
- Weighted inherited semantics
- Gene functional similarity
Gene Ontology (GO) is a standardized, precisely defined and controlled vocabulary of terms. It comprises three orthogonal ontologies: cellular component (CC), molecular function (MF) and biological process (BP) . These ontologies are structured as three directed acyclic graphs (DAGs) in which, the nodes correspond to the terms describing a certain biological semantic category and the edges represent the linkages between terms describing defined relationships . Genes and gene products in many biomedical databases such as UniProt , SwissProt  have been annotated by GO terms [5, 6]. Therefore, semantic similarity applied to GO annotations of genes can provide a measure of their functional similarity.
In recent years, many gene functional similarity methods based on GO [2, 5, 7–19] have been proposed by researchers. These measures have been widely used in all kinds of important applications such as protein-protein interaction prediction [20–23], network prediction [24–26], cellular localization prediction , disease gene prioritization [8, 28, 29], pathway modeling  and improving analysis of microarray data quality . Measuring the functional similarity is more informative for understanding the biological roles and functions of genes, although sometimes it may be less objective and striking comparing with sequence and structure similarity [5, 32, 33].
Measures of gene functional similarity can mainly be divided into two categories: pairwise approaches and groupwise approaches, both of which have to rely on the GO graph . Pairwise methods measure gene functional similarity through two steps . The first step is measuring semantic similarity scores of term pairs using term comparison techniques. The second step is to integrate semantic similarities of term pairs into a single functional similarity. Three distinct approaches which are average rule, maximum rule and best match average rule (BMA) have been proposed for the integration in the second step . It is well accepted that the BMA rule is best overall. Pairwise approaches measure the semantic similarity between GO terms can be divided three categories: node-based, edge-based and hybrid .
Node-based measures [9–11, 40] are original developed for WordNet, and then applied to GO. Resnik  considered the most informative common ancestors (MICA) of two terms. Jiang and Conrath (JC)  and Lin  take into consideration the specificity of terms themselves, as well as the specificity of the most MICA. GraSM  considers average IC of all disjoint common ancestors rather than MICA only. However, these methods all suffer from ‘shallow annotation’ problem in which the semantic similarity values between terms near the root of the ontology are sometimes measured very high [5, 41].
Edge-based approaches [16, 19, 42, 43] calculate the number of edges along the paths that link two GO terms. The drawback of these approaches is that they assume all the edge in GO graph represents uniform distance and only count the number of edges on the paths traversed from one term to another. More recently, several researchers have attempted to address this issue by assigning different weights to edges that belong to different levels [15, 17]. However, they still ignore two important facts. One is the semantic similarity of two terms with a certain graph distance near the root would be equal to the semantic similarity of two terms with the same graph distance but away from the root. The other is that it is difficult to confirm weights of edges since the complex relationships of terms in the GO graph.
The hybrid methods [2, 12, 44, 45] not only consider the structure of the ontology but also distinguish the edges based on their different types and levels. Wang  designed a method that each edge is assigned a fixed weight according to the type of relationship between terms. The weight is also called semantics contribution factor (ωe). There are two mainly disadvantages of Wang’s method. One is that the semantic contribution factor (ωe) is fixed according the linking types of GO terms. The other is that the semantic contribution only depends on the maximum products of all the paths linking the two terms.
Groupwise methods measure gene functional similarity via comparing the terms that annotate genes in groups. According to Pesquita , there are three types of categories to measure the functional similarity of genes: set, graph and vector. Purely set-based approaches are not common, because few measures only consider direct annotations.
While simUI does not consider the specificity of the terms in the GO graph, simGIC takes the IC of a term as its specificity. As is pointed out by Teng , simGIC ignores the shared IC between terms and this may also result in misjudgments for gene functional similarity.
AS(t i ) is the ancestor set of term t i . The Eq. (8) suggests that the term inherits all the semantics of its ancestors. In other words, a term transmits all its semantics to each descendant equally. Besides, Teng’s method doesn’t take into account the specificity of edges in the ontology. Obviously, this model doesn’t meet human perspective.
The distribution of term IC based on different models
For Sanchez’s model, 87% of term IC is higher than 0.9. Only a small amount of term IC is varied between (0,0.9). The result of Seco has the similar problem. There is only 15% of term IC in range 0 to 0.7 totally. Therefore, these two models don’t show the specificity of different terms in the ontology. Hence, the distribution of term IC is very unreasonable. The results of Teng’s model have a great improvement comparing with the two models above. IC of terms is distributed in each interval reasonably. However, further analysis suggests IC of terms gathers at some points such as 0.39 and 0.42. By contrast, WIS has the ability to distribute the term IC in each interval evenly. The cumulative curve of WIS is smoother than Teng’s. This is because WIS makes the best use of the term information in the ontology and fully defines the specificity of a term. As a result, WIS performs better than other models in terms of the distribution of term IC (See the ‘Discussion’ section for details).
Functional classification of genes in a biological pathway
Functions of genes in valine degradation pathway
As is demonstrated in Fig. 3, WIS has clustered the 11 genes into 3 clusters correctly. The first class contains gene SFA1, ADH1, ADH2, ADH3, ADH4 and ADH5, all of which have the similar subtype that EC number is 184.108.40.206. Meanwhile, PDC1, PDC5 and PDC6 are clustered into another group with the same EC number (220.127.116.11). BAT1 and BAT2 are clustered into the third group precisely. The result suggests that clustering result of WIS is consistent with the human perspective in functional classification of genes in the pathway.
In contrast, the clustering results obtained by relevant measures are mixed. For method Hybrid, it fails in the first level when it assigns high similarity to BTA2 and PDC6. For method Teng, since ADH4 has a higher similarity with SFA1 than PDC1, PDC5 and PDC6, these genes are not in their proper positions. As for method Wang, BAT1, PDC1 and PDC5 are grouped together in the first level. The clustering results are incorrect apparently. Therefore, functional similarities obtained by method Hybrid, Teng and Wang can’t characterize the gene functional relationship consistently with the human perspectives in the pathway.
The results of CESSM
The performances of different methods on seven experiments
As is shown in Table 2, there are totally 24 group experiments. As for SeqSim, WIS achieves the highest correlation in five out of six experiments except for MF_IEA+. Regarding Pfam, WIS wins first on BP_IEA+ and MF_IEA- experiments. Moreover, WIS gets the rank one on MF_IEA+ experiment of Res. In contrast, as for ECC, Teng shows the best correlation on MF_IEA+, BP_IEA+, MF_IEA- and BP_IEA- experiments. Teng also wins first on Res experiments of CC_IEA+, MF_IEA- and CC_IEA-. Additionally, simGIC and simUI achieve highest correlations in the corresponding experiments. For pairwise methods, Resnik and Lin only show highest correlations on four experiments in total.
At the same time, we also accumulate the correlations on ECC, SeqSim and Res for each method. Annotations with IEA and without IEA are both considered respectively. The performance of WIS is the best of the seven methods. For annotations with IEA, the sum of WIS is 5.2467 ranking first followed by simGIC and Resnik which are 5.2145 and 5.0678 respectively. For annotations without IEA, WIS also gets the rank one followed by simGIC and Teng. Detail results about the results are provided in Additional file 1: Table S5, Figure S2 and S3, available online.
In summary, WIS performs better than other six measures on MF, BP, CC ontology when they are evaluated on ECC, Pfam, SeqSim, resolutions, respectively. In all cases, WIS wins first on eight experiments followed by Teng and simUI methods. It is noteworthy that WIS and other groupwise methods (Teng, simUI, simGIC), in general, perform better than the pairwise methods (Resnik, Lin and JC) on CESSM.
Protein-Protein interaction of yeast and human
Functional similarity between genes in yeast and human PPI datasets are computed by eleven measures which are Resnik, Jiang and Conrath, Lin, Wang, simGIC, simUI, Teng, ResnikGrasm , WangWV, simRel , and WIS. The pairwise approaches adopt the BMA rule to combine semantic similarity of terms. It is noteworthy that the original method in  is called method Wang. After taking our proposed weighting scheme, method Wang is called WangWV. The aim of adding WangWV is to compare the effectiveness of the proposed weighting scheme. Thereafter, we plot the ROC curves for each method and calculate the areas under the curves (AUC). At the same time, we also calculate F1-scores for different classification cut-off points for Resnik and WIS measures.
AUC of the functional similarity measures for three GOs using BMA in the PPI task on yeast dataset (IEA+ and IEA-)
AUC of the functional similarity measures for three GOs using BMA in the PPI task on human dataset (IEA+ and IEA-)
F1-score of the Resnik and WIS measure for yeast PPI task (IEA+ and IEA-)
Mean of F1-score
Max of F1-score
F1-score of the Resnik and WIS measure for human PPI task (IEA+ and IEA-)
Mean of F1-score
Max of F1-score
The performance of mean and max of F1-score on yeast dataset is shown in Table 5. The WIS prediction of PPIs based on the mean of F1-score is always better than the results achieved by Resnik. The mean F1-score of WIS is considerably higher than that of Resnik on both IEA+ and IEA- yeast datasets, while WIS doesn’t show great advantages against Resnik on max F1-score. In terms of max F1-score, Resnik achieves excellent performance on IEA+ datasets, while WIS is superior to Resnik on IEA- datasets of CC ontology only. For example, the max F1-scores on yeast IEA+ experiments for Resnik are 0.8802, 0.8029 and 0.8055 respectively, while results of WIS are 0.8717, 0.7762 and 0.7780.
The performances of mean and max of F1-score on human dataset are shown in Table 6. Resnik only win first on max of F1-score on CC_IEA+ experiment. WIS is superior to Resnik on all the rest of experiments. In summary, WIS outperforms other leading functional similarity methods including Resnik on yeast and human PPI datasets.
Comparison analysis based on correlation with gene expression data
Pearson’s correlation of functional similarity measures for three GOs using BMA against gene expression data (IEA+ and IEA-)
The CC ontology has the highest correlations in all cases, followed by BP and MF ontology. The experimental results show that Resnik generally outperforms other methods. As is demonstrated in Table 7, Resnik shows highest correlations on CC_IEA+, MF_IEA+ and MF_IEA- experiments, while WIS ranks first on BP_IEA+ and B P_IEA- experiments which are 0.4367 and 0.2941. Method Teng gets the highest correlations on CC_ IEA- experiment. Although WIS wins first on two experiments only and is inferior to Resnik which has been indicated to be better on yeast dataset by other authors [53, 54], its overall performance is better than other groupwise methods. Groupwise and pairwise methods show comparable performance in this dataset. WIS shows the best correlations (or one of the best) between gene expression and functional similarity with all three GO ontologies.
The specificity of terms
How to measure the IC of terms reasonably is a controversial problem, but it is generally believed that the model should make the best use of term information and highlight its specificity. Therefore, a novel model for measuring the IC of a term is proposed. Comparing with other models, WIS considers not only the depths of terms, the number of their ancestors as well as the topology of their descendants in the GO graph. Therefore, our model has the ability to represents the specificity of terms to the maximum.
Comparisons of IC computational models
Whether the information of t affects the result
The weighted inherited semantics between term and its parents
As is proposed by Teng, the semantics of a term is divided into two parts: one is inherited semantics, which is same as the semantics of its ancestors, and the other is extended semantics, which is special in itself. However, there is one serious drawback for Teng’s model. Because the edges in the GO graph are not always equal, the inherited semantics comes from its parents ought to have a weighted value according to the edge rather than is the same as the semantics of its ancestors.
In order to avoid repeated summing of term shared IC, WIS divides the semantics of a term into two part. One is weighted inherited semantics which is from its parents and the other is the extended semantics. Since WIS makes the best use of the relationship between terms, the results for measuring the annotating term set will be more reasonable. The results of WIS confirm that it is a effective and reliable way to estimate gene functional similarity.
The difficulty on verifying the results
Since there is no direct way to ascertain the true functional similarity between two genes, how well a measure captures the similarity in function is not a trivial assessment . For the sake of giving a comprehensive comparison, we select four group experiments to verify the performance of existing gene functional similarity methods.
The selected measures show different performances on different experiments. For example, groupwise methods outperform pairwise methods on CESSM dataset, while simRel performs best on human PPI experiments which is followed by WIS. The reason of this problem maybe the characteristics of different data sets. The proteins in CESSM are all well annotated. In contrast, the data set of yeast only considers the high quality interactions, but ignores the annotation richness for genes. Therefore, the number of annotations per gene is crucial to the performance of functional similarity measures. Besides, due to lack of the authority and uniform evaluation criteria, there are still existing some problems in comparing these methods objectively. Therefore, how to measure the functional similarity reliably is still a meaningful research area.
On PPI classification of yeast and human datasets, as we can see the results in Tables 3 and 4, Resnik also get high AUC values. As is known to us, current GOA database is incomplete and many proteins are only annotated with one or two GO terms. What’s more, these proteins which are not well studied are annotated with more general terms (near the root of the ontology). In this situation, if two proteins are annotated with the same GO terms, the functional similarity between the proteins calculated by most methods is always 1.0. Obviously, this is not meet human’s perspective. As a result, the methods that cannot distinguish the identical annotations may not perform well. As for the eleven methods listed in Table 4, three pairwise methods can distinguish the identical annotation, which are Resnik, simRel and ResnikGrasm. From the results, we can fortunately find that these three pairwise methods indeed perform better than the other methods which cannot distinguish the identical annotation. We can conduct other experiments and assess the performance of Resnik and sim Rel, and then further give a strong evidence.
In future work, WIS can be evaluated on human miRNA target gene sets and correlation with sequence similarity dataset. Then WIS also needs to be verified on other model organism that have high quality biological data. Since annotation richness is crucial to the performance of functional similarity methods, WIS should be investigated on datasets with different annotation richness. In the end, there may be some scope for improving the proposed measure on studying the specificity of terms and measuring the IC of a term set more reasonable.
We proposed a novel method, namely WIS, to measure gene functional similarity based on GO. It is extensively evaluated on four different experiments which are functional classification of genes in biological pathway, CESSM dataset, protein–protein interaction prediction and correlation with gene expression. The experimental results suggest that WIS is a more effective and reliable way to estimate gene functional similarity comparing with the other tested methods. WIS has the following advantages.
First, WIS makes the best use of term information in the GO graph. WIS measures the IC of a term by considering its depth, the number of its ancestors and the topology of its descendants in the ontology. As a result, WIS can conquer the limitation of corpus bias, which affects the corpus-based approach heavily. Therefore, WIS can also fully measure the specificity of terms more objectively than other methods.
Second, WIS measures the IC of a term set by combining the inherited and extended IC of terms. Inherited IC is the weighted semantics which is from its parents and extended IC is special in itself. WIS considers the two types of semantics, so it can effectively avoid repeated summing of term shared IC, which is the key point for estimating the IC of a term set reasonably and correctly.
Third, WIS is very promising since it outperforms most existing state-of-the-art methods on all kinds of experiments. Pairwise approaches are sensitive to the number of annotations per gene since they are based on the combination of similarities between term pairs. In contrast, groupwise approaches are sensitive to the specificity of terms because they estimate gene functional similarity by comparing the terms in groups. Since WIS can measure the IC of terms and term sets more reasonably, the performance of WIS is more stability than other tested methods on the experiments. Therefore, it is an effective and reliable way to estimate gene functional similarity. The online service of WIS is available at http://nclab.hit.edu.cn/WIS freely.
Measure the IC of a term
Measure the IC of a term set by means of considering weighted inherited semantics of terms
Then, for the sake of measuring the IC of a term set, we take full account of the term IC as well as the weighted inherited semantics between terms. As a result, the semantics of a term is divided into two parts: one is weighted inherited semantics from its parents, and the other is extended semantics which is special in itself.
In this way, WIS can effectively avoid repeated summing of term shared IC.
Example: measure the IC of a term set based on WIS
The weight values of corresponding edges in Figure 11
The computational process for measuring the IC of term set S
Elements in S
IC(S) + ICextended
IC(t 2) − ω 12 ∗ IC(t 1) = 0.003
IC(t 3) − ω 13 ∗ IC(t 1) = 0.011
IC(t 4) − ω 24 ∗ IC(t 2) = 0.010
IC(t 5) − ω 35 ∗ IC(t 3) = 0.012
IC(t 6) − ω 46 ∗ IC(t 4) − ω 36 ∗ IC(t 3) = 0.012
IC(t 7) − ω 67 ∗ IC(t 6) − ω 57 ∗ IC(t 5) = 0.003
In step 1, term set S is null, and IC(S) is 0. Then we add the first term t 1 into S. According to Equation (13), ICextend is 0. Therefore, the last result for step one equals to IC(S) + ICextended and is 0.
In step 2, term set S contains t 1 only, and IC(S) is 0. We add the second term t 2 into S. According to Equation (13), IC extend (t 2 → t 1) is 0.003. Therefore, the last result for step 2 equals to IC(S) + ICextended and is 0.003.
Measure the gene functional similarity between two genes based on WIS
Experimental data and evaluation of the proposed approach
How well a measure captures the function similarity between two genes is not a trivial assessment because there is no direct way to ascertain the true functional similarity between them [2, 39]. However, the performance of existing functional similarity measurements can be verified in terms of pathway gene clustering [12, 55], correlations with sequence similarity [13, 46], gene expression profiling , protein-protein interactions [14, 54] and so on. In this article, the performance of WIS will be validated on four group experiments which are biological pathways of yeast, CESSM dataset, protein-protein interaction dataset of yeast and human as well as gene expression data of yeast. Additionally, it is noteworthy that pairwise approaches adopt the BMA rule to combine semantic similarity of terms since it is the best for evaluation of functional similarity measures.
Gene Ontology data
We downloaded the Gene Ontology data from the Gene Ontology database (dated August 2015) containing 41,624 ontology terms totally subdivided into 3717 cellular component, 27,864 biological process and 9943 molecular function terms. Gene annotations for GO terms were downloaded from the Gene Ontology database for S. cerevisiae and H. Sapiens (dated October 2015).
Biological pathway of yeast
Genes participate in a certain biological pathway may involve in several different molecular functions. They are endowed with different Enzyme Commission (EC) numbers according to the subtype of reaction that they catalyze at the molecular level. Therefore, it is an effective way to classify the genes according to their molecular functions of genes and validate the accuracy of functional similarity methods. If the clustering results are consistent with the artificial classification results based on the biological reactions, the measure is effective in charactering the functional similarity between genes . Therefore, we have taken a few pathways from yeast pathway database (http://pathway.yeastgenome.org/) and the validated results are demonstrated for the valine degradation pathway only due to the space limitation.
We use the CESSM  tool to compare WIS with other leading methods. CESSM is a widely used platform which provides a standard dataset. It consists of 13,430 pairs of proteins involving 1039 distinct proteins and implements 11 state-of-the-art semantic similarity measures. We only consider the best-match average (BMA) rule of Resnik's, Lin's and Jiang and Conrath’s methods, coupled with simGIC, simUI and Teng . It provides Pearson correlations with sequence similarity (Seq), protein family similarity (Pfam), enzyme commission classification similarity (ECC) and Resolution (Res) to evaluate these measures . SeqSim is computed using a relative measure of sequence similarity based on the BLAST bitscores, which is called RRBS method . The similarity between two proteins is computed by dividing the sum of the reciprocal BLAST bit scores by the sum of their dependent BLAST bitscores. The value of SeqSim ranges from 0 to 1.0. ECC is calculated using EC class similarity of proteins. According to , the value of ECC is between 0 and 4 that corresponds to the number of EC digits two proteins share. Pfam is measured via Jaccard similarity, where the similarity between proteins is the ratio between the number of domains they share and the total number of those they have. Resolution is the relative intensity with which values in the sequence similarity scale are translated into the semantic similarity. Resolution depicts the ability of a method to distinguish different levels of sequence similarity. Higher correlation and resolution values support the efficiency of the measures. A detail explanation for these criterions has been discussed by Pesquita .
Protein-Protein interaction data of Yeast and Human
We collect protein-protein interaction (PPI) datasets of yeast and human from the Jain and Davis’s database [53, 58]. The database has around 3800 yeast PPIs and 1500 human PPIs which are core set of DIP yeast database (dated 2009) . Negative datasets with the same number of PPIs for yeast and human are independently generated by randomly choosing annotated gene pairs for BP, CC and MF ontology, which are absent from a combined dataset of all possible PPIs [58, 59]. We conducted out experiments using the same data in . In order to draw the ROC plots, the threshold of the functional similarity values between all gene pairs is varied between (0,1). The gene pairs with similarity values greater than the threshold are predicted to be positives, while those below the threshold are predicted to be negatives. Thereafter, the true positive and true negative, and false positive and false negative values are computed, and ROC curves can be plotted . The area under the curve (AUC) obtained from the ROC plots is used to compare the performance of WIS against the other functional similarity measures. The F1-scores are also calculated for the corresponding measures.
Gene expression data for yeast
Correlation between gene expression and gene functional similarity is another desirable criterion since many gene products that participate in the same biological process or are functionally related have similar expression profiles . Therefore, the comparison of expression similarity and functional similarity between genes can be used as a standard performance evaluation. Methods having higher correlation will be regard as a better performance. The gene expression dataset for S.cerevisiae comes from Jain and Davis . The dataset contains 5000 S. cerevisiae gene pairs randomly selected from a list of all possible pairs of proteins in the gene expression dataset . We use all 5000 gene pairs from their study and consider genes with electronic annotations (IEA+) and non-electronic annotations (IEA-).
This article has been published as part of BMC Systems Biology Volume 10 Supplement 4, 2016: Proceedings of the 27th International Conference on Genome Informatics: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10-supplement-4.
M. Guo is supported by National Natural Science Foundation of China (61271346, 61571163, and 61532014). C. Wang is supported by Natural Science Foundation of China (61402132), and X. Liu is supported by Natural Science Foundation of China (91335112). Publication costs for this article was funded by National Natural Science Foundation of China .
Availability of data and materials
The dataset(s) supporting the conclusions of this article were downloaded from the relevant public databases.
• Ontology data: we downloaded the Gene Ontology data from the Gene Ontology database (http://geneontology.org/page/download-ontology, dated August 2015) containing 41,624 ontology terms subdivided into 3817 cellular components, 27,864 biological process and 9943 molecular function terms.
• GO Annotation data: Gene annotations for GO terms were downloaded from the Gene Ontology database for S. cerevisiae and H. Sapiens (http://geneontology.org/page/download-annotations, dated August 2015).
• Home page: http://nclab.hit.edu.cn/WIS.
ZT conceived the idea, designed the experiments, and drafted the manuscript. MG, CW and XL guided the whole work. ZT gave advices on writing skills. All authors have read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
The human and yeast GO annotations are publicly available to all the researchers and free of academic usage. It has no ethics issue. No human participants and individual clinical data are involved with this study.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nature genetics. 2000;25(1):25–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu Y, Guo M, Shi W, Liu X, Wang C. A novel insight into Gene Ontology semantic similarity. Genomics. 2013;101(6):368–75.PubMedView ArticleGoogle Scholar
- Bairoch AM, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro Rojas S, Gasteiger E, Huang H, Lopez R, Magrane M. The universal protein resource (UniProt). Nucleic acids research. 2005;33(Database issue):D154–159.PubMedView ArticleGoogle Scholar
- Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R. CluSTr: a database of clusters of SWISS-PROT+ TrEMBL proteins. Nucleic acids research. 2001;29(1):33–6.PubMedPubMed CentralView ArticleGoogle Scholar
- Song X, Li L, Srimani PK, Yu PS, Wang JZ. Measure the semantic similarity of go terms using aggregate information content. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2014;11(3):468–76.View ArticleGoogle Scholar
- Peng J, Wang T, Wang J, Wang Y, Chen J. () Extending gene ontology with gene association networks. Bioinformatics 2016;32(8):1185–1194.PubMedView ArticleGoogle Scholar
- Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics. 2006;7(1):302.PubMedPubMed CentralView ArticleGoogle Scholar
- Schlicker A, Lengauer T, Albrecht M. Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics. 2010;26(18):i561–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy, arXiv preprint cmp-lg/9709008. 1997.Google Scholar
- Lin D. An information-theoretic definition of similarity. In: ICML. 1998. p. 296–304.Google Scholar
- Resnik P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res. 1999;11:95–130.Google Scholar
- Wang JZ, Du Z, Payattakool R, Philip SY, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.PubMedView ArticleGoogle Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC bioinformatics. 2008;9(5):1.Google Scholar
- Bandyopadhyay S, Mallick K. A New Path Based Hybrid Measure for Gene Ontology Similarity. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2014;11(1):116–27.PubMedView ArticleGoogle Scholar
- Wu H, Su Z, Mao F, Olman V, Xu Y. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic acids research. 2005;33(9):2822–37.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA. A knowledge-based clustering algorithm driven by gene ontology. Journal of biopharmaceutical statistics. 2004;14(3):687–700.PubMedView ArticleGoogle Scholar
- Li M, Wu X, Pan Y, Wang J. hF‐measure: A new measurement for evaluating clusters in protein–protein interaction networks. Proteomics. 2013;13(2):291–300.PubMedView ArticleGoogle Scholar
- Smyth GK. Limma: linear models for microarray data. Bioinformatics and computational biology solutions using R and Bioconductor Springer. 2005;397–420.Google Scholar
- Pekar V, Staab S. Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th international conference on Computational linguistics-Volume 1: 2002. Association for Computational Linguistics: 1–7.Google Scholar
- Brameier M, Wiuf C. Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. Journal of biomedical informatics. 2007;40(2):160–73.PubMedView ArticleGoogle Scholar
- Cho YR, Zhang AD, Xu X. Semantic similarity based feature extraction from microarray expression data. Int J Data Min Bioin. 2009;3(3):333–45.View ArticleGoogle Scholar
- Yang D, Li YH, Xiao H, Liu Q, Zhang M, Zhu J, Ma WC, Yao C, Wang J, Wang D, et al. Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics. 2008;24(2):265–71.PubMedView ArticleGoogle Scholar
- Qu Y, Xu S. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture[J]. Bioinformatics, 2004, 20(12):1905–1913.Google Scholar
- Lee PH, Lee D. Modularized learning of genetic interaction networks from biological annotations and mRNA expression data. Bioinformatics. 2005;21(11):2739–47.PubMedView ArticleGoogle Scholar
- Yu G, Fu G, Wang J, Zhu H. Predicting Protein Function via Semantic Integration of Multiple Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 13;(2):220–232.Google Scholar
- Yu G, Zhu H, Domeniconi C. Predicting protein functions using incomplete hierarchical labels. BMC Bioinformatics. 2015;16(1).Google Scholar
- Lei Z, Dai Y. Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC bioinformatics. 2006;7(1):491.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheng L, Li J, Ju P, Peng J, Wang Y. SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association. 2014.Google Scholar
- Chen J, Aronow BJ, Jegga AG. Disease candidate gene identification and prioritization using protein interaction networks. BMC bioinformatics. 2009;10(1):73.PubMedPubMed CentralView ArticleGoogle Scholar
- Guo X, Liu R, Shriver CD, Hu H, Liebman MN. Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics. 2006;22(8):967–73.PubMedView ArticleGoogle Scholar
- Tuikkala J, Elo L, Nevalainen OS, Aittokallio T. Improving missing value estimation in microarray data with gene ontology. Bioinformatics. 2006;22(5):566–72.PubMedView ArticleGoogle Scholar
- Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics. 2013;29(11):1424–32.PubMedView ArticleGoogle Scholar
- Peng J, Wang T, Hu J, Wang Y, Chen J. (2016) Constructing Networks of Organelle Functional Modules in Arabidopsis. Current Genomics. 17 (5):427–438.View ArticleGoogle Scholar
- Seco N, Veale T, Hayes J. An intrinsic information content metric for semantic similarity in WordNet[C]. ECAI. 2004;16:1089.Google Scholar
- Harispe S, Sánchez D, Ranwez S, Janaqi S, Montmain J. A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of biomedical informatics. 2014;48:38–53.PubMedView ArticleGoogle Scholar
- Sánchez D, Batet M. Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective. Journal of biomedical informatics. 2011;44(5):749–59.PubMedView ArticleGoogle Scholar
- Sánchez D, Batet M, Isern D. Ontology-based information content computation. Knowledge-Based Systems. 2011;24(2):297–303.View ArticleGoogle Scholar
- Guzzi PH, Mina M, Guerra C, Cannataro M. Semantic similarity analysis of protein data: assessment with biological features and issues. Briefings in bioinformatics. 2012;13(5):569–85.PubMedView ArticleGoogle Scholar
- Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS computational biology. 2009;5(7):e1000443.PubMedPubMed CentralView ArticleGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM: Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In: Proceedings of the 14th ACM international conference on Information and knowledge management: 2005. ACM: 343-344.Google Scholar
- Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A. Correlation between gene expression and GO semantic similarity. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2005;2(4):330–8.View ArticleGoogle Scholar
- Yu H, Gao L, Tu K, Guo Z. Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene. 2005;352:75–81.PubMedView ArticleGoogle Scholar
- Del Pozo A, Pazos F, Valencia A. Defining functional distances over gene ontology. BMC bioinformatics. 2008;9(1):50.PubMedPubMed CentralView ArticleGoogle Scholar
- Othman RM, Deris S, Illias RM. A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of biomedical informatics. 2008;41(1):65–81.PubMedView ArticleGoogle Scholar
- Shen Y, Zhang S, Wong H-S: A new method for measuring the semantic similarity on Gene Ontology. In: Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on: 2010. IEEE. pp. 533-8.Google Scholar
- Mistry M, Pavlidis P. Gene Ontology term overlap as a measure of gene functional similarity. BMC bioinformatics. 2008;9(1):327.PubMedPubMed CentralView ArticleGoogle Scholar
- Tversky A. Features of similarity. Psychological review. 1977;84(4):327.View ArticleGoogle Scholar
- Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. Coexpression analysis of human genes across many microarray data sets. Genome research. 2004;14(6):1085–94.PubMedPubMed CentralView ArticleGoogle Scholar
- Pesquita C, Faria D, Bastos H, Falcão A, Couto F. Evaluating GO-based semantic similarity measures. In: Proc 10th Annual Bio-Ontologies Meeting: 2007. 38.Google Scholar
- Alvord G, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome biology. 2007;8(9):183.View ArticleGoogle Scholar
- Chabalier J, Mosser J, Burgun A. A transversal approach to predict gene product networks from ontology-based similarity. BMC bioinformatics. 2007;8(1):235.PubMedPubMed CentralView ArticleGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM. Measuring semantic similarity between Gene Ontology terms. Data & knowledge engineering. 2007;61(1):137–52.View ArticleGoogle Scholar
- Jain S, Bader GD. An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC bioinformatics. 2010;11(1):562.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu T, Du L, Zhou Y. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC bioinformatics. 2008;9(1):1.View ArticleGoogle Scholar
- Zhang S-B, Lai J-H. A hybrid measure for the semantic similarity of gene ontology terms. In: Systems and Informatics (ICSAI), 2014 2nd International Conference on: 2014. IEEE: 911-6.Google Scholar
- Pesquita C, Pessoa D, Faria D, Couto F. CESSM: Collaborative evaluation of semantic similarity measures. JB2009: Challenges in Bioinformatics. 2009;157:190.Google Scholar
- Devos D, Valencia A. Practical limits of function prediction. Proteins: Structure, Function, and Bioinformatics. 2000;41(1):98–107.View ArticleGoogle Scholar
- Pesaranghader A, Matwin S, Sokolova M, Beiko RG. simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics. 2016;32(9):1380–7.PubMedView ArticleGoogle Scholar
- Razick S, Magklaras G, Donaldson IM. iRefIndex: a consolidated protein interaction database with provenance. BMC bioinformatics. 2008;9(1):1.View ArticleGoogle Scholar