An integrative approach for measuring semantic similarities using gene ontology
- Jiajie Peng†1, 2,
- Hongxiang Li†1,
- Qinghua Jiang3,
- Yadong Wang1Email author and
- Jin Chen2, 4Email author
© Peng et al.; licensee BioMed Central Ltd. 2014
Published: 12 December 2014
Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding.
We propose a novel integrative measure called InteGO 2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO 2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories.
InteGO 2 computes gene-to-gene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
The Gene Ontology (GO) provides a representation of biological knowledge through structured, controlled vocabulary of terms, which are interrelated forming a directed acyclic graph (DAG) for describing the functional information of gene products [1, 2]. GO consists of three categories that shared by all organisms: molecular function (MF), biological process (BP) and cellular component (CC) . As a widely used bioinformatics resource, GO provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various aspects including predicting gene functional associations , homology analysis , assessing target gene functions , and predicting subcellular localization .
Since GO was released, various computational measurements have been developed to compute gene functional similarities by comparing GO terms with which the genes are annotated [7–23]. These term- comparison measurements can be classified into three categories based on the types of knowledge in GO that they used: edge-based, node-based, and hybrid .
The measures in the edge-based category take the structure of GO into account [11, 12, 22]. By using the topological information of GO directed acyclic graph (DAG), a recently designed method Relative Specificity Similarity (RSS) models both the distance of given term pair to its closest leaf terms and the distance to their most recent common ancestor (MRCA) . The edge-based measures, however, are still fully dependent on the topology of GO DAG, and it is inappropriate to simply equalize the terms at the same topological level .
In the node-based category, methods originally designed for natural language processing [24–26] are utilized for term comparisons. In the earlier developed measures, the similarity of two GO terms is defined as the information content of their most informative common ancestor (MICA), indicating its specificity. It was further advanced by modeling the distance between a given term pair to its MICA . The results show strong correlations with yeast gene co-expressions and protein sequence similarities [24, 27]. However, the node-based measures only consider the annotations and common ancestors, neglecting the complex topology of the GO DAG.
Hybrid measurements have been recently proposed to consider the more complete information in GO.  utilizes all of the parent terms of the target terms, which takes the topology of the GO DAG into account. Hybrid Relative Specificity Similarity (HRSS) employs the concepts of information content, adapting topology, annotations and MICA . The experiment results show that both Wang and HRSS measures perform better than the traditional node-based measures [15, 22]. However, these measures still only focus on several types of information in GO but neglect others.
Since none of the existing measure can employ all the information in GO, an integrative approach to unite all the strength of existing measures is preferred. In this direction,  proposed a rank-based gene semantic similarity measure called InteGO by synergistically integrating multiple similarity measures (called seed measures) to take into account more aspects of GO (structure, annotation, MICA, MRCA, all of the common parent, etc). InteGO first selects measures based on an evaluation set, and then integrates the selected measures using one of four straightforward methods (maximum, minimum, average and median). The experiment results showed that InteGO performs significant better than the seed measures . However, the performance of InteGO is still limited, because it is vulnerable to the selection of low performance measures, and its fixed integration strategy may not be suitable for all gene pairs.
* Our new integrative measure not only takes into account the state-of-the-art GO based measures, but also selects the most appropriate seed measures for each gene pair.
* A metaheuristic search method is presented in InteGO 2 to flexibly integrate multiple seed measures.
InteGO 2 has three steps. First, we calculate all the similarity scores using all the candidate measures and then rank them, resulting in a ranked matrix M r . Second, a grouping process is applied on M r to identify the common features of all the ranked results, with which we define a set of seed measures for each gene pair saved in S seed . Third, we integrate all the measures in S seed with an addition model, in which the parameter of each component is estimated by applying a learning process on training set T . We will introduce the three steps of InteGO 2 in the following text.
Step 1. Computing similarities using all measures
where g1 and g2 are two target genes, m is a candidate measure in S all , |GS| is the number of genes in gene set GS, which according to Figure 1, is the input gene set G or the training set T . RankSim(g1g2, m) ∈ [0, 1]. RankSim(g1g2, m) indicates how similar g1 and g2 is, compared with all of the gene pairs in GS. Note that although the similarities using each measure may at a different scale or have a different distribution, the ranked results are comparable. Therefore, the integration of all the ranked results may better reflect functional similarity.
Step 2. Selecting seed measures
Since different similarity measures use different types of information in GO, or model data in different ways, one measure may perform the best on certain functional categories but not on the others. Alternatively, the integration of suitable measures makes it possible to calculate the overall similarity score by considering all the aspects of GO. A key problem here is to select the most appropriate measures (called seed measures) for every gene pair from a pool of candidate measures.
An illustration example of the seed measure group is shown in Figure 2(a). In the figure, with the decrease of d from d1 to , the isolated measures are in the order of m1, m3, m4, and m5, and the the seed measure group include m2, m6, m7, and m8.
It is clear that a seed measure group can be labeled as as high, low, or mix according to its distribution in the number axis. Mathematically, we define the label of a seed measure group using the highest number of the isolated measures in the leftmost, middle or rightmost of the number axis. For example, the seed measure group in Figure 2(a) is high, in Figure 2(b) is low, and in Figure 2(c) is mix. We label the seed measure groups, because the integration strategy could be different for different seed measure group types.
Step 3. Integrating seed measures
where type is the type of seed measure group; i is a seed measure in the seed measure group; RankSim(i) is the similarity of given gene pair calculated with measure i (Eq. 1); X i is the parameter of seed measure i, where X is H, M or L; max, min and ave represent the maximum, minimum and average of all the RankSim values for g1 and g2 using all the seed measures; and X α , X β , Xγ are their parameters respectively. We include maximum, minimum and average in the Eq. 2, because the experiment results in  show that maximal, minimal and average values are better than individual measure in the tested conditions.
In order to use Eq. 2 for seed measure integration, the parameters, e.g. X α , X β , Xγ , needs to be assigned. Instead of leaving the difficult job to the end users, we estimate these parameters using a training data T . Specifically, we adopt a metaheuristic search method to gradually update the parameters in Eq. 2 to maximize the score of an objective function in T .
There are a wide variety of metaheuristics, including simulated annealing, tabu search, iterated local search, variable neighborhood search, and greedy randomized adaptive search. It also includes a learning component to the search, such as ant colony optimization, evolutionary computation, and genetic algorithm. In this paper, we adopt the tabu search method. Comparing with a simple local search procedure, tabu search carefully explores the neighborhood of each solution through the use of memory structures (tabu list) to avoid sticking in the poor-scoring areas or areas where scores plateau . Specifically, given the training set T , we use the EC number (Enzyme Commission) to explain molecular function with the criteria that the molecular functions of a group of genes are similar if they have the same EC numbers [15, 30, 31]. Therefore, we can locate the best candidates of solutions for next move in the searching process.
where c is a constant small positive number, as a Laplacian smoothing parameter; G(e i ) is the set of all of the genes which EC number is e i except gene g; G(e j ) is the set of all of the genes which EC number is e j ; g is a gene assigned to e i . Sim(g, g ′ , t) and Sim(g, g ∗ , t) are defined in Eq. 2. In Eq. 4, the numerator and denominator represent the inter-EC distance and intra-EC distance respectively. The higher the diff g (e i , e j ) is, the more obvious the positive difference between inter-EC difference and intra-EC difference is.
Initialize TL as the empty tabu list, and a set of random parameters in Eq. 2 as current solution s (starting point) satisfying ∑i∈MGX i + X α + X β + Xγ = 1.0, where X is H, M , or L. The initial best solution is bs = s.
Calculate the neighborhood solutions of s by increasing or decreasing one or multiple parameters in s. Note that we learn one group of parameters at a time. For example, while learning parameters for H x , the other two groups L x and M x are fixed.
The best solution for next move s′ is selected from the neighborhood solutions of s using the optimization function (Eq. 5).
If s′ > bs, let s′ be the current solution, update TL and bs = s′.
If s′ ≤ bs, we still let current best solution s = s′ and update TL if s′ ∉ TL. Otherwise, we delete s′ from the neighborhood solutions and go back to step 3.
Repeat step 2 to 5 till bs is stable.
To avoid bias, we repeat step 1 to 6 multiple times and choose the best result.
We evaluate InteGO 2 on three model organisms (human, Arabidopsis and Yeast) with different levels of GO annotation scale and complexity . For each of them, we use EC numbers and pathways as independent biological evidences for molecular function and biological process category in GO respectively. Finally, we test the robustness of InteGO 2 by gradually removing seed measures with best performance.
The GO annotation and structure data were downloaded from the GO website (http://www.geneontology.org/GO.downloads.shtml). The EC number and pathway information of human, Arabidopsis and Yeast were downloaded from the HumanCyc (http://humancyc.org), PlantCyc (http://ftp.plantcyc.org/Pathways) and Saccharomyces genome database (http://www.yeastgenome.org/download-data/curation) respectively. InteGO 2 was implemented with Python 2.7 with NetworkX package (http://networkx.github.io).
Performance evaluation on molecular function
Proteins sharing the same EC numbers are considered to have similar molecular functions. For every manually curated pathway in human, Arabidopsis and yeast, we grouped the genes based on their EC numbers (full four digits) and tested the difference between the inter- and intra-group gene-gene similarities. There are in total 125, 205 and 32 EC groups with least three genes in human, Arabidopsis and yeast respectively.
In the experiments, we chose seven widely used measures in all the three categories as candidate measures. We also added a fake measure to simulate the situation where a wrong measure was included to test the robustness of InteGO 2. Among the seven measures, SimUI  and TO  measure use the GO annotations information directly; Resnik , Schlicker  and SimGIC  measure use annotation information to calculate the information content of GO terms; Wang  measure considers the complex topology of GO; HRSS  considers the shared path based on information content. More detail description is shown in Additional file 3. In the fake measure, a random half of the similarity scores were computed with Resnik measure, and the other half were 1 or 0, such that the similarity of two genes with the same EC is 0, otherwise it is 1 (the reversed values ensure that the fake measure has low quality).
In order to evaluate InteGO 2 systematically, we adopted the cross-validation strategy by randomly selecting 1/ 5 of human ECs as the testing set (200 genes involved) and the other 4/ 5 of human ECs being the training set (823 genes involved). The same training set was used for Arabidopsis and yeast (1151 and 121 genes involved respectively). Using the training set, the parameters in Eq. 2 were estimated, which were directly applied on the testing set to compute the EC-based LogFC scores using Eq. 5.
We found that the parameters for the three types of seed measure groups (high, low and mix) are significantly different, reflecting different integration strategies. The highest parameter in the high seed measure groups is maximum, in the low seed measure groups is minimum, and in the mix seed measure groups is simUI measure.
Statistics analysis was carried out to test the significance of InteGO2 results. The p-values of t-test indicate that the results of InteGO2 are significantly different with the results of other measures except simGIC, simUI and Wang measure on Arabidopsis and yeast (T-Test, supplementary Table S4 in Additional file 4).
Performance evaluation on biological process
Given that genes annotated to the similar biological process may be involved in the same manually curated pathway, we grouped genes based on the pathway information, and on these gene groups we evaluated InteGO 2. There are in total 258, 154 and 141 pathways with at least two genes in humanCyc, PlantCyc and Saccharomyces genome database respectively.
Statistics analysis was carried out to test the significance of InteGO2 results. The p-values of t-test indicate that the results of InteGO2 are significantly different with the results of other measures except simGIC, simUI and Wang measure on Arabidopsis (T-Test, supplementary Table S8 in Additional file 4).
The results indicate that InteGO 2 successfully utilizes the GO information by integrating seed measures appropriately to better deliver functional similarities better genes.
Robustness of InteGO 2
Performance evaluation on protein sequences
Generating functional association maps
The calculation of GO-based gene functional similarity has already been widely applied [3–6]. However, since the existing measurements only use a subset of the GO information (e.g., topology of DAG, annotations, MICA, edge length and all the parents term), the demand to integrate these measurements is compelling.
In this paper, we proposed a new integrative measure called InteGO 2 by automatically selecting the most appropriate seed measures and by integrating the seed measures using an addition model. First, we calculate the ranked similarity scores using all the measures. Second, seed measures are selected using a grouping process. Third, the parameters of the addition model are estimated by optimizing an objective function on a training data. Experimental results using ECs and pathways show that InteGO 2 performs the best among all the measures. It also shows that InteGO 2 is robust against the unavailability of candidate measures. Note that we have proposed InteGO in the previous work to unify different measures , which can be considered as a simplified case of InteGO 2.
To demonstrate the advantages of InteGO 2, we computed the gene similarity scores for all the human, Arabidopsis and yeast genes on both molecular function and biological process GO categories, and generated a functional association map for each organism. The new functional association maps, together with the existing biological networks, can be beneficial in medical diagnostics, and they also may provide more biological insights into gene function and regulation. In the future, we will apply InteGO2 to more organisms, data sets (such as protein-family-based index) and compare the new functional association maps with the existing biological network (such as protein-protein network and genetic interaction network) to predict protein or genetic interaction based on the GO similarity scores.
This project has been funded by the U.S. Department of Energy, grant no. DE-FG02-91ER20021 to J.C; the National High Technology Research and Development Program of China grant (no. 2012AA020404 and 2012AA02A602) and the National Natural Science Foundation of China grant (no. 61173085) to Y. W.
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC systems Biology Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5.
- Consortium GO: Gene Ontology annotations and resources. Nucleic acids research. 2013, 41: D530-D535.View ArticleGoogle Scholar
- Blake J: Ten quick tips for using the gene ontology. PLoS computational biology. 2013, 9: e1003343-10.1371/journal.pcbi.1003343.PubMed CentralView ArticlePubMedGoogle Scholar
- Vafaee F, Rosu D, Broackes-Carter F, Jurisica I: Novel semantic similarity measure improves an integrative approach to predicting gene functional associations. BMC systems biology. 2013, 7: 22-10.1186/1752-0509-7-22.PubMed CentralView ArticlePubMedGoogle Scholar
- Nehrt N, Clark W, Radivojac P, Hahn M: Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS computational biology. 2011, 7: e1002073-10.1371/journal.pcbi.1002073.PubMed CentralView ArticlePubMedGoogle Scholar
- Lewis B, Shih I, Jones-Rhoades M, Bartel D, Burge C: Prediction of mammalian microRNA targets. Cell. 2003, 115: 787-798. 10.1016/S0092-8674(03)01018-3.View ArticlePubMedGoogle Scholar
- Lu Z, Hunter L: GO molecular function terms are predictive of subcellular localization. PSB. 151-Google Scholar
- Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19: 1275-1283. 10.1093/bioinformatics/btg153.View ArticlePubMedGoogle Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose M: A knowledge-based clustering algorithm driven by gene ontology. Journal of biopharmaceutical statistics. 2004, 14: 687-700. 10.1081/BIP-200025659.View ArticlePubMedGoogle Scholar
- Couto F, Silva M, Coutinho P: Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. CIKM. 2005, 343-344.Google Scholar
- Bodenreider O, Aubry M, Burgun A: Non-lexical approaches to identifying associative relations in the gene ontology. PSB. 2005, 91-Google Scholar
- Wu H, Su Z, Mao F, Olman V, Xu Y: Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic acids research. 2005, 33: 2822-2837. 10.1093/nar/gki573.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Gao L, Tu K, Guo Z: Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene. 2005, 352: 75-81.View ArticlePubMedGoogle Scholar
- Schlicker A, Domingues F, Rahnenfhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics. 2006, 7: 302-10.1186/1471-2105-7-302.PubMed CentralView ArticlePubMedGoogle Scholar
- Riensche R, Baddeley B, Sanfilippo A, Posse C, Gopalan B: Xoa: Web-enabled cross-ontological analytics. IEEE Congress on Services. 2007, 99-105.Google Scholar
- Wang J, Du Z, Payattakool R, Philip S, Chen C: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23: 1274-1281. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
- Yu H, Jansen R, Stolovitzky G, Gerstein M: Total ancestry measure: quantifying the similarity in tree- like classification, with genomic applications. Bioinformatics. 2007, 23: 2163-2173. 10.1093/bioinformatics/btm291.View ArticlePubMedGoogle Scholar
- del Pozo A, Pazos F, Valencia A: Defining functional distances over Gene Ontology. BMC bioinformatics. 2008, 9: 50-10.1186/1471-2105-9-50.PubMed CentralView ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Falcao A, Lord P, Couto F: Semantic similarity in biomedical ontologies. PLoS computational biology. 2009, 5: e1000443-10.1371/journal.pcbi.1000443.PubMed CentralView ArticlePubMedGoogle Scholar
- Othman R, Deris S, Illias R: A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of biomedical informatics. 2008, 41: 65-81. 10.1016/j.jbi.2007.05.010.View ArticlePubMedGoogle Scholar
- Yang H, Nepusz T, Paccanaro A: Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics. 2012, 28: 1383-1389. 10.1093/bioinformatics/bts129.View ArticlePubMedGoogle Scholar
- Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P: Measuring gene functional similarity based on group- wise comparison of GO terms. Bioinformatics. 2013, 29: 1424-1432. 10.1093/bioinformatics/btt160.View ArticlePubMedGoogle Scholar
- Wu X, Pang E, Lin K, Pei Z: Improving the measurement of semantic similarity between gene ontology terms and gene products: Insights from an edge-and ic-based hybrid method. PloS one. 2013, 8: e66745-10.1371/journal.pone.0066745.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng J, Wang Y, Chen J: Towards integrative gene functional similarity measurement. BMC bioinformatics. 2014, 15: S5-PubMed CentralView ArticlePubMedGoogle Scholar
- Resnik P: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research. 1999, 11: 95-130.Google Scholar
- Jiang J, Conrath D: Semantic similarity based on corpus statistics and lexical taxonomy. ROCLING. 1997, 9008-Google Scholar
- Lin D: An information-theoretic definition of similarity. CM. 1998, 98: 296-304.Google Scholar
- Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, Martinez-Cruz L, Rubio A: Correlation between gene expression and GO semantic similarity. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2005, 2: 330-338. 10.1109/TCBB.2005.50.View ArticleGoogle Scholar
- Marler R, Arora J: The weighted sum method for multi-objective optimization: new insights. Structural and multidisciplinary optimization. 2010, 41: 853-862. 10.1007/s00158-009-0460-7.View ArticleGoogle Scholar
- Glover F: Future paths for integer programming and links to artificial intelligence. Computers & Operations Research. 1986, 13: 533-549. 10.1016/0305-0548(86)90048-1.View ArticleGoogle Scholar
- Karp P: Call for an enzyme genomics initiative. Genome biology. 2004, 5: 401-10.1186/gb-2004-5-8-401.PubMed CentralView ArticlePubMedGoogle Scholar
- Díaz-Mejía J, Pérez-Rueda E, Segovia L: A network perspective on the evolution of metabolism by gene duplication. Genome biology. 2007, 8: R26-10.1186/gb-2007-8-2-r26.PubMed CentralView ArticlePubMedGoogle Scholar
- Allison D, Cui X, Page G, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006, 7: 55-65. 10.1038/nrg1749.View ArticlePubMedGoogle Scholar
- Rhee S, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nature Reviews Genetics. 2008, 9: 509-515. 10.1038/nrg2363.View ArticlePubMedGoogle Scholar
- Gentleman R: Visualizing and distances using GO URL. [http://www.bioconductor.org/docs/vignettes.html]
- Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome research. 2004, 14: 1085-1094. 10.1101/gr.1910904.PubMed CentralView ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Bastos H, Falcao A, Couto F: Evaluating GO-based semantic similarity measures. Annual Bio-Ontologies Meeting. 2007, 37-40.Google Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. Journal of molecular biology. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Guengerich F: Cytochrome p450 and chemical toxicology. Chemical research in toxicology. 2007, 21: 70-83.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.