An integrative approach for measuring semantic similarities using gene ontology
 Jiajie Peng†^{1, 2},
 Hongxiang Li†^{1},
 Qinghua Jiang^{3},
 Yadong Wang^{1}Email author and
 Jin Chen^{2, 4}Email author
https://doi.org/10.1186/175205098S5S8
© Peng et al.; licensee BioMed Central Ltd. 2014
Published: 12 December 2014
Abstract
Background
Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding.
Results
We propose a novel integrative measure called InteGO 2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO 2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories.
Conclusions
InteGO 2 computes genetogene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
Keywords
Background
The Gene Ontology (GO) provides a representation of biological knowledge through structured, controlled vocabulary of terms, which are interrelated forming a directed acyclic graph (DAG) for describing the functional information of gene products [1, 2]. GO consists of three categories that shared by all organisms: molecular function (MF), biological process (BP) and cellular component (CC) [1]. As a widely used bioinformatics resource, GO provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various aspects including predicting gene functional associations [3], homology analysis [4], assessing target gene functions [5], and predicting subcellular localization [6].
Since GO was released, various computational measurements have been developed to compute gene functional similarities by comparing GO terms with which the genes are annotated [7–23]. These term comparison measurements can be classified into three categories based on the types of knowledge in GO that they used: edgebased, nodebased, and hybrid [18].
The measures in the edgebased category take the structure of GO into account [11, 12, 22]. By using the topological information of GO directed acyclic graph (DAG), a recently designed method Relative Specificity Similarity (RSS) models both the distance of given term pair to its closest leaf terms and the distance to their most recent common ancestor (MRCA) [22]. The edgebased measures, however, are still fully dependent on the topology of GO DAG, and it is inappropriate to simply equalize the terms at the same topological level [18].
In the nodebased category, methods originally designed for natural language processing [24–26] are utilized for term comparisons. In the earlier developed measures, the similarity of two GO terms is defined as the information content of their most informative common ancestor (MICA), indicating its specificity. It was further advanced by modeling the distance between a given term pair to its MICA [13]. The results show strong correlations with yeast gene coexpressions and protein sequence similarities [24, 27]. However, the nodebased measures only consider the annotations and common ancestors, neglecting the complex topology of the GO DAG.
Hybrid measurements have been recently proposed to consider the more complete information in GO. [15] utilizes all of the parent terms of the target terms, which takes the topology of the GO DAG into account. Hybrid Relative Specificity Similarity (HRSS) employs the concepts of information content, adapting topology, annotations and MICA [22]. The experiment results show that both Wang and HRSS measures perform better than the traditional nodebased measures [15, 22]. However, these measures still only focus on several types of information in GO but neglect others.
Since none of the existing measure can employ all the information in GO, an integrative approach to unite all the strength of existing measures is preferred. In this direction, [23] proposed a rankbased gene semantic similarity measure called InteGO by synergistically integrating multiple similarity measures (called seed measures) to take into account more aspects of GO (structure, annotation, MICA, MRCA, all of the common parent, etc). InteGO first selects measures based on an evaluation set, and then integrates the selected measures using one of four straightforward methods (maximum, minimum, average and median). The experiment results showed that InteGO performs significant better than the seed measures [23]. However, the performance of InteGO is still limited, because it is vulnerable to the selection of low performance measures, and its fixed integration strategy may not be suitable for all gene pairs.

* Our new integrative measure not only takes into account the stateoftheart GO based measures, but also selects the most appropriate seed measures for each gene pair.

* A metaheuristic search method is presented in InteGO 2 to flexibly integrate multiple seed measures.
Method
InteGO 2 has three steps. First, we calculate all the similarity scores using all the candidate measures and then rank them, resulting in a ranked matrix M_{ r } . Second, a grouping process is applied on M_{ r } to identify the common features of all the ranked results, with which we define a set of seed measures for each gene pair saved in S_{ seed }. Third, we integrate all the measures in S_{ seed } with an addition model, in which the parameter of each component is estimated by applying a learning process on training set T . We will introduce the three steps of InteGO 2 in the following text.
Step 1. Computing similarities using all measures
where g_{1} and g_{2} are two target genes, m is a candidate measure in S_{ all }, GS is the number of genes in gene set GS, which according to Figure 1, is the input gene set G or the training set T . RankSim(g_{1}g_{2}, m) ∈ [0, 1]. RankSim(g_{1}g_{2}, m) indicates how similar g_{1} and g_{2} is, compared with all of the gene pairs in GS. Note that although the similarities using each measure may at a different scale or have a different distribution, the ranked results are comparable. Therefore, the integration of all the ranked results may better reflect functional similarity.
Step 2. Selecting seed measures
Since different similarity measures use different types of information in GO, or model data in different ways, one measure may perform the best on certain functional categories but not on the others. Alternatively, the integration of suitable measures makes it possible to calculate the overall similarity score by considering all the aspects of GO. A key problem here is to select the most appropriate measures (called seed measures) for every gene pair from a pool of candidate measures.
An illustration example of the seed measure group is shown in Figure 2(a). In the figure, with the decrease of d from d_{1} to ${d}^{\prime}$, the isolated measures are in the order of m_{1}, m_{3}, m_{4}, and m_{5}, and the the seed measure group include m_{2}, m_{6}, m_{7}, and m_{8}.
It is clear that a seed measure group can be labeled as as high, low, or mix according to its distribution in the number axis. Mathematically, we define the label of a seed measure group using the highest number of the isolated measures in the leftmost, middle or rightmost of the number axis. For example, the seed measure group in Figure 2(a) is high, in Figure 2(b) is low, and in Figure 2(c) is mix. We label the seed measure groups, because the integration strategy could be different for different seed measure group types.
Step 3. Integrating seed measures
where type is the type of seed measure group; i is a seed measure in the seed measure group; RankSim(i) is the similarity of given gene pair calculated with measure i (Eq. 1); X_{ i } is the parameter of seed measure i, where X is H, M or L; max, min and ave represent the maximum, minimum and average of all the RankSim values for g_{1} and g_{2} using all the seed measures; and X_{ α }, X_{ β } , X_{γ} are their parameters respectively. We include maximum, minimum and average in the Eq. 2, because the experiment results in [23] show that maximal, minimal and average values are better than individual measure in the tested conditions.
In order to use Eq. 2 for seed measure integration, the parameters, e.g. X_{ α }, X_{ β } , X_{γ} , needs to be assigned. Instead of leaving the difficult job to the end users, we estimate these parameters using a training data T . Specifically, we adopt a metaheuristic search method to gradually update the parameters in Eq. 2 to maximize the score of an objective function in T .
There are a wide variety of metaheuristics, including simulated annealing, tabu search, iterated local search, variable neighborhood search, and greedy randomized adaptive search. It also includes a learning component to the search, such as ant colony optimization, evolutionary computation, and genetic algorithm. In this paper, we adopt the tabu search method. Comparing with a simple local search procedure, tabu search carefully explores the neighborhood of each solution through the use of memory structures (tabu list) to avoid sticking in the poorscoring areas or areas where scores plateau [29]. Specifically, given the training set T , we use the EC number (Enzyme Commission) to explain molecular function with the criteria that the molecular functions of a group of genes are similar if they have the same EC numbers [15, 30, 31]. Therefore, we can locate the best candidates of solutions for next move in the searching process.
where c is a constant small positive number, as a Laplacian smoothing parameter; G(e_{ i }) is the set of all of the genes which EC number is e_{ i } except gene g; G(e_{ j } ) is the set of all of the genes which EC number is e_{ j } ; g is a gene assigned to e_{ i }. Sim(g, g^{ ′ }, t) and Sim(g, g^{ ∗ }, t) are defined in Eq. 2. In Eq. 4, the numerator and denominator represent the interEC distance and intraEC distance respectively. The higher the diff_{ g } (e_{ i }, e_{ j } ) is, the more obvious the positive difference between interEC difference and intraEC difference is.
 1.
Initialize TL as the empty tabu list, and a set of random parameters in Eq. 2 as current solution s (starting point) satisfying ∑_{i∈MG}X_{ i } + X_{ α } + X_{ β } + X_{γ} = 1.0, where X is H, M , or L. The initial best solution is bs = s.
 2.
Calculate the neighborhood solutions of s by increasing or decreasing one or multiple parameters in s. Note that we learn one group of parameters at a time. For example, while learning parameters for H_{ x }, the other two groups L_{ x } and M_{ x } are fixed.
 3.
The best solution for next move s′ is selected from the neighborhood solutions of s using the optimization function (Eq. 5).
 4.
If s′ > bs, let s′ be the current solution, update TL and bs = s′.
 5.
If s′ ≤ bs, we still let current best solution s = s′ and update TL if s′ ∉ TL. Otherwise, we delete s′ from the neighborhood solutions and go back to step 3.
 6.
Repeat step 2 to 5 till bs is stable.
 7.
To avoid bias, we repeat step 1 to 6 multiple times and choose the best result.
Results
We evaluate InteGO 2 on three model organisms (human, Arabidopsis and Yeast) with different levels of GO annotation scale and complexity [33]. For each of them, we use EC numbers and pathways as independent biological evidences for molecular function and biological process category in GO respectively. Finally, we test the robustness of InteGO 2 by gradually removing seed measures with best performance.
Data preparation
The GO annotation and structure data were downloaded from the GO website (http://www.geneontology.org/GO.downloads.shtml). The EC number and pathway information of human, Arabidopsis and Yeast were downloaded from the HumanCyc (http://humancyc.org), PlantCyc (http://ftp.plantcyc.org/Pathways) and Saccharomyces genome database (http://www.yeastgenome.org/downloaddata/curation) respectively. InteGO 2 was implemented with Python 2.7 with NetworkX package (http://networkx.github.io).
Performance evaluation on molecular function
Proteins sharing the same EC numbers are considered to have similar molecular functions. For every manually curated pathway in human, Arabidopsis and yeast, we grouped the genes based on their EC numbers (full four digits) and tested the difference between the inter and intragroup genegene similarities. There are in total 125, 205 and 32 EC groups with least three genes in human, Arabidopsis and yeast respectively.
In the experiments, we chose seven widely used measures in all the three categories as candidate measures. We also added a fake measure to simulate the situation where a wrong measure was included to test the robustness of InteGO 2. Among the seven measures, SimUI [34] and TO [35] measure use the GO annotations information directly; Resnik [24], Schlicker [13] and SimGIC [36] measure use annotation information to calculate the information content of GO terms; Wang [15] measure considers the complex topology of GO; HRSS [22] considers the shared path based on information content. More detail description is shown in Additional file 3. In the fake measure, a random half of the similarity scores were computed with Resnik measure, and the other half were 1 or 0, such that the similarity of two genes with the same EC is 0, otherwise it is 1 (the reversed values ensure that the fake measure has low quality).
In order to evaluate InteGO 2 systematically, we adopted the crossvalidation strategy by randomly selecting 1/ 5 of human ECs as the testing set (200 genes involved) and the other 4/ 5 of human ECs being the training set (823 genes involved). The same training set was used for Arabidopsis and yeast (1151 and 121 genes involved respectively). Using the training set, the parameters in Eq. 2 were estimated, which were directly applied on the testing set to compute the ECbased LogFC scores using Eq. 5.
We found that the parameters for the three types of seed measure groups (high, low and mix) are significantly different, reflecting different integration strategies. The highest parameter in the high seed measure groups is maximum, in the low seed measure groups is minimum, and in the mix seed measure groups is simUI measure.
Statistics analysis was carried out to test the significance of InteGO2 results. The pvalues of ttest indicate that the results of InteGO2 are significantly different with the results of other measures except simGIC, simUI and Wang measure on Arabidopsis and yeast (TTest, supplementary Table S4 in Additional file 4).
Performance evaluation on biological process
Given that genes annotated to the similar biological process may be involved in the same manually curated pathway, we grouped genes based on the pathway information, and on these gene groups we evaluated InteGO 2. There are in total 258, 154 and 141 pathways with at least two genes in humanCyc, PlantCyc and Saccharomyces genome database respectively.
Statistics analysis was carried out to test the significance of InteGO2 results. The pvalues of ttest indicate that the results of InteGO2 are significantly different with the results of other measures except simGIC, simUI and Wang measure on Arabidopsis (TTest, supplementary Table S8 in Additional file 4).
The results indicate that InteGO 2 successfully utilizes the GO information by integrating seed measures appropriately to better deliver functional similarities better genes.
Robustness of InteGO 2
Performance evaluation on protein sequences
Generating functional association maps
Conclusions
The calculation of GObased gene functional similarity has already been widely applied [3–6]. However, since the existing measurements only use a subset of the GO information (e.g., topology of DAG, annotations, MICA, edge length and all the parents term), the demand to integrate these measurements is compelling.
In this paper, we proposed a new integrative measure called InteGO 2 by automatically selecting the most appropriate seed measures and by integrating the seed measures using an addition model. First, we calculate the ranked similarity scores using all the measures. Second, seed measures are selected using a grouping process. Third, the parameters of the addition model are estimated by optimizing an objective function on a training data. Experimental results using ECs and pathways show that InteGO 2 performs the best among all the measures. It also shows that InteGO 2 is robust against the unavailability of candidate measures. Note that we have proposed InteGO in the previous work to unify different measures [23], which can be considered as a simplified case of InteGO 2.
To demonstrate the advantages of InteGO 2, we computed the gene similarity scores for all the human, Arabidopsis and yeast genes on both molecular function and biological process GO categories, and generated a functional association map for each organism. The new functional association maps, together with the existing biological networks, can be beneficial in medical diagnostics, and they also may provide more biological insights into gene function and regulation. In the future, we will apply InteGO2 to more organisms, data sets (such as proteinfamilybased index) and compare the new functional association maps with the existing biological network (such as proteinprotein network and genetic interaction network) to predict protein or genetic interaction based on the GO similarity scores.
Notes
Declarations
Acknowledgements
This project has been funded by the U.S. Department of Energy, grant no. DEFG0291ER20021 to J.C; the National High Technology Research and Development Program of China grant (no. 2012AA020404 and 2012AA02A602) and the National Natural Science Foundation of China grant (no. 61173085) to Y. W.
Declarations
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC systems Biology Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCBAsia): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5.
Authors’ Affiliations
References
 Consortium GO: Gene Ontology annotations and resources. Nucleic acids research. 2013, 41: D530D535.View ArticleGoogle Scholar
 Blake J: Ten quick tips for using the gene ontology. PLoS computational biology. 2013, 9: e100334310.1371/journal.pcbi.1003343.PubMed CentralView ArticlePubMedGoogle Scholar
 Vafaee F, Rosu D, BroackesCarter F, Jurisica I: Novel semantic similarity measure improves an integrative approach to predicting gene functional associations. BMC systems biology. 2013, 7: 2210.1186/17520509722.PubMed CentralView ArticlePubMedGoogle Scholar
 Nehrt N, Clark W, Radivojac P, Hahn M: Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS computational biology. 2011, 7: e100207310.1371/journal.pcbi.1002073.PubMed CentralView ArticlePubMedGoogle Scholar
 Lewis B, Shih I, JonesRhoades M, Bartel D, Burge C: Prediction of mammalian microRNA targets. Cell. 2003, 115: 787798. 10.1016/S00928674(03)010183.View ArticlePubMedGoogle Scholar
 Lu Z, Hunter L: GO molecular function terms are predictive of subcellular localization. PSB. 151Google Scholar
 Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19: 12751283. 10.1093/bioinformatics/btg153.View ArticlePubMedGoogle Scholar
 Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, SianiRose M: A knowledgebased clustering algorithm driven by gene ontology. Journal of biopharmaceutical statistics. 2004, 14: 687700. 10.1081/BIP200025659.View ArticlePubMedGoogle Scholar
 Couto F, Silva M, Coutinho P: Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. CIKM. 2005, 343344.Google Scholar
 Bodenreider O, Aubry M, Burgun A: Nonlexical approaches to identifying associative relations in the gene ontology. PSB. 2005, 91Google Scholar
 Wu H, Su Z, Mao F, Olman V, Xu Y: Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic acids research. 2005, 33: 28222837. 10.1093/nar/gki573.PubMed CentralView ArticlePubMedGoogle Scholar
 Yu H, Gao L, Tu K, Guo Z: Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene. 2005, 352: 7581.View ArticlePubMedGoogle Scholar
 Schlicker A, Domingues F, Rahnenfhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics. 2006, 7: 30210.1186/147121057302.PubMed CentralView ArticlePubMedGoogle Scholar
 Riensche R, Baddeley B, Sanfilippo A, Posse C, Gopalan B: Xoa: Webenabled crossontological analytics. IEEE Congress on Services. 2007, 99105.Google Scholar
 Wang J, Du Z, Payattakool R, Philip S, Chen C: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23: 12741281. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
 Yu H, Jansen R, Stolovitzky G, Gerstein M: Total ancestry measure: quantifying the similarity in tree like classification, with genomic applications. Bioinformatics. 2007, 23: 21632173. 10.1093/bioinformatics/btm291.View ArticlePubMedGoogle Scholar
 del Pozo A, Pazos F, Valencia A: Defining functional distances over Gene Ontology. BMC bioinformatics. 2008, 9: 5010.1186/14712105950.PubMed CentralView ArticlePubMedGoogle Scholar
 Pesquita C, Faria D, Falcao A, Lord P, Couto F: Semantic similarity in biomedical ontologies. PLoS computational biology. 2009, 5: e100044310.1371/journal.pcbi.1000443.PubMed CentralView ArticlePubMedGoogle Scholar
 Othman R, Deris S, Illias R: A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of biomedical informatics. 2008, 41: 6581. 10.1016/j.jbi.2007.05.010.View ArticlePubMedGoogle Scholar
 Yang H, Nepusz T, Paccanaro A: Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics. 2012, 28: 13831389. 10.1093/bioinformatics/bts129.View ArticlePubMedGoogle Scholar
 Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P: Measuring gene functional similarity based on group wise comparison of GO terms. Bioinformatics. 2013, 29: 14241432. 10.1093/bioinformatics/btt160.View ArticlePubMedGoogle Scholar
 Wu X, Pang E, Lin K, Pei Z: Improving the measurement of semantic similarity between gene ontology terms and gene products: Insights from an edgeand icbased hybrid method. PloS one. 2013, 8: e6674510.1371/journal.pone.0066745.PubMed CentralView ArticlePubMedGoogle Scholar
 Peng J, Wang Y, Chen J: Towards integrative gene functional similarity measurement. BMC bioinformatics. 2014, 15: S5PubMed CentralView ArticlePubMedGoogle Scholar
 Resnik P: Semantic similarity in a taxonomy: An informationbased measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research. 1999, 11: 95130.Google Scholar
 Jiang J, Conrath D: Semantic similarity based on corpus statistics and lexical taxonomy. ROCLING. 1997, 9008Google Scholar
 Lin D: An informationtheoretic definition of similarity. CM. 1998, 98: 296304.Google Scholar
 Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, MartinezCruz L, Rubio A: Correlation between gene expression and GO semantic similarity. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2005, 2: 330338. 10.1109/TCBB.2005.50.View ArticleGoogle Scholar
 Marler R, Arora J: The weighted sum method for multiobjective optimization: new insights. Structural and multidisciplinary optimization. 2010, 41: 853862. 10.1007/s0015800904607.View ArticleGoogle Scholar
 Glover F: Future paths for integer programming and links to artificial intelligence. Computers & Operations Research. 1986, 13: 533549. 10.1016/03050548(86)900481.View ArticleGoogle Scholar
 Karp P: Call for an enzyme genomics initiative. Genome biology. 2004, 5: 40110.1186/gb200458401.PubMed CentralView ArticlePubMedGoogle Scholar
 DíazMejía J, PérezRueda E, Segovia L: A network perspective on the evolution of metabolism by gene duplication. Genome biology. 2007, 8: R2610.1186/gb200782r26.PubMed CentralView ArticlePubMedGoogle Scholar
 Allison D, Cui X, Page G, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006, 7: 5565. 10.1038/nrg1749.View ArticlePubMedGoogle Scholar
 Rhee S, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nature Reviews Genetics. 2008, 9: 509515. 10.1038/nrg2363.View ArticlePubMedGoogle Scholar
 Gentleman R: Visualizing and distances using GO URL. [http://www.bioconductor.org/docs/vignettes.html]
 Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome research. 2004, 14: 10851094. 10.1101/gr.1910904.PubMed CentralView ArticlePubMedGoogle Scholar
 Pesquita C, Faria D, Bastos H, Falcao A, Couto F: Evaluating GObased semantic similarity measures. Annual BioOntologies Meeting. 2007, 3740.Google Scholar
 Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. Journal of molecular biology. 1990, 215: 403410. 10.1016/S00222836(05)803602.View ArticlePubMedGoogle Scholar
 Guengerich F: Cytochrome p450 and chemical toxicology. Chemical research in toxicology. 2007, 21: 7083.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.