Novel semantic similarity measure improves an integrative approach to predicting gene functional associations

Background Elucidation of the direct/indirect protein interactions and gene associations is required to fully understand the workings of the cell. This can be achieved through the use of both low- and high-throughput biological experiments and in silico methods. We present GAP (Gene functional Association Predictor), an integrative method for predicting and characterizing gene functional associations. GAP integrates different biological features using a novel taxonomy-based semantic similarity measure in predicting and prioritizing high-quality putative gene associations. The proposed similarity measure increases information gain from the available gene annotations. The annotation information is incorporated from several public pathway databases, Gene Ontology annotations as well as drug and disease associations from the scientific literature. Results We evaluated GAP by comparing its prediction performance with several other well-known functional interaction prediction tools over a comprehensive dataset of known direct and indirect interactions, and observed significantly better prediction performance. We also selected a small set of GAP’s highly-scored novel predicted pairs (i.e., currently not found in any known database or dataset), and by manually searching the literature for experimental evidence accessible in the public domain, we confirmed different categories of predicted functional associations with available evidence of interaction. We also provided extra supporting evidence for subset of the predicted functionally-associated pairs using an expert curated database of genes associated to autism spectrum disorders. Conclusions GAP’s predicted “functional interactome” contains ≈1M highly-scored predicted functional associations out of which about 90% are novel (i.e., not experimentally validated). GAP’s novel predictions connect disconnected components and singletons to the main connected component of the known interactome. It can, therefore, be a valuable resource for biologists by providing corroborating evidence for and facilitating the prioritization of potential direct or indirect interactions for experimental validation. GAP is freely accessible through a web portal: http://ophid.utoronto.ca/gap.

IC Resnik (t) = − log(p(t)). (1) Seco measure: An early example of intrinsic approach for estimating the information content is introduced by Seco et al. [2] for assessing the similarity of terms of the WordNet thesaurus: where hyponyms(t) returns the number of hyponyms of term t and max t is a constant set to the maximum number of terms in WordNet. The denominator is a scaling factor, which ensures that the information content of the most informative term is 1.
Leaves measure: The Leaves measure assumes that the information content of a term t is only proportional to the number of terminal concepts, i.e., leaf terms, subsumed by t in a given taxonomy: IC leaves (t) = 1 − log(|leaf subconcepts(t, relations) + 1| log(all leaves(relations)) , where leaf subconcepts(t, relations) is the number of most specific terms subsumed by t, and all leaves(relations) denotes all terminal terms, in the ontology graph induced by the relations under consideration, e.g., is a and part of .

Estimating the significance of gene similarity scores
We assess the significance of the gene similarity scores, δ(g i , g j ), returned by GAP using a phenotypebased permutation test procedure, as follows: Let n be the number of genes, which we consider ordered, i.e., g 1 , . . . , g n , and let π = {1, . . . , n} = {π(1), . . . , π(n)} be a uniformly random permutation. For each feature F k , let {F 1 k , . . . , F i k , . . . , F n k } be the feature-sets associated to each of the n genes. We use π to randomly reassign these featuresets so that F i k is reassigned to the π(i) th gene. We then re-compute the similarity between genes using the permuted samples. We repeat this process 10,000 times to generate a null distribution for the GAP gene similarity scores.
The nominal p-value for each δ(g i , g j ) is then calculated as proportion of permutations samples for which the sampled gene similarity score δ π (g i , g j ) is greater than or equal to the actual gene similarity score δ(g i , g j ) : p-value = 1 10, 000 I(δ π (g i , g j ) ≥ δ(g i , g j )), where I(condition) = 1.0 when the condition is satisfied and 0.0 otherwise.

Performance Evaluation Measures
To assess GAP performance, we used F1-score, precision versus recall curves, and area under the Receiver Operator Characteristic (ROC) curve scores as described below: [3] is a performance measure that combines precision and recall values into a single score. In our context, for each query gene, precision refers to the fraction of retrieved interacting partners that are known (i.e., found in the gold standard dataset) to interact with the given gene. Recall measures the fraction of the known interacting partners of the query gene that has been retrieved by the interaction prediction tool. F1-score is the harmonic mean 1 of precision and recall, and ranges between 0 and 1. We used F1-score to compare GAP's performance for different configuration settings because the precision versus recall graph would be unreadable due to the large number of curves to be compared. Once the best performing configuration has been identified using F1-score, the precision versus recall curve is used instead for the subsequent performance comparisons.

Precision vs. recall curve (PR curve)
Although easy to calculate, the F1-score has certain limitations: it is not sensitive to the ranked order of the retrieved interactions as it is computed using the unordered set of retrieved interactions. As such, the rank (score) of the truly interacting protein pairs does not affect the F1-score, which is a shortcoming as, in general, users are more interested in the highly-scored predicted interactions, and expect the "true positives" to appear at the top of the ranked list of predicted interactions.
To remedy this, we used precision versus recall curves [3], which plot the precision values at every recall point. Given a query gene and a ranked list of predicted interacting partners, a PR curve is constructed by traversing down the list and plotting the precision value for each recall point. In general, a predictor A is assumed to be better than a predictor B if, at every recall point, A's precision value is higher than B's.
The precision versus recall graph is defined for a single query; however, to arrive at a meaningful conclusion, performance comparisons should be done based on several queries. We therefore, need a technique for the interpolation of precision values in order to evaluate the overall retrieval performance for a given set of queries. In this paper, we used the ceiling interpolation method, commonly used in the information retrieval literature [4].

Area Under the ROC Curve (AUC)
Receiver Operator Characteristic (ROC) curves [5] plot the true positive rate (i.e., recall) against the false-positive rate for different cutoff values of the predicted scores. 2 ROC curves, therefore, measure the tradeoff between sensitivity and specificity.
The ROC curve can be aggregated into a scalar metric by computing the area under the curve (AUC) [5]. The AUC can be interpreted as "the expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative" [6], which is equivalent to the Wilcoxon-Mann-Whitney U statistic test of ranks [7]. The AUC takes values between 0.0 to 1.0. Since random guessing produces the diagonal line, which has an area of 0.5, all interesting classifiers should have an AUC more than 0.5.

Gene association and protein interaction prediction methods compared with GAP
To comprehensively assess GAP's performance, we considered a broad range of gene association and protein-protein interaction prediction methods. We focused on methods that are functional, i.e., not specifically designed for direct protein interactions (for instance, tools using molecular docking or protein structural similarity algorithms were not selected as by design they predict only a subset of interactions). We were also favoring tools that either offer web servers or make their predicted interactome available for download. Furthermore, as we are interested in human gene associations, we excluded those methods specifically designed for other species. The predictors considered for inclusion in this study, their prediction features and methodologies, and the corresponding selection constraints are listed in Table S1, which is followed by a brief description of the subset of methods selected for comparison with GAP.
Table S1: List of considered gene association or protein interaction prediction methods. Each method is referred to by the first author's name, and if available, the server's name and hyperlink is also provided. Studies are ranked chronologically. The predictive features, and the methodology used by each study is given in the second column. Constraints of each method which made us to exclude the corresponding predictor for comparison is listed in the last column. Gene ontology, co-occurrence in tissue, gene expression, sequence similarity, homology based, and domain interaction/uses four active learning algorithms for selecting the protein pairs to be used for training a random forest algorithm Not available I. Lee et al. [12] WormNet Gene expression, physical and genetic interaction assays of C. elegans, scientific literature, functional associations of yeast orthologs/uses a modified Bayesian integration of different data types, and a log-likelihood scoring mechanism Designed specifically for C. elegans M. Singhal et al. [13] Domain information/uses support vector machine algorithm Designed for physical interaction prediction only, not available *"Not available" means that the predicted interactome for human proteins is not available for download and there is no web server through which one can retrieve the interacting partners of the genes of interest.

Selected methods, brief description
Below is a brief description of the tools selected for comparison with GAP: GeneMANIA: Gene Multiple Association Network Integration Algorithm [20] The GeneMANIA algorithm comprises of : (1) a heuristic algorithm, based on ridge regression, which calculates a composite functional association network from several networks derived from different genomic or proteomic data sources, e.g., protein-protein, protein-DNA and genetic interactions, pathways, reactions, gene expression data, protein domains and phenotypic screening profiles, and (2) an efficient implementation of Gaussian field label propagation algorithms, which predict gene function given the composite network constructed by the heuristic algorithm.

I2D-Pred: Interologous Interaction Database-Predicted [24]
I2D (Interologous Interaction Database) is an on-line database of known and predicted mammalian and eukaryote protein-protein interactions. We used I2D's known human PPIs in the construction of our gold standard, and compared GAP's performance against I2D-Pred, I2D's set of 59,373 interolog-based predicted interactions. I2D-Pred is constructed by mapping model organism (i.e., S. cerevisiae, C. elegans, D. melanogaster, M. musculus, and rat) protein interactions to human protein orthologs using BLASTP and the reciprocal best-hit approach. Using the constructed database of model organism-to-human orthologs, each model organism protein was translated to its human ortholog, and a predicted human interaction was added to the database if both proteins in the model organism interaction were conserved in humans.

PIP: Potential Interactions of Proteins [22]
PIP is a web server delivering human, rat, and fission yeast predicted protein interactions. The predictions are made via homology with experimentally derived protein-protein interactions from various species. The homologous interacting pairs of experimentally supported pritein interactions are identified by running BLAST searches for the entire genomes of the species of interest against all proteins in the DIP [29] and MIPS [30] databases. The putative protein interactions are given confidence scores based on their homology to experimentally observed interacting proteins. The confidence scores are then weighted according to the amount of available experimental evidence, i.e., higher weight is given to more frequently observed interactions. Once the network of interacting proteins is constructed, the number of individual interactions is reduced by using a clustering method aimed at identifying key interconnected network nodes.
PIPs: Human protein-protein interactions prediction database [16] The PIPs database is a web resource which predicts human protein-protein interactions using a nave Bayesian method that combines information from gene expression, orthology, domain cooccurrence, post-translational modifications, co-localization, and the analysis of the local topology of the predicted PPI network.
Each evidence type is considered as a separate module providing an interaction score. The individual module scores are combined into a prediction score corresponding to the overall likelihood of the potential interaction given the available data. PIPs contains 37,606 high probability interactions (i.e., with a score ≥ 1 indicating that the interaction is more likely to occur than not). Out of these, 3,400 are not reported in the HPRD, BIND, or DIP interaction databases [16].

PPI Finder: A Mining Tool for Human Protein-Protein Interactions [15]
PPIFinder is a web-based tool which mines human protein-protein interactions from PubMed abstracts based on name co-occurrence and interaction-related keywords. PPIFinder uses a hybrid frame-based approach which incorporates both statistical and computational methods. It follows a typical frequency-based statistical method for retrieving genes related to a query gene based on their co-occurrences in the PubMed abstracts. However, PPIFinder also employs computational linguistic methods to extract semantic descriptions of the predicted interactions from the literature. PPIFinder also searches for Gene Ontology annotations and uses the shared GO annotations to infer potential protein interactions.
According to the reported statistics [15], only 28% of the co-occurring protein pairs in PubMed abstracts appeared in any of the frequently used human PPI databases (HPRD, BioGRID and BIND). On the other hand, out of the known interacting pairs in HPRD, 69% co-occur in the literature, and 65% share GO annotations.

STRING: Search Tool for the Retrieval of Interacting Genes/Proteins [8]
The database and web-tool STRING is a meta-resource of known and predicted protein-protein associations derived from four sources of genomic context, high-throughput experiments, co-expression, and scientific literature. STRING is developed by a consortium of academic institutions, and it is regularly updated; the last version covers about 5.2 millions proteins from 1,133 species. STRING imports protein association knowledge from databases of physical interactions and databases of curated biological pathways (e.g., MINT, HPRD, BIND, DIP, BioGRID, KEGG, and Reactome). Besides the experimentally derived gene associations, STRING also stores computationally predicted interactions from the text mining of scientific texts as well as interactions inferred from genomic features.
In terms of usage, given a query gene, STRING retrieves all genes which repeatedly occur within the same cluster as the gene of interest, where a gene cluster is defined as in [31]. Different genomic features (e.g., gene neighborhood, gene fusion events, and coexpression) are used in constructing the gene clusters. Text based predicted interactions are simply derived by searching for gene name co-occurrence in the content of PubMed abstracts [32]. . At this level, AMFR is connected to CD82 , while the network contains 60% of the nodes and 40% of the interactions of the whole PPI interactome. CD82 and AMFR are connected via ten different shortest paths of length three; one, which is via EGFR and HSPD1 is highlighted in the figure. CD82 is highly predicted by GAP to interact with AMFR, and we confirmed that this interaction is experimentally validated in the literature. Therefore, our confirmation of this GAP prediction improves the connectivity of the protein-protein interaction network.

CD82 and AMFR connectivity in the protein-protein interaction network
The names of genes with many connections on the next level of the network are explicitly displayed in sub-figures (b) and (c). Figure S2: Functional inter-connectivity of all autism genes predicted by GAP using upper 20-quantile as threshold setting measure (p-value < 0.05). The size of each node is proportional to the node degree and the node color changes in full spectrum from red (the lowest degree) to purple (the highest degree). 6 Autism-related genes predicted by GAP while novel to the SFARI database Figure S3: Histogram of the association-degrees of 11,215 genes predicted to be functionally associated to SFARI known autism genes. Association-degree of a predicted gene corresponds to the number of SFARI autism genes predicted to be functionally associated to it.   Figure S4: (a): Network of functional associations among SFARI known autism genes and novel autism genes predicted by GAP. Yellow nodes are SFARI genes, and black nodes are 114 novel autism genes. Edges correspond to predicted functional associations between novel and known genes (edges among SFARI autism genes are filtered out). The predicted novel genes are densely connected to autism genes, and form a highlyinterconnected network with several clique-like subgraphs as highlighted in the graph. Approximate cliques are identified using NAViGaTOR's [33] plug-in to the NeAT toolbox [34]. . The last three columns display for each sub-graph, the minimum, the maximum, and the average values of the node degrees, respectively.

GAP's predicted functional interactome
GAP's high confidence (p − text−value < 0.01; estimated by a phenotype-based permutation test) predicted "functional interactome" contains ≈1M functional associations among about 19K human genes. Out of these, about 90% are novel (i.e., not listed in publicly available datasets of experimentally validated direct and indirect interaction). GAP's novel predictions connect previously disconnected components and singletons to the main body of the known interactome and are shown in Figure S5.