Revealing functionally coherent subsets using a spectral clustering and an information integration approach
© Richards et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Richards et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Contemporary high-throughput analyses often produce lengthy lists of genes or proteins. It is desirable to divide the genes into functionally coherent subsets for further investigation, by integrating heterogeneous information regarding the genes. Here we report a principled approach for managing and integrating multiple data sources within the framework of graph-spectrum analysis in order to identify coherent gene subsets.
We investigated several approaches to integrate information derived from different sources that reflect distinct aspects of gene functional relationships including: functional annotations of genes in the form of the Gene Ontology, co-mentioning of genes in the literature, and shared transcription factor binding sites among genes. Given a list of genes, we construct a graph containing the genes in each information space; then the graphs were kernel transformed so they could be integrated; finally functionally coherent subsets were identified using a spectral clustering algorithm. In a series of simulation experiments, known functionally coherent gene sets were mixed and recovered using our approach.
The results indicate that spectral clustering approaches are capable of recovering coherent gene modules even under noisy conditions, and that information integration serves to further enhance this capability. When applied to a real-world data set, our methods revealed biologically sensible modules, and highlighted the importance of information integration. The implementation of the statistical model is provided under the GNU general public license, as an installable Python module, at: http://code.google.com/p/spectralmix.
In biomedical sciences, experimental results often come in the form of one or more gene sets, and biologists are commonly tasked with the interpretation of these lists, which can easily become overwhelming considering the amount of data and number of data sources currently available. Frequently, gene products carry out their function by working closely with the products of other genes, which motivates the study of genes as a set, instead of as individual units. We refer to these multi-gene units when carrying out one or more related biological processes as 'functional modules'. There are a number of rationales for studying genes through a modular perspective [1–3]. Modules of genes may be interesting because of physical interactions , common subcellular location , or they may be meaningful players in a system of interconnected biologically processes. Whichever the case, it is of significant interest to be able to hone in on interesting subsets of genes  that perform coherent functions, particularly by making use of multiple types of information sources .
Currently, a common approach to discovering functional modules from a gene list is via the use of enrichment-based methods [8–10], which determine if constituents of a predefined collection of gene sets are observed more frequently than expected in the list. Often, these predefined reference gene sets reflect a single information source; for example, gene sets are commonly grouped according to annotations based on the Gene Ontology (GO) to narrow the search based on one or more known functions. The requirement of predefined gene sets subjects the methods to limits imposed by those who construct the gene sets, thus reducing the chances of finding de novo coherent subsets. In situations where the nature of the interesting subsets is unknown, data-driven methods are more suitable than methods based on predefined reference sets. Additionally, because evidence for gene-gene relationships within a module may occur in different forms, a caveat of most existing methods is that they do not consider the connections across distinct biological aspects, and thus would fail to identify diverse types of functional modules.
Experimental methods and thus their resulting data come in many diverse forms, and in light of this it remains challenging to assess the functional coherence of a group of genes by considering multiple biological aspects. As an example, consider the following hypothetical scenario: from protein-protein interaction data, we find that protein A physically interacts with protein B, and from a signal transduction database one learns that protein B is a kinase that phosphorylates protein C. The challenge is to find out that proteins A, B, and C are functionally related in an automated way. Here, we describe a novel approach for revealing functionally coherent subsets ab initio from an arbitrary gene list by assimilating information from multiple data sources.
There are two main challenges with combining heterogeneous information to identify functionally coherent subsets from a gene list. First, storing and accessing multiple information sources can be challenging for organisms of modest to large genome size, for which we implemented a web server to handle storage and facilitate access (See Supplemental Methods in the Additional File 1). Second, it remains an active research area to encode diverse information regarding genes in a fashion that enables identification of functional modules. One notable method  that uses a Bayesian approach to integrate heterogeneous data sources was devised for the purposes of function prediction. The problem of identifying functionally coherent subgroups is a related but distinct problem to that of function prediction.
In this section, we detail the results of a number of simulation experiments as well as an example application. Using a simulation approach, we examined the algorithm's ability to retrieve gene subsets, and specifically, we studied how the addition of new data sources impacts performance. Then, we tested the usefulness of our approach in recovering coherent gene set from 'noisy' gene lists as is often encountered in high-throughput experiments. We then show the results of applying our method to a real-world data set.
Given a gene list, our task is to identify functionally coherent gene subsets. Here, we used simulation experiments to evaluate the efficacy of spectral clustering for this task. In the simulation experiments, a number of functionally coherent gene subsets, ranging from 3-8, was randomly mixed in multiple experiments, and our method was then used to recover the original subset partitions. We used pathways from KEGG database  and protein complexes from the MINT  database as 'known' functionally coherent modules, in that the proteins in these modules either perform related functions or form physical modules.
The figure shows that the spectral clustering algorithm significantly outperforms random cluster assignments, and the difference becomes more obvious as k increases. Overall, the trend for spectral clustering is that of decreasing efficacy with increasing k, and in the case of the GO precision and recall, both are similarly affected. This decreasing trend is likely due to the fact that in general, clustering tasks become more difficult as more gene sets are mixed. An additional reason for the declining performance might be the fact that many metabolic and signal transduction pathways, as well as molecular complexes, are comprised of a mixture of functional modules, or coherent gene sets, and thus assigning modules to an appropriate pathway is not a straightforward task, especially as the number of modules and potential pathways increases.
The observed improvements in data partitioning due to information integration is highly encouraging. The results indicate that indeed, different information sources contain distinct yet complementary information, and efficient information integration techniques can be employed to utilize such complementary information in order to achieve a better gene set recovery. The kernel fusion and transformation step in spectral clustering (see equations 3, 4, and 5) provided a principled way of integrating information in that the sum of two kernel functions does not require exceptions and heuristics.
Our results show that spectral clustering performs better using the GO as an information source in comparison to gene co-mentioning data (PubMed). One possible explanation is that information from the GO database is 'richer' in comparison to that of the gene co-mentioning. It is easier to establish the relatedness among a pair of genes in terms of function because many genes are annotated in the GO databases, and our approach of revealing functional relationships using the graphical representation of the GO can easily assess the relatedness between a pair of genes, even though they may be annotated with different GO terms. We believe the strength of our approach lies in the fact that it captures the functional relationship between genes, by taking into account both the structure of the GO and the strength of the relationship using semantic distance. This observation may lead to other possible approaches of representing the functional relationship between genes; for example one could use rigorous topic modeling of literature information that are associated with a gene in order to capture the functional relationships between genes [16, 17]. On the other hand, the gene co-mentioning data matrix is fairly sparse, and not all information is directly relevant; thus, as an information source alone, gene co-mentioning does not perform well. Finally, a key observation from this experiment is that, although an information source may not be rich in information, it may be valuable if it is complementary to other information sources.
To further evaluate the results, we plotted the individual simulations as graphs, and inspected the calls made by the algorithm (see Supplemental Results in the Additional File 1). From these observations, we see that, in general, the false positive genes are weakly connected to the rest of the true positive genes, which explains why they were not included as part of the noise group. Also, the true positive genes are generally highly connected, in terms of edge weights, which indicates that spectral clustering is capable of accurately capturing the relatedness of functionally coherent genes, and our overall procedure is capable of dealing with noise inevitably found in biological data.
Overall, the methods presented in this paper allow for efficient gene subset searching in both simulated and the real-world data. Our approach should be of interest to a spectrum of biologists: it can be used to sift through large amounts of experimental data, and will help the experimentalist to identify specific genes or biological functions of interest. A method that effectively partitions mixtures of genes into functional modules is highly desirable in contemporary high throughput biology, particularly in microarray studies. Our results show the value of spectral clustering, and particularly information integration in this setting. This research also prompts new research avenues, including: the discovery of additional informative data sources, and the adaption of these techniques to other problems like the prediction of gene function.
The GO defines the relationships between annotation terms in a hierarchical way, using expert knowledge. Annotation and ontology definition files used in this study were downloaded from: http://www.geneontology.org/GO.downloads.database.shtml (03.16.2011). Given the ontology structure and annotation information, a variety of methods and information sources have been proposed to quantitatively describe the relationships between terms [23, 24], often referred to as semantic distances. The distances were used to construct a weighted graph of all terms provided by the GO. The GO graph was then used to quantify the distance for any two genes. Edges were not drawn when the following evidence codes were used: Inferred from Electronic Annotation (IEA), Inferred from Sequence or Structural Similarity (ISS), Inferred from Sequence Orthology (ISO), Inferred from Sequence Alignment (ISA), Inferred from Sequence Model (ISM), Inferred from Genomic Context (IGC), and inferred from Reviewed Computational Analysis (RCA).
The goal of constructing a weighted graph representing the structure and semantic relationships of the GO is to use this data structure to determine the functional relatedness of genes, because a pairwise distance matrix among the genes is needed for spectral clustering. For each gene pair, all GO terms that were used to annotate the two genes were considered, and functional distance between the genes was may be determined as the distance of the shortest path between the genes in the GO graph, using a bidirectional version of Dijkstra's algorithm  as implemented using NetworkX . Using the information integration techniques discussed below, the three aspects of the GO: biological process, molecular function and cellular component were combined. The distances between genes tend to be smaller for cellular component than for the other aspects (see Supplemental Results in the Additional File 1). However, by using all three aspects simultaneously a single very small distance will have less of an affect on overall gene-gene distance than three reasonably small distances.
When a pair of genes is co-mentioned in the biomedical literature, they are often related to each other somehow: they may be participating in the same biological processes co-operatively or, alternatively, they may counter-act each other. The reasons for the co-mentioning are many; nonetheless, a biomedical document seldom mentions genes that are totally irrelevant, although certain exceptions exist. To populate a pairwise distance matrix of all genes using co-mentioning data, a current file containing a mapping between genes and biomedical literature was downloaded from NCBI FTP site ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz. Using these data, the distance between two genes was calculated as the maximum number of shared publications minus the observed number of shared publications.
Spectral clustering aims to divide a set of data points into highly related subsets. Unlike conventional clustering methods such as K-means clustering, spectral clustering groups data points based on their 'relatedness' rather than their geometric closeness. As a result, a set of data points can be partitioned into a cluster based on a chain of strong pairwise connections even though the points are geometrically remote. Thus, the method is particularly well-suited for capturing the relationships between gene subsets by taking into account their relatedness across different biological aspects, for example, gene products that are linearly connected in a metabolic pathway.
where ||d ij || is a measure of distance between objects i and j and σ is the bandwidth parameter. In related works, σ was automatically scanned for by minimizing a quantity referred to as distortion [12, 31], an objective function that assesses the quality of the clustering. Empirically, we have found that searching for σ based on distortion tends towards increasing recall at the expense of precision; therefore we opted to search for an optimal σ by maximizing the mean silhouette value  instead. A silhouette value measures how similar to each other the data points in a cluster are, relative to the points outside the cluster, and thus reflects the coherence of a cluster. The values of σ used in this study are reported in Supplemental Table 1 in the Additional File 1. The parameters were estimated by mixing groups of known functionally coherent groups, scanning intervals of possible values, calculating precision and recall, and finally by visually inspecting both the affinity values as well as the plotted results.
Treating each row in Y as a point, the points are then clustered into k subsets using the K-means clustering algorithm. Finally, the original point g i is assigned to cluster j if row i of the matrix Y was assigned to cluster j. We note that L is not actually the Laplacian (I - L) as traditionally thought of from graph theory, though we keep with the terminology of Ng et al. . To carry out the clustering and related tasks, an installable Python package was developed and made publicly available through a mercurial repository http://code.google.com/p/spectralmix.
One of the major goals, given a gene set of interest, is to integrate the information from distinct information sources, such that one can take advantage of complementary information to reveal the connections among genes that would be missed when an individual information source is used. Within the framework of spectral clustering, information integration can be performed at different stages: 1) create a pairwise distance matrix by combining all information sources, 2) after kernel transformation, combine the similarity matrices derived from different information sources in the kernel space. Integration at the distance stage is inherently difficult, because of differences in location, scale, and distribution types of distinct sources. The second approach is also referred to as a kernel fusion approach , which is not only a logical approach to integrate data, but is also shown to be effective. A major advantage of this approach is that it is principled, that is the same approach is taken each time, thus avoiding the issue of technique manipulation for newly encountered information sources. In this study, we performed kernel transformations of distance matrices from each information source into corresponding affinity matrices, which were scaled using an information-source-specific σ. Then, affinity matrices were element-wise summed to produce a unified affinity matrix.
Just because we have a set of genes partitioned into groups does not necessarily mean that the resulting clusters will represent coherent gene subsets; for this there are three challenges that must be overcome. The first is that of noise: experimental results commonly contain noise and as a result any method that clusters genes must be shown to be reasonably resilient to random noise. The second is determining the optimal number of subsets; and the final challenge is to assess whether a subset is functionally coherent or not. To judge the quality of a given clustering, the average silhouette value for a cluster may be used, where values ≤ 0 are considered poorly clustered. This heuristic is useful for filtering or ranking, however it does not tell us much about method performance in the face of noise. In order to determine the extent to which noise plays a role a more rigorous set of simulations was run, where known positive control data sets were combined with varying quantities of inserted noise.
In order to determine the suitable number of clusters to partition data another modification was made. Given data X that have been kernel transformed and cast into eigen-decomposition space as Y, we consider the first two eigenvectors. Originally, Y may be partitioned in this space using K-means or another clustering algorithm, however we may repartition the data by scanning over a range of k (3-8) settling on the value that maximizes the average silhouette index . Because k is normally unknown, the search for an optimal number of clusters is necessary component of the algorithm.
The resulting k subsets were then analyzed for functional coherence . The method described therein, also called GOSteiner, is based solely on the Gene Ontology and uses a graph-theoretic method to determine statistical significance, in terms of functional coherence, of an arbitrary gene set. Given the current state of functional annotation completeness for the GO, it is expected that there are some number of functionally interesting clusters that will be missed, however the number of false positives is expected to be very low with GOSteiner.
The simulations experiments provide a controlled environment to serve as a common means, by which comparisons can be made over a variety of experimental conditions including: the impact of different approaches for populating and calculating distance matrices, the effect of combinations of different data types, and the impact of noise on clustering. In this study a varying number (k) of functionally coherent gene sets were randomly selected, mixed, and combined to form a single gene set. The functionally coherent sets come from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database  and the Molecular INTeraction (MINT) database . The clustering algorithm was then applied to partition the genes. For each iteration of the simulation to a newly created gene set, it is necessary to evaluate the clustering assignments. We used an evaluation method that counts pairs of similar and non-similarly labeled genes in the same way the Rand index  is calculated. This evaluation method allows the calculation of both precision and recall and a summarizing F1 score (see Supplemental Methods in the Additional File 1). It is important to note that precision and recall as traditionally thought of in information retrieval is different from this setting, because we are considering pairwise relationships instead of the genes themselves. Simulations were run 20 times for each k, in order to carry out performance comparisons under different conditions; for example, to compare the use of different information sources. With the simulations run, the data were grouped based on the simulation condition and performance metric. In order to statistically compare these groups, normally an ANOVA would be used. However, ANOVA with repeated measures could not be used to compare the groups or blocks (e.g. GO, Pubs, GO-Pubs), because the assumptions of equal variance and normality were violated. To check the model assumptions, the Shapiro-Wilk's test  for normality and Barlett's test for homogeneity of variances were used. The non-parametric alternative, Friedman's method for randomized blocks was used to first determine if there was a difference among the groups, then in the cases where a null hypothesis of no difference was rejected, a post hoc analysis was subsequently used. All tests were carried out using the statistical language R  and an implementation of the post hoc test was written in R and based on the coin package .
To illustrate the utility of our proposed method, we applied the algorithm to time-series microarray data . The K-means clustering algorithm  was used and all clusters that contained one or more genes annotated with a GO term pertaining to 'mitochondria' were used to create a gene set of interest from the original probes. In all, the gene set of interest contained 458 genes. Next, spectral clustering was run on the gene set using GO, publications, gene expression, and all possible combinations of the individual sources. For the gene expression data the correlation coefficient was used as a distance metric. The purpose of the experiment was to determine if our approach is capable of identifying coherent subsets among these genes by different combinations of information data sources as a means to reveal new biological information. In addition, we were interested in the relative performance of each information source k so in order to ensure an unbiased comparison k was set to 10 for each information source used. The performance of the information sources is compared using a weighted mean of the p-values (see Figure 5). After each partitioning of the genes into putative modules each cluster was assessed for functional coherence using the GOSteiner method .
We thank Matt Shotwell and Joshua Swearingen for valuable discussion and Gaëlle Blanvillain for proofreading. This research was partially supported by National Institute of Health (NIH) grants: R01LM009153, R01LM010144, T15LM07438 and EY13520.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.