Revealing functionally coherent subsets using a spectral clustering and an information integration approach

Background Contemporary high-throughput analyses often produce lengthy lists of genes or proteins. It is desirable to divide the genes into functionally coherent subsets for further investigation, by integrating heterogeneous information regarding the genes. Here we report a principled approach for managing and integrating multiple data sources within the framework of graph-spectrum analysis in order to identify coherent gene subsets. Results We investigated several approaches to integrate information derived from different sources that reflect distinct aspects of gene functional relationships including: functional annotations of genes in the form of the Gene Ontology, co-mentioning of genes in the literature, and shared transcription factor binding sites among genes. Given a list of genes, we construct a graph containing the genes in each information space; then the graphs were kernel transformed so they could be integrated; finally functionally coherent subsets were identified using a spectral clustering algorithm. In a series of simulation experiments, known functionally coherent gene sets were mixed and recovered using our approach. Conclusions The results indicate that spectral clustering approaches are capable of recovering coherent gene modules even under noisy conditions, and that information integration serves to further enhance this capability. When applied to a real-world data set, our methods revealed biologically sensible modules, and highlighted the importance of information integration. The implementation of the statistical model is provided under the GNU general public license, as an installable Python module, at: http://code.google.com/p/spectralmix.

1 Supplemental Methods

Data integration and retrieval framework
To help maximize the reusability and flexibility of our data management system, we used the representational state transfer (REST) architectural style, which is becoming more widely used in biology [3]. An architectural style in general is defined by the configuration of architectural elements including: components, connectors, data, and the relationships among them. Under this style, the communication is carried out using resources which are identified by Uniform Resource Identifiers or URI's. The methods described in this work use a centralized database scheme, which contains multiple information types that may be queried through the use of resources. For the database we used PostgreSQL (http://www.postgresql.org), although the code is organized in such a way that it is essentially database agnostic. To deal with the business logic or exchange of information with the database, we used the SQL toolkit and object relational mapper SQLAlchemy (http://www.sqlalchemy.org). The Pylons http://pylonshq.com/ framework was used to implement a RESTful storage and retrieval system. Under this setting, a Model-View-Controller (MVC) paradigm is followed. The majority of scripting was carried out using the Python (http://www.python.org) programming language, with a few additional scripts being written in Perl (http://www.perl.org). A basic API was developed for the R statistical language [4] and for Python.
1.2 Genomic sequence: an information source of co-regulation among genes.
Functionally related genes tend to be co-regulated at the transcription level in order to interact or cooperate at the protein level; thus genes sharing genomic motifs that potentially function as transcription factor binding sites (TFBSs) are likely to be related functionally. A large number of search methods have been proposed for the discovery of TFBS, yet entropy-based methods are the ones most commonly used [7]. The putative TFBS were found by searching promoter regions using the position specific scoring matrices (PSSM) available from the TRANSFAC [2] database based on the maximum entropy scores. The promoter regions were obtained from the NCBI contig builds available from ftp://ftp.ncbi.nih.gov/genomes/ and the regions were defined as -300bp to +1000bp relative to the transcription start site. For each PSSM, we scanned through all genes in the genome of interest and obtained a maximum entropy score (x) for each gene. These scores were compiled into empirical cdf distributions, in order to determine p-value of a PSSM instance. The one-sided p-value, as determined from the empirical cdf, can be interpreted as the chance of observing a given entropy value or a value with a better match when considering all genes in the genome. The threshold for significance was set at p-value ≤ 0.05.

Evaluation of clustering results
Given the resulting labels from the algorithm and the original labels, one of the more popular methods of evaluation is the Rand index [5]. This is a comparison by counting pairs and measures the extent to which clusters agree or disagree. Let a cluster C = {C 1 , . . . , C K } be the partitions of Let the original clustering be denoted by C and another clustering for the same data X be C . Pairs of observations may be counted in the following ways: a = pairs that are in same cluster under both C and C b = pairs that are in same cluster under C but not under C c = pairs that are in same clusters under C but not under C d = pairs that are in different clusters under both C and C The Rand index was originally designed to assess partition accuracy of two classes, but here we are interested in larger values of k. Therefore, we adopted the statistics from the Rand index, and used them to calculate modified recall and precision measures. The former can be defined as a a+c and the latter as a a+b . A metric that combines both of these measures is the F 1 score, (1) 2 Supplemental Results

DNA motifs data as an information source
The objective in these experiments was to evaluate DNA motifs as a potential source of information in the discovery of functional modules. Distances between genes were calculated as the maximum number of shared transcription factor binding sites minus the observed number of shared transcription factor binding sites. The motif data are different from the GO and PubMed sources in that they represent more of an experimental-type data source. Simulations were run as was done with the GO and PubMed data sources, however, this time including motifs as an information source. it compares to the GO and PubMed data sources. The results can be summarized as follows. 1) Using each information source alone, the performance of spectral clustering becomes increasingly better in the following order: PubMed, DNA motif, and the GO. 2) Combining PubMed with DNA motifs and combining GO with DNA motifs does not lead to significant enhancements, although combining GO, PubMed and DNA motifs have led to marginal enhancement in comparison with combination of GO and PubMed data. A possible explanation for the lack of enhancement by adding the DNA motif data is that, although by itself the DNA motif information performs much better in comparison to PubMed, is that only a limited number of TFBS are experimentally characterized compared to the number thought to exist. The trend may also be explained by the degenerative nature of TFBS, which causes search methods to have high numbers of false positives [1].

GO Graphs by Aspect
The distances between genes can be measured by any of the three Gene Ontology aspects individually or more than one in combination via kernel fusion. Here we show the differences between these aspects in terms of all pairwise distances in each of the three graphs.

Noise Filtering: An Example
A major objective of the information integration and clustering framework is to direct attention to genes that, based on the given data source(s), represent the most coherent subsets. This challenge In this experiment, we used a MINT protein interaction complex (S. cerevisiae) that consisted of 23 genes to which we added 27 genes randomly selected from the genome. The goal was to run the algorithm, without making an assumption for k, and check the results to see if we could recover significantly coherent subset(s) that make up the original 23 genes. Using combined GO and PubMed data as information sources, our algorithm broke the list of 50 genes into 4 clusters. Two of the clusters contained only a single gene, and could not be assessed for statistical significance. The remaining two modules contained 21 and 23 genes. After being subjected to coherence testing using the GOSteiner method [6], their respective p-values were 0.999 and < 0.0001. The sole statistically significant module contained 18/23 (i.e. 78%) of the genes from the original coherent gene set and five noise genes (false positives; 22%). In summary, our procedure identified 4/50 genes explicitly as noise and 23/50 genes were determined as noise by statistical significance testing, using GOSteiner. Overall, this is an impressive result indicating that the techniques in our procedure (spectral clustering, information integration, noise modeling, determination of cluster number, and finally functional coherence assessment) successfully revealed the majority of the truly coherent genes as a candidate subset.
To further evaluate the results, we inspected the affinity matrix representing the 50 genes. Elements of the matrix, whose values are above the 50 percentile, were included as edges connecting YJR11 2W-A Figure 3: A functional module (MINT) uncovered from noise. The original gene list was comprised of 23 genes belonging to a protein complex, and 27 that were noise. Spectral clustering was run on the gene list, and statistical significance was determined for each of the underlying modules. Shown here is the one statistically significant module. Also, added in yellow are the protein complex genes that fell into other modules. The genes that were significant and truly belonged to the complex are marked as a true positive (blue). If significance was determined and the gene was truly noise then it is shown in violet (false positive), and the genes missed by the module are shown in yellow (false negative). The true negatives are removed from the graph for clarity. The edges with a weight (affinity) less than the median edge weight are also removed for ease of visualization. the genes. The connectivity of the genes in the significant cluster were shown in Figure 3. In addition to the 23 member genes of the cluster, we also included the 5 genes from the original coherent set that were not identified to belong to the significant module (yellow nodes), to illustrate how these genes were clustered in our process. The true positive genes (blue nodes) are all highly connected to each other with strong edges; there is connectivity between three true positives, and a false positive gene (SPO14). From these observations, we see that the false positive genes are weakly connected to the rest of the true positive genes, which explains why they were not included as part of the noise group. The results indicate that spectral clustering is capable of accurately capturing the relatedness of functionally coherent genes, and our overall procedure is capable of dealing with noise inevitably found in biological data.

Estimates of kernel bandwidth parameter
Affinities were calculated using the Gaussian kernel function: where ||d ij || is a measure of distance between objects i and j and σ is the bandwidth parameter. In Table 3.1 the estimates, by species and information source, are reported.