Distance measure evaluation
In this first part of our analysis we evaluated four distance measures that are frequently applied for the detection of co-expression between genes: the Pearson correlation [10], the Spearman correlation [11], the Euclidean distance [12], and the mutual information [13]. For each of the three species S. cerevisiae, D. melanogaster and C. elegans (see method's section for details) separately, we calculated the co-expression values, using one of the distance measures at a time, between all pairs of genes and ranked the pairs according to their corresponding distances. To evaluate the accuracy of co-expression predictions, we calculated a function similarity measure that described how well the two genes in a gene pair were associated according to biological expert knowledge. For this, we followed a suggestion by Lord et al. [14] that has previously been used for gene co-expression network analysis [15] and employed the directed acyclic graph (DAG) structure of the Gene Ontology (GO) annotation system [16] (version date: March 24th, 2006). In the GO system, a gene can be annotated to more than one functional attribute. For each of the two genes in a pair, we extracted a sub tree of biological process annotation attributes (nodes). As the measure of functional similarity between the two genes we then calculated the ratio of the nodes found in both trees (intersection) and compared it to the union of both trees thus defining a similarity measure between 0 for unrelated genes and 1 for genes with identical annotations. The KEGG PATHWAY maps [17] (release 35) were also considered in parallel, to give a second analysis of biological expert knowledge that was independent of the GO annotations. For this, we followed a previously published approach [7] where two genes were said to have similar function if they occurred on the same PATHWAY map.
We used the functional similarities as defined by GO or KEGG to evaluate each gene list that was ranked according to the co-expression between gene pairs (based on 1771 genes for S. cerevisiae and 2065 genes for D. melanogaster and C. elegans, see also Figure 2 and below). For each number of best co-expressed gene pairs (i.e. for 1 to ~8000 gene pairs with the smallest distances), we calculated the average functional similarity of our predictions (which is the same as the accuracy or the positive predictive value). The list of genes that are known not to interact is small and by far incomplete, so it is very difficult to evaluate the sensitivity of our predictions. Instead, we preferred to express the interactions rather in terms of accuracies than as true and false positives in receiver operating characteristics (ROC). Assuming a relation between functional coupling and co-expression, we expected to find the highest accuracy for the best co-expressed genes. The further incorporation of less strongly co-expressed pairs will then lead to a decrease of accuracy. With this accumulative way of evaluation we took the perspective of somebody asking the question similar to: "How well are the 1000 best co-expressed gene associations supported by functional annotations?" or "What is a reasonable number of gene associations I should include to find functionally similar pairs of genes to a certain level of accuracy?"
At first, we observed for all species and distance measures that the accuracies obtained from GO annotations were generally higher than the accuracies from KEGG maps while the shapes for all graphs were similar: the accuracies decreased continuously as we included more and more of the lower ranked gene pairs (Figure 3). S. cerevisiae accuracies decreased less dramatically than the accuracies for the two other species indicating the strong contribution of the even lower ranked gene pairs to the observed averaged accuracy. In terms of overall accuracies, S. cerevisiae performed best followed by D. melanogaster, while C. elegans performed poorest. This order coincided with the overall annotation level (background similarity) for the three species, i.e. the background accuracy was also poorest for C. elegans (see also legends in Figure 3). Other factors such as the experimental conditions chosen to generate the expression datasets as well as the overall quality of the datasets might play a role.
A functional GO analysis [18] of the top 8000 genes in S. cerevisiae showed several highly significant GO terms related to the ribosome, accounting for 317 genes out of the 1131 genes (28%) in the top 8000 interactions. For the 8000 top D. melanogaster interactions, we found overrepresentation of developmental GO terms, which fits well with the experimental conditions of this dataset. For the C. elegans dataset, the only two significant GO terms refer to "organelle part".
Only minor performance differences were found between distance measures, and different datasets and species will favor different measures. Euclidean distance and the mutual information were found both as the best and the worst method depending on the situation. The most robust method seems to be Spearman correlation as it was often the best and never the worst method. We noticed that the Euclidean distance has to be handled with care. When we tested the influence of different data normalization schemes (see below) we saw that the Euclidean distance performed poorly when the datasets were not z-normalized (data not shown).
Conserved co-expression across species
After we evaluated a set of commonly applied distance measures for co-expression detection for three species separately, we proceeded and asked: "How much gain in accuracy for functional gene interactions do we see when the analysis is restricted to interactions that are conserved between two or even three evolutionary distant species?" For this, we employed the principle of orthology: two genes are orthologous to each other when they arose from a speciation event [19]. A common assumption in this context, even though it is not part of the definition of orthology, is that two orthologs might keep their function partly or even completely (see also method's section for details). The evolutionary distance between two orthologous genes might play a role in so far as orthologs between similar species are generally thought of to retain a higher level of functional conservation than between distant species. We incorporated orthology into our analysis by 'synchronizing' the expression datasets between two species: each gene in one datasets has one corresponding orthologous gene in the other dataset. The number of genes used for the subsequent analysis was thereby reduced compared to the distance measure evaluation performed before (see Figure 2 for the number of genes) because genes with unclear orthology relationships were removed from the analysis. Using the datasets synchronized between two or three species, gene associations between species can be linked to each other. We then combined the distances between two or three species by averaging them (geometric mean gave better results than the arithmetic mean) between the pairs of orthologous genes. The Spearman distance measure was selected for this analysis, as it gave the best overall performance across the datasets (Figure 3). One of the prerequisites here was to normalize the inter-species distances to a common range between zero and one (see method's section for details). For each expression dataset, the impact of incorporating orthologous expression data was evaluated. Similar to the analysis shown in Figure 3, co-expressed gene pairs for each dataset were ranked according to distances calculated from one, two, or three species and each ranked list was evaluated using GO and KEGG functional annotation (Figure 4).
For the S. cerevisiae dataset we found that the incorporation of co-expression conservation to the C. elegans dataset gave an increase in accuracy and the conservation to the D. melanogaster resulted in an even higher increase (Figure 4A and 4D). The joint conservation of S. cerevisiae to both other species at the same time increased the accuracy again further, giving a consistent picture for both, GO and KEGG functional annotations.
For C. elegans, the simultaneous conservation to both species also outperforms the accuracies obtained when considering only one of the two other species (Figure 4C and 4F). The GO annotation system slightly favors the conservation to S. cerevisiae while the situation is inversed for KEGG.
While both S. cerevisiae and C. elegans benefit positively from considering the conservation to any of the two other species, the incorporation of S. cerevisiae could even decrease accuracies below the level of D. melanogaster alone. Here we can only speculate about the reasons: The relatively large number of ribosomal gene interactions found for the S. cerevisiae dataset (see findings above) might not correspond to highly scoring interactions in the D. melanogaster or C. elegans dataset, and therefore lead to decreased accuracy. Another explanation is that the information content of the S. cerevisiae dataset might be relatively poor so that it benefits from the incorporation of any of the two other datasets. Therefore it only gives little advantage to the C. elegans dataset, and even has a negative influence to the D. melanogaster dataset. Accounting for the conservation to C. elegans increases accuracies so that the combination of S. cerevisiae and C. elegans still gives an overall gain.
Influence of dataset and normalization
During our analysis we recognized the strong influence of the normalization scheme on the performance of the distance measures. Specifically for the Euclidean distance we observed an extremely poor performance when expression data was not z-normalized (data not shown). The Pearson correlation and the mutual information were only slightly affected. The Spearman correlation was not affected at all since the rank ordering inherent in this method is invariant to z-normalization.
For S. cerevisiae, we also tested the performance of distance measures and their influence on conservation to a second dataset [20] that contained up to 300 experimental conditions. Compared to the Spellman et al. dataset we used in the analysis presented here, the Hughes et al. dataset contained many genes with a higher number of extremely high (and probably biologically not meaningful) expression values (data not shown). Resulting from these outliers, we observed poor performances for the best co-expressed gene pairs. Even after removing these outliers, the Hughes et al. dataset gave less good results in our evaluations so that we decided not to employ it for our analysis.