 Research article
 Open Access
 Published:
Increased entropy of signal transduction in the cancer metastasis phenotype
BMC Systems Biologyvolume 4, Article number: 104 (2010)
Abstract
Background
The statistical study of biological networks has led to important novel biological insights, such as the presence of hubs and hierarchical modularity. There is also a growing interest in studying the statistical properties of networks in the context of cancer genomics. However, relatively little is known as to what network features differ between the cancer and normal cell physiologies, or between different cancer cell phenotypes.
Results
Based on the observation that frequent genomic alterations underlie a more aggressive cancer phenotype, we asked if such an effect could be detectable as an increase in the randomness of local gene expression patterns. Using a breast cancer gene expression data set and a model network of protein interactions we derive constrained weighted networks defined by a stochastic information flux matrix reflecting expression correlations between interacting proteins. Based on this stochastic matrix we propose and compute an entropy measure that quantifies the degree of randomness in the local pattern of information flux around single genes. By comparing the local entropies in the nonmetastatic versus metastatic breast cancer networks, we here show that breast cancers that metastasize are characterised by a small yet significant increase in the degree of randomness of local expression patterns. We validate this result in three additional breast cancer expression data sets and demonstrate that local entropy better characterises the metastatic phenotype than other nonentropy based measures. We show that increases in entropy can be used to identify genes and signalling pathways implicated in breast cancer metastasis and provide examples of denovo discoveries of gene modules with known roles in apoptosis, immunemediated tumour suppression, cellcycle and tumour invasion. Importantly, we also identify a novel gene module within the insulin growth factor signalling pathway, alteration of which may predispose the tumour to metastasize.
Conclusions
These results demonstrate that a metastatic cancer phenotype is characterised by an increase in the randomness of the local information flux patterns. Measures of local randomness in integrated protein interaction mRNA expression networks may therefore be useful for identifying genes and signalling pathways disrupted in one phenotype relative to another. Further exploration of the statistical properties of such integrated cancer expression and protein interaction networks will be a fruitful endeavour.
Background
The statistical study of biological networks has shed substantial novel insights into how biological networks are organised and how this organisation may relate to cellular function [1]. An important class of biological networks are protein interaction networks (PINs), as determined from highthroughput interaction screens and literature curation [2–4]. Edges in these networks mainly describe interactions of proteins that are part of the same physical complex or posttranslational modifications mediating signal transduction flows. Topological studies of these PINs have shown that they have important properties such as scalefreeness of the degree distribution, characterised by most proteins having small degree constrasting with a small yet significant number of highly connected nodes, called hubs [1]. Hubs have been shown to encode for essential genes in the context of yeast knockout experiments [5] and are also frequently altered in cancer [6, 7], highlighting that topological properties of these networks may carry important biological information. More recently, hierarchical modularity and betweenness centrality are other network properties that have been actively investigated in the context of biological networks [8, 9]. Specifically, the observation that proteins of high betweenness centrality tend to be also essential [9] and may also be frequently altered in cancer [10, 11] suggests that proteins of high betweenness centrality are also important for normal cellular physiology. However, it is recognised that wellknown disease genes are more frequently studied leading to literature bias and hence to a skew in the topological properties of such genes [12].
Existing PINs are also largely incomplete and therefore only represent very crude models of protein interactions [2]. Nevertheless they have provided useful structural models and their integration with gene expression data has led to novel insights, also in cancer genomics where they have helped identify candidate gene modules driving disease progression and novel diagnostic/prognostic markers [11, 13]. A distinct advantage of the integrative PINmRNA expression approach is that it can overcome some of the inherent problems associated with analysing gene expression data on its own, such as poor signal to noise ratios and the difficulty of interpreting gene expression signatures. Indeed, integrative PINmRNA expression studies have helped tease out relevant patterns of expression variation in the contextual framework of signalling pathways and protein complexes [14, 15]. Another potential advantage of using PINs is in relation to expression correlation networks to distinguish between direct and indirect correlations. For example, partial correlations and Graphical Gaussian Models have been proposed for deriving relevance networks where edges represent only potential direct influences between genes [16]. An alternative to partial correlations is to use the structural PIN to remove the (longdistance) indirect correlations between genes.
So far though the focus has been on the identification of specific subnetworks that may be causally involved in disease progression [11, 13, 17–19]. Much less has been done on the exploration of statistical properties of networks underlying the cancer phenotype with the exception of [12] where they performed a comprehensive analysis of cancer networks by first selecting genes differentially expressed between cancer and normal tissue (the "seeds") and then constructing the networks as the nearestneighbor PIN expansion of these seeds. It has become clear however that using PIN information a posteriori does not make full justice of the potential offered by a genuine integrative PINmRNA expression analysis. Indeed, it is increasingly recognised that approaches based on singlegene differential expression analysis are too simplistic and have been gradually superseeded by networkbased methods [11, 13]. A deeper understanding of the statistical properties of the underlying cancer networks is therefore important, as this may guide the choice of network metrics to use for better identifying signalling pathways or functional modules involved in disease progression.
In this work we investigate statistical properties [20] of integrated PINmRNA expression networks with the aim of identifying network properties that vary significantly between different cancer phenotypes. In particular, we hypothesized that since the frequency of genomic alterations is generally associated with a poor prognosis phenotype [21, 22], that this could be reflected by disruptions in the integrated PINmRNA expression networks, which would manifest as increases in the randomness of the local correlation patterns. To this end, we here introduce a local entropy measure that quantifies such randomness and demonstrate that (i) entropy is increased in primary tumours that metastasize (henceforth "metastatic phenotype"), (ii) that it better characterises the metastatic phenotype than other metrics which do not quantify randomness, and (iii) that this entropy measure can be used to discover genes and associated functional modules involved in cancer progression. To our knowledge this constitutes the first demonstration that metastatic cancer is characterised by a higher degree of randomness in the underlying signal transduction patterns and that this effect can be measured from integration of gene expression data with existing PINs.
Methods
The protein interaction network
We downloaded the Human Protein Reference Database (HPRD) interaction network [2] from Pathway Commons http://www.pathwaycommons.org The HPRD network has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. Protein interactions in this network include physical stable interactions such as those defining protein complexes, as well as transient interactions such as posttranslational modifications and enzyme reactions found in signaltransduction pathways, including 20 highly curated immune and cancer signalling pathways ("NetPath"). The HPRD network used here consisted of 8396 proteins with 36877 documented interactions.
Breast cancer gene expression data sets
We downloaded the normalised data of four breast cancer gene expression data sets from GEO http://www.ncbi.nlm.nih.gov/geo/[23–26]. These data sets were all profiled with the Affymetrix platform. We refer to these data sets as "EMC" (Erasmus Medical Centre) [23], "Frid" [24], "LoiUnt" [25] and "Mainz" [26]. In the case of [25] we used the data from the untreated patient population. As clinical endpoint we used a binary outcome variable indicating if the tumour metastasized or not. This was possible for the Mainz, Frid and LoiUnt data sets. For the EMC cohort, we used recurrence as a surrogate. The number of samples and their breakup according to clinical outcome in each study were 286 (107 recurred, 179 no recurrence) for "EMC", 200 (46 metastasized, 154 no metastasis) for "Mainz", 129 (27 metastasized, 102 no metastasis) for "Frid" and 125 (28 metastasized, 97 no metastasis) for "LoiUnt". To simplify the jargon we also refer to samples that recurred (EMC cohort) as metastatic samples.
Integrated PINmRNA expression networks
Next, we built integrated PINmRNA expression networks from the HPRD PIN and a gene expression data matrix. From the PIN we first extracted the proteins for which there were corresponding gene expression profiles. Typically, this resulted in a reduced PIN over approximately 7000 common genes/proteins with a 11 correspondence between nodes in the PIN and gene expression profiles. The reduced PIN had several connected components, the maximally connected component containing approximately 95% of the original nodes. Let P_{ ij }denote the adjacency matrix of the PIN and let N denote the number of nodes.
Since we are interested in investigating differences in network properties between metastatic and nonmetastatic cancer we constructed correlation expression networks by separately computing gene pairwise Pearson correlations across metastatic and nonmetastatic cancers. Since typically the number of metastatic samples is less than that of nonmetastatic tumours, to ensure that results are independent of systemsize, we subselected a number of nonmetastatic tumours at random from the original set. Thus, let denote the Pearson correlation of genes i and j across the n_{ M }metastatic samples, and let denote the corresponding correlation across n_{ M }randomly selected nonmetastatic samples. We then assign to each pair of genes an edge weight w_{ ij }= C_{ ij }. Finally, since we are interested in the patterns of local correlation we set w_{ ij }= 0 whenever P_{ ij }= 0. Thus, we assign nonzero weights only to edges that are present in the corresponding PIN. The resulting network defines the integrated PINmRNA network, and while the structural networks are identical for the metastatic and nonmetastic phenotypes, the values of the edge weights will differ.
The stochastic information flux matrix
There is still freedom in how we can assign weights to the edges in the PINmRNA network. In particular, since 1 ≤ C_{ ij }≤ 1, we may redefine the edge weights as for both the metastatic and nonmetastatic correlation networks. This monotonic transformation ensures that the edge weights satisfy 0 ≤ w_{ ij }≤ 1. A weight close to 1 represents therefore a strong positive correlation between the two genes, a weight of approximately 0.5 describes a nonexistent correlation, while strong negative correlations will be assigned small weights. For any given node i with neighbours j ∈ N(i) where N(i) = {j ∈ N: P_{ ij }= 1}, we can then assign a probability distribution as follows:
so that ∑_{j∈N(i)}p_{ ij }= 1. We note that while this defines a stochastic matrix, it is not doubly stochastic, i.e ∑_{i∈N(j)}p_{ ij }≠ 1, because in general p_{ ij }≠ p_{ ji }.
We interpret this probability distribution as follows. Strong positive correlations in the PIN very likely describe interactions that are necessary for the cell to carry out a specific modular function. We call this a "positive information flux" interaction to indicate that this interaction describes propagation of an activating signal. Thus, although the positive interaction could be between members of a given complex without explicitly representing signal transduction flow, we nevertheless argue that the positive correlation is necessary for the protein complex to carry out a specific function within the (signal transduction) pathway in which it operates. Because of this, we interpret negative correlations as "negative information flux" interactions to indicate that complexes may be in an inactive state or to represent inhibitory interactions. Thus, these interactions are given the smallest "positive flux" weights. Weak or nearzero correlations are given a larger positive flux (p_{ ij }~ 0.5) than negative interactions (p_{ ij }~ 0) because we argue that these interactions have less evidence for inducing an inactivating signal. Thus, the integrated PINmRNA networks with the edge weights defined by p_{ ij }as in equation (1) describe positive flux interactions (as measured by genegene correlations in expression) subject to the structural constraint of the PIN. Applying this procedure to the two C^{(M)} and C^{(N)} correlation matrices, yields two integrated PINmRNA networks, one for the metastatic phenotype with stochastic matrix , and one for the nonmetastatic phenotype with stochastic matrix Since the number of metastatic samples is less than that of nonmetastatic tumours (true for all cohorts considered here), and since we constructed the weighted networks using the same number of samples from each phenotype, we have freedom in the choice of nonmetastatic samples to use. We exploit this freedom by making in each cohort 10 random selections of nonmetastatic samples (in each case selecting the same number of samples as there are of metastatic cases), called "bootstraps", which allows the robustness of our results to be evaluated.
We point out that gene expression gives only a rough proxy for protein activity and that therefore our integrated PINmRNA networks and stochastic information flux matrices only represent very crude models of signal transduction flow. In particular, we expect many egdes to be assigned low fluxes when in fact corresponding protein levels and activities may be highly correlated. Thus, it is clear that our networks do not represent truly integrated networks. However, in the absence of largescale protein expression and phosphorylation data, specially in relation to cancer phenotypes, relying on correlations in mRNA expression between genes whose protein products have been documented to physically interact (including postranslational modifications) is one viable alternative.
Entropy of information flux distribution
We propose the following entropy measure [20] that quantifies the amount of randomness/disorder in the local flux distribution surrounding any given node i ∈ N. This measure is similar to the local disparity/heterogeneity index of weighted networks considered in other contexts [27, 28]. Specifically, if k_{ i }denotes the degree of node i, then the local entropy of node i is defined by
Thus, in the extreme case where all the flux is constrained along one interaction, p_{ ij }= δ_{ ij }(where δ_{ ij }is the Kroneckerdelta), S_{ i }= 0, while in the case where the flux is distributed equally across all edges i.e if p_{ ij }= 1/k_{ i }, the entropy attains its maximum value S_{ i }= 1.
We denote by S the distribution of local entropies across the whole network, i.e S = {S_{1},..., S_{ N }} where N is the number of nodes in the network. Since the stochastic matrix p_{ ij }is defined separately for metastatic and nonmetastatic phenotypes we obtain separate distributions of local entropies S^{(M)} and S^{(N)}.
To determine if these two distributions are significantly different we use the nonparametric paired Wilcoxon rank sum test. The paired test is appropriate because we need to control for possible dependencies between the two entropy values associated with each node (see below). We use the Pvalue of the test as a measure of the degree of difference between the values in the two phenotypes. We also define a differential entropy, dS (also denoted as "diffS"), on a nodebynode basis, by
It is important to point out that the sampling properties of the entropy measure as defined here will depend on the degree of the node. To see this, consider a node of degree 10, with half of the neighbors having p_{ ij }= 1/5 while the other half have p_{ ij }= 0. Then, the entropy is S = log 5/log 10 ≈ 0.70. Consider now another node of degree 100, again with half of the neighbors having p_{ ij }= 1/50 and the rest p_{ ij }= 0. In this case, the entropy is S = log 50/log 100 ≈ 0.85. Thus, while the two nodes have, modulo the degree, the same stochastic flux distribution, the entropies are widely different with the higher connected node exhibiting much higher entropy. A consequence of this is that the sampling properties and in particular the expected variance of the entropy will decrease with increasing degree. This means therefore that when comparing entropy measures between the two phenotypes, the comparison must be done on a degreebydegree basis. In particular, the paired Wilcoxon (nonparametric) test used above to evaluate differences in the global entropy distributions performs the comparison on a node by node basis and therefore circumvents the degreedependence problem posed by the entropy measure.
Feature selection based on local differential entropy
It is clear however that ranking nodes according to the magnitude of entropy change may skew selection to nodes of small degree as high degree nodes are less likely to show big entropy changes. To address this difficulty we adopt an empirical approach to derive a null distribution of expected entropy changes as a function of the degree of the node. Specifically, the approach we used is:

1.
Randomise the gene expression profiles of each sample, reconstruct stochastic flux matrices for the two phenotypes and calculate differential entropy values.

2.
For nodes with the same degree, estimate sampling variance. Generally, for nodes of higher degree (i.e k_{ i }> 50), there were too few nodes to estimate sampling variance, so highdegree nodes were pooled together and a single variance estimate obtained for each pool. In particular, nodes with degrees in the following ranges were pooled in corresponding bins: (50, 75), (75, 100), (100, 150), (150, max k_{ i }). This procedure was justified as we observed that the sampling variance was fairly constant for highdegree nodes.

3.
Repeat steps 1 & 2 a total of 100 times to obtain 100 variance estimates for each degree or pooled degree.

4.
Obtain smoothed standard error estimates by averaging the standard errors over the 100 randomisations.

5.
We observed that the resulting standard deviation, σ, was a decreasing monotonic function of the degree of the node. In particular, it took the powerlaw form
(4) 
6.
Estimate the parameters a and b using nonlinear least squares [29].

7.
For node i of degree k_{ i }and observed differential entropy value dS_{ i }we computed a Pvalue by comparing dS_{ i }to the value expected under the null given by a Gaussian of mean 0 and standard deviation σ (k_{ i }) with the parameters as estimated in the previous step.

8.
Finally, nodes are ranked according to the significance of the Pvalues, and subsequent FDR estimation is applied [30].
Average local correlation
We compare our local entropy metric to an alternative metric that quantifies the strength of the local positive flux distribution. Specifically, given a node i we can estimate the strength of the positive information flux surrounding node i by simply averaging the correlations over the nearest neighbors
Since this is done for each cancer phenotype separately, we obtain for each node the difference in the mean local correlation as
In addition, we also consider the average of local absolute correlations, that is,
and similarly, the difference in mean local absolute correlation,
However, since generally we would expect larger local entropies to reflect decreases in the positive information flux or to reflect lower absolute correlations (i.e less order), we redefine these metrics as
and use these metrics in the comparisons to entropy. In comparing all the metrics between the two phenotypes, the Pvalues obtained from the paired Wilcoxon rank sum tests are directly comparable, since the same number of values are being compared for each choice of metric.
Results
Using entropy to characterise the metastatic cancer phenotype
Based on the observation that more aggressive metastatic breast cancer is associated with a higher genomic instability (i.e a higher frequency of genomic alterations) [21, 22, 24, 31], we sought to investigate if this could be reflected in integrated PINmRNA expression networks. Specifically, we hypothesized that genomic (and epigenomic) alterations would lead to disruptions in the dynamics of information flow on protein interaction networks, and that this disruption would manifest itself as a higher degree of randomness in the patterns of local gene expression correlations.
While gene expression data represents a steady state, we posited that information flux in the network and under a certain condition is driven by positive correlations in gene expression between the interacting proteins. We thus assigned to each edge in the PIN a positive "flux weight", the magnitude of which reflects the strength of the Pearson correlation between the corresponding expression profiles (Methods). The weights were normalised further to yield a stochastic matrix p_{ ij }(∑_{j∈N(i)}p_{ ij }= 1 for all i ∈ N, Methods), describing the relative strengths of the local Pearson correlations. Thus, p_{ ij }can be interpreted as the probability of signal transmission from protein i to a neighbor j relative to all other interacting neighbors of protein i. Using this stochastic matrix we propose to use an entropylike measure that quantifies the amount of disorder or randomness in the local flux distribution (Methods). Specifically, for each node i we compute an entropy of flux distribution, S_{ i }, as
Thus, nodes with significant changes in S between two phenotypes represent points in the gene network where the degree of randomness in the information flow changes.
There are in principle other measures one could use to quantify the disruption in information flow in a network. One possibility is to consider for each node i the distribution of Pearson correlation values C_{ ij }with each of its neighbors [11]. By comparing the distribution of correlation values in metastic breast cancer with the corresponding distribution in nonmetastatic cancer one can thus identify nodes for which the local flux distribution changes. These nodes may therefore represent genes that are disrupted in the more aggressive phenotype. However, we argued that the degree of randomness in the local flux distribution maybe a better characteristic of a metastatic network than any statistic based on the difference in correlation values (Figure 1). Indeed, as the hypothetical example in Figure 1 shows, the metastasis phenotype may be associated with a significant increase in randomness but not necessarily with a significant change in the average nearestneighbor correlation. On the other hand, there could be nodes with a significant change in the mean nearest neigbour correlation but without a change in local randomness. We therefore sought to determine if randomness could be a better intrinsic characteristic of the metastatic phenotype than orthogonal measures such as local mean correlation or absolute correlation which do not necessarily quantify such randomness.
Increased entropy in metastatic breast cancer
To test our hypotheses we first integrated a highly curated network of protein interactions (from the Human Protein Reference Database) [2] with a highquality cancer gene expression data set from 286 breast cancer patients [23], of which 107 relapsed (Methods). There were 7279 protein coding genes with expression profiles and with representation in the HPRD PIN. Of these, 167 were isolated nodes, which were therefore removed from the analysis, leaving n = 7112 nodes. The PIN is sparse containing a total of only 31678 edges, representing 0.01% of the maximum possible number of edges. The degree distribution exhibited longtails with the highest degree node having 222 neighbors. About 60% of nodes had a degree less or equal than 5, while ~75% of nodes had a degree less than 10. To ensure comparibility of correlation values, we sampled 107 samples at random from the 179 nonmetastatic samples, to build the corresponding nonmetastatic PINmRNA network. The metastatic PINmRNA network was constructed using all 107 metastatic samples.
We first restricted the analysis to the 1903 nodes of degree ≥ 10, representing about 25% of all nodes in the network, and compared the entropy measures between the metastatic and nonmetastatic networks. First, we found that the local entropies of both metastatic and nonmetastatic networks were significantly lower than that of a null network obtained by randomising the expression profile of each tumour sample (Figure 2A). This is in line with the expectation that local correlation patterns are far from random. We also observed that our proposed entropy measure was discriminatory of the two cancer phenotypes, with a small yet significant increase in entropy in the metastatic network (Wilcoxon paired test P = 2 × 10^{16}). To test the robustness of these results we compared the metastatic network to other nonmetastatic PINmRNA networks obtained by randomly selecting a different set ("bootstraps") of 107 nonmetastatic tumours. Viewing these as perturbations of the original nonmetastatic network, we performed a total of 10 bootstraps and in all cases the entropy of the metastatic network was higher (Figure 2B, Wilcoxon paired test P < 0.005 in all cases). We also observed that differential entropy values were highly correlated between the different bootstraps (mean Pearson correlation 0.77), thus allowing values to be averaged.
Differences in local entropy between the metastatic and nonmetastatic network were not substantial in absolute terms, however, the paired test results clearly showed statistical significance and a trend towards increased entropy. To obtain a handle on the statistical significance of individual entropy changes, we first noted that the variance in differential entropy values was dependent on the degree of the node (Additional file 1A), which was not unexpected (Methods). Specifically, high degree nodes are generally less likely to exhibit large entropy changes. To overcome this problem we derived degreedependent variance estimates of the differential entropy, which allowed specific Pvalues to be derived (Methods). We observed that the variance decreased with the degree of the node in the form of a powerlaw allowing Pvalues to be estimated for nodes of any degree (Figure 2C). Confirming the paired test analysis, individual Pvalues exhibited a clear skew towards 0 suggesting many significant changes (Figure 2D). Among the top 200 ranked nodes (stringent FDR < 0.001) (Additional file 2), 133 showed entropy increases (Binomial test P < 10^{6}). Using the Pvalues to rank the genes also confirmed that the selected nodes were now not skewed towards those of lowdegree (Additional file 1B).
Because these results were obtained on the selected nodes of high degree (k_{ i }≥ 10), we asked if this result would remain significant had we included all nodes. We thus repeated the analysis using all 5592 nodes with at least two neighbors (entropy undefined for k_{ i }< 2). This showed that entropy was also significantly higher in metastatic breast cancer (Wilcoxon paired test P < 10^{100}), despite the fact that nonhubs made up 75% of the network. Thus, while nodes of lower degree generally also showed increases in entropy we focused our analysis on the higherdegree nodes (k_{ i }≥ 10) as these are more prone to disruption in cancer [6, 7].
Validation in other breast cancer cohorts
We next asked if the increase in local entropy found in the EMC data set is a robust finding independent of the expression data set used. We thus collected three additional breast cancer expression data sets that had been profiled with Affymetrix arrays [24–26]. For all three data sets we constructed the corresponding PINmRNA networks in both metastatic and nonmetastatic disease using the same procedure as for the EMC data. The PINmRNA networks were of similar size to the one obtained from the EMC data set. Remarkably, as with the EMC PINmRNA networks, we observed that the local entropies (of nodes with degree k_{ i }≥ 10) were significantly increased in the metastatic networks (Figure 3A). In all cases, we verified that the increase in entropy was robust to the choice of nonmetastatic network. We emphasize here again that metastatic and nonmetastatic PINmRNA networks were derived using the same number of samples, thus excluding the possibility that the increase in entropy is a samplesize effect. However, the increase in entropy was less marked when the analysis was extended to include all nodes of degree ≥ 2 (Mainz P = 4 × 10^{7}, Frid P > 0.5, LoiUnt P = 2 × 10^{5}).
It is natural to ask if the increase in entropy seen across all cohorts is due to the same genes or instead if in each cohort the subset of genes exhibiting increases is different. To address this we selected the 133 genes which showed the most significant increases in entropy in the EMC data set (out of the top 200 genes with most significant changes (FDR < 0.001), 133 showed increases) and verified that these generally also showed more significant increases in entropy in the validation sets (Figure 3B). This supports the view that there is genewise consistency in the metastasisassociated entropy increases seen across all four cohorts.
Randomness as an intrinsic property of the metastatic phenotype
Given the robust finding of increased entropy in metastatic breast cancer, we next asked if other local measures could discriminate between the two cancer phenotypes. Specifically, we compared entropy against the average local correlation, that is, the mean correlation of a node with its nearest neighbors, as well as the mean local absolute correlation (Methods).
We expected that the mean correlation (or absolute correlation) in expression would decrease in metastatic breast cancer as a result of the increased frequency in genomic/epigenomic alterations. We confirmed this across all four data sets using the same onetailed paired Wilcoxon test (Table 1). Comparison of the Pvalues obtained using entropy and average correlation measures however also showed that entropy was substantially more different between metastatic and nonmetastatic breast cancer (Table 1).
As an alternative test we also asked how many nodes exhibited an increase or decrease in entropy, average correlation or absolute correlation and tested the strength of any deviation from the null using the binomial test. By performing this binomial test we are also comparing entropy to a paired ttest of correlation values as the directionality of change is determined by the difference in mean correlations. We observed that there were a substantially larger number of nodes exhibiting increased entropy in metastatic breast cancer than nodes exhibiting lower average correlation (Figure 4A). In line with this, binomial test Pvalues of skewness derived from differential entropy values were also substantially more significant than those derived from differences in mean local correlation (Figure 4B). Interestingly, comparing differential entropy to changes in the local mean absolute correlation, we observed that the absolute correlation was a good surrogate for entropy. However, entropy did generally show more significant changes than the absolute correlation (Table 1, Figure 4). Thus, these results demonstrate that increased entropy in the local patterns of gene expression correlations may be a more intrinsic property of the metastatic phenotype and thus constitute a more useful measure for identifying gene subnetworks implicated in breast cancer metastasis than subnetworks derived from direct correlation measures.
Biological significance
We would expect that nodes exhibiting the most significant increases in entropy may point to genes and signalling pathways that are more frequently altered in the metastatic or poor prognosis breast cancer phenotype, irrespective of the modality of the underlying alteration. Of the 133 genes showing the most significant increases in entropy (Additional file 2) several have expression levels that have already been associated with a poor prognosis and distant metastasis in breast cancer (e.g CDC2, CCNB1, MYBL2, MAD2L1) and which have functions related to the cellcycle and DNA replication [32, 33]. Graphical depiction of the nearest neighbor subnetwork surrounding e.g MYBL2 confirmed the increase in entropy in the metastatic phenotype (Figure 5). For instance, one can observe how a strong inhibitory interaction with CCND1 and strong correlations with CCNE1 and RBL1 become disrupted in the metastatic phenotype. It is clear that nearest neighbors of MYBL2 (e.g CCNA2) also undergo increases in entropy, thus implicating larger gene modules, in this case a cellcycle module, that are altered in metastatic disease. Interestingly, GSEA (Gene Set Enrichment Analysis using a Fishertest) of genes with increased entropy in the metastatic phenotype revealed enrichment of several biological pathways, notably apoptosis, natural killer cell mediated cytotoxicity and interleukin2 (IL2), androgen receptor and insulin growth factor (IGF) signalling (Table 2). Corresponding subnetworks for IL2RB, IGFBP7 and BCL2 clearly confirmed the increase in entropy associated with these genes and pathways (Figure 5).
Since GSEA was performed on the top 200 genes using the 1903 gene list as the background reference, the enrichment of any biological pathways and in particular those of apoptosis and IL2/IGF is very significant and not caused by the selection of nodes of high degree. As a control and to confirm this further, we observed no enrichment of any biological pathway among the genes showing decreases in entropy (Table 2). Although this could be explained in part by the smaller number of genes showing decreases (67 vs 133), we only obtained significance of one biological term (regulation of translation initiation) when we expanded the 67 gene set to the 133 with the most significant entropy decreases.
It is likely that the increases in entropy associated with specific genes are caused by a higher frequency of underlying mutations, losses or epigenetic silencing. Using copynumber and sequencing data of a large cohort of 171 breast tumours (38 metastasized, 130 nometastasis) [31] we observed an increased frequency of losses in those tumours that metastasized for many of the top ranked genes including BCL2 (18% vs 13%), IGFBP7 (11% vs 6%), IGF1 (8% vs 2%) (Additional file 3). Similarly, TP53 was mutated in 31% of tumours that metastasized against 17% in those tumours that did not. Of the top 133 genes exhibiting increases in entropy, 90 could be mapped to the array CGH study and of these, 68 (75%) showed more frequent losses in primary tumours that metastasized (Additional file 4). In contrast, 65% of genes exhibiting no significant changes in entropy were more frequently lost in metastatic cases (Additional file 4). Thus, genes with significant increases in entropy were approximately 1.6 times more likely to undergo more frequent losses in disseminating tumours (Fisher test P = 0.08).
Discussion
In this work we have explored a specific statistical property of integrated protein interaction and cancer mRNA expression networks. Our hypothesis was that the metastatic cancer phenotype is characterised by a higher level of randomness in the sense of a more disordered local gene expression correlation pattern. In other words, since one may view the protein interactions as imposing to some degree constraints on the allowed local gene expression patterns [9, 11], we posited that these constraints would be violated to a higher extent in breast tumours which exhibit a more invasive phenotype. To test this we introduced a stochastic matrix on the PIN, modeling positive information fluxes around any given node, and from it we defined a local entropy measure which quantifies the degree of randomness of this flux distribution. Using this entropy measure we could show that the metastatic phenotype is indeed characterised by a higher level of entropy independently of the breast cancer cohort considered. While entropy changes were not large in absolute terms due to the focus on nodes of relatively high degree (degree ≥ 10), we were able to show using an empirically derived degree dependent variance estimator, that the changes in entropy were significant with a clear skew towards higher entropy in the metastatic phenotype. Moreover, we were able to show that genes exhibiting significant increases in local entropy in one data set, did so also in the independent cohorts, supporting the view that the disruptions in local expression patterns that we have found are biologically genuine and of relevance to the metastatic phenotype. It is interesting though that the increase in entropy was less marked when analysis was extended to all nodes of degree larger than 1, indicating that lowdegree nodes showed less consistent directional changes, in line with the expectation that lowdegree nodes may be functionally less important.
We also observed significant enrichment of relevant biological pathways, specially those with tumour suppressor functions, among genes showing increases in entropy in the metastatic phenotype, but not so for genes showing decreases in entropy. This supports the view that it is increases in randomness that may be of most functional consequence. A higher randomness in the coexpression patterns of genes important to the apoptotic cascade may indicate that this pathway is less functional in primary tumours that eventually metastasize. Similarly, our findings suggest that increased randomness in the patterns of expression of the IL2 pathway, a wellknown tumour suppressor pathway mediating tumour inhibition through formation of natural killer cells [34], is a critical determinant of breast cancer metastasis. This confirms other reports that immuneresponse pathways are important in prognosis prediction of breast cancer [35, 36]. While the IGFpathway has been implicated as a mediator of breast cancer progression and is also involved in the regulation of apoptosis [37–39], our results suggest a novel important role for IGFBP7 and IGFsignalling.
Generally, there were other instances where the identified genes formed interlinked subnetworks implicating whole gene modules that show increased entropy. It will therefore be interesting to extend our analysis to metrics that quantify the degree of randomness across whole subnetworks as opposed to single nodes. Among the larger modules, we identified one associated with the cellcycle (CDC2, CDC20, MYBL2, MAD2L1) consistent with many other studies showing the importance of cellcycle genes in breast cancer prognosis [32, 33]. Of note, we also found local entropy increases in subnetworks surrounding genes involved in invasion (SPARC, MMP1, MMP2, MMP3). SPARC expression itself has been associated with clinical outcome in breast cancer in a large number of studies, for example see [40, 41]. In this regard, it is worth pointing out that we have rediscovered these important cellcycle and extracellular matrix modules without directly comparing expression levels between metastatic and nonmetastatic breast cancer, but rather by using entropy to compare the randomness of expression patterns within each phenotype. Thus, our proposed entropy metric is able to identify biological mechanisms related to the cancer hallmarks and a poor prognosis phenotype.
Perhaps most remarkably is the fact that entropy outperformed other local measures of information flux, in the sense that entropy better characterized the metastatic phenotype. One competing measure we considered was the average nearestneighbor correlation in gene expression, which clearly was less effective in distinguishing the two cancer phenotypes. Similarly, entropy characterised the metastatic phenotype more closely than the effect size statistic provided by a ttest. The ttest is similar to the method used in [11] to identify features disrupted in poor prognosis breast cancer. In that study, the authors identified BRCA1 as one of the top genes exhibiting changes in the mean local correlation. In terms of differential entropy however, BRCA1 was not highly ranked because differential entropy only measures the change in the degree of randomness of the local correlation distribution, while as shown in [11]BRCA1 exhibits a fairly uniform switch from positive to negative correlations with its interacting partners, a switch which is therefore not necessarily associated with an increase in disorder or randomness. This example highlights the fact that although differential entropy can miss important genes, that it constitutes a metric that is complementary to the one used in [11]. Therefore, it will be interesting to compare the alternative metrics in relation to the ranking of nodes. However, the result obtained here that increased entropy is a better distinguishing feature of the metastatic phenotype supports the view that changes in the randomness of the local correlation patterns may be more relevant than changes in mean correlation levels. To confirm this further, we took the mean of the absolute correlations as another metric and compared it to entropy. In line with the expectation that increases in entropy are generally caused by drops in absolute correlation values, we observed that changes in the mean absolute correlation, while not outperforming entropy, was a good surrogate for it. It will therefore also be interesting to compare our entropy metric to other network metrics that quantify randomness or to metrics that combine randomness and changes in mean correlation levels.
The observed increase in entropy in the metastatic breast cancer networks may have a clear biological interpretation, as genomic and epigenomic alterations are more frequent in primary tumours that metastasize [21, 24, 31]. This increased frequency of alterations is true for losses, mutations and amplifications (highlevel gains) but does not necessarily hold for lowlevel gains [24, 31]. It was therefore encouraging to observe that genes undergoing the most significant increases in entropy were also more likely to undergo more frequent losses in the tumours that metastasized. Although this association was only marginal, this could be because the genomic alteration data came from a different cohort. Another potential difficulty in linking entropy changes to gene alterations is that a substantial number of alterations are not copynumber changes but instead may represent point mutations, rearrangements, or epigenetic aberrations, and large scale profiling of these types of alterations is still ongoing. Thus, the question if differential entropy selects for genes that are more frequently altered and how the directionality of change might relate to the type of alteration is still an open question that we hope to address in future studies. An easier context in which to study this would be to compare integrated PINmRNA networks for cancer and corresponding normal tissue, as normal tissue exhibits significantly much lower levels of alterations. However, this also requires relatively large normal and cancer tissue data sets profiled on the same platform and ideally as part of the same study. While in breast cancer such data sets are only now becoming available, it will be interesting in the meantime to pursue this question in other cancers.
Conclusions
Metastatic breast cancer is characterised by an increase in the randomness of the local expression correlation patterns. The entropy metric proposed here may therefore be a useful tool for identifying genes and signalling pathways implicated in the metastatic process of other cancers. We foresee that further networktheoretical studies of integrated PINmRNA expression networks in the context of cancer genomics will be fruitful.
References
 1.
Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004, 5: 101113. 10.1038/nrg1272
 2.
Prasad TS, Kandasamy K, Pandey A: Human protein reference database and human proteinpedia as discovery tools for systems biology. Methods Mol Biol. 2009, 577: 6779. full_text
 3.
Stelzl U, Wanker EE: The value of high quality proteinprotein interaction networks for systems biology. Curr Opin Chem Biol. 2006, 10: 551558. 10.1016/j.cbpa.2006.10.005
 4.
Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21: 20762082. 10.1093/bioinformatics/bti273
 5.
Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 4142. 10.1038/35075138
 6.
Hornberg JJ, Bruggeman FJ, Westerhoff HV, Lankelma J: Cancer: a systems biology disease. Biosystems. 2006, 83: 8190. 10.1016/j.biosystems.2005.05.014
 7.
Jonsson PF, Bates PA: Global topological features of cancer proteins in the human interactome. Bioinformatics. 2006, 22: 22912297. 10.1093/bioinformatics/btl390
 8.
Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, et al.: Evidence for dynamically organized modularity in the yeast proteinprotein interaction network. Nature. 2004, 430: 8893. 10.1038/nature02555
 9.
Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M: The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol. 2007, 3: e59 10.1371/journal.pcbi.0030059
 10.
Ozgür A, Vu T, Erkan G, Radev DR: Identifying genedisease associations using centrality on a literature mined geneinteraction network. Bioinformatics. 2008, 24: i277i285. 10.1093/bioinformatics/btn182
 11.
Taylor IW, Linding R, WardeFarley D, Liu Y, Pesquita C, et al.: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009, 27: 199204. 10.1038/nbt.1522
 12.
Platzer A, Perco P, Lukas A, Mayer B: Characterization of proteininteraction networks in tumors. BMC Bioinformatics. 2007, 8: 224 10.1186/147121058224
 13.
Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140 10.1038/msb4100180
 14.
Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP: Classification of microarray data using gene networks. BMC Bioinformatics. 2007, 8: 35 10.1186/14712105835
 15.
Ulitsky I, Shamir R: Identification of functional modules using network topology and highthroughput data. BMC Syst Biol. 2007, 1: 8 10.1186/1752050918
 16.
Schäfer J, Strimmer K: An empirical bayes approach to inferring largescale gene association networks. Bioinformatics. 2005, 21: 754764. 10.1093/bioinformatics/bti062
 17.
Tuck DP, Kluger HM, Kluger Y: Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics. 2006, 7: 236 10.1186/147121057236
 18.
Pujana MA, Han JD, Starita LM, Stevens KN, Tewari M, et al.: Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007, 39: 13381349. 10.1038/ng.2007.2
 19.
Nibbe RK, Koyutürk M, Chance MR: An integrative omics approach to identify functional subnetworks in human colorectal cancer. PLoS Comput Biol. 2010, 6: e1000639 10.1371/journal.pcbi.1000639
 20.
Barrat A, Barthelemy M, Vespignani A: Dynamical Processes on Complex Networks. CUP. 2008,
 21.
Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z: A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet. 2006, 38: 10431048. 10.1038/ng1861
 22.
Negrini S, Gorgoulis VG, Halazonetis TD: Genomic instabilityan evolving hallmark of cancer. Nat Rev Mol Cell Biol. 2010, 11: 220228. 10.1038/nrm2858
 23.
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, et al.: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet. 2005, 365: 671679.
 24.
Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, et al.: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10: 529541. 10.1016/j.ccr.2006.10.009
 25.
Loi S, HaibeKains B, Desmedt C, Lallemand F, Tutt AM, et al.: Definition of clinically distinct molecular subtypes in estrogen receptorpositive breast carcinomas through genomic grade. J Clin Oncol. 2007, 25: 12391246. 10.1200/JCO.2006.07.1522
 26.
Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, et al.: The humoral immune system has a key prognostic impact in nodenegative breast cancer. Cancer Res. 2008, 68: 54055413. 10.1158/00085472.CAN075206
 27.
Derrida B, Flyvbjerg H: Statistical properties of randomly broken objects and of multivalley structures in disordered systems. J Phys A:Math Gen. 1987, 20: 52735288. 10.1088/03054470/20/15/039.
 28.
Barthelemy M, Gondran B, Guichard E: Spatial structure of internet traffic. Physica A. 2003, 319: 633642. 10.1016/S03784371(02)013821.
 29.
Bates DM, Watts DG: Nonlinear Regression Analysis and Its Applications. 1988, Wiley,
 30.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100: 94409445. 10.1073/pnas.1530509100
 31.
Chin SF, Teschendorff AE, Marioni JC, Wang Y, BarbosaMorais NL, et al.: Highresolution acgh and expression profiling identifies a novel genomic subtype of er negative breast cancer. Genome Biol. 2007, 8: R215 10.1186/gb2007810r215
 32.
Teschendorff AE, Naderi A, BarbosaMorais NL, Caldas C: Pack: Profile analysis using clustering and kurtosis to find molecular classifiers in cancer. Bioinformatics. 2006, 22: 22692275. 10.1093/bioinformatics/btl174
 33.
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, et al.: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006, 98: 262272. 10.1093/jnci/djj052
 34.
Pardoll D: Metastasispromoting immunity: when t cells turn to the dark side. Cancer Cell. 2009, 16: 8182. 10.1016/j.ccr.2009.07.007
 35.
Teschendorff AE, Miremadi A, Pinder SE, Ellis IO, Caldas C: An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 2007, 8: R157 10.1186/gb200788r157
 36.
Teschendorff AE, Caldas C: A robust classifier of high predictive value to identify good prognosis patients in ernegative breast cancer. Breast Cancer Res. 2008, 10: R73 10.1186/bcr2138
 37.
Neuberg M, Buckbinder L, Seizinger B, Kley N: The p53/igf1 receptor axis in the regulation of programmed cell death. Endocrine. 1997, 7: 107109. 10.1007/BF02778075
 38.
Samani AA, Yakar S, LeRoith D, Brodt P: The role of the igf system in cancer growth and metastasis: overview and recent insights. Endocr Rev. 2007, 28: 2047. 10.1210/er.20060001
 39.
Yee D, Wood TL: The igf system in mammary development and breast cancer. J Mammary Gland Biol Neoplasia. 2008, 13: 351352. 10.1007/s1091100890962
 40.
Naderi A, Teschendorff AE, BarbosaMorais NL, Pinder SE, Green AR, et al.: A geneexpression signature to predict survival in breast cancer across independent data sets. Oncogene. 2007, 26: 15071516. 10.1038/sj.onc.1209920
 41.
Farmer P, Bonnefoi H, Anderle P, Cameron D, Wirapati P, et al.: A stromarelated gene signature predicts resistance to neoadjuvant chemotherapy in breast cancer. Nat Med. 2009, 15: 6874. 10.1038/nm.1908
Acknowledgements
AET was supported by a Heller Research Fellowship. SS was supported by a Newton International Fellowship.
Author information
Additional information
Authors' contributions
AET designed, analysed and wrote the paper. SS contributed with ideas and to the writing of the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 3: Relation of differential entropy to patterns of differential loss and gain in breast tumours. For the top 200 genes showing the most significant entropy changes, we provide the frequencies of genomic copynumber gain and loss across 171 breast tumours [31], stratified according to whether primary tumours metastasized (DM) or not (NoDM). The differential gains and losses are also provided in table. NA indicates that that gene had no oligo representation on the array. (XLS 40 KB)
Additional file 4: Increased entropy and genomic loss. Of the top 200 nodes with most signficant entropy changes, 133 showed significant increases in the metastatic phenotype, and 90 of these genes could be mapped to an oligo array comparative genomic hybridisation study over 171 breast tumours [31]. Left panel plots the difference in the frequency of loss of the gene between the tumours that metastasized and those that did not (yaxis) against the negative logarithm of the Pvalue for the differential entropy (xaxis). Right panel shows the same plot for the 133 genes showing least significant changes in entropy. Genes with significant increases in entropy were 1.6 times more likely to be more frequently lost in poor prognosis tumours (Fisher test P = 0.08). (PDF 17 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Entropy Measure
 Metastatic Phenotype
 Local Entropy
 Absolute Correlation
 Human Protein Reference Database