New network topology approaches reveal differential correlation patterns in breast cancer
 Michael Bockmayr^{1},
 Frederick Klauschen^{1},
 Balazs Györffy^{2},
 Carsten Denkert^{1} and
 Jan Budczies^{1}Email author
DOI: 10.1186/17520509778
© Bockmayr et al.; licensee BioMed Central Ltd. 2013
Received: 15 April 2013
Accepted: 6 August 2013
Published: 15 August 2013
Abstract
Background
Analysis of genomewide data is often carried out using standard methods such as differential expression analysis, clustering analysis and heatmaps. Beyond that, differential correlation analysis was suggested to identify changes in the correlation patterns between disease states. The detection of differential correlation is a demanding task, as the number of entries in the genebygene correlation matrix is large. Currently, there is no gold standard for the detection of differential correlation and statistical validation.
Results
We developed two untargeted algorithms (DCloc and DCglob) that identify differential correlation patterns by comparing the local or global topology of correlation networks. Construction of networks from correlation structures requires fixing of a correlation threshold. Instead of a single cutoff, the algorithms systematically investigate a series of correlation thresholds and permit to detect different kinds of correlation changes at the same level of significance: strong changes of a few genes and moderate changes of many genes. Comparing the correlation structure of 208 ER breast carcinomas and 208 ER+ breast carcinomas, DCloc detected 770 differentially correlated genes with a FDR of 12.8%, while DCglob detected 630 differentially correlated genes with a FDR of 12.1%. In twofold crossvalidation, the reproducibility of the list of the top 5% differentially correlated genes in 140 ER tumors and in 140 ER+ tumors was 49% for DCloc and 33% for DCglob.
Conclusions
We developed two correlation network topology based algorithms for the detection of differential correlations in different disease states. Clusters of differentially correlated genes could be interpreted biologically and included the marker genes hydroxyprostaglandin dehydrogenase (PGDH) and acylCoA synthetase medium chain 1 (ACSM1) of invasive apocrine carcinomas that were differentially correlated, but not differentially expressed. Using random subsampling and crossvalidation, DCloc and DCglob were shown to identify specific and reproducible lists of differentially correlated genes.
Keywords
Differential correlation Microarray data Breast cancer Molecular subtypes Differential coexpressionBackground
Over the last 15 years, global gene expression profiling using microarrays has been established as a common tool for disease research. With this approach, disease mechanisms may be studied by comparative expression profiling of disease and healthy tissues or two disease states A and B. In recent years, this approach helped to discover prognostic markers and signatures and to identify target structures for drug intervention. Alterations of gene regulation often result in up or downregulated genes. Therefore, looking for differentially expressed genes using statistical tests is one of the most common strategies for the comparative analysis of microarray data [1].
However, this approach ignores the fact that most of the biological processes require orchestrated action of many genes. Therefore, gene correlation and coexpression have been intensively studied since the early days of microarrays and the seminal work of Eisen et al. [2]. Today, hierarchical clustering and heatmaps are ubiquitous in studies of microarray data. Heatmaps usually serve as a tool for visualization of the results. Clustering has also been shown to be useful for the identification of disease subtypes, such as, for instance, defining molecular subtypes of breast cancer [3].
Complementary to differential expression analysis, the study of differential correlation or differential coexpression aims at a deeper understanding of the expression patterns in diseased tissues. As an example, a number of downstream targets could be regulated by a master gene, for example a transcription factor. In tissues where the regulatory mechanism is functional, the module of the target genes will be expressed in an ordered pattern. However, in diseased tissues where the regulatory mechanism is dysfunctional, the expression of the gene module will be unordered or random. Correlation changes of this kind can be detected by differential correlation (DC) analysis, but might be overlooked by differential expression (DE) analysis.
The number of pairwise correlations in global expression data of human tissues is quadratic in the number of genes and exceeds one million. Casebycase testing would lead to a multiple testing problem, connected with searching for a few differentially correlated gene pairs within a huge number of unregulated correlations. Therefore, it should be more efficient not to study each gene pair separately, but to take into account the overall structure of correlations. Shortly after the microarray technology became common, algorithms for the detection of differential coexpression and differential correlation were developed [4–6]. Meanwhile, a multitude of algorithms were published [7] that can be divided into targeted, semitargeted and untargeted approaches [8].
In targeted approaches, predefined gene modules are analyzed for correlation changes between the two disease states. Frequent choices for the modules are Gene Ontology (GO) categories, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, or clusters from additional external expression data sets. For example, in gene set coexpression analysis (GSCA) a dispersion index is calculated for each of the modules and the significance is assessed using a permutation test [9]. In [10], a difference network framework is developed and a test statistics is defined by averaging over the edge weights between members of the modules. In another kind of targeted approach, the analysis of correlations is restricted to a predefined network, for example to the human interactome [11]. In [12], the expression pattern of breast cancer on the interactome network was analyzed and it was shown that the metastatic cancer phenotype is characterized by an increase of randomness of the local information flux patterns.
In semitargeted approaches, modules in one of the disease states are defined using clustering, and these modules are investigated for correlation in the other disease state. The differential clustering algorithm (DCA) starts with clustering of the tissues in the reference disease state and proceeds with reordering the genes in the reference clusters according to the correlations in the second disease state [13]. CoXpress starts with hierarchical clustering of the reference samples and proceeds with a resamplingbased approach to find those modules that are coexpressed in one state, but not in the other [14].
Untargeted approaches do not depend on externally defined modules or modules defined by clustering of a reference data set. Therefore, untargeted algorithms are capable of detecting DC in more general situations where differential regulation neither occurs within predefined external nor internal modules. Many of the untargeted approaches start with constructing correlation (or interaction) networks of each of the disease states and proceed with identification of differentially correlated subnetworks [15–18]. The recently published DICER algorithm [19] is able to address two different scenarios of DC: differentially correlated clusters, but also differentially correlated metamodules. Here, a metamodule is defined as a pair of gene sets with genes inside the sets being correlated in both disease states, but with differing correlations between the gene sets.
Transformation of a correlation structure into a network requires fixing of a threshold. Whenever a correlation exceeds the threshold, the corresponding two network nodes are joined. A novelty of the current study is to investigate changes in network topology, but at the same time to evaluate a series of correlation thresholds that comprehensively cover the range of correlations in the data. In this way, different kinds of correlation changes can be detected at the same level of significance: strong changes of a few genes, but also moderate changes of many genes.
Worldwide, breast cancer is classified into molecular subtypes based on estrogen receptor (ER) and HER2 status. Determination of the molecular subtype is essential to tailor adjuvant treatment and to estimate of the risk of recurrence after surgery. In the last decade, DE between molecular subtypes of breast cancer was extensively investigated [20–22]. However, the literature on DC analysis of breast cancer is limited and includes a denovo partitioning method [23] and a targeted analysis of KEGG pathways [24]. Therefore, we tested the new developed untargeted algorithms in a large gene expression data set of 208 ER, 208 ER+, 208 HER2 and 208 HER2+ breast carcinomas.
Methods
The algorithms DCglob and DCloc (Additional file 1) were implemented using the statistical programming language R[25]. While the global algorithm focuses on comparison of the connected components of the networks, the local algorithm compares the next neighbors of the gene under consideration. The general workflow of the algorithms is illustrated in Figure 1.
Computing time
Calculations were done on a Linux computer including 16 GB RAM and an Intel Core i72600 processor, 3.40 GHz. In the first step, the gene correlation matrix was calculated and used as input for both of the algorithms. Because this calculation did not significantly contribute, the computing time was independent of the number of samples. The time to calculate the strength of differential correlation for 12703 genes was 52 minutes for DCglob (200 thresholds) and 69 minutes for DCloc (100 thresholds). Including FDR calculation by subsampling analysis (100 subsamples), the calculation time was 88 hours for DCglob and 116 hours for DCloc.
Global topology algorithm
Step 1 The algorithm compares gene expression data of n samples in disease condition A with gene expression data of n samples in disease condition B. First, the correlation matrix C^{ q } comprising the Pearson correlations ${c}_{\mathit{\text{ij}}}^{q}=\text{cor}({\text{gene}}_{i},\phantom{\rule{2.77626pt}{0ex}}{\text{gene}}_{j})$ of all pairs of genes is calculated for both disease conditions q=A,B. The Fishertransformed correlation matrix ${z}_{\mathit{\text{ij}}}^{q}=\frac{1}{2}\xb7log\frac{1+\underset{\mathit{\text{ij}}}{\overset{q}{c}}}{1\underset{\mathit{\text{ij}}}{\overset{q}{c}}}$ is the starting point for all further calculations.
Step 2 Correlation networks are computed for a comprehensive series of correlation thresholds. In the breast cancer data set, the highest correlation between genes is c=0.985 and corresponds to a Fishertransformed value of z=2.5. Therefore, we choose the set of thresholds T to be the series of 200 equidistant values between 0 and 2.5. For each threshold t∈T, we obtain correlation networks ${N}_{A}^{t}$ and ${N}_{B}^{t}$ corresponding to the disease conditions A and B.
Step 3 The decomposition of the networks ${N}_{A}^{t}$ and ${N}_{B}^{t}$ into connected components is computed and all connected components containing 3 or more genes are selected. We restricted to clusters of 3 or more genes, because this is the minimum number where the network topology comes into play. Changes of pairwise correlations could be more effectively studied using a direct approach. Formally, let $\{{A}_{1}^{t},\dots ,{A}_{k}^{t}\}$ and $\{{B}_{1}^{t},\dots ,{B}_{l}^{t}\}$ denote the sets of these connected components in ${N}_{A}^{t}$ and ${N}_{B}^{t}$ respectively. After these preparations, we remove the genes that are contained in a connected component for both networks ${N}_{A}^{t}$ and ${N}_{B}^{t}$ from the set of all genes G, yielding ${\stackrel{~}{G}}^{t}:=G\setminus \bigcup _{1\le i\le k,1\le j\le l}({A}_{i}^{t}\cap {B}_{j}^{t})$. Then, we build subnetworks ${\xd1}_{A}^{t}$ resp. ${\xd1}_{B}^{t}$ of ${N}_{A}^{t}$ resp. ${N}_{B}^{t}$ induced by the genes in ${\stackrel{~}{G}}^{t}$ and compute the sets ${\xc3}^{t}$ and ${\stackrel{~}{B}}^{t}$ of genes that are contained in connected components of ${\xd1}_{A}^{t}$ and ${\xd1}_{B}^{t}$ with cardinality greater or equal 3. Because we removed the genes that are contained in correlation clusters for both disease conditions, the remaining genes that are in correlation clusters for one of the disease conditions are considered as differentially correlated for correlation threshold t.
As an example, ${I}_{j}^{A}\left(t\right)=1$ indicates that gene j is member of a connected component in the network ${N}_{A}^{t}$ but not in the network ${N}_{B}^{t}$.
Step 4 Finally, genes that are members of connected components in only one of the networks over a large range of correlation thresholds are selected. To this end, an interval of maximal length [ a,b] is chosen such that ${I}_{j}^{q}\left(t\right)=1$ for all t∈ [ a,b]∩T. Thus, the interval contains as series of threshold values for which a gene is correlated in one of the networks, but not in both networks. The interval length b−a is converted to a pvalue using Steiger’s test for the comparison of correlation coefficients [26] and used to measure the strength of differential correlation. A list of differentially correlated genes includes all genes with pvalues p<S below a threshold S.
Local topology algorithm
Step 1 This step is identical to Step 1 performed for DCglob.
In this definition, the number of common next neighbors in both networks is divided by the total number of next neighbors. To focus on changes that affect correlation clusters of at least 3 genes, we set ${d}_{i}^{t}:=0$ if ${V}_{A}^{i,t}\cup {V}_{B}^{i,t}<3$.
Thus, we obtain a value in [0,1], which represents the strength of differential correlation for each gene. A list of differentially correlated genes includes all genes with d > s above a threshold s.
Estimation of false discovery rates
Statistical evaluation of the algorithms was performed by a repeated random subsampling analysis. We wanted to falsify the null hypothesis that both patients groups exhibit the same gene correlation structure. Therefore, we randomly subsampled arbitrary breast cancer patients to generate the null distribution. This procedure mixes ER+ and ER patients (as well as HER2+ and HER2 patients) and therefore is appropriate to assess the significance of differential correlations between the ER+ and ER subtype (as well as the HER2+ and HER2 subtype). Then, we compared the number of differentially correlated genes between breast cancer subtypes n_{ AB } to the number of differentially correlated genes between randomly sampled sets of breast cancer n_{0}.
wherein π_{0} denotes the proportion of not differentially correlated genes. This is a standard method for estimating the FDR from a subsampling or permutation analysis, see for example [27]. For breast cancer, the number of differentially correlated genes turned out to be small compared to the total number of genes. Therefore, slightly overestimating the FDR, we used the approximation π_{0}=1.
Dataset
We generated a large gene expression data set of breast cancer (1317 samples) by fusion of publicly available microarray data sets. Raw data of GSE1456, GSE2034, GSE4922, GSE6532, GSE7390 and GSE11121 with respectively 159, 286, 327, 578, 198 and 200 samples were downloaded from the Gene Expression Omnibus (GEO) website [28]. All the samples were analyzed using the Affymetrix Human Genome U133A microarray. As remarked in [29] some of the samples were contained in two or more data sets. Thus, we removed 431 samples and ended up with a breast cancer gene expression data set of 1317 unique samples. The raw data were preprocessed using the mas5 protocol as implemented in the R package affy[30] and transformed to log2 scale. All samples consisted of surgical collected freshfrozen tissue of primary tumors without neoadjuvant treatment.
A large number of genes was represented by more than one microarray probe. In this case, we selected the probe with the highest expression level resulting in a gene expression data set of 12703 unique genes. Immunohistochemistry (IHC) and in situ hybridization (ISH) where necessary are the gold standard for the determination of the ER and HER2 status. However, immunohistochemical data of ER and HER2 protein expression were not available for all samples. Hence, ER and HER2 classification was performed using the expression level of the estrogen receptor 1 gene (probe 205225_at) and the HER2 gene (probe 216836_s_at) from the microarray data (Additional file 2). A high concordance between RNA based determination of ER and HER2 states and the IHC based standard method was demonstrated before [22, 31]. A value of 10 was chosen as threshold for the ER status and a value of 12 as threshold for the HER2 status.
Visualization and functional analysis
Heatmaps were generated using the R function heatmap. Hierarchical clustering was executed using the average linkage method with Pearson correlations as similarity measure. Prior to the analysis, the expression level of each gene was centered to mean 0 and standard deviation 1. Construction and analysis of networks was carried out using the R package igraph[32]. Visualization of the networks was realized using Cytoscape[33]. Gene enrichment analysis was executed using DAVID [34, 35] with the genes represented by the microarray as background.
Results
Two algorithms were developed for the detection of differential correlation in different disease states (Figure 1). The algorithms are based on the detection of either global (DCglob) or local (DCloc) changes in the topology of the correlation network. Both algorithms include the analysis of correlation networks corresponding to a series of correlation thresholds that covers the range of correlations in the data.
Identification of differentially correlated genes
Distribution of ER and HER2 status
Microarray cohort  Californian population[36]  

Total  1317  (100%)  100% 
ER+  1035  (78.6%)  79.4% 
ER  282  (21.4%)  20.6% 
HER2+  208  (15.8%)  21.7% 
HER2  1109  (84.2%)  78.3% 
ER+/HER2+  120  (9.1%)  14.6% 
ER+/HER2  915  (69.5%)  64.8% 
ER/HER2+  88  (6.7%)  7.1% 
ER/HER2  194  (14.7%)  13.5% 
The genes were ranked according to the strength of differential correlation p (DCglob) and d (DCloc), see Additional files 3 and 4. The statistic p can be interpreted as the significance of the range of correlations, where the gene under consideration takes part in a change of global topology. The statistic d can be interpreted as topological dissimilarity of the networks in the neighborhood of the gene under consideration. Stronger differential correlation corresponds to lower p, but higher d.
Numbers of detected genes by DCglob and by DCloc
Algorithm  Threshold  ER  HER2  

Genes  FDR  Genes  FDR  
DCglob  p<0.1  630  12.1%  804  9.5% 
p<0.05  420  8.9%  544  6.9%  
DCloc  d>0.25  770  12.8%  1027  9.6% 
d>0.3  185  5.4%  238  4.2% 
Variation of the cutoff parameters
Differential correlation vs. differential expression
We looked for differential expression between the breast cancer subtypes using the standard approach of Welch’s test. After using the BenjaminiHochberg (BH) method for multiple testing correction and a FDR of 5%, 55% of all genes were differentially expressed between ER+ and ER, and 31% were significantly differentially expressed between HER2+ and HER2. Among the differentially correlated genes identified by DCglob (p<0.1), 76% were differentially expressed between ER+ and ER and 48% were differentially expressed between HER2+ and HER2. For DCloc (d>0.25), the percentages were similar (73% and 47%). Thus, DC analysis provided additional information beyond DE analysis. As an example, the marker genes for the invasive apocrine subtype of breast cancer acylCoA synthetase medium chain 1 (ACSM1) and hydroxyprostaglandin dehydrogenase (PGDH) exhibit strong differential correlation (DCglob p=8.0E05 and p=0.0004; DCloc d=0.26 and d=0.27), but they would not be detected in a differential expression analysis (p=0.21 and p=0.55 after BenjaminiHochberg correction).
Functional analysis of the differentially correlated genes
Gene enrichment analysis (ER+ vs. ER)
Category  Catalog  DCloc  DCglob  

N  p  N  p  
Extracellular matrix  GOTERM_CC_FAT  60  9.6E13  29  2.8E01 
Cell adhesion  GOTERM_BP_FAT  80  2.0E07  43  5.5E01 
Cell cycle  GOTERM_BP_FAT  79  2.2E05  63  3.7E03 
Immune response  GOTERM_BP_FAT  70  6.4E05  n.s.  
Growth factor binding  GOTERM_MF_FAT  22  1.5E04  n.s.  
Organelle fission  GOTERM_BP_FAT  30  1.6E03  24  4.2E02 
ECMreceptor interaction  KEGG_PATHWAY  18  3.9E03  n.s.  
Ribosome  KEGG_PATHWAY  18  5.5E03  n.s.  
Oxidoreductase  SP_PIR_KEYWORDS  42  8.3E02  42  2.2E02 
Gene enrichment analysis (HER2+ vs. HER2)
Category  Catalog  DCloc  DCglob  

N  p  N  p  
Translational elongation  GOTERM_BP_FAT  57  1.5E32  14  3.1E01 
Ribonucleoprotein  SP_PIR_KEYWORDS  83  7.1E31  46  1.2E09 
Ribosome  KEGG_PATHWAY  55  4.0E30  14  1.8E01 
Acetylation  SP_PIR_KEYWORDS  303  8.5E20  276  2.3E28 
Mitotic cell cycle  GOTERM_BP_FAT  77  6.8E13  65  7.8E12 
Regulation of ubiquitinprotein  GOTERM_BP_FAT  27  3.3E09  17  6.5E04 
ligase activity during  
mitotic cell cycle  
Immune response  GOTERM_BP_FAT  98  1.0E08  n.s  
Oxidative phosphorylation  KEGG_PATHWAY  28  1.7E04  30  4.5E08 
Mitochondrion  GOTERM_CC_FAT  93  4.0E01  106  6.6E07 
Proteasomal ubiquitindepend  GOTERM_BP_FAT  27  2.7E06  20  8.4E04 
ent protein catabolic process  
Mitochondrial membrane part  GOTERM_CC_FAT  26  1.3E04  25  4.5E06 
MHC protein complex  GOTERM_CC_FAT  13  3.5E04  7  1.7E01 
Growth factor binding  GOTERM_MF_FAT  22  8.5E03  n.s  
NADH dehydrogenase activity  GOTERM_MF_FAT  10  7.6E02  11  2.4E03 
ATP synthesis coupled  GOTERM_BP_FAT  12  2.7E02  12  5.9E03 
electron transport  
Antiapoptosis  GOTERM_BP_FAT  30  5.0E02  n.s 
Heatmap analysis
Cluster 1 (orange) included ACSM1 and PGDH, two genes that were described as markers for invasive apocrine carcinomas (IACs), a subgroup of ER breast cancer recently studied by Celis et al. [37, 38]. It was enriched with other markers of IAC (Table 5). Cluster 2 (darkblue) consisted of genes that are upregulated in HER2+ breast cancer, including ERBB2, GRB7, STARD3 and PSMD3 that are located in the HER2 amplicon [39]. It was significantly enriched for genes of the HER2 amplicon (Table 5). Cluster 3 (red) was highly enriched with marker genes for the androgenresponsive subgroup of ER breast cancer described by Doane et al. [40]. Cluster 4 (bluegreen) contained some cell cycle genes (CDC16, TFDP1). Cluster 5 (lightblue) contained FOXC1, a gene with regulatory and prognostic relevance in triplenegative breast cancer [41]. Cluster 6 (green) contained many genes that are related to ATPase activity (ATP5J, DHX15, CCT8, PSMC6 and ATP5O). Finally, Cluster 7 (purple) contained genes coding for keratins (KRT18, KRT19, KRT7, KRT8), claudins (CLDN3, CLDN4) and CD24. While part of the correlations in Cluster 2 (HER2), Cluster 4 and Cluster 6 (ATPase activity) are preserved in ER+ breast cancer, Cluster 1 (IAC markers), Cluster 3 (AR signaling) and Cluster 5 are completely rearranged in ER+ breast cancer.
DCloc identified only two different clusters. Among others, Cluster 1 (lightblue) contains genes related to the immune response (PTPRC, SIT1, CXCL13, FAIM3, IFI35). Like the third cluster identified by DCglob, the second cluster includes AR and FOXA1. Among the 25 genes present in this cluster, 18 are also present in Cluster 3 of the DCglob analysis. The two clusters identified by DCloc are not completely disarranged in the ER+ subtype, but the correlation between the genes in the clusters is much weaker in the ER+ compared to the ER subtype.
Network analysis
Reproducibility analysis
Further, we investigated the reproducibility of lists of differentially correlated genes in dependence of the sample size. When comparing the list of the top 5% genes in the training data set to the list of the top 5% genes in the validation data set, the reproducibility raised up to 33% for DCglob and to 49% for DCloc (Figure 5C). When relaxing the reproducibility condition to the list of the top 20% genes in the validation set, this percentage reached 64% for DCglob and 80% for DCloc (Figure 5D).
Discussion and conclusions
We developed two novel algorithms for the detection of differential correlation (DC) in highdimensional molecular data. Both approaches are untargeted in the sense that they do not depend on predefined gene modules and start with mapping of the correlation structures to networks. DCglob analyzes global changes of network topology, while DCloc analyzes local changes of the network topology. An innovative ingredient of the algorithms is the analysis of a series of networks that are constructed from a series of correlation thresholds. Therefore, detection of different kinds of correlation changes (strong changes of a few genes or moderate changes of many genes) is feasible. As default setting, the networks are constructed from an equidistant (after Fishertransformation) series of 100 correlation thresholds between 0 and the maximal correlation in the data set. For the global algorithm, we increased the number of thresholds to 200 to achieve a more precise ranking of the genes. Using 200 instead of 100 thresholds did not significantly change the lists of DC genes. For both algorithms, a statistic for the strength of DC (p or d) is calculated as average of topological changes over all correlation thresholds. Using averaging as method for summary ensures that changes at different correlation thresholds are taken into account with equal weight. This i appropriate in a situation without prior knowledge on the strength and biological relevance of correlation changes.
An essential part of the analysis is to work with two equalsized sample groups that are compared for differential correlations. The number of available HER2+ samples was 208 and therefore we randomly drew 208 ER+, 208 ER and 208 HER2 samples for analysis. Unequal samples size would lead to a different sensitivity and/or specificity for the detection of correlations and therefore would lead to false positive discoveries in the DC analysis. We therefore recommend to work with equalsized sample groups.
The results of DCloc and DCglob were consistent: Pearson correlations between −p and d were strong and highly significant (p<2E16) for the analysis of estrogen receptor status (R = 0.59) and the analysis of HER2 status (R = 0.64). Extensive subsampling analysis was carried out to demonstrate the significance of the findings, and to estimate FDRs associated with the lists of differentially correlated genes. Both algorithms detected a significant number of differentially correlated genes in molecular subtypes of breast cancer compared with the null model. In the analysis of breast cancer subtypes, DCloc turned out to be more sensitive than DCglob (Figure 2C). This was in line with a higher number of significantly enriched functional categories for DCloc(Table 3). Additionally, the performance of DCloc was considerable better than DCglob in terms of reproducibility when analyzing two independent breast cancer data sets (Figure 5). Assuming that 1/3 of the detected genes should reproduce in twofold crossvalidation, 90 patients in each group were appropriate for DCloc and 130 patients in each group were appropriate for DCglob. However, the results of the global algorithm were more easily interpretable in the heatmap analysis (Figure 3).
Gene enrichment analysis revealed significantly enriched terms within the lists of differentially correlated genes. We tested heatmaps and networks as tools for visualization and indepth analysis of the correlation changes. Clustering and heatmap analysis turned out to be particularly useful for a biological interpretation of the correlation patterns. ER tumors tend to be more aggressive and are more difficult to treat than ER+ tumors. Therefore, we analyzed the correlation patterns in ER tumors in more detail. Interestingly, among the clusters of genes detected by DCglob, there were many genes that could serve markers for substratification of the ER or the triplenegative subtype (see Table 5).
We detected different types of changes in the correlation patterns between ER+ and ER breast cancer. For example, the structure of the cluster associated with the HER2 amplicon (Cluster 2) was relatively well preserved. However, there was a difference in correlation strength, which can be possibly explained by the unequal distribution of the HER2 tumors between the ER+ and the ER subtype (see Table 1). In contrast, there were gene clusters with strong correlation in ER breast cancer and almost no correlation in ER+ breast cancer (Clusters 1, 3 and 5).
As an example, the genes in Cluster 1 were strongly correlated in ER breast cancer, but most of them were uncorrelated in ER+ breast cancer. In the marker gene enrichment analysis, we found a highly significant overlap between these genes and marker genes of invasive apocrine carcinomas (IACs). In fact, the genes of Cluster 1 were highly expressed in a small subgroup of triplenegative (ER and HER2) breast cancer (10 tumors), but poorly expressed in the remaining ER tumors and in all ER+ tumors. The two major marker genes for IACs [37], PGDH and ACSM1, were highly differentially correlated, but not differentially expressed between the ER and the ER+ subtype.
Interestingly, the genes in Cluster 3 were highly expressed in the ER/HER2+ tumors and the IACs, but not in the remaining triplenegative tumors. The marker gene enrichment analysis demonstrated a highly significant overlap of this cluster with AR signaling. The AR gene itself turned out to be considerably higher expressed in ER+ tumors compared to ER tumors (fold change =3.01, p=1.2E30). This result is in agreement with immunohistochemical data showing that the number of androgen receptor positive tumors is larger in the ER+ subtype [42]. Further analysis showed that the expression of most of the genes of the AR signaling cluster was high in the ER+ tumors, but variable in the ER tumors. Thus, the low correlation of the AR signaling genes in ER+ tumors is a consequence of the missing variance of the pathway in these tumors, where it is always highly expressed. These observations suggest that AR signaling is always active in ER+ tumors, while it is under regulation (active or inactive) in ER tumors. AR signaling based stratification is of interest in the ER subtype, but not in the ER+ subtype, in line with a recent result that high AR protein expression was associated with better survival in triplenegative breast cancer [42].
In summary, functionally relevant pathways (Table 3, 4 and 5) could be identified that show highly correlated gene expression in one of the subtypes, but not in the complementary subtype. Coexpression of a pathway is likely to be a consequence of pathway regulation, for example by transcription factors. A pathway highly expressed in a cancer subtype is potentially supporting or even essential for the growth of tumors cells in this kind of tumor. Thus, the clusters identified by DC analysis can be beneficial for patient stratification and can represent interesting targets for new therapies. In this context we reidentified the HER2amplicon that is successfully targeted by trastuzumab or other antiHER2 drugs [43].
These examples illustrate that DC analysis, in particular network topology based approaches, can help to identify biologically important gene clusters beyond the results of conventional DE analysis. Using DCloc and DCglob, we detected hundreds of differentially correlated genes in the molecular subtypes of breast cancer, while keeping the FDR under control. In a twofold crossvalidation approach we showed that results of both algorithms were reproducible in an independent data set.
Within the last decade, a multitude of methods and algorithms were developed for the detection of differential correlations. The algorithms address different biological questions and it is difficult to decide which of the algorithms works best in terms of power and of biological interpretability. However, statistical properties like specificity and reproducibility could be evaluated in a comparable way for many of the algorithms. We believe that statistical evaluation should be stronger emphasized in future studies of differential correlations.
Abbreviations
 DC:

Differential correlation
 DE:

Differential expression
 FDR:

False discovery rate
 ER:

Estrogen receptor
 HER2:

Human epidermal growth factor receptor 2
 IHC:

Immunohistochemistry
 BH:

BenjaminiHochberg
 IAC:

Invasive apocrine carcinoma
 AR:

Androgen receptor.
Declarations
Acknowledgements
This project was funded by the European Commission, FP7 grant #278659 (RESPONSIFY).
Authors’ Affiliations
References
 Allison D, Cui X, Page G, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 5565. 10.1038/nrg1749.PubMedView ArticleGoogle Scholar
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Nat Acad Sci. 1998, 95 (25): 1486314868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
 Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, BørresenDale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Nat Acad Sci. 2001, 98 (19): 1086910874. 10.1073/pnas.191367098.PubMedPubMed CentralView ArticleGoogle Scholar
 Li KC: Genomewide coexpression dynamics: Theory and application. Proc Nat Acad Sci. 2002, 99 (26): 1687516880. 10.1073/pnas.252466999.PubMedPubMed CentralView ArticleGoogle Scholar
 Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential genegene coexpression patterns. Bioinformatics. 2004, 20 (17): 31463155. 10.1093/bioinformatics/bth379.PubMedView ArticleGoogle Scholar
 Kostka D, Spang R: Finding disease specific alterations in the coexpression of genes. Bioinformatics. 2004, 20 (Suppl 1): i194i199. 10.1093/bioinformatics/bth909.PubMedView ArticleGoogle Scholar
 de la Fuente A: From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010, 26 (7): 326333. 10.1016/j.tig.2010.05.001.PubMedView ArticleGoogle Scholar
 Tesson B, Breitling R, Jansen R: DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics. 2010, 11: 49710.1186/1471210511497.PubMedPubMed CentralView ArticleGoogle Scholar
 Choi Y, Kendziorski C: Statistical methods for gene set coexpression analysis. Bioinformatics. 2009, 25 (21): 27802786. 10.1093/bioinformatics/btp502.PubMedPubMed CentralView ArticleGoogle Scholar
 Southworth L, Owen A, Kim S: Aging mice show a decreasing correlation of gene expression within genetic modules. PLoS Genet. 2009, 5 (12): e100077610.1371/journal.pgen.1000776.PubMedPubMed CentralView ArticleGoogle Scholar
 Taylor I, Linding R, WardeFarley D, Liu Y, Pesquita C: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009, 27: 199204. 10.1038/nbt.1522.PubMedView ArticleGoogle Scholar
 Teschendorff A, Severini S: Increased entropy of signal transduction in the cancer metastasis phenotype. BMC Syst Biol. 2010, 4: 10410.1186/175205094104.PubMedPubMed CentralView ArticleGoogle Scholar
 Ihmels J, Bergmann S, Berman J, Barkai N: Comparative gene expression analysis by differential clustering approach: application to the Candida albicans transcription program. PLoS Genet. 2005, 1: e3910.1371/journal.pgen.0010039.PubMedPubMed CentralView ArticleGoogle Scholar
 Watson M: CoXpress: differential coexpression in gene expression data. BMC Bioinformatics. 2006, 7: 509509. 10.1186/147121057509.PubMedPubMed CentralView ArticleGoogle Scholar
 Choi J, Yu U, Yoo O, Kim S: Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics (Oxford, England). 2005, 21 (24): 43484355. 10.1093/bioinformatics/bti722.View ArticleGoogle Scholar
 Voy B, Schar J, Perkins A, Saxton A, Borate B, Chesler E, Branstetter L, Langston M: Extracting gene networks for lowdose radiation using graph theoretical algorithms. PloS Comput Biol. 2006, 2 (7): e8910.1371/journal.pcbi.0020089.PubMedPubMed CentralView ArticleGoogle Scholar
 Zhang B, Li H, Riggins RB, Zhan M, Xuan J, Zhang Z, Hoffman EP, Clarke R, Wang Y: Differential dependency network analysis to identify conditionspecific topological changes in biological networks. Bioinformatics. 2009, 25 (4): 526532. 10.1093/bioinformatics/btn660.PubMedPubMed CentralView ArticleGoogle Scholar
 Altay G, Asim M, Markowetz F, Neal D: Differential C3NET reveals disease networks of direct physical interactions. BMC Bioinformatics. 2011, 12: 29610.1186/1471210512296.PubMedPubMed CentralView ArticleGoogle Scholar
 Amar D, Safer H, Shamir R: Dissection of regulatory networks that are altered in disease via differential coexpression. PLoS Comput Biol. 2013, 9 (3): e100295510.1371/journal.pcbi.1002955.PubMedPubMed CentralView ArticleGoogle Scholar
 Gruvberger S, Ringnér M, Chen Y, Panavally S, Saal LH, Borg A, Fernö M, Peterson C, Meltzer PS: Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res. 2001, 61 (16): 59795984.PubMedGoogle Scholar
 van’t Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M, Peterse H, van der Kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530536. 10.1038/415530a.View ArticleGoogle Scholar
 Budczies J, Weichert W, Noske A, Müller B, Weller C, Wittenberger T, Hofmann H, Dietel M, Denkert C, Gekeler V: Genomewide gene expression profiling of formalinfixed paraffinembedded breast cancer core biopsies using microarrays. J Histochem Cytochem. 2011, 59 (2): 146157. 10.1369/jhc.2010.956607.PubMedPubMed CentralView ArticleGoogle Scholar
 Freudenberg J, Sivaganesan S, Wagner M, Medvedovic M: A semiparametric Bayesian model for unsupervised differential coexpression analysis. BMC Bioinformatics. 2010, 11: 23410.1186/1471210511234. [http://www.biomedcentral.com/14712105/11/234] 10.1186/1471210511234PubMedPubMed CentralView ArticleGoogle Scholar
 Tegge AN, Caldwell CW, Xu D: Pathway correlation profile of genegene coexpression for identifying pathway perturbation. PLoS ONE. 2012, 7 (12): e5212710.1371/journal.pone.0052127.PubMedPubMed CentralView ArticleGoogle Scholar
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, [http://www.Rproject.org/] [ISBN 3900051070]Google Scholar
 Steiger J: Test for comparing elements of a correlation matrix. Psychol Bull. 1980, 87: 245251.View ArticleGoogle Scholar
 Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003, 100 (16): 94409445. 10.1073/pnas.1530509100.PubMedPubMed CentralView ArticleGoogle Scholar
 Edgar R, Domrachev M, Lash AE: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
 Györffy B, Lanczky A, Eklund AC, Denkert C, Budczies J, Li Q, Szallasi Z: An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Res Treat. 2010, 123 (3): 725731. 10.1007/s1054900906749.PubMedView ArticleGoogle Scholar
 Gautier L, Cope L, Bolstad BM, Irizarry RA: Affy—analysis of affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20 (3): 307315. 10.1093/bioinformatics/btg405.PubMedView ArticleGoogle Scholar
 Müller BM, Kronenwett R, Hennig G, Euting H, Weber K, Bohmann K, Weichert W, Altmann G, Roth C, Winzer KJ, Kristiansen G, Petry C, Dietel M, Denkert C: Quantitative determination of estrogen receptor, progesterone receptor, and HER2 mRNA in formalinfixed paraffinembedded tissue–a new option for predictive biomarker assessment in breast cancer. Diagn Mol Pathol. 2011, 20: 110. 10.1097/PDM.0b013e3181e3630c.PubMedView ArticleGoogle Scholar
 Csardi G, Nepusz T: The igraph software package for complex network research. InterJournal Complex Systems. 2006,, 1695Google Scholar
 Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011, 27 (3): 431432. 10.1093/bioinformatics/btq675.PubMedPubMed CentralView ArticleGoogle Scholar
 Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics resources. Nat Protoc. 2009, 4 (1): 4457.View ArticleGoogle Scholar
 Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1): 113. 10.1093/nar/gkn923.PubMed CentralView ArticleGoogle Scholar
 Bauer K, Parise C, Caggiano V: Use of ER/PR/HER2 subtypes in conjunction with the 2007 St Gallen consensus statement for early breast cancer. BMC Cancer. 2010, 10: 22810.1186/1471240710228.PubMedPubMed CentralView ArticleGoogle Scholar
 Celis JE, Gromov P, Cabezon T, Moreira JMA, Friis E, Jirstrom K, LlombartBosch A, TimmermansWielenga V, Rank F, Gromova I: 15Prostaglandin dehydrogenase expression alone or in combination with ACSM1 defines a subgroup of the apocrine molecular subtype of breast carcinoma. Mol Cell Proteomics. 2008, 7 (10): 17951809. 10.1074/mcp.R800011MCP200.PubMedView ArticleGoogle Scholar
 Celis JE, Cabezon T, Moreira JM, Gromov P, Gromova I, TimmermansWielenga V, Iwase T, Akiyama F, Honma N, Rank F: Molecular characterization of apocrine carcinoma of the breast: Validation of an apocrine protein signature in a welldefined cohort. Mol Oncol. 2009, 3 (3): 220237. 10.1016/j.molonc.2009.01.005.PubMedView ArticleGoogle Scholar
 Staaf J, Jonsson G, Ringner M, VallonChristersson J, Grabau D, Arason A, Gunnarsson H: Highresolution genomic and expression analyses of copy number alterations in HER2amplified breast cancer. Breast Cancer Res. 2010, 12 (3): R2510.1186/bcr2568.PubMedPubMed CentralView ArticleGoogle Scholar
 Doane A, Danso M, Lal P, Donaton M, Zhang L, Hudis C, Gerald W: An estrogen receptornegative breast cancer subset characterized by a hormonally regulated transcriptional program and response to androgen. Oncogene. 2006, 25 (28): 39944008. 10.1038/sj.onc.1209415.PubMedView ArticleGoogle Scholar
 Ray PS, Wang J, Qu Y, Sim MS, Shamonki J, Bagaria SP, Ye X, Liu B, Elashoff D, Hoon DS, Walter MA, Martens JW, Richardson AL, Giuliano AE, Cui X: FOXC1 is a potential prognostic biomarker with functional significance in basallike breast cancer. Cancer Res. 2010, 70 (10): 38703876. 10.1158/00085472.CAN094120.PubMedView ArticleGoogle Scholar
 Loibl S, Müller B, von Minckwitz G, Schwabe M, Roller M, DarbEsfahani S, Ataseven B, du Bois A, FisslerEckhoff A, Gerber B, Kulmer U, Alles J, Mehta K, Denkert C: Androgen receptor expression in primary breast cancer and its predictive and prognostic value in patients treated with neoadjuvant chemotherapy. Breast Cancer Res Treat. 2011, 130 (2): 47787. 10.1007/s1054901117158.PubMedView ArticleGoogle Scholar
 Smith I, Procter M, Gelber R, Guillaume S, Feyereislova A, Dowsett M, Goldhirsch A, Untch M, Mariani G, Baselga J: 2year followup of trastuzumab after adjuvant chemotherapy in HER2positive breast cancer: a randomised controlled trial. Lancet. 2007, 369 (9555): 2936. 10.1016/S01406736(07)600282.PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.