Maximizing capture of gene co-expression relationships through pre-clustering of input expression samples: an Arabidopsis case study

Background In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium. Results A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network. Conclusions Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.


Background
Cellular processes underlying expression of complex traits have a major impact on health, agriculture, and our understanding of general biology. The gene co-expression network (GCN) models coordinated gene expression across a series of input data sets such as microarrays [1,2]. In a GCN, nodes represent genes, and edges describe significant gene co-expression relationships. The GCN also exhibits properties common to most naturally occurring networks such as scale-free, small world and hierarchical topology [3,4]. Due to the availability of large quantities of publically available expression data and the relative ease of construction, GCNs have been constructed for a broad array of organisms including human [2,5,6], yeast [7][8][9], Arabidopsis [10][11][12][13], rice [14,15], maize [16], potato [17] and many more. These networks have elucidated gene sets involved in varied biological systems including cell wall biosynthesis [13], mouse weight [18], and complex trait expression [19][20][21][22].
A GCN is constructed by performing pair-wise correlation analysis of every gene on the array. Typically, Pearson's Correlation Coefficient (PCC) is used, and a large n x n matrix of PCC values is obtained where n is the number of measurable genes. An analytical method is then employed to identify the level at which correlation values should be thresholded to yield biologically meaningful co-expression relationships. Often, co-expression networks are built for analysis of differentially connected genes between two or more experimental conditions (environment, disease state, genotype, or tissue type) [23][24][25]. In these cases networks are constructed separately for each condition and the context of the connectivity can then be examined to identify modules (or gene sets) putatively causal for phenotypic traits. Genes with known causality can be used to guide selection of modules.
In some cases, global co-expression networks are created using a large number of samples from publicly available repositories-the goal being to mine knowledge across the compendium of samples not previously identified through individual experiments. For thousands or tens-of-thousands of samples two challenges occur. First, classification of samples into conditions becomes manually intractable and sample descriptions are often insufficient to automate proper classification. Therefore, conditional context is often ignored, samples are combined and similarity measurements are calculated across all samples. However, this treats all genes and samples (with varying conditions) equally when calculating similarity scores. As the number of samples with different conditions increases in the input dataset, the number of significant co-expression relationships decreases [26]. Therefore, networks built from a set of samples with a large number of conditions tend to be small. Cheng and Church recognized the illogic of weighting all genes and samples the same for similarity calculation and proposed biclustering of expression data [27] to generate biclusters of genes with similar expression in similar contexts (conditions). Many types of biclustering methods have been developed [28]. However, for large disparate sample sets, the difficulty in classifying samples into conditional groups makes biclustering difficult.
A second challenge for network construction is identification of a proper significance threshold. Many methods have been employed for significance thresholding. These include ad hoc methods [1,[29][30][31], permutation testing [5], linear regression [13], rank-based methods [32,33], Fisher's test of homogeneity [34], spectral graph theory [35], Random Matrix Theory (RMT) [36,37], Partial Correlation and Information Theory (PCIT) [26], methods that use topological properties [38], and supervised machine learning [39,40]. In some cases a constant threshold is applied to the entire network. While significant relationships can be found using a constant threshold, a dynamic threshold is better suited for context-dependent co-expression variability. Dynamic thresholding can increase sensitivity and decrease false-positives. Rank-based methods, PCIT and supervised machine learning methods all employ a form of dynamic thresholding.
To address the challenges of large sample, large conditional network construction, we use a method of segregating input samples into groups before network construction. Without knowledge of sample conditions, we approximate expression conditions by pre-clustering of input gene-expression profiles into groups, and apply dynamic significance thresholding through independent network construction for each sample group. We refer to the network from each group as a Gene Interaction Layer (GIL). The GIL compendium represents an attempt to maximize the capture of all possible interactions across all samples ( Figure 1). The GIL compendium also allows for gene and gene-gene interactions to exist in multiple GILs and therefore provides a framework for the analysis of intersecting biological pathways and potentially pleiotropic interactions. To demonstrate the effectiveness of this approach, we generated a GIL collection for the model plant Arabidopsis thaliana (Arabidopsis) by pre-cluste-ring 7,105 input samples. Our results indicate that the Arabidopsis GIL compendium represents a dramatic improvement in capture of gene co-expression relationships.

Results
Arabidopsis GILs were constructed by partitioning 7,105 publicly available Affymetrix ATH1 microarray RNA expression samples using a network construction pipeline [41] which implements Random Matrix Theory (RMT) for biological signal thresholding [9]. Samples were RMA [42] normalized prior to pre-clustering. Additionally, a typical "global" network was constructed using all normalized input samples. The global network comprised 3,297 nodes and 129,134 edges (average degree, <k> = 78) representing 15.9% of the known Arabidopsis genes as measured by the ATH1 platform. As mentioned previously, this small network size is expected given the mixture of multiple sample conditions. To improve capture of measurable genes and co-expression relationships, we partitioned the input samples into K groups with similar expression patterns using the K-means clustering algorithm [43]. Each cluster of samples were then RMA normalized again within their respective groups and coexpression networks were built for each cluster.
The effects of K size on the GIL collection The choice of K size in k-means clustering should have an impact on the topological properties of each GIL. To quantify these differences we performed K-means clustering nine times at K sizes of 30, 60, 70, 80, 90, 120, 150, 180, and 210, yielding 990 total groups. The average array count for each K size ranged from 236.8 (K=30) to 33.8 (K=210) arrays. Changes in the various topological properties including clustering co-efficient, closeness, betweenness, page rank, etc., between different K sizes can be found in Additional file 1: Figure S1. In summary, relative to the global network, the average number of nodes in each group was lower and ranged from 2,090 (K=30) to 1,819 (K=150) ( Figure 2). The average number of edges in each group was much lower relative to the global network and declined with increasing K size ranging from 12,265 (K=30) to 6,421 (K=210) (Figure 2).
To test if the degree of pre-clustering of samples had a negative effect on network topology, we randomly reassigned 50% of the samples among the groups and reconstructed the networks. This re-assignment occurred at K sizes of 30, 50, 60, 70, 80, 90, and 120. Sample re-assignment had a noticeable effect on scale-free behavior as Global Network

Gene Interaction Layers: GIL Compendium Partitioned Datasets
Gene Expression Datasets Figure 1 Deconstructing a global network into gene interaction layers (GILs). Gene expression datasets can be used to construct a co-expression network in total (global) or sorted by expression pattern into groups prior to network construction (GILs). exhibited by a decrease fit to the Kronecker scale-free graph model [44] and decrease in average scaling exponent γ (Figure 3). The scaling exponent for the global network was 1.36. Therefore, the randomization of samples between pre-clustered groups seemed to generate networks with properties more similar to the "global" network. However, despite differences in fit to the Kronecker scale-free graph model and to the scaling exponent, all networks exhibited scale-free behavior; therefore each K size generated networks that appear realistic.

Identification of the best K size
To identify the best K size, we examined the network from each K group for total gene capture, functional annotation enrichment, and module count. First, we examined total node capture among the GILs from each K group. While the global network captured 15.9% of possible genes measured by the array platform, the total gene space captured across all networks of a K group increased with increasing K size ( Figure 4). At K=180, for example, 99.6% of all measurable genes were captured at least once. Second, we modularized the networks using MCL [45] and tested for function term enrichment (Bonferroni p < 0.001) for each module relative to whole genome background using Gene Ontology (GO) terms [46], KEGG pathways [47], Interpro protein domains [48], Plant Ontology (PO) terms [49], and AraCyc pathways [50]. In general, the number of total enriched terms tended to increase along with K size (Additional file 1: Figure S2). However, the number of unique enriched terms tended to plateau at K=90. Third, we investigated modular behavior by examining the average number of modules for each K. For this test we used a second module detection algorithm called link-community detection [51]. The link-community method allows for nodes to be present in more than one module which is more representative of shared genes found in intersecting pathways. The average number of link-community modules (LCMs) per GIL decreased as K size increased while the number of MCL modules rose slightly (Additional file 1: Figure S3). Therefore, to balance a maximum number of nodes, edges, functional enrichment, and high module count representing both coverage (MCL) and multifunctionality (LCM), K=90 was chosen as the K-means clustering size for the Arabidopsis GIL collection.

Summary of the K 90 GIL collection
In summary, the K90 GIL collection consisted of 86 GILs and contained a total of 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. Counting the experiments contained in the K90 GIL collection revealed that each GIL was comprised of multiple NCBI GEO experiments (μ = 78.9 arrays; σ=53.7; Additional file 2: Table S1) derived from a blend of GEO series (μ=12.3; σ=10.7), indicating that GILs were not simply the product of segregating datasets into the original experiment groups. After module detection using the Markov Clustering (MCL; [45]) and link-community [51] methods, we circumscribed 38,234 MCL modules, encapsulating 94.7% of measurable genes and 26,570 linkcommunity modules capturing 71.0% of genes, each with a median occurrence of six genes per module (min = 2 and max = 707) ( Table 1). Four GILs (43,46,52,70) in the K90 collection did not construct due to high levels of Figure 3 GILs are scale free at varied partition sizes. Primary graph shows log-likelihood fit to a Kronecker scale-free graph model at k-means cluster sizes K. Clusters were randomized at 0% (solid bars) and 50% (striped bars) prior to network construction. Inset graph shows average scaling exponent (<γ>) for GILs constructed from K-means of cluster size K at 0% (solid line) and 50% (dashed line) randomization.
correlation between genes. For MCL modules, we identified 867 enriched function terms (657 unique) within the global network while the K90 GIL collection yielded 37,614 enriched function terms (4,789 unique)a 7.2fold improvement in representation of unique functional terms in the GILs.
To test the amount of unique function captured by kmeans presorting, we compared the K=90 GIL set to networks constructed by randomizing the input samples at 50% and 100%, leaving cluster number and sizes intact. The total number of unique nodes (gene capture) decreased from 19,588 (at 0% randomization) to 13,568 (at 100% randomization), yet the number of unique edges increased from 558,023 (at 0%) to 705,399 (at 100%). Interestingly, this higher level of connectivity between smaller numbers of genes resulted in more MCL modules ranging from 39,931 (at 0%) to 47,623 (at 100%) as well as more total enriched function terms ranging from 37,614 (at 0%) to 64,236 (at 100%) (Additional file 1: Figure S2). However, many of these enriched functions were re-captured terms since the number of unique enriched terms was reduced from 4,789 (at 0%) to 3,518 (at 100%). Thus, K-means pre-sorting of datasets prior to network construction appears to have a positive effect on capture of unique biological function.

Exploring GIL function
Gene expression samples in the K90 GIL collection were not segregated by specific experimental conditions, but rather segregated using knowledge-independent pre-clustering. Therefore, annotation of the experimental conditions of samples may help describe the biological context in each GIL. Experimental descriptions from the samples in GEO confer limited information but common keywords from these descriptions can provide valuable insight. Therefore, we attempted to annotate each GIL for biological context by using highly represented keywords from the GEO experiment descriptions. We expected two GILs would be more similar in function if they shared genes between modules. Therefore, GILs were ordered using average linked hierarchical clustering with shared node count between GIL LCMs as the similarity metric (Additional file 1: Figure S4). This procedure resulted in groups of GILs that showed non-random similarity in assigned keywords (p < 0.05) for keywords "leaf", "root", "seed", "seedling", "iron", and "mutant" (significant keywords are indicated as red boxes in Additional file 1: Figure S4). However, overlap in keywords is evident between GILs indicating that GILs are not solely comprised of samples from a single experimental condition.
To determine if the GIL keyword assignment was biologically relevant, GILs were tested for an increase in relevant, enriched functional terms (Columns in Additional file 1: Figure S4). First, GILs were distributed into keyword and non-keyword groups for five different keywords ( Figure 5). For example, 10 GILs with the "auxin" keyword were placed in an auxin group while the remaining 76 GILs were placed in a non-auxin group. Analogous   groups were constructed for "leaf", "light/dark", "root" and "stress". The number of specific enriched words from functional terms (e.g. *auxin*, *photo*, *stress*, *ribosome*; * = wildcard) found to be functionally enriched in LCMs were counted in both keyword and non-keyword groups. The ratio of these enriched terms between keyword and non-keyword groups can be found in Figure 5.
For the auxin group, there was a 6.78 fold increase of enriched functional terms that contained "auxin". In addition, we performed functional term enrichment on all genes in a GIL as final indicator of GIL context (Additional file 3: Table S2).

Discussion
The partitioning method we describe dramatically increases the capture of significant co-expression relationships by segregating expression profiles into similar groups prior to correlation calculation. In effect, this approach is a dynamic significance thresholding strategy where a local hard threshold is determined for each GIL separately using the correlation matrix of each GIL. It is different from other dynamic thresholding methods which perform thresholding using the global correlation matrix. With pre-clustering of samples, correlation calculations occur locally for each GIL, allowing for existing threshold detection methods, such as RMT, to be used in a dynamic way. Naturally, there will be some false positives that surpass any thresholding method, but the lack of knowledge of true positive co-expression relationships make sensitivity and specificity difficult to quantify. A second benefit for pre-clustering is reduction of complexity that results from mixing of samples from various conditions. It has been shown that as the number of conditions in a sample set increase, the distribution curve of correlation changes such that most correlations are centered around zero [26]. Thus, there are fewer high correlations and fewer significant co-expression relationships, and hence smaller networks as more conditions are present. Pre-sorting of samples groups them by similar overall expression levels, reduces the complexity of the dataset and mimics separation of samples by condition.
A third benefit to un-supervised pre-sorting of samples is that limited human understanding does not constrain the network. For example, grouping a set of samples into a control group (condition #1) and a disease state group (condition #2), may overlook changes in gene expression due to differences in genotype, tissue type and environmental conditions between the various individuals in the study. Even for carefully designed studies, confounding effects due to unrecognized or immeasurable conditions may hide subtle expression relationships. By pre-sorting without bias towards specific conditions, the underlying expression levels of each sample dictate membership in a group. The resulting GIL can therefore be multi-conditional which is more realistic than the idea that co-expression within a GIL is specific to a single condition. As mentioned previously, we do not see a single condition present for a GIL, although we did see some evidence that some GILs may be more representative of certain biological contexts.
Fourth, the GIL collection provides a novel framework for exploration of gene co-expression. We do not combine the co-expression relationships from all GILs into a single global network-GILs remain as separate entities. The modules within a single GIL provide a unique set of relationships for a unique biological function. A module in one GIL with a similar set of genes as a module in another GIL may have different co-expression relationships. This redundancy provides a realistic framework where genes and pathways are represented in different condition contexts. While the exact biological context of each GIL is unknown, differential connectivity between similar modules can indicate how co-expression changes between genes of interest. It is clear from our attempts to annotate the GILs that more research is required to assign conditions to the GILs and place genes and gene interactions into experimental contexts.
A disadvantage to K-means clustering for pre-sorting of input samples is that samples can only belong to a single GIL. It seems realistic to believe that the gene expression in one sample could share similarity with samples from several GILs. Biclustering draws on this idea that gene coexpression is a product of overlapping genes and samples. Also, Ruan et. al. constructed a sample co-expression network where samples were nodes and edges existed when samples shared similar expression patterns [52]. They show that module detection of a "sample network" yields modules with samples that share a similar biological context (e.g. lymphoma cell types). It seems logical to therefore use a link-community algorithm for detecting sample modules in such a sample network, from which we could therefore construct GILs where samples can be members of multiple GILs. Additionally, we used K-means clustering to segregate the expression datasets, but other clustering approaches, such as that proposed by Ruan et. al, or other commonly used clustering techniques, could also be effective in pre-sorting of samples.
Finally, is the A. thaliana interactome represented by the K90 GIL compendium comprehensive? While we detected 558,022 unique gene interactions, there were 1,584,378 total interactions, as well as gene modules that intersect through shared nodes between GILs. While an accurate protein coding gene count for A. thaliana has yet to be determined, a recent tally identified 27,416 loci and 35,386 transcript variants (TAIR10 build). The ATH1 microarray platform interrogates 20,677 of these known genes albeit with little transcript variant discriminatory power. The K90 GIL collection captured 19,588 loci in 558,022 temporal gene interactions. Extrapolating to include genes not measured by the array, we estimate the unique Arabidopsis co-expression interactome to be a minimum of 589,045 binary interactions. While this number is certain to increase with higher resolution RNA expression measurements (e.g. deep RNAseq sampling) obtained under additional experimental conditions, it does provide a baseline model of the RNA co-expression interactome.
An interactome estimate on a 10 5 scale is not unprecedented for A. thaliana. A recent A. thaliana study estimated the baseline protein interaction space to be 299,000 ± 79,000 (μ ± SD) without accounting for protein isoforms. This estimate was based on approximately 2% (2,661 peptides) of the predicted interactome [53]. While not statistically significant, 2.1% (237/11,374 edges) of the PPI interactions overlapped with the K90 GIL collection. It is intriguing that a PPI network predicted to represent 2% of the interactome overlaps with 2% of interactome modeled by the GIL collection. We hypothesize that the capture of interactome space represented in the K90 GIL collection approaches the maximal interaction space of the measured genes across all the input samples.

Conclusions
Our results indicate that pre-clustering transcriptome measurements by expression similarity prior to co-expression network construction captures more significant co-expression relationships than networks constructed without pre-clustering. In summary, we believe that this approach is simple, reduced-bias, and practical to reduce noise in large expression dataset collections. The enhanced resolution afforded by the GIL collection provides a more holistic platform for improved identification of gene modules that can be interrogated for novel biochemical pathways, used to assign new genes to known pathways, predict gene sets causal for important complex traits, and indicate pleiotropic relationships within and between modules derived from specific GILs in the GIL collection. The A. thaliana K90 GIL compendium described here is therefore the most comprehensive and we believe the most natural model of the gene co-expression interactome for a plant species. The method should be applicable to any organism where a large number of gene expression profiling datasets have been generated.

Partitioning expression datasets by K-means clustering
Expression measurements used for network construction were obtained from publicly available Affymetrix Arabidopsis ATH1 Genome Array experiments available in NCBI GEO (platform GPL198). A total of 7,158 samples were obtained in November 2011. RMA normalization [42] was performed for all samples together using the command-line utility of RMAExpress (http://rmaexpress.bmbolstad.com).
Sample outlier detection was performed using the arrayQualityMetrics [54] tool for Bioconductor [55]. Samples that failed two of the three outlier tests were removed from the dataset. The normalized expression file, an n × m matrix of expression values with 22,746 rows of probe sets (no control) and 7,105 columns of samples (Additional file 2: Table S1), was clustered by similarity of expression values using K-means clustering via the kmeans function of R. K-means clustering was repeated for different values of K at 30, 60, 70, 80, 90, 120, 150, 180, and 210.

Co-expression network construction
Each cluster of samples derived from the K-means clustering was used to construct individual networks-one for each cluster. Additionally, a "global" network was constructed consisting of all available samples. The network construction process was performed on each dataset cluster which included: 1) RMA normalization of the clustered datasets and not all possible arrays; 2) removal of control probes; 3) outlier sample removal; 4) removal of ambiguous probe sets; 6) calculation of a similarity matrix of pair-wise Pearson correlation values between all probe sets; 7) use of Random Matrix Modeling (RMT) [9] to identify a significant threshold to cull the similarity matrix; 8) conversion of the thresholded ATH1 probeset similarity matrix to an Arabidopsis gene network; and 9) module detection. Details specific to each step are further described.
The software package RMAExpress (http://rmaexpress. bmbolstad.com) was used to normalize each cluster individually. The samples of each cluster were provided as input, and the Chip Description File (CDF) was obtained from the Affymetrix website. A file containing an n x m expression matrix of normalized expression values, where n is the number of probe sets and m is the number of samples, was generated for each cluster. The rows of the expression matrix were then reduced in size through the removal of control probes using an in-house Perl script. Outlier samples within the cluster itself were identified using the arrayQualityMetrics [54] tool for Bioconductor [55]. The columns of the expression matrix were then reduced in size through removal of the outlier samples using an in-house Perl script. The rows of the expression matrix were then reduced again to remove any ambiguous probesets. Ambiguous probe sets are those that potentially hybridize to multiple genes in the Arabidopsis genome. Ambiguous probe sets were identified using megablast from the NCBI BLAST [56] package to align a FASTA file of probe sequences obtained from the Affymetrix website with the Arabidopsis coding sequences (CDS) obtained from the TAIR website (Lamesch et. al. 2012). Parameters for megablast set the word size to 25 (-W 25, the length of the probe sequence), and disabled low-complexity filtering (-F F). An in-house Perl script counted the probe to gene mappings and identified ambiguous probe sets. Pearson correlation coefficients were calculated using optimized C-code and an m x m similarity matrix was produced, where m is the number of remaining probe sets. Random Matrix Theory, which employs threshold detection using the nearest neighbor spacing distribution of eigenvalues, was used to identify an appropriate threshold for filtering the similarity matrix [9]. Optimized C-code, produced in-house, was used for RMT thresholding. An adjacency matrix is effectively produced by ignoring values (or setting values to zero) in the similarity matrix that are less than the threshold. Using an in-house Perl script, a network file was constructed, and edges in the network correspond to probe sets in the adjacency matrix that have a non-zero correlation value. Using mappings of probe set to TAIR gene models derived earlier with megablast, the probe set-based network was converted to a gene-based network. Edge lists for all K90 GILs and the Global network can be found in Additional file 4 and online at sysbio.genome.clemson. edu. Finally, modules, or sets of closely linked genes, were identified in each of the networks using the link community method [51] which circumscribes edges into communities using the 'linkcomm' binary version 1.0-4 in R [57] or MCL modules using the 'mcl' binary version 11-294 (http://micans.org/mcl/). Gene assignment to modules can be found in Additional file 5: Table S3.

Functional enrichment analysis
Functional enrichment was performed for each of the modules from each GIL. Functional terms used for enrichment include Gene Ontology (GO) [46], InterPro [58], AraCyC [50], Plant Ontology (PO) [49], and KEGG [47] terms. Mappings of GO, PO, AraCyC, and InterPro terms to genes were obtained from the TAIR website (Lamesch et. al. 2012). KEGG terms were obtained by uploading Arabidopsis coding sequences (CDS) to the KEGG/KAAS server which maps KEGG terms using a homology-based method [59]. Functional enrichment was then performed using an in-house Perl script that functions similar to the online DAVID tool [60,61]. The tool uses a Fisher's Exact test to look for significant over-representation of terms within a module versus the TAIR10 genes represented on the ATH1 platform as background. Module functional enrichment results (Bonferroni adjusted p < 0.001) can be found in Additional file 6: Table S4. Network functional enrichment results (Bonferroni adjusted p < 0.001) can be found in Additional file 3: Table S2.