R software implementation
We have implemented the sample network approach in a freely available, custom R software function called SampleNetwork. SampleNetwork has been designed to facilitate detailed exploration of sample relationships and expedite genomic data pre-processing decisions via sample network analysis. SampleNetwork enables semi-automatic, interactive sample network construction and network concept calculations. Network concepts include node-based measures such as the standardized sample connectivity (Z.K) and the standardized sample clustering coefficient (Z.C), as well as network-based measures such as cor(K C) and the mean inter-sample adjacency (ISA, or density). These concepts and many others are defined below and in Supplementary Methods (Additional file 1). By calculating the distributions of node-based sample network concepts, SampleNetwork enables the user to identify and remove outlying samples in an iterative and interactive fashion; by calculating network-based sample network concepts, SampleNetwork enables the user to gauge overall progress towards data cleanliness and sample homogeneity. These features are described in detail in our online tutorial (see below and Additional file 3). SampleNetwork also enables significance testing of sample covariates with respect to sample metrics, and data normalization. Data normalization may be performed pursuant to outlier removal using the quantile normalization method proposed in ref. [43].
Because sample networks often reveal groupings of samples that reflect batch effects (technical variation), which are typically not removed by standard normalization procedures, we have also incorporated existing methods that allow the user to automatically correct for batch effects. Specifically, we have found that the R function ComBat created by Johnson and colleagues [36] is quite adept at removing batch effects. Consequently, if batch effects are present, the user has the option of correcting for them by calling ComBat from within SampleNetwork, which automates its execution. SampleNetwork also requires installation of the following R (http://www.r-project.org/) and Bioconductor (http://www.bioconductor.org/) packages: affy [44], cluster, impute [45], preprocessCore, and WGCNA [34]. With each successive round of data processing, SampleNetwork produces and exports the results of sample network analysis automatically (e.g. Figure S1; Additional file 1).
We have also created a companion R software function called ModuleSampleNetwork to explore the properties of sample networks when formed over subsets of features. In our application, subsets of features correspond to modules of co-expressed genes [21], but we note that subsets can be defined by the user according to any criteria. ModuleSampleNetwork does not enable outlier testing and removal or data normalization, but instead seeks to compare module sample network properties between subgroups of samples (e.g. Figure 6) and across modules (e.g. Figure 4). An example workflow would involve using SampleNetwork to pre-process a microarray dataset, then using WGCNA [34] to identify modules of co-expressed genes, and finally using ModuleSampleNetwork to explore sample network properties at the modular level.
While both SampleNetwork and ModuleSampleNetwork are user-friendly, they are interactive and require judicious feedback from the user (for example, regarding thresholds for outlier removal). To illustrate how the software can be used in practice, we provide a detailed, annotated tutorial with R code (Additional file 3) highlighting the required input files, parameter choices, user interactions, and resulting output files. The beneficial effects of outlier detection and removal, data normalization, and correction for batch effects, as implemented using SampleNetwork, are clearly delineated by significance testing of sample covariates with respect to sample metrics, analysis of differential expression, and analysis of network concepts with each successive round of data processing, as described in the online tutorial. This tutorial, (Additional file 3) along with the required input files and the SampleNetwork and ModuleSampleNetwork R functions, is available on our web site (http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/SampleNetwork).
Microarray data pre-processing
Raw microarray data (.CEL files) [14] were downloaded from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3790). Detailed information on sample characteristics and sample processing can be found in [14]. A summary of sample characteristics can also be found in Additional file 2. To eliminate non-specific and mis-targeted probes prior to generating expression values, a mask file (“HG-U133A”) was obtained from http://masker.nci.nih.gov/ev/[46] and applied to the raw microarray data using the R (http://www.r-project.org/) package “ProbeFilter” [47] (http://arrayanalysis.mbni.med.umich.edu/MBNIUM.html#ProbeFilter). After applying the mask file, only probe sets with at least seven remaining probes were retained for further analysis (n = 18,631). Expression values were generated in R using the “expresso” function of the “affy” package (http://www.bioconductor.org/) [48] with “mas” settings and no normalization, followed by scaling of arrays to the same average intensity (200).
Sample networks based on general similarity or dissimilarity measures
The input of most clustering procedures is a similarity or dissimilarity measure. In Additional file 1, we define these measures and describe general approaches for turning a similarity or dissimilarity matrix into a sample network.
Defining sample adjacency
To construct sample networks, a measure of connection strength, or adjacency, is defined for each pair of samples i and j and denoted by a
ij
. A mathematical constraint on a
ij
is that its values must lie between 0 and 1. In our implementation, we defined the adjacency between (microarray) samples S
i
and S
j
as follows:
(1)
where β = 2. Technically, a
ij
is a signed weighted adjacency matrix [22, 49]. A major advantage of defining a network adjacency measure (as opposed to a general similarity measure) between samples is that it allows specification of network concepts (see below). Our proposed sample adjacency measure (based on β = 2) also has several other advantages. First, it preserves the sign of the correlation (although in most applications negative correlations among samples are unlikely to occur). Second, it preserves the continuous nature of the correlation information; alternative approaches based on thresholding the correlation coefficient may lead to information loss. Third, while any other power β could be used, the choice of β = 2 results in an adjacency measure that is close to the correlation when the correlation is large (e.g. larger than 0.6, which is often the case among samples in microarray data).
We note that SampleNetwork also allows the user to define sample adjacencies using Euclidean distance, which may be desirable in some applications. Future efforts may seek to compare the properties of sample networks using these and other adjacency measures.
Network concepts
After constructing an adjacency matrix, nodes (samples) can be characterized in terms of a number of existing network concepts (see refs. [10, 12] for comprehensive overviews of network concepts). Several of these concepts are reviewed briefly below, including the connectivity (also known as degree in unweighted networks) and the clustering coefficient, which we find to be particularly useful in the context of sample networks.
Connectivity
The connectivity (k) of the i-th network node is defined by:
The maximum connectivity is defined as:
The scaled connectivity Ki of the i-th network node is defined as:
The standardized connectivity Z.Ki of the i-th network node is defined as:
(5)
Sample network interpretation of the connectivity: Using our proposed measure of sample adjacency (signed weighted network with β = 2), we find that
(6)
if all sample correlations are > 0.6. In other words, samples with high connectivity tend to be highly positively correlated with other samples. The connectivity is the most widely used concept for distinguishing the nodes of a network. As illustrated in the motivational example above and as detailed in our R tutorial (Additional file 3), samples with low connectivity may represent outliers.
Clustering coefficient
The clustering coefficient (C) of node i measures the density of local connections, or “cliquishness” [11]. For weighted networks, 0 ≤ a
ij
≤ 1 implies that 0 ≤ C
i
≤ 1 [22]:
(7)
The standardized clustering coefficient Z.Ci of the i-th network node is defined as:
(8)
Sample network interpretation of the clustering coefficient: The higher the clustering coefficient of a sample, the higher is the average pairwise correlation among its closest neighbors. If all of a sample’s closest neighbors have pairwise correlations of −1, the clustering coefficient will be zero.
Density and mean intersample adjacency (ISA)
We find it useful to characterize sample networks using the mean (off-diagonal) adjacency measure, i.e.
(9)
where A = [a
ij
]. The mean adjacency is also known as the density of the network. In sample networks, we often refer to the density as the mean intersample adjacency (ISA).
Sample network interpretation of the density: Using our proposed measure of sample adjacency (signed weighted network with β = 2), we find that
(10)
if all sample correlations are > 0.6. Thus, the mean adjacency is roughly equal to the mean correlation in sample networks.
The standardized C(k) curve and cor(K,C) network concept
Empirical results obtained through application of the SampleNetwork R function to many datasets indicated that as outlying samples are removed, data are normalized, and technical artifacts (e.g. batch effects) are corrected, Z.K and Z.C exhibit a progressively linear, inverse relationship. A similar relationship has been observed in unweighted (binary) networks, where the relationship between the (unstandardized) connectivity and (unstandardized) clustering coefficient of network nodes, i.e. the C(k) curve, has previously been reported to follow a scaling law (C ≅ kα[29, 31]), with values approaching −1 often observed for the scaling exponent α in biological systems [30, 32]. It has been suggested that this relationship may emerge as a consequence of hierarchically modular networks, where nodes with low connectivity form small, densely connected clusters, and nodes with high connectivity serve to bridge these many small clusters into one large, integrated network [31].
We define the standardized C(k) curve as a scatter plot between Z.K and Z.C where Z.K and Z.C denote the standardized sample connectivity and the standardized sample clustering coefficient, respectively. We also introduce a new network concept, cor(K,C), which we define as the Spearman correlation between Z.K and Z.C. We note that other measures of correlation could also be used (e.g. Pearson correlation). Since the Spearman correlation is invariant with respect to a monotonically increasing transformation (e.g. standardization), we find that
(11)
where k denotes the unscaled connectivity. As described in Results, we find that cor(K,C) is inversely related to the density (i.e. mean adjacency) in simulated networks. However, because cor(K,C) is invariant if one scales all off-diagonal adjacencies by a constant, it is more accurate to consider cor(K,C) as an indicator of network heterogeneity. The network concept Heterogeneity is defined as:
(12)
Let us briefly consider the special case of an exactly factorizable network in which the network adjacency factors into node-specific contributions (a
ij
= CF(i) CF(j)) [10, 50]. In this case, we have shown that the Spearman correlation cor(K C) is actually determined by the network heterogeneity:
(13)
Thus, cor(K C) close to 1 indicates that network heterogeneity is high. Divergence of cor(K C) from 1 (in a negative direction) implies increasing homogeneity; once a critical level of homogeneity in the network is breached (analogous to a percolation transition [33]), cor(K C) becomes negative. In practice, however, the relationship described above does not generalize to non-factorizable networks. In our real data applications that involve non-factorizable networks, cor(K C) also exhibits a dependence on the network size n.
Identification of significant differences between cor(K,C)
Differences in standardized C(k) curves may distinguish biologically interesting groups of samples. For example, assume two sample networks (corresponding to two groups of samples) and two corresponding measures of cor(K,C). To identify significant differences in cor(K,C) between two sample networks, we use a test for assessing the significance of differences in correlations from samples of different sizes. First, cor(K,C) for each sample group is transformed using the Fisher transformation:
(14)
where k indexes the sample networks being compared. For the comparison between groups (sample networks) 1 and 2, the difference between the resulting z-scores is divided by the joint standard error:
(15)
where n1 and n2 represent the number of samples in groups 1 and 2, respectively. Under the null hypothesis of equal cor(K,C), z
diff
follows asymptotically a normal distribution (under weak assumptions). Therefore we calculate significance levels (P-values) for z
diff
based upon the standard normal distribution.
Simulation model for illustrating the ability of cor(K,C) to distinguish sample groups in the absence of differential expression
To further illustrate the utility of cor(K C), we simulated a set of 500 genes (referred to as a “module”) with the following properties: i) the first principal component (the observed module eigengene [ME]) exhibited no relationship to a simulated sample trait (referred to as “disease status”), and ii) cor(K C) distinguished “control” subjects from those with “moderate” or “severe” disease status. The module was simulated to contain two unrelated sub-modules of 200 and 300 genes, respectively. The first sub-module contained a signal for the simulated sample trait, while the second sub-module contained noise genes with no relation to disease status. The first sub-module was simulated in two steps. First, we used a seed ME as input for the simulateModule function from the R package WGCNA [34]. This function simulates genes with varying correlations around the seed ME and exports standardized gene expression values (i.e. each gene has mean = 0 and variance = 1). Second, we added a mean value to each module gene. Importantly, the mean gene expression values depended on the value of the seed ME. For subjects whose seed ME values were above the median, mean expression values were drawn from a normal distribution with mean = 2 and standard deviation = 2. For subjects whose seed ME values were below the median, mean expression values were 2/3 those of the control subjects (i.e. it was assumed that the disease lowered the mean gene expression values in sub-module 1). Analogously, we simulated the expression values for the second sub-module. However, we assumed that the mean gene expression values were derived from a normal distribution with mean = 2/3 and standard deviation = 2/3 (i.e. the mean values of these genes tended to have lower expression values than those of the first sub-module). The sample trait was simulated by thresholding the seed ME of the first sub-module. We assumed that healthy control subjects have a high value of the seed ME. Specifically, we simulated 100 individuals, with 50 designated as “control” subjects (darkgreen), 25 designated as “moderate” disease status (red), and 25 designated as “severe” disease status (turquoise), as indicated in Figure 5. In practice, the seed ME was not known. Instead, the observed ME for the entire module was obtained as the first principal component of the set of 500 genes.
Additional network concepts for sample networks
In addition to characterizing sample networks via the connectivity and the clustering coefficient, it is also possible to characterize sample networks using additional network concepts. Such concepts include decentralization and homogeneity, as well as summaries of node-based measures such as the mean correlation, mean connectivity, mean clustering coefficient, mean intersample adjacency (or density), and mean maximum adjacency ratio (MAR). When applied to sample networks, these concepts provide a battery of measures for comparing the consistency of sample behavior within and across datasets. These network concepts are calculated automatically by SampleNetwork and are discussed further in Additional file 1 and our R tutorial (Additional file 3).
Differential expression analysis
To determine whether specific CN gene co-expression modules were associated with DE in HD, for each CN module we calculated the ME (i.e. the first principal component obtained by singular value decomposition), which is a vector that summarizes the characteristic expression pattern of a module [10]. We then used Student’s t-test to determine whether the mean expression levels of the ME differed between groups (distinguished by HD diagnosis). An advantage of this approach is that the extent of modular DE can be summarized by a single P-value. Future efforts may seek to incorporate higher-order representative features (beyond the first principal component) to explore additional relationships between gene co-expression modules and disease status [51]. Differential gene expression in CN between CTRL and HD subjects (Additional file 4) was assessed using Student’s t-test on log2-transformed expression values. The resulting P-values were corrected for multiple comparisons by controlling for the false-discovery rate [52]. The resulting local false-discovery rates (referred to as Q-values), along with mean expression levels for CTRL and HD, are reported for all genes in the salmon module in Additional file 4.
Ingenuity pathways analysis
Ingenuity Pathways Analysis (IPA; http://www.ingenuity.com/) was used to determine whether gene co-expression modules identified in [21] were enriched with functional interactions among their constituent genes. For each module, probe sets that were positively correlated with the module eigengene (P < 0.001) were used to search for enrichment. Network construction was restricted to experimentally verified, direct physical interactions. IPA reported false-discovery rate (FDR)-corrected P-values for the 500 most enriched functionally annotated categories of genes in each module. Results for the salmon module are reported in Additional file 5.