To the best of our knowledge, this work provides the first formal demonstration that network methods can distinguish biologically meaningful relationships among samples in genomic datasets. We have shown that sample networks can identify outlying samples when hierarchical clustering procedures cannot, and even when hierarchical clustering procedures are not used at all. We have described a novel network statistic, cor(K,C), and shown that it can be used to i) evaluate sample homogeneity, ii) identify sample characteristics (e.g. diagnosis) with global effects, and iii) enable comparisons among groups of samples using pre-selected lists of features (e.g. gene co-expression modules). By applying the latter approach to microarray data generated from human brain tissue, we have identified a neuronal signal transduction module that is an epicenter of transcriptional dysregulation in striatal samples from individuals with HD. The advantages of using network methods for describing sample relationships in genomic datasets are summarized below.
A major advantage of constructing sample networks is that individual samples can subsequently be described using established node-based network concepts such as the connectivity and the clustering coefficient. These concepts are independent of the choice or use of clustering algorithms and depend only on the adjacency measure used to construct the network. The distributions of standardized node-based network concepts provide an unbiased and quantitative framework for identifying samples that “behave” differently, even if the underlying causes of this behavior are unknown. Intuitively, if the connectivity for a given sample (when measured over all genes) is significantly lower than all other sample connectivities from the same biological system, it suggests that there is something different about that sample compared to the others. The investigator must ask him/herself whether the observed difference is likely to reflect biological or technical variation. In light of the multiple steps that comprise a typical genomic experiment, each of which may introduce technical variation, a conservative approach is to exclude aberrant samples if there are no obvious biological factors that might explain their discordant behavior.
Compared with other methods for identifying outlying samples in genomic data, our approach offers several additional advantages. First, because sample relationships are defined with respect to a correlation matrix, it is platform-agnostic and does not require access to raw data (although in practice it is preferable to process raw data in a consistent fashion). Second, it is easily applied to very large datasets, in contrast to clustering procedures that rely upon visual inspection of dendrograms to identify outlying samples. Third, it produces a battery of measures for summarizing the consistency and integrity of genomic datasets (e.g. mean intersample adjacency [ISA, or density], decentralization, homogeneity, etc.), which can be compared across disparate studies, technology platforms, and biological systems. Such measures are especially useful for meta-analyses, where objective assessment of data quality is highly desirable before seeking to pool or compare results across studies. Finally, as implemented in SampleNetwork and described in Additional file 3, our approach is both flexible and efficient, enabling users to move quickly through large datasets in an iterative fashion, specifying groups of samples for processing, identifying and removing outliers, testing the significance of sample covariates, and performing data normalization. To enhance user-friendliness, we have also incorporated the R function ComBat , which is an effective tool for removing batch effects (Additional file 1). At each stage, relevant output files are produced and exported automatically.
At the same time, there are several important caveats associated with our proposed approach for using network concepts to identify outlying samples in genomic data. It should be noted that our approach works best for datasets with large numbers of samples (e.g. more than 10). It is also important to note that standardized network concepts such as Z.K are relative measures whose interpretation depends on context. For example, in a relatively homogeneous sample network (e.g. mean ISA > 0.97), a Z.K value of −2.5 implies higher adjacencies for the sample in question than it would in a more heterogeneous sample network (e.g. mean ISA < 0.9). In light of these considerations, it can be helpful to have “targets” in mind, such as an expectation of what the mean ISA should approach for a given biological system, technology platform, and adjacency measure. These targets can be guided by prior experience (for example, cancer datasets often exhibit substantial sample heterogeneity) or by the use of technical and biological replicates. Lastly, although we have focused primarily on Z.K and to a lesser extent Z.C as intuitive indicators of outlying status, it is possible that other node-based network concepts (or indeed, other measures of adjacency) could produce different results.
Beyond facilitating relatively simple tasks such as outlier identification, sample networks provide a novel perspective on more complex challenges such as group comparisons. Our results indicate that the standardized C(k) curve in weighted sample networks is a powerful tool for identifying sample characteristics with global effects on genomic activity. The stark divergence of cor(K,C) for HD CN samples motivated us to explore how cor(K,C) would be affected by other network topologies, leading to the observation that cor(K,C) undergoes a percolation-like transition that is related to network density and size. Although cor(K,C) was inversely related to network density in our simulations, we note that cor(K,C) is invariant if one scales all off-diagonal adjacencies by a constant. Therefore, it is more accurate to consider cor(K,C) as an indicator of network heterogeneity (or homogeneity; Additional file 1). In the special situation of an exactly factorizable network, we find that cor(K,C) is determined by the network heterogeneity (Methods). One practical implication of these findings is that cor(K,C) can serve as a useful indicator of data “cleanliness”: with each iteration of sample outlier removal or data normalization performed using SampleNetwork, cor(K,C) should approach −1.
We note that our findings with respect to the percolation-like transition for cor(K,C) are also applicable to unweighted (binary) networks. We have observed a similar transition for cor(K,C) in unweighted gene networks as the threshold for dichotomizing the adjacency matrix is progressively increased (Figure S8; Additional file 1). At permissive (low) thresholds, which produce networks in which most nodes are connected, cor(K,C) is negative; as the threshold is raised, producing networks in which most nodes are not connected, the relationship begins to invert, becoming positive at more stringent (high) thresholds (Figure S8; Additional file 1).
In unweighted networks, the relationship between the (unstandardized) connectivity and (unstandardized) clustering coefficient of network nodes, i.e. the C(k) curve, has previously been reported to follow a scaling law: C ≅ k
α[29, 31]. It has been shown that the value of the scaling exponent α is not universal, but negative values approaching −1 have been observed in biological systems [30, 32]. The inverse relationship for the C(k) curve has been interpreted as evidence of hierarchical modularity in network structure [30, 31]. Specifically, it has been suggested that in hierarchically modular networks, nodes with low connectivity form small, densely connected clusters, while nodes with high connectivity serve to bridge these many small clusters into one large, integrated network . However, the C(k) curve has primarily been studied in the context of metabolic, protein interaction, and gene regulatory networks, as well as other non-biological networks [30, 32, 37].
To the best of our knowledge, a percolation-like transition in the C(k) curve has not previously been reported. However, prior work has revealed that global topological properties of unweighted networks, such as those embodied in the C(k) curve, can be predicted by knowledge of local motif structure, and vice versa . Motifs, or subgraphs, describe basic interaction patterns among small groups of nodes [38, 39]. In unweighted networks, it has been shown that subgraphs naturally segregate into two classes: highly abundant type I subgraphs, which are sparsely interconnected, and less abundant type II subgraphs, which are densely interconnected . It has also been shown that a phase boundary separating type I and type II subgraphs can be accurately predicted using global network topological properties, including the C(k) curve . Therefore, we propose that the transition in the standardized C(k) curve observed in our analysis reflects a concomitant transition in local motif structure, which in turn reflects the degradation of sample network topology in CN by HD. Although motifs have been studied almost universally in the context of unweighted networks, we are aware of at least one study that has presented an approach for generalizing motif scoring to weighted networks . Our results suggest that future research investigating the relative strengths of distinct motifs in weighted networks and their relationship to global network topological properties is warranted.
The effect of HD on the standardized C(k) curve for CN samples was initially observed over all genes, which is consistent with the large impact that HD exerts on the CN transcriptome [14, 17, 19, 20, 41]. Because the transcriptomes of human brain regions, including CN, are organized into biologically meaningful gene co-expression modules , we reasoned that constructing sample networks for previously identified CN modules might expose variation in the standardized C(k) curve, which in turn might implicate specific biological processes in connection with HD pathology . This approach constitutes a novel strategy for exploring the effects of disease on sets of genes. We identified several modules that exhibited highly significant differences in cor(K
C) between CTRL and HD subjects in CN. One potential drawback of our approach is that relatively small differences in cor(K
C) can appear significant as |cor(K
C)| approaches 1; for example, M34 was significant despite a relatively small difference between CTRL (cor(K
C) = −0.98) and HD (cor(K
C) = −0.91) subjects. For the four most significant modules, however, the differences in cor(K
C) were > 1, indicating that the standardized C(k) curve had flipped from negative (CTRL) to positive (HD).
As illustrated above, differences between standardized C(k) curves are not simply a proxy for differences in network density, but also relate to network size and heterogeneity. We have also observed that small numbers of samples that are highly discordant (i.e. severe outliers) can have a large impact on the standardized C(k) curve (M.C.O. and S.H., unpublished observations). Thus, the standardized C(k) curve is an aggregate measure, and one that may be used to complement existing strategies for conducting both unsupervised and supervised analyses. We also note that in the present study, the overall relationship between differential expression (DE) and differences between the standardized C(k) curves of CTRL and HD subjects was weak. For example, although the salmon module (which exhibited the most significant difference in cor(K,C) between CTRL and HD) was strongly associated with DE, the red module (which also exhibited a significant difference in cor(K,C) between CTRL and HD) was not. Furthermore, our simulation study confirms that situations may exist in which cor(K,C) can distinguish meaningful sample subgroups in the absence of DE. These findings deserve additional study.