Organizational structure and the periphery of the gene regulatory network in B-cell lymphoma

Background The physical periphery of a biological cell is mainly described by signaling pathways which are triggered by transmembrane proteins and receptors that are sentinels to control the whole gene regulatory network of a cell. However, our current knowledge about the gene regulatory mechanisms that are governed by extracellular signals is severely limited. Results The purpose of this paper is three fold. First, we infer a gene regulatory network from a large-scale B-cell lymphoma expression data set using the C3NET algorithm. Second, we provide a functional and structural analysis of the largest connected component of this network, revealing that this network component corresponds to the peripheral region of a cell. Third, we analyze the hierarchical organization of network components of the whole inferred B-cell gene regulatory network by introducing a new approach which exploits the variability within the data as well as the inferential characteristics of C3NET. As a result, we find a functional bisection of the network corresponding to different cellular components. Conclusions Overall, our study allows to highlight the peripheral gene regulatory network of B-cells and shows that it is centered around hub transmembrane proteins located at the physical periphery of the cell. In addition, we identify a variety of novel pathological transmembrane proteins such as ion channel complexes and signaling receptors in B-cell lymphoma.

http://www.biomedcentral.com/1752-0509/6/38 and a peripheral layer, connected by an intermediate layer exhibiting a reduced modularity. The central layers of these networks were described to be highly enriched by genes that are located in the nucleus for regulating, e.g., the cell cycle, while the periphery is governed by metabolic, transport systems and cell communication processes. These results are consistent with the simplified view that the physical periphery of a cell produces signaling cascades that are induced by extracellular signals that are detected by transmembrane protein receptors. In turn, this leads to a transduction and amplification of extrinsic and intrinsic signaling cascades through the cytoplasm to the nucleus culminating in the regulation of gene expression. For an intuitive visualization of these intricate processes see Figure 1.
The inference of gene interactions in a gene regulatory network from gene expression data is often discussed in connection with the nuclear transcriptional regulatory network [1,16,17]. In the simplified transcription factor vs target gene model, a transcription factor affects directly the gene expression of the mRNA of a target gene. This may give the impression that gene interactions inferred from expression data need to be interpreted in the context of transcription regulation. For this reason, inferred networks from gene expression data are frequently equated with the transcriptional regulatory network. However, this is not justified because expression data convey only information about the dynamic state of genes correspondingly their mRNAs and, hence, do not provide direct information about any type of biochemical binding, including transcription regulation, at all. Instead, inferred interactions from expression data are not limited to transcription regulation, but can also include protein-protein interactions [18]. To emphasize this, we use the terminology gene regulatory network for a network that is inferred from gene expression data to point out that this is not necessarily a transcription regulatory network but a mixture of this and a protein-protein network [19].
The major purpose of this paper is to infer a gene regulatory network from a large-scale B-cell lymphoma gene expression data set, and to investigate its structural and biological organization. Immature B-cell lymphocytes are cells from the bone marrow that play an important role in the adaptive immune system. When B-cells are activated by an antigen they differentiate to memory B-cells, to antibody secreting plasma B-cells or proliferate intermediately to germinal centers (centroblasts and centrocytes) [20]. Bcells are one of the most interesting cell types for the study of mammalian signaling and cell differentiation processes due to their unique physiological properties governing the and Multiple myeloma (MM, plasma cells). For our analysis, we use the microarray data set from [21] which contains samples from the germinal centers of lymphoma patients and experimental transformed germinal center cell types.
In a previous study, it has been found that the C3NET inference algorithm has a considerably higher true positive (TP) rate for leaf edges of genes in a network that are sparsely connected [18]. For this reason we hypothesize that this method has characteristics which are very beneficial for the inference of peripheral regions of the gene regulatory network of B-cells. Due to the fact that Bcells are highly receptive to external stimuli, as described above, knowledge of these interactions seems viable for In order to analyze the structural organization of B-cell lymphoma, we infer a gene regulatory network by using C3NET in combination with an ensemble approach. This means, instead of applying the inference method to one data set, we are applying it to a bootstrap [22] ensemble of data sets. This allows not only to assess local networkbased measures down to the level of individual edges [23,24] but also to obtain an average network structure which is amenable for a hierarchical analysis, as we will show in this article.
There are several large-scale B-cell lymphoma related gene expression data sets available of germinal center tumor samples from Diffuse large B-cell lymphoma (DLBCL), Follicular lymphoma (FL) and Burkitt lymphoma (BL) [25][26][27][28][29]. In this paper, we study the gene regulatory network from B-cell lymphoma by using the data set in [21]. For an independent validation of our results we study in addition two Diffuse large B-cell lymphoma data set described in [25,27].
To demonstrate the validity of our bootstrap approach, we are using simulations comparing results from a bootstrap ensemble with an ensemble of independently generated data. For a principle overview of the generation of the bootstrap data, see Figure 2. In this figure, the data set D B k refers to the k-th data set from the bootstrap ensemble.
In this paper, we infer the peripheral region of the gene regulatory network inferred from a large-scale B-cell lymphoma gene expression data set by using the C3NET algorithm. We provide a functional and a structural analysis of the largest connected component for this network.
Further, we analyze the hierarchical organization of the network components of the B-cell gene regulatory network as revealed by the bootstrap approach.

Methods
In the following section we present the methods and the data used for our analysis.

Simulated Gene Expression Data
We simulate gene expression data sets for a variety of different network structures by using SynTReN and GeNGe [30,31]. For each network type, we generate 300 data sets with a sample size of 100, 200, 500 and 1000. Further, for each of these data sets, a bootstrap ensemble of size b = 100 was generated by sampling with replacement.
In addition, we generate simulated gene expression data sets for a network consisting of 8 network modules, which are organized in a hierarchical manner; see Figure 4 for a visualization. Each network module is generated using a Modular Topology Model (MTM) network model, each with a size of 25 genes. A MTM network has properties such as a scale-free degree distribution, high clustering coefficients and short path lengths as observed in real biological networks [32,33]. We construct 5 different networks by weakly connecting the 8 individual modules with a different number of connections. Specifically, the individual network modules are connected by 0, 3, 5, 10 and 15 edges, resulting in a total of 5 networks, each consisting of 200 genes. For each of the 5 networks, we generate independently 100 gene expression data sets with sample size 500 by using netsim [32]. Netsim generates time-series data. In order to obtain steady state expression data each sample in a data set is taken after the 50th time point. The gene expression profiles are generated with a sigmoidal activator function.

Preprocessing of B-cell lymphoma microarray data sets
The collection of the microarray gene expression data used in this study are from [21], which are accessible from the NCBI Gene Expression Omnibus (GEO) [34] (accession GSE2350). We denote the GSE2350 dataset that includes transformed and untransformed B-cell lymphoma samples as the Basso GSE2350 dataset. For our analysis we consider only samples for which raw gene expression data in form of CEL files are available. From the total of 387 samples of the GSE2350 dataset, 344 samples were available with raw CEL files. In the following, we call this data set D. The data set includes two Affymetrix chip platforms, hgu95a and hgu95a v2. We used the mixture CDF environment hgu95av12mixcdf 1.0.tar.gz available from http:// bmbolstad.com/misc/mixtureCDF/MixtureCDF.html to include only probe sets that have the same probe set annotation. For a cross-dataset validation of our study, we preprocessed two additional B-cell lymphoma data set. We retrieved a Diffuse-large B-cell lymphoma hgu133plus2 Affymetrix microarray data set with accession GSE11318 [27] including 203 samples, and a Diffuse large B-cell lymphoma hgu133a Affymetrix microarray data set with the accession GSE22470 [25] including 271 samples. These two data sets contain only untransformed B-cell lymphoma samples. We denote these as the Lenz GSE11318 dataset and the Salaverria GSE22470 dataset.
We processed all CEL files for each data set using RMA, a quantile normalization and summarization [35][36][37]. We extracted the log 2 expression intensities for each probe set. Because a gene can be represented by more than one probe set, we calculate the median expression value for each gene by mapping the annotation of Affymetrix-ID to Entrez gene IDs to obtain a summary value for the genes. The Basso GSE2350 dataset comprises a total of 9, 684 genes and 344 samples, where we do not exclude any unmapped probesets.
In order to perform a cross-dataset validation of the Basso GSE2350 dataset, we discarded all gene and probe set identifiers from the Lenz GSE11318 dataset and Salaverria GSE22470 dataset that are not present in the Basso GSE2350 dataset. After removal, the expression matrix of the Lenz GSE11318 dataset comprises 8, 727 genes and 203 samples and the expression matrix of the Salaverria GSE22470 dataset comprises 8, 664 genes and 271 samples.

Gene regulatory network inference
We use the C3NET method [18] to infer the gene regulatory networks for the simulated gene expression data sets and the B-cell data. For each data set, a copula transformation is applied to the gene expression matrix as performed in [17]. Mutual information for all gene pairs is computed using the Pearson estimator [38,39] ( 1 ) http://www.biomedcentral.com/1752-0509/6/38 Here, ρ is the Pearson correlation coefficient. F-scores are estimated by Here, p is the precision and r the recall.

Functional Analysis
The procedure for the Gene Ontology (GO) [40] enrichment analysis was implemented in R using the Entrez gene to GO annotation from the hgu95a v2 and the org.Hs.eg.db package and for the GO enrichment analysis the topGO package [41] from Bioconductor in R [42]. The significance level of the enrichment for a GO term was determined by a hyper-geometric test (Fisher's Exact Test [43]). For the analysis, only terms assigned to more than 3 candidate genes are considered for the analysis.

Network gene centrality pathway analysis
For the cross-dataset validation of the B-cell C3NET gene regulatory networks inferred from different data sets, we conducted a pathway-based network comparison. This method allows to identify functional subnetworks with the strongest structural similarities between pairs of gene regulatory networks. We compare the underlying network structure between two C3NET gene regulatory networks, using the node betweenness centrality [44] measure. The node betweenness centrality for a gene v i in a network is defined by [44] Betweenness centrality measures the proportion of all shortest paths between gene v k and gene v l , which traverse gene v i denoted by n i kl , referred to all shortest paths between gene v k and gene v l denoted by p kl .
For two given gene regulatory networks, G a and G b , we estimate the betweenness centrality values for all genes from a Gene Ontology (GO) term. Then, for each GO term, we perform Spearman's rank correlation test [43] for the ranks of the betweenness centrality values. We adjust p-values using a FDR [45] correction for a given significance level of α = 0.05. For the analysis we use the Gene Ontology (GO) annotation from the Bioconductor org.Hs.eg.db package.

Hierarchical network organization
In order to analyze the hierarchical organization of the Bcell C3NET gene regulatory network, we perform a threestep procedure based on bootstrap samples of the data. In the first step, we infer a network G by using all 344 samples of the microarray data set D. For this network, we identify its network components, which represent connected components. That means, from G we obtain a set of network components C = {C 1 , . . . , C K } whereas C i represents a list of genes that can be found in component i. These components have the property that for any pair of genes, e.g., g j , g k ∈ C i there exists a path connecting gene g j with g k . However, for gene pairs from different components, e.g., g j ∈ C i and g k ∈ C i there exists no path that connects these genes. The size of each network component is given by N i = |C i |, i.e., for example that network component C i contains N i genes. Here, K indicates the total number of components found in G we consider for our analysis. We would like to emphasize that the network components are naturally obtained because the inferred network G is usually an unconnected network, due to the conservative working principle of C3NET. Hence, C = {C 1 , . . . , C K } correspond to these individual network components. In the second step, an ensemble of bootstrap data sets is generated from D, as described in Figure 2, from which we infer an ensemble of networks , we estimate for each gene pair the fraction of inferred edges present in the ensemble, This corresponds to the probability Pr(e ij = 1|{G B i }). Due to the fact that the underlying network G is undirected, c ij is symmetric, i.e., c ij = c ji .
Finally, in the third step, the information obtained in step one and step two is combined by estimating the average neighborhood closeness,ĉ kk , from each network component k to network component k , obtained the following way. A mean feature vector,ĉ k , is estimated for each network component k bŷ Here, N k is the size of network component k andĉ k (i) is the i-th component of the vectorĉ k , which has length N. The interpretation ofĉ k (i) is Pr(component k is connected with gene i |{G B i }). From this, the average neighborhood closeness between network component k and network component k is obtained bŷ The interpretation ofĉ kk is Pr(component k is connected Hence, the average neighborhood closenessĉ kk provides structural information about the involvement of individual genes between network component k and k utilizing the variability within the data, as exploited by the bootstrap ensemble. We use the average neighborhood closenessĉ kk to define a K × K similarity matrix U by For our analysis we are using U to define an error measure d 2 , defined in section 'Graph edit distance hierarchy error' . Due to the probabilistic interpretation ofĉ kk , which implies that 0 ≤ĉ kk ≤ 1, these components are easily transformed into distance values by For our analysis we use the resulting K ×K distance matrix D for a hierarchical clustering in combination with the "Ward" method. The overall procedure is summarized in Figure 5.

Consistency of bootstrap ensembles
We start our analysis by performing simulations to compare the distributions of F-scores of an ensemble of independently generated data sets with two bootstrap ensembles. For an illustration of the generation of these bootstrap data and the difference between the three types of F-scores, see Figure 2. The colors of the arrows in this figure correspond to the colors of the boxplots shown in Figure 3. That means the blue boxplots correspond to F-scores obtained for an ensemble of 300 independently generated data sets. The red boxplots correspond to 30000 (= 300 × 100) F-scores obtained by bootstrapping each of the 300 data sets 100 times. We call this bootstrap ensemble BE 1. The boxplots in green show the 300 averaged F-scores, i.e., each F-score is averaged over 100 bootstrap samples. We call this bootstrap ensemble BE 2. Figure 3 shows the distribution of these F-scores for scale-free networks in dependence of four different sample sizes. In general, one can see that the distributions of Fscores of the two bootstrap ensembles are similar in range, median and the interquartiles to the F-scores obtained for the ensemble of independently generated data sets. However, the F-scores for BE 1 (shown in red) contain some outliers. This can be expected, because the bootstrapping of the data leads in general to a loss of information, due to the fact that not all samples are available for the inference task. For this reason, the median F-scores decline slightly, as can be seen from Figure 3. However, this decline is rather moderate, e.g., compared to the overall increase for larger sample sizes. Further, there are only few outliers, indicating that only very few bootstrap data sets lead to atypical results. Hence, our analysis demonstrates that the usage of bootstrap ensembles leads to a good approximation compared to results for and ensemble of independently generated data sets. Due to the fact that the latter data are only available in simulation studies, but not for real biological data, a bootstrap ensemble is a valid approach to estimate the variability of the population of inferred networks from an ensemble of data sets. We repeated the above analysis for different network topologies, including random networks and directed acyclic graphs (not shown), and found qualitatively similar results as for the scale-free networks shown above.
These results demonstrate that the bootstrap data lead to very similar results as the independent data, independent of the sample size. Hence, in the specific context http://www.biomedcentral.com/1752-0509/6/38 of network inference bootstrapping data is an efficient means to generate an ensemble of data to resemble an independently generated ensemble.

Inferrability of a hierarchical organization
Next, we evaluate the performance of our bootstrap approach for the inference of a network with a hierarchical structure. For this reason, we simulate gene expression data by using a network with a defined hierarchical organization of 8 interconnected MTM network modules, see Figure 4. The individual network modules are interconnected by 0, 3, 5, 10 and 15 edges. In total, we analyze gene expression data sets for 5 ensembles, each consisting of 100 data sets. In order to measure the performance of our bootstrap approach for the inference of the hierarchical organization of the network components from the simulated data we developed two measures. The first measure, d 1 , is the dendrogram clustering error that scores clustering errors of dendrogram splits between the true hierarchical structure and the inferred hierarchical structure ( Figure 6). The second measure, d 2 , is the graph edit distance hierarchy error that computes the graph edit distance [46][47][48] between the reference adjacency matrix of interconnected network modules and the inferred similarity matrix U, described in section 'Hierarchical network organization' . We estimate for each simulated gene expression data set a distance matrix D and a dendrogram of the inferred network module hierarchy, obtained by application of the "Ward" method. A summary of this procedure is given in Figure 7.

Dendrogram clustering error
The dendrogram clustering error, d 1 , measures the number of clustering errors from the lowest to the highest split, s i , between the inferred hierarchical structure and the underlying true hierarchical reference structure RH (Figure 4). A split, s i , in the dendrogram describes either (a) a cluster of two network modules or (b) a cluster of network modules to a cluster of network modules. We calculate the clustering error d 1 by The binary error function f (s i |RH) ∈ {0, 1} scores the clustering error of a split as follows. Suppose, the reference hierarchy RH is resembled in split s i for the cases (a) two clustered modules share the same parent module or (b) have a parent/child relationship. A clustering error of a http://www.biomedcentral.com/1752-0509/6/38   split is counted as 1 if neither of the two cases is true, otherwise it is zero. For the case of a network module being clustered to a cluster of network modules, the relation (a) or (b) must be given to at least one network module of the cluster. The dendrogram distance d 1 scores the total clustering error from the lowest to the highest split s i in the dendrogram. The maximal number of clustering errors is the total number of splits defined in the dendrogram. An example for the calculation of the error score d 1 is shown in Figure 6. In order to obtain a null model of the dendrogram clustering error that corresponds to the random clustering of network modules, we generate reference networks with permuted module labels. That means, we assume a network with a modular structure as shown in Figure 4, but permute the labels of the corresponding modules within this network.
In Figure 8 A we show the empirical cumulative distribution function (ecdf ) of the dendrogram clustering error d 1 for a variety of networks with different numbers of module interconnecting edges. For the null model, that means for networks resembling a random hierarchy structure as defined above, only about 20% of the cases reach a clustering error with ≤ 1. This is similar to the results obtained for networks with no interconnections between modules (black). Interestingly, already for 3 connections between modules (red) we observe that 40% of these networks have a clustering error ≤ 1. The results for 5 (green) and 10 (blue) module interconnecting edges show 70% and 80% with a clustering error ≤ 1. Finally, for 15 module interconnecting edges all hierarchy networks can be recovered with d 1 ≤ 1.

Graph edit distance hierarchy error
The graph edit distance hierarchy error, d 2 , is a measure for the error of the inferred module hierarchy. The reference hierarchy is described by the adjacency matrix R for the modules. Here an entry of R ij = 1 denotes that the two network modules i and j are connected. We measure the distance between the true underlying hierarchy R and the inferred similarity matrix U, given by Eq. 7, by In Figure  number of interconnecting edges between the network modules. This means adding edges between the network modules helps in reducing the inference error. For the networks with no interconnections between the network modules (black) d 2 is largest, as expected. These results correspond to the absence of a hierarchy between the network modules. Overall, the results for d 2 are similar to d 1 demonstrating that regardless of the chosen error measure a relatively low number of interconnecting edges is sufficient to enable the recovery of at least parts of the present hierarchy in the network.

Analyzing network components of the B-cell C3NET gene regulatory network
We infer a C3NET [18] gene regulatory network from 344 samples of the B-cell lymphoma microarray gene expression data set from [21]. The resulting B-cell C3NET gene regulatory network comprises 9, 684 genes and 9, 221 edges distributed over 463 separate network components (with > 1 gene). For a cross-dataset validation, we further inferred two additional C3NET gene regulatory networks from the DLBCL data sets in [25,27].
The DLBCL-C3NET gene regulatory network of the Lenz GSE11318 dataset comprises 8, 727 genes and 8, 134 edges and the DLBCL-C3NET gene regulatory network of the Salaverria GSE22470 dataset comprises 8, 664 genes and 8, 108 edges. Figure 9 shows a summary of the size distributions of the inferred network components.
For the B-cell C3NET gene regulatory network, the K = 25 largest network components (5% right quantile) have > 100 genes and comprise a total of 4, 673 genes representing 48% of all genes in the network. The giant connected component consists of 884 genes and 883 edges. For the two DLBCL-C3NET gene regulatory networks the largest K = 25 network components of the inferred networks comprise 3, 331/3, 477 genes representing 38%/40% of all genes in the entire gene regulatory network. The giant connected components of the two DLBCL gene regulatory networks consist of 299/395 genes and 298/394 edges.

Functional Network Analysis
In order to obtain a biological interpretation of the inferred B-cell C3NET gene regulatory network, we perform a Gene Ontology [40] (Table 1), Molecular Function (Table 2) and Cellular Component ( Table 3). The genes in the giant connected component show an enrichment in biological processes for G-protein-coupled-receptor protein signaling pathway (89 genes), cell-cell signaling (87 genes) and calcium ion transport (26 genes) ( Table 1). The cellular component analysis shows an enrichment, e.g., for plasma membrane proteins (264 genes), ion channel complexes (125 genes) and cell junction proteins (48 genes) ( Table 3). The molecular function analysis shows an enrichment, e.g., for G-protein coupled receptor activity (60 genes) and ion channel activity (38 genes) ( Table 2).
To study the biological relation and functional diversity between the individual network components, we cluster the network components according to the results of the Gene Ontology enrichment analysis from the category Cellular Component. Specifically, we conduct a clustering analysis of GO terms for the K = 25 network components in the following way. From testing 1, 020 GO terms of the category Cellular Component we find that 529 of these test at least for one network component significant.
The functional hierarchical clustering of the network components is generated from the overlap of significant Gene Ontology terms between all pairwise comparisons of the Gene Ontology enrichment analysis of the individual network components. A pairwise distance matrix is computed from the shared number of Gene Ontology terms for a significance level of α = 0.05. For the hierarchical clustering we use the "Ward" method, see Figure 10 A (first column).
The numbers of the leaves in the dendrogram correspond to the rank-labels of the network components, whereas '1' corresponds to the GCC. The provided GO terms correspond to the most frequently enriched terms found in the corresponding branches of the dendrogram. The hierarchical clustering based on the functional GO analysis of the network components separates the dendrogram into two principle branches. The first branch, shown in red, consists of highly enriched extracellular proteins, intrinsic and integral membrane proteins, cell junction and ion channel complex proteins. The second branch, shown in blue, is highly enriched for intracellular proteins from the nucleus, mitochondrion and cytoplasm.
In order to provide a comparison with the gene regulatory networks inferred from the two DLBCL gene expression data sets, we perform the same analysis for the Lenz GSE11318 dataset and the Salaverria GSE22470 dataset ( Figure 10). The network components of the DLBCL gene regulatory networks show a similar bipartition as observed for the Basso GSE2350 gene regulatory network separating into two principle branches for the peripheral and the intracellular regions of the cell. A major difference is that the DLBCL gene regulatory networks show a bipartition within the principle branch enriched with genes of the peripheral regions.

Hierarchical organization of the B-cell C3NET gene regulatory network
Next, we study the hierarchical organization of the K = 25 largest network components of the B-cell gene regulatory network. This analysis is similarly conducted as for the simulated data, described in section 'Inferrability of a hierarchical organization' . That means, first, we generate b = 100 bootstrap data sets from which we infer an ensemble of networks G B i b=100 i=1 . Then, we determine from these networks a distance matrix D, which we use for a hierarchical clustering. As agglomeration clustering method we use again the "Ward" method.
The resulting dendrograms are shown in Figure 10 B (second column). Also in these dendrograms, the ranklabels of the network components correspond to the leaf labels. As for the clustering of significantly enriched GO terms between the individual network components, we observe a bifurcation into two principal branches. Though the subgroupings of individual components differ to some extend in the respective categories, one can see that the same two principal branches are obtained as for the clustering of the GO terms in Figure 10 A. The first branch corresponds to the extracellular and membrane  intrinsic proteins enriched network components and the second branch belongs to intracellular network components enriched by genes in the nucleus, mitochondria and cytoplasm.
We would like to emphasize that the generation of both dendrograms is based on complementary information. Figure 10 A is obtained from dissimilarity values among GO terms, not considering the inferred interactions among genes. In contrast, Figure 10 B is obtained from a structural analysis of the inferred network, not considering GO terms. This demonstrates that the extracted information from two complementary analysis methods leads to coinciding information with respect to the principle separation of cellular components of a biological cell.
Further, we compare the results of the B-cell C3NET gene regulatory network to the DLBCL-C3NET gene regulatory networks inferred from the Lenz GSE11318 dataset and the Salaverria GSE22470 dataset (Figure 10 B). Although the subgroupings between the  functional and structural hierarchical clustering differ to some extend, overall, the network components of the two DLBCL gene regulatory networks show a similar clustering into two major branches of peripheral and intracellular regions. However, the bipartiton of the structural network components (second column in Figure 10) is less pronounced as observed for the B-cell C3NET gene regulatory network for the Basso GSE2350 dataset.

Identification of novel key signaling pathways in B-cell lymphoma
Hub genes of the B-cell C3NET gene regulatory network are genes with the largest node degree among all genes in the network. Intuitively, such genes are the most interesting targets to study as they are more likely to be associated with multiple pathways, e.g., signaling pathways and thus form putative key regulators for a large diversity of biological processes. From the entire B-cell C3NET gene regulatory network, we extracted the largest 25 hub genes with more than 20 connections. In Table 4 we give an overview of these hub genes including their gene identifiers and a selected GO term in order to facilitate the interpretation of their functional context. The selected hub genes play crucial roles in signaling processes such as receptors, ion channels and transporters, cell adhesion proteins and transcription factors. To our knowledge these genes were not studied in B-cell lymphoma to date ( Table 4).
The structure of the giant connected component (GCC) network consists of small, interconnected network modules with intramodular hub genes consisting in total of 884 genes and 883 edges ( Figure 11). The GCC contains the largest hub gene (CACNA1F) of the entire B-cell C3NET gene regulatory network, and in total 7 of the top-ranked 25 hub genes ( Figure 11). We highlighted the hub genes in the GCC network in Figure 11 and find that these are involved in adhesion, signaling and proliferation processes. In the following, we discuss some of these hub genes in more detail. The largest hub gene, with a total of 46 connections, is the calcium channel subunit CACNA1F. This gene belongs to the class of voltage gated calcium channels that regulate calcium influx and intracellular processes such as signaling pathways and gene expression. In particular, CACNA1F was found to be highly expressed in human lymphoid tissues such as plasma cells within germinal centers and therefore assumed to play a role in immune responses [49]. CLDN9 (35 gene neighbors) is a member of tight junction proteins which establish a permeability barrier between cells that play also a role in transport and signalling processes and have also been observed to be important for HPC virus entry [50]. CALCA (32 gene neighbors) calcitonin lowers calcium concentrations and plays a role in adhesion, cell migration and cell differentiation [51]. NR5A1 (27 gene neighbors) is a transcription factor that regulates cell growth, cell differentiation and developmental processes [52]. SNX29 (26 gene neighbors) belongs to a protein family involved in protein membrane trafficking that have a variety of protein motif binding domains [53].

Influence of activator and repressor links
In this section we study the inferrability of activator and repressor links. First we determine the correlation coefficient of all significant edges in the inferred network and obtain their corresponding p-values from testing for a vanishing Pearson correlation coefficient. Second, we conduct a multiple testing correction using the Benjamini-Hochberg procedure [45]. The edges that are statistically significant are identified as activator correspondingly repressor edges if the sign of the correlation coefficient is positive respectively negative.
In the inferred B-cell C3NET gene regulatory network, we identify a total of 847 repressor edges and 8, 372 activator edges. The estimated true reconstruction rate for repressor and activator edges is obtained from the bootstrap ensemble. A two-sample Kolmogorov-Smirnov test [43] comparing the distributions of the true reconstruction rates indicates a significant difference between these two distributions with a p-value of p = 2.2 × 10 −16 . Further, we find that activator edges are easier to infer than repressor edges, because activator edges have statistically a higher true positive rate than repressor edges.

Relationship of node degrees in the gene regulatory network and gene expression values
Next, we investigate the node degrees of genes in the inferred B-cell C3NET gene regulatory network and compare these with the variances of their gene expression values. We perform a loess (locally weighted scatterplot smoothing) [54] regression on the logarithm of the variances of the gene expression values and the corresponding node degree for each gene. We observe a positive correlation for genes up to a node degree of 7. In contrast, genes with a higher node degree show a negative correlation (results not shown). Thus, genes with a higher node degree in the inferred B-cell C3NET gene regulatory network show a smaller variation in their expression profile among the different samples of the expression data set.
Similarly, the connection between the gene expression variation and the node degrees in a protein-protein network was studied in [55]. There it was shown that with an increasing degree of the proteins, the gene expression variation decreases. Hence, for degrees larger than 7, both results coincide, however, for smaller degrees there seem to be differences between a protein-protein network and a gene regulatory network.

Cross-dataset validation for cellular component subnetworks
We perform a cross-dataset validation studying the structural similarity of our B-cell C3NET gene regulatory network with two additional DLBCL-C3NET gene regulatory networks we inferred from observational germinal center tumor data sets from [27] (Lenz GSE11318 dataset) and [25] (Salaverria GSE22470 dataset). In order to assess the structural similarity between networks, we use the (vertex) betweenness centrality measure [44] in combination with Spearman's rank correlation coefficient [43]. We use Spearman's rank correlation coefficient to test if structural components of two networks are similar to each other with respect to the order of the vertex betweenness centrality values of the genes. Specifically, in the following, we study two different scales of the networks. First, we http://www.biomedcentral.com/1752-0509/6/38 compare the entire networks using all genes. This corresponds to a global comparison. Second, we compare subnetworks defined as cellular components according to the gene ontology database. This corresponds to a local comparison.
From the global comparison, we find that the B-cell C3NET gene regulatory network shows a significant correlation of r ∼ 0.12 (p ≤ 2.2 −16 ) to the DLBCL-C3NET gene regulatory networks of the Salaverria GSE22470 dataset and r ∼ 0.14 (p ≤ 2.2 −16 ) to the DLBCL-C3NET gene regulatory network of the Lenz GSE11318 dataset. A comparison between the two DLBCL-C3NET gene regulatory networks shows also a significant correlation of r ∼ 0.24 (p ≤ 2.2 −16 ).
For the local comparisons, we test a total of 435 cellular components (corresponding to gene sets) that can be found in the networks having more than 10 genes. From these cellular components, we identify the ones with a http://www.biomedcentral.com/1752-0509/6/38 statistically significant Spearman rank correlation coefficient between profile vectors whose components correspond to the vertex betweenness centrality values of the genes in cellular components. To the resulting nominal pvalues, we are applying the Benjamini-Hochberg multiple testing correction procedure [45] to control the FDR at a level of 5%.
From the comparisons of the B-cell C3NET gene regulatory network with the DLBCL-C3NET gene regulatory network obtained from the Lenz GSE11318 dataset, we identify 95 (21%) gene sets, and for the comparison of the B-cell C3NET gene regulatory network with the DLBCL-C3NET gene regulatory networks obtained from the Salaverria GSE22470 dataset, we find 72 (16.5%) gene sets with a statistically significant correlation. In total, 58 terms are simultaneously significant in both network comparisons. These terms involve the basal part of cell, cell periphery, endosome and 17 gene sets sharing the parental term GO:0032991 macromolecular complex, e.g., histone mehyltransferase complex, anaphase-promoting complex, ribosome and cation chanel complex. In Table 5, we show the 30 Gene Ontology cellular component gene sets with the highest structural similarity between the B-cell C3NET gene regulatory network and the two DLBCL-C3NET gene regulatory networks. Each of the presented terms is statistically significant in both comparisons and the subscript 'ave' indicates the averaged values over these two comparisons.

Discussion
In this article, we inferred a B-cell gene regulatory network from B-cell lymphoma gene expression data [21] using the C3NET algorithm [18]. We found that the inferred B-cell C3NET gene regulatory network is characterized by individual network components that are organized by smaller interconnected network modules with intramodular hub genes. Further, we found that the giant connected component of the network is composed of 884 genes which show a significant enrichment for plasma membrane proteins that are involved in G protein signaling pathways and ion channel complexes. From the literature, it is known that ion channels play a key role for the signal transduction mechanism in lymphocytes [56]. Additionally, we found that the 25 largest components of the entire network can be categorized into two major classes. The first class, including the largest network component, is enriched by genes that are located at the membrane and the extracellular space at the physical periphery of the cell whereas the second class comprises network components located in the intracellular organelles such as in the cytoplasm, nucleus and mitochondrion. Further, the hub genes of the B-cell C3NET gene regulatory network were identified to play crucial roles in cell signaling, adhesion and cell proliferation processes.
It is believed that B-cell lymphoma subtypes show characteristic gene expression profiles of B-cells that are arrested in specific developmental stages [57]. The emergence of a lymphoma phenotype is thus understood to result from an impairment of pathways that control B-cell differentiation, proliferation and apoptosis processes [57]. The organizational structure of gene regulatory networks is a rich source of information to study specific molecular mechanisms of B-cell lymphoma. However, the combination of observational and experimental conditions from a variety of different B-cell lymphoma, including transformed and untransformed cells, as for our data [21], does not allow to infer a gene regulatory network for one particular subtype of B-cell lymphoma. Thus we are of the opinion that our inferred B-cell C3NET gene regulatory network represents an average representation of B-cell lymphoma reflecting different phenotypic subtypes with which the information conveyed by the gene expression values is associated.
In [18] it has been demonstrated that not all regions within a network can be inferred with the same inference accuracy. That means, the inference of networks is heterogeneous with respect to distinct edges in the network. It has been shown that moderately interconnected genes are easier to infer. This corresponds to the edges of linearly connected genes and the edges toward the leaf nodes of the network that are at the 'periphery' of the network. The results in [18] have been obtained for simulated data. However, for a real biological gene regulatory network it was unclear what genes correspond to the periphery of this cellular network. In contrast, in this paper we demonstrated that the periphery of the inferred B-cell C3NET gene regulatory network is centered around transmembrane proteins and the linear parts of the gene regulatory network correspond to signaling pathways and transmembrane receptor or ion channel proteins involved in signaling cascades. We would like to note that these transmembrane proteins could form putative drug targets for B-cell lymphoma.
The C3NET algorithm selects at most one edge for each gene, having maximum mutual information value. Therefore, this algorithm intends to capture the conservative causal core of the whole regulatory network only. This is in contrast to many other network inference methods [17,21,58]. For this reason, it is no surprise that a previous analysis of the same data set employing a different network inference method [17] found that their inferred regulatory network is governed by major hub genes, which mark key regulators such as transcription factors [21]. In particular, the network inferred by ARACNE consisted of 129, 000 edges and their major hub genes are reported to be cell cycle regulators. In contrast to these results, we found by our analysis a network with 9, 684 genes and 9, 221 edges enriched for signaling pathways http://www.biomedcentral.com/1752-0509/6/38 Table 5 Network similarity analysis for cellular components between the B-cell C3NET gene regulatory network and the Lenz and Salaverria gene regulatory network for 30 from the 58 cellular component subnetworks with the highest correlation coefficient of the betweenness centrality, significant in both comparisons -the columns denote the size (number of genes) of a Gene Ontology term represented in the subnetworks, betw avg the average betweenness for the two comparisons, r avg Spearman's rank correlation coefficient and p avg the FDR adjusted p-value GOID Term Size betw avg r avg pval avg (FDR) us to obtain a structural clustering reflecting the hierarchical organization of these network components. We would like to emphasize that this hierarchical clustering does not utilize information about GO terms. This is in contrast to the hierarchical clustering of GO terms presented in Figure 10 A. Nevertheless, we identified the major branches in Figure 10 B that correspond well to the clustering of the GO terms in Figure 10 A. We would like to indicate that our results confirm findings presented in [15]. It was found that the yeast and the E. coli protein network can be separated into two highly modular subnetworks which showed a functional enrichment for intracellular and extracellular processes. Hence, this may hint to a fundamental organization scheme of cellular networks. A potential hypothesis derived from these results is that the hierarchy among the network components may reflect aspects of the information flow between these components [62,63]. There are several advantages resulting form our approach, we would like to highlight. First, our investigation of the hierarchical organization of the B-cell C3NET gene regulatory network is at the abstraction level of network components or modules, but not genes. As such it resembles a systems approach [64][65][66]. This leads to a tremendous reduction in the complexity of the problem, and specifically in the interpretation of the obtained dendrograms shown in Figure 10. Second, on a technical note the size of the bootstrap ensemble was chosen large enough so that a further increase in its size does not lead to a modification of the obtained clustering. For this reason, the obtained results are stable. Third, the merit of bootstrapping is well known in many branches of statistics [22,67], where it is frequently used to quantify the variability within the data. In our approach, we utilize the data variability by exploiting mutual information values which are too weak in the whole data set to either pass a statistical test or which are not the maximum mutual information value for any gene. For example, there may be genes that have several significant interactions with other genes within a very small margin. For such cases, the bootstrapping allows to favor different gene pairs, because a slight change in the constitution of a data set may lead to alternating selections regarding the maximum mutual information valued gene pair.
Finally, we would like to note that results from our reanalysis of the data set [21] demonstrate that the biological information buried within large-scale high-throughput data is rich allowing to investigate a multitude of different biological questions.

Conclusions
With the increasing quality of network inference algorithms, we are heading toward the next major challenge we are facing in the post-genomic era, namely: What do the inferred networks mean? An analysis of the hierarchical organization of a network is just one aspect thereof, but we think, an import one. Due to the fact that one can study the hierarchy among genes, pathways, subnetworks or combinations thereof the complexity of this problem might be unprecedented. The bootstrap approach presented in this paper represents a simple, yet, flexible method in order to tame the complexity of the problem resulting, additionally, in an interpretable structure.