Cluster and propensity based approximation of a network

Ranola, John Michael; Langfelder, Peter; Lange, Kenneth; Horvath, Steve

doi:10.1186/1752-0509-7-21

Methodology article
Open access
Published: 14 March 2013

Cluster and propensity based approximation of a network

John Michael Ranola¹,
Peter Langfelder²,
Kenneth Lange^1,2,4 &
…
Steve Horvath^2,3

BMC Systems Biology volume 7, Article number: 21 (2013) Cite this article

9057 Accesses
5 Citations
4 Altmetric
Metrics details

Abstract

Background

The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.

Results

Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).

Conclusions

The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.

Background

The research of this article was originally motivated by two types of network models: correlation networks and multigraphs. After reviewing these special network models, we describe how structural insights gained from them can be used to tackle research questions arising in the study of general networks specified by network adjacencies and more generally to unsupervised learning scenarios modeled by similarity measures.

Background: adjacency matrix and multigraphs

Networks are used to describe the pairwise relationships between n nodes (or vertices). For example, we use networks to describe the functional relationships between n genes. We consider networks that are fully specified by an n × n adjacency matrix A = (A_{i
j}), whose entry A_{i
j} quantifies the connection strength from node i to node j. For an unweighted network, A_{i
j} equals 1 or 0, depending on whether a connection (or link or edge) exists from node i to node j.

For a weighted network, A_{i
j} equals a real number between 0 and 1 specifying the connection strength from node i to node j. For an undirected network, the connection strength A_{i
j} from i to j equals the connection strength A_{j
i} from j to i. In other words, the adjacency matrix A is symmetric. For a directed network, the adjacency matrix is typically not symmetric. Unless we explicitly mention otherwise, we will deal with undirected networks. In this paper the diagonal entries A_{i
i} of the adjacency matrix A have no special meaning. We arbitrarily set them equal to 1 (the maximum adjacency value); other authors set them equal to 0 [1].

In an (unweighted) multigraph, the adjacencies A_{i
j} = n_{i
j} are integers specifying the number of edges between two nodes. A general similarity matrix (whose entries are non-negative real numbers possibly outside [0,1]) can be interpreted as a weighted multigraph. In each of the network types, the connectivities

\begin{align} k_{i} = & \sum_{j \neq i} A_{ij} \end{align}

(1)

are important statistics pertinent to finding highly connected hubs. In an unweighted network (a graph), k_i is the degree of node i.

Background: correlation- and co-expression networks

Network methods are frequently used to analyze experiments recording levels of transcribed messenger RNA. The gene expression profiles collected across samples can be highly correlated and form modules (clusters) corresponding to protein complexes, organelles, cell types, and so forth [2–4]. It is natural to describe these pairwise correlations in network language. The intense interest in co-expression networks has elicited a number of new models and statistical methods for data analysis [3, 5–11], with recent applied research focusing on differential network analysis and regulatory dysfunction [12, 13].

A correlation network is a network whose adjacency matrix A = (A_{i
j}) is constructed from the correlations between quantitative measurements summarized in an m × n data matrix X = (x_{i
j}). The m rows of X correspond to sample measurements (subjects), and the n columns of X correspond to network nodes (genes). The j th column x_j of X serves as a node profile across the m samples. A correlation network adjacency matrix is constructed from the pairwise correlations Corr(x_i,x_j) in either of two ways. An unweighted gene co-expression network is defined by thresholding the absolute values of the correlation matrix. A weighted adjacency matrix is a continuous transformation of the correlation matrix. For reasons explained in [5, 14], it is advantageous to define the adjacency A_{i
j} between two genes i and j as a power β ≥ 1 of the absolute value of their correlation coefficient; thus,

\begin{align} A_{ij} = & | Corr (x_{i}, x_{j}) |^{β} . \end{align}

Weighted gene co-expression networks have found many important medical applications, including identifying brain cancer genes [14], characterizing obesity genes [15, 16], understanding atherosclerosis [17], and locating the differences between human and chimpanzee brains [9]. One of the important steps of weighted correlation network analysis is to find network modules, usually via hierarchical clustering. Each module (cluster) is then represented by the module eigengene defined by a certain singular value decomposition (SVD). Suppose Y denotes the expression data of a single module (cluster) after the appropriate columns of X have been extracted and standardized to have mean 0 and variance 1. The SVD of Y is the decomposition Y=U D V^t, where the columns of U and V are orthogonal, D is a diagonal matrix with nonnegative diagonal entries (singular values) presented in descending order, and the superscript t indicates a matrix or vector transpose. The sign of the dominant singular vector u₁ (the first column of U) is fixed by requiring a positive average correlation with the columns of Y; u₁ is referred to as the module eigenvector or eigengene. One can show that u₁ is an eigenvector of the m×m sample correlation matrix $\frac{1}{m} Y Y^{t}$ corresponding to the largest eigenvalue. The eigenvector u₁ explains the maximum amount of variation in the columns of Y.

Let d_i be the i th singular value of Y. The eigenvector factorizability

\begin{align} EF (u_{1}) = \frac{| d_{1} |^{4}}{\sum_{j} | d_{j} |^{4}} \end{align}

measures how well a network factors [18]. This measure is very similar to the proportion of variation explained, $d_{1}^{2} / \sum d_{j}^{2}$ . One can prove [18] that when EF(u₁) ≈ 1, the correlation matrix Y approximately factors as

\begin{align} Corr (x_{i}, x_{j}) & \approx Corr (x_{i}, u_{1}) Corr (x_{j}, u_{1}) . \end{align}

In co-expression networks, modules are often approximately factorizable [18, 19]. For a network comprised of multiple modules, it should come as no surprise that when the eigenvector factorizabilities of all modules are close to 1, the correlation network factors as

\begin{array}{l} A_{ij} \approx | Corr (x_{i}, u_{1}^{c_{i}}) |^{β} | Corr (x_{j}, u_{1}^{c_{j}}) |^{β} | Corr (u_{1}^{c_{i}}, u_{1}^{c_{j}}) |^{β} \\ \approx p_{i} p_{j} r_{c_{i} c_{j}}, \end{array}

(2)

where $u_{1}^{c_{i}}$ is the module eigenvector of the module containing i, $p_{i} = | Corr (x_{i}, u_{1}^{c_{i}}) |^{β}$ measures the intramodular connectivity (module membership) of node i with respect to its module, and $r_{c_{i} c_{j}} = | Corr (u_{1}^{c_{i}}, u_{1}^{c_{j}}) |^{β}$ measures the similarity between clusters c_i and c_j. The quantity

\begin{align} {kME}_{i} & = Corr (x_{i}, u_{1}^{c_{i}}) \end{align}

(3)

is called the module membership measure or conformity [18, 19].

Unlike general networks, correlation networks allow assessment of the statistical significance of an edge (via a correlation test) and generate informative network statistics such as the module eigenvector. But correlation network methods can only be applied to model the correlations between numeric variables. An open question is whether correlation network methods can be generalized to general networks by defining a suitable decomposition of a general network similarity measure. In the following, we will address this question.

Results and discussion

CPBA is a sparse approximation of a similarity measure

Consider a general n×n symmetric adjacency matrix A, for example one generated by a multigraph. Because the diagonal entries of A are irrelevant, A is determined by its $(\binom{n}{2})$ upper-diagonal entries. We now describe a low-rank matrix approximation to A based on partitioning the n nodes into K clusters labeled 1,…,K. Motivated by (Eq. 2), our approximation of a general similarity relies on three main ingredients. The first is a cluster assignment indicator c = (c_i) whose entry c_i equals a when node i belongs to cluster a. The cluster label a = 0 is special and is reserved for singleton nodes outside any of the “proper” clusters. The clusters are required to be non-empty except for the improper cluster 0. The second ingredient is a K×K cluster similarity matrix R = (r_{a
b}) whose entries quantify the relationships between clusters. The third and final ingredient is the propensity vector p = (p_i) whose components quantify the tendency (propensity) of the various nodes to form edges. The goal of cluster and propensity based approximation (CPBA) is to construct an approximation to A by optimally choosing the cluster assignment indicator c, the cluster similarity matrix R, and the propensity vector p. CPBA assumes that the adjacency matrix A_{i
j} can be approximated as

\begin{align} A_{ij} & \approx r_{c_{i} c_{j}} p_{i} p_{j} . \end{align}

(4)

The right-hand side with $(\binom{K}{2}) + n$ parameters can be interpreted as a sparse parametrization of the left-hand side with $(\binom{n}{2})$ parameters. In a weighted correlation network, the propensity p_i of node i is approximately |kME_i|^β. The cluster similarity r_{a
b}, defined by the correlation $| Corr (u_{1}^{a}, u_{1}^{b}) |^{β}$ between eigengenes, is an intuitive measure of the interactions between modules. The diagonal entries r_{a
a} of R are identically 1.

Objective functions for estimating CPBA

In practice, CPBA parameters c, p, and R of a general similarity are unknown and must be estimated by optimizing a suitably defined objective function. In this article, we describe estimation methods that are based on optimizing two superficially different objective functions. Our first objective is just the squared Frobenius matrix norm

\begin{align} Frobenius (c, p, R) & = \sum_{i} \sum_{j \neq i} {(A_{ij} - r_{c_{i} c_{j}} p_{i} p_{j})}^{2} . \end{align}

(5)

Our second objective is the Poisson log-likelihood

\begin{array}{l} Poisson (c, p, R) = \sum_{i} \sum_{j \neq i} ln [\frac{{(r_{c_{i} c_{j}} p_{i} p_{j})}^{A_{ij}} e^{- r_{c_{i} c_{j}} p_{i} p_{j}}}{A_{ij}!}] \\ = \sum_{i} \sum_{j \neq i} [A_{ij} ln (r_{c_{i} c_{j}} p_{i} p_{j}) - r_{c_{i} c_{j}} p_{i} p_{j} - ln (A_{ij}!)] . \end{array}

(6)

Our later multigraph example interprets Poisson(c,p,R) in this traditional sense. The functional form of the Poisson log-likelihood even applies when the A_{i
j} are non-integer. The factorial A_{i
j}!, which is irrelevant to maximization in any case, can then be defined via the gamma function. In practice maximization of the Poisson log-likelihood and minimization of the Frobenius norm yield very similar numerical updates.

In the Methods section, we describe a powerful MM algorithm for optimizing the objective functions and estimating its parameters. We now pause and briefly describe a few major applications. First, the sparse parametrization can be used to derive relationships between network statistics; our previous research highlights this possibility [18, 19]. For example, the connectivity index (Eq. 1) can be approximated by

\begin{align} k_{i} & = \sum_{j \neq i} A_{ij} \approx p_{i} \sum_{j \neq i} r_{c_{i} c_{j}} p_{j} . \end{align}

(7)

Second, since our optimization algorithms also strive to choose the best cluster assignment indicator c, they naturally give rise to clustering algorithms. Cluster reassignment is carried out node by node in a sequential fashion. For the sake of computationally efficiency, all parameters are fixed until node reassignment has stabilized. If parameters are updated as each node is visited, then the computational overhead seriously hinders analysis of networks with ten thousand nodes. Our limited experience suggests that more frequent re-estimation of parameters is less likely to end with an inferior optimal configuration. Hence, the tradeoff is complex.

Other major uses depend on the underlying model. In the Frobenius setting, the model can be used to generalize conformity-based decomposition of a network as shown in Example 2. In the Poisson log-likelihood setting, our model suggest a new clustering procedure. In contrast to other clustering procedures, the CPBA models provide a means of relating clusters to each other via the cluster similarities r_{a
b}. Furthermore, likelihood based objective functions permit statistical tests for assessing the significance of an edge. For example, in the multigraph model, the significance of the number of connections between two nodes can be ascertained by comparing the observed number of connections to the expected number of connections under the Poisson model. Finally, likelihood based objective functions provide a rational basis for estimating the number of clusters in a data set.

In the following three examples, we illustrate how to generalize a variety of network models to include clustering.

Example 1: Generalizing the random multigraph model

We recently explored a random multigraph model [20] that allows multiple edges to form between two nodes and edges to form with different probabilities. Edges still form independently. Under the random multigraph model, each node i is assigned a propensity p_i. The random number of edges between nodes i and j is then assumed to follow a Poisson distribution with mean p_ip_j. This model relies entirely on propensities and ignores cluster similarities. We will refer to it as the Pure Propensity Poisson Model (PPP) to avoid confusion with CPBA. Thus, the PPP log-likelihood is

\begin{array}{l} Pure Propensity Poisson (p) \\ = \sum_{i} \sum_{j \neq i} ln [\frac{{(p_{i} p_{j})}^{A_{ij}} e^{- p_{i} p_{j}}}{A_{ij}!}] \\ = \sum_{i} \sum_{j \neq i} [A_{ij} ln (p_{i} p_{j}) - p_{i} p_{j} - ln (A_{ij}!)] \\ = \sum_{i} \sum_{j \neq i} [n_{ij} ln (p_{i} p_{j}) - p_{i} p_{j} - ln (n_{ij}!)], \end{array}

(8)

where A_{i
j}=n_{i
j} is the number of edges between nodes i and j. While future work could explore alternatives to the Poisson distribution, it is attractive for several reasons. First, it is the simplest model that gives the requisite flexibility. Second, a Poisson random variable accurately approximates a sum of independent Bernoulli random variables. A binomial distribution also serves this purpose, but it imposes a hard upper bound on the number of successes. Third, the Poisson model is convenient mathematically since it yields nice MM updates in maximum likelihood estimation of the model parameters [20]. Fourth, a likelihood formulation permits testing for statistically significant connections between nodes.

Although the parametrization (Eq. 8) of PPP is flexible and computationally tractable, it ignores cluster formation. To address this limitation, we propose to exploit the CPBA parametrization. This extension is natural because many large multigraphs appear to be made up of smaller sub-networks, often referred to as modules, that are highly connected internally and only sparsely connected externally. For example, consider a co-authorship multigraph where an edge is placed between two scientists whenever they co-author an article. Scientists working at the same institution and in the same department tend to be highly connected. Similarly, scientists tend to collaborate with other scientists working on the same research topics. Cluster structure is also inherent in biology. For instance, genes often function in pathways, and proteins often cluster in evolutionary families. Thus, when a network exhibits clustering, the propensity to form connections within a cluster is usually higher than the propensity to form connections between clusters. This phenomenon cannot be modeled using our original PPP model [20] and provides the motivation for injecting cluster similarity into the multigraph model. Our hope is that the CPBA based multigraph model will better account for differences in intracluster and intercluster connections and lead to better identification of significant connections. In the absence of an explicit model for clustering, the PPP model is likely to falter on a dataset that exhibits clustering. The most likely result is a host of significant connections between nodes in the same cluster since they all exhibit more edges than expected by chance. These types of significant connections are often uninteresting. In the above mentioned co-authorship network, the cluster structure may reflect institutional affiliations. In this case, it may be more interesting to identify pairs of researchers who publish more (or less) than is expected based on their workplace location.

To keep the number of parameters to a minimum, the cluster similarity matrix R = (r_{a
b}) is assumed to be symmetric with a unit diagonal. Thus, our new random multigraph model, CPBA, adds just $(\binom{K}{2})$ parameters for K clusters. As previously postulated, the number of edges between nodes i and j in clusters c_i and c_j is Poisson distributed with mean $r_{c_{i} c_{j}} p_{i} p_{j}$ .

Example 2: Generalizing the conformity-based decomposition of a network

To demonstrate the value in our clustering model and tap into the wealth of data on weighted networks [21], we propose a clustering extension. Because weighted networks by definition have edge weights in [0,1], we drop the Poisson assumption and instead minimize the Frobenius criterion (Eq. 5). A major benefit of this model is that it generalizes the conformity-based decomposition of a network [21]. An adjacency matrix A = (A_{i
j}) is exactly factorizable if and only if there exists a vector f = (f_i) with non-negative elements such that

\begin{align} A_{ij} = f_{i} f_{j} \end{align}

(9)

for all i≠j. In this setting, f_i is often called the conformity of node i. Although the term factorizable network was first proposed in [19], numerous examples of these types of networks can be found in the literature. A physical model for experimentally determined protein-protein interactions is exactly factorizable [22]. In that model, the affinity A_{i
j} between proteins i and j is the product of conformities f_i = exp( − α_i), where α_i is the number of hydrophobic residues on protein i. Since it can be shown that f is uniquely defined if the network contains n ≥ 3 nodes and all A_{i
j} > 0 [19, 21], it is easy to see that the propensity vector matches the conformity vector, p = f, when all r_{a
b} = 1. Even when a network is not factorizable, our method can estimate conformities while simultaneously clustering the nodes into more factorizable modules. In addition, the entries of the cluster similarity matrix R can be interpreted as adjacencies between modules. Thus, the cluster similarity matrix represents a network whose nodes are networks themselves. In correlation network applications, we proposed a similar measure [23], and for gene networks we defined a measure of the probability of overlap between gene enrichment categories. Although these measures are useful in their respective contexts, they cannot be generalized to other networks. In contrast, by incorporating cluster similarity into our model, we have a standard way of calculating these measures for any type of network.

MM algorithm and R software implementation

Our software implementation of CPBA is freely available in the R package PropClust. On a laptop with a 2.4 GHz i5 processor and 4 GB of RAM, PropClust can estimate the parameters for 1000 nodes for a given cluster assignment in 0.1 seconds. For 3000 nodes, the same analysis takes 1 second. In practice, initial clusters are never perfect and must be re-configured as well. PropClust adopts a block descent (or ascent) strategy that alternates cluster re-assignment and parameter re-estimation until clusters stabilize. Block descent takes under 10 rounds on average if initial cluster assignments are good. Note that all parameters are fixed in cluster re-assignment, and all clusters are fixed in parameter re-estimation. Furthermore, both steps decrease the value of the objective function. Early versions of PropClust re-estimated parameters as each node was moved. This tactic proved to be too computationally burdensome on large-scale problems despite its slightly better performance in finding optimal clusters.

Judicious choice of the initial clusters is realized by a divide-and-conquer strategy. First, hierarchical clustering coupled with dynamic branch cutting [24] is used to cluster nodes into manageable blocks of user-defined maximum size, for instance at most 1000 nodes each. Second, the CPBA algorithm is applied to each block to arrive at clusters within blocks. Our co-expression network application shows that this initialization procedure works well even in large data sets. Another way to accelerate clustering is exploit parallel processing in the MM algorithm. Parallelization of the MM algorithm is easily achieved since the parameters separate in the surrogate function and updating the propensities via (Eq. 12) and (Eq. 15) requires only the previous parameter values. Cluster re-assignment avoids continuous optimization altogether and is very fast.

Simulated clusters in the Euclidean plane

Our first simulated dataset suggests a geometric interpretation of propensities and cluster similarities. For this dataset we simulated four distinct clusters on the Euclidean plane by sampling from a rotationally symmetric normal distribution with covariance matrix I and means corresponding to the four cluster centers shown in Figure 1A. The numbers of points in the four clusters were 50, 100, 150, and 200, respectively. The adjacency between two points is defined as 1−[dist / max(dist)]², where dist denotes Euclidean distance between the points and max(dist) denotes the maximum distance between any two points. Thus as depicted in Figure 1B, points closer together have a higher adjacency than those further apart. As anticipated, the MM algorithm provided the correct cluster assignments. Figure 1C also makes it evident that the estimated propensity of a point is significantly correlated to the Euclidian distance between the point and its cluster’s center. This result is expected since a connectivity k_i is related to a propensity p_i through equation (Eq. 7). Within a module, connectivity is also related to its cluster’s center through the formula

\begin{align} k_{i} & = (n - 1) - \frac{n | | x_{i} - \bar{x} | |^{2} + \sum_{j} | | x_{j} - \bar{x} | |^{2}}{max (dist)} \end{align}

(10)

where n is the number of nodes in the cluster, x_i is the position of node i, and $\bar{x}$ is the position of the cluster center [21]. This formula also explains why there is a separate line for each cluster in Figure 1C. Finally, Figure 1D shows that the cluster similarity r_{k
l} of clusters k and l is significantly correlated to the distance between the centers of k and l. In summary, a propensity can be viewed as a measure of the centrality of a node, while a cluster similarity reflects the distance between two cluster centers.

Simulated gene co-expression network

To illustrate how CPBA generalizes to weighted correlation networks, we simulated gene expression data using the simulateDatExpr5Modules function in the WGCNA package in R [25]. Given the simulated expression data, we calculated Pearson’s correlation coefficient for each pair of genes and formed an adjacency matrix. Applying CPBA based clustering to the simulated data led to clusters that overlap perfectly with the simulated clusters. As Figure 2 depicts, the estimated propensities p_i are very significantly correlated to the node connectivities k_i. This strong relationship reflects (Eq. 7). Furthermore, as seen in Figure 2, cluster similarity is significantly correlated to true eigengene adjacency, namely, $r_{c_{i} c_{j}} \approx | Corr (u_{1}^{c_{i}}, u_{1}^{c_{j}}) |^{β}$ . In both simulations several different cluster assignment initializations were tried and all led to the same, correct, result.

Real gene co-expression network application to brain data

In this real data example, we demonstrate that CPBA generalizes weighted correlation network analysis and can deal with fairly large data sets. The human brain expression data in question were measured on the Affymetrix U133A platform [4]. Following Oldham et al. 2008 [4], we restrict our analysis to the roughly 10⁴ probes that were highly expressed in brain tissue. The biological modules discovered by Oldham et al. 2008 [4] via WGCNA are fairly well understood and correspond to cell types such as astrocytes, oligodendrocytes, and neurons enriched for specific cell markers. In re-analyzing these data, we defined initial clusters as sketched in our discussion of the R software implementation of CPBA. This strategy obviates the need to pre-specify the number of clusters present in a data set. The results of PropClust are depicted in the second color band of Figure 3A. Overall, we find that CPBA yields modules very similar to those identified by WGCNA. The overlap with the well annotated modules of Oldham et al. [4] shows that the two clustering procedures yield meaningful and nearly equivlaent modules. CPBA has the advantage of giving cluster similarities. Figure 3B shows that eigengene based network adjacency (defined as the correlation between eigengenes raised to the soft-thresholding power 4) is highly correlated (r=0.93) with the cluster similarity parameter calculated by CPBA. For genes within a given module, Figures 3C-E demonstrate that the node propensities estimated under CPBA are highly correlated with the module membership measures k M E raised to the soft thresholding power 4. Finally, Figures 3I-J show that the connectivities k_i in the correlation network are highly correlated (r=0.96) with the connectivities calculated under the CPBA approximation and with the corresponding CPBA propensities (r=0.88). Figure 3K shows that there is a high correlation (r=0.93) between CPBA based connectivity (i.e. the row sum of the CPBA matrix) and the propensity parameter.

These results demonstrate that CPBA is roughly equivalent to WGCNA in a typical co-expression network. We expect that CPBA will also be helpful in understanding network topology. For example, Figure 3F shows that the weighted co-expression network satisfies the approximate scale-free topology (SFT) property. Future research should aim to characterize the general fit of CPBA parameters to the SFT property. In this example, the CPBA based connectivities and propensities shown in Figures 3G and 3H agree well.

OMIM disease and gene networks

Here we present an application that is not amenable to correlation network models but is arguably well suited for multigraph models. Specifically, we consider a bipartite multigraph between genes and diseases based on curated data from the reference Online Mendelian Inheritance of Man (OMIM), which tracks published links between diseases and corresponding genes [26]. These data were previously studied in detail by Goh et al. [27], who showed that diseases and their associated genes are related at higher levels of cellular and organ function. In the current application we validated their functional clustering using the CPBA model.

Following Goh et al. [27], we analyzed the data in two ways. First we created a disease network by placing an edge between two diseases for each gene they were both linked to. Only the links labeled as high quality by OMIM were considered. This construction yielded a multigraph of 2552 diseases with 1401 diseases connected to at least one other disease. We created a second, complementary multigraph by placing an edge between two genes for each disease they were both linked to. For this multigraph, there were 4045 genes with 1978 genes connected to at least one other gene. As suggested by the Medical Subject Headings (MeSH) list [28], we applied the CPBA model with K=10 clusters for the gene network and K = 14 clusters for the disease network, leaving out irrelevant categories.

We categorized the diseases using MeSH with little success. Nearly half of the diseases (47%) were not mapped to any category, and another 36% were mapped to multiple categories. Using the clustering obtained from the CPBA analysis of the disease network, we looked at whether any MeSH categories were overrepresented in a cluster. Ignoring diseases present in multiple MeSH categories, we found 8 significant categories at P < 0.01, including neoplasms, musculoskeletal diseases, and eye diseases (See Table 1). Although significant results were obtained, only a handful of diseases in each cluster contributed to the statistic. Upon closer inspection of the clusters, we found that many seemingly well-defined diseases were either not mapped or multiply mapped. For example, the eye disease cluster contains morning glory disc anomaly, coloboma, best macular dystrophy, cone-rod retinal dystrophy, and iris hypoplasia which are all clearly eye diseases, but not classified as such by MeSH. The eye disease cluster is depicted in Figure 4.

Table 1 Over-represented MeSH categories in the disease network

Full size table

Additionally, we found 540 significant connections between diseases at P < 0.01 and 148 significant connections at P < 0.001. The top 10 connections are listed in Table 2. The disease pair Adrenoleukodystrophy and Zellweger syndrome came in first; these two diseases are two of only three peroxisome biogenesis diseases belonging to the Zellweger spectrum [29]. It is also interesting to look for highly connected hub clusters, namely, clusters with high similarity to several other clusters. To define a measure of cluster connectivity, one can use the row sum of the cluster similarity matrix R. The neoplasm cluster has the highest row sum and is the cluster with the highest cluster connectivity. This makes sense given the complexity and diversity of cancers within the cluster.

Table 2 Disease network top 20 significant connections CPBA

Full size table

Looking at the complementary gene network, we checked for overrepresentation of Gene Ontology (GO) terms using BinGO on Cytoscape [30]. We found that each cluster had an overrepresentation of many GO terms. In the cluster with the well-known tumor suppressor protein TP53, we found 875 statistically significant GO terms at P < 0.01. Of these, 585 terms are still significant at P < 0.001 after accounting for multiple testing. The top 10 GO terms include both positive and negative regulation of cellular and biological processes, regulation of cell proliferation, anatomical structure development, regulation of apoptosis, and others that are clearly associated with TP53. Finally, we found 1316 significant connections between genes at P < 0.01 and 418 at P < 0.001. The top 10 connections are listed in Table 3. Many of these gene pairs are known to interact from other supporting evidence. For example, interaction between the top ranking pair, Hemoglobin Alpha 1 globin chain (HBA1) and Hemoglobin Subunit Beta (HBB), is confirmed by their co-crystal structure in x-ray crystallography [31] and by automated yeast two-hybrid (Y2H) interaction mating [32]. Figure 5 depicts the full gene network derived from OMIM.

Table 3 Gene network top 20 significant connections CPBA

Full size table

Empirical comparison of edge statistics

In this section we compare our current CPBA model with our original Pure Propensity Poisson (PPP) model on two real datasets: the OMIM disease network and the complimentary OMIM gene network. On the whole we find that the CPBA model produces more plausible P-values for the edge-count tests. Conditioning on clusters enables CPBA to detect significant intercluster connections often missed by the PPP model. It also produces more reasonable P-values within clusters since propensities are not artificially deflated by the lack of connections between nodes from different clusters. We now consider how these trends play out in the OMIM disease network and the OMIM gene network.

In the disease network we find that, among the 20 most significant connections under the CPBA model, 5 are intercluster connections (See Table 2). Under the PPP model in contrast, none of the 20 most significant connections link different CPBA clusters (See Table 4). In fact, none of the top 50 connections of the PPP model occur between different CPBA clusters. The significant connection between Usher syndrome and retinitis pigmentosa would have gone completely unnoticed under the PPP model. This would be a shame because retinitis pigmentosa is a major symptom of Usher syndrome [33]. Another missed intercluster pairing, Waardenburg syndrome and Craniofacial-deafness-hand syndrome, also deserves recognition since both syndomes involve deafness and common facial features [34, 35].

Table 4 Disease network top 20 significant connections PPP model

Full size table

Comparing the intracluster connections, we find that CPBA and PPP produce similar results, with 8 connections present in both lists. However, the P-values of these connections differ sharply under the two models. Since the PPP model essentially assumes a single cluster, estimated propensities trend lower in response to the lack of connections between nodes from different clusters. This results in lower means for the Poisson distributions and more extreme P-values. This phenomenon is especially evident in the pairing between Adrenoleukodystrophy and Zellweger syndrome; in the CPBA model the test for excess edges has −Log₁₀(P) = 8.57, whereas in the PPP model −Log₁₀(P) = 12.06.

The same story holds for the gene network. Among the 20 most significant connections under CPBA, 7 are intercluster connections (Table 3). Under the PPP model the corresponding number is 0 (Table 5). One of the more interesting missed connections occurs between BDNF (brain-derived neurotrophic factor) and HTR2A (seratonin receptor 2A). Both genes are associated with attention in schizophrenia [36]. As for intracluster connections, all intracluster connections found in the CPBA list are also found in the PPP list. However, the P-values for the most significant pair (HBB and HBA) differ by almost 5 orders of magnitude.

Table 5 Gene network top 20 significant connections PPP model

Full size table

To summarize, the CPBA model was able to find significant intercluster edge counts that the PPP model missed. Indeed, the PPP model was unable to find a single signficant intercluster pair in either data set. Although conditioning on clusters resulted in less impressive intracluster P-values, the CPBA model was still able to detect most of the significant intracluster pairings found by the PPP model. Figure 6 provides a scatterplot of −Log₁₀(P) for all significant connections obtained under either model. Points are colored red if they represent a pairing within a cluster and black if they represent a pairing between different clusters. The figure justifies our contentions that the CPBA model is more sensitive to intercluster connections and less sensitive to intracluster connections than its less nuanced competitor. So while there will be fewer significant intracluster connections, they will arguably be more interesting. Most likely these virtues of the CPBA model carry over to other data sets.

Simulations for evaluating edge statistics

To drive home the last point, we took a block diagonal adjacency matrix containing 1’s in its diagonal blocks and 0’s in its off-diagonal blocks and introduced a few off-block connections. In our initial matrix with three diagonal blocks of 100, 200, and 500 nodes, we changed 60 off-block entries from 0’s to 1’s. Each pair of node sets accounted for 20 of these switches. We then analyzed the modified matrix under both the CPBA and PPP models. Figure 7 plots −Log₁₀(P) versus true adjacencies for the modified entries. Based on its identification of clusters, the CPBA model yields a better fit to the data. Comparison of −Log₁₀(P) values under the two models shows that CPBA is more adept at finding significant intercluster connections. The evidence from the receiver operating characteristic (ROC) curve is very convincing on this point. The area under the ROC curve for the CPBA model was 0.95 compared to just 0.38 for the PPP model.

Hidden relationships between fortune 500 companies

To illustrate the utility of CPBA in a non-biological setting, we briefly describe a multigraph model of cross-company management. Specifically, we took the Fortune 500 Companies of 2011 and put an edge between two companies for each shared member on their boards of directors. The original data is found in Freebase [37]. As discussed below, the use of the Bayesian Information criterion (BIC) or the Akaike Information criterion for estimating clusters is problematic. For example, the BIC suggests an optimal number of clusters K around 10, while the AIC gives a less plausible value of K > 20. In the following, we assume K = 10 clusters. It is noteworthy that most companies do not cluster into groups of related industries. This makes sense because conflict of interest norms preclude companies in the same field from sharing board members. Overt clustering is consequently discouraged.

Based on the underlying probability model, we ascertained the significance of the edge counts for company pairs. Table 6 lists the 10 most significant connections under the 10-cluster model. Several connections stand out. The significant pairing between Fidelity National Financial and Fidelity National Information Services is rather obvious. The same holds for the pairing between Autozone and AutoNation Inc. Other connections are less obvious. The pairing between General Motors and DuPont may reflect the fact that Pierre du Pont, the founder of DuPont, at one point owned a third of all General Motors stock. This remained true until federal antitrust prosecutors filed suit, and the Supreme Court ruled against DuPont, forcing the company to dispose all of its GM shares in 1961 [38]. Although the shares are gone, it seems that some ties persist.

Table 6 Fortune 500 top 10 significant connections

Full size table

Relationship to other network models and future research

Because so much effort has been devoted to the mathematical and statistical explication of complex networks, we can only touch on the relationship of the CPBA parametrization to other network models. Complex networks can be described by random graphs (the Erdös and Rényi model [39]), small-world models (the Watts Strogatz model [40]), scale-free networks (the preferential attachment model of Barabasi and Albert [41, 42]), and other growing random network (GRN) models. These models involve graphs rather than multigraphs, so the number of edges per node pair equals 0 or 1. The CPBA has interesting ramifications for random graphs with arbitrary connectivity distributions [43]. If the edges are placed randomly in a network with such connectivities, then the probability P_{i
j} of observing an edge between nodes i and j is exactly factorizable. In fact, $P_{ij} = k_{i}^{- 3} k_{j}^{- 3}$ , where k_i is the connectivity (degree) of node i [42, 44]. Thus, P_{i
j} can be well approximated by CPBA with propensities $p_{i} = k_{i}^{- 3}$ and cluster similarities r_{a
b}=1. The Erdös and Rényi (ER) model, which assumes uniform edge probabilities, is too restrictive for realistic networks. The CPBA parametrization adapts well to random graphs if we replace the mean edge count with the edge formation probabilities

\begin{align} P_{ij} & = \frac{p_{i} p_{j} r_{c_{i} c_{j}}}{1 + p_{i} p_{j} r_{c_{i} c_{j}}} . \end{align}

This reformulation of the model is consistent with construction of an MM algorithm for parameter estimation [45]. Future research should explore the topological properties of such models.

Growing random networks (GRNs) are also of interest since many networks grow by the continuous addition of new nodes and exhibit preferential attachment. Thus, the likelihood of connecting to a node depends on the node’s current connectivity [41–44, 46, 47]. At each time step of a growing random network [44], a new node is added, and a directed edge to one of the earlier nodes is created. This growing network has a directed-tree graph topology whose basic elements are nodes connected by directed edges. In general, the topology of a general GRN is determined by the connection kernel A_k, which is the probability that a newly-introduced node forms an edge to an existing site with k edges (k−1 incoming and 1 outgoing). Future research could explore how to define a connection kernel (or more generally a GRN) so that the resulting network can be well approximated using the CPBA of the adjacency matrix. The Barabasi-Albert (BA) model is an important special case of a GRN [41, 42] that leads to a scale-free network. In the BA model, the degree of a node satisfies the power-law (or scale-free) distribution

P (k) \sim k^{γ} .

For homogeneous connection kernels, A_k∼k^ν, and scale free networks only arise if ν = 1 [44]. Future research could explore whether the adjacency matrix of the BA model can be well approximated using the CPBA. Toward this end it may useful to observe that the probability P_{i
j} of finding an edge between nodes i and j in the BA model is given by [42, 44]

\begin{matrix} P_{ij} = \frac{4 (k_{j} - 1) (4 k_{i} + k_{j} + 2)}{k_{i} (k_{i} + 1) (k_{i} + k_{j} - 1) (k_{i} + k_{j}) (k_{i} + k_{j} + 1) (k_{i} + k_{j} + 2)} \end{matrix}

which, importantly, assumes that node i with connectivity k_i was added later to the growing network than node j (implying that k_i < k_j). In view of this temporal assumption, P_{i
j} is not symmetrical in i and j; it also contains no parameters to capture clustering. Thus, there is no obvious relationship between the BA model and the CPBA approximation of a network. Future research can investigate how to parameterize preferential attachment so that the resulting probability P_{i
j} of finding an edge fits well to the CPBA.

Relationship to other clustering methods

Although the MM algorithm that estimates the CPBA parameters naturally generates a clustering method, CPBA is not just another clustering method. Our applications highlight the utility of the parameter estimates and the resulting likelihood based tests. CPBA not only provides a sparse parametrization of a general similarity matrix, but it also identifies hub nodes and clusters and enables significance tests for excess edges (between nodes) and shared similarities (between clusters). We do not claim that CPBA based clustering outperforms existing clustering methods in the simple task of clustering.

Substitutes for CPBA clustering include hierarchical clustering, partitioning around medoids [48], spectral clustering [49], mixture models [50], component models [51], and many more [52–56]. Because CPBA can be interpreted as a generalization of weighted correlation network methods, there is no need to invoke it instead of WGCNA when it comes to co-expression network applications. In modeling relationships between quantitative variables, one can use a host of other methods, for example sparse Gaussian graphical models [57, 58], Bayesian networks, and structural equation models. CPBA is not meant to replace these powerful approaches for modeling relationships between quantitative variables. Its main attraction is that it applies to a general similarity measure. Since input data sometimes consists of a similarity (or dissimilarity) measure, CPBA fills a useful niche.

Conclusions

The current paper introduces the CPBA model (cluster and propensity based approximation) for general similarity measures and sketches an efficient MM algorithm for estimation of the CPBA parameters. These advances will prove valuable in dissecting networks involving functional or evolutionary modules. The CPBA model is attractive for several reasons. First, it invokes relatively few parameters while providing sufficient flexibility for modeling observed similarities. Second, the cluster similarity parameters are good at revealing higher-order relationships between clusters. The row sum of the cluster similarity matrix can be used to define a cluster connectivity measure and to identify hub clusters such as the neoplasm hub in the disease network. Third, the CPBA model naturally generalizes network approximations that are already part of scientific practice, namely, the propensity based approach in multigraph models, the conformity decomposition in weighted networks, and the eigenvector based approximation in correlation networks. Fourth, the connections to the MM algorithm make the model well adapted to high-dimensional optimization. Fifth, the Poisson multigraph version of the model enables assessment of the statistical significance of edge counts and similarities between clusters. Sixth, likelihood-based models such as the Poisson multigraph model provide a rational basis for estimating the number of clusters. While it is beyond our scope to evaluate different methods for estimating the number of clusters in a data set, it is worth mentioning that our R implementation allows users to initialize clusters via hierarchical clustering. This tactic obviates the need to pre-specify the number of clusters.

Using simulated clusters in the plane and simulated co-expression networks, we demonstrate that CPBA generalizes existing methods. The planar examples show how a propensity can be intuitively seen as a measure of a node’s closeness to its cluster’s center and how a cluster similarity can be seen as a measure of proximity between two clusters. The simulated gene expression dataset exposes the CPBA model’s close ties to the previously studied concepts of intramodular connectivity, module eigengenes, and eigengene adjacency. Our analysis of real gene expression data reassures us that CPBA clustering results are similar to those of a benchmark method used in co-expression network analysis. The CPBA propensity parameters mirror the module eigengene based connectivity kME, and the cluster similarity measures mimic the network eigengenes. In our view, the main value of the CPBA model lies in generalizing correlation network methods.

To illustrate the versatility of CPBA, we applied it to the gene and disease networks of OMIM. The evidence that CPBA identifies biologically meaningful clusters is readily apparent in the significant enrichment of MeSH categories in the disease clusters and in the significant enrichment of GO terms in the gene clusters. While many other clustering procedures could have been used, CPBA has the advantage of dealing with dissimilarity measures as opposed to numeric input variables. It also provides Poisson likelihood based significance tests for edge counts (either pairs of genes and or pairs of diseases) that respect the underlying cluster structure. Finally, the row sums of the cluster similarity measure can be used to define hub clusters, and the estimated propensities can be used to define hub nodes. As we hoped, there were biologically meaningful ties between significantly connected pairs of genes and diseases. Several of these biologically plausible explanations are discussed in the text.

Although our examples are mainly biological, one can apply CPBA in many other contexts. For example, we employed CPBA to highlight shared board members among the Fortune 500 companies. This example illustrates how significant connections mirror the underlying ties between nodes. The edge count significance test suggests that the antitrust suit against GM and DuPont was no accident. To its credit, CPBA not only generalizes correlation network methods to general similarity matrices, but it also provides a valuable extension of random multigraph methods to weighted and unweighted multigraph data. CPBA is not just another clustering procedure but offers unique test statistics that permit identification of hub clusters and significant edge counts. We anticipate that the CPBA model will prove attractive to a wide range of scientists.

Methods

Maximizing the Poisson log-likelihood based objective function

Our algorithm for maximizing the Poisson log-likelihood (Eq. 6) given a clustering assignment c combines block ascent and the MM principle [59–61]. Clustering proceeds by re-assigning each node in turn until clusters stabilize. It may take several cycles through the nodes to achieve stability. Reassignment fixes parameters and selects the assignment with the highest log-likelihood. In the Poisson log-likelihood (Eq. 6), we take

\begin{matrix} ln L (c, R, P) = \sum_{i} \sum_{j \neq i} n_{ij} [ln (r_{c_{i} c_{j}} p_{i} p_{j}) - (r_{c_{i} c_{j}} p_{i} p_{j}) - ln (n_{ij}!)] \end{matrix}

where $r_{c_{i} c_{j}}$ is the cluster similarity between clusters c_i and c_j, p_i is the propensity of node i, and A_{i
j}=n_{i
j} is the number of connections between nodes i and j.

To optimize the objective function for a given cluster assignment, we employ block ascent and alternate updating R and p. Fixing p, it is possible to to solve for the best cluster similarity parameters r_{a
b} exactly. Indeed, setting the partial derivative

\begin{align} \frac{\partial}{\partial r_{c_{i} c_{j}}} Poisson = \sum_{k \in c_{i}} \sum_{l \in c_{j}} (\frac{n_{kl}}{r_{c_{i} c_{j}}} - p_{k} p_{l}) \end{align}

equal to zero and solving for $r_{c_{i} c_{j}}$ yields the simple update.

\begin{align} {\hat{r}}_{c_{i} c_{j}} & = \frac{\sum_{k \in c_{i}} \sum_{l \in c_{j}} n_{kl}}{(\sum_{k \in c_{i}} p_{k}) (\sum_{l \in c_{j}} p_{l})} . \end{align}

(11)

We expect the estimated r_{a
b} to occur within the unit interval [0,1] because edge formation is enhanced within clusters.

To update the propensity vector p with R fixed, we turn to an MM algorithm [59–61]. The MM principle says we should minorize the objective function by a surrogate function and maximize the surrogate function. This action drives the objective function uphill. One function minorizes another at a point p_m if it is tangent to the other function at p_m and falls below it elsewhere. The arithmetic-geometric mean inequality

\begin{align} p_{i} p_{j} & \leq \frac{1}{2} p_{mi} p_{mj} [{(\frac{p_{i}}{p_{mi}})}^{2} + {(\frac{p_{j}}{p_{mj}})}^{2}] \end{align}

is the key to minorizing the Poisson log-likelihood. Substituting the right-hand side for p_ip_j in the log-likelihood (Eq. 6) gives a surrogate function with parameters separated and leads directly to the propensity updates

\begin{align} p_{m + 1, i} & = \sqrt{\frac{p_{mi} \sum_{j} n_{ij}}{\sum_{j \neq i} r_{c_{i} c_{j}} p_{mj}}} . \end{align}

(12)

In practice, this MM algorithm may require an excessive number of iterations to converge. To accelerate convergence, we employ a Quasi-Newton extrapolation specifically designed for high-dimensional problems (Methods and [62]). The overall ascent algorithm (outer iterations) on R and p may also be slow to converge. It can also be accelerated by the same extrapolation scheme. Accelerating both inner and outer iterations gives a fast numerically stable procedure for estimating R and p for c fixed.

Minimizing the Frobenius norm based objective function

Minimization of the Frobenius objective function (Eq. 5) employs block descent and again alternates updating R and p. In this case setting the partial derivative

\begin{align} \frac{\partial}{\partial r_{c_{i} c_{j}}} Frobenius = - 2 \sum_{k \in c_{i}} \sum_{l \in c_{j}} (A_{kl} - r_{c_{i} c_{j}} p_{k} p_{l}) p_{k} p_{l} \end{align}

equal to zero and solving for $r_{c_{i} c_{j}}$ yields the simple update

\begin{align} {\hat{r}}_{c_{i} c_{j}} = \frac{\sum_{k \in c_{i}} \sum_{l \in c_{j}} A_{kl} p_{k} p_{l}}{(\sum_{k \in c_{i}} p_{k}^{2}) (\sum_{l \in c_{j}} p_{l}^{2})} . \end{align}

(13)

To update p for R fixed, we again rely on the MM principle. However, since we now seek to minimize the objective function, we majorize it. This is accomplished by first expanding it in the form

\begin{array}{l} Frobenius (c, p, R) = \sum_{i} \sum_{j \neq i} \\ \times [A_{ij}^{2} - 2 A_{ij} r_{c_{i} c_{j}} p_{i} p_{j} + {(r_{c_{i} c_{j}} p_{i} p_{j})}^{2}] . \end{array}

(14)

In majorization, one is allowed to work piecemeal. Thus, we majorize the term involving (p_ip_j)² by the earlier arithmetic-geometric mean inequality

\begin{align} p_{i}^{2} p_{j}^{2} & \leq \frac{1}{2} {(p_{mi})}^{2} {(p_{mj})}^{2} [{(\frac{p_{i}}{p_{mi}})}^{4} + {(\frac{p_{j}}{p_{mj}})}^{4}] \end{align}

taking into account squares. The term involving −p_ip_j can be majorized by the inequality x ≥ 1+ lnx for x ≥ 0 in the form

\begin{align} - p_{i} p_{j} & \leq - p_{mi} p_{mj} [1 + ln (\frac{p_{i} p_{j}}{p_{mi} p_{mj}})] . \end{align}

Substituting upper bounds side for (p_ip_j)² and −p_ip_j in the expanded objective function (Eq. 14) gives a surrogate function with parameters separated and leads directly to the propensity updates

\begin{align} p_{m + 1, i} = {[\frac{{(p_{mi})}^{3} \sum_{b} \sum_{j \in b} r_{c_{i} c_{j}} A_{ij} p_{mj}}{\sum_{b} \sum_{j \in b} {(r_{c_{i} c_{j}})}^{2} {(p_{mj})}^{2}}]}^{1 / 4} . \end{align}

(15)

As in the Poisson case, acceleration is advisable for both inner MM iterations and the outer block descent iterations. The same Quasi-Newton extrapolation [62] is pertinent and gives a fast numerically stable procedure for estimating R and p for c fixed.

Model initialization

Initial cluster assignment

Many algorithms exist for creating initial cluster assignments [56]. For most datasets these assignments only affect the time to convergence and not the converged solution. Our R software implements hierarchical clustering and does not require pre-specifying the number of clusters. More specifically, our software applies average linkage hierarchical clustering with dynamic branch cutting [24]. Dissimilarities are set equal to 1 minus similarities.

Initial propensities

One way to initialize propensities is to assume a single cluster and estimate propensities as suggested in our earlier work [20]. An alternative in the Frobenius model is to initialize p_i by the sum of the connections of node i divided by the square root of the sum of all connections [21],

\begin{align} p_{i} = \frac{\sum_{j \neq i} A_{ij}}{\sqrt{\sum_{k} \sum_{j \neq k} A_{kj}}} . \end{align}

(16)

This initialization can be motivated by showing that the above equation holds if $r_{c_{i} c_{j}} = 1$ (equivalently, the network consists of a single cluster) and $\sum p_{i} ≫ \sum p_{i}^{2}$ . While the assumption of perfect cluster similarity is unrealistic, it leads to initial values that work well in practice. For the Poisson model the analog is

\begin{align} p_{i} = \frac{\sum_{j \neq i} n_{ij}}{\sqrt{\sum_{k} \sum_{j \neq k} n_{kj}}} . \end{align}

(17)

Cluster similarity parameters

Because the block updates (Eq. 11) and (Eq. 13) for the cluster similarity parameters only depend on cluster assignment and propensities, it is natural to use those updates for initialization as well.

Clustering algorithm

1.
Choose the objective function (Frobenius or Poisson).
2.
Initialize the cluster assignment, for example, via hierarchical clustering.
3.
Initialize the propensity vector p by (Eq. 16) or (Eq. 17) and the cluster similarity matrix R by (Eq. 11) or (Eq. 13).
4.
Parameter Estimation: Given cluster assignments, re-estimate parameters through the updates (Eq. 11) and (Eq. 12) or (Eq. 13) and (Eq. 15). Declare convergence when the objective function changes by less than a threshold, say 10⁻⁵.
5.
Cluster Reassignment:
1. (a)
  Randomly permute the nodes.
2. (b)
  For each node taken in order, try all possible cluster reassignments for the node.
3. (c)
  Assign the node to the cluster that leads to the biggest improvement in the objective function.
4. (d)
  Repeat step 5 until no nodes are reassigned.
6.
Repeat steps 4 and 5 until no nodes are reassigned.
7.
(Optional) Repeat steps 1- 5 for other cluster numbers and use a cluster number estimation procedure for choosing the number of clusters.

Quasi-Newton acceleration

In this section we briefly review a Quasi-Netwon acceleration method described more fully in [62]. Newton’s method seeks a root of the equation 0 = x−F(x), where F(x) is a smooth map. For CPBA this is the algorithm map summarized by Equations (11) and (12) for Poisson updates or Equations (13) and (15) for Frobenius updates. Because the function G(x) = x − F(x) has differential d G(x) = I − d F(x), Newton’s method iterates according to

\begin{align} x_{n + 1} = x_{n} - dG {(x_{n})}^{- 1} G (x_{n}) = x_{n} - {[I - dF (x_{n})]}^{- 1} G (x_{n}) . \end{align}

Quasi-Netwon acceleration approximates d F(x_n) by a low-rank matrix M and explicitly forms the inverse (I − M)⁻¹.

Construction of M relies on secants. We can generate a secant by taking two iterates of the algorithm starting from the current iterate x_n. If we are close to the optimal point x_∞, then we have the linear approximation

\begin{align} F \circ F (x_{n}) - F (x_{n}) \approx M [F (x_{n}) - x_{n}], \end{align}

where M = d F(x_∞). We abbreviate the secant requirement as M u = v, where u = F(x_n)−x_n and v = F∘F(x_n)−F(x_n). To improve the approximation of M, one can use several secant constraints M u_i = v_i for i = 1,…,q. These are expressed in matrix form as M u = v. For our purposes the value q=6 works well.

Provided U has full column rank q, the minimum of the strictly convex function $| | M | |_{F}^{2}$ subject to the constraints M u=v is attained at M=V(U^tU)⁻¹U^t [62]. Fortunately, a variant of Sherman-Morrison formula [63] implies that the matrix I−M = I−V(U^tU)⁻¹U^t has the explicit inverse

\begin{align} {[I - V {(U^{t} U)}^{- 1} U^{t}]}^{- 1} = I + V {[U^{t} U - U^{t} V]}^{- 1} U^{t} . \end{align}

Thus, the quasi-Newton acceleration can be expressed as

\begin{align} x_{n + 1} & = x_{n} - {[I - V {(U^{t} U)}^{- 1} U^{t}]}^{- 1} [x_{n} - F (x_{n})] \\ = x_{n} - [I + V {(U^{t} U - U^{t} V)}^{- 1} U^{t}] [x_{n} - F x_{n})] \\ = F (x_{n}) - V {(U^{t} U - U^{t} V)}^{- 1} U^{t} [x_{n} - F (x_{n})] . \end{align}

This update involves inversion of the small q × q matrix U^tU − U^tV; all other operations reduce to matrix times vector multiplications.

Estimating the number of clusters

Estimating the number of clusters is the Achilles heel of cluster analysis. While this topic is beyond our scope, it is worth mentioning that an advantage of model based approaches is that likelihood criteria can be brought to bear. Since adding clusters entails more parameters, it is tempting to use the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to estimate the number of cluster in the Poissom model [64, 65]. Both of these criteria balance the tradeoff between the number of parameters and the fit of the model. Specifically these methods choose the number of clusters K that minimize A I C = − 2 ln(L) + 2c or B I C = − 2 ln(L) + c ln(n), respectively, where c is the number of parameters, L is the likelihood, and n is the sample size. We caution the reader that AIC and BIC may be inappropriate for the present task because both criteria invoke strong assumptions. For example, AIC is derived by assuming a regular model, for instance, a linear model with Gaussian noise. Hence, AIC may be inappropriate for models with latent variables such as cluster labels. BIC may be inappropriate because our approach is frequentist rather than Bayesian. A review of the limitations and utility of these criteria can be found in [66].

Ethical approval

This article involved publicly available human data sets which are completely anonymized. This study is therefore exempted from requiring ethics approval. No animal data were used. We fully comply with the Declaration of Helsinki and the “Animal Research: Reporting In Vivo Experiments” (ARRIVE) guidelines.

Availability and requirements

Project name: PropClust R package Project home page:http://www.genetics.ucla.edu/labs/horvath/PropClust Operating system(s): Platform independent Programming language: R Licence: GNU GPL 3 The propensity based clustering method propensityDecomposition is implemented in the R package PropClust. The package also contains the function CPBAdecomp for carrying out the propensity decomposition of a network.

Abbreviations

BA:: Barabasi Albert
CPBA:: Cluster and propensity based approximation
ER:: Erdos renyi
GO:: Gene ontology
GRN:: Growing random network
kME:: Connectivity based on the module eigenvector or eigengene
MeSH:: Medical subject headings
MM:: Minorization maximization or majorization minimization
PPP:: Pure propensity poisson
SFT:: Scale free topology.

References

von Luxburg U: A tutorial on spectral clustering. Stat Comput. 2007, 17 (4): 395-416. 10.1007/s11222-007-9033-z.
Article Google Scholar
Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.
Article PubMed CAS PubMed Central Google Scholar
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302 (5643): 249-255. 10.1126/science.1087447.
Article PubMed CAS Google Scholar
Oldham M, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, Geschwind D: Functional organization of the transcriptome in human brain. Nat Neurosci. 2008, 11 (11): 1271-1282. 10.1038/nn.2207.
Article PubMed CAS PubMed Central Google Scholar
Zhang B, Horvath S: A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005, 4: 17-
Google Scholar
Huang Y, Li H, Hu H, Yan X, Waterman M, Huang H, Zhou X: Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics. 2007, 23 (13): i222-i229. 10.1093/bioinformatics/btm222.
Article PubMed CAS Google Scholar
Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, Cloughesy T, Nelson S, Mischel P: Analysis of oncogenic signaling networks in Glioblastoma identifies ASPM as a novel molecular target. Proc Natl Acad Sci USA. 2006, 103 (46): 17402-17407. 10.1073/pnas.0608396103.
Article PubMed CAS PubMed Central Google Scholar
Carlson M, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF: Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics. 2006, 7 (7): 40-
Article PubMed PubMed Central Google Scholar
Oldham M, Horvath S, Geschwind D: Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A. 2006, 103 (47): 17973-17978. 10.1073/pnas.0605938103.
Article PubMed CAS PubMed Central Google Scholar
Chen L, EmmertStreib F, Storey J: Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 2007, 8 (219):
Keller M, Choi Y, Wang P, Belt Davis D, Rabaglia M, Oler A, Stapleton D, Argmann C, Schueler K, Edwards S, Steinberg H, Chaibub Neto E, Kleinhanz R, Turner S, Hellerstein MK, Schadt E, Yandell B, Kendziorski C, Attie A: A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility. Genome Res. 2008, 18 (5): 706-716. 10.1101/gr.074914.107.
Article PubMed CAS PubMed Central Google Scholar
Dawson J, Ye S, Kendziorski C: An empirical bayesian framework for discovering differential co-expression. Bioinformatics. 2012, 68 (2): 455-465.
CAS Google Scholar
de la Fuente A: From ‘differential expression’ to ‘differential networking’ -identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010, 26 (7): 326-333. 10.1016/j.tig.2010.05.001.
Article PubMed CAS Google Scholar
Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, TF C, Nelson S, Mischel P: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a novel molecular target. Proc Natl Acad Sci U S A. 2006, 103 (46): 17402-17407. 10.1073/pnas.0608396103.
Article PubMed CAS PubMed Central Google Scholar
Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating genetics and network analysis to characterize genes related to mouse weight. PloS Genet. 2006, 2 (8):
Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007, 18 (6-7): 463-472. 10.1007/s00335-007-9043-3.
Article PubMed PubMed Central Google Scholar
Gargalovic PS, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He A, Truong A, Baruch-Oren T, Berliner JA, Kirchgessner TG, Lusis A J: The unfolded protein response is an important regulator of inflammatory genes in Endothelial cells. Arterioscler Thromb Vasc Biol. 2006, 26 (11): 2490-2496. 10.1161/01.ATV.0000242903.41158.a1. [http://atvb.ahajournals.org/cgi/content/abstract/26/11/2490] 10.1161/01.ATV.0000242903.41158.a1
Article PubMed CAS Google Scholar
Horvath S, Dong J: Geometric interpretation of gene co-expression network analysis. PloS Comput Biol. 2008, 4: 8-10.1371/journal.pcbi.0040008.
Article Google Scholar
Dong J, Horvath S: Understanding network concepts in modules. BMC Syst Biol. 2007, 1: 24-10.1186/1752-0509-1-24.
Article PubMed PubMed Central Google Scholar
Ranola J, Ahn S, Sehl ME, Smith D, Lange K: A Poisson model for random multigraphs. Bioinformatics. 2010, 26 (16): 2004-2011. 10.1093/bioinformatics/btq309.
Article PubMed CAS PubMed Central Google Scholar
Horvath S: Weighted Network Analysis: Applications in Genomics and Systems Biology. 1 edition. 2011, New York: Springer
Book Google Scholar
Deeds E, Ashenberg O, Shakhnovich E: A simple physical model for scaling in protein-protein interaction networks. Proc Natl Acad Sci U S A. 2006, 103 (2): 311-316. 10.1073/pnas.0509715102.
Article PubMed CAS PubMed Central Google Scholar
Langfelder P, Horvath S: Eigengene networks for studying the relationships between co-expression modules. BMC Syst Biol. 2007, 1: 54-10.1186/1752-0509-1-54.
Article PubMed PubMed Central Google Scholar
Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the dynamic tree cut library for R. Bioinformatics. 2007, 24 (5): 719-20.
Article PubMed Google Scholar
Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008, 9: 559-10.1186/1471-2105-9-559.
Article PubMed PubMed Central Google Scholar
McKusick-Nathans Institute of Genetic Medicine JHU: Online mendelian inheritance in man, OMIM®. [http://omim.org/]
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The human disease network. Proc Natl Acad Sci. 2007, 104 (21): 8685-8690. 10.1073/pnas.0701361104. [http://www.pnas.org/content/104/21/8685.abstract] 10.1073/pnas.0701361104
Article PubMed CAS PubMed Central Google Scholar
Rogers F: Medical subject headings. Bull Med Libr Assoc. 1963, 51: 114-116.
PubMed CAS PubMed Central Google Scholar
Steinberg S, Dodt G, Raymond G, Braveman N, Moser A, Moser H: Peroxisome biogenesis disorders. Biochemica et Biophysica Acta - Mol Cell Res. 2006, 1763 (12): 1733-10.1016/j.bbamcr.2006.09.010.
Article CAS Google Scholar
Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005, 21 (16): 3448-3449. 10.1093/bioinformatics/bti551. [http://bioinformatics.oxfordjournals.org/content/21/16/3448.abstract] 10.1093/bioinformatics/bti551
Article PubMed CAS Google Scholar
Shaanan B: Structure of human Oxyhaemoglobin at 2.1 a resolution. J Mol Biol. 1983, 171: 31-59. 10.1016/S0022-2836(83)80313-1.
Article PubMed CAS Google Scholar
Stelzl Uea: A human protein-protein interaction network: a resource for Annotating the Proteome. Cell. 2005, 122 (6): 957-968. 10.1016/j.cell.2005.08.029.
Article PubMed CAS Google Scholar
U.S. National Institute of Health’s informational page on Usher syndrome. [http://www.nidcd.nih.gov/health/hearing/pages/usher.aspx]
U.S. National Institute of Health’s informational page on Waardenburg syndrome. [http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0002401/]
U.S. National Institute of Health’s informational page on Craniofacial-deafness-hand syndrome. [http://ghr.nlm.nih.gov/condition/craniofacial-deafness-hand-syndrome]
Alfmova M, Lezhelko T, Golimbet V, Korovaltseva G, Lavrushkina O, Kolesina N, Frolova L, Muratova A, Abramova L, Kaleda V: Investigation of association of the brain-derived neurotrophic factor (BDNF) and a serotonin receptor 2A (5-HTR2A) genes with voluntary and involuntary attention in schizophrenia. Zh Nevrol Psikhiatr Im S S Korsakova. 2008, 108 (4): 62-69.
Google Scholar
Freebase open data resource. [http://www.freebase.com/]
DuPont Wikipedia entry. [http://en.wikipedia.org/wiki/DuPont]
Erdös P, Renyi A: On random graphs. Publicationes Mathematicae. 1959, 6: 290-297.
Google Scholar
Watts D, Strogatz S: Collective dynamics of ‘small-world’ networks. Nature. 1998, 393 (6684): 440-442. 10.1038/30918.
Article PubMed CAS Google Scholar
Barabasi A, Albert R: Emergence of scaling in random networks. Science. 1999, 286 (5439): 509-512. 10.1126/science.286.5439.509.
Article PubMed Google Scholar
Albert R, Barabasi A: Statistical mechanics of complex networks. Rev Mod Phys. 2002, 74: 47-97. 10.1103/RevModPhys.74.47.
Article Google Scholar
Newman MEJ, Strogatz SH, Watts DJ: Random graphs with arbitrary degree distributions and their applications. Phys Rev E. 2001, 64: 026118-[http://link.aps.org/doi/10.1103/PhysRevE.64.026118]
Article CAS Google Scholar
Krapivsky PL, Redner S, Leyvraz F: Connectivity of growing random networks. Phys Rev Lett. 2000, 85: 4629-4632. 10.1103/PhysRevLett.85.4629. [http://link.aps.org/doi/10.1103/PhysRevLett.85.4629] 10.1103/PhysRevLett.85.4629
Article PubMed CAS Google Scholar
Lange K: Numerical Analysis for Statisticians. 2010, New York: Springer
Book Google Scholar
Strogatz SH: Exploring complex networks. Nature. 2001, 410 (6825): 268-276. 10.1038/35065725. [http://dx.doi.org/10.1038/35065725] 10.1038/35065725
Article PubMed CAS Google Scholar
Durrett R: Random Graph Dynamics. 2006, New York: Cambridge University Press
Book Google Scholar
Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: John Wiley and Sons, Inc
Book Google Scholar
Zhou D, Burges CJC: Spectral clustering and transductive learning with multiple views. Proceedings of the 24th international conference on Machine learning, ICML ’07. 2007, New York: ACM, 1159-1166. [http://doi.acm.org/10.1145/1273496.1273642]
Chapter Google Scholar
Newman MEJ, Leicht EA: Mixture models and exploratory analysis in networks. Proc Natl Acad Sci U S A. 2007, 104 (23): 9564-9569. 10.1073/pnas.0610537104.
Article PubMed CAS PubMed Central Google Scholar
Sinkkonen J, Aukia J, Kaski S: Component models for large networks. arXiv e-prints. 2008, arXiv: 0803.1628-
Google Scholar
Hofman JM, Wiggins CH: Bayesian approach to network modularity. Phys Rev Lett. 2008, 100 (25): 258701-
Article PubMed PubMed Central Google Scholar
Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N: Learning systems of concepts with an infinite relational model. AAAI. 2006, United States: AAAI Press, 381-388.
Google Scholar
Airoldi EM, Blei DM, Fienberg SE, Xing EP: Mixed membership stochastic blockmodels. J Mach Learn Res. 2008, 9: 1981-2014.
PubMed PubMed Central Google Scholar
Newman M: Modularity and community structure in networks. PNAS. 2006, 103: 8577-8582. 10.1073/pnas.0601602103.
Article PubMed CAS PubMed Central Google Scholar
Schaeffer SE: Graph clustering. Comput Sci Rev. 2007, 1: 27-64. 10.1016/j.cosrev.2007.05.001. [http://www.sciencedirect.com/science/article/pii/S1574013707000020] 10.1016/j.cosrev.2007.05.001
Article Google Scholar
Yin J, Li H: A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat. 2011, 5 (4): 2630-2650. 10.1214/11-AOAS494.
Article PubMed PubMed Central Google Scholar
Xulvi-Brunet R, Li H: Co-expression networks: graph properties and topological comparisons. Bioinformatics. 2010, 26 (2): 205-214. 10.1093/bioinformatics/btp632.
Article PubMed CAS PubMed Central Google Scholar
Hunter D, Lange K: A tutorial on MM algorithms. Am Stat. 2004, 58: 30-37. 10.1198/0003130042836.
Article Google Scholar
Lange K: Optimization. 2004, New York: Springer
Book Google Scholar
Wu T, Lange K: The MM alternative to EM. Stat Sci. 2010, 25 (4): 492-505. 10.1214/08-STS264.
Article Google Scholar
Zhou H, Alexander D, Lange K: A quasi-Newton acceleration for high-dimensional optimization algorithms. Stat Comput. 2011, 21: 261-273. 10.1007/s11222-009-9166-3. [http://dx.doi.org/10.1007/s11222-009-9166-3] http://dx.doi.org/10.1007/s11222-009-9166-3 10.1007/s11222-009-9166-3
Article PubMed PubMed Central Google Scholar
Lange K: Numerical Analysis for Statisticians. 1999, New York: Springer-Verlag
Google Scholar
Akaike H: A new look at the statistical model identification. Automatic Control, IEEE Trans. 1974, 19 (6): 716-723. 10.1109/TAC.1974.1100705.
Article Google Scholar
Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6 (2): 461-464. 10.1214/aos/1176344136. [http://www.jstor.org/stable/2958889] 10.1214/aos/1176344136
Article Google Scholar
Watanabe S: Algebraic Geometry and Statistical Learning Theory. 2009, Cambridge: Cambridge University Press
Book Google Scholar

Download references

Acknowledgements

This research was supported in part by United States Public Health Service grants GM53275, HG006139, MH59490, and UCLA CTSI Grant UL1TR000124.

Author information

Authors and Affiliations

Biomathematics, University of California, Los Angeles, CA, USA
John Michael Ranola & Kenneth Lange
Human Genetics, UCLA, Los Angeles, CA, USA
Peter Langfelder, Kenneth Lange & Steve Horvath
Biostatistics, UCLA, Los Angeles, CA, USA
Steve Horvath
Statistics, UCLA, Los Angeles, CA, USA
Kenneth Lange

Authors

John Michael Ranola
View author publications
You can also search for this author in PubMed Google Scholar
Peter Langfelder
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Lange
View author publications
You can also search for this author in PubMed Google Scholar
Steve Horvath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to John Michael Ranola, Kenneth Lange or Steve Horvath.

Additional information

Competing interests

The authors declare that they have no conflict of interest.

Authors’ contributions

JR, SH, and KL jointly developed the methods and wrote the article. JR carried out the analysis and implemented the software. PL helped with the R software implementation and carried out the example analysis on empirical expression data. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ranola, J.M., Langfelder, P., Lange, K. et al. Cluster and propensity based approximation of a network. BMC Syst Biol 7, 21 (2013). https://doi.org/10.1186/1752-0509-7-21

Download citation

Received: 20 June 2012
Accepted: 14 February 2013
Published: 14 March 2013
DOI: https://doi.org/10.1186/1752-0509-7-21

Cluster and propensity based approximation of a network

Abstract

Background

Results

Conclusions

Background

Background: adjacency matrix and multigraphs

Background: correlation- and co-expression networks

Results and discussion

CPBA is a sparse approximation of a similarity measure

Objective functions for estimating CPBA

Example 1: Generalizing the random multigraph model

Example 2: Generalizing the conformity-based decomposition of a network

MM algorithm and R software implementation

Simulated clusters in the Euclidean plane

Simulated gene co-expression network

Real gene co-expression network application to brain data

OMIM disease and gene networks

Empirical comparison of edge statistics

Simulations for evaluating edge statistics

Hidden relationships between fortune 500 companies

Relationship to other network models and future research

Relationship to other clustering methods

Conclusions

Methods

Maximizing the Poisson log-likelihood based objective function

Minimizing the Frobenius norm based objective function

Model initialization

Initial cluster assignment

Initial propensities

Cluster similarity parameters

Clustering algorithm

Quasi-Newton acceleration

Estimating the number of clusters

Ethical approval

Availability and requirements

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Systems Biology

Contact us