Cluster and propensity based approximation of a network
 John Michael Ranola^{1}Email author,
 Peter Langfelder^{2},
 Kenneth Lange^{1, 2, 4}Email author and
 Steve Horvath^{2, 3}Email author
DOI: 10.1186/17520509721
© Ranola et al.; licensee BioMed Central Ltd. 2013
Received: 20 June 2012
Accepted: 14 February 2013
Published: 14 March 2013
Abstract
Background
The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.
Results
Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel MajorizationMinimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bipartite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).
Conclusions
The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higherorder relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.
Keywords
Network decomposition Modelbased clustering MM algorithm Propensity Network conformityBackground
The research of this article was originally motivated by two types of network models: correlation networks and multigraphs. After reviewing these special network models, we describe how structural insights gained from them can be used to tackle research questions arising in the study of general networks specified by network adjacencies and more generally to unsupervised learning scenarios modeled by similarity measures.
Background: adjacency matrix and multigraphs
Networks are used to describe the pairwise relationships between n nodes (or vertices). For example, we use networks to describe the functional relationships between n genes. We consider networks that are fully specified by an n × n adjacency matrix A = (A_{ i j }), whose entry A_{ i j } quantifies the connection strength from node i to node j. For an unweighted network, A_{ i j } equals 1 or 0, depending on whether a connection (or link or edge) exists from node i to node j.
For a weighted network, A_{ i j } equals a real number between 0 and 1 specifying the connection strength from node i to node j. For an undirected network, the connection strength A_{ i j } from i to j equals the connection strength A_{ j i } from j to i. In other words, the adjacency matrix A is symmetric. For a directed network, the adjacency matrix is typically not symmetric. Unless we explicitly mention otherwise, we will deal with undirected networks. In this paper the diagonal entries A_{ i i } of the adjacency matrix A have no special meaning. We arbitrarily set them equal to 1 (the maximum adjacency value); other authors set them equal to 0 [1].
are important statistics pertinent to finding highly connected hubs. In an unweighted network (a graph), k_{ i } is the degree of node i.
Background: correlation and coexpression networks
Network methods are frequently used to analyze experiments recording levels of transcribed messenger RNA. The gene expression profiles collected across samples can be highly correlated and form modules (clusters) corresponding to protein complexes, organelles, cell types, and so forth [2–4]. It is natural to describe these pairwise correlations in network language. The intense interest in coexpression networks has elicited a number of new models and statistical methods for data analysis [3, 5–11], with recent applied research focusing on differential network analysis and regulatory dysfunction [12, 13].
Weighted gene coexpression networks have found many important medical applications, including identifying brain cancer genes [14], characterizing obesity genes [15, 16], understanding atherosclerosis [17], and locating the differences between human and chimpanzee brains [9]. One of the important steps of weighted correlation network analysis is to find network modules, usually via hierarchical clustering. Each module (cluster) is then represented by the module eigengene defined by a certain singular value decomposition (SVD). Suppose Y denotes the expression data of a single module (cluster) after the appropriate columns of X have been extracted and standardized to have mean 0 and variance 1. The SVD of Y is the decomposition Y=U D V^{ t }, where the columns of U and V are orthogonal, D is a diagonal matrix with nonnegative diagonal entries (singular values) presented in descending order, and the superscript t indicates a matrix or vector transpose. The sign of the dominant singular vector u_{1} (the first column of U) is fixed by requiring a positive average correlation with the columns of Y; u_{1} is referred to as the module eigenvector or eigengene. One can show that u_{1} is an eigenvector of the m×m sample correlation matrix $\frac{1}{m}\mathit{Y}{\mathit{Y}}^{t}$ corresponding to the largest eigenvalue. The eigenvector u_{1} explains the maximum amount of variation in the columns of Y.
is called the module membership measure or conformity [18, 19].
Unlike general networks, correlation networks allow assessment of the statistical significance of an edge (via a correlation test) and generate informative network statistics such as the module eigenvector. But correlation network methods can only be applied to model the correlations between numeric variables. An open question is whether correlation network methods can be generalized to general networks by defining a suitable decomposition of a general network similarity measure. In the following, we will address this question.
Results and discussion
CPBA is a sparse approximation of a similarity measure
The righthand side with $\left(\genfrac{}{}{0.0pt}{}{K}{2}\right)+n$ parameters can be interpreted as a sparse parametrization of the lefthand side with $\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)$ parameters. In a weighted correlation network, the propensity p_{ i } of node i is approximately kME_{ i }^{ β }. The cluster similarity r_{ a b }, defined by the correlation $\left\text{Corr}\right({\mathit{u}}_{1}^{a},{\mathit{u}}_{1}^{b}){}^{\beta}$ between eigengenes, is an intuitive measure of the interactions between modules. The diagonal entries r_{ a a } of R are identically 1.
Objective functions for estimating CPBA
Our later multigraph example interprets Poisson(c,p,R) in this traditional sense. The functional form of the Poisson loglikelihood even applies when the A_{ i j } are noninteger. The factorial A_{ i j }!, which is irrelevant to maximization in any case, can then be defined via the gamma function. In practice maximization of the Poisson loglikelihood and minimization of the Frobenius norm yield very similar numerical updates.
Second, since our optimization algorithms also strive to choose the best cluster assignment indicator c, they naturally give rise to clustering algorithms. Cluster reassignment is carried out node by node in a sequential fashion. For the sake of computationally efficiency, all parameters are fixed until node reassignment has stabilized. If parameters are updated as each node is visited, then the computational overhead seriously hinders analysis of networks with ten thousand nodes. Our limited experience suggests that more frequent reestimation of parameters is less likely to end with an inferior optimal configuration. Hence, the tradeoff is complex.
Other major uses depend on the underlying model. In the Frobenius setting, the model can be used to generalize conformitybased decomposition of a network as shown in Example 2. In the Poisson loglikelihood setting, our model suggest a new clustering procedure. In contrast to other clustering procedures, the CPBA models provide a means of relating clusters to each other via the cluster similarities r_{ a b }. Furthermore, likelihood based objective functions permit statistical tests for assessing the significance of an edge. For example, in the multigraph model, the significance of the number of connections between two nodes can be ascertained by comparing the observed number of connections to the expected number of connections under the Poisson model. Finally, likelihood based objective functions provide a rational basis for estimating the number of clusters in a data set.
In the following three examples, we illustrate how to generalize a variety of network models to include clustering.
Example 1: Generalizing the random multigraph model
where A_{ i j }=n_{ i j } is the number of edges between nodes i and j. While future work could explore alternatives to the Poisson distribution, it is attractive for several reasons. First, it is the simplest model that gives the requisite flexibility. Second, a Poisson random variable accurately approximates a sum of independent Bernoulli random variables. A binomial distribution also serves this purpose, but it imposes a hard upper bound on the number of successes. Third, the Poisson model is convenient mathematically since it yields nice MM updates in maximum likelihood estimation of the model parameters [20]. Fourth, a likelihood formulation permits testing for statistically significant connections between nodes.
Although the parametrization (Eq. 8) of PPP is flexible and computationally tractable, it ignores cluster formation. To address this limitation, we propose to exploit the CPBA parametrization. This extension is natural because many large multigraphs appear to be made up of smaller subnetworks, often referred to as modules, that are highly connected internally and only sparsely connected externally. For example, consider a coauthorship multigraph where an edge is placed between two scientists whenever they coauthor an article. Scientists working at the same institution and in the same department tend to be highly connected. Similarly, scientists tend to collaborate with other scientists working on the same research topics. Cluster structure is also inherent in biology. For instance, genes often function in pathways, and proteins often cluster in evolutionary families. Thus, when a network exhibits clustering, the propensity to form connections within a cluster is usually higher than the propensity to form connections between clusters. This phenomenon cannot be modeled using our original PPP model [20] and provides the motivation for injecting cluster similarity into the multigraph model. Our hope is that the CPBA based multigraph model will better account for differences in intracluster and intercluster connections and lead to better identification of significant connections. In the absence of an explicit model for clustering, the PPP model is likely to falter on a dataset that exhibits clustering. The most likely result is a host of significant connections between nodes in the same cluster since they all exhibit more edges than expected by chance. These types of significant connections are often uninteresting. In the above mentioned coauthorship network, the cluster structure may reflect institutional affiliations. In this case, it may be more interesting to identify pairs of researchers who publish more (or less) than is expected based on their workplace location.
To keep the number of parameters to a minimum, the cluster similarity matrix R = (r_{ a b }) is assumed to be symmetric with a unit diagonal. Thus, our new random multigraph model, CPBA, adds just $\left(\genfrac{}{}{0.0pt}{}{K}{2}\right)$ parameters for K clusters. As previously postulated, the number of edges between nodes i and j in clusters c_{ i } and c_{ j } is Poisson distributed with mean ${r}_{{c}_{i}{c}_{j}}{p}_{i}{p}_{j}$.
Example 2: Generalizing the conformitybased decomposition of a network
for all i≠j. In this setting, f_{ i } is often called the conformity of node i. Although the term factorizable network was first proposed in [19], numerous examples of these types of networks can be found in the literature. A physical model for experimentally determined proteinprotein interactions is exactly factorizable [22]. In that model, the affinity A_{ i j } between proteins i and j is the product of conformities f_{ i } = exp( − α_{ i }), where α_{ i } is the number of hydrophobic residues on protein i. Since it can be shown that f is uniquely defined if the network contains n ≥ 3 nodes and all A_{ i j } > 0 [19, 21], it is easy to see that the propensity vector matches the conformity vector, p = f, when all r_{ a b } = 1. Even when a network is not factorizable, our method can estimate conformities while simultaneously clustering the nodes into more factorizable modules. In addition, the entries of the cluster similarity matrix R can be interpreted as adjacencies between modules. Thus, the cluster similarity matrix represents a network whose nodes are networks themselves. In correlation network applications, we proposed a similar measure [23], and for gene networks we defined a measure of the probability of overlap between gene enrichment categories. Although these measures are useful in their respective contexts, they cannot be generalized to other networks. In contrast, by incorporating cluster similarity into our model, we have a standard way of calculating these measures for any type of network.
MM algorithm and R software implementation
Our software implementation of CPBA is freely available in the R package PropClust. On a laptop with a 2.4 GHz i5 processor and 4 GB of RAM, PropClust can estimate the parameters for 1000 nodes for a given cluster assignment in 0.1 seconds. For 3000 nodes, the same analysis takes 1 second. In practice, initial clusters are never perfect and must be reconfigured as well. PropClust adopts a block descent (or ascent) strategy that alternates cluster reassignment and parameter reestimation until clusters stabilize. Block descent takes under 10 rounds on average if initial cluster assignments are good. Note that all parameters are fixed in cluster reassignment, and all clusters are fixed in parameter reestimation. Furthermore, both steps decrease the value of the objective function. Early versions of PropClust reestimated parameters as each node was moved. This tactic proved to be too computationally burdensome on largescale problems despite its slightly better performance in finding optimal clusters.
Judicious choice of the initial clusters is realized by a divideandconquer strategy. First, hierarchical clustering coupled with dynamic branch cutting [24] is used to cluster nodes into manageable blocks of userdefined maximum size, for instance at most 1000 nodes each. Second, the CPBA algorithm is applied to each block to arrive at clusters within blocks. Our coexpression network application shows that this initialization procedure works well even in large data sets. Another way to accelerate clustering is exploit parallel processing in the MM algorithm. Parallelization of the MM algorithm is easily achieved since the parameters separate in the surrogate function and updating the propensities via (Eq. 12) and (Eq. 15) requires only the previous parameter values. Cluster reassignment avoids continuous optimization altogether and is very fast.
Simulated clusters in the Euclidean plane
where n is the number of nodes in the cluster, x_{ i } is the position of node i, and $\stackrel{\u0304}{\mathit{x}}$ is the position of the cluster center [21]. This formula also explains why there is a separate line for each cluster in Figure 1C. Finally, Figure 1D shows that the cluster similarity r_{ k l } of clusters k and l is significantly correlated to the distance between the centers of k and l. In summary, a propensity can be viewed as a measure of the centrality of a node, while a cluster similarity reflects the distance between two cluster centers.
Simulated gene coexpression network
Real gene coexpression network application to brain data
These results demonstrate that CPBA is roughly equivalent to WGCNA in a typical coexpression network. We expect that CPBA will also be helpful in understanding network topology. For example, Figure 3F shows that the weighted coexpression network satisfies the approximate scalefree topology (SFT) property. Future research should aim to characterize the general fit of CPBA parameters to the SFT property. In this example, the CPBA based connectivities and propensities shown in Figures 3G and 3H agree well.
OMIM disease and gene networks
Here we present an application that is not amenable to correlation network models but is arguably well suited for multigraph models. Specifically, we consider a bipartite multigraph between genes and diseases based on curated data from the reference Online Mendelian Inheritance of Man (OMIM), which tracks published links between diseases and corresponding genes [26]. These data were previously studied in detail by Goh et al. [27], who showed that diseases and their associated genes are related at higher levels of cellular and organ function. In the current application we validated their functional clustering using the CPBA model.
Following Goh et al. [27], we analyzed the data in two ways. First we created a disease network by placing an edge between two diseases for each gene they were both linked to. Only the links labeled as high quality by OMIM were considered. This construction yielded a multigraph of 2552 diseases with 1401 diseases connected to at least one other disease. We created a second, complementary multigraph by placing an edge between two genes for each disease they were both linked to. For this multigraph, there were 4045 genes with 1978 genes connected to at least one other gene. As suggested by the Medical Subject Headings (MeSH) list [28], we applied the CPBA model with K=10 clusters for the gene network and K = 14 clusters for the disease network, leaving out irrelevant categories.
Overrepresented MeSH categories in the disease network
Name  MeSH num.  −Log_{10}(P) 

Hemic & lymphatic diseases  C15  8.32 
Eye diseases  C11  7.78 
Cardiovascular diseases  C14  4.23 
Nervous system diseases  C10  4.04 
Neoplasms  C4  3.37 
Musculoskeletal diseases  C5  2.91 
Endocrine system diseases  C19  2.04 
Congenital, hereditary, &  
neonatal diseases & abnormalities  C16  2.03 
Disease network top 20 significant connections CPBA
Disease 1  Disease 2  C1  C2  −Log_{10}(P)  

1  Zellweger syndrome  Adrenoleukodystrophy  14  14  8.57 
2  Muscular dystrophydystroglycanopathy (limbgirdle)  Muscular dystrophydystroglycanopathy (congenital)  2  2  7.05 
3  Ullrich congenital muscular dystrophy  Bethlem myopathy  14  14  6.48 
4  Iminoglycinuria  Hyperglycinuria  14  14  6.48 
5  Alport syndrome  Hematuria  14  14  5.31 
6  Colorblindness  Blue cone monochromacy  14  14  5.31 
7  Refsum disease  Zellweger syndrome  14  14  5.05 
8  Usher syndrome  Retinitis pigmentosa  8  6  5.04 
9  Seckel syndrome  Microcephaly  14  14  4.96 
10  Leukoencephalopathy with vanishing white matter  Ovarioleukodystrophy  14  14  4.96 
11  Omenn syndrome  Severe combined immunodeficiency  14  14  4.90 
12  Tuberous sclerosis  Lymphangioleiomyomatosis  14  14  4.60 
13  Conerod dystrophy  Macular degeneration  6  10  4.60 
14  Bronchiectasis with or without elevated sweat chloride  Pseudohypoaldosteronism  11  11  4.47 
15  LeriWeill dyschondrosteosis  Langer mesomelic dysplasia  14  14  4.10 
16  Multiple pterygium syndrome  Myasthenic syndrome  14  14  4.00 
17  Craniofacialdeafnesshand syndrome  Waardenburg syndrome  3  11  3.77 
18  Nicotine addiction  Epilepsy  3  8  3.76 
19  Hirschsprung disease  Pheochromocytoma  11  2  3.70 
20  Langer mesomelic dysplasia  Short stature  14  14  3.62 
Gene network top 20 significant connections CPBA
Rank  Gene 1  Gene 2  Cluster 1  Cluster 2   Log_{10}(P) 

1  HBB  HBA1  2  2  9.05 
2  SHOXY  SHOX  10  10  7.36 
3  BDNF  HTR2A  5  4  7.07 
4  SH2B3  JAK2  2  8  7.05 
5  TSC2  TSC1  10  10  6.28 
6  FOXC1  PITX2  7  7  5.73 
7  MAPT  PSEN1  4  6  5.66 
8  OPN1MW  OPN1LW  10  10  5.58 
9  COL4A4  COL4A3  10  10  5.58 
10  RAG2  RAG1  10  10  5.56 
11  SCNN1G  SCNN1B  5  5  5.25 
12  HBB  KLF1  2  10  5.09 
13  COL6A1  COL6A3  10  10  5.08 
14  COL6A2  COL6A3  10  10  5.08 
15  SLC6A19  SLC36A2  10  10  5.08 
16  SLC6A20  SLC36A2  10  10  5.08 
17  SLC6A20  SLC6A19  10  10  5.08 
18  COL6A2  COL6A1  10  10  5.08 
19  GPC3  OFD1  8  7  4.75 
20  LTBP2  CYP1B1  10  7  4.73 
Empirical comparison of edge statistics
In this section we compare our current CPBA model with our original Pure Propensity Poisson (PPP) model on two real datasets: the OMIM disease network and the complimentary OMIM gene network. On the whole we find that the CPBA model produces more plausible Pvalues for the edgecount tests. Conditioning on clusters enables CPBA to detect significant intercluster connections often missed by the PPP model. It also produces more reasonable Pvalues within clusters since propensities are not artificially deflated by the lack of connections between nodes from different clusters. We now consider how these trends play out in the OMIM disease network and the OMIM gene network.
Disease network top 20 significant connections PPP model
Disease 1  Disease 2  C1  C2   Log_{10}(P)  

1  Muscular dystrophydystroglycanopathy (limbgirdle)  Muscular dystrophydystroglycanopathy (congenital)  2  2  13.31 
2  Zellweger syndrome  Adrenoleukodystrophy  14  14  12.06 
3  Leber congenital amaurosis  Retinitis pigmentosa  6  6  10.12 
4  Neuropathy  CharcotMarieTooth disease  12  12  8.99 
5  Blood group  Malaria  13  13  8.76 
6  Ullrich congenital muscular dystrophy  Bethlem myopathy  14  14  8.57 
7  Iminoglycinuria  Hyperglycinuria  14  14  8.57 
8  Usher syndrome  Deafness  8  8  8.48 
9  Hemolytic uremic syndrome  Macular degeneration  10  10  8.24 
10  Bronchiectasis with or without elevated sweat chloride  Pseudohypoaldosteronism  11  11  7.75 
11  Refsum disease  Zellweger syndrome  14  14  7.14 
12  Meckel syndrome  Joubert syndrome  6  6  7.08 
13  Omenn syndrome  Severe combined immunodeficiency  14  14  6.99 
14  Left ventricular noncompaction  Cardiomyopathy  12  12  6.97 
15  Mitochondrial complex I deficiency  Leigh syndrome  2  2  6.85 
16  Alport syndrome  Hematuria  14  14  6.70 
17  Colorblindness  Blue cone monochromacy  14  14  6.70 
18  Atrial fibrillation  Long QT syndrome  2  2  6.64 
19  Conerod dystrophy  Retinitis pigmentosa  6  6  6.56 
20  Microphthalmia with coloboma  Microphthalmia  6  6  6.46 
Comparing the intracluster connections, we find that CPBA and PPP produce similar results, with 8 connections present in both lists. However, the Pvalues of these connections differ sharply under the two models. Since the PPP model essentially assumes a single cluster, estimated propensities trend lower in response to the lack of connections between nodes from different clusters. This results in lower means for the Poisson distributions and more extreme Pvalues. This phenomenon is especially evident in the pairing between Adrenoleukodystrophy and Zellweger syndrome; in the CPBA model the test for excess edges has −Log_{10}(P) = 8.57, whereas in the PPP model −Log_{10}(P) = 12.06.
Gene network top 20 significant connections PPP model
Rank  Gene 1  Gene 2  Cluster 1  Cluster 2   Log_{10}(P) 

1  HBB  HBA1  2  2  13.87 
2  SHOXY  SHOX  10  10  10.15 
3  SDHD  SDHB  5  5  9.96 
4  SCNN1G  SCNN1B  5  5  9.27 
5  RAG2  RAG1  10  10  8.34 
6  TSC2  TSC1  10  10  8.14 
7  SDHC  SDHB  5  5  7.79 
8  FOXC1  PITX2  7  7  7.54 
9  OPN1MW  OPN1LW  10  10  7.43 
10  COL4A4  COL4A3  10  10  7.43 
11  GDF6  GDF3  7  7  7.29 
12  TERC  TERT  9  9  7.20 
13  CISH  TIRAP  4  4  7.12 
14  GDNF  RET  5  5  7.04 
15  COL6A1  COL6A3  10  10  6.94 
16  COL6A2  COL6A3  10  10  6.94 
17  SLC6A19  SLC36A2  10  10  6.94 
18  SLC6A20  SLC36A2  10  10  6.94 
19  SLC6A20  SLC6A19  10  10  6.94 
20  COL6A2  COL6A1  10  10  6.94 
Simulations for evaluating edge statistics
Hidden relationships between fortune 500 companies
To illustrate the utility of CPBA in a nonbiological setting, we briefly describe a multigraph model of crosscompany management. Specifically, we took the Fortune 500 Companies of 2011 and put an edge between two companies for each shared member on their boards of directors. The original data is found in Freebase [37]. As discussed below, the use of the Bayesian Information criterion (BIC) or the Akaike Information criterion for estimating clusters is problematic. For example, the BIC suggests an optimal number of clusters K around 10, while the AIC gives a less plausible value of K > 20. In the following, we assume K = 10 clusters. It is noteworthy that most companies do not cluster into groups of related industries. This makes sense because conflict of interest norms preclude companies in the same field from sharing board members. Overt clustering is consequently discouraged.
Fortune 500 top 10 significant connections
Rank  Company 1  Company 2   Log_{10}(P)  Edges 

1  U.S. Bancorp  Ecolab  6.01  4 
2  PetSmart  Dean Foods  4.53  3 
3  Sempra Energy  Aecom Technology Corp.  4.39  3 
4  General Motors  DuPont  4.07  3 
5  Cardinal Health  Aon Corp.  4.07  3 
6  Lockheed Martin  Monsanato  4.07  2 
7  Fidelity National Financial  Fidelity National Inf. Services  4.06  2 
8  HewlettPackard  News Corporation  3.89  2 
9  AutoZone  AutoNation, Inc.  3.8  3 
10  United Technologies Corporation  PACCAR  3.74  2 
Relationship to other network models and future research
This reformulation of the model is consistent with construction of an MM algorithm for parameter estimation [45]. Future research should explore the topological properties of such models.
which, importantly, assumes that node i with connectivity k_{ i } was added later to the growing network than node j (implying that k_{ i } < k_{ j }). In view of this temporal assumption, P_{ i j } is not symmetrical in i and j; it also contains no parameters to capture clustering. Thus, there is no obvious relationship between the BA model and the CPBA approximation of a network. Future research can investigate how to parameterize preferential attachment so that the resulting probability P_{ i j } of finding an edge fits well to the CPBA.
Relationship to other clustering methods
Although the MM algorithm that estimates the CPBA parameters naturally generates a clustering method, CPBA is not just another clustering method. Our applications highlight the utility of the parameter estimates and the resulting likelihood based tests. CPBA not only provides a sparse parametrization of a general similarity matrix, but it also identifies hub nodes and clusters and enables significance tests for excess edges (between nodes) and shared similarities (between clusters). We do not claim that CPBA based clustering outperforms existing clustering methods in the simple task of clustering.
Substitutes for CPBA clustering include hierarchical clustering, partitioning around medoids [48], spectral clustering [49], mixture models [50], component models [51], and many more [52–56]. Because CPBA can be interpreted as a generalization of weighted correlation network methods, there is no need to invoke it instead of WGCNA when it comes to coexpression network applications. In modeling relationships between quantitative variables, one can use a host of other methods, for example sparse Gaussian graphical models [57, 58], Bayesian networks, and structural equation models. CPBA is not meant to replace these powerful approaches for modeling relationships between quantitative variables. Its main attraction is that it applies to a general similarity measure. Since input data sometimes consists of a similarity (or dissimilarity) measure, CPBA fills a useful niche.
Conclusions
The current paper introduces the CPBA model (cluster and propensity based approximation) for general similarity measures and sketches an efficient MM algorithm for estimation of the CPBA parameters. These advances will prove valuable in dissecting networks involving functional or evolutionary modules. The CPBA model is attractive for several reasons. First, it invokes relatively few parameters while providing sufficient flexibility for modeling observed similarities. Second, the cluster similarity parameters are good at revealing higherorder relationships between clusters. The row sum of the cluster similarity matrix can be used to define a cluster connectivity measure and to identify hub clusters such as the neoplasm hub in the disease network. Third, the CPBA model naturally generalizes network approximations that are already part of scientific practice, namely, the propensity based approach in multigraph models, the conformity decomposition in weighted networks, and the eigenvector based approximation in correlation networks. Fourth, the connections to the MM algorithm make the model well adapted to highdimensional optimization. Fifth, the Poisson multigraph version of the model enables assessment of the statistical significance of edge counts and similarities between clusters. Sixth, likelihoodbased models such as the Poisson multigraph model provide a rational basis for estimating the number of clusters. While it is beyond our scope to evaluate different methods for estimating the number of clusters in a data set, it is worth mentioning that our R implementation allows users to initialize clusters via hierarchical clustering. This tactic obviates the need to prespecify the number of clusters.
Using simulated clusters in the plane and simulated coexpression networks, we demonstrate that CPBA generalizes existing methods. The planar examples show how a propensity can be intuitively seen as a measure of a node’s closeness to its cluster’s center and how a cluster similarity can be seen as a measure of proximity between two clusters. The simulated gene expression dataset exposes the CPBA model’s close ties to the previously studied concepts of intramodular connectivity, module eigengenes, and eigengene adjacency. Our analysis of real gene expression data reassures us that CPBA clustering results are similar to those of a benchmark method used in coexpression network analysis. The CPBA propensity parameters mirror the module eigengene based connectivity kME, and the cluster similarity measures mimic the network eigengenes. In our view, the main value of the CPBA model lies in generalizing correlation network methods.
To illustrate the versatility of CPBA, we applied it to the gene and disease networks of OMIM. The evidence that CPBA identifies biologically meaningful clusters is readily apparent in the significant enrichment of MeSH categories in the disease clusters and in the significant enrichment of GO terms in the gene clusters. While many other clustering procedures could have been used, CPBA has the advantage of dealing with dissimilarity measures as opposed to numeric input variables. It also provides Poisson likelihood based significance tests for edge counts (either pairs of genes and or pairs of diseases) that respect the underlying cluster structure. Finally, the row sums of the cluster similarity measure can be used to define hub clusters, and the estimated propensities can be used to define hub nodes. As we hoped, there were biologically meaningful ties between significantly connected pairs of genes and diseases. Several of these biologically plausible explanations are discussed in the text.
Although our examples are mainly biological, one can apply CPBA in many other contexts. For example, we employed CPBA to highlight shared board members among the Fortune 500 companies. This example illustrates how significant connections mirror the underlying ties between nodes. The edge count significance test suggests that the antitrust suit against GM and DuPont was no accident. To its credit, CPBA not only generalizes correlation network methods to general similarity matrices, but it also provides a valuable extension of random multigraph methods to weighted and unweighted multigraph data. CPBA is not just another clustering procedure but offers unique test statistics that permit identification of hub clusters and significant edge counts. We anticipate that the CPBA model will prove attractive to a wide range of scientists.
Methods
Maximizing the Poisson loglikelihood based objective function
where ${r}_{{c}_{i}{c}_{j}}$ is the cluster similarity between clusters c_{ i } and c_{ j }, p_{ i } is the propensity of node i, and A_{ i j }=n_{ i j } is the number of connections between nodes i and j.
We expect the estimated r_{ a b } to occur within the unit interval [0,1] because edge formation is enhanced within clusters.
In practice, this MM algorithm may require an excessive number of iterations to converge. To accelerate convergence, we employ a QuasiNewton extrapolation specifically designed for highdimensional problems (Methods and [62]). The overall ascent algorithm (outer iterations) on R and p may also be slow to converge. It can also be accelerated by the same extrapolation scheme. Accelerating both inner and outer iterations gives a fast numerically stable procedure for estimating R and p for c fixed.
Minimizing the Frobenius norm based objective function
As in the Poisson case, acceleration is advisable for both inner MM iterations and the outer block descent iterations. The same QuasiNewton extrapolation [62] is pertinent and gives a fast numerically stable procedure for estimating R and p for c fixed.
Model initialization
Initial cluster assignment
Many algorithms exist for creating initial cluster assignments [56]. For most datasets these assignments only affect the time to convergence and not the converged solution. Our R software implements hierarchical clustering and does not require prespecifying the number of clusters. More specifically, our software applies average linkage hierarchical clustering with dynamic branch cutting [24]. Dissimilarities are set equal to 1 minus similarities.
Initial propensities
Cluster similarity parameters
Because the block updates (Eq. 11) and (Eq. 13) for the cluster similarity parameters only depend on cluster assignment and propensities, it is natural to use those updates for initialization as well.
Clustering algorithm
 1.
Choose the objective function (Frobenius or Poisson).
 2.
Initialize the cluster assignment, for example, via hierarchical clustering.
 3.
Initialize the propensity vector p by (Eq. 16) or (Eq. 17) and the cluster similarity matrix R by (Eq. 11) or (Eq. 13).
 4.
Parameter Estimation: Given cluster assignments, reestimate parameters through the updates (Eq. 11) and (Eq. 12) or (Eq. 13) and (Eq. 15). Declare convergence when the objective function changes by less than a threshold, say 10^{−5}.
 5.Cluster Reassignment:
 (a)
Randomly permute the nodes.
 (b)
For each node taken in order, try all possible cluster reassignments for the node.
 (c)
Assign the node to the cluster that leads to the biggest improvement in the objective function.
 (d)
Repeat step 5 until no nodes are reassigned.
 (a)
 6.
Repeat steps 4 and 5 until no nodes are reassigned.
 7.
(Optional) Repeat steps 1 5 for other cluster numbers and use a cluster number estimation procedure for choosing the number of clusters.
QuasiNewton acceleration
QuasiNetwon acceleration approximates d F(x_{ n }) by a lowrank matrix M and explicitly forms the inverse (I − M)^{−1}.
where M = d F(x_{ ∞ }). We abbreviate the secant requirement as M u = v, where u = F(x_{ n })−x_{ n } and v = F∘F(x_{ n })−F(x_{ n }). To improve the approximation of M, one can use several secant constraints M u_{ i } = v_{ i } for i = 1,…,q. These are expressed in matrix form as M u = v. For our purposes the value q=6 works well.
This update involves inversion of the small q × q matrix U^{ t }U − U^{ t }V; all other operations reduce to matrix times vector multiplications.
Estimating the number of clusters
Estimating the number of clusters is the Achilles heel of cluster analysis. While this topic is beyond our scope, it is worth mentioning that an advantage of model based approaches is that likelihood criteria can be brought to bear. Since adding clusters entails more parameters, it is tempting to use the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to estimate the number of cluster in the Poissom model [64, 65]. Both of these criteria balance the tradeoff between the number of parameters and the fit of the model. Specifically these methods choose the number of clusters K that minimize A I C = − 2 ln(L) + 2c or B I C = − 2 ln(L) + c ln(n), respectively, where c is the number of parameters, L is the likelihood, and n is the sample size. We caution the reader that AIC and BIC may be inappropriate for the present task because both criteria invoke strong assumptions. For example, AIC is derived by assuming a regular model, for instance, a linear model with Gaussian noise. Hence, AIC may be inappropriate for models with latent variables such as cluster labels. BIC may be inappropriate because our approach is frequentist rather than Bayesian. A review of the limitations and utility of these criteria can be found in [66].
Ethical approval
This article involved publicly available human data sets which are completely anonymized. This study is therefore exempted from requiring ethics approval. No animal data were used. We fully comply with the Declaration of Helsinki and the “Animal Research: Reporting In Vivo Experiments” (ARRIVE) guidelines.
Availability and requirements
Project name: PropClust R package Project home page:http://www.genetics.ucla.edu/labs/horvath/PropClust Operating system(s): Platform independent Programming language: R Licence: GNU GPL 3 The propensity based clustering method propensityDecomposition is implemented in the R package PropClust. The package also contains the function CPBAdecomp for carrying out the propensity decomposition of a network.
Abbreviations
 BA:

Barabasi Albert
 CPBA:

Cluster and propensity based approximation
 ER:

Erdos renyi
 GO:

Gene ontology
 GRN:

Growing random network
 kME:

Connectivity based on the module eigenvector or eigengene
 MeSH:

Medical subject headings
 MM:

Minorization maximization or majorization minimization
 PPP:

Pure propensity poisson
 SFT:

Scale free topology.
Declarations
Acknowledgements
This research was supported in part by United States Public Health Service grants GM53275, HG006139, MH59490, and UCLA CTSI Grant UL1TR000124.
Authors’ Affiliations
References
 von Luxburg U: A tutorial on spectral clustering. Stat Comput. 2007, 17 (4): 395416. 10.1007/s112220079033z.View Article
 Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A. 1998, 95 (25): 1486314868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView Article
 Stuart JM, Segal E, Koller D, Kim SK: A genecoexpression network for global discovery of conserved genetic modules. Science. 2003, 302 (5643): 249255. 10.1126/science.1087447.PubMedView Article
 Oldham M, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, Geschwind D: Functional organization of the transcriptome in human brain. Nat Neurosci. 2008, 11 (11): 12711282. 10.1038/nn.2207.PubMedPubMed CentralView Article
 Zhang B, Horvath S: A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol. 2005, 4: 17
 Huang Y, Li H, Hu H, Yan X, Waterman M, Huang H, Zhou X: Systematic discovery of functional modules and contextspecific functional annotation of human genome. Bioinformatics. 2007, 23 (13): i222i229. 10.1093/bioinformatics/btm222.PubMedView Article
 Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, Cloughesy T, Nelson S, Mischel P: Analysis of oncogenic signaling networks in Glioblastoma identifies ASPM as a novel molecular target. Proc Natl Acad Sci USA. 2006, 103 (46): 1740217407. 10.1073/pnas.0608396103.PubMedPubMed CentralView Article
 Carlson M, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF: Gene connectivity, function, and sequence conservation: predictions from modular yeast coexpression networks. BMC Genomics. 2006, 7 (7): 40PubMedPubMed CentralView Article
 Oldham M, Horvath S, Geschwind D: Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A. 2006, 103 (47): 1797317978. 10.1073/pnas.0605938103.PubMedPubMed CentralView Article
 Chen L, EmmertStreib F, Storey J: Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 2007, 8 (219):
 Keller M, Choi Y, Wang P, Belt Davis D, Rabaglia M, Oler A, Stapleton D, Argmann C, Schueler K, Edwards S, Steinberg H, Chaibub Neto E, Kleinhanz R, Turner S, Hellerstein MK, Schadt E, Yandell B, Kendziorski C, Attie A: A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility. Genome Res. 2008, 18 (5): 706716. 10.1101/gr.074914.107.PubMedPubMed CentralView Article
 Dawson J, Ye S, Kendziorski C: An empirical bayesian framework for discovering differential coexpression. Bioinformatics. 2012, 68 (2): 455465.
 de la Fuente A: From ‘differential expression’ to ‘differential networking’ identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010, 26 (7): 326333. 10.1016/j.tig.2010.05.001.PubMedView Article
 Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, TF C, Nelson S, Mischel P: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a novel molecular target. Proc Natl Acad Sci U S A. 2006, 103 (46): 1740217407. 10.1073/pnas.0608396103.PubMedPubMed CentralView Article
 Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating genetics and network analysis to characterize genes related to mouse weight. PloS Genet. 2006, 2 (8):
 Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007, 18 (67): 463472. 10.1007/s0033500790433.PubMedPubMed CentralView Article
 Gargalovic PS, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He A, Truong A, BaruchOren T, Berliner JA, Kirchgessner TG, Lusis A J: The unfolded protein response is an important regulator of inflammatory genes in Endothelial cells. Arterioscler Thromb Vasc Biol. 2006, 26 (11): 24902496. 10.1161/01.ATV.0000242903.41158.a1. [http://atvb.ahajournals.org/cgi/content/abstract/26/11/2490] 10.1161/01.ATV.0000242903.41158.a1PubMedView Article
 Horvath S, Dong J: Geometric interpretation of gene coexpression network analysis. PloS Comput Biol. 2008, 4: 810.1371/journal.pcbi.0040008.View Article
 Dong J, Horvath S: Understanding network concepts in modules. BMC Syst Biol. 2007, 1: 2410.1186/17520509124.PubMedPubMed CentralView Article
 Ranola J, Ahn S, Sehl ME, Smith D, Lange K: A Poisson model for random multigraphs. Bioinformatics. 2010, 26 (16): 20042011. 10.1093/bioinformatics/btq309.PubMedPubMed CentralView Article
 Horvath S: Weighted Network Analysis: Applications in Genomics and Systems Biology. 1 edition. 2011, New York: SpringerView Article
 Deeds E, Ashenberg O, Shakhnovich E: A simple physical model for scaling in proteinprotein interaction networks. Proc Natl Acad Sci U S A. 2006, 103 (2): 311316. 10.1073/pnas.0509715102.PubMedPubMed CentralView Article
 Langfelder P, Horvath S: Eigengene networks for studying the relationships between coexpression modules. BMC Syst Biol. 2007, 1: 5410.1186/17520509154.PubMedPubMed CentralView Article
 Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the dynamic tree cut library for R. Bioinformatics. 2007, 24 (5): 71920.PubMedView Article
 Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008, 9: 55910.1186/147121059559.PubMedPubMed CentralView Article
 McKusickNathans Institute of Genetic Medicine JHU: Online mendelian inheritance in man, OMIM®. [http://omim.org/]
 Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The human disease network. Proc Natl Acad Sci. 2007, 104 (21): 86858690. 10.1073/pnas.0701361104. [http://www.pnas.org/content/104/21/8685.abstract] 10.1073/pnas.0701361104PubMedPubMed CentralView Article
 Rogers F: Medical subject headings. Bull Med Libr Assoc. 1963, 51: 114116.PubMedPubMed Central
 Steinberg S, Dodt G, Raymond G, Braveman N, Moser A, Moser H: Peroxisome biogenesis disorders. Biochemica et Biophysica Acta  Mol Cell Res. 2006, 1763 (12): 173310.1016/j.bbamcr.2006.09.010.View Article
 Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005, 21 (16): 34483449. 10.1093/bioinformatics/bti551. [http://bioinformatics.oxfordjournals.org/content/21/16/3448.abstract] 10.1093/bioinformatics/bti551PubMedView Article
 Shaanan B: Structure of human Oxyhaemoglobin at 2.1 a resolution. J Mol Biol. 1983, 171: 3159. 10.1016/S00222836(83)803131.PubMedView Article
 Stelzl Uea: A human proteinprotein interaction network: a resource for Annotating the Proteome. Cell. 2005, 122 (6): 957968. 10.1016/j.cell.2005.08.029.PubMedView Article
 U.S. National Institute of Health’s informational page on Usher syndrome. [http://www.nidcd.nih.gov/health/hearing/pages/usher.aspx]
 U.S. National Institute of Health’s informational page on Waardenburg syndrome. [http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0002401/]
 U.S. National Institute of Health’s informational page on Craniofacialdeafnesshand syndrome. [http://ghr.nlm.nih.gov/condition/craniofacialdeafnesshandsyndrome]
 Alfmova M, Lezhelko T, Golimbet V, Korovaltseva G, Lavrushkina O, Kolesina N, Frolova L, Muratova A, Abramova L, Kaleda V: Investigation of association of the brainderived neurotrophic factor (BDNF) and a serotonin receptor 2A (5HTR2A) genes with voluntary and involuntary attention in schizophrenia. Zh Nevrol Psikhiatr Im S S Korsakova. 2008, 108 (4): 6269.
 Freebase open data resource. [http://www.freebase.com/]
 DuPont Wikipedia entry. [http://en.wikipedia.org/wiki/DuPont]
 Erdös P, Renyi A: On random graphs. Publicationes Mathematicae. 1959, 6: 290297.
 Watts D, Strogatz S: Collective dynamics of ‘smallworld’ networks. Nature. 1998, 393 (6684): 440442. 10.1038/30918.PubMedView Article
 Barabasi A, Albert R: Emergence of scaling in random networks. Science. 1999, 286 (5439): 509512. 10.1126/science.286.5439.509.PubMedView Article
 Albert R, Barabasi A: Statistical mechanics of complex networks. Rev Mod Phys. 2002, 74: 4797. 10.1103/RevModPhys.74.47.View Article
 Newman MEJ, Strogatz SH, Watts DJ: Random graphs with arbitrary degree distributions and their applications. Phys Rev E. 2001, 64: 026118[http://link.aps.org/doi/10.1103/PhysRevE.64.026118]View Article
 Krapivsky PL, Redner S, Leyvraz F: Connectivity of growing random networks. Phys Rev Lett. 2000, 85: 46294632. 10.1103/PhysRevLett.85.4629. [http://link.aps.org/doi/10.1103/PhysRevLett.85.4629] 10.1103/PhysRevLett.85.4629PubMedView Article
 Lange K: Numerical Analysis for Statisticians. 2010, New York: SpringerView Article
 Strogatz SH: Exploring complex networks. Nature. 2001, 410 (6825): 268276. 10.1038/35065725. [http://dx.doi.org/10.1038/35065725] 10.1038/35065725PubMedView Article
 Durrett R: Random Graph Dynamics. 2006, New York: Cambridge University PressView Article
 Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: John Wiley and Sons, IncView Article
 Zhou D, Burges CJC: Spectral clustering and transductive learning with multiple views. Proceedings of the 24th international conference on Machine learning, ICML ’07. 2007, New York: ACM, 11591166. [http://doi.acm.org/10.1145/1273496.1273642]View Article
 Newman MEJ, Leicht EA: Mixture models and exploratory analysis in networks. Proc Natl Acad Sci U S A. 2007, 104 (23): 95649569. 10.1073/pnas.0610537104.PubMedPubMed CentralView Article
 Sinkkonen J, Aukia J, Kaski S: Component models for large networks. arXiv eprints. 2008, arXiv: 0803.1628
 Hofman JM, Wiggins CH: Bayesian approach to network modularity. Phys Rev Lett. 2008, 100 (25): 258701PubMedPubMed CentralView Article
 Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N: Learning systems of concepts with an infinite relational model. AAAI. 2006, United States: AAAI Press, 381388.
 Airoldi EM, Blei DM, Fienberg SE, Xing EP: Mixed membership stochastic blockmodels. J Mach Learn Res. 2008, 9: 19812014.PubMedPubMed Central
 Newman M: Modularity and community structure in networks. PNAS. 2006, 103: 85778582. 10.1073/pnas.0601602103.PubMedPubMed CentralView Article
 Schaeffer SE: Graph clustering. Comput Sci Rev. 2007, 1: 2764. 10.1016/j.cosrev.2007.05.001. [http://www.sciencedirect.com/science/article/pii/S1574013707000020] 10.1016/j.cosrev.2007.05.001View Article
 Yin J, Li H: A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat. 2011, 5 (4): 26302650. 10.1214/11AOAS494.PubMedPubMed CentralView Article
 XulviBrunet R, Li H: Coexpression networks: graph properties and topological comparisons. Bioinformatics. 2010, 26 (2): 205214. 10.1093/bioinformatics/btp632.PubMedPubMed CentralView Article
 Hunter D, Lange K: A tutorial on MM algorithms. Am Stat. 2004, 58: 3037. 10.1198/0003130042836.View Article
 Lange K: Optimization. 2004, New York: SpringerView Article
 Wu T, Lange K: The MM alternative to EM. Stat Sci. 2010, 25 (4): 492505. 10.1214/08STS264.View Article
 Zhou H, Alexander D, Lange K: A quasiNewton acceleration for highdimensional optimization algorithms. Stat Comput. 2011, 21: 261273. 10.1007/s1122200991663. [http://dx.doi.org/10.1007/s1122200991663] http://dx.doi.org/10.1007/s1122200991663 10.1007/s1122200991663PubMedPubMed CentralView Article
 Lange K: Numerical Analysis for Statisticians. 1999, New York: SpringerVerlag
 Akaike H: A new look at the statistical model identification. Automatic Control, IEEE Trans. 1974, 19 (6): 716723. 10.1109/TAC.1974.1100705.View Article
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6 (2): 461464. 10.1214/aos/1176344136. [http://www.jstor.org/stable/2958889] 10.1214/aos/1176344136View Article
 Watanabe S: Algebraic Geometry and Statistical Learning Theory. 2009, Cambridge: Cambridge University PressView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.