Understanding network concepts in modules

Background Network concepts are increasingly used in biology and genetics. For example, the clustering coefficient has been used to understand network architecture; the connectivity (also known as degree) has been used to screen for cancer targets; and the topological overlap matrix has been used to define modules and to annotate genes. Dozens of potentially useful network concepts are known from graph theory. Results Here we study network concepts in special types of networks, which we refer to as approximately factorizable networks. In these networks, the pairwise connection strength (adjacency) between 2 network nodes can be factored into node specific contributions, named node 'conformity'. The node conformity turns out to be highly related to the connectivity. To provide a formalism for relating network concepts to each other, we define three types of network concepts: fundamental-, conformity-based-, and approximate conformity-based concepts. Fundamental concepts include the standard definitions of connectivity, density, centralization, heterogeneity, clustering coefficient, and topological overlap. The approximate conformity-based analogs of fundamental network concepts have several theoretical advantages. First, they allow one to derive simple relationships between seemingly disparate networks concepts. For example, we derive simple relationships between the clustering coefficient, the heterogeneity, the density, the centralization, and the topological overlap. The second advantage of approximate conformity-based network concepts is that they allow one to show that fundamental network concepts can be approximated by simple functions of the connectivity in module networks. Conclusion Using protein-protein interaction, gene co-expression, and simulated data, we show that a) many networks comprised of module nodes are approximately factorizable and b) in these types of networks, simple relationships exist between seemingly disparate network concepts. Our results are implemented in freely available R software code, which can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/ModuleNetworks

Here we describe empirical studies regarding the relationship between network concepts and module size. This additional file accompanies our main article 'Understanding network concepts in modules'.
We report results for the 4 networks described in our main article: Drosophila PPI, yeast PPI, yeast weighted gene co-expression network, and the corresponding unweighted co-expression network. A description of the datasets and the modules can be found in the main article.
We use scatter plots and correlation coefficients to relate network concepts to the underlying module sizes, see Figures 1, 2, 3 and 4. The colors of the points correspond to the module colors. Grey denotes the improper module comprised of genes outside any proper module.
Each section corresponds to one dataset, and each subsection corresponds to one approximation equation by either its approximate CF-based analogue or by one of our two observations (Observation 2 and 3 in the main article). The first 4 subsections study the relationships between the fundamental network concepts and their approximate CF-based analogues for density, centralization, heterogeneity and clustering coefficients. The next 2 subsections study the relationships of mean(ClusterCoef ) ≈ 1 + Heterogeneity 2 2 × Density, and T opOverlap [1]j ≈ k [1] n (1 + Heterogeneity 2 ), which are equations (11) and (14) in the main article. We show that the reported relationships between network concepts remain significant even after adjusting the analysis for module size. To study adjust for module size, we use multiple linear regression models. For example, we regress the fundamental network concept on its approximate CF-based analogue and module size. While module size is sometimes a significant predictor, we find that it is less significant than the approximate CF-based analog. The fact that the approximate CF-based analog remains a highly significant covariate in the multiple regression model demonstrates that the relationship between fundamental network concepts and CF-based approximations is not trivially due to the underlying module size.
We also use multiple linear regression models to argue that the relationship among fundamental network concepts remains highly significant even after adjusting for module size.
In the last subsection, we report the relationship between module size and factorizability F (A). We find that larger modules tend to be less factorizable than smaller modules.
The data and a corresponding software tutorial can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/ModuleNetworks

Drosophila PPI Network
The relationships between fundamental network concepts and module sizes for Drosophila protein-protein networks are illustrated in Figure 1. Note that the most significant predictor of density is approximate CF-based density and module size is not significant in this model. Also note that the regression model explains 99 percent of the variance (Rsquared measure). The relationship between fundamental network concepts and the covariates is highly significant (p < 2.2e-16). Note that the most significant predictor of centralization is approximate CF-based centralization and module size is not significant in this model. Note that the most significant predictor of heterogeneity is the approximate CF-based heterogeneity; module size is not significant in this model. Note that the most significant predictor of the clustering coefficients is the approximate CF-based clustering coefficients. While module size is significant, note that the corresponding Student T statistic (3.105) is much smaller than that of the approx CF-based clustering coefficient (51.05). This suggest that the significant p-value (0.00194) of the module size is partly due to the large number of observations (1371). (11) In our article, we derive the following relationship in approximately factorizable module networks. mean(ClusterCoef ) ≈ 1 + Heterogeneity 2 2 × Density,

Drosophila PPI: Clustering Coefficient and the Approximation by Equation
Here we study whether this relationship remains significant after adjusting the analysis for module size. In the following regression model, the covariate 'Approx. Clustering Coefficients' denotes 1 + Heterogeneity 2 2 × Density. We find Note that the most significant predictor of the clustering coefficient is the approximate CF-based clustering coefficient.
1.6 Drosophila PPI: TOM with the Hub Node and the Approximation by Equation In our article, we argue that the topological overlap between the most highly connected node and all other nodes is approximately constant. Specifically, if we denote the index of the most highly connected node by [1] and its connectivity by k [1] = max(k ), then Here we study whether this relationship remains significant after adjusting the analysis for module size. This suggests that module size explains 70 percent of the variation in factorizability.

Yeast PPI Network
The relationships between fundamental network concepts and module sizes for yeast protein-protein networks are illustrated in Figure 2

Yeast Weighted Gene Co-Expression Network
The relationships between fundamental network concepts and module sizes for the weighted yeast gene co-expression networks are illustrated in Figure 3.

Yeast Unweighted Gene Co-Expression Network
The relationships between fundamental network concepts and module sizes for the unweighted yeast gene co-expression networks are illustrated in