petal: Co-expression network modelling in R
© The Author(s) 2016
Published: 1 August 2016
Networks provide effective models to study complex biological systems, such as gene and protein interaction networks. With the advent of new sequencing technologies, many life scientists are grasping for user-friendly methods and tools to examine biological components at the whole-systems level. Gene co-expression network analysis approaches are frequently used to successfully associate genes with biological processes and demonstrate great potential to gain further insights into the functionality of genes, thus becoming a standard approach in Systems Biology. Here the objective is to construct biologically meaningful and statistically strong co-expression networks, the identification of research dependent subnetworks, and the presentation of self-contained results.
We introduce petal, a novel approach to generate gene co-expression network models based on experimental gene expression measures. petal focuses on statistical, mathematical, and biological characteristics of both, input data and output network models. Often over-looked issues of current co-expression analysis tools include the assumption of data normality, which is seldom the case for hight-throughput expression data obtained from RNA-seq technologies. petal does not assume data normality, making it a statistically appropriate method for RNA-seq data. Also, network models are rarely tested for their known typical architecture: scale-free and small-world. petal explicitly constructs networks based on both these characteristics, thereby generating biologically meaningful models. Furthermore, many network analysis tools require a number of user-defined input variables, these often require tuning and/or an understanding of the underlying algorithm; petal requires no user input other than experimental data. This allows for reproducible results, and simplifies the use of petal. Lastly, this approach is specifically designed for very large high-throughput datasets; this way, petal’s network models represent as much of the entire system as possible to provide a whole-system approach.
petal is a novel tool for generating co-expression network models of whole-genomics experiments. It is implemented in R and available as a library. Its application to several whole-genome experiments has generated novel meaningful results and has lead the way to new testing hypothesizes for further biological investigation.
Within the life sciences, high-throughput technologies such as RNA-sequencing, microarrays, mass spectrometry, and ChIP-sequencing produce large experimental omics datasets at increasing volume. Analysts are left to organize, structure, and analyse these data in sufficient and efficient ways. Computational Biology, Bioinformatics, Systems Biology, Network Biology, and Network Medicine offer interdisciplinary tools to help solve these challenges. Here, our focus is the efficient and effective analysis of high-throughput gene expression data from microarrays and next-generation sequencing platforms (RNA-seq) via co-expression networks.
Applications of networks and their analysis have become standard tools in the Systems Biology toolbox for their versatility and powerful approach to whole-system analysis, their ability to handle very large complex datasets, and their proficiency to present large-scale gene association [1–4]. The networks can be examined with standard tools from Graph Theory to identify systematic changes, patterns, similarities and possibly regulations between genes. Co-expression network construction and analysis have found many uses in the life sciences, such as functional groupings of genes in plants under stress conditions, and identification of molecular targets for future targeted gene therapy [5, 6].
Next, an adjacency function paired with a threshold transform the association measures into an unweighted or weighted network. In an unweighted network edges indicate only that an association exists between vertices implying a binary graph. In a weighted network all vertices are connected at different strength of association resulting in a completely connected graph. These networks, weighted or unweighted, are mathematically presented by the adjacency (incidence) matrix.
The resulting network model should follow typical properties of complex networks such as scale-free and small-world. Both these structural properties are standard characteristics of true complex biological network systems [7–13]. To determine these architectural characteristics of networks, topological measures taken from Graph Theory are calculated. These topological properties are robust descriptive measures that objectively describe the network’s architecture. Such measures include cluster coefficient, path-length, connectivity degree, vertices degree distribution, diameter, density, and many others .
Small-world In 1998 Duncan Watts and Steven Strogatz introduced a small-world network model . For a network model to be small-world it must be made of densely connected subnetworks that are linked together in such a fashion that the path between any vertex pair is relatively short . Mathematically, to categorize a network as small-world, its average cluster coefficient (meanCC) and average path length (meanPath) are calculated. A vertex’s cluster coefficient indicates how well its neighbours are connected: when a vertex has a cluster coefficient equal to one then all of its neighbours are connected to each other. In a small-world network model the average cluster coefficient of all vertices is larger than in a random graph. The path length between two vertices is the number of edges within their shortest path. The average path length of a small-world model must be relatively short in comparison to random network models. This phenomenon is often referred to as ‘six-degrees of separation’ [13, 15].
After the network model is constructed it can be analysed. The underlying assumption of co-expression network analysis is that genes with similar expression patterns are possibly co-expressed, co-regulated, share common functionality, and/or might be regulated by a joint transcription factor. Consequently, groups of similar expression profiles across experimental conditions can be hypothesized to share common functionality by means of the ‘Guilt-by-Association’ principle . As a result, common practice is to examine the constructed co-expression network for its topological properties to determine tightly connected vertices (clusters, modules) which help to reveal whole-system expression patterns, putative gene interactions, potential functional groupings, the association of functions to genes of unknown function, and possible regulations within the system. Examining network properties in combination with well-defined testing hypotheses can lead to the identification of putative key players within a pathway and thus possible drug targets in future research.
The choice of a proper association measure relevant to data distribution and experimental hypothesis.
The absence of explicit confirmation that the constructed network follows the scale-free and small-world properties.
The inconvenience of having to enter a large number of user-specified input variables.
The restriction of using only datasets attached to a tool’s integrated database.
To know the meaning of gene modules.
To extract results of interest and/or interpreting the output presented by the application.
The measure used to transform the expression matrix into an association matrix should depend on the expression data distribution to be statistically valid. The Pearson Correlation Coefficient (PE) is the most common default measure in co-expression network tools [5, 17–21]. Cytoscape , a platform offering numerous network applications, has only one plugin that constructs co-expression networks simply based on PE. PE is a convenient choice because scientists are familiar with it, and its computational cost is very low in comparison to Spearman Correlation Coefficient (SP). On the contrary, PE is not an appropriate association measure for most data, as it is based on normality assumptions. For example, RNA-seq data typically follow a negative binomial distribution , hence PE is not a statistically robust measure and alternatives should be available.
Furthermore, co-expression networks are shown to have small-world and scale-free properties and can be realistically modelled by these two model structures [7–9]. To our knowledge there is no co-expression network method naturally constructing networks with both these biological properties. For example, Weighted Gene Co-expression Network Analysis (WGCNA) [24, 25], an R-library, includes a scale-free topological fitting index that can be manually tuned to construct a scale-free network, but WGCNA does not purposefully construct networks following small-world architecture, nor is the scale-free property robust within the WGCNA algorithm. A small change in user-specified parameters can shift the edge definition causing the loss of scale-freeness.
Another disadvantage of network analysis tools is that many methods assume that the user is familiar with network properties to select appropriate input variables or knows how to tune these variables. A common mandatory user-defined parameter is the association measure threshold. This threshold greatly affects conclusions drawn from the network model, as it governs the construction of the network. Therefore, the threshold should be objectively computed, rather than subjectively chosen by the user. Common practice is to define association with a Pearson Correlation Coefficient of 0.8 and greater [21, 26]. However, there is no consensus on threshold values; it is more of an arbitrary selection that does not necessarily reflect biological relevance.
One of the main goals in network analysis is the identification of tightly connected subnetworks often referred to as gene modules: the general hypothesis is that genes with similar expression might also share functional similarity. Module detection is the general practice of defining (tightly) connected gene groups or partitioning the network into smaller subnetworks [14, 18, 25, 28]. Tightly connected groups are also called clusters or communities. The notion of ‘tightly connected’ has no consensus. The terms cluster, modules and communities are loosely defined and are interchangeably used in the literature [14, 25, 28]. In  gene modules are defined as disjoint subsets of nodes with more connections withhin the module than to genes outside the module. WGCNA labels modules as “clusters of nodes” and a “subset of nodes that are tightly connected to each other” . Mahanta cursive state that “one of the most important applications of gene co-expression networks is to identify functional modules or network modules, which are represented by the strongly connected regions of the co-expression networks” . Ficklin and Feltus describe genes in modules to “participate in similar biological processes; therefore, guilt-by-association inference can be applied to module genes with no known functions that are connected to module genes of known function” . Modules identified by different methods inherit divergent graph properties. To obtain insights in regards to intra-connectivity within the extracted gene groups, their properties should be provided. For example, density can be calculated is a measure of tightness. Density is the proportion of all possible edges and edges actually present in the network model. A density of one implies that every vertex has an edge to every other vertex in the particular subnetwork. The closer the density is to 1, the more densely/tightly connected the subnetwork is.
Finally, networks are generated in order to study experimental hypotheses, thus resulting structures containing genes of interest (GoIs) should be extracted and presented to the user in a clear and understandable manner. Also, these results should be both self-contained and easily transferable to other standalone tools such as Cytoscape [22, 31] or Pajek .
The petal tool was designed to strengthen the standard flow of co-expression analysis. Upon the evaluation of current co-expression network tools [20, 25, 27, 33] our first goal was to develop a network construction algorithm confirming that resulting network models follow real biological network characteristics: scale-free and small-world [7–13]. An additional goal was to generate network models based on entire omics expression datasets to ensure a true whole-transcriptome or whole-genome representation rather than based on a pre-selected subset of genes (e.g., differentially expressed genes). Another main aim was to present the researcher with a tool that is easy to use and does not require any prior network theory knowledge. Consequently, the number of input parameters needed to be minimized.
Step 1: Defining association via a measure
Metrics, such as measures of correlation or geometric distances, are applied to gene expression vectors to represent association between genes. petal includes the following measures: Pearson Correlation Coefficient (PE), Spearman Correlation Coefficient (SP), Kendall Rank Coefficient, Euclidean Distance, Manhattan Distance, Canberra Distance, and Mutual Information to account for parametric and non-parametric data distribution and to accommodate a variety of testing hypotheses. The default measure in petal is SP as it can be applied to non-normally distributed data. PE, a very popular measure [21, 26], is a parametric measure, and should only be applied to normally distributed (expression) data. A robust alternative to the PE is the Spearman Correlation Coefficient (SP) [35, 36]. PE, SP, and Kendall are correlation measures and compare the behaviour of the gene expression profiles (n-dimensional vector) over the n measurements. They solely evaluate the pattern similarity and do not calculate geometric distance between gene pairs, i.e., the spatial difference between gene expression vectors. Euclidean-, Manhattan-, and Canberra distances provide insight into the spacial difference between gene pairs, but do not provide any information in regards to common differential expression between gene pairs. Distances are non-parametric measures and can be applied to any data distribution. Mutual Information is another non-parametric measure based on entropy and can process missing data values better than other measures [37, 38]. Association can be calculated by a number of other less commonly used measures; for more detail on different measures refer to [18, 39–41].
Step 2: Defining edges via adjacency function and threshold
Unweighted network models have been widely studied in Graph Theory and carry well-defined properties. In consideration of these well established attributes, petal uses this discrete transformation (Eq. 2) to construct unweighted network models to take advantage of graph theoretical characteristics.
Calculating initial threshold list
Sample sorted measure table used to define threshold list
g e n e u
g e n e w
g e n e q
g e n e r
g e n e t
g e n e q
g e n e q
g e n e s
g e n e w
g e n e p
Selection of ‘best’ threshold
Sample network threshold table
To evaluate a network model for small-world architecture, the average cluster coefficient and average path length are calculated, these are calculated for each network model and recorded as meanCC and meanPath, respectively (Column 4 and 5 in Table 2).
With this linear transformation of the power-law function linear regression can be used. The true distribution is log-transformed. Linear regression is applied to the log-transformed degree distribution to determine the coefficient of determination (R 2) and the slope of the linear regression. The slope a of the linear regression corresponds to the power in Eq. 1 and should lie within the interval (1,3) [12, 14]. When the linear regression is a good fit for the log-transformed true degree distribution, indicated by a high R 2 value, and a in the interval (1,3) the model can be categorized as scale-free. petal evaluates each network model for how well the log-transformed degree distribution fits the a linear line via linear regression and records its corresponding R 2 value and a to determine scale-free behaviour (Column 2 and 3 in Table 2).
A network component is a set of vertices that are connected by paths. If a network is made of one component it is considered a connected network. If a network model has two components, then this model has two disjoint subnetworks and not every vertex has a path to every other vertex within the entire network model. Network architectures, scale-free and small-world, are defined under the assumption that the network is connected; however, their defining topological properties (vertex degree distribution, average cluster coefficient, average path length) can be calculated without this assumption by excluding vertex pairs in different components when calculating averages. As a result, the calculated values for the network parameters can be very misleading if obtained from a disjoint network model. It is seldom the case for biological network models based on expression data to be one single component. The biggest component of a multi-component network must include at least 90–98 % of the network’s vertices for the topological measures to reliably define the model’s architecture, otherwise the topological measures can lead to misinterpretation . Consequently, petal validates the reliability of the calculated network parameters (R 2, a, meanCC, and meanPath) by determining the number of components in each model. This information is then used to identify the largest network component and its relative size to the entire current network model. To our knowledge, no other network construction algorithm considers the importance of verifying the percentage of genes in the largest component to uphold the calculated network characteristics. petal documents the relative size of the largest component (%bigComp) to confirm that the previously calculated properties as trustworthy (column 7 in Table 2).
One of our goals is to present a whole-omics approach and not just focus on a pre-selected set of genes. Therefore, an objective is to include as many genes as possible from the original dataset. For each network model, vertices which are not connected to any other vertex are removed from consideration as they do not provide any information in terms of association and the percentage of remaining vertices is recorded (column 6 in Table 2).
The resulting network models are weighted against each other based on their topological properties. The ‘best’ threshold is considered to have generated a network model that is scale-free, small-world, with its biggest component including at least 95 % of the network’s vertices, and retains the maximum number of vertices from the original dataset. If such a network cannot be identified, the user is alerted that none of the considered network models are scale-free and small-world, but each model remains accessible.
Depending on the calculated first and last thresholds, the interval between these two values can be relatively large. Consequently, the step sizes between considered thresholds are large and a ‘better’ threshold might be missed between the measured thresholds. To account for a large step size between threshold values, a refining step is included in the algorithm. Refining thresholds is an optional step, as this comes at a cost of longer runtime.
Besides t best , one or more thresholds also meet the criteria of the algorithm, denoted as t alt for alternative thresholds. t alt and t best are sorted, the strongest and weakest associations are set to the new first and last thresholds, respectively.
Only t best produces a scale-free, small-world network model. t b e s t−1 and t b e s t+1 are set to the first and last threshold, respectively.
With the assignment of the new first and last threshold, the interval between the two is again split into six equal subintervals, resulting in the list of refined thresholds. The new first and last thresholds cover a smaller spectrum resulting in smaller step sizes and thus making the choice of final threshold more precise. The algorithm then proceeds by recalculating the network threshold table.
Step 3: Identifying structures within networks
One of the goals of co-expression network analysis is to extract structures (subnetworks, paths) from the entire network and examine these for biological patterns or association. Gene module detection is a standard procedure after network construction. Often hierarchical clustering is performed on a pairwise-distance matrix to organize the networks into hierarchical trees these can be cut at a user-specified height to obtain network modules. These modules can have very different topological properties. Furthermore, when modules are defined by hierarchical clustering some network information is lost, such as the interactions within a cluster (intra-connectivity).
Cliques are completely connected subnetworks; every vertex connects to every other vertex. They share the same topological properties regardless of dimension. For example, the diameter, cluster coefficient, and density of any clique is always equal to one. The members of a clique form an equivalence class following the transitive property which results in less variation across clique members’ expression profiles compared to groupings obtained from standard clustering routines . The mathematical definition of a clique is: A subnetwork of j vertices is a clique if and only if the subnetwork has j(j−1)/2 number of edges. Extracting cliques from a network is a common network analysis step, but computationally very expensive and an NP-complete problem.
Also, the extraction of fully connected subgraphs is considered too stringent for some biological testing hypotheses and very time-consuming when the network is densely connected. Another consideration when using cliques is that they might be too restricted in their properties based on the input data. If the input data are clean, meaning that technical or experimental noise and faulty measurements have been removed, then cliques are not considered stringent. Gene expression data is rather noisy for which cliques can be too inflexible of a structure. As an alternative fuzzy cliques can be used . Fuzzy cliques are ‘almost’ cliques. Similar to modules, clusters, or communities, there is no standard definition of ‘almost’. When fuzzy cliques are discussed, topological properties should be reported to determine how strong or weak of fuzzy clique it is.
Extracting groups based on genes of interest
Another goal of co-expression network is to extract groups of genes that behave similarly over time or under varying environmental conditions. In general, the researcher has interest in a particular set of genes and wants to identify other genes which behave similarly to the genes of interest (GoIs). Consequently, petal allows the user to upload a list of gene identifiers to easily explore the GoIs. The genes of interest can be investigated more closely within the identified ‘best’ network by looking at their direct neighbourhoods referred to as vicinity networks (VN). A VN is a subnetwork representing the intermediate neighbourhood of a single vertex or of a completely connected set of vertices (clique). A VN of vertex i is a subnetwork including vertex i and all its direct neighbours and their edges. A VN of a clique includes all clique members and their common neighbours. Let there be s members in clique r, then the VN of clique r includes the s members and the the common neighbours of the s members. The topological properties of VNs can vary greatly, but their extraction from a network is very fast. These smaller subnetworks can be examined more closely and cliques are extracted at a much smaller computational cost from VNs than from the entire network. Often, some precision is lost when computational time is decreased; there is no loss of information when gene-specific cliques are extracted from its vicinity network than when they are mined from the entire network.
Genes from the provided list are considered individually. For each entered gene a unique VN will be extracted and written to file. When this option is chosen, we recommend to keep the list of genes relatively short, such as twenty.
Assume k genes of interest were uploaded, to test for connections between these k genes, they are extracted from the network, resulting in a k×k adjacency matrix. From this subnetwork with k nodes all maximal cliques are identified. Each maximal clique is treated separately while identifying its neighbours. Neighbours of each maximal clique are written to a file distinguishing between neighbours and the clique genes obtained from the user’s identifiers.
Sample vicinity network table
petal’s main functions
petal is made of three main functions. dataToVNs requires two inputs: the file name of the expression data in tab-delimited format and the file containing the genes of interest (GoIs). This function takes the expression data matrix and supplies the user with groupings of the GoIs. Optional function input includes the upload of a gene annotation file, choice of measure, and thresholds. If no file name for the gene list is specified the function returns an error message and guides the user to use createSWSFnetFromFile. createSWSFnetFromFile only constructs the network models and calculates the ‘best’ model. This function can be followed up with downstreamAnalysis in which genes of interest (GoIs) can be loaded as well as a gene annotation file.
The user supplies the expression data file to petal. In addition, there are four optional steps: the selection of an association measure, user-specified thresholds (for the advanced user), the upload of a list of genes which are of particular interest to the researcher, and a gene annotation file. Additionally, the user has the option to evaluate their data distribution by using the function graphHistQQFromFile before constructing a network model. This provides the user with an estimate of their data distribution to identify whether data are approximately normally distributed. In this case, the user can then select a parametric similarity measure, such as the Pearson Correlation Coefficient (PE). The Spearman Correlation Coefficient (SP) is currently set as the default measure as it is data distribution-dependent. The second optional step is to select up to five association thresholds instead of using the automated threshold computation. To examine the network structures and association of a few genes petal allows the user to upload a list of genes which are extracted with their one-neighbour vicinity networks (VN) for in-depth evaluation. Lastly, if a gene annotation file is available, it can also be loaded in order to attach the information to each identified VN.
Upon completion, petal’s accessible files include: general information file (.txt), network file (.txt), adjacency matrices (.RData), two topology tables (.txt), vicinity network files (.txt), and the expression profiles (.tiff) of each vicinity network. The network file can be directly uploaded into Cytoscape. Cytoscape, an Open Source tool, can be used for visualization and offers several network viewing tools via various plugins [22, 31, 42]. The.RData files of the network adjacency matrices are provided for convenient loading into R, enabling the advanced user to personalize downstream analysis if desired. In addition, the user can look at the characteristics of networks generated on different thresholds. Further, a table is provided which includes all network vertices with their degree and cluster coefficient. Each identified vicinity network is reported with gene membership and its density. Also each VN’s gene expression profiles can be viewed via.tiff image files.
Results and discussion
petal provides an easy to use R-library with the possibility of manual adjustment for the advanced user. With only one function call, the user obtains a sophisticated network analysis without any graph theoretical knowledge. The network model is guaranteed to be scale-free and small-world without any parameter specification. There is no tuning of parameters required. Gene specific groups can be extracted from the network which are conveniently automatically annotated if an annotation file is provided, and the expression profiles of all genes within the group are graphed. This feature saves the researcher a great amount of time. Furthermore, as the analysis can be done within one or two function calls, petal is accessible to scientists with minimal computer or R programming knowledge.
Comparison to other tools
petal produces network models that present associations among genes of a studied system based on experimental data. These models provide a comprehensive view of the entire system which comes at a cost of longer computational runtime compared to most other current tools (e.g., WGCNA). On the other hand, user time is drastically reduced due to restricting user-intervention, decreasing the manual execution of computational steps. WGCNA, although very low in computational costs, does not purposefully generate small-world networks, and ensures scale-free networks only with user intervention. In addition, WGCNA’s extracts gene modules from a tree structure, which is a simplification of a network graph and information is lost in the process. WGCNA is a powerful network analysis tool, but requires many input parameters, making it hard for the novice user to take advantage of this R-library. Cytoscape [22, 31, 42] is a very popular tool to view networks. To our knowledge the construction of co-expression networks is unique to two plugins; one builds networks exclusively on the PE measure, and the other on the Mutual Information metric. petal offers a number of different measure and helps the user to choose an appropriate measure based on the specific expression data distribution. We also believe that petal’s output is very user-friendly so that the scientist can interpret the results easily and examine densely connected subnetworks of GoIs, both mathematically and via additional viewers such as Cytoscape or Pajek.
Runtime and Memory
Empirical evaluation of petal’s runtime and memory requirement
Dimension of dataset
Max memory [GB]
genes × measures
Application to the sciences
The utility of petal is demonstrated with an application of an Illumina RNA-seq whole-genome sequencing experiment of the mountain pine beetle (Dendroctonus ponderosae). Mountain pine beetles are obligate parasites of pine trees. They have destroyed a wide area of forest land and are a serious threat to conifer forests in the western North America. They rely on aggregation pheromones to coordinate the ‘mass attacks’ necessary to overwhelm a host tree’s defences and thus successfully colonize a tree. A molecular level understanding of this process may provide new methods to manage these devastating pests. Although pheromone biosynthetic pathways have been previously studied, the enzymes involved have not yet been completely identified, characterized, and understood [43–45]. Aw et al. presented the first genomic analysis of the mountain pine beetle and identified candidate genes encoding enzymes involved in pheromone-biosynthesis by studying their gene expression patterns , which yielded two confirmed pheromone biosynthesizing enzymes . The hypothesis is that genes encoding these enzymes are regulated in parallel. Of particular interest is a group of 28 genes previously implicated in pheromone biosynthetic pathways.
In this experiment, the Illumina NextSeq 500 platform was used to generate RNA-seq measures of gene transcription of more than 13,000 genes of the mountain pine beetle. Four biological replicates were collected for each of the four specimen types: fed/unfed male/female. Sequences were trimmed and filtered for nucleotide-base quality and 19–35 million sequences were aligned to the Dendroctus ponderosae reference genome. Unambiguously aligned sequences were counted for all annotated mountain pine beetle genes. Count data underwent standard protocols for low-count filtering, upper quartile normalization and transformation into counts per million following the DESeq2 processing pipeline . Experimental findings relevant to beetle biology and biochemistry will be described in a forthcoming manuscript, in which the data will be made publicly available.
After data quality control, the dataset contains 11,342 gene identifiers across 16 measures. The petal histogram and Q-Q plot confirms our assumption that our RNA-seq data are non-normally distributed (Additional file 1). The expression data are upload into petal alone with a list of 28 gene identifiers of interest and the corresponding gene annotation file. The 28 genes are analysed together as they have been hypothesized to play a joint role in the pheromone biosynthetic pathways. The petal run was performed on a server with 2.5 GHz processors and 256 GB RAM, one processor was used and it took 4.42 h utilizing at most 7.5 GB RAM.
NetworkStats.txt obtained from petal for the mountain pine beetle dataset
Another interesting subnetwork, VN6, is a vicinity network obtained from four other GoIs. All 85 common neighbours are tightly interconnected: the 89-gene subnetwork has a density of 0.96. Upon closer examination, VN6 and VN13 share three GoIs and their intersection includes 80 genes/vertices. Adding the four GoIs that do not overlap between the two VNs to the 80-gene intersection results in a subnetwork with a density of 0.9943. The union of the GoIs of VN6 and VN13 are highlighted in orange in Fig. 4 and the profiles are shown in orange. Biologically, this subnetwork presents a group of 84 very similarly expressed genes, including various cytochrome P450; this grouping agrees with a hypothesized link between tree resin detoxification and pheromone production [43, 46].
Overall, this approach enables the researcher to quickly view genes with similar expression patterns. With current annotation of the genes at hand, simple observations of the similarity or differences of functions of similarly-behaving genes can be made.
petal is written for life scientists to construct high level co-expression networks and to extract vicinity networks of interest. petal is very user-friendly by requiring little prior knowledge of network science without sacrificing the quality output that comes from complex, well graph-theoretically defined networks. petal’s adaptability allows for the analysis of experimental expression data of most sizes. petal is an easy-to-use tool, attractive to a wide range of scientists with flexible and customizable options.
Availability and requirements
Project name: petalProject homepage: https://github.com/julipetal/petalNet Operating system: LinuxProgramming language: ROther requirements: igraph (version 0.7)License: GPL-3Restriction for non-academic use: None
PE, pearson correlation coefficient; SP, spearman correlation coefficient; WGCNA, weighted gene co-expression network analysis; GoI, gene of interest; Q-Q plot, quantile-quantile plot; meanCC, mean cluster coefficient; meanPath, mean path length; VN, vicinity network
This work is based upon work supported by the Department of Energy (DOE), Office of Science, Genomic Science Program under Award Number DE-SC0008834. This work was also made possible by a grant from the National Institute of General Medical Sciences (P20GM103440) from the National Institutes of Health through its support of the Nevada Center for Bioinformatics. The contents of this manuscript are solely the responsibility of the authors and do not necessarily represent the official views of the DOE or the NIH. We also express our gratitude to Claus Tittiger and Jeff Nadeau for allowing us to apply petal to their current mountain pine beetle experiment.
Publication of this article was funded by a grant from the National Institute of General Medical Sciences (P20GM103440) from the National Institutes of Health through its support of the Nevada Center for Bioinformatics.
This article has been published as part of BMC Systems Biology Vol 10 Suppl 2 2016: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2015: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10-supplement-2.
JP developed the R package petal, and wrote the manuscript. SS profiled petal and suggested adjustments to the algorithm of petal. FCH made suggestions for improvements and contributed to the writing. KAS initiated the project, and contributed to the writing. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, Leonardson A, Castellini LW, Wang S, Champy M-F, Zhang B, Emilsson V, Doss S, Ghazalpour A, Horvath S, Drake TA, Lusis AJ, Schadt EE. Variations in dna elucidate molecular networks that cause disease. Nature. 2008; 452(7186):429–35.PubMedPubMed CentralView ArticleGoogle Scholar
- Jeger MJ, Pautasso M, Holdenrieder O, Shaw MW. Modelling disease spread and control in networks: implications for plant sciences. New Pytologist. 2007; 174(2):279–97.View ArticleGoogle Scholar
- LÃşpez-Kleine L, Leal L, Lopez C. Biostatistical approaches for the reconstruction of gene co-expression networks based on transcriptomic data. Brief. Funct. Genom. 2013; 12(5):457–67.View ArticleGoogle Scholar
- Schwikowski B, Uetz P, Field S. A network of protein-protein interactions in yeast. Nat Biotechnol. 2000; 18(12):1257–61.PubMedView ArticleGoogle Scholar
- Mutwil M, Usadel B, Schütte M, Loraine A, Ebenhöh O, Persson S. Assembly of an interactive correlation network for the arabidopsis genome using a novel heuristic clustering algorithm. Plant Physiol. 2010; 152(1):429–37.View ArticleGoogle Scholar
- Yang Y, Han L, Yuan Y, Li J, Hei N, Liang H. Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types. Nat Commun. 2014; 5(3231):1–9.Google Scholar
- van Noort V, Snel B, Huynen MA. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Reports. 2004; 5(3):280–28.PubMedPubMed CentralView ArticleGoogle Scholar
- Ruan J, Dean AK, Zhang W. A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Syst Biol. 2010; 4(8):1–21.Google Scholar
- Xulvi-Brunet R, Li H. Co-expression networks: graph properties and topological comparisons. Bioinforma. 2010; 26(2):205–14.View ArticleGoogle Scholar
- Barabési AL, Albert R. Emergence of scaling in random network. Sci. 1999; 286(5439):509–12.View ArticleGoogle Scholar
- Barabés AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004; 5(2):101–13.View ArticleGoogle Scholar
- Barabési AL. Scale-free networks: a decade and beyond. Sci. 2009; 325(5939):412–3.View ArticleGoogle Scholar
- Watts DJ, Strogatz SH. Collective dynamics of ’small-world’ networks. Nature. 1998; 393(6684):440–2.PubMedView ArticleGoogle Scholar
- Newman MEJ. Networks: An Introduction. New York, NY: Oxford University Press; 2012.Google Scholar
- Watts DJ. Six Degrees: the Science of a Connected Age. New York, NY: WW Norton and Company, Inc; 2004.Google Scholar
- De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010; 8(10):717–29.PubMedGoogle Scholar
- Giorgi FM, Del Fabbro C, Licausi F. Comparative study of rna-seq- and microarray-derived coexpression networks in arabidopsis thaliana. Bioinforma (Oxford, England). 2013; 29(6):717–24.View ArticleGoogle Scholar
- Dehmer M, Emmert-Streib F, Graber A, Salvador A. Applied Statistics for Appliednetwork Biology, Methods in Systems Biology, 1st edn. Weinheim, Germany: Wiley-Blackwell; 2011.View ArticleGoogle Scholar
- Movahedi S, Van Bel M, Heyndrickx KS, Vandepoele K. Comparative co-expression analysis in plant biology. Plant Cell Environ. 2012; 35(10):1787–98.PubMedView ArticleGoogle Scholar
- Srinivasasainagendra V, Page GP, Mehta T, Coulibaly I, Loraine AE. Cressexpress: a tool for large-scale mining of expression data from arabidopsis. Plant Physiol. 2008; 147(3):1004–16.PubMedPubMed CentralView ArticleGoogle Scholar
- Horvath S, Dong J. Geometric interpretation of gene coexpression network analysis. PLoS Comput Biol. 2008; 4(8):1–27.View ArticleGoogle Scholar
- Cytoscape. Network Data Integration, Analysis, and Visualization in a Box. http://www.cytoscape.org/. Accessed: 2015-12-05.
- Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of rna-seq data. BMC Bioinforma. 2013; 14(91):1–18.Google Scholar
- Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005; 4(1):0–43.Google Scholar
- Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysi. BMC Bioinforma. 2008; 9(559):1–13.Google Scholar
- Maschietto M, Tahira AC, Puga R, Lima L, Mariani D, da Silveira Paulsen B, Belmonte-de-Abreu P, Vieira H, Krepischi AC, Carraro DM, Palha JA, Rehen S, Brentani H. Co-expression network of neural-differentiation genes shows specific pattern in schizophrenia. BMC Medical Genomics. 2015; 8(23):1–15.Google Scholar
- Ogata Y, Suzuki H, Sakurai N, Shibata D. Cop: a database for characterizing co-expressed gene modules with biological information in plants. Bioinformatics, Oxford, England. 2010; 26(9):1267–8.View ArticleGoogle Scholar
- Ruan J, Zhang W. Identification and evaluation of weak community structures in network In: Gil Y, Mooney R, editors. Proceedings of the 21st National Conference on Artificial Intelligence: 16–20 July 2006; Menio Park, CA. Boston, Massachusetts, USA: AAAI: 2006. p. 470–5.Google Scholar
- Mahanta P, Ahmed H, Bhattacharyya DK, Kalita JK. An effective method for network module extraction from microarray data. BMC bioinforma. 2012; 13(Suppl 13:S4):1–11.Google Scholar
- Ficklin SP, Feltus FA. Gene coexpression network alignment and conservation of gene modules between two grass species: maize and rice. Plant Physiol. 2011; 156(3):1244–56.PubMedPubMed CentralView ArticleGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11):2498–504.PubMedPubMed CentralView ArticleGoogle Scholar
- Batagelj V. Pajek. http://mrvar.fdv.uni-lj.si/pajek/. Accessed: 2015-10-30.
- Ben-Dor A, Shamir R, Yakhini Z. Clustering gene expression patterns. J Comput Biol. 1999; 6(3–4):281–97.PubMedView ArticleGoogle Scholar
- Ihaka R, Gentleman R. R (programming language). http://cran.r-project.org. Accessed 12 Feb 2016.
- Xiao X, Moreno-Moral A, Rotival M, Bottolo L, Petretto E. Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules. PLoS Gen. 2014; 10(1):1–16.View ArticleGoogle Scholar
- de Jong S, Boks MP, Fuller TF, Strengman E, Janson E, de Kovel CG, Ori APS, Vi N, Mulder F, Blom JD, Glenthøj B, Schubart CD, Cahn W, Kahn RS, Horvath S, Ophoff RA. A gene co-expression network in whole blood of schizophrenia patients is independent of antipsychotic-use and enriched for brain-expressed genes. PLoS ONE. 2012; 7(6):1–10.Google Scholar
- Bidkhori G, Narimani Z, Hosseini Ashtiani S, Moeini A, Nowzari-Dalin A, Masoudi-Nejad A. Reconstruction of an integrated genome-scale co-expression network reveals key modules involved in lung adenocarcinoma. PloS ONE. 2013; 8(7):1–10.View ArticleGoogle Scholar
- Broderick G, Fuite J, Kreitz A, Vernon SD, Klimas N, Fletcher MA. A formal analysis of cytokine networks in chronic fatiguesyndrome. Brain Behav Immunity. 2010; 24(7):1209–17.View ArticleGoogle Scholar
- Munneke B, Schlauch KA, Simonsen KL, Beavis WD, Doerge RW. Adding confidence to gene expression clustering. Genetics. 2005; 170(4):2003–11.PubMedPubMed CentralView ArticleGoogle Scholar
- Song L, Langfelder P, Horvath S. Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinforma. 2012; 13(328):1–21.Google Scholar
- Cushman JC, Tillett RT, Wood JA, Branco JM, Schlauch KA. Large-scale mrna expression profiling in the common ice plant, mesembryanthemum crystallinum, performing c3 photosynthesis and crassulacean acid metabolism (cam. J Exp Bot. 2008; 59(7):1875–94.PubMedView ArticleGoogle Scholar
- Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia Sea. A travel guide to cytoscape plugins. Nat Method. 2012; 9(11):1069–76.View ArticleGoogle Scholar
- Aw T, Schlauch K, Keeling CI, Young S, Bearfield JC, Blomquist GJ, Tittiger C. Functional genomics of mountain pine beetle (dendroctonus ponderosae) midguts and fat bodies. BMC Genomics. 2010; 11(215):1–12.Google Scholar
- Keeling CI, Henderson H, Li M, Yuen M, Clark EL, Fraser JD, Huber DPW, Liao NY, Docking TR, Birol I, Chan SK, Taylor GA, Palmquist D, Jones SJM, Bohlmann J. Transcriptome and full-length cdna resources for the mountain pine beetle, dendroctonus ponderosae hopkins, a major insect pest of pine forests. Insect Biochem Mol Biol. 2012; 42(8):525–36.PubMedView ArticleGoogle Scholar
- Keeling CI, Yuen MM, Liao NY, Roderick Docking T, Chan SK, Taylor GA, Palmquist DL, Jackman SD, Nguyen A, Li M, Henderson H, Janes JK, Zhao Y, Pandoh P, Moore R, Sperling FA, W Huber DP, Birol I, Jones SJ, Bohlmann J. Draft genome of the mountain pine beetle, dendroctonus ponderosae hopkins, a major forest pest. Genome Bioliology. 2013; 14(3):1–20.Google Scholar
- Song M, Delaplain P, Nguyen TT, Liu X, Wickenberg L, Jeffrey C, Blomquist G, Tittiger C. exo-brevicomin biosynthetic pathway enzymes from the mountain pine beetle, dendroctonus ponderosae. Insect Biochem Mol Biol. 2014; 53:73–80.PubMedView ArticleGoogle Scholar
- Love MI, Huber W, S A. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014; 15(12):1–21.View ArticleGoogle Scholar