Disentangling the multigenic and pleiotropic nature of molecular function
© Stoney et al. 2015
Published: 9 December 2015
Biological processes at the molecular level are usually represented by molecular interaction networks. Function is organised and modularity identified based on network topology, however, this approach often fails to account for the dynamic and multifunctional nature of molecular components. For example, a molecule engaging in spatially or temporally independent functions may be inappropriately clustered into a single functional module. To capture biologically meaningful sets of interacting molecules, we use experimentally defined pathways as spatial/temporal units of molecular activity.
We defined functional profiles of Saccharomyces cerevisiae based on a minimal set of Gene Ontology terms sufficient to represent each pathway's genes. The Gene Ontology terms were used to annotate 271 pathways, accounting for pathway multi-functionality and gene pleiotropy. Pathways were then arranged into a network, linked by shared functionality. Of the genes in our data set, 44% appeared in multiple pathways performing a diverse set of functions. Linking pathways by overlapping functionality revealed a modular network with energy metabolism forming a sparse centre, surrounded by several denser clusters comprised of regulatory and metabolic pathways. Signalling pathways formed a relatively discrete cluster connected to the centre of the network. Genetic interactions were enriched within the clusters of pathways by a factor of 5.5, confirming the organisation of our pathway network is biologically significant.
Our representation of molecular function according to pathway relationships enables analysis of gene/protein activity in the context of specific functional roles, as an alternative to typical molecule-centric graph-based methods. The pathway network demonstrates the cooperation of multiple pathways to perform biological processes and organises pathways into functionally related clusters with interdependent outcomes.
Biological functions must be carried out in a synchronised manner to ensure proper timing of processes like cell division and metabolism. Molecular functions arise from complicated sets of physical interactions between large numbers of proteins, RNAs and various regulatory pathways, which can be difficult to reconstruct, represent and analyse. In systems biology, molecular function is mapped using molecular interaction networks. Protein-protein interaction (PPI) networks are frequently used to map protein functionality [1–5]. Within interaction networks, molecules are usually represented as single nodes connected by physical interactions. Functionally similar nodes tend to cluster together into dense sub-networks, referred to as functional modules [4, 6, 7] or "pathways" , forming the basis of network analysis to study function [3–5]. One aim of identifying sub-networks is to illustrate the position and connectivity that molecules and functional modules have within the network . They are used to examine the organisation of different functions within the cell, showing how information is passed through physical interactions to enable the system to function as a whole. Many studies have used Saccharomyces cerevisiae to model functionality [8–11] due to the availability of extensive PPI, genetic interaction (GI) and gene annotation data, making it an ideal organism for developing methods of functional organisation.
A great deal of research has focused on computational methods used to identify clusters/sub-networks based on topological features [12–14]. However, such networks tend to utilise the sum of a molecule's interactions, without accounting for the temporal and spatial nature of its interactions. Simply because two proteins can interact does not mean that they will interact in every context . Clustering approaches tend to treat spatial/temporal edges as if they are constant. These sub-networks, which represent functional modules, may as a result bring together functions that are unrelated in the cell. Evidence for this comes from discrepancies in community detection in networks created from different data types . The combination of different data types has been shown to improve the functional homogeneity of topological clusters.
To deal with the issue of spatial/temporal edges we propose a method using experimentally validated pathways as the units of cellular processes. In this context pathways represent groups of proteins shown to interact under specific experimental conditions. This differs from the definition used in Kelley (2005) , in which clusters in PPI networks were described as pathways. In our approach proteins that participate in multiple, context dependent, interactions appear in multiple pathways, rather than being represented by a single highly connected node. Gene Ontology (GO) annotations derived from experimental evidence or sequence homology were used to assign collective functionality to the pathways. Annotated pathways were then connected according to functional overlap. Linking pathways by shared functionality enables us to examine the flow of information among biological functions, giving insight into the organisation of function within the cell.
S. cerevisiae pathway names and their constituent genes/proteins were retrieved from ConsensusPathDB (CPDB) (. Pathways were represented as sets of genes. The original data set consisted of 1050 pathways with 2114 genes.
CPBD collects pathway data from multiple databases, which results in a large degree of pathway duplication and overlap, making pathway consolidation necessary . Three types of data duplication were identified: duplicated pathway names, duplicated gene sets, and small pathways that were subsets of larger pathways. Databases resourced by CPDB may assign slightly different gene sets to identical pathway names, as a result of varying pathway boundaries. Repeated pathway names were identified and amalgamated into single entities by merging the gene sets. Pathways with identical gene sets were identified and redundant pathways were removed.
Transformation of data during processing.
Unannotated genes removed
Number of unique genes
Median genes per pathway
Mean genes per pathway
Pathways containing less than three genes/proteins were considered too small for reliable statistical analysis of function and were removed. The effect that our processing had on the data set is documented in Table 1. The final data set consisted of 271 pathways and 1433 genes, with a median of six genes/proteins per pathway.
Generation of a full set of GO identifiers for each gene
Functional gene annotations were retrieved from the Gene Ontology . GO terms were assigned to the genes within each pathway. Only experimentally derived annotations or annotations generated using sequence orthologs were used, leaving 132 (9%) of genes unannotated (Table 1). Unannotated genes were omitted from the data set. To increase annotation completeness, the GO hierarchy was downloaded and parent annotations were added to genes.
Removal of uninformative GO terms
The hierarchical nature of the Gene Ontology resulted in some annotations being too general and frequent to be considered informative. For this reason, and based on assessment of the GO annotation frequencies across the genes in the data set, annotations present in over 50% of genes were removed; these deleted annotations are listed in Additional File 1. These annotations are highly unlikely to be identified as enriched within a single pathway during later processing stages. Removing them at this point reduces repeated testing.
Annotation of pathways
GO annotations associated with pathway genes were used to infer the function of the CPDB pathways. Only biological process annotations were used, molecular function and cellular component information were not incorporated. The Shapiro test  was performed to ensure that none of the GO terms were randomly distributed across the pathways (p << 0.001). Enrichment profiles were created to include all the GO terms enriched within a pathway's genes. Functional profiles were then generated to show the most specific enriched GO terms capable of describing the gene set. Functional profiles should therefore be considered as describing the main functional roles of each pathway, at the highest level of specificity possible.
Functional enrichment profiles were created using Fisher's exact test to identify all annotations enriched to a p-value of 0.01, within the pathway's gene set. The parameters used were: instances of the GO annotation within the pathway (how many genes the annotation was attributed to), instances of other GO terms in the pathway, instances of the annotation outside the pathway, and instances of all other GO terms outside the pathway. Using an enrichment score of 0.01 as the threshold for allocating GO terms, annotations are assigned at 99% specificity. Rather than correcting for multiple testing, we use later processing stages to remove false positive annotations, which are designed to be flexible to the varying specificity of GO term-pathway relationships. P-values gained from Fisher's exact tests are therefore referred to as enrichment scores.
The functional profile of a pathway is defined as a reduced set of enriched GO annotations that give maximum representation of a pathway's genes. Enriched annotations that were only present in one gene/protein within the pathway were excluded, as they are likely to be spurious and give a poor representation of the pathway's function.
Pleiotropic genes within pathways
Pleiotropy describes genes that contribute to more than one phenotype, implying that the gene/protein is involved in more than one function. This may be due to presence of the gene/protein in different pathways, or the genes within a single pathway affecting multiple functions , resulting in pathway multi-functionality. These additional functions may be missed in the initial formation of functional profiles, as only the most enriched annotations for each gene set are included. A second processing stage was added to capture pleiotropic annotations. Semantic distances between GO terms were taken from Ames et al. (2013) . Semantic distances were available for 88% of GO annotations within the enriched profiles. Identifying phenotypic pleiotropy is complex, as the distinction between different characters and multiple attributes of a single character is often unclear . To ensure that the terms we add are truly pleiotropic we have chosen to use only terms that are semantically very different from existing terms in the functional profile.
Within functional profiles, the median semantic distance between pairs of GO terms was 6 and 95% of GO term pairs had semantic distances above 11.2. Therefore a semantic distance of 11.2 was used as the measure of pleiotropy. To avoid false positive annotations, GO terms from enriched profiles were only considered pleiotropic if they had an enrichment score below 0.0005. The semantic distance between each GO term in each pathway's enriched profile and all the GO terms in the functional profile was measured. Any enriched annotations that had a distance greater than 11.2 from all of the GO terms in the functional profile, were considered pleiotropic and added to the functional profile. Using these parameters 32 GO terms were added to 25 pathways.
where A and B are two sets of GO terms.
Linking functionally similar annotations
Genetic interaction analysis
GIs frequently occur between genes/proteins in pathways that share functions . Based on this knowledge it is expected that topological clusters (see Additional File 2) in the network will be enriched for GIs. This was tested using a set of GIs from BIOGRID . Excluding GIs involving genes that were absent from the data set resulted in a list of 29,309 GIs. For each GI, the set of pathways that each gene/protein participates in was retrieved, and all pathway combinations were examined. If both genes/proteins appeared in a single pathway, a within-pathway GI was recorded, whereas if each gene/protein appeared in a different pathway but the pathways were in the same cluster, a within-cluster GI was recorded. GIs linking pathways from different clusters or involving unclustered pathways were recorded as uncharacterised.
Characterising the profiles of multi-pathway genes/proteins
To establish whether genes/proteins acting in multiple pathways are performing different roles, we performed pairwise comparisons of semantic distances between the annotations in multiple functional profiles. The sum of the semantic distances was divided by the number of genes in the profiles' union.
Generation of the gene/protein overlap heat map
Many proteins were present in multiple pathways. To examine the relatedness of these pathways' functions, a heat map was created to compare gene/protein overlap against functional similarity. Pathways were arranged into a tree based on functional similarity, shown on both axes. This was calculated by carrying out pairwise comparisons of all GO terms between functional profiles, and taking the mean semantic distances. The tree structure was created by QuickTree using the Unweighted Pair Group Method with Arithmetic Mean joining method . The heat map was created by calculating the percentage of gene/protein overlap between pathways and colouring cells accordingly.
Results and discussion
We produced a set of functionally annotated pathways, which were assembled into a network to show functional organisation. The major functional subgraphs are identified and the relationship between functions is discussed. The functional variability of genes/proteins that participate in multiple pathways is evaluated. GI enrichment within network clusters was measured.
Biological functions require the cooperation of multiple genes and proteins. Most functional representations associated with individual genes/proteins are derived from the curation of scientific papers  which focus on small numbers of genes making them highly idiosyncratic and often failing to capture the cooperative aspect of biological function. In order to create systems-wide models that are more suitable to biological interpretation and understanding, new representations are needed that better reflect the cooperative nature of function. Biological pathways are a suitable candidate for higher-level representation of biological function, since they group genes and proteins that interact to produce a specific cellular or physiological outcome.
Generation of a functionally representative set of pathways
A set of 1050 S. cerevisiae pathways was obtained from CPDB and processed to remove data duplication and reduce the range of pathway sizes (pathway sizes in the original data set ranged from 1 to 310). Removal of duplicated pathway names and gene sets, as well as pathways containing fewer than three genes, reduced the number of pathways in the data set to 553 (Table 1). Further processing of duplicated data selectively removed pathways whose size deviated from the median, helping to reduce the standard deviation from 23.2 in the original data set to 13.1 in the final data set. The largest pathway in the original data set was 'Metabolism' containing 310 genes, which would have dominated much of the network. The largest pathway in the final data set was 'Protein processing in endoplasmic reticulum' with a more comparable 78 genes.
Assignment of Gene Ontology Terms to Genes
Annotations were available for 92% of genes in the data. Adding parent annotations to the GO terms initially assigned to the genes increased the median number annotations from two to 38 and the maximum from eight to 149.
Removing highly frequent, uninformative annotations from the data set reduced the median number of annotations per gene from 38 to 31. Within this final data set the range of annotations assigned to genes was large, ranging from one to 208; 75% of genes had between 14 and 66 annotations. This variability may be due to genes being attributed GO terms with large numbers of parent annotations or gene/protein multi-functionality.
Generation of functional profiles of pathways
Fishers exact test produced large numbers of overrepresented GO terms for each pathway (median 26, range 1-159). This is in part related to the hierarchical nature of the Gene Ontology, implying that many of these annotations are describing a small number of functions at various levels of detail. Functional profiles were created to give a succinct representation of each pathway's specific functions, by selecting a reduced set of GO terms to describe the maximum number of genes/proteins inside each pathway (median 2, range 1-9). Only 35% of pathways were described by a single GO term, demonstrating that functions defined by the Gene Ontology cannot be directly mapped onto pathways, as the relationship is more complex. A moderate correlation was found between the number of genes/proteins in a pathway and the number of GO terms in its functional profile (coefficient 0.5). The majority of pathways had unique functional profiles, however 13% of functional profiles were not unique to a pathway indicating that some GO functions may be shared by discrete groups of pathways.
Improved functional profile comprehensiveness through incorporation of gene pleiotropy
Examples of the data added through the inclusion of pleiotropic genes.
cellular carbohydrate catabolic process
trehalose degradation II
cellular carbohydrate catabolic process
fructose metabolic process
Functional diversity of pathways
Functional network subgraphs
A further difference between our network and PPI networks is that PPI networks tend to be hub-based networks, the network topology dominated by a small number of highly connected hub proteins and having scale-free properties [5, 30, 31]. Scale free distributions, characterised as having a power law degree distribution of P(k) ~ k-γ where γ is typically between 2 and 3 are common in both biological and non-biological networks . Within our network hub nodes would be expected to appear as highly multifunctional pathways. However, although the degree distribution did follow a power law distribution (γ = 1.3), the low gamma value indicates that none of the nodes are disproportionately influential.
Co-occurrence with genetic interactions
Enrichment of GIs within pathways and network clusters.
Genetic interaction data
Pathway dependent gene/protein multi-functionality
Comparison to Over Representation Analysis
To validate our results we compared them to DAVID , a tool commonly used for over representation analysis (ORA). We used DAVID to group genes based on GO annotation similarity. We then measured the number of shared annotations between genes from the same or different ORA groups. Gene pairs within the same ORA groups shared a mean of 3.9 annotations (n 63143) while genes in different ORA groups shared a mean of 0.6 (n 764398), indicating that the edges within our network are strongly supported by DAVID functional groupings (Welch's T-test gave p = 0.0).
Our method produces pathway annotations from GO data and organises pathways into a network representation of cellular function. The network contains 271 pathways, covering a wide range of functions including metabolism, signal transduction, gene expression and DNA maintenance. Yeast has 6604 genes of which 5151 are characterised , therefore the 1433 annotated genes analysed within the pathways of this network should not be considered as complete coverage. Our method can however be adjusted to allow more genes and pathways into the final data set, or to study specific sets of pathways. The highly frequent GO terms in Additional File 1 highlights the bias towards metabolic pathways within the data set.
We have developed a method for organising cellular processes based on function, which accounts for temporal interactions modelled through pathways and allows multifunctional genes to be portrayed independently in their different biological contexts. The network illustrates the organisation of function, as multiple pathways co-operate to ensure cellular processes are coordinated. Pathway multi-functionality was examined, determining that pathways vary greatly in the number and diversity of GO functions they facilitate. The functional variability of genes within multiple pathways was also demonstrated. Appreciation of multi-functionality at the level of both genes and pathways is critical for understanding pleiotropic genes and their relationship to multiple phenotypes, interpreting GIs and considering the transfer of information within the cell. Our representation of cellular function will enable analysis of gene/protein activity in the context of their functional roles, instead of the typical molecule-centric approach. This method can be adapted to incorporate different data types into the network, such as expression data and genetic interaction data. Future work will include incorporation of expression data to create directed edges showing the information flow between edges.
RAS is funded by a BBSRC studentship (BB/J014478/1) to DLR, JMS and GN. RMA is supported by the Wellcome Trust Institutional Strategic Support Award (WT105618MA). We thank Andy Brass, Daniela Delneri, Nikola Milosevic, Michele Filannino (University of Manchester) and Maurício Kritz (National Laboratory for Scientific Computing, Brazil) for advice and support.
The publications charges for this article were funded by an RCUK block grant to the University of Manchester.
This article has been published as part of BMC Systems Biology Volume 9 Supplement 6, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Systems biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/9/S6.
- Janjić V, Pržulj N: Biological function through network topology: a survey of the human diseasome. Brief Funct Genomics. 2012, 11: 522-32.View ArticlePubMedGoogle Scholar
- Lee J, Lee J: Hidden information revealed by optimal community structure from a protein-complex bipartite network improves protein function prediction. PLoS One. 2013, 8 (4): e60372-View ArticlePubMedPubMed CentralGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol. 2000, 18 (12): 1257-1261.View ArticlePubMedGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-View ArticlePubMedPubMed CentralGoogle Scholar
- Yook SH, Oltvai ZN, Barabási AL: Functional and topological characterization of protein interaction networks. Proteomics. 2004, 4 (4): 928-942.View ArticlePubMedGoogle Scholar
- Vidal M, Cusick ME, Barabási AL: Interactome networks and human disease. Cell. 2011, 144 (6): 986-998.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen J, Yuan B: Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics. 2006, 22 (18): 2283-2290.View ArticlePubMedGoogle Scholar
- Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotechnol. 2005, 23 (5): 561-566.View ArticlePubMedPubMed CentralGoogle Scholar
- Przulj N, Wigle Da, Jurisica I: Functional topology in a network of protein interactions. Bioinformatics. 2004, 20 (3): 340-348.View ArticlePubMedGoogle Scholar
- Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, et al: The genetic landscape of a cell. Science. 2010, 327 (5964): 425-431.View ArticlePubMedGoogle Scholar
- Dutkowski J, Kramer M, Surma M a, Balakrishnan R, Cherry JM, Krogan NJ, Ideker T: A gene ontology inferred from molecular networks. Nat Biotechnol. 2013, 31: 38-45.View ArticlePubMedGoogle Scholar
- Blondel V, Guillaume J: Fast unfolding of communities in large networks. J Stat ... 2008, 1-12.Google Scholar
- Song J, Singh M: How and when should interactome-derived clusters be used to predict functional modules and protein function?. Bioinformatics. 2009, 25 (23): 3143-3150.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang J, Li M, Deng Y, Pan Y: Recent advances in clustering methods for protein interaction networks. BMC Genomics. 2010, 11 (Suppl 3): S10-View ArticleGoogle Scholar
- Hyduke DR, Palsson BØ: Towards genome-scale signalling network reconstructions. Nat Rev Genet. 2010, 11 (4): 297-307.View ArticlePubMedGoogle Scholar
- Ames RM, Macpherson JI, Pinney JW, Lovell SC, Robertson DL: Modular biological function is most effectively captured by combining molecular interaction data types. PLoS One. 2013, 8 (5): e62670-View ArticlePubMedPubMed CentralGoogle Scholar
- Kamburov A, Wierling C, Lehrach H, Herwig R: ConsensusPathDB--a database for integrating human functional interaction networks. Nucleic Acids Res. 2009, 37 (Database issue): D623-D628.View ArticlePubMedPubMed CentralGoogle Scholar
- Belinky F, Nativ N, Stelzer G, Zimmerman S, Iny Stein T, Safran M, Lancet D: PathCards: multi-source consolidation of human biological pathways. Database (Oxford). 2015, 2015: bav006-bav006.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al: Gene Ontology: tool for the unification of biology. Nat Genet. 2011, 25 (1): 25-29.View ArticleGoogle Scholar
- Shapiro S, Wilk B: An Analysis of Variance Test for Normality (Complete Samples). Biometrika. 1965, 52: 591-611.View ArticleGoogle Scholar
- He X, Zhang J: Toward a molecular understanding of pleiotropy. Genetics. 2006, 173 (4): 1885-1891.View ArticlePubMedPubMed CentralGoogle Scholar
- Wagner GP, Zhang J: The pleiotropic structure of the genotype-phenotype map: the evolvability of complex organisms. Nat Rev Genet. 2011, 12: 204-213.View ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34 (Database issue): D535-D539.View ArticlePubMedPubMed CentralGoogle Scholar
- Howe K, Bateman a, Durbin R: QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics. 2002, 18 (11): 1546-1547.View ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, et al: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-D266.View ArticlePubMedPubMed CentralGoogle Scholar
- Carroll SY, Stimpson HEM, Weinberg J, Toret CP, Sun Y, Drubin DG: Analysis of yeast endocytic site formation and maturation through a regulatory transition point. Mol Biol Cell. 2012, 23 (4): 657-668.View ArticlePubMedPubMed CentralGoogle Scholar
- Kunze M, Pracharoenwattana I, Smith SM, Hartig A: A central role for the peroxisomal membrane in glyoxylate cycle function. Biochim Biophys Acta - Mol Cell Res. 2006, 1763 (12): 1441-1452.View ArticleGoogle Scholar
- Dudek J, Rehling P, van der Laan M: Mitochondrial protein import: Common principles and physiological networks. Biochim Biophys Acta - Mol Cell Res. 2013, 1833 (2): 274-285.View ArticleGoogle Scholar
- Akira S, Takeda K: Toll-like receptor signalling. Nat Rev Immunol. 2004, 4: 499-511.View ArticlePubMedGoogle Scholar
- Albert R: Scale-free networks in cell biology. J Cell Sci. 2005, 118 (Pt 21): 4947-4957.View ArticlePubMedGoogle Scholar
- Winterbach W, Van Mieghem P, Reinders M, Wang H, de Ridder D: Topology of molecular interaction networks. BMC Syst Biol. 2013, 7: 90-View ArticlePubMedPubMed CentralGoogle Scholar
- Barabási AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004, 5 (2): 101-113.View ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Lempicki R a: Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research. 2009, 37 (1): 1-13.View ArticlePubMed CentralGoogle Scholar
- Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, et al: Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Res. 2012, 40 (Database issue): 700-705.View ArticleGoogle Scholar
- Gagiano M, Bauer FF, Pretorius IS: The sensing of nutritional status and the relationship to fillamentous growth in Saccharomyces cerevisiae. FEMS Yeast Res. 2002, 2 (4): 433-470.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.