Mining functional subgraphs from cancer proteinprotein interaction networks
 Ru Shen^{1, 2},
 Nalin CW Goonesekere^{3} and
 Chittibabu Guda^{1, 4}Email author
https://doi.org/10.1186/175205096S3S2
© Shen et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Abstract
Background
Proteinprotein interaction (PPI) networks carry vital information about proteins' functions. Analysis of PPI networks associated with specific disease systems including cancer helps us in the understanding of the complex biology of diseases. Specifically, identification of similar and frequently occurring patterns (network motifs) across PPI networks will provide useful clues to better understand the biology of the diseases.
Results
In this study, we developed a novel patternmining algorithm that detects cancer associated functional subgraphs occurring in multiple cancer PPI networks. We constructed nine cancer PPI networks using differentially expressed genes from the Oncomine dataset. From these networks we discovered frequent patterns that occur in all networks and at different size levels. Patterns are abstracted subgraphs with their nodes replaced by node cluster IDs. By using effective canonical labeling and adopting weighted adjacency matrices, we are able to perform graph isomorphism test in polynomial running time. We use a bottomup pattern growth approach to search for patterns, which allows us to effectively reduce the search space as pattern sizes grow. Validation of the frequent common patterns using GO semantic similarity showed that the discovered subgraphs scored consistently higher than the randomly generated subgraphs at each size level. We further investigated the cancer relevance of a select set of subgraphs using literaturebased evidences.
Conclusion
Frequent common patterns exist in cancer PPI networks, which can be found through effective pattern mining algorithms. We believe that this work would allow us to identify functionally relevant and coherent subgraphs in cancer networks, which can be advanced to experimental validation to further our understanding of the complex biology of cancer.
Keywords
Background
Proteinprotein interaction (PPI) networks carry vital information on the molecular functions and biological processes of cells. Analysis of PPI networks associated with specific disease systems including cancer helps us to better understand the complex biology of diseases. PPI networks are dynamically modulated in a tissuespecific microenvironment; hence, a set of similarly expressed genes from two types of cancer tumors may exhibit different PPI patterns. A lot of gene expression data has been accumulated on cancerspecific tumors warranting the need for developing effective algorithms to translate the differentially expressed gene lists into functionally coherent modules that are common to all cancers or shared in a given subset of cancers. To achieve this, genes are mapped to corresponding proteins and known PPIs are represented as a network graph for further analysis. Using graph theorybased algorithms, pairs of networks can be compared to identify common, distinct or frequent subnetworks. These subnetworks containing a set of proteins (nodes) with a distinct set of connections (edges) can represent a functional unit in a pathway or in a biological process. Similarly, frequent subnetworks (network motifs) may represent recurring functional units within a network or among multiple networks. In this study, we focus on developing a graphbased algorithm to identify common and frequent network motifs from PPI networks of nine different cancers.
Graphs have been widely used to model a variety of data types such as PPI networks [1], biological pathways [2] and molecular structure of chemical compounds [3]. Graph comparison has a wide range of applications in biological data analysis. For example, by aligning biological pathways represented by graphs, evolutionarily conserved patterns are identified [2]. Similarly, by measuring the discrepancies between PPI networks of healthy and sickened individuals, interactions that are involved in disease outbreak and progression are determined [4].
Existing methods for graph comparison can be categorized into the following three major types: distancebased, alignmentbased and kernelbased methods. In a distancebased method, similarity of graphs is measured based on the graphs' common structures [5, 6]. The larger a maximum common subgraph (MCS) is, the more similar are the two graphs; and thus the smaller the MCS distance between the graphs is. The MCS distance between the graphs is defined to be 1V_{ mcs } /{V_{1}, V_{2}} where V is the number of nodes in graph G = (V, E) [5]. The MCS distance method only considers the maximum common subgraph when comparing graph similarity. It will only identify graphs that globally resemble each other and ignore graphs that share many similar but disconnected subgraphs. Another distancebased method [7] measures the similarity of graphs based on their edit distance. With substitutions, deletions and insertions for both nodes and edges, any graph can be transformed into another graph by iteratively applying these operations. Intuitively the more operations needed, the more dissimilar the two graphs are. With a cost function associated with each operation, the graph edit distance is defined to be the minimum total cost to transform one graph to the other. However, similar to the MCS method, the edit distance methods also measure only the global similarity of the graphs.
The alignmentbased methods utilize the idea of graph alignment that is conceptually similar to sequence alignment. In sequence alignment, different scores or penalties are assigned for matches, mismatches and gaps, and the alignment algorithm looks for the best way to arrange the sequences so that the overall alignment score is maximized. In graph alignment, the similarities of graphs are determined by the conservation of interactions, which is measured through the edges and similarity of nodes [8, 9]. Depending on the requirement, the nodebased or edgebased weights are used in calculating the alignment score [8]. Graph alignment algorithms such as PathBLAST [2] use the dynamic programming approach to find optimum solutions. Graph alignment algorithms can detect global or local similarity depending on the scoring function used by the algorithm. However these algorithms either end up with exponential running time or turn to heuristic methods for solutions when dealing with graphs that contain cycles.
The third approach, using kernelbased methods measures graph similarities through kernel functions. Existing graph kernels can be viewed as a special case of Rconvolution kernels proposed by Haussler [10]. The basic idea of a graph kernel is to decompose a graph into smaller substructures, and build the kernel based on similarities between the decomposed substructures. The natural and most general Rconvolution on graphs would decompose graphs to all of their subgraphs and compare each pair of the subgraphs. However, it is proven in that computing allsubgraph kernel is as hard as deciding subgraph isomorphism which is NPhard [11]. Alternative graph kernels include product graph kernel, marginalized kernel, subtreepattern kernel, and so on. These kernels differ in the way they decompose graphs to substructures and the similarity measure they use to compare the substructures. Similar to distancebased methods, kernel methods can only be used to measure global similarity of graphs. There is no information about which subgraphs contribute to the similarities.
One of the most important tasks in the analysis of PPI networks is to predict functional modules that represent either stable protein complexes or groups of transiently interacting proteins that together can accomplish a biological function. These functional modules can be mapped to specific subgraphs in PPI networks. Below, we discuss three methods that have been used to extract substructures from graphs: (i) frequent subgraph identification, (ii) graph segmentation and (iii) corebased clustering. Aprioribased approach and pattern growth approach are the two major types of algorithms for identifying frequent subgraphs. The discovery of frequent subgraphs usually consists of two steps that include candidate generation and frequency counting. Aprioribased algorithms such as FSG [12] generate candidates of larger size by joining two smaller subgraphs. In order for two frequent ksubgraphs to be eligible for joining, they must contain the same (k1)subgraph. This introduces a lot of overhead, as there are multiple ways to join two subgraphs of size k. The frequency verification step involves subgraph isomorphism test and therefore is not feasible for large graphs. On the other hand, the pattern growth approach [13] extends patterns from a single pattern directly, instead of joining two smaller subgraphs. Pattern growth approach needs to deal with the redundancy problem: the same (k+1)subgraph can be generated from extending many different ksubgraphs. Both aprioribased approach and pattern growth approach are restricted by the graph size due to the subgraph isomorphism problem. Heuristic methods such as Subdue [14] look for incomplete result set. Subdue is an approximate algorithm and finds patterns that can best compress the original graph by substituting those patterns with a single vertex. Minimum description length (MDL) is used to evaluate how efficient the graph can be compressed.
Graph segmentation method extracts substructures by partitioning graphs into disjoint dense subgraphs. Kmeans clustering [15] aims to partition graphs to clusters that minimize the withincluster sum of squares. Mincut [16] and a more recent spectral clustering algorithm [17] consider not only the withincluster density but also intercluster distance. King et al. [18] used a costbased local search algorithm to find highly interconnected subsets of nodes.
In contrast to the graph segmentation method, where the central nodes of the subgraphs are usually randomly chosen, in corebased clustering the central nodes are selected before clustering is performed [19, 20]. The central nodes are also referred to as seeds or core of substructures. MCODE method [1] selects the central nodes based on the highest kcore of the nodes neighborhood. A kcore is a graph of minimal degree k. All nodes are weighted based on their local network density using the highest kcore of the nodes neighborhood. SPICi method proposed by Jiang and Singh [19] chose the nodes that have highest weighted degree as core nodes. After selecting the central nodes, clusters are expanded to maximize the local density of the substructures. The expansion stops when local density stops increasing or when all nodes are exhausted.
Due to the NPhardness of many graph problems, most of the previous methods offer approximate solutions to measure graph similarity. In this paper we present a method that produces the exact solutions in graph comparison and pattern identification. Our algorithm works in a bottom up fashion. It starts from onenode subgraph, and proceeds to oneedge and multipleedge subgraph. At each loop the search space is reduced by eliminating parts of networks that are not eligible for next round of comparison. Even though the runtime increases exponentially as the size of subgraph increases, in our case the size of the search space, as the base of the exponential, reduces quickly. Therefore we can obtain the complete result in a reasonable amount of time. As we look for common substructures across the networks, we also perform graph isomorphism test. Graph isomorphism problem is known to be in NP; however, it's unknown to be in P or NPcomplete if P ≠ NP. In our specific context of network comparison, we solve this in polynomial time with our patternlabeling algorithm.
We applied our algorithm on nine cancer associated PPI networks to identify common and frequent patterns in these networks. We collected differentially expressed genes from microarray studies of various solid tumor tissues derived from the Oncomine database [21]. Using the algorithm we identified common frequent subgraphs of up to 10 edges in these networks. These subgraphs may correspond to functional modules that play common roles in cancer diseases as they occur multiple times in all the nine cancer networks.
Results and discussion
Cancer protein interaction networks
Our PPI networks are constructed from a comprehensive, nonredundant dataset of experimentallyderived PPIs [22] that are collected from five major databases including IntAct [23], MINT [24], HPRD [25], DIP [26] and BIND [27]. Our goal is to mine for cancerassociated subgraphs from the global interaction networks; however, PPI data that are specific to a cancer tumor do not exist in the public domain. Hence, we used all the available PPI datasets for humans from five major databases as the basis for our studies. In our final human PPI network, there are 19,710 unique proteins representing 95,931 unique interactions. Note that this unique set of proteins exhibit some level of redundancy because splice variants with minimal sequence differences are included as unique proteins due to the fact that PPIs are isoformspecific.
Number of genes and proteins mapped under each cancer network.
Cancer type  Number of genes  Number of proteins  Edge count  Node count 

Bladder cancer  11771  29286  47909  10726 
Breast cancer  11373  26498  33558  8611 
Cervical cancer  9811  22447  19332  6288 
Colorectal cancer  18982  40905  58212  13273 
Esophagus cancer  5135  13380  13405  4218 
Gastric cancer  12137  28224  41289  9707 
Melanoma  8763  22421  30843  7677 
Pancreatic cancer  17339  37160  52125  12199 
Prostate cancer  11181  27598  41658  9621 
Network analysis
The reason we are interested in frequent patterns is that the presence of these subgraphs in PPI networks constitute an analogy to motifs in multiple sequence alignment. These frequent subgraphs represent conserved functional modules that play significant roles in the disease systems we study. First we look for frequent subgraphs within a network because of the possibility of finding more than one identical subgraph from nodes that belong to the same cluster (see below). Then we perform comparative analysis across multiple networks to measure the commonality across networks. These subgraphs must be connected components, which is a prerequisite for forming protein complexes or pathways. Our method of frequent pattern extraction involves the following three key steps: identification of node similarity, graph isomorphism test and discovery of frequent patterns.
Identification of node similarity
Each node in a PPI network represents a unique protein. Nodes are considered similar if the proteins they represent have similar functions. We use the sequence alignment algorithm Blastclust [28] to cluster protein sequences into mutually exclusive groups. Proteins present in the same cluster are deemed functionally similar to each other and they will be assigned the same cluster ID. Blastclust is a singlelinkage clustering algorithm to cluster sequences hierarchically. It begins with pairwise alignment and places a sequence in a cluster if it matches at least one of the sequences already in the cluster. Blastclust uses the BLASTP algorithm to compute the pairwise matches. We used stringent criteria of 90% sequence identity over 95% of the length of each sequence and divided 18,888 proteins to 14,838 clusters. The cluster ID will be tagged to each node in the network and will be used in pattern labeling process as described in the following section.
Graph isomorphism test
The basic idea in canonical graph labeling [29] is to represent relational graph data using a sequence of symbols that can uniquely identify the graph. Kuramochi et al. [12] proposed to use concatenation of upper triangle of adjacency matrix as canonical label of graphs. For graphs with no edge weights, an adjacency matrix is a binary matrix. Every row and column corresponds to a node in the graph. The value at M[i, j] is 1 if there is an edge between node i and node j and 0 otherwise. For undirected graphs, the adjacency matrix is symmetric on its main diagonal. Therefore we can use upper right triangle of the adjacency matrix to fully represent a graph. The ordering of rows and columns will determine the content of adjacency matrix. We order the rows and columns using protein IDs the nodes are labeled with. The adjacency matrix generated in such way unambiguously represents a given graph. To create the canonical label of the graph, we first concatenate the protein IDs sorted in order. Then we concatenate the upper triangle of the adjacency matrix.
PageRank algorithm [30] is used by Google Internet search engine to measure relative importance of web pages. The algorithm calculates a numeric value for each node to indicate its ranking in the overall network. Based on the ranking information, Google can determine which web pages are more important or more relevant and tune their search results accordingly. A similar idea can be applied to compute structural equivalence. In PageRank, all graph nodes are considered of the same type. So the ranking information solely reflects the positions of nodes in the graph. In our case, we want to first differentiate graph nodes based on their cluster ID; then differentiate the nodes based on their equivalence class (see below). To achieve this purpose, we assign weights to nodes based on their cluster ID. We associate a unique integer value with each cluster. The same integer value will be assigned to all nodes in the cluster as the weight of the node. The magnitude of the weight is not an indication of the functional importance of the cluster. It is solely used to differentiate the clusters.
Hyperlink matrix is a stochastic matrix. Every column of H sums to 1. The entry H[i, j] indicates the probability of moving from node j to node i. It can also be understood as the ratio of contribution node j makes to node i among all nodes j connected to. Let v be the vector storing relative importance of nodes. v[i] denotes the relative importance of node i. A node's relative importance is determined by the contribution all other nodes have made to it. So we need to solve the equation Hv = v. This is actually to find the Eigen vector corresponding to eigenvalue 1 of matrix H. Eigenvalue computation can be performed in polynomial time.
It shows that A1 and A2 are of the same relative importance. They will be included in the same equivalence class. B1 and B2 will also be included in the same equivalence class. Then we sort nodes based on cluster ID at first level and equivalence class at second level. In matrix M when we shuffle nodes in the same equivalence class, the matrix content will not be changed; the canonical label remains the same. Therefore permutations are not needed to generate a unique pattern label.
In Figure 3B, node A1, A2 and A3 are from the same cluster. But A3 falls into a different equivalence class from A1 and A2 because their relative importance values (the middle column) are different. When we sort the nodes, the relative positions between equivalence classes are fixed. The order is based on the relative importance value. The relative position within equivalence classes can be changed without impacting the content of matrix.
Using the algorithm described above we can generate pattern labels for graphs. Generally it takes O(n^{3}) time to compute eigenvalue decomposition. Constructing adjacency matrix and hyperlink matrix each takes O(n^{2}) time. Sorting of nodes takes O(n lg n) time. Thus the algorithm to compute pattern labels runs in polynomial time.
Discovery of frequent patterns
Finding frequent subgraphs is an NPhard problem. When the size of the subgraph is a variant, finding frequent subgraphs takes exponential runtime. Therefore, to solve frequent subgraphs problem we need to effectively reduce the search space as subgraph size increases. To accomplish this, we take the bottom up approach to find small subgraphs first and proceed to larger subgraphs. We start with frequent subgraphs of 1 node. We look for clusters with size no less than the given threshold in each network. This can be done through a simple counting of nodes within each cluster in each network. Among the selected clusters, we look for those present in all networks. Nodes belonging to these clusters are kept; the rest are removed from the networks. Edges incident to removed nodes are also removed from the networks. On the remaining part of the networks we will discover patterns of next size level.
Figure 4 summarize the number of common and frequent patterns at each edge size. From 2edge to 4edge, the number of patterns increases as pattern size increases. In these cases, the number of patterns appears to be influenced by the possible combinations of edges, which is an increasing function of number of edges. From 4edge on, as the number of edges increases, there is a decline in the number of patterns. This is because it's harder for large size patterns to be both frequent and common. As shown in Figure 4, the 10edge is the maximum size of common and frequent pattern that could be found on our datasets. Beyond this point the number of patterns will become zero as the pattern size increases beyond 10 edges.
Performance validation
We compared our method with FSG, which is a frequent subgraphmining algorithm [12], on analyzing the 9 cancer PPI networks. Given a set of network transactions, FSG looks for subgraph patterns that exist in at least σ percent of the networks, where σ is the support threshold. We ran both programs on our 24core 2.93 Ghz Intel Xeon server. We set FSG σ to 100, which is equivalent to our method of finding common patterns in all given networks. FSG doesn't have the option of setting the subgraph support within each network and its default value is 1. At 2edge and 3edge levels, FSG ran faster than our method using less than one second and 1 second, respectively; while our method used 6 and 20 seconds, respectively. At 4edge level, FSG spent similar amount of time as our method, which is around 30 seconds. But FSG was not able to continue the task at 5andhigheredge levels and ran out of memory. The running time and resource requirements increased exponentially as the subgraph size increased. Our method, on the other hand, showed a much slower rate of increase in time complexity. When support within network is set to 2, our program took 800 seconds to find 5edge patterns. The running time reached the maximum for the 9edge patterns and then finally reduced to 600 seconds at the 10edge group.
The subgraph patterns identified by us are frequent within each network and also common to all the nine cancer networks. Hence, we hypothesize that each subgraph corresponds to an important functional module in cancer. We used GO semantic similarity [32] as a metric to quantitatively verify the functional importance of the frequent common patterns, and thus the performance of our method, in detecting the functional subgraphs. Semantic similarity provides a quantitative measure of how similar a pair of proteins is, based on the annotations (GO terms) in a given GO concept category. The idea is that the interacting proteins are more likely associated with similar cellular processes and/or involved in similar function. Hence, this similarity measure is higher for functionally related proteins, and vice versa. This concept has been very effective in interpreting the functional similarities of genes/proteins based on gene annotation information from heterogeneous data sources [33, 34].
Role of subgraph patterns in cancer
We also looked at some of the smaller subgraphs containing 28 edges and found a number of network patterns associated with cytoskeletal functions. One of the 8edge patterns is related to a functional unit consisting of actin (α, β and γ isoforms) and six actin associated genes, ACTR1A, CCT5, GSN, SPTAN1, TPM1, DYNLL1 and their homologs, that are differentially expressed across nine cancer types. CCT5 is a molecular chaperone, and is part of the TCP1 ring complex, known to fold various proteins including actin and tubulin. We find that CCT5 is uniformly upregulated across datasets. We hypothesize that CCT5 may play an important role in ensuring the correct folding of cytoskeletal proteins that are produced during cell proliferation in cancer. It is well known that the actin cytoskeleton is substantially modified in transformed cells, and this occurs in concert with changes in a host of actin filamentassociated regulatory proteins [46]. These changes are thought be integrally involved in the abnormal growth properties of tumor cells, their ability to adhere to tissue, and their increased ability to metastasize [47].
In the 5edge group of patterns, we have identified a functional module centered on the wellknown oncogene MYC, and Myc binding proteins, Max, Mycbp2 (PAM), and SP1, that are differentially regulated in nine cancers. Interestingly, this functional pattern also includes α and β tubulins and their homologs in various subgraphs as shown in Figure 5. The MYC protooncogene family has been the subject of intense scrutiny due to the involvement of deregulated MYC genes in a wide range of cancers [48]. Myc is a shortlived protein that promotes proliferation by regulating the expression of specific target genes. Myc requires the constitutively expressed family member Max to function. Myc and Max form heterodimers via basic helixloophelix leucine zipper domains and bind to Ebox regulatory elements in target genes. Myc overexpression upregulates genes directed towards cell growth: ribosome biogenesis, protein synthesis, and metabolism [49], and Myc has also been shown to repress genes that attenuate cell cycle progression [50]. Highthroughput sequencing of ChIP DNA (ChIPseq) has been used to locate 3465 DNA regions bound by Myc, 20% of which were up or downregulated as a consequence of cMyc expression [51]. Oncogenic activation is known to occur from constitutive and overexpression of the cMyc protein. For example, in Burkitt's lymphoma, a translocation of MYC, t(8,14) to a location that falls within the regulation of the strong promoter of immunoglobin genes increases the amount of expression of the MYC gene.
Conclusion
In this paper, we present a novel algorithm for mining frequent and common patterns across multiple cancer PPI networks. The comprehensive PPI datasets used in this study exhibit powerlaw distribution across all cancer networks. By using effective canonical labeling and adopting weighted adjacency matrices, we are able to perform graph isomorphism test in polynomial running time. The search starts from small patterns of 1 node, proceeds by incrementing the subgraph size 1 edge at a time, and stops when no frequent patterns are discovered for a certain edge level. As the size increments, the infrequent edges in the original networks are removed, thus reducing the search space for the next round of searching. We applied the algorithm on nine cancer PPI networks and identified frequent and common patterns of different sizes up to 10 edges. To validate the performance of our method, we compared these patterns against the randomly generated patterns at each edgegroup, using GO semantic similarity measure. Patterns identified in this study exhibited significantly higher scores compared to the random ones at all edgegroup levels indicating that these patterns are functionally cohesive modules. Further investigations on the specific role of each module in cancer revealed their intricate association with various cancerassociated processes such as transcriptional regulation, cell growth, cell proliferation, etc. Ingenuity pathway analysis of a 10edge module demonstrated that the cancerassociated functions are tightly dependent among the nodes of the subgraph as evidenced by both direct and interactions. Based on these results, we believe that the methodology developed in this study is capable of identifying common and frequent subgraphs from large and multiple interaction networks. While we used cancer PPI networks in our study, this is a generic methodology and hence can be applied to mine subgraphs from many other networks.
Methods
Human protein interactome dataset
We created a comprehensive, nonredundant dataset of experimentallyderived interacting proteins by combining multiple datasets (downloaded in the PSI MI 2.5 format) from five major protein interaction databases that include DIP (Database of Interacting Proteins) [26], IntAct [23], BIND (Biomolecular Interaction Network Database) [27], HPRD (Human Protein Reference Database) [25] and MINT (Molecular Interaction database) [24]. These datasets are fairly overlapping both within and across databases, and protein sequences in these databases are originally indexed with different source identifiers from UniProt, DIP, GenBank, etc. We have collected only those proteins belonging to the human species. To remove redundancy, we first created datasets of unique sequences (based on fulllength protein sequence string comparison) within each database and then merged them to create a nonredundant dataset of interacting protein sequences, each indexed with our internal identifier. Finally, we obtained 19,710 unique protein sequences representing 95,931 unique PPIs.
Calculation of GO semantic similarity
The semantic similarity of GO terms between two interacting proteins was calculated for all possible pairs of proteins in the human PPI network. The GO terms associated with each protein were obtained from the GO database. The GO annotation (GOA) for a protein can be based on three concepts i.e., biological process (P), molecular function (F) and cellular component (C). The best semantic similarity measure between the GO terms of the two proteins, under each GO concept, was determined for all pairs of proteins using the method proposed by Brown and Jurisica [33].
In brief, the similarity score between two GO terms is higher if they share a common parent with a more specific GO term (less frequent), and vice versa. The total similarity score is the sum of the best similarity scores from each concept.
Graph theory preliminaries
Definition 1 (Labeled graph) A labeled graph is a triple G = (V, E, μ), where

V is the node set

E is the edge set, E ⊆ V × V

μ:V → L_{V} is a function assigning labels to nodes
In PPI networks, nodes are labeled with protein IDs. Since each protein appears at most once in a PPI network, no two nodes share same labels. Formally: ∀ v_{i}, v_{j} ∈ V, v_{i} ≠ v_{j} → μ(v_{i}) ≠ μ(v_{j}).
Definition 2 (Undirected graph, connected graph) A graph G = (V, E, μ) is an undirected graph if and only if
∀v_{i}, v_{j} ∈ V: (v_{i}; v_{j}) ∈ E ↔ (v_{j} ; v_{i}) ∈ E. In an undirected graph G, two nodes v_{i} and v_{j} are connected if G contains a path from v_{i} to v_{j}. A graph is said to be connected if every pair of nodes in the graph are connected.
Definition 3 (Subgraph) Graph G' = (V', E', μ') is a subgraph of graph G = (V, E, μ) if V' ⊆ V and E' ⊆ (V' × V') ∩ E) and μ' = μ.
Definition 4 (Graph isomorphism) Given two labeled graphs G = (V, E, μ) and G' = (V', E', μ'). Graph isomorphism is a bijective function f: V → V' such that ∀v_{i}, v_{j} ∊ V, (v_{i}, v_{j}) ∊ E ↔ (f(v_{i}), f(v_{j})) ∊ E'.
Definition 5 (Frequent subgraph) Given a graph G = (V, E, μ), support(g) is the number of isomorphic embeddings of subgraph g. A subgraph is frequent if its support is no less than a given minimum support threshold.
Algorithms
Algorithm 1 frequentCommonDiscover(G,σ)
1: for Every G _{ i } in G do
2: C_{ i } ← Find node clusters with size no less than σ
3: end for
4: F ^{ 0 } ← Find node clusters that are present in all C _{ 0 } ~ C _{ k }
//k is number of graphs in G
5: for Every G _{ i } in G do
6: Remove nodes not present in clusters in F ^{ 0 }
7: end for
8: for Every G _{ i } in G do
9: Label edges with concatenation of sorted label of nodes at both ends
10: Label edge groups with concatenation of sorted cluster ID of nodes at both ends
11: L_{ i } ← Find edge groups with size no less than σ
12: end for
13: F ^{ 1 } ← Find edge groups that are present in all L _{ 0 } ~ L _{ k }
14: for Every G _{ i } in G do
15: Remove edges not present in groups in F ^{ 1 }
16: end for
17: t ← 2
18: while F ^{ t1 } is not empty do
19: for Every G _{ i } in G do
20: E ← Enumerate t number of edges
21: for Every E _{ j } in E do
22: if connected then
23: Assign canonical labels to subgraphs using subgraphLabel(E _{ j } )
24: Assign pattern labels to subgraphs using patternLabel(E _{ j } )
25: end if
26: end for
27: Compute embeddings of patterns using MIS()
28: Pi ← Find subgraph patterns with embeddings no less than σ
29: end for
30: F ^{ t } ← Find subgraphs patterns that are present in all P _{ 0 } ~ P _{ k }
31: for Every G _{ i } in G do
32: Remove subgraphs not present in patterns in F ^{ t }
33: end for
34: t ← t + 1
35: end while
Algorithm 2 patternLabel(E)
1: Extract node set N from E
2: Assign weights to nodes based on their cluster ID
3: Construct weighted adjacency matrix
4: Construct hyperlink matrix
5: Compute eigenvalue decomposition of hyperlink matrix
6: Sort nodes by cluster ID first
7: Within cluster, sort nodes by corresponding values in eigen vector
8: Construct binary adjacency matrix, with nodes in order
9: Concatenate node list and upper diagonal of binary adjacency matrix
10: Return the sequence of symbols
Author information
RS is a graduate student in CG's lab with training in computer science and this work is part of her dissertation research. NCWG is an Associate professor with training in biochemistry and molecular biology. CG (Associate professor) has an interdisciplinary background in molecular and computational biology. He has published a number of computational methods with a variety of applications in biomedical research, since 2001.
Declarations
Acknowledgements
This work was partly supported by NIH/NIGMS grants to CG [1R01GM086533 and 1R15GM080681]; and startup funds to CG from the University of Nebraska Medical Center.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM)  Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
Authors’ Affiliations
References
 Badger GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 210.1186/1471210542.View ArticleGoogle Scholar
 Kelley BP, Yuan B, Lewitter F: Pathblast: a tool for alignment of protein interaction networks. Nucleic Acids Res. 2004, 32: W8388. 10.1093/nar/gkh411.PubMed CentralView ArticlePubMedGoogle Scholar
 Cook DJ, Holder LB: Mining graph data. 2007, John Wiley & Sons, Inc.Google Scholar
 Borgwardt KM, Kriegel HP: Graph kernels for disease outcome prediction from proteinprotein interaction networks. Pacific Symposium on Biocomputing. 2007, 415.Google Scholar
 Bunke H, Shearer K: A graph distance metric based on the maximal common subgraph. Pattern Recognition Lett. 1998, 19: 255259. 10.1016/S01678655(97)001797.View ArticleGoogle Scholar
 Fernandez ML, Valiente G: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Lett. 2001, 22: 753758. 10.1016/S01678655(01)000174.View ArticleGoogle Scholar
 Sanfeliu A, Fu KS: A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics. 1983, 13: 353363.View ArticleGoogle Scholar
 Li Z, Zhang S, Wang Y, Zhang XS, Chen L: Alignment of protein interaction networks by integer quadratic programming. Bioinformatics. 2007, 23: 16311639. 10.1093/bioinformatics/btm156.View ArticleGoogle Scholar
 Zaslavskiy M, Bach F, Vert JP: Global alignment of proteinprotein interaction networks by graph matching methods. Bioinformatics. 2009, 25: i2591267. 10.1093/bioinformatics/btp196.PubMed CentralView ArticlePubMedGoogle Scholar
 Haussler D: Convolutional kernels on discrete structures. Technical Report. 1999, Computer Science Department, UC Santa CruzGoogle Scholar
 Gartner T, Flach P, Wrobel S: On graph kernels: Hardness results and efficient alternatives. Proc Annual Conf Computational Learning Theory. 2003, 129143.Google Scholar
 Kuramochi M, Karypis G: An efficient algorithm for discovering frequent subgraphs. Technical Report. 2002, University of Minnesota, Department of Computer Science, 226.Google Scholar
 Yan X, Han J: Graphbased substructure pattern mining. Proceedings of 2002 International Conference on Data Mining. 2002, 721724.Google Scholar
 Ketkar NS, Holder LB, Cook DJ: Subdue: Compressionbased frequent pattern discovery in graph data. OSDM. 2005Google Scholar
 Lloyd SP: Least square quantization in pcm. IEEE Transactions on Information Theory. 1982, 28: 129137. 10.1109/TIT.1982.1056489.View ArticleGoogle Scholar
 Goldberg AV: Finding a maximum density subgraph. Technical Report. 1984, University of California, Berkeley, CAGoogle Scholar
 Ng AY, Jordan MI, Weiss Y: On spectral clustering: Analysis and an algorithm. Proceedings of Advances in Neural Information Processing Systems. 2001, 849856.Google Scholar
 King AD, Przulj N, Jurisica I: Protein complex prediction via costbased clustering. Bioinformatics. 2004, 20: 30133020. 10.1093/bioinformatics/bth351.View ArticlePubMedGoogle Scholar
 Jiang P, Singh M: Spici: a fast clustering algorithm for large biological networks. Bioinformatics. 2010, 26: 11051111. 10.1093/bioinformatics/btq078.PubMed CentralView ArticlePubMedGoogle Scholar
 Le TV, Kulikowski CA, Muchnik IB: Coring method for clustering a graph. Proceedings of ICPR. 2008Google Scholar
 Rhodes DR, Yu J, Shanker K: Oncomine: A cancer microarray database and integrated datamining platform. Neoplasia. 2004, 6: 16.PubMed CentralView ArticlePubMedGoogle Scholar
 Guda C, King BR, Pal LR, Guda P: A TopDown Approach to Infer and Compare DomainDomain Interactions across Eight Model Organisms. PLoS ONE. 2009, 4: e509610.1371/journal.pone.0005096.PubMed CentralView ArticlePubMedGoogle Scholar
 Kerrien S, Aranda B, Breuza L: The IntAct molecular interaction database in 2012. Nucleic Acids Res. 2012, 40: D841D846. 10.1093/nar/gkr1088.PubMed CentralView ArticlePubMedGoogle Scholar
 Chatraryamontri A, Ceol A, Palazzi LM: MINT: the Molecular INTeraction database. Nucleic Acids Res. 2006, 35: D572D574. 10.1093/nar/gkl1094.PubMed CentralView ArticlePubMedGoogle Scholar
 Prasad TSK, Goel R, Kandasamy K: Human Protein Reference Database  2009 Update. Nucleic Acids Res. 2009, 37: D767772. 10.1093/nar/gkn892.View ArticleGoogle Scholar
 Xenarios I, Rice DW, Salwinski L: DIP: the Database of Interacting Proteins. Nucleic Acids Res. 2000, 28: 289291. 10.1093/nar/28.1.289.PubMed CentralView ArticlePubMedGoogle Scholar
 Bader GD: Design and use of the Biomolecular Interaction Network Database (BIND) for storing and analyzing proteinprotein interaction data. PhD thesis. 2003, Graduate Department of Biochemistry, University of TorontoGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403410.View ArticlePubMedGoogle Scholar
 Hartke SG, Radcliffe AJ: Mckay's canonical graph labeling algorithm. Contemporary Mathematics book series. 2009, 479: 99111.View ArticleGoogle Scholar
 Page L, Brin S, Motwani R, Winograd T: The pagerank citation ranking: Bringing order to the web. Proceedings of the 7th International World Wide Web Conference. 1998, 161172.Google Scholar
 Shannon P, Markiel A, Ozier O: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 24982504. 10.1101/gr.1239303.PubMed CentralView ArticlePubMedGoogle Scholar
 Guda P, Chittur SV, Guda C: Comparative Analysis of ProteinProtein Interactions in CancerAssociated Genes. Genomics, Proteomics, Bioinformatics. 2009, 7: 2536. 10.1016/S16720229(08)600303.PubMed CentralView ArticlePubMedGoogle Scholar
 Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21: 20762082. 10.1093/bioinformatics/bti273.View ArticlePubMedGoogle Scholar
 Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23: 12741281. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
 Schlossmacher GA, Stevens A, White A: Glucocorticoid receptor mediated apoptosis: mechanisms of resistance in cancer cells. J Endocrinol. 2011, 211: 1725. 10.1530/JOE110135.View ArticlePubMedGoogle Scholar
 Limbourg FP, Liao JK: Nontranscriptional actions of the glucocorticoid receptor. J Mol Med. 2003, 81: 168174.PubMed CentralPubMedGoogle Scholar
 Van Beekum O, Kalkhoven E: Aberrant forms of histone acetyltransferases in human disease. Subcell Biochem. 2007, 41: 233262.PubMedGoogle Scholar
 York B, O'Malley BW: Steroid receptor coactivator (SRC) Family: Masters of Systems Biology. J Biol Chem. 2010, 285: 3874338750. 10.1074/jbc.R110.193367.PubMed CentralView ArticlePubMedGoogle Scholar
 Xu J, Wu RC, O'Malley BW: Normal and cancerrelated functions of the p160 steroid receptor coactivator (SRC) family. Nat Rev Cancer. 2009, 9: 615630. 10.1038/nrc2695.PubMed CentralView ArticlePubMedGoogle Scholar
 Iyer NG, Ozdag H, Caldas C: p300/CBP and cancer. Oncogene. 2004, 23: 42254231. 10.1038/sj.onc.1207118.View ArticlePubMedGoogle Scholar
 Mullighan CG, Zhang J, Kasper LH: CREBBP mutations in relapsed acute lymphoblastic leukaemia. Nature. 2011, 471: 235239. 10.1038/nature09727.PubMed CentralView ArticlePubMedGoogle Scholar
 Tang Y, Zhao W, Chen Y, Zhao Y, Gu W: Acetylation is indispensable for p53 activation. Cell. 2008, 133: 612626. 10.1016/j.cell.2008.03.025.PubMed CentralView ArticlePubMedGoogle Scholar
 Ferreon JC, Lee CW, Arai M, MartinezYamout MA, Dyson HJ, Wright PE: Cooperative regulation of p53 by modulation of ternary complex formation with CBP/p300 and HDM2. Proc Natl Acad Sci USA. 2009, 106: 65916596. 10.1073/pnas.0811023106.PubMed CentralView ArticlePubMedGoogle Scholar
 Shi D, Pop MS, Kulikov R, Love IM, Kung AL, Grossman SR: CBP and p300 are cytoplasmic E4 polyubiquitin ligases for p53. Proc Natl Acad Sci USA. 2009, 106: 1627516280. 10.1073/pnas.0904305106.PubMed CentralView ArticlePubMedGoogle Scholar
 Salicioni AM, Xi M, Vanderveer LA, Balsara B, Testa JR, Dunbrack RL, Godwin AK: Identification and structural analysis of human RBM8A and RBM8B: two highly conserved RNAbinding motif proteins that interact with OVCA1, a candidate tumor suppressor. Genomics. 2000, 69: 5462. 10.1006/geno.2000.6315.View ArticlePubMedGoogle Scholar
 Pawlak G, Helfman DM: Cytoskeletal changes in cell transformation and tumorigenesis. Curr Opin Genet Dev. 2001, 11: 4147. 10.1016/S0959437X(00)001544.View ArticlePubMedGoogle Scholar
 Jordan MA, Wilson L: Microtubules and actin filaments: dynamic targets for cancer chemotherapy. Curr Opin Cell Biol. 1998, 10: 123130. 10.1016/S09550674(98)800951.View ArticlePubMedGoogle Scholar
 Eilers M, Eisenman RN: Myc's broad reach. Genes Dev. 2008, 22: 27552766. 10.1101/gad.1712408.PubMed CentralView ArticlePubMedGoogle Scholar
 Patel JH, Loboda AP, Showe MK, Showe LC, McMahon SB: Analysis of genomic targets reveals complex functions of MYC. Nat Rev Cancer. 2004, 4: 562568. 10.1038/nrc1393.View ArticlePubMedGoogle Scholar
 Staller P, Peukert K, Kiermaier A, Seoane J, Lukas J, Karsunky H: Repression of p15INK4b expression by Myc through association with Miz1. Nat Cell Biol. 2001, 3: 392399. 10.1038/35070076.View ArticlePubMedGoogle Scholar
 Zeller KI, Zhao X, Lee CW, Chiu KP, Yao F, Yustein JT: Global mapping of cMyc binding sites and target gene networks in human B cells. Proc Natl Acad Sci USA. 2006, 103: 1783417839. 10.1073/pnas.0604129103.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.