Protein-protein interaction (PPI) networks carry vital information on the molecular functions and biological processes of cells. Analysis of PPI networks associated with specific disease systems including cancer helps us to better understand the complex biology of diseases. PPI networks are dynamically modulated in a tissue-specific microenvironment; hence, a set of similarly expressed genes from two types of cancer tumors may exhibit different PPI patterns. A lot of gene expression data has been accumulated on cancer-specific tumors warranting the need for developing effective algorithms to translate the differentially expressed gene lists into functionally coherent modules that are common to all cancers or shared in a given subset of cancers. To achieve this, genes are mapped to corresponding proteins and known PPIs are represented as a network graph for further analysis. Using graph theory-based algorithms, pairs of networks can be compared to identify common, distinct or frequent sub-networks. These sub-networks containing a set of proteins (nodes) with a distinct set of connections (edges) can represent a functional unit in a pathway or in a biological process. Similarly, frequent sub-networks (network motifs) may represent recurring functional units within a network or among multiple networks. In this study, we focus on developing a graph-based algorithm to identify common and frequent network motifs from PPI networks of nine different cancers.

Graphs have been widely used to model a variety of data types such as PPI networks [1], biological pathways [2] and molecular structure of chemical compounds [3]. Graph comparison has a wide range of applications in biological data analysis. For example, by aligning biological pathways represented by graphs, evolutionarily conserved patterns are identified [2]. Similarly, by measuring the discrepancies between PPI networks of healthy and sickened individuals, interactions that are involved in disease outbreak and progression are determined [4].

Existing methods for graph comparison can be categorized into the following three major types: distance-based, alignment-based and kernel-based methods. In a distance-based method, similarity of graphs is measured based on the graphs' common structures [5, 6]. The larger a maximum common subgraph (MCS) is, the more similar are the two graphs; and thus the smaller the MCS distance between the graphs is. The MCS distance between the graphs is defined to be 1-|*V*
_{
mcs
}|/{|*V*
_{1}|, |*V*
_{2}|} where |V| is the number of nodes in graph G = (V, E) [5]. The MCS distance method only considers the maximum common subgraph when comparing graph similarity. It will only identify graphs that globally resemble each other and ignore graphs that share many similar but disconnected subgraphs. Another distance-based method [7] measures the similarity of graphs based on their edit distance. With substitutions, deletions and insertions for both nodes and edges, any graph can be transformed into another graph by iteratively applying these operations. Intuitively the more operations needed, the more dissimilar the two graphs are. With a cost function associated with each operation, the graph edit distance is defined to be the minimum total cost to transform one graph to the other. However, similar to the MCS method, the edit distance methods also measure only the global similarity of the graphs.

The alignment-based methods utilize the idea of graph alignment that is conceptually similar to sequence alignment. In sequence alignment, different scores or penalties are assigned for matches, mismatches and gaps, and the alignment algorithm looks for the best way to arrange the sequences so that the overall alignment score is maximized. In graph alignment, the similarities of graphs are determined by the conservation of interactions, which is measured through the edges and similarity of nodes [8, 9]. Depending on the requirement, the node-based or edge-based weights are used in calculating the alignment score [8]. Graph alignment algorithms such as PathBLAST [2] use the dynamic programming approach to find optimum solutions. Graph alignment algorithms can detect global or local similarity depending on the scoring function used by the algorithm. However these algorithms either end up with exponential running time or turn to heuristic methods for solutions when dealing with graphs that contain cycles.

The third approach, using kernel-based methods measures graph similarities through kernel functions. Existing graph kernels can be viewed as a special case of R-convolution kernels proposed by Haussler [10]. The basic idea of a graph kernel is to decompose a graph into smaller substructures, and build the kernel based on similarities between the decomposed substructures. The natural and most general R-convolution on graphs would decompose graphs to all of their subgraphs and compare each pair of the subgraphs. However, it is proven in that computing all-subgraph kernel is as hard as deciding subgraph isomorphism which is NP-hard [11]. Alternative graph kernels include product graph kernel, marginalized kernel, subtree-pattern kernel, and so on. These kernels differ in the way they decompose graphs to substructures and the similarity measure they use to compare the substructures. Similar to distance-based methods, kernel methods can only be used to measure global similarity of graphs. There is no information about which subgraphs contribute to the similarities.

One of the most important tasks in the analysis of PPI networks is to predict functional modules that represent either stable protein complexes or groups of transiently interacting proteins that together can accomplish a biological function. These functional modules can be mapped to specific subgraphs in PPI networks. Below, we discuss three methods that have been used to extract substructures from graphs: (i) frequent subgraph identification, (ii) graph segmentation and (iii) core-based clustering. *Apriori*-based approach and pattern growth approach are the two major types of algorithms for identifying frequent subgraphs. The discovery of frequent subgraphs usually consists of two steps that include candidate generation and frequency counting. *Apriori*-based algorithms such as FSG [12] generate candidates of larger size by joining two smaller subgraphs. In order for two frequent k-subgraphs to be eligible for joining, they must contain the same (k-1)-subgraph. This introduces a lot of overhead, as there are multiple ways to join two subgraphs of size k. The frequency verification step involves subgraph isomorphism test and therefore is not feasible for large graphs. On the other hand, the pattern growth approach [13] extends patterns from a single pattern directly, instead of joining two smaller subgraphs. Pattern growth approach needs to deal with the redundancy problem: the same (k+1)-subgraph can be generated from extending many different k-subgraphs. Both *apriori*-based approach and pattern growth approach are restricted by the graph size due to the subgraph isomorphism problem. Heuristic methods such as Subdue [14] look for incomplete result set. Subdue is an approximate algorithm and finds patterns that can best compress the original graph by substituting those patterns with a single vertex. Minimum description length (MDL) is used to evaluate how efficient the graph can be compressed.

Graph segmentation method extracts substructures by partitioning graphs into disjoint dense subgraphs. K-means clustering [15] aims to partition graphs to clusters that minimize the within-cluster sum of squares. Min-cut [16] and a more recent spectral clustering algorithm [17] consider not only the within-cluster density but also inter-cluster distance. King et al. [18] used a cost-based local search algorithm to find highly interconnected subsets of nodes.

In contrast to the graph segmentation method, where the central nodes of the subgraphs are usually randomly chosen, in core-based clustering the central nodes are selected before clustering is performed [19, 20]. The central nodes are also referred to as seeds or core of substructures. MCODE method [1] selects the central nodes based on the highest k-core of the nodes neighborhood. A k-core is a graph of minimal degree k. All nodes are weighted based on their local network density using the highest k-core of the nodes neighborhood. SPICi method proposed by Jiang and Singh [19] chose the nodes that have highest weighted degree as core nodes. After selecting the central nodes, clusters are expanded to maximize the local density of the substructures. The expansion stops when local density stops increasing or when all nodes are exhausted.

Due to the NP-hardness of many graph problems, most of the previous methods offer approximate solutions to measure graph similarity. In this paper we present a method that produces the exact solutions in graph comparison and pattern identification. Our algorithm works in a bottom up fashion. It starts from one-node subgraph, and proceeds to one-edge and multiple-edge subgraph. At each loop the search space is reduced by eliminating parts of networks that are not eligible for next round of comparison. Even though the run-time increases exponentially as the size of subgraph increases, in our case the size of the search space, as the base of the exponential, reduces quickly. Therefore we can obtain the complete result in a reasonable amount of time. As we look for common substructures across the networks, we also perform graph isomorphism test. Graph isomorphism problem is known to be in NP; however, it's unknown to be in P or NP-complete if P ≠ NP. In our specific context of network comparison, we solve this in polynomial time with our pattern-labeling algorithm.

We applied our algorithm on nine cancer associated PPI networks to identify common and frequent patterns in these networks. We collected differentially expressed genes from microarray studies of various solid tumor tissues derived from the Oncomine database [21]. Using the algorithm we identified common frequent subgraphs of up to 10 edges in these networks. These subgraphs may correspond to functional modules that play common roles in cancer diseases as they occur multiple times in all the nine cancer networks.