Comparing biological networks via graph compression

Background Comparison of various kinds of biological data is one of the main problems in bioinformatics and systems biology. Data compression methods have been applied to comparison of large sequence data and protein structure data. Since it is still difficult to compare global structures of large biological networks, it is reasonable to try to apply data compression methods to comparison of biological networks. In existing compression methods, the uniqueness of compression results is not guaranteed because there is some ambiguity in selection of overlapping edges. Results This paper proposes novel efficient methods, CompressEdge and CompressVertices, for comparing large biological networks. In the proposed methods, an original network structure is compressed by iteratively contracting identical edges and sets of connected edges. Then, the similarity of two networks is measured by a compression ratio of the concatenated networks. The proposed methods are applied to comparison of metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis, and are compared with an existing method. These results suggest that our methods can efficiently measure the similarities between metabolic networks. Conclusions Our proposed algorithms, which compress node-labeled networks, are useful for measuring the similarity of large biological networks.


Background
Development of algorithms for comparing various kinds of biological data is one of the important topics in bioinformatics and systems biology. Methods for comparison of DNA and/or protein sequences have been extensively studied and have been applied to analyses of real sequence data quite successfully. Due to increased interests in systems biology, extensive studies have recently been done on comparison of biological networks.
For comparison of metabolic networks, Ogata et al. developed a method based on clustering [1], Tohsato et al. extended a multiple sequence alignment technique to multiple alignment of metabolic pathways using a scoring scheme based on EC (Enzyme Commission) numbers [2], Pinter et al. applied a tree matching technique to alignment of metabolic pathways [3], and Wernicke and Rasche developed a simple backtracking algorithm utilizing the local diversity property [4]. For comparison of protein-protein interaction networks, Kelley et al. developed PathBlast using dynamic programming [5], Liang et al. developed NetAlign using a clique-based method for computing maximal common subgraphs [6], Li et al. developed MNAligner using integer quadratic programming [7], Singh et al. developed IsoRank algorithm based on Google's PageRank method [8], and Zaslavskiy et al. developed a gradient ascentbased method and a message passing-based method [9].
On the other hand, data compression methods have been applied to comparison of large sequence data [10,11] and protein structure data [12,13]. Since it is still difficult to compare global structures of large biological networks and data compression-based methods can be applied to comparison of large-scale sequence data, it is reasonable to try to apply data compression methods to comparison of biological networks. In this paper, we propose such methods.
In order to apply data compression to biological networks, data compression methods for graphs are required. For compression of graphs, Adler and Mitzenmacher developed a method based of Huffman coding of vertices [14], Peshkin developed GRAPHI-TOUR based on iterative contractions of identical edges [15], and Cook and Holder developed SUBDUE based on contraction of frequent subgraphs and MDL (minimum description length) principle [16], which was further extended to EDIF for lossless compression by Yang et al. [17]. However, the method by Adler and Mitzenmacher does not seem to be useful for comparison of networks because it does not make much use of structural information. In GRAPHI-TOUR, the uniqueness of compression results is not guaranteed because there is some ambiguity in selection of overlapping edges (isomorphic graphs may be differently compressed depending on the orderings of vertices in input data), which is not suitable for comparison of network structures. This point is also unclear in EDIF and SUBDUE. Therefore, we develop in this paper novel graph compression methods for which it is guaranteed that two isomorphic graphs are compressed in the same way. Using these compression methods, we measure the similarity of two networks by means of the universal similarity metric (USM) proposed by Li et al. [11]. USM is defined using Kolmogorov complexity which represents the amount of information contained in data, and is obtained by removing redundant parts maximally. Therefore, Kolmogorov complexities are approximated by compression sizes.
We apply the proposed methods to comparison of metabolic networks, and compare the results with those of GRAPHITOUR. The results of hierarchical clustering for several organisms suggest that the proposed methods outperforms GRAPHITOUR, and can adequately measure the similarities between metabolic networks.

Graph compression method
Since our proposed methods are based on GRAPHI-TOUR, we briefly review the GRAPHITOUR algorithm [15]. GRAPHITOUR is based on iterative contractions of identical edges. In order to efficiently contract edges, GRAPHITOUR selects edges appearing most frequently, and solves an instance of maximum cardinality matching problem, which finds as many edges as possible such that no two edges share a common vertex. Figure 1 shows an example of contraction of identical edges. The graph of (A) contains 4 edges labeled with 'a' and 'b', 2 edges with 'a' and 'a', 1 edge with 'b' and 'b', and 1 edge with 'a' and 'c'. GRAPHITOUR selects edges with 'a' and 'b' because they appear most frequently, and solves the maximum cardinality matching problem for their edges. However, optimal solutions are not necessarily uniquely determined. (B) shows a contracted graph after the top-left edge with 'a' and 'b' is substituted with a new vertex labeled with 'ab'. On the other hand, (C) shows a contracted graph after the top-right edge with 'a' and 'b' is substituted. This example implies that GRAPHITOUR can generate different compressed graphs. In order to measure the similarity of networks, the same compressed graph should always be obtained. Therefore, we improve GRAPHITOUR for that purpose, and propose the following algorithm, which we call CompressEdge, to uniquely determine contracted edges in each iteration.

End
This proposed algorithm avoids contraction of edges which share a common vertex. In the example of Figure 1, our algorithm does not choose edges whose endpoints are 'a' and 'b', instead chooses the second candidate edges whose endpoints are 'a' and 'a', and obtains the graph of (D) as the result. It should be noted that the proposed algorithm does not solve the maximum cardinality matching problem because it selects only edges such that all edges with the same labels do not share a common vertex.
However, it is not sufficient to uniquely determine contracted edges because there can be more than one set which has the same number of edges, that is, |ε most | > 1. Therefore, we introduce a total order to sets of labels to determine the priority of edges. Each edge has a pair of labels (l 1 , l 2 ) corresponding to two endpoints of the edge. Let s 1 and s 2 be sets of labels. We can define a total order for s 1 and s 2 as follows.
First, we sort s 1 and s 2 by descending order, respectively. We compare i-th elements s hold for some i. The proposed algorithm selects edges with the smallest set of labels from ε most according to the total order. For example, if we compare s 1 = {l 1 , l 3 } with s 2 = {l 3 , l 2 } under l 1 <l 2 <l 3 , s 1 and s 2 are sorted as (l 3 , l 1 ) and (l 3 , l 2 ), respectively, and we have s 1 <s 2 .
When edges with (l 1 , l 2 ) are contracted, a new label l n is added to L, where l n > l for all l (≠ l n ) L. In computational experiments, Morgan index [18] based on graph structures is assigned to each vertex. However, new added labels themselves do not reflect the original graph structure. Therefore, in order to make effective use of the total order of original labels, we introduce a set of labels for each label l, s (l), which consists of only original labels. Then, s (l n ) is defined to be s (l 1 ) ∪ s (l 2 ) when (l 1 , l 2 ) is substituted with l n . The algorithm compares s (l 1 ) ∪ s(l 2 ) with s (l 3 ) ∪ s (l 4 ) before comparing edges of (l 1 , l 2 ) and (l 3 , l 4 ). For example, for the graph of Figure 1D, the algorithm selects edges with ('aa', 'b') as contracted edges because it appears most frequently.

Extension to contraction of multiple edges
In the previous algorithm, identical edges are contracted at each iteration. In this section, we propose another algorithm, which we call CompressVertices, to contract identical sets of multiple connected edges. In order to uniquely determine the contracted sets of connected edges, we must introduce a total order to a set of edges. For that purpose, we apply degree sequence [19], which is defined to be the non-increasing sequence of degrees of vertices. For example, the degree sequence of Figure  1A is (3, 3, 3, 2, 2, 2, 1). Moreover, in order to distinguish labels of vertices, we introduce the non-increasing sequence of pairs of the degree and the label for each vertex included in the set of edges, which we call dlsequence. In dl-sequence, the degree is not calculated for the original graph, but is for the set of edges, and we define the inequality of elements of dl-sequence by; (d 1 , l 1 ) > (d 2 , l 2 ) if d 1 > d 2 or (d 1 = d 2 and l 1 > l 2 ). Then, we can define a total order for dl-sequences that is a-a-b, is ((2, 'a'), (1, 'b'), (1, 'a')). Here, in the example, if the edges are substituted with a new vertex, a self loop is remained to the new vertex due to the edge of (' a', 'b') as shown in Figure 1E. The remained edges should be also contracted. Therefore, we modify CompressEdge, and propose the following algorithm to contract identical sets of vertices and the edges between the vertices instead of individual edges. In the example of Figure 1E, the graph of Figure 1F is or (s i = s j and dl i <dl j ) for all V(dl j ) V most ; return (V(dl i ),dl i );

End
First, the algorithm tries to find two vertices and the edge between the vertices to be contracted. If it does not find such two vertices, it tries to find more than two vertices and the edges between the vertices until it finds or up to the given number of vertices, M. Figure 2 shows an example of contraction by the algorithm. The graph of (A) contains 4 edges labeled with 'a' and 'b', 2 edges with 'a' and 'c', and 2 edges with 'b' and 'c'. However, it cannot select any edge because they are overlapping each other. Therefore, it tries to find three vertices to be contracted. (B) shows the candidate sets of vertices {' a', 'b', 'c '} and three edges ('a', 'b'), ('b', 'c'), and ('c', 'a'), which appear most frequently two times in the graph. However, they are overlapping again. Finally, it selects the candidate sets of (C), vertices {'a', 'b', 'b'} and two edges of ('a', 'b'), which appear also most frequently. (D) shows the graph contracted from the graph (A) by substituting the selected vertices and edges with a new vertex labeled with 'abb'.
CompressVertices (2) of M = 2 is equivalent to Com-pressEdge. It should be noted that the algorithm uniquely determines the contracted sets of connected edges in each iteration although different subgraphs can be substituted with vertices of the same label in M ≥ 4. For example, both graphs of a-b-a-b and a-a-b-b are represented as ((2, 'b'), (2, 'a'), (1, 'b'), (1, 'a')) in dl-sequence. It is to be noted that for three vertices, different subgraphs cannot have the same dl-sequence. The algorithm is still efficient because it compares dl-sequences instead of comparing subgraphs, where subgraph isomorphism problem is known to be in NP-complete.

Similarity measure
The universal similarity metric (USM) was proposed by Li et al. [11], and has been applied to several biological data [12,13]. USM between two objects o 1 and o 2 is defined using Kolmogorov complexity K(o) as follows: It should be noted that K(o) is considered as a measure of the amount of information that the object o contains.
Since it is not possible to obtain these Kolmogorov complexities for real data, we approximate K(G) of a graph G by C(G) = |R| + |E C |, where |R| means the number of rules extracted from G by our method, and | E C | means the number of remaining edges after the compression of G. The conditional Kolmogorov complexity K(G 1 |G 2 ) of G 1 given G 2 can be approximated to be C(G 1 ∪ G 2 ) − C(G 2 ) as in [12,13], where G 1 ∪ G 2 means the concatenated graph G′(V′, E′) of G 1 (V 1 , E 1 ) and G 2 (V 2 , E 2 ) such that V′ = V 1 ∪ V 2 , E′ = E 1 ∪ E 2 , |V′| = |V 1 | + |V 2 | and |E′| = |E 1 | + |E 2 |. Even if there are identical vertices (i.e. vertices with identical labels) between G 1 and G 2 , they are added to V′ as different vertices.
It should be noted that GU SM(G, G) = 0 if |E c | =0 for C(G). If G 1 and G 2 are similar, then G 1 and G 2 are generated from almost the same set of rules R, that is, C (G 1 ∪ G 2 ) ≈ C(G 1 ) ≈ C(G 2 ) ≈ |R|, and GU SM(G 1 ,G 2 ) approaches to 0.

Results and discussion
To evaluate the proposed measure, we used metabolic pathways for several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis, from the KEGG database [20] (see Table 1). We used chemical compounds as nodes and chemical reactions as edges. For evaluation purposes, it is not appropriate to use protein-protein interaction networks because the accuracy of the networks is not sufficient [21]. Furthermore, we compared the results with those of GRAPHITOUR because other methods such as PathBLAST [5], NetAlign [6], and MNAligner [7] do not give the similarity of networks, and focus on finding interesting subnetworks or most similar subnetworks to a query network.
In our first computational experiment, all nodes in the metabolic networks were labeled with chemical compounds, and there was only one edge having the same labels, that is, |ε(l 1 , l 2 )| = 1. Then, our compression algorithm for G(V, E) produced rules R and the remaining graph G c (V c , E c ) as C(G) = |R| + |E c | = |E|. This means that G is not compressed. Many other methods also cannot select frequent subnetworks for such networks.
Since we would like to compare network structures for the metabolic networks, we replaced labels with Morgan index [18].  We fixed the number of iterations of the Morgan index procedure, applied our compression algorithms to individual and concatenated metabolic networks, G 1 , G 2 , G 1 ∪ G 2 , and calculated GU SM (G 1 , G 2 ) from C(G 1 ), C (G 2 ) and C(G 1 ∪ G 2 ). To confirm that our compression algorithms work for measuring the similarity of metabolic networks, we obtained hierarchical clustering results using the nearest neighbor (single linkage) method, and compared them with actual phylogenetic trees and hierarchical clustering results by GRAPHI-TOUR. Moreover, we performed such experiments with several numbers of iterations from 1 to 20 because the number of iterations of the original Morgan index is at most 11 for the metabolic networks. Figure 4 shows the results of hierarchical clustering using CompressEdge for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis with Morgan indices of 1, 2, 3, 6, 11, and 12 iterations. The numbers of contracted edges for the metabolic network of H. sapiens with Morgan indices of 1, 2, 3, 6, 11, and 12 iterations were 251, 1367, 1387, 1395, 1395, and 1395, respectively. The results of more than 5 iterations were almost similar to those of 12 iterations. Figure 5 shows the results on the number of different values of Morgan indices for the metabolic networks for 1-20 iterations of the Morgan index procedure. We can see from this figure that the number of different values of Morgan indices is almost constant for more than 11 iterations. For a small number of iterations, it is considered that metabolic networks were not compressed well because many edges have the same labels and share common nodes. This means that the number of iterations is required to be large for measuring the similarity more accurately. However, for that purpose, much larger numbers than the number of iterations of the original Morgan indices, that is at most 11, are not needed because the number of different values of Morgan indices is almost constant for more than 11 iterations (see Figure 5). According to the results of hierarchical clustering in Figure 4, H. sapiens was always the nearest to M. musculus. Bacterial organisms of B. subtilis and E. coli were furthest from H. sapiens in the result of 12 iterations. It is considered that the result of 12 iterations is almost consistent to actual phylogenetic trees. This suggests that the proposed method can adequately measure the similarities between metabolic networks. Figure 6 shows the results of hierarchical clustering using CompressVertices(3) for metabolic networks. The result of 3 iterations was already similar to the results of more than 5 iterations. It is considered that there are more overlapping edges in networks for a smaller number of iterations, and our proposed method contracted their edges well for a small number of iterations. Figure 7 shows the results of hierarchical clustering using GRAPHITOUR for metabolic networks. In the result of 10 iterations, E. coli was farthest from H. sapiens. The result of 12 iterations was not well clustered because D. melanogaster is likely to be closer to H. sapiens than to C. elegans [22]. These suggest that our proposed method can measure the similarities better than GRAPHITOUR.
Furthermore, the proposed methods are efficient. The computational time was at most 9 seconds in Compres-sEdge, 17 seconds in CompressVertices (3)    We also compared our proposed methods with simple methods based on node and edge numbers. We calculated distances between two organisms as the differences of the number of nodes or edges, and obtained hierarchical clustering results using the nearest neighbor method. We can see from the results (Figures S1 and S2 in the supplementary information web page) that H. sapiens was the nearest to M. musculus, but other relations were different from actual phylogenetic trees. This result shows that our proposed methods are better than those based on node or edge numbers. In addition, our methods generate rules to contract edges iteratively. It means that they try to find hierarchical substructures for a given graph. We defined the similarity between graphs, GUSM, based on the number of such rules. It can be considered that GUSM implies whether hierarchical substructures are also similar, or not.
We proposed two methods, CompressEdge and Com-pressVertices(M), where M denotes the maximum number of vertices to be contracted at a time. It should be noted that the minimum number is 2 because a vertex itself cannot be contracted. CompressEdge is equivalent to CompressVertices(2). If a graph has many overlapping edges, CompressEdge (and CompressVertices (2)) would not extract many rules, that is, many edges are not contracted and are remained. Consider a graph G that all edges of G are overlapping. The similarity between G and itself, GU SM (G, G), must be 0. However, GU SM (G, G) = 1 because any edge of G is not contracted, C (G) = |E c | and C (G ∪ G) = 2|E c |. Therefore, CompressVertices(M) for M > 2 is needed.

Conclusions
In this paper, we have proposed novel methods for compressing biological networks. One of the important properties of the proposed methods is that isomorphic networks are compressed in the same way. Unlike many other methods comparing biological networks, our methods are able to give the similarity metric of their networks. Moreover, our methods are very efficient because they do not compare subgraphs directly. We have applied the proposed compression methods to comparison of metabolic networks, and compared the results with those of an existing method. The results suggest that the proposed compression methods are useful for comparison of biological networks.
Our proposed methods were applied to only undirected graphs in this paper. However, it is possible to extend our algorithms to deal with directed graphs. It is easy to extend CompressEdge, and the modified method can still be efficient. Our methods are efficient as well as GRA-PHITOUR, and it is important to keep the efficiency after extending them. Metabolic networks can also be represented as directed graphs and undirected graphs using chemical compounds as nodes and the relation between compounds (e.g., involve the same reaction) as edges. It is an interesting issue to examine whether our methods are robust for any representation of a network.
It is considered in general that random networks are more difficult to be compressed than scale-free networks. However, our methods cannot compress metabolic networks that are known as scale-free networks because all nodes in metabolic networks are labeled with distinct chemical compounds, and there is only one edge having the same labels. Our proposed methods are useful only for comparison of networks. If the same labels can appear multiple times, it is expected that our methods can also differentiate these networks. However, it is difficult to compare them in a simple way because the compression size depends on the distribution of node labels. Since we do not have realistic models for generation of random and scale-free networks with node labels, application of our proposed methods to differentiation of random networks from scale-free networks is left as future work.
Though we have applied the methods to comparison of networks, the application is not limited to comparison. It might be applied to detection of network motifs with hierarchical structures because our method iteratively compresses edges (edges can be replaced by small subgraphs).
One drawback of our proposed compression methods is that it is not a lossless compression method (i.e., the original network cannot be reconstructed from compressed data). Therefore, improvement of the method towards lossless compression is also important future work.