- Research
- Open Access
- Published:

# Predicting protein complex in protein interaction network - a supervised learning based method

*BMC Systems Biology*
**volume 8**, Article number: S4 (2014)

## Abstract

### Background

Protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, making it possible to predict protein complexes from protein -protein interaction networks. However, most of current methods are unsupervised learning based methods which can't utilize the information of the large amount of available known complexes.

### Methods

We present a supervised learning-based method for predicting protein complexes in protein - protein interaction networks. The method extracts rich features from both the unweighted and weighted networks to train a Regression model, which is then used for the cliques filtering, growth, and candidate complex filtering. The model utilizes additional "uncertainty" samples and, therefore, is more discriminative when used in the complex detection algorithm. In addition, our method uses the maximal cliques found by the Cliques algorithm as the initial cliques, which has been proven to be more effective than the method of expanding from the seeding proteins used in other methods.

### Results

The experimental results on several PIN datasets show that in most cases the performance of our method are superior to comparable state-of-the-art protein complex detection techniques.

### Conclusions

The results demonstrate the several advantages of our method over other state-of-the-art techniques. Firstly, our method is a supervised learning-based method that can make full use of the information of the available known complexes instead of being only based on the topological structure of the PIN. That also means, if more training samples are provided, our method can achieve better performance than those unsupervised methods. Secondly, we design the rich feature set to describe the properties of the known complexes, which includes not only the features from the unweighted network, but also those from the weighted network built based on the Gene Ontology information. Thirdly, our Regression model utilizes additional "uncertainty" samples and, therefore, becomes more discriminative, whose effectiveness for the complex detection is indicated by our experimental results.

## Background

Most proteins form complexes to accomplish their biological functions [1, 2]. Protein complexes are important for understanding principles of cellular organization and function. While there are a number of ways to detect protein complexes experimentally, Tandem Affinity Purification (TAP) with mass spectrometry [3] is the preferred experimental detection method used by many research groups. However, there are several limitations to this method [4]. For example, its multiple washing and purification steps tend to eliminate transient low affinity protein complexes. Also, the tag proteins used in the experiments may interfere with the protein complex formation. Gavin et al. [1] have shown that TAP-MS only captures limited known yeast protein complex subunits. Furthermore, in TAP-MS the subcellular location of complexes is lost due to the *in vitro* purification of whole-cell lysates [5]. This means that time-consuming preparation of subcellular fractionated lysates may be needed for a less-studied cellular process in order to employ subcellular localization information to validate the experimental results and detect false negatives or false positives. Due to these experimental limitations, alternative computational approaches for detecting the complexes are thus useful complements to the experimental methods for detecting protein complexes [6].

In the post-genome era, high-throughput experimental techniques have produced a large amount of protein interactions, making it possible to predict protein complexes from the protein interaction networks (PIN). Automatic complex identification approaches are increasingly proposed to extract the set of proteins from the PIN as complexes.

The PIN can be described as a graph, the nodes of which correspond to the proteins and the edges correspond to the interactions; thus, the complex detection is realized by finding the subgraphs from PIN. Since the proteins in the same complex are highly interactive with each other, the protein complexes generally correspond to the dense subgraphs in the PIN [7]. Therefore, the proposed complex detection methods can be roughly divided into four categories. (1) Agglomerative model, in which every single node or some subgraph forms a cluster at the beginning stage and clusters are allowed to merge and grow under certain constraints. For example, the MCODE method is based on growing seeds selected by weight [8]. Similarly, the DPClus method expands clusters starting from seeded vertices [9]. (2) Clique finding methods. The CFinder system finds functional modules in PIN by detecting the k-clique percolation clusters using a Clique Percolation Method [10]. CMC is also a clique based method that uses a protein-protein interaction iteration method to update the network [11]. The ClusterONE method initiates from a single seed vertex before a greedy growth procedure begins to add or remove vertices in order to find groups with high cohesiveness [12]. (3) Traditional graph clustering methods based on a premise that PIN can be described as a graph, thus the algorithm can also be applied to detect dense clusters as protein complexes. The Markov clustering method (MCL) simulates random walks within graphs and thus partitions the PPI network into many non-overlapping dense clusters [13].(4) Complex detection methods based on the core-attachment architecture developed by Gavin et al., who demonstrated that protein complex had the architecture of core-attachment [2]. An example of such methods is COACH that selects some subgraph as the core structure first, and then adds the attachment to the core to construct a complex [14].

However, most of above methods are unsupervised learning methods, which predict the protein complexes based on the pre-defined rules. Although these unsupervised learning methods have the superiority of solving the problem without annotation and training process, they can not make full use of the information of available known complexes. In the research field of protein complexes, numerous true complexes have been provided, which can be used as the prior knowledge of the supervised learning method. Qi et al. first imported the supervised learning method into the complex detection. By using graph topological patterns and biological properties as features, they trained a probabilistic Bayesian network (BN) model score subgraphs in the protein interaction graph and identify new complexes [15]. Shi et al. proposed a semi-supervised prediction model with neural network and their results shows that integrating biological features and topological features to represent protein complexes is more meaningful than using dense subgraphs [16]. Chen et al. analyzed the graph properties and biological properties of protein complexes and constructed a prediction model using the filtered features [17]. However, their method only determines whether a candidate protein complex is a true complex and doesn't deal with the construction of the candidate protein complexes from the PIN. Qiu et al. developed multiple kernels from heterogeneous data sources and combined them in an SVM classifier to predict co-complexed protein pairs [18]. Like Chen et al., their method also doesn't deal with the construction of the candidate protein complexes from the PIN although the co-complexed protein pairs it predicts can extend known MIPS complexes and identify maximal cliques as candidate protein complexes.

In this paper, we present a supervised learning based method to discover the complexes in the PIN by learning from true complexes. Compared with other supervised learning based methods (e.g. Qi et al.[15] and Shi et al.[16]), our method introduces some new features from the weighted network: the density, the mean and maximum degrees of the weighted network, which prove to be quite effective for the performance improvement. In addition, our method uses the three categories training set for the first time. Since the more samples and additional categories provide more information for the regression model training, the learned model becomes more discriminative. Finally, our method uses the maximal cliques found by the Cliques algorithm [19] as the initial cliques, which has been proven to be more effective than the method of expanding from the seeding proteins used in other methods.

## Methods

### Complex detection algorithm

The aim of complex detection is to discover subgraphs representing the predicted protein complexes from the PIN. We propose a supervised learning-based method including four steps as shown in Table 1. The inputs are an unweighted network, a weighted network and a training set. The unweighted and weighted networks are originally constructed from the DIP database (the Database of Interacting Proteins [20]), which contains 4928 proteins and 17201 interactions and then the interactions with GO similarities less than 0.9 are regarded as false positive interactions and deleted from the PIN as will be discussed in the following section. The size of the training set is of great importance for the supervised learning-based method. However, currently it is difficult to obtain sufficient number of positive training samples in complex detection field. Thus, in order to achieve more training samples, we used 422 complexes which are predicted by the COACH method but do not match the true complexes in the benchmark. Since the COACH method is a state-of-the-art complex detection method, its predicted result that doesn't match a true complex could still be a true complex. We assigned them "uncertainty" status denoting that their potential of being true complexes is superior to the negative samples and inferior to the positive ones. Consequently, we constructed three different training sample categories: 668 true complexes from some available PPI databases are used as the positive samples, 422 complexes predicted by the COACH method as the intermediate samples, and 2004 subgraphs obtained by randomly selecting nodes as the negative samples. The more samples and additional categories provide more information for the learning model to be more discriminative.

In the first step, the feature vectors are generated for the complexes in the training set from the unweighted and weighted PIN networks based on the features which will be discussed in the later section of *Complex feature selection*. It should be noted that all the features are extracted from the true protein complexes when they are in the PIN (i.e. the true protein complexes are the (unweighted or weighted) subgraphs in the whole (unweighted or weighted) protein interaction network. The Regression model is subsequently trained by solving the optimization problem by gradient descent.

In the second step, the Cliques algorithm is used to find maximal cliques in the PIN [19]. Although enumerating all maximal cliques is NP-hard, this does not pose a problem in PPI networks because PPI networks are usually sparse [11]. The Cliques algorithm uses a depth-first search strategy to enumerate all maximal cliques, and it can effectively prune non-maximal cliques during the enumeration process. In our experiments, we explored two different minimal sizes of the cliques on the performance: the sizes greater than or equal to 3 and 4 (denoted as clique_size ≥ 3 and clique_size ≥ 4 respectively). Furthermore, because of the high density of the PIN, the cliques may have high node overlapping rate. For example, two cliques with four nodes may have three nodes in common. Therefore, the cliques are filtered as follows: the set of cliques is ranked in descending order of their scores given by the Regression model, denoted as {C_{1}, C_{2}, ..., C_{k}}; for each clique C_{i} (i = 1, 2, ..., k), whether the number of common nodes of C_{i} and the clique C_{j} (j = i + 1, ..., k and C_{j} has a lower score than C_{i}) is larger than or equal to the threshold (set to 2 and 3 for clique_size ≥ 3 and clique_size ≥ 4 respectively) is checked. If so, the clique with the lower score is removed.

In the third step, the growing operation is performed on each clique obtained in the previous step. For a clique C_{i}, the set of its neighbors is denoted as N(C_{i}) and, for each node v_{i} in N(C_{i}), it is checked if its addition to C_{i} makes the new subgraph { C_{i}∪v_{i}} obtain higher score given by the Regression model. The operation is repeated until no node introduction leads to higher score of the new subgraph. Thus, after the growing operation, the cliques constitute a set of candidate complexes.

The candidate complexes may still have a high overlapping rate since they also may have some neighbor nodes in common. Therefore, in the fourth step, similar filtering operation as in the second step is performed. For two candidate complexes, C_{1} = {p_{1}, p_{2}, ..., p_{m}} and C_{2} = {q_{1}, q_{2}, ..., q_{n}}, their overlapping rate is calculated as follows:

The merging threshold (denoted as merg_thred) is set to a value between 0 and 1. The merging operation is performed as follows: first, the candidate complexes are ranked in descending order of their scores given by the Regression model; then, for each candidate complex C_{i}, its overlapping rate with all the candidates C_{j} with lower scores are calculated. If the overlapping rate is higher than the merg_thred, the merging operation is performed if the score of their union is higher than that of C_{i} itself. Otherwise, the complex C_{j} is removed.

### Weighted network construction

The PIN can be modeled as a simple graph G = (V, E), in which a node element in node set V represents a protein and an edge element in edge set E represents an interaction between two distinct proteins. In our method, a weighted graph is introduced to represent PIN as G = (V, w(E)), where w(E) represents the weighted interaction. In this way, we extract the complex features based on two different networks--an unweighted and a weighted network.

Protein interaction data produced by high-throughput experiments are often associated with high false positive and false negative rates due to the limitations of the associated experimental techniques and the dynamic nature of protein interaction maps. Therefore, the complex features extracted from the unweighted network are insufficient for describing a complex. Gene Ontology (GO) provides a collection of well-defined biological terms--known as GO terms--spanning biological processes, molecular functions and cellular components. Here, based on the method presented in [21], we use GO annotation from SGD [22] to estimate the similarity between proteins, and then use it as the weight of network.

In our method, the semantic similarity between two proteins is calculated based on the annotation size of the GO term (which is defined as the number of annotated proteins on the GO term) on which both proteins are annotated. According to the transitivity property of GO annotation, if a protein *p* is annotated on a GO term *gi*, it is also annotated on the GO terms on the path from *gi* to the root GO term in the GO structure. Thus, the proportion of the annotation size of a GO term to the total number of annotated proteins can quantify the specificity of the GO term. If two proteins are annotated on a more specific GO term and have more common GO terms, then they are functionally more similar. We define the semantic similarity *sim*(*p, q*) between two proteins *p* and *q* as follows:

*C*(*p,q*) denotes the set of the GO terms whose annotation includes *p* and *q*. If both *p* and *q* are annotated on *n* different GO terms, *Si*(*p, q*) (1≤*i*≤*n*) denotes a set of annotated proteins on the GO term *gi* whose annotation includes *p* and *q. Smax* is the maximum size of annotation among all GO terms in a directed acyclic graph (DAG) structure. The proportion of the annotation size of a GO term (*Si*(*p,q*)) to the total number of annotated proteins (*Smax*) can quantify the specificity of the GO term. If *p* and *q* are annotated on a more specific GO term and more common GO terms than *p* and *l* (another protein), then *p* is semantically more similar to *q* than *l*. In addition, the graph topology is also introduced into the weight calculation. For an input graph *G* = (*V, E*), we assign the topological weight of an edge [*u, v*] to be the number of neighbors shared by the vertices u and v (which represent proteins *p* and *q* respectively). Then the sum of *sim* (*p, q*) and topological weight is assigned to the edge between *u* and *v*.

In our experiments, if proteins are not annotated by the GO terms, 0 is used as their interaction weight and the interactions with GO similarities less than 0.9 are regarded as false positive interactions and deleted from the PIN.

### Complex feature selection

Extracting appropriate features for the subgraphs representing complexes is related to the problem of measuring the similarity between complex subgraphs. We designed the following features to describe a complex subgraph in the PPI network. Some features are extracted from the unweighted network and other features from the weighted network.

1. Graph density: The graph density has been used in many complex detection methods, and it has been proven to be an important feature for complex detection [8]. Let G = (V, E) be an unweighted graph, with |V| vertices and |E| edges. Suppose |E|_{m} =|V|(|V|-1)/2 is the theoretical maximum number of possible edges in G, and the unweighted graph density is defined as the ratio of |E| and |E|_{m}. For the weighted graph, the weight of the edge <u, v> is given by G = (V, w(E)), w(u, v). Thus, the density of the weighted graph is defined as follows:

2. Degree statistics: Degree is defined as the number of neighbors of a node in unweighted graph that describes the connection between the nodes. For the unweighted graph, the mean and medium degrees are chosen as the node degree feature. In the weighted graph, a degree is defined as the sum of the weights between the node and its neighbors and the mean and maximum degrees are chosen as the node degree features.

3. Edge weight statistics: Similar to the node degree, the edge weight is another important measure of the weighted network as it describes the feature of the edge. The mean of all weights is chosen as the edge weight statistics feature.

4. Clustering coefficient: Clustering coefficient reflects the neighbors of the nodes that can be used to describe the modularity of the graph. Let G = (V, E) be a complex graph with V = {v_{1}, v_{2},..., v_{n}}(n is the number of nodes). For each node v_{i}, the set of its neighbors is denoted as V_{i}' = {v_{i1}, v_{i2}, ..., v_{ik}} and let N_{i} = (V_{i}', E_{i}) be an induced graph of G. Define C_{i} = 2|E_{i}|/k(k-1) (if k ≤ 1, C _{i}= 0), where k denotes the number of nodes in V_{i}'. The mean of {C_{1}, ..., C_{n}} is chosen as the clustering coefficient feature [23].

5. Topological change: For a weighted graph, topological change features are gained by measuring the topological changes when different weight cutoffs are applied to the graph (ranging from 1 to 8). Let G_{i} = (V, E_{i}) (*i* = 1, ..., 8) be the graphs in which only the edges with the weights higher than *i* remained, that is, E_{i} = {e|w(e) >*i*}. Topological changes are measured as T_{i} = (|E_{i}|-|E_{i+1}|)/|E_{i}| (i = 1, ..., 7. If |E_{i}| = 0, let T_{i} = 0). In our feature set, T_{i} (*i* = 2, ..., 7) are chosen as the topological change features [17] which measure the distribution of the edge degrees in the weighted network.

The five groups of features discussed above are used in our experiments for describing the complexes from different perspectives (as shown in Table 2). Four features are based on the unweighted network and six features are based on the weighted network.

### Regression model

In our method, the Regression model is introduced to evaluate the possibility a subgraph is a true complex. Regression analysis is a statistical method used to model and analyze several features [24]. The goal of regression is to summarize observed data as simply, usefully and elegantly as possible [25]. In the context of the complex detection problem, a model that can evaluate the possibility a subgraph is a true complex is required. In our method, the regression analysis is used to model the complex detection problem, as it can train a model of the multiple features by analyzing the training set.

We model the problem evaluating the possibility a subgraph is a true complex as a linear regression function, $\text{f}\left(x\right)={\omega}^{T}\cdot x$, where *f(x)* is the linear function of features and *ω*^{T} is the weight vector of the dimension corresponding to the number of the features. The linear least square approach is used to obtain the regression model, and the least square function is defined as follows:

where *i* is the training sample, *N* the number of the samples, *y*_{
i
} the annotation of sample *i* (in our model $yi\in \left\{0,1,2\right\}$. For the negative samples, intermediate samples and positive samples, *y*_{
i
} is set to 0, 1 and 2 respectively), *x*_{
i
} the feature vector, and $f\left({x}_{i}\right)$ the score of the sample *i* . This approach leads to an optimization problem, whereby, when the least square function obtains the least value, the model is optimal.

We solve the optimization problem by the gradient descent algorithm, which is an iterative algorithm where, within each step, the gradient of the objective function is calculated, and then the negative direction of gradient is used to search the next step by multiplying the step-size. The gradient of the object with respect to parameter can be calculated as follows:

where *ω*_{
j
} is the weight of *j* th dimension, and *ω* is updated by $\omega \leftarrow \omega -\eta \cdot \text{\Delta}\omega $, and the learning rate *η* can be set to a small positive value.

### Datasets

Our method was tested on the DIP database, which has been widely used in complex detection field, so that our result is comparative with others. DIP contains 4928 proteins and 17201 interactions. We built the weighted network by calculating the GO similarity of the proteins as discussed in the previous section. 6120 interactions with GO similarities less than 0.9 were deleted since the lower GO similarity indicates that two proteins have less common functional annotations and their interaction is more likely to be a false one.

Our training set includes 668 positive samples, 422 intermediate samples and 2004 negative samples. The positive samples are obtained from four sources: (I) MIPS [26], (II) Aloy et al., [27], (III) SGD database [22] and (IV) TAP06 [2]. Moreover, as the extant research shows that most of the complexes include more than one protein [28], we choose the complexes which at least have two different proteins as the true complexes. The intermediate samples are 422 complexes predicted by the COACH method, and 2004 subgraphs obtained by randomly selecting nodes are used as the negative samples. The size distribution of the positive sample set follows a power law, so do the intermediate and negative sample sets as shown in Figure 1. After a Regression model is trained with the training set, our complex detection algorithm is performed on the DIP PIN. Then the detected complexes are evaluated using the metrics to be introduced in the following section.

### Evaluation metrics

The neighborhood affinity score (NA (A, B)) [29] is used as a measure to evaluate the similarity of two given clusters A and B, and is defined as follows:

The neighborhood affinity score between a predicted complex p and a true complex b, NA (p, b), is used to determine whether they match. If NA (p, b) ≥ *ω* , they are considered to be matching (*ω* is usually set as 0.25). Let P and B be the sets of the predicted and true complexes in the benchmark respectively, *Ncp* be the number of the correct predictions that match at least a true complex and *Ncb* be the number of the true complexes that match at least one predicted complex, the precision and recall are defined as follows:

The F-measure (the harmonic mean of precision and recall and defined as (2PR) / (P + R) where P denotes precision and R recall) can be used to evaluate the overall performance, which is a popular metric in the performance evaluation of complex detection methods.

Recently, the sensitivity (*Sn*), positive predictive value (*PPV*), accuracy (*Acc*) and *P-value* have also been proposed to evaluate the performance of complex detection methods [29]. Given *n* benchmark complexes and *m* predicted complexes, let *T*_{
ij
} be the number of proteins in common between *i* th benchmark complex and *j* th predicted complex, *Sn* and *PPV* are then defined as follows:

where *N*_{
i
} is the number of proteins in the ith benchmark complex, and *T*_{
.j
} is defined as:

Generally, a high *Sn* value indicates that the prediction has a good coverage of the proteins in the true complexes, whereas a high *PPV* value indicates that the predicted complexes are likely to be true complexes. As a summary metric, the accuracy of a prediction, *Acc*, can then be defined as the geometric average of *Sn* and *PPV*:

These metrics are by no means absolute measures--they all have their own limitations, *Sn, PPV* and *Acc* in particular. For example, if a method predicts a giant complex that covers many proteins in the known true complex set, this method will yield a very high *Sn* score. Similarly, *PPV* value cannot evaluate overlapping clusters reliably. Here is a case in point: if the known gold standard MIPS complex set is taken to match with itself, then the resulting *PPV* value is 0.772 instead of 1, although the precision and recall are both correctly calculated as 1. In such cases, the *Acc* score, as the geometric average of *Sn* and *PPV*, will not make sense either. Therefore, in the performance comparison, the F-measure is used as the main metric, and the *Acc* is only used as an auxiliary one.

*P-value* refers to the statistical significance of the occurrence of a predicted protein complex with respect to a given functional annotation, which is computed by the following hypergeometric distribution:

where a predicted complex *C* contains *k* proteins in the functional group *F* and the entire PPI network contains |*V*| proteins. The functional homogeneity of a predicted complex is the smallest p-value over all the possible functional groups. A predicted complex with a low *p-value* indicates that it is enriched by proteins from the same function group and it is thus likely to be a true protein complex.

## Results and discussions

The effect of different parameters (e.g. Regression model iteration time, two or three category training set, clique_size and merg_thred) and feature sets on performance, the performance comparison with other methods and the statistical evaluation of the predicted protein complexes is discussed in this section.

It should be noted that in the experiments which will be discussed in the following two sections (*the effect of different parameters on performance* and *the effect of features set on performance*), we used the 668 positive samples as benchmark to evaluate our identified complexes which means the training and the testing data have overlap. In classical classification task, the problem will would affect the validity of the results. However, in our complex detection method, the problem is not so serious since the Regression model trained by the training data is not used directly to classify the candidate complexes to be true ones or not but to assign them scores used for the cliques filtering, growth, and candidate complex filtering as described in previous section. Nevertheless, to avoid the problem as possible as we can, when comparing our results with those of other methods we used a method (which will be introduced in the section *Performance comparison with other methods*) similar to the five-fold-cross-validation or different training and testing data. However, this five-fold-cross-validation method will lead to the significant increase of experiment time. For example, the number of experiments needed for Table 3 will increase from 144 (16*9) to 720 (16*9*5). Therefore, in the experiments introduced in, the following two sections we still used the 668 positive samples as benchmark to approximately evaluate the impacts of the parameters such as Regression model iteration time, two or three category training set, clique_size, merg_thred, and different feature sets on the performance while in the section *Performance comparison with other methods* (which compares our results with those of other methods) we used the five-fold-cross-validation method or different training and testing data.

### The Effect of different parameters on performance

In our method, we imported the Regression model to evaluate the possibility a subgraph is a true complex. In Regression model, the regression square error is reduced as the time of iteration grows, and it will return different models with different iteration times. The performance comparison measured by F-measure between these different Regression models is made in Table 3. In the table, Model100 denotes the model with 100 iterations; Model200 denotes the model with 200 iterations, and so on. Among others, the Model500 achieves the highest F-measure of 0.5910 when merg_thred is 0.8 and clique_size ≥ 3. With the further increase of the iteration time, the F-measure begins to decrease. Through analyzing the result, we found that Model500 can achieve the higher precision than the models with more iteration time (e.g. Model4000) while they have almost the same recall. Figure 2 depicts the F-measure curves of different models, which shows that in most cases, with the increase of the merg_thred, the F-measure of each model keeps increasing and when the merg_thred is 0.7 or 0.8, it reach its peak value. Therefore, in the following experiments, the Regression model with 500 iterations and the merg_thred 0.8 are used.

As mentioned in previous section, our Regression model is built with the three-category training set, which could improve the discrimination of the model. In order to prove its effectiveness, we made the performance comparison between the two-category and three-category training set with the clique_size ≥ 3. As can be seen from Figure 3, the F-measure and accuracy of using the three-category training set are much better than those of using the two-category training set when merg_thred is 0.8. Therefore, the three-category training set is used in our experiments.

In addition, we conducted the comparison experiments with different clique_sizes and merg_threds. Table 4 shows the performances when the clique _size ≥ 3, clique_size ≥ 4 and the merg_thred ranges from 0.1 to 0.9. The experimental results show that our method returns more complexes and achieves higher F-measure when the clique_size ≥ 3, as the lower clique_size will allow the Clique algorithm to find more cliques (more predicted complexes). On the contrary, when the clique_size ≥ 4, few predicted complexes are returned to match the true complexes. Therefore, we set clique_size ≥ 3 in our experiments. The advantage of clique_size ≥ 4 is that it returns fewer complexes with higher precision. For example, when merg_thred is 0.2, it returns 113 complexes and achieves the highest precision (0.7876). In addition, as the merg_thred grows from 0.1 to 0.9, the recall increases while the precision decreases. The reason is that, when the merg_thred increases, fewer merging operations are performed and our method can predict more complexes to achieve a high recall. However, the precision will deteriorate since more predicted complexes remain unmatched to any true complex. The F-measure achieves its highest value 0.5910 when the merg_thred is 0.8 (clique_size ≥ 3).

### The effect of different features set on performance

To evaluate the contribution of different feature sets to the performance, we conducted the experiments with different feature sets. The experimental results with three different feature sets--four unweighted features, seven weighted features, and all features - are showed in Table 5. The F-measures achieved with seven features from the weighted network are much better than those achieved with four features from the unweighted network, and almost as good as those using all features, which shows that the feature set from the weighted network is effective in improving the performance. The reason is that the weighted network feature set combines the GO information with the topological properties. In addition, the combination of the unweighted network feature set and weighted network feature set achieves an F-measure of 0.5910 (the merg_thred is 0.8), which indicates that the construction of our feature set is effective.

We also analyzed the contribution of individual features to the performance. Table 6 shows the rank lists of the features achieved with two different standards. The Regression model assigns the features with different weights that reflect the importance of each feature, and the features are ranked by the weights in the descending order in Table 6 (the column 2 and 3 show the feature names and their weights).

Moreover, the experiments were also conducted to verify the performance our complex detection method could achieve when each feature was removed. If the performance declines more sharply when a feature is removed, the feature is deemed more important. In this way, the features are ranked by the F-measure of each one-feature-removed experiment in the ascending order in Table 6 (the column 4 and 5 show the ranked feature names and their F-measures). It should be noted that topological change features help to enhance the performance only when *i* = 5 and 7, and, therefore, other topological change features are removed in our feature set.

In accordance with the results in Table 5 the features from the weighted network play a more important role in the feature set than those from the unweighted network. Among others, the mean edge weight and density features of the weighted network rank among top 3 in both lists. This is also consistent with the idea of the previous complex detection algorithms based on detecting density subgraph.

Compared with the other supervised learning-based methods of Qi et al. and Shi et al., our method introduces some new features from the weighted network (in bold and itlic in Table 6): the density, the mean and maximum degrees of the weighted network. Our experiment shows that these features are quite effective in complex detection: They totally contribute to a performance improvement of 2.6 percentage points in F-measure (from 0.5650 to 0.5910).

In order to prove the effectiveness of our Regression model, the comparative experiments between the Regression and the equal weight model that assigns all the features with the same weight were conducted. The results are shown in Table 7 from which it can be seen that the F-measures of the Regression model are superior to those of the equal weight model at different merg_threds. When the merg_thred is 0.8, the Regression model achieves an F-measure of 0.5910 which is significantly better than that of the equal weight model (0.4366). This indicates that the Regression model is effective in assigning appropriate weights to different features and, therefore, improving the performance.

### Performance comparison with other methods

The performance comparison with the-state-of-art unsupervised methods including MCODE, COACH, CMC and ClusterONE is shown in table 8. In order to compare with these methods as fair as possible, we designed an experiment method similar to the five-fold-cross-validation to obtain the complexes with our method: the 668 positive examples are divided into five folds {S_{1}, S_{2}, S_{3}, S_{4}, S_{5}}. For each cross validation experiment, four folds (plus 422 intermediate samples and 2004 negative samples) are used as the training set and then the trained Regression model is used to detect the complexes in the DIP PIN. Since the detected complexes may include the ones in the training set, the predicted complexes matched with the training set are removed (the match threshold is set to 0.9 calculated by Equation (6)). For example, if the training set is {S_{1} U S_{2} U S_{3} U S_{4}}, then the set of remained complexes is supposed to include the complexes in S_{5}.

After five round such experiments are performed, the five result sets of the remained complexes are combined which is supposed to include the complexes in {S_{1} U S_{2} U S_{3} U S_{4} U S_{5}}, and then the similar complexes with the match score higher than 0.6 (which is determined through our experiments and can achieve the best performance) are removed. Finally, the remained detected complexes (which are achieved avoiding the problem that the training and testing set overlap) are used as our final result, which is then evaluated with {S_{1} U S_{2} U S_{3} U S_{4} U S_{5}} (the 668 positive examples). In this way we avoid the problem that the training and testing set may overlap. Since MCODE, COACH, CMC and ClusterONE are unsupervised methods, their results are directly obtained on the PIN, and their optimal parameters are used.

The experiments were performed on four PIN datasets: DIP, Gavin [30], Krogan [31] and Collins [32] (their details are shown in Table 9). In these networks, interactions with GO similarities less than 0.9 are regarded as false positive interactions and deleted as described in previous section.

As shown in Table 8 on the DIP dataset, the widely used dataset in complex detection field, our method obtains the best result on almost every evaluation metric on the DIP dataset. In term of the F-measure, the most frequently used evaluation metric, our method achieves the highest performance (0.5710), which is much superior to those of MCODE (0.2150), COACH (0.4735), CMC (0.4766) and ClusterONE (0.4533). The performances (measured with F-measure) of our method are also best on other PIN datasets except Collins (on Collins dataset our method's performance (0.6266) is inferior to that of ClusterONE (0.6909), but still better than others).

The main advantage of our method over other methods is that it uses the supervised learning method in the complex detection process, which makes full use of the information of available known complexes to achieve better performance.

Qi et al. are the first to import the supervised learning-based method into the complex detection. Table 10 gives the performance comparison between their method and ours. Since the program used by Qi et al. is not available, we use their published results [15]. Qi et al. used MIPS and TAP06 as the positive sets. Thus, in order to make the results as comparable as possible, our datasets were processed in the same way as Qi et al. did, i.e. the complexes composed of a single protein or a pair of proteins were filtered out. After the filter processing, 200 complexes in MIPS remained and 150 complexes in TAP06 remained. It should be pointed out that the number of remaining complexes in TAP06 in Qi et al.'s and our method are almost the same (152 to 150), whereas those remaining in MIPS were markedly different (101 to 200). Moreover, in line with Qi et al.' method, we only kept the proteins from the two true complex sets in the PIN, yielding 1353 proteins and 5072 interactions. We conducted experiments using MIPS as the positive training set and TAP06 as the testing set and vice versa.

In our experiments, we set the clique_size ≥ 3 and merg_thred 0.8, with all the evaluation metrics in Table 10 computed the same way as in Qi et al.' method. Here, it should be pointed out that in the Qi et al.' method, the measure that defines the predicted complex matching the true complex is different from the NA (A, B) value computed in Equation (6). Qi et al. assumed that, if the common proteins both in the predicted complex and the true complex constitute more than 50% of each one, the predicted complex is taken as a match to the true complex. The precision, recall and F-measure are all calculated based on this definition and are shown in Table 10.

As can be seen from Table 10 on both training-testing sets, the F-measures of our method are better than those of Qi et al.'s method. Especially when TAP06 is used as the training set and MIPS as the testing set, our F-measure is 19.4 percentage points higher than that of Qi et al.'s method (0.312 to 0.506). Although the results are not fully comparable for different numbers of remaining complexes in MIPS, they still show the effectiveness of our method.

We also compared our method with that of the Shi et al. (a semi-supervised prediction model with neural network. They used MIPS as both the training and testing set and achieved a performance of 0.397 in F-measure (0.333 in precision and 0.491 in recall) on DIP database. With the same experimental setting and evaluation metrics, our method obtains a better performance of 0.5144 in F-measure (0.4194 in Precision and 0.665 in Recall).

Better performance of our method over other two supervised learning based methods, Qi et al. and Shi et al. may be due to the following three key reasons: (1) Firstly, as discussed in previous section, our method introduces some new features from the weighted network: the density, mean and maximum degrees of the weighted network, which prove to be quite effective for the performance improvement. Secondly, in our method, the initial cliques used are the maximal cliques found by the Cliques algorithm and has been proven to be more effective than expanding from the seeding proteins. In contrast, in the other two methods, each seeding protein is connected to its highest weight neighbor and the pair is subsequently used as the starting cliques. We conducted an experiment in which the starting cliques were selected with such method and other experimental setting unchanged and an F-measure of 0.5418 was achieved, which is inferior to the result of our method (0.5910). Finally, our method introduces the three categories training set for the first time. Since the more samples and additional categories provide more information for the regression model training, the learned model becomes more discriminative.

For comparison purpose, the performances of MCODE, COACH, CMC and ClusterOne on the same PIN are also presented in Table 10. On the testing set of MIPS, our method also outperforms others. However, on the testing set of TAP06, the performances of COACH and ClusterOne are better than that of our method. The reason is the limited size of positive samples (200 complexes from MIPS). When we introduced more positive samples from Aloy and SGD (total 263 complexes from MIPS, Aloy and SGD), a much better performance is achieved (0.447 in F-measure, the last row in Table 10 denoted as "Ours1") which is very close to those of COACH and ClusterOne (0.449 and 0.470 in F-measure, respectively). Similarly, on the testing set of MIPS, when more positive samples are introduced, a better performance is achieved (improved from 0.506 to 0.518 in F-measure). This shows that, if more training samples are provided, as a supervised learning method our method can achieve better performance than the unsupervised methods.

### Statistical evaluation of the predicted protein complexes

To substantiate the biological significance of our predicted complexes, we calculated their function p-values, which represent the probability of co-occurrence of proteins with common functions. As such, low p-value of a predicted complex generally indicates that the collective occurrence of these proteins in the complex does not occur merely by chance and thus the complex has high statistical significance.

Table 11 gives ten examples of the low p-value predicted complexes that are matched with the true complexes from the benchmark (In our experiments, the p-values of complexes are calculated with the SGD's GO::TermFinder [33]. However, although some of our predicted complexes with low p-values were not matched with the true complexes, they still have high biological significance, as some of them may be true complexes that are still undiscovered. Examples of such complexes are given in Table 12 and Figure 4 and might be of use for biologists looking for new protein complexes.

## Conclusions

Protein complexes are important for understanding principles of cellular organization and function. Since high-throughput experimental techniques produce a large amount of protein interactions, many complex detection algorithms have been proposed. However, most of the current methods are only based on the topological structure of the PIN and do not make use of the information of the available known complexes.

In this paper, we present a supervised learning-based method to detect complexes from PIN. In this method, through constructing a training set, a Regression model is obtained that is subsequently used to assess the detected complexes for the cliques filtering, growth, and candidate complex filtering. The evaluation and analysis of our predictions demonstrate the several advantages of our method over other state-of-the-art techniques. Firstly, our method is a supervised learning-based method that can make full use of the information of the available known complexes instead of being only based on the topological structure of the PIN. That also means, if more training samples are provided, our method can achieve better performance than those unsupervised methods. Secondly, we design the rich feature set to describe the properties of the known complexes, which includes not only the features from the unweighted network, but also those from the weighted network built based on the Gene Ontology information. The weighted network features achieve a much better performance than the unweighted network features, which proves the effectiveness of the usage of Gene Ontology. Thirdly, our Regression model utilizes additional "uncertainty" samples and, therefore, becomes more discriminative, whose effectiveness for the complex detection is clearly indicated by our experimental results.

Our future work will focus on exploring more effective features for the complex detection in PIN. Especially, extracting the features from the biomedical resources such as Gene Ontology may be a promising approach. In addition, cooperation with biomedical expert on protein complex detection in some certain disease PIN will also be one of our next step works through which the effectiveness of our method can be further verified.

## References

- 1.
Gavin AC, Bösche M: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.

- 2.
Gavin AC, Aloy P: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.

- 3.
Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol. 1999, 17: 1030-1032. 10.1038/13732.

- 4.
Tarassov K, Messier V: An in vivo map of the yeast protein interactome. Science. 2008, 320: 1465-1470. 10.1126/science.1153878.

- 5.
Schönbach C: Molecular biology of protein-protein interactions for computer scientists. Biological data mining in protein interaction networks IGI Global, USA. Edited by: Li XL, Ng SK. 2009, 1-13.

- 6.
Li XL, Wu M, Kwoh CK, Ng SK: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 2010, 11 (Suppl 1): S3-10.1186/1471-2164-11-S1-S3.

- 7.
Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci. 2003, 100: 12123-12128. 10.1073/pnas.2032324100.

- 8.
Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.

- 9.
Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207-10.1186/1471-2105-7-207.

- 10.
Adamcsek B, Palla G: CFinder:locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22: 1021-1023. 10.1093/bioinformatics/btl039.

- 11.
Liu GM, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25: 1891-1897. 10.1093/bioinformatics/btp311.

- 12.
Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods. 2012, 9: 471-472. 10.1038/nmeth.1938.

- 13.
Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins. 2004, 54: 49-57.

- 14.
Wu M, Li XL, Kwoh CK, Ng SK: A Core-Attachment based Method to Detect Protein Complexes in PPI Networks. BMC Bioinformatics. 2009, 10: 169-10.1186/1471-2105-10-169.

- 15.
Qi YJ, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z: Protein complex identification by supervised graph local clustering. Bioinformatics. 2008, 24: i250-i258. 10.1093/bioinformatics/btn164.

- 16.
Shi L, Lei X, Zhang A: Protein complex detection with semi-supervised learning in protein interaction networks. Proteome Sci. 2011, 9 (Suppl 1): S5-10.1186/1477-5956-9-S1-S5.

- 17.
Chen L, Shi X, Kong X, Zeng Z, Cai YD: Identifying protein complexes using hybrid properties. J Proteome Res. 2009, 8 (11): 5212-5218. 10.1021/pr900554a.

- 18.
Qiu J, Noble WS: Predicting Co-Complexed Protein Pairs from Heterogeneous Data. PLoS Comput Biol. 2008, 4: e1000054-10.1371/journal.pcbi.1000054.

- 19.
Tomita E, Tanala A, Takahashi H: The worst-case time complexity for generating all maximal cliques and computational experiments. Theoretical Computer Science. 2006, 363: 28-42. 10.1016/j.tcs.2006.06.015.

- 20.
Xenarios I, Salwínski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.

- 21.
Cho YR, Hwang W, Ranmanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007, 265: 147-160.

- 22.
Dwight SS, Harris MA: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 2002, 30: 69-72. 10.1093/nar/30.1.69.

- 23.
Stelzl U, Worm U: A Human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122: 957-968. 10.1016/j.cell.2005.08.029.

- 24.
Cohen J, Patricia C, West SG, Aiken LS: Applied multiple regression/correlation analysis for the behavioral sciences. Edited by: Riejert D, Planer J. 2003, Lawrence Erlbaum Associates, Mahwah, New Jersey, 3

- 25.
Weisberg S: Applied Linear Regression (3nd ed). 1980, John Wiley & Sons, Inc. New York

- 26.
Mewes HW, Amid C: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, 32 (Database): D41-D44.

- 27.
Aloy P, Böttcher B: Structure-based assembly of protein complexes in yeast. Science. 2004, 303: 2026-2029. 10.1126/science.1092645.

- 28.
Dudley AM, Janse DM, Tanay A, Shamir R, Church GM: A global view of pleiotropy and phenotypically derived gene function in yeast. Mol Syst Biol. 2005, 1: 2005-0001

- 29.
Brohée S, Helden JV: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006, 7: 488-10.1186/1471-2105-7-488.

- 30.
Gavin AC, Aloy P: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.

- 31.
Krogan NJ, Cagney G: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440: 637-643. 10.1038/nature04670.

- 32.
Collins SR, Kemmeren P: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007, 6: 439-450.

- 33.
Boyle EI, Weng S: GO: TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth456.

## Acknowledgements

This work is supported by grants from the Natural Science Foundation of China (grant no. 61070098, 61272373 and 61340020), Trans-Century Training Programme Foundation for the Talents by the Ministry of Education of China (grant no. NCET-13-0084) and the Fundamental Research Funds for the Central Universities (grant no. DUT13JB09 and DUT14YQ213). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Declarations**

Publication of this article was funded by the following grants: the Natural Science Foundation of China (grant no. 61070098, 61272373 and 61340020), Trans-Century Training Programme Foundation for the Talents by the Ministry of Education of China (grant no. NCET-13-0084) and the Fundamental Research Funds for the Central Universities (grant no. DUT13JB09 and DUT14YQ213).

This article has been published as part of *BMC Systems Biology* Volume 8 Supplement 3, 2014: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Systems Biology Approaches to Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S3.

## Author information

## Additional information

### Competing interests

The authors have declared that no competing interests exist.

### Authors' contributions

ZHY and NT conceived of the study, carried out its design and drafted the manuscript. FYY and NT performed the experiments. FYY, HFL, JW and ZWY participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.

## Rights and permissions

## About this article

### Cite this article

Yu, F.Y., Yang, Z.H., Tang, N. *et al.* Predicting protein complex in protein interaction network - a supervised learning based method.
*BMC Syst Biol* **8, **S4 (2014) doi:10.1186/1752-0509-8-S3-S4

#### Published

#### DOI

### Keywords

- Protein-protein interaction network
- Protein complexes
- Gene Ontology
- Supervised learning