PPSampler2: Predicting protein complexes more accurately and efficiently by sampling
- Chasanah Kusumastuti Widita^{1} and
- Osamu Maruyama^{2}Email author
https://doi.org/10.1186/1752-0509-7-S6-S14
© Widita and Maruyama; licensee BioMed Central Ltd. 2013
Published: 13 December 2013
Abstract
The problem of predicting sets of components of heteromeric protein complexes is a challenging problem in Systems Biology. There have been many tools proposed to predict those complexes. Among them, PPSampler, a protein complex prediction algorithm based on the Metropolis-Hastings algorithm, is reported to outperform other tools. In this work, we improve PPSampler by refining scoring functions and a proposal distribution used inside the algorithm so that predicted clusters are more accurate as well as the resulting algorithm runs faster. The new version is called PPSampler2. In computational experiments, PPSampler2 is shown to outperform other tools including PPSampler. The F-measure score of PPSampler2 is 0.67, which is at least 26% higher than those of the other tools. In addition, about 82% of the predicted clusters that are unmatched with any known complexes are statistically significant on the biological process aspect of Gene Ontology. Furthermore, the running time is reduced to twenty minutes, which is 1/ 24 of that of PPSampler.
Keywords
Background
Protein complexes are essential molecular entities in the cell because intrinsic functions of an individual protein are often performed in the form of a protein complex. Thus, it is helpful to identify all protein complexes of an organism for elucidation of the molecular mechanisms underlying biological processes. However, reliable protein complex purification experiments are rather laborious and time-consuming. Thus it has been expected to provide reliable candidates for true protein complexes by computational prediction methods.
Most computational approaches to predict the components of protein complexes are designed based on the observation that densely connected subgraphs in a protein-protein interaction (PPI) network are often overlapped with some known protein complexes. One of the differences among those methods is the search strategies to find good clusters of proteins. For example, MCL [1] is considered to be a clustering algorithm, in which clusters are formed by repeatedly executing an inflation step and a random walk step. RRW [2] and NWE [3] execute random walks with restarts and generate predicted protein clusters using the resulting stationary probabilities of the random walks. Note that the stationary probability from a protein to another which are both within a densely connected subgraph is likely to be high. COACH [4] finds extremely dense subgraphs which are called cores, and predicts protein clusters by extending cores with additional proteins out of the cores.
Our previous method, PPSampler [5], is designed based on the Metropolis-Hastings algorithm, in which a partition of all proteins is generated as a sample according to the probability distribution which is specified by a scoring function of a partition of all proteins. The entire scoring function consists of the following three scoring functions denoted by f_{1}, f_{2}, and f_{3}. The main part of f_{1} is equivalent to the total sum of the PPI weights within predicted clusters of size two or more. The second scoring function of f_{2} is designed based on the constraint that the frequency of sizes of predicted clusters obeys a power-law distribution. This constraint is derived from the observation that the frequency of sizes of known complexes obeys a power-law distribution in CYC2008 [6] and CORUM [7], which are databases of protein complexes of yeast and human, respectively. Thus f_{2} evaluates the difference between a given power-law distribution and the distribution of sizes of clusters in a partition. The third scoring function of f_{3} is the gap between the number of proteins within predicted clusters of size two or more and a target value of that number. It should be noted here that f_{2} and f_{3} can be considered to be regularization terms to encourage sparse structures (see, for example, [8] for sparse structure). PPSampler is reported to outperform other prediction methods, including MCL, RRW, NWE, and COACH, in our previous work [5]. Especially, the F-measure score of PPSampler is 0.536, which is at least 30% better than those of the other methods.
In this paper, at first, we have improved the scoring functions, f_{1}, f_{2}, and f_{3}, of PPSampler in order to predict protein complexes more accurately. The first scoring function of f 1 is refined by replacing the sum of the weights of PPIs within a cluster with a generalized density of the cluster. The remaining scoring functions, f_{2} and f_{3}, are also newly modeled, using Gaussian distributions. The resulting scoring functions are called g_{1}, g_{2}, and g_{3}, respectively. Notice that g_{2} and g_{3} are also regularization terms to encourage sparse structures. Secondly, the new entire scoring function is formulated as the negative of the sum of g_{1}, g_{2}, and g_{3} although that of PPSampler is the negative of the product of f_{1}, f_{2}, and f_{3}. Lastly, the proposal distribution, which proposes a candidate state given a current state, is also improved to enable a more efficient random walk over the states. Note that the second and third modifications enable the algorithm to run faster.
The resulting method is called PPSampler2. Hereafter PPSampler is called PPSampler1 to distinguish clearly between it and PPSampler2. The F-measure score of PPSampler2 is 26% higher than that of PPSampler1. In addition, about 82% of the predicted clusters that are unmatched with any known complexes are statistically significant on the biological process aspect of Gene Ontology. Furthermore, the running time is drastically reduced from eight hours to twenty minutes. Interestingly, it turns out that the two new scoring functions of g_{1} and g_{3} make g_{2}, the scoring function based on a power-law distribution, unnecessary in the sense that, without g_{2}, PPSampler2 always returns almost the same results as with g_{2}. This would be due to the effect of the generalized density of g_{1}.
Methods
Search by sampling
where T is a temperature parameter. Note that there exists the relationship that the higher the probability, the lower the score. Thus, by sampling, a minimized score and the corresponding state can be found. To exploit the Metropolis-Hastings algorithm, in addition to D and f(C), a proposal distribution, denoted by Q(C′|C), which is a probability distribution of C′ ∈ D given C ∈ D, should be also specified. The formulations of them for PPSampler2 are given in the subsequent sections after that of a PPI dataset.
Weighted protein-protein interaction network
A PPI network is often used as an input to protein complex prediction tools. It can be defined as an undirected graph, G = (V, E), where V is a set of proteins under consideration, and E is a subset of V × V \ {{u, u}|u ∈ V}, representing a set of PPIs. Notice that any self-interactions, {u, u}, are excluded in E. Suppose that each PPI, e, has a weight, w(e) ∈ ℝ_{+}, representing the reliability of the interaction of e. Note that the higher the weight of an interaction, the more reliable the interaction. For a pair of proteins, e, not in E, the weight of e is defined as w(e) = 0.
States
An element of C is called a cluster of proteins. All partitions of V are states in D. In the subsequent section, the formulation of the score of C is given explicitly.
Scoring functions
In our previous work [5], the entire scoring function, which is denoted by f ′(C) in this paper, for a partition C is formulated as the negative of the product of three different scoring functions, f_{1}(C), f_{2}(C), and f_{3}(C), i.e., f ′(C) = −f_{1}(C)·f_{2}(C)·f_{3}(C). These three scoring functions are formulated as follows. The first scoring function of f_{1} is designed to return the total sum of the PPI weights within predicted clusters of size two or more in C. The second scoring function of f_{2} evaluates the difference between a given power-law distribution and the distribution of sizes of clusters in C. The third scoring function of f_{3} is the gap between the number of proteins within predicted clusters of size two or more in C and a target value of that number.
As can be seen, f is changed from the product of three terms in the previous work. The motivation is to increase the acceptance rate of proposed states. For current and candidate states, C and C′, the term of −(f(C′) − f(C)), which is calculated in the Metropolis-Hastings algorithm, can be expected to be higher than −(f ′(C′) − f ′(C)) due to the difference between the forms of f and f ′.
The three new scoring functions, g_{1}(C), g_{2}(C) and g_{3}(C), use the same source data as f_{1}(C), f_{2}(C) and f_{3}(C), respectively, but are refined in the following way.
Scoring function g_{1}(C)
where N is a parameter specifying the upper bound on the size of a cluster in C. The above function, g_{1}(c), can be interpreted as follows. If c is of size one, g_{1}(c) is set to be zero. This means that c has no influence to g_{1}(C). Next, g_{1}(c) is negative infinity if the size of c is greater than N, or if c includes a protein which has no interactions with the other proteins in c. In this case, P(C) goes to zero. Otherwise, g_{1}(c) is equal to the total sum of the weights of all interactions within c divided by the positive square root of the size of c.
Note that the scoring function of the previous work [5] corresponding to g_{1}(c) is f_{1}(c). The difference between g_{1}(c) and f_{1}(c) appears only in the last case of the three cases, in which the score of a cluster, c, is formulated as $\sum _{u,v\left(\ne u\right)\in c}}w\left(u,\text{}v\right)$ in the previous work. If it is furthermore divided by the factor of $\sqrt{\left|c\right|}$, the resulting term is equivalent to the scoring function defined above, ${g}_{1}\left(c\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\displaystyle \sum _{u,v\left(\ne u\right)\in c}}\frac{w\left(u,\text{}v\right)}{\sqrt{\left|c\right|}}.$
The new scoring function, g_{1}(c), can be considered to be a density measure. Actually, density measures are used in many previous works to infer protein complexes. For example, Wu et al. [4] uses a typical density measures $x/\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$, where x is the number of interactions within c. Namely, its denominator, $\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$, is equivalent to the possible maximum number of edges in a subgraph with |c| nodes. Note that because the PPI network used in their work is supposed to be unweighted, the numerator is just the number of edges in
c. However, it can be observed that the larger a cluster, the relatively lower the value of the above measure. Namely, the larger a cluster, the severer evaluation it suffers. Then, Feng et al. [11] eased this peculiarity by adopting |c| as the denominator instead of $\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$. Namely, the resulting measure is $\frac{x}{\left|c\right|}$. In this work, in addition to the denominators mentioned above, which are $\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$ and |c|, more gradual functions, $\sqrt{\left|c\right|}$ and log_{2} |c|, have also been evaluated. Then, it turns out that $\sqrt{\left|c\right|}$ and log_{2} |c| give similar F-measure scores which are higher than those of the others. Thus, $\sqrt{\left|c\right|}$ is selected as the denominator in PPSampler2.
Scoring function g_{2}(C)
where γ is the power-law parameter and its default value is set to be 2. This default value is an approximation of the value, 2.02, of the regression curve obtained from the relative frequency of CYC2008 complexes of size i = (2, 3, . . . , N ) by minimizing the sum of squared errors at sizes i. The sum of the squared errors is small (0.0014).
Scoring function g_{3}(C)
In this work, ${\sigma}_{2}^{2}$ is set to be 10^{6}.
Proposal distribution
A proposal distribution, Q(C′|C), provides the transition probability to a candidate state C′∈ D given a current state C ∈ D. The proposal distribution we use here is obtained by improving that of PPSampler1. The differences between them will be pointed out in the following explanation of our new proposal distribution.
At first, a protein, u ∈ V, is chosen uniformly at random. Thus, the probability of choosing u is $\frac{1}{\left|V\right|}$.The randomly chosen protein u is removed from the cluster including u in C, and then the destination of u is determined by the following probabilistic procedure. As a result, a conditional probability is associated with the resulting state.
The value of β is set to be β = 1/ 100 as in [5].
Notice that the reduction of the running time of PPSampler2 is realized by the combination of the two factors. A factor is that the scoring function, f, is changed from the product of three terms to the sum of them. Another is that the new proposal distribution proposes states which are likely to have higher probabilities.
Initial state
The initial state is the same as that of PPSampler1, which is the following partition. Let u and v be the pair of proteins with the highest PPI weight among all of the given PPI weights. Then the cluster consisting only of u and v is created. In addition to it, each of the remaining proteins forms a singleton cluster. It is trivial that the probability of this state is not zero.
Output of PPSampler2
PPSampler2 returns as output the state, C, with the highest probability among all the states sampled. After removing all the clusters of size one in C, the remaining clusters are all treated as predicted complexes.
Matching statistics
Performance measures
Thus, it is one if s and t are identical to each other. We say that s and t are matched if ov(s, t) ≥ η, where η is a predefined threshold. Notice that, in the case where s and t share only one protein, the overlap ratio turns to be zero. Otherwise, the overlap ratio is equal to the ratio of the number of common proteins between s and t to the geometric mean of the sizes of s and t.
On the other hand, if s and t share less than two proteins, the overlap ratio is defined as zero. The reason to do that can be explained as follows. The typical value of η in the literature is $\sqrt{0.\text{2}}=0.\text{4472}$ (see, for example, [12]). However, with this threshold, if s and t are both of size two and share only one protein, they are determined to be matched because $\frac{\text{1}}{\sqrt{\text{2}\cdot \text{2}}}=0.\text{5}>0.\text{4472}.$ Notice that this case tends to happen by chance. Suppose that there are many known complexes of size two. In this situation, by predicting many clusters of size two, a known complex of size two can be matched with such a predicted cluster by sharing only one protein. The overlap ratio define here is designed to avoid this unfavorable situation. Note that η is set to be $\sqrt{0.2}$ in this work.
Notice that all clusters of size one are completely not counted in this matching statistics. Hereafter, a predicted cluster of any tool means a set of two or more proteins predicted as a protein complex.
Statistical significance by Gene Ontology
The Gene Ontology (GO) provides a unified representation of gene and gene product attributes across all species [13]. Thus, GO is often exploited to find some biological coherence of a newly found group of proteins, like functional modules and protein complexes. For a predicted cluster, if a more specific GO term annotates more proteins in the predicted cluster, the term would be a better biological characterization of the cluster.
where c contains b proteins in the set, M, of proteins annotated by t, and V is the set of all proteins in the whole PPI network [14]. In this work, the p-values (with Bonferroni correction) of predicted clusters are calculated by the tool, "Generic gene ontology (GO) term finder" (http://go.princeton.edu/cgi-bin/GOTermFinder), whose implementation depends on GO::TermFinder [15]. The p-value cutoff used in this work is set to be the default value of 0.01.
Result
In this section, we report the results of performance comparison, carried out in a similar way as [5], of PPSampler2 with the following public tools, MCL [16], MCODE [12], DPClus [17], CMC [18], COACH [4], RRW [2], NWE [3], and PPSampler1 [5]. The outputs of these algorithms are evaluated by the known protein complexes of CYC2008 and GO terms.
Materials
The set of all PPIs with their weights in WI-PHI [19] is given as input to the above algorithms. It contains 49607 non-self-interactions with 5953 proteins (393 self interactions are excluded). The average degree of the proteins is 16.7. Every interaction of them is assigned a weight representing the reliability of the interaction. The weight of an interaction is determined from datasets derived from high-throughput assays, including tandem affinity purification coupled to mass spectrometry (TAP-MS) and the yeast two-hybrid system, and a literature-curated physical interaction dataset, which is used as a benchmark set. The log-likelihood of each dataset is calculated with the benchmark set. The weight of an interaction is formulated as the sum of, over those datasets, the product of the socio-affinity index [20] of the interaction on a dataset and the log-likelihood of the dataset. The resulting weights are ranged from 6.6 to 146.6. The higher the weight of an interaction, the more reliable. Note that among the above algorithms, MCL, RRW, NWE, PPSampler1, and PPSampler2 exploit the weights, and the others do not.
The gold standard dataset of known complexes used here is the complexes of the CYC2008 database [6]. Recall that this database have 408 curated heteromeric protein complexes of S. cerevisiae. It is pointed out in our previous work [5] that among those complexes, 172 (42%) and 87 (21%) are hetero-dimeric and trimeric complexes, respectively.
Configuration setting
Default parameter values of PPSampler2.
Parameter | notation & value |
---|---|
Temperature | T = 10^{−9} |
number of iterations | L = 2 × 10^{6} |
maximum cluster size | N = 100 |
probability of making a new single cluster | β = 0.01 |
parameters of g_{2} | γ = 2 |
${\sigma}_{\text{2},i}^{2}=\text{1}000\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\text{1}.{\text{1}}^{-i}$ | |
parameters of g_{3} | λ = 2000 |
${\sigma}_{3}^{\text{2}}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1{0}^{6}$ |
Performance comparison for all predicted clusters
The matching results of all predicted clusters.
MCL | MCODE | DPClus | CMC | COACH | RRW | NWE | PPSampler1 | PPSampler2 | |
---|---|---|---|---|---|---|---|---|---|
#protein | 5869 | 2432 | 4888 | 5868 | 4094 | 4240 | 1626 | 2001 | 2009.90 ± 0.30 |
#cluster | 880 | 156 | 925 | 978 | 1353 | 1984 | 720 | 350 | 402.10 ± 5.20 |
Avg. size | 6.67 | 15.59 | 6.91 | 20.65 | 13.29 | 2.14 | 2.26 | 5.72 | 5.00 ± 0.06 |
N _{ pc } | 206 | 27 | 192 | 79 | 416 | 196 | 204 | 188 | 248.40 ± 2.37 |
N _{ kc } | 246 | 31 | 219 | 84 | 253 | 204 | 212 | 218 | 302.70 ± 3.00 |
The (average) precision score of PPSampler2 is 0.618, which is 15% higher than the second best, given by PPSampler1 (0.537). In addition, the third best is 0.307, achieved by COACH, which is only 50% of the best. Thus, PPSampler2 outperforms the other algorithms in precision.
In recall, PPSampler2 outperforms the others, too. The recall score is 0.742, followed by 0.620 and 0.603 given by COACH and MCL, respectively. Thus, the best score is 20% and 23% higher than them, respectively. Note that the recall score of PPSampler2 is 39% higher than that of PPSampler1, 0.534.
In F-measure, PPSampler2 achieves the highest score, 0.674. It is 26% higher than the second highest, 0.536, given by PPSampler1. Note that PPSampler1 needs about eight hours to achieve that F-measure score. On the other hand, PPSampler2 can obtain its F-measure score in twenty minutes. Namely, PPSampler2 runs 24 times faster than PPSampler1. Thus, PPSampler2 is superior to PPSampler1 in prediction accuracy as well as running-time. Furthermore, the third highest F-measure score, achieved by COACH, is 0.411. Thus the F-measure of PPSampler2 is 64% higher than it. This indicates how high PPSampler2 outperforms the others.
It would be interesting to see which known complexes are successfully detected by PPSampler2. All of the known protein complexes perfectly detected by PPSampler2 and not by the other tools are extracted. For each of those known protein complexes, the best overlap ratio obtained by each algorithm is given in Additional file 1. The number of such complexes is 35, and the sizes of them are widely ranged from 2 to 25. Interestingly, MCL finds all of the complexes approximately but except the first one. This can be related to the common feature between MCL and PPSampler2 that the structure of their solutions is modeled as a partition of all proteins.
Size-dependent performance comparisons
As mentioned before, it can be found that 172 (42%) of the 408 curated heteromeric protein complexes in the CYC2008 database are heterodimeric protein complexes, and 87 (21%) of them are heterotrimeric protein complexes. Totally, 259 (63%) of the 408 complexes are complexes of size two or three. They can be said to be the majority of the known protein complexes. Thus, the performances on those hetero-dimeric and trimeric complexes will be dominant in the performance on the set of the 408 complexes. Then the performances on those small-sized complexes are evaluated. In addition, the performance on the remaining predicted clusters, i.e., those of size four or more is also considered, because many prediction algorithms have been evaluated by known complexes of size four (or three possibly) or more (see, for example, [3, 21]). Thus, it is interesting to see how good the performance of PPSampler2 w.r.t. the range of sizes is.
The performance measures specialized for this purpose are almost the same as ones formulated in [5]. For the set, C, of all clusters predicted by an algorithm, we denote by C|_{ i } the subset of C whose elements are of size i and by C|_{≥i}the subset of C whose elements are of size i or more. For the set, K, of all known protein complexes, we denote by K|_{ i } the subset of K whose elements are of size i and by K|_{≥i}the subset of K whose elements are of size i or more. For each of the sizes of i = 2 and 3, the precision and recall for size i are defined as precision $\left(C{|}_{i},\text{}K,\sqrt{0.\text{2}}\right)$ and recall $\left(C,\text{}K{|}_{i},\sqrt{0.\text{2}}\right)$, respectively. The corresponding F-measure is the harmonic mean of these precision and recall. In the similar way, the precision and recall for size four or more are defined as precision $\left(C{|}_{\ge \text{4}},\text{}K,\sqrt{0.2}\right)$ and recall $\left(C,\text{}K{|}_{\ge \text{4}},\sqrt{0.\text{2}}\right),$ respectively. Note that K is again set to be the set of all protein complexes in CYC2008.
The matching results on size two.
MCL | MCODE | DPClus | CMC | COACH | RRW | NWE | PPSampler1 | PPSampler2 | |
---|---|---|---|---|---|---|---|---|---|
#protein | 462 | 6 | 2 | 12 | 0 | 3648 | 1264 | 258 | 219.20 ± 6.58 |
#cluster | 231 | 3 | 1 | 6 | 0 | 1824 | 632 | 129 | 109.60 ± 3.29 |
N _{ pc } | 7 | 0 | 0 | 0 | 0 | 122 | 129 | 39 | 45.30 ± 1.90 |
N _{ kc } | 79 | 5 | 58 | 32 | 71 | 60 | 83 | 57 | 103.20 ± 3.03 |
The matching results on size three.
MCL | MCODE | DPClus | CMC | COACH | RRW | NWE | PPSampler1 | PPSampler2 | |
---|---|---|---|---|---|---|---|---|---|
#protein | 456 | 162 | 120 | 616 | 60 | 309 | 162 | 180 | 247.80 ± 13.56 |
#cluster | 152 | 54 | 40 | 216 | 20 | 103 | 54 | 60 | 82.60 ± 4.52 |
N _{ pc } | 13 | 6 | 3 | 5 | 0 | 45 | 43 | 32 | 47.60 ± 4.18 |
N _{ kc } | 55 | 4 | 38 | 16 | 52 | 48 | 50 | 48 | 69.80 ± 0.60 |
The matching results on size four or more.
MCL | MCODE | DPClus | CMC | COACH | RRW | NWE | PPSampler1 | PPSampler2 | |
---|---|---|---|---|---|---|---|---|---|
#protein | 4951 | 2264 | 4799 | 5795 | 4052 | 283 | 200 | 1563 | 1542.90 ± 15.27 |
#cluster | 497 | 99 | 884 | 756 | 1333 | 57 | 34 | 161 | 209.90 ± 2.81 |
Avg. size | 9.96 | 22.87 | 7.09 | 25.84 | 13.45 | 5.00 | 5.88 | 9.71 | 7.35 ± 0.10 |
N _{ pc } | 186 | 21 | 189 | 74 | 416 | 29 | 32 | 117 | 155.50 ± 4.36 |
N _{ kc } | 112 | 22 | 123 | 36 | 130 | 96 | 79 | 113 | 129.70 ± 1.90 |
Evaluation by Gene Ontology
It is reasonable to suppose that the list of protein complexes recorded in databases are still incomplete. This assumption indicates potential protein complexes. Under the assumption, statistically significant clusters by GO which are unmatched with any known complexes are good candidates for potential protein complexes.
Statistically significant clusters unmatched with any known complexes.
MCL | MCODE | DPClus | CMC | COACH | RRW | NWE | PPSampler1 | PPSampler2 | |
---|---|---|---|---|---|---|---|---|---|
#unmat. | 674 | 129 | 733 | 899 | 937 | 1788 | 516 | 162 | 158 |
BP | 203 | 41 | 220 | 356 | 611 | 322 | 215 | 107 | 130 |
CC | 137 | 33 | 155 | 267 | 530 | 214 | 142 | 70 | 80 |
MF | 150 | 34 | 163 | 276 | 510 | 191 | 127 | 75 | 87 |
Random clusters
The evaluation of randomly generated clusters of proteins is also carried out to see how meaningful the evaluation of clusters predicted from the original PPIs is. A random partition of all proteins is generated by shuffing the input PPIs in the following way. The text file of WI-PHI has three columns, corresponding to an interactor, the other interactor, and their weight. Each column is permutated randomly. PPSampler2 is applied to the resulting random PPIs with the default parameter set. This process is repeated three times and their performance scores are almost the same. Thus, one of them is picked up and is summarized as follows.
The number of proteins within predicted clusters of size two or more is 2012. The number of those predicted clusters is 731, and among them only one cluster is matched with a known complex. Thus, the precision score is 0.001. The number of complexes matched with some predicted clusters is also one. Then the recall score is 0.002, and the resulting F-measure score is 0.002. The numbers of statistically significant clusters on the GO aspects, biological process, cellular component, and molecular function are 53, 25, and 34, respectively, and their fractions to the number of predicted clusters are 7.3, 3.4, and 4.7%, respectively. Thus, these results imply that the predicted clusters from the original PPIs of WI-PHI are very meaningful.
Robustness
In this section, we assess the robustness of PPSampler2 to some important parameters of it.
PPSampler2 without regulation of frequency of sizes of predicted clusters
Suppose that instead of $\sqrt{\left|c\right|}$, the denominator was $\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$, which is the total number of possible interactions within c. Under this assumption, g_{1}(c) is equivalent to the mean of the weights of all pairs of proteins within c. This fact implies that clusters of size two with an interaction whose weight is higher than those of the neighboring interactions are likely to be formed because for such a size-two cluster, if a neighboring protein is added to the cluster, the averaged weight is lower than that of the size-two cluster. Thus, if the denominator is $\frac{\left|c\right|\left(\left|c\right|-1\right)}{2}$, g_{1} cannot make the frequency of sizes of predicted clusters obey a power-law distribution. On the other hand, if the denominator of g_{1} is $\sqrt{\left|c\right|}$, clusters are allowed to be larger to some extent. Namely, even if the weights of neighboring interactions are lower than that of an interaction of a size-two cluster, g_{1} can become larger by adding another proteins. This would be the mechanism of finding a set of clusters whose size distribution is a power-law distribution.
Conclusions
We have proposed a new protein complex prediction method, PPSampler2, by improving the scoring functions and proposal distribution of PPSampler1. The performance of PPSampler2 is superior to other methods. Especially, 92% of the predicted clusters are either matched with known complexes or statistically significant on the biological process aspect of GO. Namely, most of the predicted clusters by PPSampler2 are biologically reliable. Thus, PPSampler2 is useful to find good candidates for potential protein complexes.
Declarations
Acknowledgements
The authors would like to thank anonymous reviewers for their valuable comments and suggestions, which were helpful in improving the paper.
Declarations
The publication fee was funded by the Institute of Mathematics for Industry at Kyushu University.
This article has been published as part of BMC Systems Biology Volume 7 Supplement 6, 2013: Selected articles from the 24th International Conference on Genome Informatics (GIW2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S6.
Authors’ Affiliations
References
- Vlasblom J, Wodak S: Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics. 2009, 10: 99-10.1186/1471-2105-10-99.PubMed CentralView ArticlePubMedGoogle Scholar
- Macropol K, Can T, Singh A: RRW: Repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics. 2009, 10: 283-10.1186/1471-2105-10-283.PubMed CentralView ArticlePubMedGoogle Scholar
- Maruyama O, Chihara A: NWE: Node-weighted expansion for protein complex prediction using random walk distances. Proteome Science. 2011, 9 (Suppl 1): S14-10.1186/1477-5956-9-S1-S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu M, Li X, Kwoh C, Ng S: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics. 2009, 10: 169-10.1186/1471-2105-10-169.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatsuke D, Maruyama O: Sampling strategy for protein complex prediction using cluster size frequency. Gene. 2013, 518: 152-158. 10.1016/j.gene.2012.11.050.View ArticlePubMedGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak S: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009, 37: 825-831. 10.1093/nar/gkn1005.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes--2009. Nucleic Acids Res. 2010, 38: D497-D501. 10.1093/nar/gkp914.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Q, Ihler A: Learning scale free networks by reweighted ℓ1 regularization. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 2011, 40-48.Google Scholar
- Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970, 57: 97-109. 10.1093/biomet/57.1.97.View ArticleGoogle Scholar
- Liu JS: Monte Carlo strategies in scientific computing. 2008, Springer, New YorkGoogle Scholar
- Feng J, Jiang R, Jiang T: A Max-Flow-Based Approach to the Identification of Protein Complexes Using Protein Interaction and Microarray Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2011, 8: 621-634.View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.View ArticleGoogle Scholar
- Li X, Wu M, Kwoh CK, Ng SK: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 2010, 11 (suppl 1): S3-10.1186/1471-2164-11-S1-S3.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-5. 10.1093/bioinformatics/bth456.PubMed CentralView ArticlePubMedGoogle Scholar
- Enright A, Dongen SV, Ouzounis C: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30: 1575-1584. 10.1093/nar/30.7.1575.PubMed CentralView ArticlePubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207-10.1186/1471-2105-7-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25: 1891-1897. 10.1093/bioinformatics/btp311.View ArticlePubMedGoogle Scholar
- Kiemer L, Costa S, Ueffing M, Cesareni G: WI-PHI: A weighted yeast interactome enriched for direct physical interactions. Proteomics. 2007, 7: 932-943. 10.1002/pmic.200600448.View ArticlePubMedGoogle Scholar
- Gavin AC: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.View ArticlePubMedGoogle Scholar
- Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. J Bioinform Comput Biol. 2008, 6: 435-466. 10.1142/S0219720008003497.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.