In this section, we describe our approach for predicting small protein complexes, which consists of two stages: first, size-specific supervised weighting (SSS) of the PPIs; second, extracting small complexes from this weighted PPI network.
Size-specific supervised weighting (SSS) of the PPI network
SSS uses supervised learning to weight each edge of the reliable PPI network with two posterior probabilities, that of being a small-co-complex edge (ie. of belonging to a small complex), and that of being a large-co-complex edge, given the edge's features. These features consist of diverse data sources, their topological characteristics, and an isolatedness feature derived from an initial calculation of the posterior. We first describe the data sources and features we use, then describe our weighting approach.
Data sources and features
We use three different data sources (PPI, functional association, and literature co-occurrence) together with their topological characteristics as features. Each data source provides a list of scored protein pairs: for each pair of proteins (a, b) with score s, a is related to b with score s, according to that data source. For both yeast and human, the following data sources are used:
• PPI : PPI data is obtained by taking the union of physical interactions from BioGRID [10], IntAct [11] and MINT [12] (data from all three repositories downloaded in January 2014). In addition, in yeast we also incorporate the widely-used Consolidated PPI dataset [13]. We unite these datasets, and score and filter the PPIs, using a simple reliability metric based on the Noisy-Or model to combine experimental evidences (also used in [14]). For each experimental detection method e, we estimate its reliability as the fraction of interactions detected where both interacting proteins share at least one high-level cellular-component Gene Ontology term. Then the reliability of an interaction (a, b) is estimated as:
where rel
i
is the estimated reliability of experimental method i, E
a,b
is the set of experimental methods that detected interaction (a, b), and n
i,a,b
is the number of times that experimental method i detected interaction (a, b). The scores from the Consolidated dataset are discretized into ten equally-spaced bins (0− 0.1, 0.1− 0.2, . . .), each of which is considered as a separate experimental method in our scoring scheme. We avoid duplicate counting of evidences across the datasets by using their publication IDs (in particular, PPIs from the Consolidated dataset are removed from the BioGRID, IntAct, and MINT datasets).
• STRING : Predicted functional association data is obtained from the STRING database [15] (data downloaded in January 2014). STRING predicts each association between two proteins a and b (or their respective genes) using the following evidence types: gene co-occurrence across genomes; gene fusion events; gene proximity in the genome; homology; co-expression; physical interactions; co-occurrence in literature; and orthologs of the latter five evidence types transferred from other organisms (STRING also includes evidence obtained from databases, which we discard as this may include co-complex relationships which we are trying to predict). Each evidence type is associated with quantitative information (e.g. the number of gene fusion events), which STRING maps to a confidence score of functional association based on co-occurrence in KEGG pathways. The confidence scores of the different evidence types are then combined probabilistically to give a final functional association score for (a, b). Only pairs with score greater than 0.5 are kept.
LIT : Co-occurrence of proteins or genes in PubMed literature (data down-loaded in June 2012). Each pair (a, b) is scored by the Jaccard similarity of the sets of papers that a and b appear in:
where A
x
is the set of PubMed papers that contain protein x. For yeast, that would be the papers that contain the gene name or open reading frame (ORF) ID of x as well as the word "cerevisiae"; for human that would be the papers that contain the gene name or Uniprot ID of x as well as the words "human" or "sapiens".
For each protein pair in each data source, we derive three topological features: degree (DEG), shared neighbors (SHARED), and neighborhood connectivity (NBC). For each data source, the edge weight used to calculate these topological features is the data source score of the edge.
• DEG : The degree of the protein pair (a, b), or the sum of the scores of the outgoing edges from the pair:
where w(x, y) is the data source score of edge (x, y), N
a
is the set of all neighbours of a, excluding a.
• NBC : The neighborhood connectivity of the protein pair (a, b), defined as the weighted density of all neighbors of the protein pair excluding the pair themselves:
where w(x, y) is the data source score of edge (x, y); N
a,b
is the set of all neighbours of a and b, excluding a and b themselves; λ is a dampening factor.
• SHARED : The extent of shared neighbors between the protein pair, derived using the Iterative AdjustCD function (with two iterations) [4].
This gives a total of twelve features: the three data sources PPI, STRING, and LIT , and nine topological features (three for each data source), DEG
PPI
, DEG
STRING
, DEG
LIT
, SHARED
PPI
, SHARED
STRING
, SHARED
LIT
, NBC
PPI
, NBC
STRING
, and NBC
LIT
. In addition, a feature called isolatedness is incorporated after an initial calculation of the posterior probabilities, as described below.
Size-specific supervised weighting of the PPI network (SSS)
In this step, we weight the edges of the PPI network with our size-specific supervised weighting (SSS) approach. We use a highly-reliable subset of the PPI network, by keeping only the top k edges with the highest PPI reliability scores. In our experiments we set k = 10000, but similar results are obtained for other values of k. SSS uses supervised learning to weight each edge with three scores: its posterior probability of being a small-co-complex edge (ie. of belonging to a small complex), of being a large-co-complex edge, and of not being a co-complex edge, given the features of the edge. These features consist of the twelve features described above (PPI, STRING, LIT , and nine topological features), as well as an isolatedness feature which is derived from an initial calculation of the posterior probabilities. We use a naive-Bayes maximum-likelihood model to derive the posterior probabilities.
Each edge (a, b) in the network is cast as a data instance, with its set of features F. Using a reference set of protein complexes, each edge (a, b) in the training set is given a class label lg-comp if both a and b are in the same large complex; it is labelled sm-comp if both a and b are in the same small complex; otherwise it is labelled non-comp. Learning proceeds by the following steps (illustrated in Figure 1):
1 Minimum description length (MDL) supervised discretization [16] is performed to discretize the features (excluding the isolatedness feature). MDL discretization recursively partitions the range of each feature to minimize the information entropy of the classes. If a feature cannot be discretized, that means it is not possible to find a partition that reduces the information entropy, so the feature is removed. Thus this step also serves as simple feature selection.
2 The maximum-likelihood parameters are learned for the three classes lg-comp, sm-comp, and non-comp:
for each discretized value f of each feature F (excluding the isolatedness feature). n
sm
is the number of edges with class label sm-comp, n
sm,F = f
is the number of edges with class label sm-comp and whose feature F has value f ; n
lg
is the number of edges with class label lg-comp, n
lg,F = f
is the number of edges with class label lg-comp and whose feature F has value f ; n
non
is the number of edges with class label non-comp, and n
non,F = f
is the number of edges with class label non-comp and whose feature F has value f .
3 Using the learned models, the class posterior probabilities are calculated for each edge (a, b) using the naive-Bayes formulation:
The posterior probabilities are calculated in a similar fashion for the other two classes lg-comp and non-comp. We abbreviate the posterior probability of edge (a, b) being in each of the three classes as P(a,b),sm, P(a,b),lg, and P(a,b),non.
A new feature ISO (isolatedness) is calculated for each edge (a, b), based on the probability that the edge is isolated (not adjacent to any other edges), or is part of an isolated triangle:
where N
x
denotes the neighbours of x, excluding x. The ISO feature is discretized with MDL.
5 The maximum-likelihood parameters for the ISO feature are learned for the three classes.
6 The posterior probabilities for the three classes, P(a,b),sm, P(a,b),lg, and P(a,b),non, are recalculated for each edge (a, b), this time incorporating the new ISO feature.
Extracting small complexes
After using SSS to weight the PPI network, the small complexes are extracted. This stage, called Extract, consists of two steps (see Figure 1): first, the small-co-complex probability weight of each edge is disambiguated into size-2 and size-3 complex components; next, each candidate complex is scored by its cohesiveness-weighted density, which is based on both its internal and outgoing edges.
In the disambiguation step, the small-co-complex probability weight of each edge (a, b) = P(a,b),sm, which denotes the probability of being in a small (either size-2 or size-3) complex, is decomposed into two component scores (we use the term score instead of probability since its derivation is not probabilistic): , which is the score of being in the size-2 complex composed of a and b; and , which is the score of being in the size-3 complex composed of a, b, and c. Intuitively, if an edge is contained within a triangle with high edge weights, then it is likelier to be a size-3 complex corresponding to the triangle rather than a size-2 complex; thus its size-2 component score should be reduced based on the weights of incident triangles:
Similarly, if an edge is contained within a triangle with high edge weights, and is also within another triangle with low edge weights, then it is likelier to form a size-3 complex with the former triangle rather than the latter; thus its size-3 component score corresponding to a specific triangle should be reduced based on the weights of its other incident triangles:
In the next step, each candidate complex is scored by weighting the density of the cluster with its cohesiveness, which is adapted from cluster cohesiveness as described in [5]. Here, we define cohesiveness of a cluster as the ratio of the sum of its internal edges' weights over its internal plus outgoing edges' weights, where the internal weights are the component scores as calculated above, and the external weights are the posterior probabilities of being either small or large co-complex edges. The cohesiveness of a size-2 cluster (a, b) and a size-3 cluster (a, b, c) respectively are:
We then define the score of a cluster as its cohesiveness-weighted density, or the product of its weighted density and its cohesiveness. The score of a size-2 cluster (a, b), and a size-3 cluster (a, b, c) respectively are: