Heterogeneous features are integrated in the prediction framework of ppiPre, including three GO-based semantic similarities, one KEGG-based similarity indicating the proteins are involved in the same pathways and three topology-based similarities using only the network structure of the PPI network.

We chose these three features to be integrated in our framework because they are highly available for the PPIs of different species and can be easily accessed in the R environment. Not like other methods and software tools, ppiPre did not integrate biological features that may not be available for the species or proteins which are not well studied, such as structural and domain information.

### GO-based semantic similarities

Proteins are annotated by GO with terms from three aspects: biological process (BP), molecular function (MF), and cellular component (CC). Directed acyclic graphs (DAGs) are used to describe these aspects. It is known that interacting protein pairs are likely to be involved in similar biological processes or in similar cellular component compared to those non-interacting proteins [2, 24, 25]. Thus if two proteins are semantically similar based on GO annotation, the probability that they actually interact is higher than two proteins that are less similar.

Several similarity measures have been developed for evaluating the semantic similarity between two GO terms [26–28]. The information content (IC) of GO terms and the structure of the GO DAG are often used in these measures.

The IC of a term *t* can be defined as follows:

IC\left(t\right)\mathsf{\text{=-}}\text{log}\left(p\left(t\right)\right)

(1)

where p(t) is the probability of occurrence of the term *t* in a certain GO aspect. Two IC-based semantic similarity measures proposed recently are integrated in ppiPre, which are Topological Clustering Semantic Similarity (TCSS) [29] and IntelliGO [30].

#### TCSS

In TCSS, the GO DAGs are divided into subgraphs. A PPI is scored higher if the two proteins are in the same subgraph. The algorithm is made up of two major steps.

In the first step, a threshold on the ICs of all terms is used to generate multiple subgraphs. The roots of the subgraphs are the terms which are below the previously defined threshold. If roots of two subgraphs have similar IC values, these two subgraphs are merged. Overlapping subgraphs may occur because some GO terms have more than one parent terms. In order to remove overlap between subgraphs, edge removal and term duplication are processed. Transitive reduction of GO DAG is used to remove overlapping edges by generating the smallest graph that has the same transitive closure as the original subgraph. After edge removal, if a term is included in two or more subgraphs, it will be duplicated into each subgraph. More details are described in [29].

After the first step, a meta-graph is constructed by connecting all subgraphs. Then the second step called normalized scoring is processed. For two GO terms, normalized semantic similarity is calculated based on the meta-graph rather than the whole GO DAG so that more balanced semantic similarity scores can be obtained.

Using the frequency of proteins that are annotated to GO term *t* and its children, the information content of annotation (ICA) for a GO term *t* is:

ICA\left(t\right)=\mathsf{\text{-}}\text{ln}\left(\frac{\left|{P}_{t}{\displaystyle \bigcup _{c\in N\left(t\right)}}{P}_{c}\right|}{{\displaystyle \sum _{t\in O}}\left|{P}_{t}{\displaystyle \bigcup _{c\in N\left(t\right)}}{P}_{c}\right|}\right)

(2)

where *P*_{
t
} is the proteins that are annotated by *t* in aspect *O* and *N*(*t*) is the child terms of *t*.

The information content of subgraph (ICS) for term {t}_{m}^{s} in the *m*^{th} subgraph {G}_{m}^{s} is defined as follows:

ICS\left({t}_{m}^{s}\right)=\frac{ICA\left({t}_{m}^{s}\right)}{\underset{{t}_{m}^{s}\in {G}_{m}^{s}}{\text{max}}ICA\left({t}_{m}^{s}\right)}

(3)

The information content of meta-graph (ICM) for a term {t}_{n}^{m} in meta-graph *G*^{m} is defined as follows:

ICM\left({t}_{n}^{m}\right)=\frac{ICA\left({t}_{n}^{m}\right)}{\underset{{t}_{n}^{m}\in {G}^{m}}{\text{max}}ICA\left({t}_{n}^{m}\right)}

(4)

Finally, the similarity between two proteins *i* and *j* is defined as:

Si{m}_{TCSS}\left(i,j\right)=\underset{{s}_{m}\mathsf{\text{,}}{t}_{n}\in {T}_{i}\mathsf{\text{,}}{T}_{j}}{\text{max}}\left\{\begin{array}{c}\hfill IC{M}_{\text{max}}\left(LCA\left({s}_{m}\mathsf{\text{,}}{t}_{n}\right)\right)\mathsf{\text{}}if\mathsf{\text{}}{s}_{m}\in {G}_{m}^{s}\mathsf{\text{}}and\mathsf{\text{}}{t}_{n}\in {G}_{n}^{s}\hfill \\ \hfill IC{S}_{\text{max}}\left(LCA\left({s}_{m}\mathsf{\text{,}}{t}_{n}\right)\right)\mathsf{\text{}}if\mathsf{\text{}}{s}_{m}\mathsf{\text{,}}{t}_{n}\in {G}_{n}^{s}\hfill \end{array}\right.

(5)

where *LCA*(*s*_{
m
},*t*_{
n
}) is the common ancestor of the terms *s*_{
m
} and *t*_{
n
} with the highest IC. *T*_{
i
} and *T*_{
j
} are two sets of GO terms which annotate the two proteins *i* and *j* respectively.

#### IntelliGO

The IntelliGO similarity measure introduces a novel annotation vector space model. The coefficients of each GO term in the vector space consider complementary properties. The IC of a specific GO term and its evidence code (EC) [31] are used to assign this GO term to a protein. The coefficient *α*_{
t
} given to term *t* is defined as follows:

{\alpha}_{t}=w\left(g\mathsf{\text{,}}t\right)*IAF\left(t\right)

(6)

where *w*(*g*, *t*) is the weight of the EC which indicates the annotation origin between protein *g* and GO term *t*, and IAF (Inverse Annotation Frequency) represents the frequency of term *t* occurred in all the proteins annotated in the aspect where *t* belongs.

For two proteins *i* and *j*, the IntelliGO uses their vectorial representation \stackrel{\u20d7}{i} and \stackrel{\u20d7}{j} to measure their similarity, which is defined as follows:

Si{m}_{IntelliGO}\left(i\mathsf{\text{,}}j\right)\mathsf{\text{=}}\frac{\stackrel{\u20d7}{i}*\stackrel{\u20d7}{j}}{\sqrt{\stackrel{\u20d7}{i}*\stackrel{\u20d7}{i}}*\sqrt{\stackrel{\u20d7}{j}*\stackrel{\u20d7}{j}}}

(7)

The detailed explanation of the definition can be found in [30].

#### Wang's method

The similarity measure proposed by Wang [32] is also implemented in the ppiPre package, which is based on the graph structure of GO DAG.

In the GO DAG, each edge has a type which is "is-a" or "part-of". In Wang's measure, a weight is given to each edge according to its type. *DAG*_{
t
} = (*t*,*T*_{
t
},*E*_{
t
}) represents the subgraph made up of term *t* and its ancestors, where *T*_{
t
} is the set of the ancestor terms of *t* and *E*_{
t
} is the set of edges in *DAG*_{
t
}.

In *DAG*_{
t
}, *S*_{
t
}(*n*) measures the semantic contribution of term *n* to term *t*, which is defined as:

\{\begin{array}{c}{S}_{t}\left(t\right)=1\\ {S}_{t}\left(n\right)=\mathrm{max}\left\{{w}_{e}{}^{*}{S}_{t}\left({n}^{\prime}\right)|{n}^{\prime}\in childrenof\left(n\right)\right\}\text{if}t\ne n\end{array}

(8)

The similarity between two GO term *m* and term *n* is defined as:

Si{m}_{Wang}\left(m\mathsf{\text{,}}n\right)\mathsf{\text{=}}\frac{{\displaystyle \sum _{t\in {T}_{m}\cap {T}_{n}}}{S}_{m}\left(t\right)\mathsf{\text{+}}{S}_{n}\left(t\right)}{SV\left(m\right)\mathsf{\text{+}}SV\left(n\right)}

(9)

where *SV*(*m*) is the sum of the semantic contribution of all the terms in *DAG*_{
m
}.

The semantic similarity between two proteins *i* and *j* is defined as the maximum value of all the similarity between any term that annotate *i* and any term that annotate *j*.

### KEGG-based similarity

Proteins that work together in the same KEGG pathway are likely to interact[33][34]. The KEGG-based similarity between proteins *i* and *j* is calculated using the co-pathway membership information in KEGG. The similarity is defined as:

Si{m}_{KEGG}\left(i\mathsf{\text{,}}j\right)=\frac{\left|P\left(i\right)\cap P\left(j\right)\right|}{\left|P\left(i\right)\cup P\left(j\right)\right|}

(10)

where *P*(*i*) is the set of pathways which protein *i* involved in the KEGG database.

### Topology-based similarities

In order to deal with the proteins that haven't got any annotations in GO or KEGG database, topology-based similarity measures are also integrated. In ppiPre, three different topological similarities are implemented.

The Jaccard similarity [35] between two proteins *i* and *j* is defined as:

Si{m}_{Jac}\left(i\mathsf{\text{,}}j\right)=\frac{\left|N\left(i\right)\cap N\left(j\right)\right|}{\left|N\left(i\right)\cup N\left(j\right)\right|}

(11)

where *N*(*i*) is set of all the direct neighbours of protein *i* in PPI network.

Adamic-Adar(AA) similarity [36] punishes the proteins with high degree by assigning more weights to the nodes with low degree in PPI network. The AA similarity between two proteins *i* and *j* is defined as:

Si{m}_{AA}\left(i\mathsf{\text{,}}j\right)={\displaystyle \sum _{n\in N\left(i\right)\cap N\left(j\right)}}\frac{1}{\text{log}{k}_{n}}

(12)

where *k*_{
n
} is the degree of protein *n*.

Resource Allocation (RA) similarity [37] is similar to AA similarity and considers the common neighbours of two nodes as resource transmitters. The RA similarity between two proteins *i* and *j* is defined as:

Si{m}_{RA}\left(i\mathsf{\text{,}}j\right)={\displaystyle \sum _{n\in N\left(x\right)\cap N\left(y\right)}}\frac{1}{{k}_{n}}

(13)

### Prediction framework

The data of interacting protein pairs verified by experiments are very incomplete and the non-interacting protein pairs far outnumber interacting protein pairs. So the classical SVM [38] which is able to handle small and unbalanced data is chosen to integrate different features in ppiPre. We have tested different kernels in e1071 and the results showed no significant difference, so the default kernel and parameters are used in ppiPre.

The prediction framework of ppiPre is presented in Figure 1. Heterogeneous features are calculated for the gold-standard PPI data set which is given by users, and the SVM classifier is trained by the gold-standard positive and negative data set (solid arrows). After the classifier is trained, the features are calculated from the query PPIs input by users, and the trained classifier can predict false positive and false negative PPIs from the input data (hollow arrows).