Proteinprotein interaction (PPI) data
The PPI dataset is represented as a weighted directed graph G = (V, E, w), where nodes (V) represent proteins, edges (E) PPIs, and the scores (w) the confidence in each interaction. The scores (edge weight w) range from 0, indicating no interaction, to 1, indicating an interaction with high confidence.
The PPI dataset used in this work is obtained from the STRING database (version 7.1) [9, 22–24]. This dataset contains computationally and experimentally derived PPIs, including interactions from other databases (e.g., MINT [25], BioGRID [26], DIP [27], and Reactome [28]), microarray experiments, highthroughput experiments, and a mined literature corpus. Furthermore, PPIs are transferred between orthologous pairs of proteins over different organisms. All of these datasets are combined and for each PPI a confidence score is calculated. This way the information from multiple sources is combined into a single score that expresses the overall confidence in each PPI. This score is derived by calculating the joint membership of proteins with PPI in KEGG pathways [29].
Problem complexity and formalization
The problem posed here is similar to the problem of finding Steiner trees in graphs [30], or more specifically, vertexweighted Steiner trees. In this problem formalization, a weighted graph G = (V, E, w) and a nonemtpy set of terminals T ⊆ V is given, with w ∈ ℝ^{+}. The optimal Steiner tree is defined as the connected subgraph G' = (V', E', w') with G' ⊆ G, for which the summed weight w_{sum}(E') = ∑_{e∈E'}w_{
e
}is minimal, and T ⊆ V' holds. The Steiner tree problem on graphs was shown to be complete [31] and, thus, is in most cases solved with heuristics. One of these heuristics is Prim's algorithm [32], which iteratively extends the subgraph G' by adding the vertex with the smallest distance until all nodes in T are connected in G'. A more recent heuristic presented by Melhorn et al. [33] proceeds by first calculating the minimal distance between all nodes in T, and then assembling the minimal Steiner tree by iteratively connecting the nodes with the smallest distance to each other.
Here, the aim is to select a subgraph of G' ⊆ G that connects a set of source proteins S to a set of target proteins T. Given a graph G = (V, E, w) with w ∈ [0, 1] and a disjoint source S ⊆ V and target set T ⊆ V (S ∩ T = ∅), the aim is to find the optimal subgraph G' ⊆ G such that for every s ∈ S and for every t ∈ T at least one path P(s, t) exists in G', whenever such a path exists in G. If either the source or the target set is empty, the problem formalization of Steiner trees is applied. Then the aim is to connect all nodes that are given either in S or in T (S ∩ T) to each other.
Objective function for the pathways
For any given pathway, the overall confidence is calculated by multiplying the individual confidence values of the utilized edges:
This objective is based on the assumption that the edge scores reflect independent confidence values, and implies that the resulting score gives the overall confidence in the pathway – that all contained edges are true biological interactions.
Inferring signal transduction pathways
Finding optimal paths between two proteins
Although the problem of finding the optimal pathway is complete, some special instances exist that are solvable in polynomial time. If, for instance, the source set S and target set T both contain one node, the problem reduces to finding the highest scoring path between them. This problem can be solved by applying Dijkstra's algorithm [34]. Given two nodes, this algorithm finds the highest scoring path with a runtime complexity of ((E + V) log V), where V gives the number of proteins and E the number of PPIs. For PPI networks, it can be assumed that most proteins are not connected to each other E ≪ V^{2}; therefore, Dijkstra's algorithm is implemented using adjacency lists, and thus the runtime is reduced to (V log V + E). The scores between all nodes, obtained by Dijkstra's algorithm, will be stored in a distance matrix D_{S×T}with S rows and T columns and the respective paths will be referred to by P^{D}(s, t).
BowTieBuilder
When multiple source and target proteins are provided, we employ a greedy approach, referred to as BowTieBuilder, to construct the signaling pathway P. In the first step, BowTieBuilder initializes the signaling pathway P = (V = S ∩ T, E = ∅, w = ∅) by including the source S and target T nodes, and flagging these nodes as 'not visited'. In the second step, the distance matrix D_{S×T}is constructed by determining the maximal scoring (Equation 1) paths between the nodes in S and the nodes in T with Dijkstra's algorithm, where the distance is set to ∞ if no path exists. This preprocessing is similar to the heuristic presented by Melhorn et al. [33] for finding Steiner trees. In the next stage of the inference, the highest scoring path P^{D}(s, t) in D that connects a 'not visited' node to a 'visited' node is added. If no such path exists the two 'not visited' nodes with the highest scoring path P^{D}(s, t) in D are connected to each other and, likewise, the path P^{D}(s, t) is added to P. Subsequently, the nodes in that path are flagged as 'visited' and D is updated to include all distances to the nodes in P^{D}(s, t). This step is reiterated, in each stage integrating 'not visited' source and target nodes. The method terminates when all nodes in S ∩T are flagged as 'visited', or, if for the remaining nodes, no path to any other node in S ∩ T exists. Then the final signaling pathway P is returned. If either S or T is an empty set, D is initialized such that it contains all distances between any node in the input set (D_{S∩T×S∩T}). Despite this change in the initialization of D, the algorithm proceeds in the same manner and finally returns the signaling pathway P which connects all nodes to each other. The structure of the BowTieBuilder algorithm is given in the following:

1.
Initialize the pathway P with all nodes S ∩ T, and flag all nodes in S ∩ T as 'not visited'.

2.
Calculate the distance matrix D_{S×T}between the nodes in S and T with Dijkstra's algorithm.

3.
Select the shortest path in D that connects a 'not visited' and a 'visited' node in P, or, if no such path exists, a 'not visited' node in S to a 'not visited' node in T.

4.
Add the nodes and edges of the selected path to P and flag all nodes in the pathway as 'visited'.

5.
Update D to include all distances to the nodes in P^{D}(s, t).

6.
Repeat the steps 2–5 until every node in S is connected to some node in T, and vice versa if such a path exists in G.

7.
Export final pathway P.
As an optional parameter, the maximum path length l is introduced, since very long paths can increase the introduction of false positive PPIs. This is accomplished by setting the length of a path with more than l edges to ∞.
Additional inference methods
When applying heuristics, it is advisable to compare different approaches to each other to analyze their properties. For this purpose, we implemented three alternative inference methods: all interactions, shortest paths, and all shortest paths.
all interactions: In this modification, the standard BowTieBuilder is applied and the resulting pathway P is obtained. Then, all PPIs (edges) between any two nodes in P are added whenever they are contained in G.
shortest paths: In this inference method, every node in the source set S is connected to the target set T through the maximal scoring path, and vice versa. In this case the pathway P can be directly derived from paths corresponding to the maximal scores in matrix D. More specifically, for each row and column, the path corresponding to the maximal entry in D is added to P.
all shortest paths: In this inference method, for every pair of source (S) and target (T) proteins the highest scoring path P^{D}(s, t) is added to P. Thus, every source and target node is directly connected if a corresponding path exists in G.
Output
Inferred signal transduction pathways are exported in the formats GML (Graph Markup Language), XGML, and GraphViz, and visualized with the graph viewer yED [35] or by Cytoscape [36, 37].
Validation
To validate the correctness of the inferred pathways we compute the recall and precision rates with respect to a specified reference pathway. These rates can be calculated with respect to PPIs or proteins. The recall rate is defined as the fraction of PPIs/proteins in the reference pathway that are inferred (Equation 2) and the precision rate is defined as the fraction of inferred PPIs/proteins that are contained in the reference pathway (Equation 3).
The topological validation is only performed for pathways that are provided by KEGG. Another possibility for testing the plausibility of inferred pathways – without the need for validation pathways – is to test if the inferred pathway can be associated with a certain biological process. To perform such an analysis, we map the proteins contained in each pathway to their 'biological process', defined by the Gene Ontology (GO) [38]. The tool Term Finder [39] is used for this purpose, which calculates a pvalue for each biological process using the hypergeometric distribution.
A direct validation against other methods for automatically inferring signal transduction pathways is omitted, because most of these algorithms are validated through pathways with one source and one target protein [10, 11, 13]. The recall and precision rates obtained by the different methods can, however, give a rough estimate of the relative performance.
Source and target proteins
BowTieBuilder is applied to several sets of source and target proteins. In principle, any type of source or target protein can be processed by BowTieBuilder; in this work, however, if not stated otherwise, the source proteins are membranebound proteins and the target proteins are TFs.
To infer signaling pathways for different biological processes, we collect several sets of membranebound proteins and TFs. To infer signaling pathways that control the yeast cell cycle, we collect membraneTF sets for the yeast cell cycle phases G1 and S from the respective KEGG pathway (KEGG identifier: sce04111). For the analysis of the yeast MAPK pathway, the membrane and TF sets are obtained from the KEGG MAPK pathway (KEGG identifier: hsa04010). In addition, the human membrane and TF sets of the Erb pathway are collected from KEGG (KEGG identifier: hsa04012), and the human membrane and TF sets related to the TLRmediated innate immune pathway are collected from a publication of Kitano et al. [18].
To combine signal transduction pathways with gene regulatory networks, all TFs that were inferred as regulators in a previous study [21] are used here as the target list. In this study, TFs were inferred to have a regulatory effect from two gene expression datasets [40, 41] and known cisregulatory elements. In addition to these TFs a list of membrane proteins was collected from the Yeast Membrane Protein Library (YMPL). Based on these TFs and membrane proteins, a signaling pathway is inferred that potentially explains the higherlevel regulation of these TFs in the respective gene regulatory network. All source and target proteins are provided in Additional File 1.
Bow tie score
As mentioned earlier, BowTieBuilder favors signaling pathways that are structured like a bow tie, but it does not demand such a structure. Thus, it is of interest to quantify to what extent signaling pathways follow the bow tie structure and, in addition, to determine the core proteins. For this purpose, we provide a bow tie score (b(p) ∈ [0, 1]) that determines how 'central' a protein p is. This score is also used to determine the bow tie score of the complete pathway. This score is related to the 'betweenness' measure, in which the number of shortest paths that include the core protein determines the centrality.
To calculate this score, the possible number of connecting paths between the source S and target T proteins is first determined, which is simply the number of source proteins multiplied with the number of target proteins S·T. Then the number of source and target proteins that can be connected by a path containing p is calculated. This is given by the number of target proteins from which p can be reached (T_{
p
}) multiplied by the number of source proteins that can be reached from p (S_{
p
}). Thereby, every edge can only be traversed in one direction, since the signaling pathway is a directed graph that is traversed from the source to the target proteins. The corresponding bow tie score for any protein p reads:
To determine the core elements of any signaling pathway, b(p) is calculated for every intermediate protein p. Given these scores, the core component is defined by the set of proteins with the maximal b(p) score. This also gives the overall score of the signaling pathway. In some cases it is helpful to distill the subnetwork that constitutes the bow tie structure by removing all paths that do not pass through the core component. We refer to such signaling pathways as 'core bow tie'.