In the following sections we describe the algorithms and sample studies used to evaluate the POBN algorithm's ability to identify predictive and non-predictive regulatory relationships based on gene expression data.
Gene expression data
We tested POBN using a set of 266 gene expression profiles in E. coli described elsewhere[9], and available online on GEO as GSE6836. This dataset represents a diverse range of biological backgrounds and environmental conditions including genetic perturbations, drug treatments, different growth phases, and a range of metabolic states. The study is well suited to this work because it contains many samples and covers a range of perturbations.
The selection of genes that could be meaningfully analyzed using POBN was based on the following two criteria: (1) genes must exhibit differential expression in at least some samples; and (2) genes must be present as target genes in the RegulonDB regulatory network. The first criterion was enforced by selecting the 300 genes with the largest variation as measured by the magnitude of the standard deviation of the expression value across samples. This selection approach will tend to favor genes with larger absolute expression levels, and a diverse range of expression values across the samples. When the second criterion was applied, only 189 of the 300 genes were found in the RegulonDB network as targets, producing a final list of 189 genes.
Data discretization
For computational efficiency, the scoring metrics used in this study require that the data be discretized. Data were binned into three states, high, medium and low, with the top third of the values assigned to high, the bottom third assigned to low, and the remaining values assigned to the medium bin. This even sized binning strategy is widely used for discretization of gene expression data in the systems biology community[10–13], and has been shown empirically to be robust in capturing relevant details of the systems under study. We note that the POBN method can be used with continuous values directly, however the continuous scoring algorithms that are currently available are computationally impractical for networks involving more than a few hidden nodes. Instead, POBN scores the probability of each network using discrete data as described below.
Scoring method
The regulatory bipartite networks tested here are modeled as a Bayesian network with the regulators as hidden nodes. In this network, variables are modeled using a multinomial model with Dirichlet priors, as is described elsewhere[14–16]. Using a multinomial likelihood and Dirichlet parameter priors leads the BD scoring metric that is used in this work[14, 17]. This means that the parameter priors are modeled as Dirichlet distributions, and helps to reduce over-fitting. See the additional file 1: scoring_metric.doc for a more detailed discussion about Bayesian networks with hidden nodes. Below we provide a brief summary of the modified sampling-scoring method used in this work.
The end goal of our method is to test each single regulatory connection (regulator-gene associations) in a bipartite network to identify the connections that are more predictive than others based on a specific data set. For this, a score or a likelihood of the network with all the initial connections present needs to be evaluated. Then, a connection is removed and a new score for the resultant network without the connection is evaluated. The non-predictive associations can be identified as the ones that improve the initial network score once they are disconnected.
In most Bayesian network problems, a completely observed data set is used to estimate the likelihood or score of a network with the metric explained in the additional file 1: scoring_metric.doc. The scoring process with complete data is fast and thousands of networks can be score in minutes. However, in the networks studied in this work, the activity of the regulators is not observed which produce a data set with numerous missing entries. One way to handle these missing entries is to use Gibbs sampling, as is described in additional file 1: scoring_metric.doc and elsewhere[18–21]. However, using Gibbs sampling alone requires that all possible disconnections must be scored and sampled --a process that is computationally impractical when many missing entries are present.
To overcome this computational limitation, we have developed a new sampling-scoring approach to evaluate the score of a network each time a regulator-gene connection is eliminated. This new POBN approach is described below.
The sampling and scoring was done using PEBL, a python library previously developed in our group[17]. PEBL evaluates the probability of a discretized dataset given a topology using the BD (Bayesian Dirichlet) scoring metric described elsewhere[14]. The source code for PEBL is freely available online (http://code.google.com/p/pebl-project/).
The scoring process in our simulations comprised two main steps: (1) Gibbs sampling of the missing entries using the initial global graph and the discretized data and (2) calculating of an average score across the missing entries samples. These missing entries samples are set of different states configurations for the missing values. One set of configurations includes a complete round of states values (a single sample for each of the missing entries). The average score for each network was estimated using the BD metric and the sampled states for the hidden nodes obtained in step (1). This scoring process was carried out for the initial network and each single edge removal from that initial network. Note that Gibbs sampling was done only for the initial network. The same set of states configurations were used to estimate the average score of each single edge removal. This approximation assumed that the number of predictive edges is greater than the non-predictive ones, thus contributing significantly more to the network score.
We used an empirical approach to estimate the required number of saved configuration states after Gibbs sampling to estimate the score for any network. The criterion used to define a connection as predictive or non-predictive was the rank of the initial global network based on score relative to all the tested edges disconnections (score of networks after each disconnection). The connection evaluation process (predictive vs. non-predictive) will be explained in the next section. Based on this approach, set of configuration sates after Gibbs sampling with sizes of 100, 250, 500 and 1,000 rounds were tested for a first POBN optimization cycle. We observed insignificant changes between the results obtained for 250, 500, and 1,000 rounds of sampling. For these 3 set of configuration states sizes, 3 non-predictive edges out of 75-78 cast as non-predictive (as described in the next section) were observed to be different between each set results. Nevertheless, the networks evaluated after disconnecting these three discrepant edges ranked just below the criterion established to cast the edge as predictive/non-predictive. For the 100 rounds of states configurations case, the difference was of 6 edges. As a result of this analysis, all subsequent POBN optimization cycles were run by collecting 250 configuration states after Gibbs sampling of the initial network as a balance between accuracy and computational efficiency.
Network searching and optimization process
The filtered 189 genes were mapped to target genes in the RegulonDB network. This mapped network produced a bipartite graph between regulators and targets containing 570 connections between 62 unobserved regulators and the selected 189 genes.
Given this network as an initial global network, we examined all one-edge removals in this network, resulting in 570 additional networks to be scored (one per edge removal), for a total of 571 networks. A complete list of these connections present in our initial global network is provided in the additional file 2: initial_global_net.xls.
If removing an edge in the global network improved the score then the disconnected edge was labeled as non-predictive. If the initial global network did not rank first on the score list, we proceeded with another scoring round (POBN cycle) after updating the initial global graph by eliminating the set of non-predictive connections. These steps were repeated until the starting global regulatory network (initial network for each cycle) ranked first on the list as the best network. Figure 2 shows a conceptual flow diagram of this process. Source code for the optimization is provided in the supplementary material additional file 3: POBN_single_round.py. The data and network format needed by PEBL is also explained in this same file.
Synthetic Network Validation
To assess the POBN algorithm's discriminatory capability, we tested the algorithm using a synthetic dataset where the predictive and non-predictive edges are known beforehand. First, a bipartite network was defined with 4 regulators and 9 response variables. Each response variable was assigned to have 1-3 parents for a total of 11 connections in the synthetic network. After defining the topology, conditional probabilities were assigned for all nodes based on connectivity and the 3 discrete values that were allowed. Based on these parameters, 5,000 samples were generated to create a data set simulating a discrete static data set, full details of which are provided in the supplementary material. See additional file 4: synthetic_data_generation.py for the data generation script.
After sampling, the values of the 4 regulators were removed, making these regulator variables unobserved. Next, all combinations of possible non-predictive connections were added to the initial graph one at a time for a total of 25 cases, i.e. 25 = (4 regulators)*(9 response variables)-(11 defined connections). Once a single non-predictive edge was added, the weight for all edges in the graph were evaluated as described in the Scoring Method section above. This same process was repeated for all combinations of pairs of non-predictive edges that could be added to the original graph (300 pair addition cases).