Fast Bayesian inference for gene regulatory networks using ScanBMA

Background Genome-wide time-series data provide a rich set of information for discovering gene regulatory relationships. As genome-wide data for mammalian systems are being generated, it is critical to develop network inference methods that can handle tens of thousands of genes efficiently, provide a systematic framework for the integration of multiple data sources, and yield robust, accurate and compact gene-to-gene relationships. Results We developed and applied ScanBMA, a Bayesian inference method that incorporates external information to improve the accuracy of the inferred network. In particular, we developed a new strategy to efficiently search the model space, applied data transformations to reduce the effect of spurious relationships, and adopted the g-prior to guide the search for candidate regulators. Our method is highly computationally efficient, thus addressing the scalability issue with network inference. The method is implemented as the ScanBMA function in the networkBMA Bioconductor software package. Conclusions We compared ScanBMA to other popular methods using time series yeast data as well as time-series simulated data from the DREAM competition. We found that ScanBMA produced more compact networks with a greater proportion of true positives than the competing methods. Specifically, ScanBMA generally produced more favorable areas under the Receiver-Operating Characteristic and Precision-Recall curves than other regression-based methods and mutual-information based methods. In addition, ScanBMA is competitive with other network inference methods in terms of running time.

a particular gene is only a small fraction of the number of possible regulators.
A popular method for inferring gene regulatory networks from time series data uses Dynamic Bayesian Networks (DBN) [1][2][3][4][5]. DBN methods estimate a probabilistic graphical model, given the time-series data. DBN methods work well, but the network size that they can handle in practice is limited because of their computational cost.
Ordinary differential equations (ODE) are alternative methods for constructing networks [6,7]. These methods are deterministic rather than statistical, although ODE methods can be combined with statistical methods. DBN on local networks within a larger ODE model inference method have been used, for example [8].
Another class of methods is based on regression models in which parent nodes (regulators) are inferred for each target gene. Vector autoregressive models have been proposed for inferring causal links between genes. Often this http://www.biomedcentral.com/1752-0509/8 /47 takes the form of a model selection problem, and methods such as the Least Absolute Shrinkage and Selection Operator (LASSO) [9,10], elastic net [11,12], and Bayesian model averaging (BMA) [13,14] have been used [15][16][17][18][19]. Morrissey, et al. [20] implemented a Markov Chain Monte Carlo (MCMC) sampler for a fully Bayesian formulation of the autoregressive model.
Mutual information methods have been used extensively on genetics data [21][22][23][24], but usually for steady-state rather than time-series data. These methods are typically non-directional. Recently, mutual information methods have been extended to analyze time-series data and produce directed networks [25,26]. Mutual information methods have the advantage of being able to identify nonlinear relationships.

Our contributions
We present a new approach using Bayesian Model Averaging (BMA) for variable selection from time-series gene expression data. Our new method offers the following advances over our previous work [18,19]: • We develop a new algorithm called ScanBMA that searches the model space more efficiently and thoroughly than previous algorithms. It is much faster than previous implementations of BMA for a large number of predictors, resulting in running time comparable to that of LASSO. It allows inference for networks of thousands of genes to be completed in minutes on a standard laptop computer. • We transform the time-series data to reduce spurious correlations. Specifically, we remove the effect of a gene on itself by subtracting the mean expression level for each gene at each time point and then using the residuals from a regression of its expression at the current time point on its expression at the previous time point. • We use Zellner's g-prior [27] for the regression parameters and show that using the g-prior to compute posterior probabilities out-performs our previous effort using the Bayesian Information Criterion (BIC). • We address the scalability of network inference methods. Our new implementation produces running times of minutes compared to hours or even days for some competing methods, thus offering substantial improvements.
We also carried out extensive empirical studies of our new method. Specifically, we compared our new method, ScanBMA, to other network construction methods in the literature, namely LASSO, the mutual information methods MRNET (Maximum Relevance/Minimum Redundancy) [24], CLR (Context Likelihood or Relatedness) [23] and ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [22], and also Dynamic Bayesian Networks (DBN) when the latter were computationally feasible.
We benchmarked the performance of our approach, ScanBMA, using two datasets. The first dataset measures the gene expression levels over time of 97 yeast segregants perturbed with the drug rapamycin. The second dataset consists of simulated time-series data from the DREAM4 (Dialogue for Reverse Engineering Assessment and Methods) challenge. For the yeast dataset, we found that our method outperformed competitors and previous analyses in recovering regulatory relationships previously reported in the literature. For the DREAM4 data, for which no prior information was available, our method performed comparably to other methods, while producing more compact networks. Finally, the ScanBMA algorithm presents a substantial improvement in running time over previous implementations of BMA. The method is implemented as the ScanBMA function in the networkBMA Bioconductor software package.

Method outline
In ScanBMA, network inference is formulated as a series of variable selection problems in which parent nodes (regulators) are inferred for each target gene. The BMA framework accounts for model uncertainty in variable selection by averaging over the posterior distributions from multiple models, weighted by their posterior model probabilities [13,14]. A challenge of BMA is to efficiently select a set of models to be averaged over. ScanBMA uses a greedy approach to explore the model space and uses the Occam's window principle [28] to eliminate unlikely models.
We previously developed a supervised framework to integrate external data sources, including co-expression, genome-wide binding, sequence polymorphism, physical interaction, genetic interaction, and literature curation data [18]. Using a training set consisting of approximately 500 known regulatory relationships in the literature, we computed prior probabilities of regulatory relationships across all candidate genes and regulators. These prior probabilities were then used to compute the posterior probabilities of of candidate regression models. We used Zellner's g-prior [27] to specify the prior for the model parameters in ScanBMA. We developed an expectationmaximization (EM) algorithm to estimate the prior variance parameter g.
Before the regression step, we apply a univariate measure (such as R-squared or BIC) to rank candidate regulators for each target gene using these prior probabilities of regulatory relationships. The parameter nvar controls the number of top regulators used in the regression step of each target gene. We have performed http://www.biomedcentral.com/1752-0509/8/47 empirical studies to study the effect of and estimate the optimal nvar.

Assessment
A number of metrics have been used to evaluate the quality of inferred networks. We focus on a few that compare the inferred network with a gold-standard network of true edges. One measure that we use is the precision of the inferred network, equal to the number of true positives divided by the total number of edges in the inferred network. Precision is important to researchers because an experiment to verify relationships identified in exploratory work can be expensive. Thus, the more confident we can be when identifying relationships, the better. In light of this, we also look at the area under the precision-recall curve (AUPRC). This gives a more comprehensive view of network quality and does not require that a threshold be chosen for the posterior probability of an edge or for the number of edges included. We also look at the area under the ROC curve (AUROC), which is widely used to assess the quality of networks.
Due to incomplete knowledge in real data, we use a partial assessment based on the YEASTRACT database [29]. This is a literature-curated repository of regulatory relationships between known transcription factors and target genes in yeast, based on more than 1,300 literature references. Due to incomplete knowledge in yeast biology, the lack of an edge in YEASTRACT is not hard evidence of the absence of a relationship between two genes, although it is used as such in our evaluation. In contrast, the true underlying networks from the simulated DREAM4 data are known, so that an absence of a true edge is a false negative and a presence of a non-existing edge is a false positive. Table 1 summarizes the assessment results for different methods applied to the yeast dataset. Our new ScanBMA method with nvar = 20 had a precision of 0.39, much higher than any other method, including our previous method, iBMA [19]. The area under the ROC curve (AUROC) for ScanBMA was also much better than for the competing methods. Note that for random guessing, the expected AUROC is 0.5 and the expected AUPRC is 0.038. For these data, the mutual information methods CLR and MRNET produced very large networks -nearly half of the number of possible edges, which is not concordant with what is known about regulatory networks of this type. Table 2 compares the preferred version of ScanBMA with four other versions, each of which lacks one of the components of our final method. This table shows that each component contributes to the accuracy of our final network. As shown by [19], the incorporation of the informative prior yields the largest increase in performance, but the other components also contribute. The use of the g-prior reduces the number of false positives, while the data transformation substantially reduces the size of the inferred network.

Results: yeast time series data
The precision-recall curves for the different methods on the yeast data are shown in Figure 1. This figure shows the quality of the ScanBMA network even beyond the 95% posterior inclusion probability cutoff. The precision stays high through a large range of recall, whereas for the other methods it quickly drops to the level of random guessing. The figure also shows that nvar = 20 performs better than nvar = 3556.
When analyzing gene networks where the number of nodes is in the thousands, computation time can be an important consideration. We compared ScanBMA with the other methods on the yeast data by running each method on 20 target genes under controlled conditions to find the average cpu time per gene. Table 3 shows that ScanBMA with nvar = 3556, i.e. considering all other genes whose expression varied as potential regulators, is within a factor of 3 of LASSO, the fastest of the other methods. Some of the mutual information methods, on the other hand, are much slower, with  MRNET taking about 50 times longer than ScanBMA. Table 3 also shows that ScanBMA produces a substantial improvement in computational efficiency over our previous method iBMA [19], especially when nvar is large. Dynamic Bayesian network methods were not included in the comparison because they analyze the entire network at once and do not scale well enough to be run feasibly on large network datasets such as the yeast data. Table 4 summarizes the results of the competing methods for the DREAM4 10-gene networks. For the DREAM4 networks, we were able to add the Dynamic Bayesian Networks method, as implemented in the ebdbnet R package [30], to the comparison because the networks are small. ScanBMA again performed best among the methods, particularly in terms of the areas under the ROC and precision-recall curves, even though no external information was available. However, the extent of ScanBMA's superiority to other methods, notably LASSO, was smaller in this case than for the yeast data, reflecting the lack of external information. The precision-recall curves from the various methods are shown for the first of the five 10-gene networks in Figure 2. Figure 3 shows the actual networks returned by each method in comparison with the true network. The ScanBMA network resembles the true network fairly well. In particular, the compactness of the ScanBMA network is apparent, particularly when compared with LASSO and MRNET. The small number of false positives may be useful in focusing the attention of the biologist on edges of high interest when searching for new regulatory relationships.

Results: simulated DREAM4 data
The results of the methods for the DREAM4 100-gene networks are summarized in Table 5. For these networks, the mutual information methods MRNET and CLR performed best in terms of the area under curve measures. ScanBMA was not quite as good by these measures, but its precision was much higher than that of any other method. Figure 4 illustrates the precision-recall curves for the various methods, showing increased precision across a broader range for ScanBMA.

Conclusions
We have presented a Bayesian Model Averaging method for inferring gene regulatory networks from time series data. It incorporates external information in a principled way via the prior edge probabilities, transforms the data Figure 1 Yeast precision-recall curves. Precision-Recall curves for different methods on the yeast data. ScanBMA was run using the g-prior, transformed data, and informative prior with nvar=20 and 3556. http://www.biomedcentral.com/1752-0509/8/47 MRNET >500 ScanBMA [20] 0.04 ScanBMA [3556] 11.2 iBMA [20] 0.08 iBMA [3556] 85 ScanBMA was run with g-prior, transformed data, informative prior. Superscript indicates value of nvar.
to reduce spurious correlations, and uses Zellner's g-prior for model parameters, with g estimated from the data. We have introduced a new algorithm, ScanBMA, to search the model space efficiently. Our method infers compact networks with higher precision than the competing methods we have assessed, important features for further analysis in searching for new regulatory relationships. We found that our method outperformed previous methods as well as LASSO and mutual information methods on yeast time-series data. In addition, our method performed comparably to competing methods, including Dynamic Bayesian Networks, on simulated data from the DREAM4 challenge, even in the absence of prior information. The networks from ScanBMA are also similar in size to the target networks.

Bayesian model averaging
In regression-based methods for network inference, we infer regulators (parent nodes) for each target gene. Hence, network inference can be formulated as a series of variable selection problems. We use the following multiple linear regression model for network inference from time-series data: iid ∼ N(0, σ 2 ε ) is the error term, for i = 1, . . . , n, t = 2, . . . , T and s = 1, . . . , S. We are particularly interested in whether β hi = 0, indicating a regulatory relationship from gene h to gene i.
One difficulty is that for gene network time series data there are typically far more potential regulators than observations. To address this problem, as well as to take into account uncertainty in model selection, we use BMA to obtain the posterior probability that each regulator is in the model [13,14]. BMA takes model uncertainty into account by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities. Thus the posterior probability that β hi = 0, also called the posterior inclusion probability of β hi , is An additional issue is that the number of possible models is too large to enumerate in a reasonable amount of time. MCMC approaches have been applied where the number of potential regulators is large [31], but they are expensive computationally. Iterative BMA has also been applied to large genetics datasets, but the number of regulators was restricted prior to analysis [18,32].

Data transformations
One concern when identifying regulatory relationships among genes is that there is a great deal of variation in gene expression levels that does not come from these interactions. For example, in the yeast data, many of the genes experience a sharp change in expression level over the first few time points caused by the application of the drug. This common trajectory is not important for inference and can produce many large correlations that do not correspond to actual interactions. In addition, we have found that removing the effect of a gene on itself can improve inference. By doing this we are removing excess variation in order to gain accuracy in inferring relationships. http://www.biomedcentral.com/1752-0509/8/47  In light of these observations, we transform the data in two steps to remove these extra sources of variation. First, we use time-adjusted data by subtracting the mean expression level for each gene i at timepoint t across strains: whereX i,t,· = S −1 S s=1 X i,t,s . This removes the overall effect of the drug.
Second, we take the residuals from regressing the gene on itself at the previous timepoint: where α i is the regression coefficient in the simple linear regression model of the expression level of gene i at time t on its expression level at time t − 1:

Specifying prior distributions
An important feature of our method is a way to incorporate external information. The Bayesian approach requires the specification of prior distributions and so includes this directly.
BMA requires two types of prior information: the prior edge probability, π hi , that potential regulator h regulates target gene i, for each h and i, and the prior distribution of the parameter vector for each model considered.
For the prior edge probabilities π hi , we considered two different prior distributions. The first is based on the empirical finding of Guelzim et al. [33] that each target gene is regulated by a small number of transcription factors, estimated as 2.76 per gene [19]. We implement this by setting π hi = 2.76/6000 for all h and i. This prior distribution does not incorporate any gene-specific information. We call it the Guelzim prior.
The second prior edge distribution we consider is that of Lo et al. [19], which uses external information in the form of expression data, binding-site data, curated literature and other sources to come up with a prior probability for each individual possible regulatory relationship. This leads to values of π hi specific to each regulator-target pair. We refer to this as the informative prior. Integration of multiple information sources has been shown to be beneficial in network construction [34,35].
As a prior for the model parameters, corresponding to the strengths of the relationships, we use Zellner's g-prior [27], as in [36]. The prior distribution of the parameter vector (β [k] 0 , where X k is the design matrix for model M k and g > 0 controls the prior variance of the regression parameters. This prior yields an analytic form for the posterior model probabilities Pr(M k |D), namely where M 0 is the null model with no regulators, d k is the number of regulators present in model M k , and R 2 k is the R 2 value for M k . This is an alternative to using BIC to approximate the posterior model probabilities, as has been done previously [32,37].
The parameter g controls the expected size of the regression parameters β hi , and is approximately equal to the prior variance of β hi /SE(β hi ), whereβ hi is the OLS estimator. This suggests that g should be at least 1, otherwise the individual β hi 's would be expected to be nearly indistinguishable from the noise even under ideal conditions. Also, the effective number of data points in the prior is n/g. Using g = n corresponds to a unit information prior and yields similar results to using the BIC approximation. Raftery [38] argued that the prior should be no more spread out than the unit information prior, suggesting a choice of g in the range 1 ≤ g ≤ n. http://www.biomedcentral.com/1752-0509/8/47 We estimate g from the data by maximum marginal likelihood, where the likelihood is summed over the model space. We do this using an EM algorithm [39,40], where the "missing data" are the γ hi 's indicating whether regulator gene h is in the model for target gene i. First, we run BMA with a selected value of g, and from this we obtain the posterior probabilities of the models used. We then maximize with respect to g. We use the new value of g in the next iteration of ScanBMA. We do this until convergence.

Searching the model space using ScanBMA
To find the models to be included in the BMA summation, we use the Occam's window principle, according to which models that are substantially worse than the best model found (by a factor of C in posterior probability) are discarded [28]. We used C = 100, based on an extensive review of conventional standards of scientific evidence. To find the models in Occam's window, we propose a new, fast algorithm called ScanBMA. The main idea of ScanBMA is to keep an active set of models around which to search. All models which can be created by adding or removing a single predictor from those in the active set are then evaluated and, if they fall within Occam's window of the current best model, are added to the active set to expand the search. The method is outlined in Algorithm 1. Because ScanBMA does not average over every model in the model space, the algorithm yields posterior inclusion probabilities of either 100% or 0% for many regulators. These extreme posterior inclusion probabilities are only approximations, and we can refine them. To estimate the posterior inclusion probability of a predictor X h with an approximate posterior probability of 100%, we calculate the ratio O −h,i of the posterior probability of the best model to that of the best model with the predictor X h removed. We then approximate the posterior probability of predictor h by O −h,i /(1 + O −h,i ), which will be less than 100%.
Similarly, to estimate the posterior inclusion probability of a predictor X h with approximate posterior probability 0%, we compute O h,i , the ratio of the posterior probability of the best model to that of the best model with the predictor X h added. Then the approximate posterior inclusion probability of predictor h is (1 + O h,i ) −1 , which will be greater than 0%.
This post-processing step yields a unique ordering among the predictors, which is useful in evaluation.
A variant of ScanBMA that we found to greatly improve computational efficiency without degrading performance is to restrict the number, nvar, of potential regulators h considered for each target gene i to those with the highest prior probabilities, π hi . Guelzim et al. [33] found that the number of transcription factors per gene has approximately an exponential distribution with mean 2.76; they did not find any of the target genes that they considered to have more than 13 regulators. As a result, we consider nvar = 20, which leaves some margin over the Guelzim maximum. We compare this with considering all genes with observed variation in expression as potential regulators, which amounts to setting nvar = 3556.

Data
To validate our method, we applied it to time-series data from a gene expression experiment on yeast as well as simulated time-series datasets from the DREAM4 competition.
The yeast data come from an experiment on 97 strains of yeast crossed from two parent strains [18]. Each strain was subjected to a treatment of the macrolide drug rapamycin, chosen to cause changes in gene transcription across the genome. Gene expression levels were measured for approximately 6,000 genes every 10 minutes from 0 to 50 minutes after the administration of the drug, using Affymetrix microarrays. The data were then filtered to remove genes that did not show significant variation over the measurements, leaving 3,556 genes.
The yeast data can be represented as a threedimensional array consisting of 3,556 genes, 97 segregants and 6 time points. There are no replicates in these data. However, the segregant axis and the time axis capture the http://www.biomedcentral.com/1752-0509/8/47 genetic and temporal variations. These yeast data are publicly available from ArrayExpress with accession number E-MTAB-412.
The simulated DREAM4 In Silico Network Challenge [41][42][43] provided 5 networks each of size 10 and 100 genes [44]. Time-series data for each network are produced by artificially perturbing a portion of the genes in the network and simulating the response of the network over time.
Specifically, the size 10 DREAM data consist of 10 genes, 21 time points and 5 replicates. The size 100 DREAM data consist of 100 genes, 21 time points, and 10 replicates.
Since our focus is on time-series data, we did not use the other data sources provided by the DREAM4 challenge. In particular, the DREAM4 challenge provided the results of simulated gene knock-out experiments for all genes, and in fact the winning entry in the competition used only the knock-out data and ignored the time series data [45]. In practice, however, this is unrealistic, since it is typically feasible to do knock-out experiments for only a small proportion of the genes. Time series data do have the potential to provide some information about all potential regulatory relationships, and our goal is to develop methods for doing this, so we have ignored the knock-out data here.

Competing methods
To evaluate the performance of our method, we compared it with LASSO [9], as well as with the mutual information methods MRNET, CLR and ARACNE. LASSO is a variable selection method based on a linear regression model, and thus can be used as an alternative to BMA. It is known for being computationally efficient.
We used the implementation of LASSO in the glmnet package in R [12], with the shrinkage penalty parameter, sometimes denoted by λ, chosen by cross-validation.
Mutual information methods have been used extensively in identifying relationships among genes [21][22][23]. Since mutual information methods are non-directional, we used a modified version inspired by [46], including the response as the first column of the matrix given to the mutual information method, and the predictors as the other columns. We then took the first column from the resulting mutual information matrix as the measure of the directed relationship from the predictors to the target gene. MRNET, CLR and ARACNE are different methods for using the mutual information to infer a weighted adjacency matrix between genes. All are implemented in the minet package in R [47].
We have also investigated the possibility of using Dynamic Bayesian Networks (DBN). Methods based on DBNs have been implemented in several R packages. These include the GeneNet package [48], but this is not designed for multiple time-series and so was not a comparable method on our datasets. The ebdbnet (Empirical Bayes Estimation of Dynamic Bayesian Networks) R package [30], and the G1DBN R package [49], do allow for multiple time-series, but they are not designed to handle networks of the size of the yeast data. They are more appropriate for networks of sizes up to a few hundred genes. We were able to use the ebdbnet package for the much smaller DREAM4 networks.

YEASTRACT: Assessment of yeast data
The YEASTRACT database [29] is a literature-curated repository of regulatory relationships between known transcription factors and target genes in yeast, based on more than 1,300 literature references. Among the 3,556 genes in our filtered yeast time-series data, this database contains documented regulatory relationships for 127 transcription factors (TFs), encompassing a total of 17,173 edges. We therefore evaluate only inferred relationships from these 127 transcription factors, and inferred regulatory relationships from other genes are not used in evaluation.