### Data

The NASC array database [2] contained data for 2,904 arrays when this research began and this is what we have used in our analyses. These are derived from samples from a range of different plant organs and many different environmental conditions and treatments capturing a large proportion of the *Arabidopsis* gene expression repertoire. This contains experiments including time series and perturbations, however, each microarray is treated as independent, and no explicit use of the time series, mutations or perturbation experiments conducted is made. MAS5.0 summarization is provided by default by NASC arrays and was used for convenience in this study. Recently other normalization methods have emerged (see [40] for a comparison) that outperform MAS5.0. However, the fact that these data are subsequently coarsely discretized into three classes will remove most of the differences between these approaches, particularly given the fact that potential differences are more likely for consistently low expression genes which are excluded from our analysis by an expression filter, detailed in the next section.

### Data quantization and filtering

The gene expression signal values are quantized into three classes (denoted LOW/MED/HIGH) on a per gene basis, since: (i) genes are expressed in different quantities - when some genes are most expressed, they still have low signal values when compared to other more abundant genes; (ii) this allows the data to be split into three equal sized classes (equal probability mass). This ensures that all quantized genes have very similar entropy, which is important in the model selection phase later in the incremental network learning algorithm. Figure 2 shows example gene expression histograms and decision boundaries for four clock genes, and Figure 4 shows the quantized gene expression profiles of the genes analysed in the second section of Results. In other works, differential expression analysis of data from specifically designed experiments (condition against a control) have been used to create under-expressed, normal, and over-expressed classes for each gene [3], however, we have a large compendium of array data taken from many different conditions, that cannot be processed in this way, and to avoid selection bias later we quantise each gene into three equal sized classes. Pe'er *et al*. [41] fit a mixture of Gaussians to the expression signals for each gene using the assumption the gene is in one of a few discrete functional expression states, which relates to its activity, with each mixture component relating to a state or class. However, analysis of the expression profiles (e.g. Figure 2) does not lead to any natural splits in our expression profile data that indicate the presence of any natural clusters in the data linked to the active state of the gene, hence we have not tried to cluster the signals, and having different number of classes for different genes would introduce selection bias at the incremental growing of networks stage of our algorithm. Methods that employ a predictive discretization procedure as part of the model selection process (through jointly optimising a discretization policy and the model structure) have been devised [42], and show that the discretization policy does effect the resulting networks inferred, however we do not currently choose to try this, due to the added level of complexity involved, and our main desire to add in genes to a network from a large set of (potentially thousands) of possible genes, which would make this approach infeasible.

Three classes are chosen for a simple reason: it is the smallest number for a discrete number of classes which allows non-linear relationships to be captured. It is anticipated that little extra information would be captured from four or more classes, and the number of free parameters in the model would become too large. With three classes the categories LOW and HIGH are separated by a MED class, whereas with only two classes expression levels close to the decision boundary may be misclassified.

An expression distribution filter is also used to remove those genes (actually, probe sets) whose decision class boundaries are within 10 units (raw signal values; see Figure 2) of each other, since the lack of a significant difference between LOW and HIGH expression values for a gene may indicate a lack of true signal in the data. This ensures that the expression levels for belonging to the LOW class and HIGH class are different enough not just for the classification to be down to measurement noise. A signal value of less than 20 is often regarded as background noise, when the gene is actually off. The expression distribution filter is a low entropy filter on the continuous expression data - removing genes with little variability in their signal values; this inherently filters out genes of overall low expression. Generally if a gene always has a low expression level, then it will be easily predictable (always low/off), and it is of no use to any model. This filtering step reduces the set of genes from 22,815 to 15,172. In practice this removes low entropy genes for which around two-thirds of the signals values are less than 20 (off/background noise).

As an example of a gene excluded from analysis by the expression distribution filter, we wished to include *GATA9* (a duplicated version of *GATA12*) as a *GATA* factor of interest, however in order to quantize the gene expression signal values into three equally sized classes, thresholds of 12.56 and 21.01 would be necessary. With a difference in thresholds of less than 10 units, this gene exhibits a consistently low signal and low variance and is excluded.

### Bayesian networks for learning GRNs

The gene regulatory network is modelled with a discrete static Bayesian network (for an introduction to Bayesian networks see [43]). Our aim here is to learn the model structure *S* for the Bayesian network. The model structure is defined by a directed acyclic graph (DAG) encoding the dependencies between the variables (genes). The method aims to learn the model which is most likely to have generated the quantized gene expression data, *D*. We could choose to calculate the marginal likelihood p(*D*|*S*
^{
h
}) for each model structure *S*
^{
h
}, but for reasons of efficiency, an approximate BIC score is calculated as
, where
is the maximum likelihood estimate of the model parameters for a model with structure *S*
^{
h
}, *d* is the number of parameters in the model, and *N* is the size of the dataset. In practice, the BIC score tends to score DAGs with fewer edges relatively more highly than marginal likelihood. No prior on model structures is used. A greedy hill climbing search where edges are added, reversed or deleted at each iteration is used until an optimum is reached. Fifty restarts (from random initial DAGs) are performed in order to avoid local optima.

### Incremental growing of networks

Starting from a set of genes known to be involved in the biological process of interest the Bayesian network learning algorithm detailed above may be used to find a gene network that was most likely to have generated the observed gene expression patterns. On each iteration, each possible gene from an extended set of genes (either a selection made by an expert from the literature, or the full set of all possible genes) is added separately to the set of genes in the current model (see Figure 1 for an overview and below for a formal algorithmic description). For each of these sets, the 'best' network is learned from the training data, as described in the previous section. The scores are compared, and the network with the highest score is accepted as the model at this stage (see Figure 7 for an example plot of the distribution of these scores). This model comparison or model selection phase is a notoriously hard and unsolved problem in the machine learning community. However, through the quantization steps outlined above, the entropy of all the sets of genes under consideration at any iteration is the same, thus removing any bias from the model selection task. This gives an order in which to add genes to the model; ideally this provides information about which genes are more likely than other genes to be explained by the inferred network, and therefore related to the initial genes. This procedure can be described algorithmically as follows:

1. Determine the set G of all possible probes for inclusion into the network (representing all 15,000+ genes, or an extended selection identified as possibly of interest), G = {*g*
_{j}: j = 1,…,N_{g}}, a set of initial genes of interest s_{0}, and decide how big a network to grow, setting max_net_size = min(N_{g}, some limit).

2. At each iteration i, let *s*
_{i} denote the best gene set, and for each gene *g* ∈ G \ *s*
_{i-1} form a candidate gene set *s*
_{i}’ = *s*
_{i-1} ∪ *g* from which to learn a Bayesian network structure
that maximises the BIC score for the gene set *s*
_{i}’. If BIC(*S*
_{i}’) > best_score then update it and save the best set and structure to *s*
_{i} = *s*
_{i}’, and *S*
_{i} = *S*
_{i}’ respectively.

3. Repeat step 2, incrementing *i* whilst |s_{
i
}| < max_net_size and resetting best_score to -∞ at the start of each iteration.

Generally, it is not possible to compare network structures on different data sets, since the likelihood term p(*D*|*S*
^{
h
},
) is data dependent. This section aims to provide some detail to explain how the quantization steps undertaken allow model comparison between networks formed from different gene sets. The likelihood is decomposable as a product of the individual observations x ∈ *D*. So, p(D|S) = ∏p(x|S) for each observation x. In our case we have *N* = 2904 microarray observations, and x covers the current selection of genes. Each p(x|S) is also decomposable as a product of the conditional probabilities for each variable (gene) given its parents p(x|S) = ∏p(*x*
_{
i
}|*parents(x*
_{
i
}
*)*). So the contribution from an unconnected gene *g* is ∏p(*g*) over all observations, which is (1/3)^{
N
}. (*N* observations, with a probability of 1/3 of being correct in each case, since the classes are of equal size). This contributes *N* ln(1/3) to the log likelihood. Owing to the quantization, any unconnected gene contributes the same to the likelihood. To perform direct comparison of the structure *S*
_{A} learned from *s*
_{
i
}∪ *g*
_{A} and *S*
_{B} learned from *s*
_{
i
}∪ *g*
_{B} we would need to add gene *g*
_{B} unconnected to *S*
_{A} and *g*
_{A} unconnected to *S*
_{B}. However, due to quantization *g*
_{A} and *g*
_{B} when not connected to any other genes in the model would contribute the same to the likelihood (and to the penalization term of the BIC score). Thus, these terms cancel and we can effectively compare the scores BIC(*S*
_{A}) vs BIC(*S*
_{B}). The choice of quantization to ensure each gene has the same entropy has allowed model comparison for networks of the same size that differ by a single gene.

Iteratively learning networks exhaustively from a full set of 15,000+ genes is possible, but computationally time consuming. This scheme is embarrassingly parallel, and speed-up close to linear with the number of processors used is achieved. Future work may investigate exploiting cheap mutual information based measures to select a set of genes for possible inclusion into the expensive network learning stage.

The main contribution of this algorithm is in the incremental addition of genes from a large (potentially whole genome) set. Thus its purpose is to identify a network that incorporates the initial key genes of interest and to form a hypothesis of how they fit into a larger network - one that is statistically likely to have generated the observed data. This is done in a greedy manner to make the problem tractable. If we consider adding *k* genes from a set of *N* possible genes, then it requires creating less than *kN* sets from which to learn networks: first *N* sets each with a new gene are created, the best chosen, and a new *N*-1 sets created and the best chosen, so the actual number of sets is *k*(*N*-1/2(*k*+1)). An exhaustive search requires all _{
N
}C_{
k
}possible sets to be created and evaluated. For the interesting scenarios, *k* <<*N* and *N*!/(*N*-*k*)!*k*! ≈ *N*
^{
k
}. So the speed up obtained (without optimization) is of the order *kN* vs *N*
^{
k
}. With *N* = 10000 and *k* = 10, this gives a speed up of 10^{35}, making the problem tractable. In order to demonstrate the trade off between speed and accuracy using this approach, we consider a small scale example as an illustration, starting from the second example in Results: from the 9 key genes, an exhaustive search of all possible combinations of three genes from the remaining 28 are used to learn networks of 12 genes (this gives 3276 sets), and a score is obtained for each. On this small scale example, the greedy search involves learning networks on 81 sets of genes. Genes are added in by our greedy search in the order: COL2, COL1, PRR5, (then LUX, and PRR7). All the top scoring networks in the exhaustive search contain both genes COL1 and COL2, and with a third gene from {PRR5, CRY1, PRR7, LUX} all with similar good scores. In three sets of runs of the exhaustive learning, each with 50 restarts of the greedy algorithm for learning networks (within the greedy search for iterative addition of genes) a small variation of the best scores for learned networks was observed, as a global optima was not reached in all cases (there are more than 10^{25} possible DAGs on 12 nodes, so this is not that surprising). The quality of the order the genes are added in appears to be good (and is subject to noise in network learning, rather than in gene addition order, as if a gene is missed from the order, it is likely to be added at the next iteration anyway). A statistically rigorous stopping criterion has not yet been devised. This has not been a focus, since generally as the networks grow, large changes in the networks do not occur and more genes are added at the periphery, with much of the network the same from one iteration to the next; generally 3 or 4 edges change when each gene is added, and 2 or 3 of these tend to be to or from the added gene. So when to stop is not a critical issue. Currently a practical size of twenty-something is used when learning from the whole genome, or the whole set for small sets of genes (up to 40).

The networks learned will be related to the initial genes. Starting with a set of genes involved in a specific process genes involved in a more general global regulation of expression (such as circadian clock) may be incorporated, conversely, starting from a set of genes involved in a general global regulation of expression, it would be unlikely that the same set of specific genes would be added, other more general global regulation genes would be added first. As the iterations go on genes less related to the initial genes will be added. Indeed, different networks will be learned when iteratively adding genes from different lists (c.f. Figures 5 &6). The algorithm finds the statistically best explanation from a given gene list. Thus, the network in Figure 5 represents a good network that was most likely to have generated the expression data for those genes, whereas, network in Figure 6 adds genes from the whole genome and does not contain links between other groups of genes (i.e. cold/salt stress). Thus we can constrain lists to find a network that captures the relationships between these genes, or identify a network incorporating genes from an unconstrained (whole genome) list.

### Conditional independence, Markov equivalence, CPDAGs & CPTs

The learned network structures encode the conditional independence relations between the variables (genes). The resulting Bayesian network is represented by a DAG and the corresponding conditional probability distributions (parameter tables). However, there are a number of structures which are Markov equivalent and this set of equivalent DAGs can be represented as a Completed Partially Directed Acyclic Graph (CPDAG) where the directionality of the regulatory relationship is only indicated where a causal relation can be inferred. Specifically, where there is strong enough evidence that a v-structure is formed, i.e. a variable is dependent on more than one other variable. Thus, rather than just capturing correlations between genes, a predictive model, which is most likely to have generated the observed data is formed. The learned network models shown in Figures 3, 5 &6 are CPDAGs which represent an equivalence class of DAGs with the same predictive ability or that are equally likely to have generated the observed data.

### Generation of synthetic data for performance evaluation

In order to perform quantitative evaluation of our algorithms, the underlying ground truth network must be known. Thus, we construct a network, and generate data from it. We can then try to recover the original network from the data. To investigate the influence of the underlying network structure on the learning algorithm we wish to be able to control the strength of the relationships, rather than just using random CPDs which may by chance be strong or very weak. In the real microarray data, each gene's expression values are quantised into 3 equal sized classes and this is very important in the model comparison step of the iterative growing of the networks. Our discrete synthetic data should have the same property.

The CPTs represent a multinomial distribution, which has the Dirichlet distribution as its conjugate prior. Therefore, we are able to generate data to control the strength of relationships between variables. If each row in a CPT ~ Dir(α_{1}, α_{2}, α_{3}) then the alpha parameters can be used to control the strength of the relationships. When the α's are equal to 1, then each probability in the CPT is randomly drawn from U(0,1) then normalised. For α_{i} >> 1, the distribution tends to uniform (1/3, 1/3, 1/3) (and has a high entropy, since this distribution is not predictive); and for α_{i} << 1 to a deterministic relationship with one state having all the probability mass. Inspection of the CPTs learned from data empirically shows that they are far from fully deterministic, but more correlated than random and certainly not uniform (information-less), so alpha parameters of less than one would be expected. From the learned network on 35 genes (see Results) data from the learned CPTs was used to estimate the parameters of a Dirichlet distribution [44], revealing α = (0.6, 0.9, 0.6) as the maximum likelihood estimate of the parameters. The rows of the learned CPTs show that the middle entry (MED) has smallest variance, rarely taking tiny or large probabilities, compared to the other two states (LOW and HIGH) - the MED class separates the LOW and HIGH expression values and captures some noise in the data, and we would inherently expect this state to be more uniform than the other two.

For each gene, the probability of it being in a particular state given a particular configuration of parent's states is drawn from a Dirichlet distribution (with α = (0.6, 0.9, 0.6)). α controls the entropy of the dirichlet sampling. The expected frequencies of parent states for each gene are taken into account during the allocation of the CPTs, in order to maintain roughly equal frequencies of the states of each node.

When this model (the Bayesian network with CPTs) is sampled from, a dataset can be formed, from which we can reverse engineer the known network. We perform network inference from different sized data sets (by truncating the largest one). Network structures are learned as described previously, again starting from nodes corresponding to the same 9 key genes as before. Networks are learned on datasets of size 10, 100, 500, 1000, 2000, 3000, 5000, 10000 to demonstrate behaviour on different size datasets.

### Evaluation metrics

The network structures are evaluated based on counts of correctly and incorrectly inferred edges. Given that a number of DAGs are Markov equivalent, the CPDAGs of each DAG will be used when a learned DAG is compared to the underlying DAG. For each predicted edge, it is counted as a FP if the corresponding edge in the original network is not present or if the predicted edge is compelled in one direction and the corresponding edge in the original network is compelled in the opposite direction; otherwise the edge is counted as a TP. For each predicted non-edge, it is counted as TN if the corresponding edge in the original network was also not present; otherwise it is counted as a FN. Note the comparison is done against the whole network, even when our model only includes a limited number of nodes, as the algorithm iteratively adds nodes.