Dissecting the fission yeast regulatory network reveals phase-specific control elements of its cell cycle

Background Fission yeast Schizosaccharomyces pombe and budding yeast Saccharomyces cerevisiae are among the original model organisms in the study of the cell-division cycle. Unlike budding yeast, no large-scale regulatory network has been constructed for fission yeast. It has only been partially characterized. As a result, important regulatory cascades in budding yeast have no known or complete counterpart in fission yeast. Results By integrating genome-wide data from multiple time course cell cycle microarray experiments we reconstructed a gene regulatory network. Based on the network, we discovered in addition to previously known regulatory hubs in M phase, a new putative regulatory hub in the form of the HMG box transcription factor SPBC19G7.04. Further, we inferred periodic activities of several less known transcription factors over the course of the cell cycle, identified over 500 putative regulatory targets and detected many new phase-specific and conserved cis-regulatory motifs. In particular, we show that SPBC19G7.04 has highly significant periodic activity that peaks in early M phase, which is coordinated with the late G2 activity of the forkhead transcription factor fkh2. Finally, using an enhanced Bayesian algorithm to co-cluster the expression data, we obtained 31 clusters of co-regulated genes 1) which constitute regulatory modules from different phases of the cell cycle, 2) whose phase order is coherent across the 10 time course experiments, and 3) which lead to identification of phase-specific control elements at both the transcriptional and post-transcriptional levels in S. pombe. In particular, the ribosome biogenesis clusters expressed in G2 phase reveal new, highly conserved RNA motifs. Conclusion Using a systems-level analysis of the phase-specific nature of the S. pombe cell cycle gene regulation, we have provided new testable evidence for post-transcriptional regulation in the G2 phase of the fission yeast cell cycle. Based on this comprehensive gene regulatory network, we demonstrated how one can generate and investigate plausible hypotheses on fission yeast cell cycle regulation which can potentially be explored experimentally.


The Co-clustering Regression Model
, denote the expected value of the relative expression level ), (t y ge at time t for an individual gene g in experiment e. Assuming that the random errors ) (t ge ! are standard normally distributed, we model the expression value y ge (t) using the regression model ( with basis functions given by For non-time course data, we have . To identify co-clustered genes, it is reasonable to use the first full cycle time period data for every gene since it is least affected by the loss of cell phase synchronization. Clearly, using asynchronous data produces noisy clusters. Moreover, if we restrict ourselves to the first period of the cell cycle, then it is easy to see that the RPM model of [3] can be approximated by the above model (1). An advantage of using (1) is that it is linear in the unknown parameters and hence is easy to fit. Thus, we can reasonable assume that the model is equivalent to the RPM model.

Combined cluster model
The . For regression coefficients we use the Gaussian prior and for error variances we use the inverse gamma prior, which are commonly used conjugate priors [1]. The model specification is completed with a prior distribution placed on the clustering configuration C. Here we use a uniform prior over the space of all clustering configurations. To avoid outlier genes from forming stray clusters, we limited clusters to a minimum size of five genes.
The clustering algorithm of [4] requires calculations involving * posterior covariance matrix of the basis function coefficients for each cluster. Here ) (t X is a matrix whose columns are the basis functions described above and I is the identity matrix. Since we have a relatively large number of experiments, including some long time courses, these matrix inversions and subsequent matrix multiplications are computationally demanding. Gram-Schmidt orthonormalization of the basis functions is therefore used to reduce the computational burden, without changing the basic structure of the regression model as in [4].

Automatic relevancy detection for experiments within a cluster
Since data used in this study are from independent experiments performed under varied conditions, there could be considerable variation among the expression profiles.
Consequently, even though a pair of genes is theoretically co-expressed, often the observed data from the ten experiments may reveal that they are co-expressed in only a subset of the experiments and not necessarily in all ten experiments. Thus one of our goals is to identify subsets of co-expressed genes even though, due to variations among the observed data, they may not necessarily be co-expressed in all ten experiments.
The above problem can be formulated within the Bayesian framework by using a mixture distribution for the error variance . If an experiment does not demonstrate coexpression of the genes in cluster k, then the corresponding data suggest a regression model with a relatively high variance. We consider a two component mixture model, with both components modeled by inverse-gamma distribution ) , , the conjugate prior.
Hence we propose the following model are chosen to reflect low and high variance, respectively. In our application we chose 1 ' = = ! ! , and 10 ' , . The resulting probability density functions for the two components are shown in Figure S5.

Clustering Algorithm
Given the observed data, the above probabilistic model results in a posterior distribution on the clustering and variance component allocations, given up to proportionality by, where the final term in the product is the density of all of the observed data given ) , ( z C .
We then use ! as the objective function for our agglomerative clustering procedure.

Agglomerative clustering
Although the statistical model described here is an extension of [1,4], the agglomerative clustering algorithm of [4] can still be applied here. Higher potential scores ! can be achieved through various schemes to relocate genes to different clusters after agglomeration. One such method is used here, whilst full details of a range of methods will appear in a separate paper. The revised agglomerative clustering algorithm for N genes proceeds as follows: • Step 1: Start with C=N clusters, each cluster containing the expression levels for one gene. Calculate the potential N ! using (2). k, c kl =c lk , and hence identify the new nearest cluster to k, k $ '. • Step 9: Repeat Steps 3-8 until C=1. • Step 10: Looking back over the clusterings visited, find the number of clusters C in the hierarchy maximizing the posterior distribution, argmax C π C . This is our optimal configuration (C * ,z * ).

Note that
Step 9 determines the number of clusters in our optimal clustering. To avoid small clusters which may be difficult to interpret, we limited this analysis to clusters containing at least 5 genes.     and blue spikes represent statistically significant motif matches for potential forkhead and HMG-1 box binding sites in the upstream sequences of genes from Cluster 4 (see Table 2 in main text). A black triangle marks the end of an intergenic region.