Expression data
We selected seven gene expression datasets from the GEO database [25]. They are time series datasets in the embryonic development of seven different tissues in mouse, including liver development (GSE13149), eye development (GSE13103), lung development (GSE11539), brain development (GSE8091), cardiac development (GSE5671), testis development (GSE6881) and ovary development (GSE6882). We directly used their "series matrix" in GEO to calculate the Pearson correlations between gene pairs. Note that a gene may have several probesets. For a gene pair, we took the maximal correlation of their probeset pairs. Here we concentrated on the brain development data.
In addition, we extracted experimental gene expression data of Tcf3 (GSE16375) [23]. These data depict gene expressions before and after the silence of Tcf3 by doxycycline treatment. We took the mean expressions for different replicates.
Assign CRM and microRNA target genes
We used MOPAT[18], a motif pair tree method, to predict 144490 CRMs, consisting of 494 motifs for 5355 mouse development genes.
MiRNAs and their predicted target genes were assembled from TargetScanMouse (release 5.1) [19]. Since miRNAs in the same miRNA family are similar in regulation, all miRNAs were classified into miRNA families using the annotation in miRBase [26]. In all, we analyzed 135 miRNA families and 8811 target genes. For simplicity, we shortened "miRNA family" to "miRNA" in the article.
All gene symbols were converted to MGI Marker Accession ID using the MGI database [27].
Evaluation of the significance of coherence
To assign the significance of the CRM-miRNA co-regulated gene set, we performed a randomization test. For each CRM-miRNA target gene set, we randomly assigned its co-regulated genes from all CRM target genes and kept the size of the gene set. Meanwhile, for the randomization of CRMs and miRNAs, we randomly assign their target genes from the whole gene set. Note that for CRM-miRNA module, genes were not picked up from the whole gene set since we wanted to exclude the influence of CRM and focus exclusively on the effect of miRNAs.
We repeated the test five times to form the background distribution. Moreover, we used 100 tests to calculate the p-value, defined as the proportion of random sets that had the same, or higher, coherence than the real set. We have considered multiple testing problems and have corrected the p values using FDR modification.
Meanwhile, to directly evaluate the effects of miRNAs, we generated some artificial miRNAs by randomly assigning their target genes from the target genes of all miRNAs. Here we kept the distribution of the miRNAs' target gene numbers [28]. This randomization test was performed five times to form the background distribution.
Applying these two kinds of randomizations, we could construct two backgrounds and evaluate the significance of every CRM-miRNA module.
Construction of the combinatorial regulation network
CRM → miRNA
The pri-miRNAs and their location in the genome were downloaded from miRBase [26]. Their upstream 5kb sequences were extracted from Ensembl [29].
We first applied the same approach as MOPAT [18] to search the TF binding sites in the pri-miRNA's upstream region. We extracted the position weight matrices from TRANSFAC 9.2 [30] and calculated the log-likelihood ratio scores on every site of the pri-miRNA upstream sequence. For every motif, the site with a score larger than the cutoff was predicted to be a binding site. Here the cutoff was calculated as the 99.99% quartile of a random sequence.
A CRM was predicted to regulate the miRNA if its containing motifs were all identified to have binding sites in the pri-miRNA's upstream sequence and these binding sites were within 200bp.
miRNA → CRM
All mouse position weight matrixs in TRANSFAC were mapped to the genes encoding the TFs that bind these PWMs [8]. A miRNA was predicted to target a CRM if at least one TF in the CRM had corresponding genes targeted by the miRNA.
In all, we identified 695432 CRM → miRNA pairs and 5060 miRNA → CRM pairs.
Evaluation of the significance of patterns in the regulation network
We constructed 100 random networks to measure the significance of every regulatory pattern by swapping the edges of the real network randomly. Thus, each random network had the same number of CRMs, miRNAs and genes with the actual network. We also kept the same number of CRM-gene, miRNA-gene and CRM-miRNA interaction pairs, respectively. For every pattern, a z-score was calculated as the difference of its actual occurrence and the average of its random occurrences, normalized by the standard deviation of the random occurrences. Patterns with high z-score were supposed to be enriched in the regulatory network [9].
Host genes
We extracted sixteen miRNAs that have host genes from miRBase [26]. After overlapping with our mouse development gene of interest, there remained ten miRNAs, while nine of them were targeted by the same CRMs with its host genes.
Analysis of the experimental data of Tcf3
The relationship between gene expressions and the regulatory effects of Tcf3 and miRNAs can be formulated as a linear model [16]:
(1)
g
k
is the log ratio of expression of the k-th gene after Tcf3 depletion and before treatment; b
TF,k
and are 1 or 0 to indicate whether the k-th gene is targeted by Tcf3 or the i-th miRNA. The parameters a
TF,k
and show the regulatory effects of Tcf3 and the i-th miRNA; in other words, they measure the expression variation of genes targeted by Tcf3 and the i-th miRNA.
To select miRNAs in this model, we first used an alternative model, considering Tcf3 and one miRNA at one time, that is
All miRNAs were fitted to this model, and those with significant p value were selected to fit the original model (1). We used a stepwise linear regression of AIC criteria to select miRNAs. For the selected miRNAs, their target genes had significantly different expression levels after Tcf3 depletion, indicating that these miRNAs have significant interaction with Tcf3.