Using potential master regulator sites and paralogous expansion to construct tissue-specific transcriptional networks
© Haubrock et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Skip to main content
© Haubrock et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Transcriptional networks of higher eukaryotes are difficult to obtain. Available experimental data from conventional approaches are sporadic, while those generated with modern high-throughput technologies are biased. Computational predictions are generally perceived as being flooded with high rates of false positives. New concepts about the structure of regulatory regions and the function of master regulator sites may provide a way out of this dilemma.
We combined promoter scanning with positional weight matrices with a 4-genome conservativity analysis to predict high-affinity, highly conserved transcription factor (TF) binding sites and to infer TF-target gene relations. They were expanded to paralogous TFs and filtered for tissue-specific expression patterns to obtain a reference transcriptional network (RTN) as well as tissue-specific transcriptional networks (TTNs).
When validated with experimental data sets, the predictions done showed the expected trends of true positive and true negative predictions, resulting in satisfying sensitivity and specificity characteristics. This also proved that confining the network reconstruction to the 1% top-ranking TF-target predictions gives rise to networks with expected degree distributions. Their expansion to paralogous TFs enriches them by tissue-specific regulators, providing a reasonable basis to reconstruct tissue-specific transcriptional networks.
The concept of master regulator or seed sites provides a reasonable starting point to select predicted TF-target relations, which, together with a paralogous expansion, allow for reconstruction of tissue-specific transcriptional networks.
Regulation of transcription is mediated through complex arrays of transcription factor binding sites (TFBSs), which constitute promoter and enhancer regions. In spite of the advent of high-throughput approaches to identify TFBSs in a given cellular context, the available information, most comprehensively collected in the TRANSFAC® database , is still fragmented and biased with regard to the systems selected. Consequently, any transcriptional network reconstructed from the available experimental data is highly incomplete. This situation deteriorates further when filtering such a transcriptional "reference" network for gene expression data in order to generate tissue-specific networks. Therefore, constructing comprehensive gene regulatory networks still depends on reliable algorithms for predicting individual TFBSs as a basis for inferring TF-target gene relations. These predictions, however, depend on the availability of information about the DNA-binding specificity of ideally all TFs encoded by a genome. Unfortunately, we are far from this ideal situation, so that we can do such predictions only for a subset of, e.g., human TFs. Although promising methods have been reported for inferring DNA-binding specificities by homology modeling [2, 3], the required 3D structures of TF-DNA complexes are known for only a minority of factors.
Recent studies have applied high-throughput approaches to map active promoters and enhancers in a particular cellular context by capturing epigenetic characteristics such as specific histone methylation patterns . However, it still has to be revealed what the exact regulation of a given gene is, i.e. which functional TFBSs are there in its regulatory regions, and which is the original signal that flags a promoter region as such. Conceivably, the recently published concepts about master transcription factors  or pioneer transcription factors  may provide a clue to this problem.
In this study, we started from the following related working model as hypothesis: In the genome of a given higher eukaryotic cell, promoter sequences have to be "flagged" in order to be recognizable by the transcription machinery. Each of these flags is realized by a high-affinity TFBS, which, due to its functional importance, is generally conserved among genomes that are phylogenetically not too distant. These high-affinity and conserved sites serve as nucleation centers, or "seeds", to govern the proper assembly of TFs at one promoter, which also involves a set of additional transcription factors with binding sites of decreasing affinity and acting in a concomitantly optional manner.
We started from 35,750 RefSeq-annotated human promoter regions (UCSC track refGene, Apr. 14, 2010, hg19) which are linked to 21,532 unique genes. We selected the 1-kb upstream regions based on the RefSeq annotation to cover the corresponding human promoter regions. We retrieved ortholog promoter regions from mouse, dog, and cow genomes from the 46_WAY_MULTIZ_hg19 whole genome alignments provided by UCSC for 46 vertebrates using UCSC/Galaxy . The corresponding sequence builds are hg19, mm9, canFan2, and bosTau4. Gaps resulting from the multiple genome alignment were removed from the promoter sequences. Potential transcription factor binding sites (TFBS) were then identified using all available vertebrate matrices (854 PWM) of the TRANSFAC matrix library (release 2009.4) and the program Match™ . We applied all vertebrate matrices using default minFN ("minimize false negatives") thresholds in order to retrieve almost all potential transcription factor binding sites that have at least the quality of the used TFBS which are given in the corresponding matrix . The predictions were then mapped back to the whole genome alignments. We next filtered for conserved TFBS predictions: a conserved TFBS has to start or end at a non-gap symbol in the corresponding promoter alignment. Finally we ranked all conserved TFBSs according to their Match score and selected the top-ranking 1%, 2%, 3%, 5%, etc for evaluations. The 100% profile comprises all conserved TFBSs that were identified with minFN thresholds. For further analyses of the network characteristics, the top-ranking 1% predicted binding sites for each matrix were used.
Using the TRANSFAC library we ended up with a list of predicted transcription factor binding sites related to the TRANSFAC matrix identifiers. To build gene regulatory networks we translated these matrix identifiers, which are linked to lists of related species-specific proteins, to official human gene symbols.
For "paralogous expansion", we used our new Human Transcription Factor Classification to construct gene regulatory networks (http://www.bioinf.med.uni-goettingen.de/projects/tfclassification/). This collection classifies human transcription factors into families and subfamilies mainly based on the sequence similarities of their DNA-binding domains (DBDs). Since at the lowest classification level, the DBDs are usually extremely similar, the DNA-binding specificities can be assumed to be nearly identical as well. We therefore expanded all TF-target links to all members of the corresponding TF (sub-)family, for which no matrix is as yet available.
The verification of the predicted binding sites was done using experimentally identified regulatory regions from the Encode project . ENCODE provides a regulatory super-track as a downloadable file. This archive is summarizing all transcription factor ChIPseq experiments which have been done within the ENCODE project based on the human genome build 37 (hg19). Altogether whole genome binding sites and their genomic coordinates are available for more than 140 different human transcription factors. They were used to evaluate our TFBS and the inferred TF-target predictions by computing the True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) rates for some human transcription factors. If a predicted TFBS is found in a ChIP-seq region as well, we count it as a TP. If a TFBS prediction is not detected by a ChIPseq experiment this is an FP result. An FN result is obtained when a ChIPseq region is overlapping with a potential promoter region (including the fragment of overlapping the promoter regions at least with 500 nucleotides), but we don't predict a TFBS for this situation. A TN result is related to a situation, where we neither predicted a TFBS nor a ChIPseq region was found. Using these statistical measurements we determine the Positive Predictive Value (or precision; PPV = TP/(TP+FP)), Specificity (Spec = TN/(TN+FP)), and the True-Positive-Rate (TPR = TP/(TP+FN); also: sensitivity or recall) to detect the accuracy of a ChIP-seq evaluation.
Based on UniGene  we have downloaded the gene expression profiles for 8 different tissues: brain, heart, kidney, liver, ovary, prostate, spleen, testis.
Previous studies have shown that sequence conservation can improve transcription factor binding site predictions [11, 12]. Therefore, we combined standard PWM scanning with a four species conservation filtering to identify potential TFBSs and, on this basis, to infer TF-target gene relations for a comprehensive reference transcriptional network (RTN). With this strategy (see Methods for details), we predicted 4,3*10e7 TFBS which are conserved among these four species (hg19, mm9, canFam2, bosTau4). These predictions are linked to 16,900 unique human gene symbols. 47.3% of all human promoters (35,750 RefSeq-annotated human promoter regions) share at least one conserved predicted binding site with the mouse, dog, and cow species. When selecting only the best 1% predictions of each TRANSFAC matrix we found that 15,619 genes (43.7%) share a conserved, high-scoring binding site. Altogether, we ended up with 490,277 TFBS predictions.
We used a fundamentally revised version of an earlier transcription factor classification, based on their DNA-binding domains , to identify groups of TFs that share DNA-binding specificity to the largest extent possible. They may be regarded as paralogs, resulting from early gene duplication events (Wingender, manuscript in preparation). This classification scheme comprises four abstraction levels: superclass, class, family, and (optionally) subfamily. Whenever one member of a bottommost clade (family or subfamily) has a TRANSFAC matrix associated, all potential binding sites and, thus, target genes predicted for this TF were copied to all other clade members. This expansion of the transcriptional reference network led to an increase of the TF genes from 442 to 742 (by 67.9%), and increased the number of directed edges in the network from 277,661 to 728,667 (by 162.4%) (Additional file 1, first line of the table). The expansion approach was also cross-validated for those cases where distinct (sub)family members had different TRANSFAC matrices associated (data not shown).
Altogether, we decided to work furtheron with the 1% profiles and the resulting networks.
In the Venn diagram of Figure 2, the overlap between the predictions and any experimental data set may nevertheless appear small when compared with the overlap between the two ChIPseq data sets. It should be noticed, however, that we explicitly accepted a high number of False Negatives, as an unavoidable trade-off of the approach chosen here aiming at high-affinity and highly conserved sites only, regarded as potential master regulator or seed sites.
In general, the number of TFs in the expanded tissue transcription networks (eTTN) increased on average 1.5 times compared to those in the TTNs, whereas the number of nonTFs is almost constant (Additional file 1). This increase of TF numbers results in an even larger increase in the number of regulations (directed edges), which is on average 2.5 times higher than before the expansion, suggesting that the eTTNs are much more densely connected than the TTNs. It is noted that the increasing ratios of genes and regulations are generally consistent with the reference network and across the different tissues (Additional file 1). This indicates that the extra TFs in the eRTN, which are highly tissue specific (Figure 5), are a characteristic of all tissues studied so far.
The individual eTTN differ considerably in their sizes. By far the largest is the brain network, comprising 75% of the TF genes, 78% of the nonTF genes and 61% of the edges of the eRTN. At the other end of the scale, the spleen network shares with the eRTN only 31% of the TF genes, 38% of the nonTF genes, and 11% of the edges. On average, 41% of the regulations represented in the eRTN survive the tissue-specific filtering.
Interestingly, the in-degree of TF genes is consistently about 50% larger than that of nonTF genes. This is true for the (e)RTN as well as for all (e)TTNs. This difference is only slightly diminished by the paralogous expansion (see Additional file 4).
However, such global increase of in- and out-degrees does not change the features of degree distributions of the eTTNs, which all show an exponential distribution of both in- and out-degree (Additional file 5).
It has been reported that during heart development, T-box transcription factors play a particularly important role . Mutations in human TBX genes may result in cardiovascular malformations. Their gene products, the TBX factors, form a complex spatio-temporal pattern defining the identity of the different heart structures .
Human TBX factors are spread over five families, one of them comprising TBX2, TBX3, TBX4 and TBX5 (family 6.5.4 in our TF classification). Out of them, only TBX5 is associated with a positional weight matrix in TRANSFAC. However, it has been reported that for instance TBX3 can assist pluripotent reprogramming of embryonal fibroblasts, and is required to specify the atrioventricular system (AV) . It prevents genes that are markers for other parts of the organ (e.g., for the chamber myocardium) to be expressed in AV, one of them is the gene of the atrial natriuretic factor (NPPA) . It is noteworthy that after paralogous expansion, our heart network reveals NPPA as one of the more than 2000 target genes of TBX3, a relation that would have been lost otherwise.
With the efforts described in this paper, we made an attempt to reconstruct a realistic transcriptional network that (1) is void of false positive TF-target relations to the utmost extent possible, (2) includes as many regulator nodes (TFs) as possible, and (3) therefore provides a reasonable basis to reconstruct tissue-specific transcriptional networks. In order to minimize the number of false positive predictions, which is a well-known problem in computationally identifying TFBSs, we focused on highly conserved and high-affinity (by virtue of Match score) binding sites only to identify TF-target relations represented by the arcs in our reference network. Since we obtained relatively high PPV for most TFs, we are confident that the network we obtained is reliable. This is supported further by the observation that the FP rates we determined by comparing our predictions with experimental data sets, which always represent one (or very few) specific cellular situation(s), are highly overrated. Comparing experimental data sets for one and the same TF, but obtained from different cell types generally revealed minimal overlaps, confirming that many alleged FPs in fact may turn into true positives in a different cellular context, so that FP numbers are usually overrated. Rather, we suppose that most, if not all, high-affinity and conserved predicted TFBSs provide a regulatory potential that might be used in a certain cellular situation.
We are aware that our very stringent approach results in large numbers of false negatives, since many experimentally validated TFBS have a very low Match score and gain their functionality by the proper context of other elements. To include this kind of context, or the proper "syntax" of promoters, will be subject of further studies and an according updating of our network. Also the inclusion of enhancers will be a task for future work. We have observed that inclusion of conservativity as criterion does not well apply to enhancer regions, so that new concepts have to be developed for their identification and characterization.
Altogether, we are confident that the networks we have reconstructed reflect a relevant part of reality. This is also supported by the observed kind of degree distribution of the most stringent network, which follows a clear exponential law, as was to be expected at least for the in-degree distribution (see  and references cited therein) and from our own earlier observations for the out-degree distribution as well . Relaxing the filter criteria leads to degree distributions with more random characteristics.
We have also shown that on the basis of such restrictive filtering, the networks can be reliably expanded by including related TFs and allow them to inherit all target relations, and with that the full out-degree, of already characterized (sub)family members. Since these newly added regulators predominantly provide tissue-specific regulatory information, this expanded network is a good basis to construct reliable transcriptional networks for individual tissues. A first overview revealed for these networks that their degree distributions follow the same rules as the reference network. In addition, first investigations have shown that also the hub composition of all these networks was comparable. Finally, we could show that in the particular case of heart development, paralogous expansion was able to rescue target genes for a specific transcription factor (TBX3), which otherwise would not have been amenable in the corresponding tissue-specific network.
A paralog-expanded transcriptional network has been constructed based on the knowledge of master regulator or seed sites. It has been shown that the paralogous expansion provides as reliable basis to reconstruct tissue-specific transcription networks. The obtained networks show the expected statistical and topological characteristics. A first case study additionally provided biological evidence for the reliability and usefulness of these networks in including regulatory information which would have been missed without this expansion. From that we conclude that our approach to construct transcriptional network is valid and provides a solid ground for further studies, in particular with regard to the analysis of regulatory processes, e.g. the mechanisms governing cell differentiation.
expanded reference transcriptional network
expanded tissue-specific transcriptional network
positive predictive value
positional weight matrix
reference transcriptional network
transcription factor binding site
tissue-specific transcriptional network.
The authors acknowledge the financial support by the European Commission under FP7 (contract no. 202272, LipidomicNet).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.