Genome-wide analysis of the transcription factor binding preference of human bi-directional promoters and functional annotation of related gene pairs
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 4 May 2011
Skip to main content
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 4 May 2011
Bi-directional gene pairs have received considerable attention for their prevalence in vertebrate genomes. However, their biological relevance and exact regulatory mechanism remain less understood. To study the inner properties of this gene organization and the difference between bi- and uni-directional genes, we conducted a genome-wide investigation in terms of their sequence composition, functional association and regulatory motif discovery.
We identified 1210 bi-directional gene pairs based on the GRCh37 assembly data, accounting for 11.6% of all the human genes owning RNAs. CpG islands were detected in 98.42% of bi-directional promoters and 61.07% of unidirectional promoters. Functional enrichment analysis in GO and GeneGO both revealed that bi-directional genes tend to be associated with housekeeping functions in metabolism pathways and nuclear processes, and 46.84% of the pair members are involved in the same biological function. By fold-enrichment analysis, we characterized 73 and 43 putative transcription factor binding sites(TFBS) that preferentially occur in bi-directional promoters from TRANSFAC and JASPAR database respectively. By text mining, some of them were verified by individual experiments and several novel binding motifs were also identified.
Bi-directional promoters feature a significant enrichment of CpG-islands as well as a high GC content. We provided insight into the function constraints of bi-directional genes and found that paired genes are biased toward functional similarities. We hypothesized that the functional association underlies the co-expression of bi-directional genes. Furthermore, we proposed a set of putative regulatory motifs in the bi-directional promoters for further experimental studies to investigate transcriptional regulation of bi-directional genes.
According to the orientation and status of the 5’ end, the adjacently located genes can be arranged in convergent, divergent or tandem configuration. Among these categories, the divergent gene arrangement is found more frequently than expected by chance in the human genome, accounting for about 10% of all human genes[2, 3]. Bi-directional gene pair is defined as two genes arranged in a head-to-head (adjacent 5’ ends) fashion on opposite strands of DNA with less than 1,000 bp between their transcription start sites(TSS). Accordingly, the entire intervening region between the two TSSs is designated as a putative bi-directional promoter. A gene is termed as uni-directional if no oppositely oriented TSS is found within 10 kb upstream of the given TSS, or if a similarly oriented TSS is found at least 1 kb upstream. Thus the entire 1 kb of 5’ flanking DNA is considered as the uni-directional promoter.
Considerable attention has been focused on bi-directional genes in recent years. Examples including LRRC49/THAP10, SURF-1/SURF-2, COL4A1/COL4A2, PCD10/SERPINI1 and HAND2/DEIN have been identified in human through individual experiments. A considerable number of bi-directional gene pairs were found to be conserved among mammalian species[9, 10]. Since evolutionary conservation usually indicates functional implications, we proposed that bi-directional gene organization is under selection to fulfil a specific functional role. Whereas most of the bi-directional gene pairs have been found in the process of studying a single gene, a genome-wide analysis of their function and physiologic consequences is currently insufficient.
The expression data obtained from biotechnologies such as SAGE and microarray indicated a correlated expression profile between bi-directional genes[11–13]. Based on the assumption that ‘co-expression implies co-regulation’, the requirement for co-regulation of functionally related genes appears to underlie the observed co-expression. However, it is still under discussion whether the co-expression evolved merely as a consequence of their physical proximity or if function dictated their co-regulation. There are several examples of bi-directional gene pairs that are related by function, e.g. in DNA repair[1, 2], aging, de novo purine synthesis and carcinogenesis. Despite this observation, a systematic study on the degree of internal co-function of the bi-directional genes has not been carried out to date.
More recent studies have suggested an intrinsic difference in nucleotide composition of bi-directional promoters compared to uni-directional ones[1, 2, 13, 16]. These characteristic feature lead us to hypothesize that divergent genes will be transcribed with a special set of regulatory signals. Currently our understanding of transcription regulation relies greatly on experimental identification of prospective regulatory regions. Yet many specifics underlying the regulatory design are unknown. Therefore, it seems necessary to re-evaluate the underlying mechanisms and biological relevance of bi-directional promoters systematically.
Distribution of bi-directional gene pairs on each chromosome
Total Gene Number
Genes regulated by bi-directional promoters were examined for functional classifications and associations. Among the 1,644 genes involved in the 822 human bi-directional gene pairs, 1,121, 1,219, and 1,256 genes were directly annotated by ‘biological process’, ‘molecular function’ and ‘cellular component’ subcategories in GO annotation system, respectively. We found several GO classes significantly over-represented among bi-directional genes. Cellular, metabolic and biosynthetic processes emerged as the most significantly enriched functional class. GO items of cell cycle and its child nodes were also significantly presented. Cellular response to stress or stimulus and their related subclasses of damage response, break repair were also focused. To summarize, the most enriched GO categories correspond to the known physiological roles of the cell, indicating that bi-directional genes are frequently involved in basic cellular metabolic processes. See Additional file 1 for the complete list of enriched GO terms.
Then we set out to find out the GO terms that represent coordinated functions of bi-directional pairs. In Biological Process, the GO terms related to metabolic process and its branch such as primary metabolic process, cellular process and biopolymer biosynthetic process topped the list of both gene pair members. Their child nodes were focused on RNA (mRNA, ncRNA) metabolic process, cellular (macromolecule or biopolymer) catabolic process, organelle organization, mitotic cell cycle etc. In molecular function, the GO terms involved in DNA-directed RNA polymerase activity, RNA methyltransferase activity, purine NTP-dependent helicase activity, NAD or NADH binding, NADH dehydrogenase (quinone) activity, etc. are significantly over-represented as compared to others. In Cellular Component, we found that bi-directional genes tend to be tightly associated into the same class of organelle, organelle envelope, nucleus, nucleoplasm, nucleolus, membrane-bounded or non-membrane-bounded organelle, etc. Interestingly, almost all the items shared by the two divergent genes are related to metabolism and energy transfer. We proposed that genes involved in functions including metabolism, are more likely to be organized in the head-to-head configuration.
Statistically enriched GeneGO Pathway categories
Regulatory processes/Cell cycle
Metabolic maps/Metabolic maps (common pathways)/Energy metabolism
Metabolic maps/Metabolic maps (common pathways)
Metabolic maps/Metabolic maps (common pathways)/Nucleotide metabolism
Metabolic maps/Metabolic maps (common pathways)/Vitamin and cofactor metabolism
So far we have been analyzing the level of gene function enrichment using two function annotation schemes respectively. The GO results show a clear agreement with those derived from the GeneGO pathways. For example, the GO terms that are significantly enriched include genes that are engaged in processes such as DNA metabolic process, which correspond to the Metabolic maps/Metabolic maps (common pathways)/Nucleotide metabolism pathway in GeneGO; Cell cycle, which corresponds to the same pathways in GeneGO; response to DNA damage stimulus, which corresponds to Regulatory processes/DNA-damage in GeneGO. This agreement is also apparent in that ‘‘DNA repair’’ is the most enriched GO term and ‘‘* DNA damage_Nucleotide excision repair,’’ which corresponds to the Regulatory processes/DNA-damage pathway, is one of the top enriched pathways in GeneGO as well.
The experimentally validated TFBS that occurred in bi-directional promoters
Regulated gene pair
Interestingly, the over-represented recognition sequence for MYC, ELK1, NF-Y, SP1, ATF, GABPA, SREBP-1, NF-E2, STAT5A, NF-1 as well as SOX-9 rank among the most conserved motifs found in human promoters.
Given the enrichment of these motifs in bi-directional promoters and their strong evolutionary conservation across mammalian promoters, we assume that the predicted TFBSs located within bi-directional promoters are more likely to be functional in co-regulation than other TFBSs. Interestingly, it would appear that TFs within the same family tend to have similar binding preference. A TFBS is either over-represented or under-represented in parallel with other family members. These observations suggest a common mode of expression across the family members of transcription factors.
In this study, 11.6% of the human genes were shown to be arranged in a head-to-head fashion, and this proportion is slightly larger than most of the previous report, except that Piontkivska et al. reported a number of 1,369 bi-directional promoters. The inconsistency was partly due to the update of TSS coordinates during the accumulation of EST and mRNA evidence. In addition, we used the much more highly curated RefGene track instead of the spliced human ESTs collection, because the large and complicated ESTs data containing thousands of transcripts captured by oligo-capping techniques will lead to an overestimation of the frequency of transcripts, and then introduce false positive result. What’s more, our work focus on the pure mRNA gene pairs and a large part of non-coding RNA, transcribed RNA and miscRNA are excluded from further analysis. Herein we provided a solid evidence for the previous observation that bi-directional promoters had a significant enrichment of CpG-islands as well as a high GC content. Since CpG island is usually the targets of regulation by methylation, it may induce changes in chromatin structure that can confer either positive or negative effects on transcription. Misregulation of bi-directional promoters elicited by mutation or hypermethylation will simultaneously silence genes on both sides. Loss of their vital biological function well explains the role bi-directional genes in the development of human diseases such as aging, brain disease and oncogenesis.
Our study provided insight into the function constraints of bi-directional genes. Functional enrichment analysis in GO and GeneGO both revealed that bi-directional genes are often associated with housekeeping functions. GO terms, including metabolic process, such as DNA, RNA, biopolymer or macromolecule metabolism, as well as nuclear processes, such as DNA repair and replication or cell cycle regulation are significantly enriched. The GeneGO pathways that are involved in growth or proliferation, such as those engaged in Energy metabolism, Nucleotide metabolism, Vitamin and cofactor metabolism, tend to be more enriched with bi-directional genes. Pathways in genetic information processing (transcription, translation and DNA repair) and cell cycle tend to be enriched as well. To summarize, bi-directional genes are significantly enriched in housekeeping functions such as metabolism pathways and nuclear processes.
Further analyses revealed that the significant functional categories are more likely to be shared by bi-directional genes. This indicated that the bi-directional genes are strongly biased toward functional similarities and coordinated regulation. We postulate that for bi-directional genes involved in basic biological processes, coordinated regulation ensures their synchronized action and thus minimizes transcriptional error. In contrast, genes with less coordinated regulation may be involved in pathways that are more flexible in responding to environmental changes.
We compared the TFBSs between bi- and uni-directional promoters according to their rate of occurrence. We discovered several transcription factors that preferentially regulate bi-directional promoters. Some of the TFBSs matched well with experimentally determined ones and several novel binding motifs were also identified. These bi-directional gene associated motifs may be envisaged as the best candidates for functional regulatory elements. In addition, the motif search result could help identify novel genes, which is linked to a known gene via a bi-directional promoter. And these genes probably perform important conserved functions.
We are also aware of some limitations in our analysis. The motifs for the identification of TFBSs are still incomplete, and the evolutionary importance of the over-representation of TFBS remains to be elucidated. Although some of their function are indicated by functional categories (GO terms) of experimental verified motifs, conclusive evidence of the role played by regulatory factors in the co-regulation of the two genes will be tested in experiments. Eventually, the combination of computational and experimental approaches will permit us to construct mechanistic models of regulatory transcription networks of bi-directional genes. It would be interesting, as a future endeavor, to examine these regulatory elements in other species in a similar fashion and compare the results to those obtained herein. Comparative analyses of these regulators across multiple species will validate our predictions by their appearance in another species. A related work is still in progress.
In this work, we conducted a systematic investigation of bi-directional gene organization focusing on sequence features, functional association and regulatory motif discovery. We confirmed known properties of bi-directional gene organization and also provided new observations. We found that bi-directional gene pairs show a higher probability to be functionally associated, formulating hypotheses that the requirement for co-regulation of functionally related genes is a possible cause for the observed co-expression of bi-directional genes. We also proposed that a special set of motifs in the bi-directional promoters play a role in transcriptional regulation of bi-directional genes. Our data also provide the putative regulatory motifs for experimental studies to investigate how the expression of divergent gene pairs is regulated.
Human genome assembly GRCh37, released as NCBI Build36 and Ensemble release 55, was downloaded from Genome Reference Consortium (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/). Gene annotation (NCBI Build36) was retrieved from the NCBI Entrez Gene ftp site (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/). The transcript annotation including transcription orientation, strand, starting site (hg19) was downloaded from hg19 RefGene table from UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/). A total of 45,408 genes (excluding mitochondrial genome) and 31,357 transcripts were collected and filtered for redundancy. This resulted in 44,293 non-redundant items of RefSeqs transcripts. Genes without clear mRNA information (NR, XR and XM) were filtered to ensure the exact transcription of all the genes. The 28520 mRNAs were collapsed into 21757 unique and non-overlapping clusters, which were further ranked according to their chromosome position and TSS coordinates to determine the adjacent gene pairs. Discrimination of bi-directional gene pairs and uni-directional genes was performed by a perl script according to the definition by Trinklein. et al. Redundant gene pair entries that share the same intergenic sequence were removed.
Based on the mapping information of gene and its transcripts, possible multiple TSSs were assessed. The intergenic regions between bi-directional genes’ TSS were taken as bi-directional promoters. For uni-directional genes the region of 1000 bp upstream of the TSS were extracted as promoter. Promoter regions were extracted from the chromosome fasta files of the latest GRCh37 version genome assembly datasets. (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/).
The intergenic sequences of bidirectional genes were extended in both sides symmetrically into 1000 bp to meet the definition of a CpG island length. CpG island finder script was run with two types of parameter criteria, %GC>=50, Obs/Exp>=0.60, length 500 and %GC >=55, Obs/Exp>=0.60, length 500 respectively. CpG frequency within the bi-directional and uni-bidirectional promoters was calculated.
We utilized Gene Ontology (GO) categories (http://www.geneontology.org/) and a commercial software MetaCore-GeneGO Pathway Maps (http://www.genego.com/metacore.php) to group functionally related genes and to contrast the functional distribution of bi-directional genes to the average distribution in the whole genome. The analysis of over-represented GO terms for bi-directional genes was performed by the GOEAST. Statistical enrichment of a category was quantified using the Hypergeometric test method. Yekutieli multi-test adjustment method was applied to correct for multiple testing.
Genes were then mapped to GeneGO database by MetaCore™ tools to infer pathways preferentially targeted by bi-directional genes. In MetaCore™, the statistical significance of the enriched pathways is indicated by a P value yielded from the Fisher’s exact test. The False discovery rate (FDR) is also applied to correct for multiple testing.
Putative TFBS in promoter regions were searched for matches to the position-weight matrix(PWM) in the JASPAR[24, 25] and TRANSFAC database. Predetermined PWMs for 73 and 87 vertebrate TFBSs were extracted from TRANSFAC(public version 7.0) and JASPAR PSSM, respectively. Alignment of PWMs on genomic sequence was performed with COTRASIF (http://biomed.org.ua/COTRASIF/). TFBSs within bi-directional promoters were categorized as over-represented, shared or under-represented at 2-fold threshold. Over-represented TFBS was defined as whose normalized number of binding sites in bidirectional promoters is 2-fold larger than those in unidirectional ones while under-represented means the normalized number of binding sites in bidirectional promoters is 2-fold smaller than the number of sites in a single unidirectional promoter. Shared motif is the intermediate state. A total of 18840 uni-directional promoters was used to give a contrast of bi-directional genes.
This research work has been supported in part by the National 973 Program of China (No. 2007CB947002) and the National Nature Science Foundation of China (20872107)
This article has been published as part of BMC Systems Biology Volume 5 Supplement 1, 2011: Selected articles from the 4th International Conference on Computational Systems Biology (ISB 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.