Volume 6 Supplement 1

Selected articles from The 5th IEEE International Conference on Systems Biology (ISB 2011)

Open Access

The preservation of bidirectional promoter architecture in eukaryotes: what is the driving force?

BMC Systems Biology20126(Suppl 1):S21

DOI: 10.1186/1752-0509-6-S1-S21

Published: 16 July 2012

Abstract

Background

The bidirectional gene architecture has been studied in many organisms, and the conservation of bidirectional arrangement has received considerable attention. However, the explanation for the evolutionary conservation about this genomic structure is still insufficient. In this study the large scale identification and pathway enrichment analysis for bidirectional genes were performed in several eukaryotes and the comparative analysis of this arrangement between human and mouse were dissected for the purpose of discovering the driving force of the preservation of this genomic structure.

Results

We identified the bidirectional gene pairs in eight different species and found this structure to be prevalent in eukaryotes. The pathway enrichment analysis indicated the bidirectional genes at the genome level are conserved in certain pathways, such as the DNA repair and some other fundamental cellular pathways. The comparative analysis about the gene expression, function, between human and mouse bidirectional genes were also performed and we observed that the selective force of this architecture doesn't derive from the co-regulation between paired genes, but the functional bias of bidirectional genes at whole genome level is observed strengthened during evolution.

Conclusions

Our result validated the coexpression of bidirectional genes; however failed to support their functional relevance. The conservation of bidirectional promoters seems not the result of functional connection between paired genes, but the functional bias at whole genome level, which imply that the genome-wide functional constraint is important for the conservation of bidirectional structure.

Background

The bidirectional promoters, as a special arrangement of neighbouring genes, have been discussed in many previous studies. The bidirectional gene pairs were defined as the divergent genes with the distance between their transcription start sites (TSS) less than 1 kb [1]. The frequency distribution of distance between adjacent gene pairs showed that the bidirectional promoters are prevalent in human genome [1]. It was later discovered that this genomic architecture is also abundant in mouse, Arabidopsis thaliana, yeast and many other species [24]. Comparative genomic analysis suggested that this gene-pair structure is conserved in vertebrates [2, 5, 6]. It was therefore believed that the bidirectional promoters possess special biological meaning [24, 7].

The co-regulation was believed to be the distinctive feature of bidirectional gene pairs, and the mechanism of the similarity of expression profiles may be the sharing of the regulatory elements [1]. The previous study by Li et al concluded that this genomic arrangement is ancient and conserved during the evolutionary process, where the function relevance of this structure was also reported in the literature [5]. Other comparative genomic researches about the bidirectional gene pairs were also performed [2, 6], but the reason for the structure conservation is still not clear now. The comparative analyses about the expression and function attribution of bidirectional gene pairs at whole genome level between human and mouse in our work could provide the potential explanations for this question.

In this study, we first performed the large-scale identification and pathway enrichment analysis of bidirectional gene pairs among several eukaryotes. Then we analyzed the general evolutionary tendency of this architecture. The functional preference of bidirectional genes at whole genome level was discovered and this preference was found to be conserved among species. The function relevance at the paired genes level as the driving force for the preservation of bidirectional promoters was excluded. The functional bias of bidirectional genes at the whole genome level is strengthened in human compared with mouse, which may imply the genuine origin of the conservation of bidirectional architecture.

Results

Bidirectional promoters are prevalent in eukaryotic genomes

The bimodal distribution for the distance between transcription start sites (TSSs) of adjacent genes in opposite strands was detected in human genome, and the minor peak was considered as the peak of the distance between bidirectional gene pairs [1]. The distribution of distance between TSSs of neighbouring genes on opposite strands in eight individual eukaryotes was summarized, and the Kernel Density Estimation [8] was then applied to smooth the histograms with Gaussian curves (Figure 1). Then we fitted the distance distribution with two mixed Gaussian distributions and obtained the approximate locations of the minor and major peaks (Additional file 1). The fitting result reveals that the locations of minor peaks are more stable among species and maintain less than 1 kb while the location of major peaks tends to be proportional to the genome size. The Pearson correlation coefficient reflects that the significant correlation between the predicted location of major peak and the genome c-value (R2 = 0.9578, p-value = 2.379E-05), however the correlation was not observed in the minor peak (p-value = 0.185 (Figure 2). The c-value of each genome was extracted from the Animal Genome Size Database [9]. Although the genome size varies among species, the location of minor peak possesses considerable stability which on the other hand confirms the identification criteria of distance between TSSs less than 1kb for defining bidirectional promoters in former research [1].
https://static-content.springer.com/image/art%3A10.1186%2F1752-0509-6-S1-S21/MediaObjects/12918_2012_Article_890_Fig1_HTML.jpg
Figure 1

Distribution of distance between head to head gene pairs in eight eukaryotic organisms. The distances between TSSs of head to head gene pairs were calculated in eight organisms individually. The distance distribution curves were then smoothed using Kernel Density Estimation. The binomial distribution can be observed in all the eight charts, where the minor peak represents the enrichment of bidirectional promoters.

https://static-content.springer.com/image/art%3A10.1186%2F1752-0509-6-S1-S21/MediaObjects/12918_2012_Article_890_Fig2_HTML.jpg
Figure 2

Relationship between percentage of bidirectional genes and genome c-value. The percentages of bidirectional genes were computed in all eight organisms. The percentages were fitted with corresponding genome c-value using linear regression fitting. The negative correlation was observed between percentage of bidirectional genes and genome c-value.

The bidirectional promoters were then identified via employing the criteria of TSSs distance less than 1 kb in eight eukaryotic organisms. The number of recognized bidirectional promoters and the bidirectional genes are shown in Table 1. Kanako (2005) argued that the enrichment of bidirectional pairs was not observed in non-mammals [2]. However, even though the minor peaks in some species are not that obvious in the distance distribution, the percentage of bidirectional genes is still considerable, and the percentage goes up along with the decreasing genome size (Table 1). As a result, the bidirectional arrangement should be prevalent in eukaryotic genomes. It has been raised that there is a negative relationship between the ratio of bidirectional genes and the gene density of each chromosome in human and mouse genome [5]. This correlation can also been discovered across species when comparing the ratio of bidirectional genes with the genome size (Pearson correlation test, R2 = 0.547, p-value = 0.036). Interestingly, the percentage of bidirectional genes falls sharply from invertebrates to vertebrates. This may be the consequence of the large-scale segmental duplications which was believed to increase gene numbers and genome size during the origin of vertebrates [10].
Table 1

Statistical results of bidirectional promoters and bidirectional genes in the eight selected eukaryotes.

Organism

Number of bidirectional promoters

Number of bidirectional genes

Number of all protein coding genes

Percentage of bidirectional genes

Homo sapiens

1178

2348

20686

11.35%

Mus musculus

1311

2617

22793

11.48%

Rattus norvegicus

698

1393

22925

6.08%

Bos taurus

574

1144

19030

6.01%

Gallus gallus

687

1367

15310

8.93%

Drosophila melanogaster

2210

4398

13671

32.17%

Caenorhabditis elegans

1813

3612

20212

17.87%

Saccharomyces cerevisiae

1864

3423

6664

51.37%

The statistical results about the bidirectional pairs and involved genes, the genome c-value for the eight organisms were also enclosed in the table.

The prevalence of bidirectional promoters indicates this genomic architecture or the involved genes may have some special properties which make them preserved during the evolutionary history. We attempted to provide a potential explanation by the comparative analysis of bidirectional promoter among species, especially between human and mouse genome.

Co-regulation of bidirectional gene pairs hardly determines the fate of bidirectional promoters

It was examined that the sequence of bidirectional promoters can regulate both divergent genes [1]. As a result, the co-regulation of paired bidirectional genes can be expected. The co-expression level of paired bidirectional genes had been confirmed to be significantly higher than other neighbouring gene structures by whole-genome microarray data analysis [1, 5]. The significant function relevance had also been observed in the paired genes [5].

However, there are two potential deficiencies in the former analyses. First, the tandem duplications, which may cause the co-regulation of paired gene as trivial reasons [11], must be removed to purify the influence of bidirectional promoters. The tandem duplications, representing the genes duplicated in tandem [12], have pretty high sequence similarity and show symmetry not only in gene expression but also in function. Second, the similar expression pattern in neighbouring genes has been reported in human, drosophila and C. elegans [11, 13, 14], and chromatin-level gene regulation are thought as the most probable explanation for this phenomenon [11]. Consequently, in order to exclude the contribution of chromatin-level gene regulation, the co-regulation level of bidirectional genes should be compared with other adjacent gene architectures as well as the random gene pairs.

The neighbouring gene pairs in human genome were divided into three classes: 1) bidirectional gene pairs (BIP), which represent the adjacent gene pairs with TSSs distance less than 1 kb on the opposite strands; 2) remote head to head gene pairs (rH2H), the adjacent gene pairs on the opposite strands except the bidirectional gene pairs; 3) head to tail gene pairs (H2T), which represent the neighbouring gene pairs on the same strands. The tandem duplications were excluded from all these three architectures, and the randomly paired genes were chosen as the control. The Pearson correlation coefficient between gene expression profiles was computed employing the wide-distributed microarray data across 78 human cell types and 62 mouse cell types stemmed from a previous transcriptome analysis by Su et al [15]. A significant high correlation was observed when comparing the co-expression level of all adjacent genes pairs with the random pairs (Figure 3; Wilcoxon test P-value < 2.2E-16). Meanwhile, the bidirectional gene pairs had significant higher coexpression level than the rH2H and H2T (Wilcoxon test, P-value 1.139E-12, 1.225E-13). This analysis was also conducted in the mouse genome, and the results agreed with that of human genome (Figure 3). After excluding the tandem duplications and considering the influence of local expression similarity, the bidirectional genes still possess similar expression profiles. The shared control region may be the most reasonable explanations for this coordinated expression.
https://static-content.springer.com/image/art%3A10.1186%2F1752-0509-6-S1-S21/MediaObjects/12918_2012_Article_890_Fig3_HTML.jpg
Figure 3

Pearson correlation coefficient for four kinds of gene pairs in human and mouse genome. The Pearson correlation coefficient between the expression profiles of paired genes in the four categories of gene pairs: BIP, rH2H, H2T and random pairs were computed. The *** represents that the correlation coefficient for this pair was significantly higher than random pairs (p-value < 0.01).

The functional similarity of bidirectional gene pairs has been evaluated in a former study [5], and the significant correlation were found in all three Gene Ontology(GO) subsystems. The functional similarity between genes was quantified as the Resnik probability. Here we applied the same GO Resnik semantic measure described in Li et al's literature [5], for the "biological process" (BP) no significant co-function tendency was observed comparing with the random gene pairs after excluding the tandem duplications. As for the subsystem "molecular function" (MF), there is a significant higher functional similarity compared with random gene pairs (p-value = 0.007345), however this discrepancy was not found when compared with other adjacent gene pairs. Consistent with Li's result, the tendency to show functional relevant in subsystem "cellular component" (CC) is stronger not only compared with random pair (p-value = 5.877E-13) but also with other neighbouring gene pairs (p-value = 1.469E-07) (Figure 4). In general, the convinced functional relationship of bidirectional genes can only be found in the cellular component term. Among the three GO subsystems, the BP refers to the biological objective to which the gene or gene product contributes [16], and this term preferably represents the biological function of the gene. In consideration of this fact, the functional relevance may not be the statistically significant attribution for bidirectional genes. Moreover, we examined the bidirectional gene pairs involving DNA repair. The bidirectional genes have been found to be enriched in DNA repair pathways [1, 7]. The DNA repair genes were collected by the pathway annotation in KEGG pathway. For the 105 DNA repair genes, 35 are regulated by bidirectional promoter; however, all the 35 paired genes don't perform a role in DNA repair.
https://static-content.springer.com/image/art%3A10.1186%2F1752-0509-6-S1-S21/MediaObjects/12918_2012_Article_890_Fig4_HTML.jpg
Figure 4

The functional similarities for human bidirectional genes. The box plot for the bidirectional gene pairs in three GO subsystems "biological process"(BP), "molecular function"(MF), "cellular component"(CC) reflects that the similarity in BP term was not obvious.

As pointed out by Yanai and his co-worker, the similar gene expression profiles may not imply similar functions [17], so the co-expression of bidirectional genes may not be driven by the biological function but by the shared regulatory elements. The co-regulation of bidirectional gene pairs may not serve as the selection criteria of this genome architecture but the consequence.

The function preference of bidirectional genes increase along with the selection of bidirectional architecture

It was reported that the genes involved in DNA-repair are more likely to be arranged in bidirectional manner [1, 7, 18]. The functional preference of bidirectional genes deserves considerable attention. Here we adopted the hypergeometric distribution to find the enriched KEGG pathway for bidirectional genes as described in Materials and Methods. The enriched pathways of bidirectional genes in eight selected organisms were shown in Table 2. There are some enriched function classes which are constant among species, especially the transcription and transcription related (Splicesome, RNA-degradation, RNA polymerase) and DNA-repair-related pathways (Nucleotide excision repair, Non-homologous end-joining). It was pointed out that the bidirectional gene pairs are only conserved in vertebrates, the bidirectional linkages are disorganized during the evolution from invertebrates to the vertebrates [2]. Nevertheless, the enriched pathways among all the eukaryotes show substantial conservation. Although the bidirectional linkages of paired genes were lost during the evolution, some genes in particular pathways were still regulated by bidirectional promoters but paired with another gene. The genome greatly expanded at the origin of vertebrates, but otherwise remained relatively constant [19]. As a result, the bidirectional linkages in invertebrates were broken and shuffled at the emerging of vertebrates which cause the non-conservation of bidirectional gene pairs between invertebrates and vertebrates.
Table 2

Enriched KEGG pathways of bidirectional genes in eight different species.

Pathway

human

mouse

rat

cow

chicken

fruit fly

C.elegans

yeast

Count

Spliceosome

hsa03040

mmu03040

rno03040

bta03040

gga03040

dre03040

cel03040

sce03040

8

RNA degradation

hsa03018

mmu03018

rno03020

bta03018

 

dre03018

cel03018

sce03018

7

Nucleotide excision repair

hsa03420

mmu03420

rno03420

 

gga03420

dre03420

cel03420

sce03420

7

RNA polymerase

hsa03020

mmu03020

 

bta03020

 

dre03020

cel03030

sce03020

6

Non-homologous end-joining

hsa03450

 

rno03450

bta03450

gga03450

dre03450

cel03450

 

6

Oxidative phosphorylation

hsa00190

mmu00190

rno00190

bta00190

gga00190

 

cel00190

 

6

Pyrimidine metabolism

hsa00240

mmu00240

   

dre00240

cel00240

sce00240

5

Ribosome

hsa03010

  

bta03010

 

dre03010

cel03010

sce03010

5

Base excision repair

hsa03410

  

bta03410

gga03410

dre03410

 

sce03410

5

Purine metabolism

 

mmu00230

   

dre00230

cel00230

sce00230

4

DNA replication

hsa03030

mmu03030

   

dre03030

 

sce03030

4

Proteasome

    

gga03050

dre03050

cel03050

sce03050

4

Mismatch repair

hsa03430

    

dre03430

cel03430

sce03430

4

Homologous recombination

hsa03440

mmu03440

   

dre03440

cel03440

 

4

Aminoacyl-tRNA biosynthesis

hsa00970

 

rno00970

 

gga00970

 

cel00970

 

4

N-Glycan biosynthesis

 

mmu00510

rno00510

   

cel00510

 

3

Cell cycle

hsa04110

mmu04110

rno04110

     

3

Peroxisome

hsa04146

mmu04146

rno04146

     

3

Systemic lupus erythematosus

hsa05322

mmu05322

 

bta05322

    

3

O-Mannosyl glycan biosynthesis

 

mmu00514

  

gga00514

  

sce00514

3

Folate biosynthesis

     

dre00790

cel00790

sce00970

3

Metabolic pathways(Global Pathway)

hsa01100

     

cel01100

sce01100

3

Protein export

     

dre03060

cel03060

sce03060

3

Basal transcription factors

     

dre03022

 

sce03022

2

SNARE interactions in vesicular transport

     

dre04130

 

sce04130

2

Parkinson's disease

hsa05012

  

bta05012

    

2

Citrate cycle (TCA cycle)

  

rno00020

   

cel00020

 

2

The first three letters represent the abbreviation of each organism and the following numbers mean the accession number of the pathway in KEGG. The COUNT is the number of species which is enriched with the pathway. The pathways which only enriched in one organism were excluded.

The more interesting finding is that the conserved bidirectional gene enriched pathways are more likely to involve the basic functions in cell. In order to confirm this observation, the tissue specificity of gene expression was then evaluated by the gene expression profiles. Large-scale gene expression variation has been used to select house-keeping genes in many former researches; the genes with lower expression variation among tissues are regarded as potential house-keeping genes [12, 20, 21]. The calculation formula for gene expression specificity is presented in Materials and Methods [22]. The tissue specificity of bidirectional genes is significantly lower than other genes (Wilcoxom sum rank test, p-value 2.459E-13 for human and 2.338E-13 for mouse), which means the bidirectional genes express widely among different tissues and prefer to perform fundamental functions.

In order to check the potential constraints for the conservation of bidirectional arrangements, we classified the human bidirectional genes into two categories: the conserved bidirectional genes whose mouse orthologous genes are still arranged in bidirectional architecture (human cBIP gene) and the human-specific bidirectional genes whose mouse orthologous genes are not regulated by bidirectional promoter (human sBIP gene). The human-mouse one to one orthologous gene pairs were extracted from Ensembl using Biomart [23]. Only those bidirectional gene pairs with one to one orthologous mouse genes were discussed in this study. The mouse bidirectional genes were also classified based on this criterion. The enriched pathways of cBIP genes show great similarity in human and mouse, all the enriched pathways in mouse are also enriched in human genome, which indicates the function classes of preserved bidirectional genes are stable during evolution. However, the pathway enrichment for sBIP genes varies widely between human and mouse (Table 3, Table 4). The above observations imply that the bidirectional genes tend to perform functions in particular pathways and this tendency can determine the selective conservation of bidirectional gene pairs. For instance, the DNA-repair related pathways can be found in human cBIP genes (hsa03410, hsa03420, hsa03430), mouse cBIP genes (mmu03420, mmu03430) and human sBIP genes (hsa03410, hsa03420), but not in mouse sBIP genes. The genes which do not perform the DNA-repair functions were eliminated from the bidirectional arrangement during the evolution of human genome structure, while the co-opted bidirectional genes in human genome more and more participate in these pathways. As a conclusion, the bidirectional genes tend to perform particular fundamental functions like DNA repair and this function preference may affect the fate of bidirectional structure during evolution; however the reason for the trend of enrichment in these particular pathways need further investigation.
Table 3

Enriched KEGG pathway of human cBIP and sBIP gene classes.

KEGG pathway ID

Pathway Name

p-value

gene type

hsa03040*

Spliceosome

4.24E-04

cBIP

hsa03020*

RNA polymerase

1.09E-03

cBIP

hsa05012

Parkinson's disease

1.64E-03

cBIP

hsa05010

Alzheimer's disease

1.78E-03

cBIP

hsa00190

Oxidative phosphorylation

1.93E-03

cBIP

hsa03018*

RNA degradation

1.93E-03

cBIP

hsa04110

Cell cycle

3.01E-03

cBIP

hsa04142

Lysosome

4.32E-03

cBIP

hsa05016

Huntington's disease

5.11E-03

cBIP

hsa00970

Aminoacyl-tRNA biosynthesis

8.60E-03

cBIP

hsa00052

Galactose metabolism

1.54E-02

cBIP

hsa03440 #

Homologous recombination

2.10E-02

cBIP

hsa04146

Peroxisome

3.62E-02

cBIP

hsa03420 #

Nucleotide excision repair

4.03E-02

cBIP

hsa00230

Purine metabolism

4.32E-02

cBIP

hsa03410 #

Base excision repair

4.47E-02

cBIP

hsa00240

Pyrimidine metabolism

2.28E-02

sBIP

hsa00562

Inositol phosphate metabolism

2.36E-02

sBIP

hsa01100

Metabolic pathways

9.65E-03

sBIP

hsa03020*

RNA polymerase

9.05E-03

sBIP

hsa03030

DNA replication

1.92E-02

sBIP

hsa03050

Proteasome

4.85E-02

sBIP

hsa03410 #

Base excision repair

1.58E-02

sBIP

hsa03420 #

Nucleotide excision repair

3.70E-02

sBIP

hsa04115

p53 signaling pathway

4.81E-02

sBIP

hsa05215

Prostate cancer

4.30E-02

sBIP

The two categeries of human bidirectional genes (cBIP, sBIP) were computed for the KEGG pathway enrichment respectively. cBIP represents the genes regulated by the human-mouse conserved bidirectional promoters, while the sBIP represents the genes in the species-specific bidirectional gene pairs. * Transcription and translation related pathways. # DNA repair related pathways

Table 4

Enriched KEGG pathway of mouse cBIP and sBIP gene classes.

KEGG pathway ID

Pathway Name

p-value

gene type

mmu03040*

Spliceosome

3.30E-03

cBIP

mmu04110

Cell cycle

3.70E-03

cBIP

mmu04142

Lysosome

4.13E-03

cBIP

mmu03018*

RNA degradation

5.70E-03

cBIP

mmu00970

Aminoacyl-tRNA biosynthesis

8.74E-03

cBIP

mmu03020*

RNA polymerase

8.97E-03

cBIP

mmu00190

Oxidative phosphorylation

1.07E-02

cBIP

mmu03440 #

Homologous recombination

1.36E-02

cBIP

mmu00052

Galactose metabolism

1.59E-02

cBIP

mmu05012

Parkinson's disease

1.98E-02

cBIP

mmu04146

Peroxisome

3.03E-02

cBIP

mmu03420 #

Nucleotide excision repair

3.30E-02

cBIP

mmu00240

Pyrimidine metabolism

6.22E-03

sBIP

mmu03020*

RNA polymerase

5.15E-03

sBIP

mmu00062

Fatty acid elongation in mitochondria

1.10E-02

sBIP

mmu00903

Limonene and pinene degradation

1.40E-02

sBIP

mmu00280

Valine, leucine and isoleucine degradation

1.98E-02

sBIP

mmu05340

Primary immunodeficiency

4.08E-02

sBIP

mmu00860

Porphyrin and chlorophyll metabolism

4.96E-02

sBIP

The two catteries of mouse bidirectional genes (cBIP, sBIP) were computed for the KEGG pathway enrichment respectively. cBIP represents the genes regulated by the human-mouse conserved bidirectional promoters, while the sBIP represents the genes in the species-specific bidirectional gene pairs. * Transcription and translation related pathways # DNA repair related pathways

Discussion

In this study, we found that the bidirectional gene pairs were prevalent in eukaryotes and the percentage of bidirectional genes declines along with the increasing of genome size. The increasing of genome size is much faster than that of gene number during evolution, which can be attributed to two reasons. First, the growing number and length of introns make the gene longer [24]. Second, the intergenic distance is also greatly expanded. The expanding of intergenic distance inevitably reduces the percentage of bidirectional promoters.

Only protein-coding gene was considered in this work, the non coding transcripts which are pervasive in many organisms were recently found enriched in the upstream of protein-coding genes and shared the same promoter with the adjacent genes [4, 25, 26]. But most of the pervasive non-coding transcripts at bidirectional promoters were considered as unstable which would be degraded soon after the birth, and the function of them was not clear right now [27]. In an attempt to validate the distance distribution of head to head transcript pairs, we took the listed non-coding genes in Ensembl into consideration and re-identified the bidirectional promoters. Although this gene collection don't include all transcripts, the distribution is similar with the former (Additional file 2), and the minor peak which represents the bidirectional promoters indeed increased. However these transcripts are rarely expression quantified and function annotated. Our research focused on the evolution force of the efficient bidirectional promoters, so only the bidirectional promoters which encode two protein-coding genes were considered.

The co-regulation of bidirectional gene pairs has been reported in many studies including the co-expression and function relevance [1, 5]. Here we validated the coexpression of bidirectional genes rather than the cofunction. The bidirectionality has been proved to be an inherent feature of promoters [1], and the proposed divergent transcription model also thought the genes were transcribed in both direction synchronously [27]. However the GO similarity analysis didn't agree with the functional relevance of paired genes. The bidirectional genes transcribed simultaneously but perform different function in cell. The co-regulation of bidirectional pairs may stem from the shared promoter; however it hardly has effect on the selection of bidirectional promoters because the natural selection of gene order bases on the functional relevance such as operon in prokaryotes. The shuffling of bidirectional linkage between invertebrates and vertebrates also proves the bidirectional structures are not kept by co-regulation.

The cross-species pathway enrichment analysis showed that the functions of bidirectional genes are greatly conserved in certain fundamental function classes like DNA-repair and transcription related pathways. And this function preference may increase along with the selection of bidirectional structure. The bidirectionality is the inherent feature of promoters [1], the < 1 kb interval between head to head gene pairs can basically determine the co-regulation of paired genes. We assumed that the surrounding nucleotide composition of these genes may be the genuine trigger, the upstream genome structure of these genes are more stable and avoid the insertion of non coding DNAs or other genes which leads to the shorter interval between adjacent gene; however this assumption requires further validation.

Methods

Identification of bidirectional promoters in eight eukaryote genomes

The chromosomal positions and sequences information of all the protein-coding genes were fetched from the Ensembl database [28] (Ensembl gene Build 58) using the Biomart system [29] for eight selected organisms: Homo sapiens, Mus musculus, Rattus norvegicus, Bos Taurus, Gallus gallus, Drosophila melanogaster, Caenorhabditis elegans and Saccharomyces cerevisiae. The mitochondria genome and unmapped fragments were not included in the following analysis. The gene start sites in Ensembl gene annotation database were regarded as the reliable transcription start site (TSS) of each gene because the full-length cDNA was used to confirm the gene boundaries [30]. The protein-coding genes on each chromosome were sorted according to the TSS coordinates. The neighbouring genes on the same strand were recognized as the head to tail gene pairs, while the opposite strand as the head to head gene pairs. Then the distances between head to head gene TSSs were calculated for the eight organisms respectively.

Removal of tandem duplication

As indicated in the previous works, the tandem duplication can contribute to the local similarity of gene attributions and this substantially affects the neighbouring gene effect analysis [31]. We therefore removed the tandem duplications from the neighbouring gene pairs for the following coexpression and cofunction analysis. For each adjacent gene pairs, corresponding protein sequences were obtained from Ensembl database (Build 58), and then the protein sequences were imported into pair-wise BLAST to get the e-value of sequence similarity (standard setting, word size 2). This method with 0.2 as cut off value has been proved to be powerful to remove ~90% of related genes from a dataset [12]. In this article, we used smaller cut off to reduce false positive rate. The pair with e-value < 0.01 was regarded as tandem duplication and eliminated in the following gene pair similarity analysis.

Extraction of conserved and species-specific bidirectional gene pairs by orthologous linkage between human and mouse

If the human paired bidirectional genes both have the one-to-one orthologous gene in mouse genome and the orthologous gene pairs were still arranged in bidirectional architecture, these bidirectional gene pairs were counted as the conserved bidirectional gene pairs (human cBIP pairs), while other gene pairs as the human specific bidirectional gene pairs (human sBIP pairs). Similarly, the mouse bidirectional gene pairs are also divided into mouse cBIP pairs and sBIP pairs using the human-mouse linkage. The 14024 one-to-one orthologous gene relationships between human and mouse were extracted from Ensembl database via the Biomart. As a result, 540 human conserved bidirectional promoters and 270 human unique bidirectional promoters were classified, while these numbers are 540 and 207 in mouse genome.

Pathway enrichment analysis of bidirectional genes

The KEGG database [32] collected the pathway information for many organisms, and we further determined if these pathways are enriched with the bidirectional genes using hypergeometric distribution. For a given pathway in a particular organism, we fixed the total number of protein-coding genes in this organism (N), the number of genes involving this pathway (N1), the number of total bidirectional genes in this organism (N2), and treated the number of bidirectional genes in this pathway as a random variable. Under the null hypothesis that the genes are not enriched in this pathway, this random variable follows a hypergeometric distribution. The enrichment p-value can be then defined as the probability that the gene number in this pathway is greater than or equal to the observed value (N0), which can be represented by the following equation:
P = 1 - N 0 - 1 N 1 i N - N 1 N 2 - i N N 2

The calculations in the parentheses refer to the combinatorial calculation. Pathway was recognized as enriched with bidirectional genes if the p-value was lower than 0.05.

Gene expression specificity and coexpression level

The raw microarray data were obtained from Su et al. [15]. For human genome 156 Affymatrix U133A microarray experiments across 78 human cell types were deployed, while for the mouse genome, the object of analysis turned into the 122 custom-designed GNF microarray chips representing 61 mouse cell types in Su's dataset. The microarray data was pre-processed by RMA method [33] with R affy package [34]. If a gene can map to several probesets, the mean value of the probesets' expression level was regarded as the gene expression value. For each gene the expression specificity was then calculated as the following equation:
τ = i = 1 n 1 - E i E max n - 1 ,

where n represents the number of expression datasets, Emax as the maximum expression value of all across cell type expression values, Ei as gene expression value in each microarray experiment. In human and mouse genome, for the mapped head to head gene pairs, head to tail pairs and random-generated 20000 gene pairs, the gene coexpression level were then evaluated as the Pearson correlation coefficient between expression profiles of paired genes separately.

Gene Ontology association analysis

The GO annotation for each gene was extracted from Gene Ontology database [16]. For one gene, the direct annotation was extended to general annotation by appending all the parent nodes of the direct annotation in the GO vocabulary tree [5]. The detail about the algorithm of Resnik semantic similarity was discussed in Li's work [5]. Among all the neighbouring gene pairs, the functional similarities of annotated pairs were then calculated in all three GO subsystems: "biological process", "molecular function", "cellular component", employing an R package for computing semantic similarity based on Gene Ontology annotations called csbl.go [35].

Declarations

Acknowledgements

This work was supported by the National Natural Science Foundation of China (31170795, 91029703), International S&T Cooperation Program of Suzhou (SH201120) and the Major State Basic Research Development Program of China (2010CB945600).

This article has been published as part of BMC Systems Biology Volume 6 Supplement 1, 2012: Selected articles from The 5th IEEE International Conference on Systems Biology (ISB 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S1.

Authors’ Affiliations

(1)
Center for Systems Biology, Soochow University

References

  1. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, Myers RM: An abundance of bidirectional promoters in the human genome. Genome Res. 2004, 14 (1): 62-66.PubMed CentralView ArticlePubMed
  2. Koyanagi KO, Hagiwara M, Itoh T, Gojobori T, Imanishi T: Comparative genomics of bidirectional gene pairs and its implications for the evolution of a transcriptional regulation system. Gene. 2005, 353 (2): 169-176. 10.1016/j.gene.2005.04.027.View ArticlePubMed
  3. Wang Q, Wan L, Li D, Zhu L, Qian M, Deng M: Searching for bidirectional promoters in Arabidopsis thaliana. BMC Bioinformatics. 2009, 10 (Suppl 1): S29-10.1186/1471-2105-10-S1-S29.PubMed CentralView ArticlePubMed
  4. Neil H, Malabat C, d'Aubenton-Carafa Y, Xu Z, Steinmetz LM, Jacquier A: Widespread bidirectional promoters are the major source of cryptic transcripts in yeast. Nature. 2009, 457 (7232): 1038-1042. 10.1038/nature07747.View ArticlePubMed
  5. Li YY, Yu H, Guo ZM, Guo TQ, Tu K, Li YX: Systematic analysis of head-to-head gene organization: evolutionary conservation and potential biological relevance. PLoS Comput Biol. 2006, 2 (7): e74-10.1371/journal.pcbi.0020074.PubMed CentralView ArticlePubMed
  6. Yang MQ, Taylor J, Elnitski L: Comparative analyses of bidirectional promoters in vertebrates. BMC Bioinformatics. 2008, 9 (Suppl 6): S9-10.1186/1471-2105-9-S6-S9.PubMed CentralView ArticlePubMed
  7. Adachi N, Lieber MR: Bidirectional gene organization: a common architectural feature of the human genome. Cell. 2002, 109 (7): 807-809. 10.1016/S0092-8674(02)00758-4.View ArticlePubMed
  8. Simonoff JS: Smoothing methods in statistics. Springer_Verlag, New York. 1996
  9. Gregory TR, Nicol JA, Tamm H, Kullman B, Kullman K, Leitch IJ, Murray BG, Kapraun DF, Greilhuber J, Bennett MD: Eukaryotic genome size databases. Nucleic Acids Res. 2007, 35: D332-338. 10.1093/nar/gkl828.PubMed CentralView ArticlePubMed
  10. Skrabanek L, Wolfe KH: Eukaryote genome duplication - where's the evidence?. Curr Opin Genet Dev. 1998, 8 (6): 694-700. 10.1016/S0959-437X(98)80039-7.View ArticlePubMed
  11. Lercher MJ, Blumenthal T, Hurst LD: Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes. Genome Res. 2003, 13 (2): 238-243. 10.1101/gr.553803.PubMed CentralView ArticlePubMed
  12. Lercher MJ, Urrutia AO, Hurst LD: Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat Genet. 2002, 31 (2): 180-183. 10.1038/ng887.View ArticlePubMed
  13. Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, van Sluis P, Hermus MC, van Asperen R, Boon K, Voute PA: The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science. 2001, 291 (5507): 1289-1292. 10.1126/science.1056794.View ArticlePubMed
  14. Spellman PT, Rubin GM: Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002, 1 (1): 5-10.1186/1475-4924-1-5.PubMed CentralView ArticlePubMed
  15. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMed
  16. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMed
  17. Yanai I, Korbel JO, Boue S, McWeeney SK, Bork P, Lercher MJ: Similar gene expression profiles do not imply similar tissue functions. Trends Genet. 2006, 22 (3): 132-138. 10.1016/j.tig.2006.01.006.View ArticlePubMed
  18. Liu Bingchuan, Chen Jiajia, Shen B: Genome-wide Analysis of the Transcription Factor Binding Preference of Human Bidirectional Promoters and Functional Annotation of the Related Gene Pairs. The Fourth International Conference on Computatiomal Systems Biology (ISB2010). 2010, 81-92.
  19. Bird AP: Gene number, noise reduction and biological complexity. Trends Genet. 1995, 11 (3): 94-100. 10.1016/S0168-9525(00)89009-5.View ArticlePubMed
  20. de Jonge HJ, Fehrmann RS, de Bont ES, Hofstra RM, Gerbens F, Kamps WA, de Vries EG, van der Zee AG, te Meerman GJ, ter Elst A: Evidence based selection of housekeeping genes. PLoS One. 2007, 2 (9): e898-10.1371/journal.pone.0000898.PubMed CentralView ArticlePubMed
  21. Eisenberg E, Levanon EY: Human housekeeping genes are compact. Trends Genet. 2003, 19 (7): 362-365. 10.1016/S0168-9525(03)00140-9.View ArticlePubMed
  22. Liao BY, Scott NM, Zhang J: Impacts of gene essentiality, expression pattern, and gene compactness on the evolutionary rate of mammalian proteins. Mol Biol Evol. 2006, 23 (11): 2072-2080. 10.1093/molbev/msl076.View ArticlePubMed
  23. Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A: BioMart Central Portal--unified access to biological data. Nucleic Acids Res. 2009, 37: W23-27. 10.1093/nar/gkp265.PubMed CentralView ArticlePubMed
  24. Lewin B: Genes 9. 2008, Jones and Bartlett Publishers
  25. Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008, 322 (5909): 1845-1848. 10.1126/science.1162228.PubMed CentralView ArticlePubMed
  26. Seila AC, Calabrese JM, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA, Sharp PA: Divergent transcription from active promoters. Science. 2008, 322 (5909): 1849-1851. 10.1126/science.1162253.PubMed CentralView ArticlePubMed
  27. Seila AC, Core LJ, Lis JT, Sharp PA: Divergent transcription: a new feature of active promoters. Cell Cycle. 2009, 8 (16): 2557-2564. 10.4161/cc.8.16.9305.View ArticlePubMed
  28. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E: Ensembl 2009. Nucleic Acids Res. 2009, 37: D690-697. 10.1093/nar/gkn828.PubMed CentralView ArticlePubMed
  29. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004, 14 (1): 160-169.PubMed CentralView ArticlePubMed
  30. Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl automatic gene annotation system. Genome Res. 2004, 14 (5): 942-950. 10.1101/gr.1858004.PubMed CentralView ArticlePubMed
  31. Lercher MJ, Chamary JV, Hurst LD: Genomic regionality in rates of evolution is not explained by clustering of genes of comparable expression profile. Genome Res. 2004, 14 (6): 1002-1013. 10.1101/gr.1597404.PubMed CentralView ArticlePubMed
  32. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28 (1): 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMed
  33. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMed
  34. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20 (3): 307-315. 10.1093/bioinformatics/btg405.View ArticlePubMed
  35. Ovaska K, Laakso M, Hautaniemi S: Fast gene ontology based clustering for microarray experiments. BioData Min. 2008, 1 (1): 11-10.1186/1756-0381-1-11.PubMed CentralView ArticlePubMed

Copyright

© Xu et al.; licensee BioMed Central Ltd. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement