A regulatory similarity measure using the location information of transcription factor binding sites in Saccharomyces cerevisiae
BMC Systems Biologyvolume 8, Article number: S9 (2014)
Defining a measure for regulatory similarity (RS) of two genes is an important step toward identifying co-regulated genes. To date, transcription factor binding sites (TFBSs) have been widely used to measure the RS of two genes because transcription factors (TFs) binding to TFBSs in promoters is the most crucial and well understood step in gene regulation. However, existing TFBS-based RS measures consider the relation of a TFBS to a gene as a Boolean (either 'presence' or 'absence') without utilizing the information of TFBS locations in promoters.
Functional TFBSs of many TFs in yeast are known to have a strong positional preference to occur in a small region in the promoters. This biological knowledge prompts us to develop a novel RS measure that exploits the TFBS location information. The performances of different RS measures are evaluated by the fraction of gene pairs that are co-regulated (validated by literature evidence) by at least one common TF under different RS scores. The experimental results show that the proposed RS measure is the best co-regulation indicator among the six compared RS measures. In addition, the co-regulated genes identified by the proposed RS measure are also shown to be able to benefit three co-regulation-based applications: detecting gene co-function, gene co-expression and protein-protein interactions.
The proposed RS measure provides a good indicator for gene co-regulation. Besides, its good performance reveals the importance of the location information in TFBS-based RS measures.
Identification of co-regulated genes are helpful for solving many biological problems such as unraveling the underlying molecular mechanisms of specific cellular functions, identifying functionally related proteins and dissecting the gene regulatory networks [1–3]. The first step toward identifying co-regulated genes is to define the regulatory similarity (i.e., the degree of co-regulation) of two genes. Gene regulation is a complex process, which involves various mechanisms: transcription factors (TFs) binding, miRNAs binding, epigenetic modifications, etc. Nowadays, various data related to these mechanisms, such as TF binding sites, miRNA binding sites and histone modification patterns, are available for gene regulation study. Among them, TF binding sites (TFBSs) have been the most widely used data. This is because that TFs binding to TFBSs in promoters is the most crucial and well understood step in gene regulation.
To date, many studies have been proposed to use TFBS data to measure the regulatory similarity (RS) of two genes [4–8]. However, existing TFBS-based RS measures consider the relation of a TFBS to a gene as a Boolean (either 'presence' or 'absence') without utilizing the information of TFBS locations. In yeast and human, functional TFBSs of many TFs are known to have a strong positional preference to occur in a small region in the promoters [9, 10]. This biological knowledge prompts us to develop a novel RS measure that exploits the TFBS location information. Following Allocco et al.'s approach , the performances of different RS measures are evaluated by the fraction of gene pairs that are co-regulated (validated by the literature evidence deposited in the YEASTRACT database ) by at least one common TF under different RS scores. The experimental results show that the proposed RS measure was the best co-regulation indicator among the six compared RS measures. In addition, the co-regulated genes identified by the proposed RS measure are also shown to be able to benefit three co-regulation-based applications: detecting gene co-function, gene co-expression and protein-protein interactions.
This study proposes a novel RS measure using the TFBS location information. This section first describes the datasets used in this study and five existing TFBS-based RS measures followed by the proposed RS measure.
Following previous studies in the literature, the promoter of a yeast gene in this study is defined as the intergenic region between this gene and its nearest non-overlapped upstream gene [13–18]. The genomic locations of the start and stop codons of 6604 genes of Saccharomyces cerevisiae (the budding yeast) were retrieved from Nagalakshmi et al.'s work . The genomic locations of 422576 TFBSs of 163 yeast TFs were collected from the SwissRegulon database , which deposited high-quality TFBS datasets predicted using Bayesian probabilistic analysis. Users can choose different posterior probability cutoffs to control the quality of the retrieved TFBSs. This study adopted a moderate cutoff of 0.5 and included a section to discuss the influence of the TFBS quality to the proposed RS measure.
Existing TFBS-based RS measures
Table 1 lists five existing TFBS-based RS measures of two genes, a and b. The first three RS measures do not consider the copies of TFBSs (namely a TF having multiple TFBSs is identical to that having one TFBS), while the last two do. In the context, TFs whose TFBSs exist in the promoter of a and b are denoted as TF a and TF b , respectively. TFs that have TFBSs in the promoters of both a and b, (i.e. TF a ∩TF b ) are named as common TFs. In the first group of RS measures, Garten et al. adopted the cumulative hypergeometric test to estimate the significance of the observed overlap between TF a and TF b in comparison with random expectation . Veerla and Höglund adopted the Jaccard index to define the similarity of promoter organization between two genes . This index calculates the RS as the size ratio of the intersection to the union of TF a and TF b . Shalgi et al. proposed a variant of Eq. (2) by replacing the denominator with the smaller size of TF a and TF b . In the second group of RS measures, Park et al. used the proportion of TFBSs in common as the RS of two genes and introduced a penalty term for TFBSs appearing in only one gene's promoter . Van Helden adopted the Poisson distribution to define the RS of two gene as the difference of the similarity score (1-the p-value of the observed TFBSs in common) and the dissimilarity score (the difference between the p-values of the observed TFBSs in a and in b) .
The proposed RS measure
Equations (1)-(5) consider the relation of a TFBS to a gene as a Boolean (either 'presence' or 'absence') without utilizing the information of TFBS locations in the promoters. The biological knowledge that the biological relevance of TFBSs is highly related to their locations in the promoters [9, 10] motivates us to introduce the TFBS location information into the RS measure as follows:
, Eq. (6)
where L is the longer promoter length of genes a and b, i is the i-th common TF that has TFBSs in the promoters of both a and b, and d i is the smallest distance between any two i-th common TF's TFBSs in different promoters. In this context, d i is called TFBS offset distance. A schematic explanation of Eq. (6) is shown in Figure 1, where TFBSs have different shapes for different TFs and have different colors for different genes where they locate. The two promoters of a and b are aligned by the start codons (Gene View). To compute d i , only the TFBSs of the i-th common TF are used and those of other TFs are ignored (TF View). In Figure 1, a small d i , which leads to a high RS, indicates that the TFBSs of the i-th common TF in the two promoters are in a similar region.
Results and discussion
Small TFBS offset distances imply high regulatory similarity
This study is motivated by the biological knowledge that functional TFBSs of many TFs in yeast are known to have a strong positional preference in the promoters . Because the critical regions in the promoters that make TFBSs functional are unknown, Eq. (6) is actually based on a derived hypothesis: if the offset distance of two TFBSs of a common TF in two genes' promoters is small, the two TFBSs are prone to co-present in the critical regions and therefore be co-functional. To investigate the practicability of the above hypothesis, a relation analysis of the co-functionality and the TFBS offset distance was conducted as follows. As shown in Figure 1, a TFBS offset distance can be computed given a TF t and two genes a and b, denoted as a <t, a, b> tuple. In this analysis, the co-functionality related to a TFBS offset distance was defined as the ratio of tuples in which the literature evidences collected by the YEASTRACT database  showed that TF t regulates both a and b to all tuples. The detailed steps are listed below:
• For a TF t, all gene pairs <a, b> whose promoters have the TFBS of t were collected.
• The TFBS offset distance (as d i in Figure 1) of t relative to <a, b> was calculated.
• A tuple <t, a, b> was stored in the bucket of the TFBS offset distance, B d , where d is the TFBS offset distance of <t, a, b>.
• After repeating 1-3 for all TFs, each bucket contains all tuples having the same TFBS offset distance.
• Finally, the relation of d and the ratio of tuples in the bucket B d in which the literature evidences showed that TF t regulates both a and b to all tuples was plotted.
The results are shown in Figure 2, where each point is a bucket, the x-axis is the TFBS offset distance, while y-axis is the ratio of tuples in which the literature evidences showed that TF t regulates both a and b to all tuples. Figure 2 shows an obvious linear relation (R2 = 0.8106), which suggest that the above hypothesis is practically usable. Reviewing Eq. (6), it implements this concept by incorporating d i , where a common TF which has a smaller TFBS offset distance (d i ) has a larger value of .
The proposed RS measure is a good co-regulation indicator
Following Allocco et al.'s approach , this study evaluated TFBS based RS measures by the fraction of gene pairs that are co-regulated (validated by the literature evidence) by at least one common TF under different RS scores. From the 6604 yeast genes retrieved from Nagalakshmi et al.'s work , 359 genes having no TFBSs were excluded. The remaining 6245 genes formed 19496890 gene pairs, where 1443 head-to-head gene pairs (both genes in such a pair share the same promoter) were further excluded. Finally, the remaining 19495447 gene pairs were used as the evaluation dataset. Figure 3 shows the results of Eqs. (1-6) on the evaluation dataset. In Figure 3, the x-axis is the RS score obtained by different RS measures and the y-axis is the fraction of gene pairs that are co-regulated (validated by the literature evidence) by at least one common TF to all gene pairs under the corresponding RS scores.
The results show that the proposed RS score is highly correlated to the likelihood of a gene pair to be co-regulated by at least one common TF. The plot of the proposed RS measure (Figure 3a) is increasing and smooth at most regions except the few points at left. It achieved a significantly higher R2 (0.963) of Spearman rank correlation than random expectation with p-value less than 0.001. In comparison with other RS measures, the R2 of the proposed measure is significantly higher than those of other existing RS measures (see Table 2). Since the unique feature of the proposed RS measure is introducing TFBS location information, this shows that TFBS location information is useful in calculating regulatory similarity between two genes. The previous section showed the underlying hypothesis as well as a numerical evidence. The results in this section, furthermore, show that the implementation of Eq. (6) of the hypothesis works. Although the implementation of Eq. (6) may incorrectly increase the weights of TFBSs co-present in the non-critical regions, it effectively decreases the weights of those present in the critical region of one gene but in a non-critical region of the other gene.
The effects of TFBS qualities
The SwissRegulon database , of which the TFBS data were used in this study, provides users a parameter of posterior probability to control the quality of the obtained TFBSs. Actually most resources of TFBS locations provide parameters such as ChIP-chip p-value and phylogenetic conservation and let users to choose the most appropriate values for their applications [13, 17, 21]. This section aims to figure out whether the TFBS quality affects the performance of the proposed RS measure and, if it does affect, what TFBS qualities are suggested.
Figure 4 shows the results of the proposed RS measures using different SwissRegulon posterior probability cutoffs. The obvious turn at the region of 0.00~0.05 of the curves corresponding to high cutoffs (0.8 and 0.9) reveals that the proposed RS measure (x-axis) were badly correlated to the likelihood of a gene pair to be co-regulated by at least one common TF (y-axis). The curves of the next two lower cutoffs (0.7 and 0.6) were smoother but still had a small peak around x = 0.15. As the cutoff dropped, the correlation of the x-axis and y-axis was getting stronger. These results suggest a strange conclusion: the proposed RS measure requires TFBS quality worse than a threshold. This conclusion could be explained by the TFBS quantity (Table 3). It is reasonable that the quality cutoff also affected the quantity. The TFBS quantity of cutoff 0.1 was about three times to that of cutoff 0.7 and ten times to that of cutoff 0.9. The results suggest that, instead of TFBS quality, the proposed RS measure was more sensitive to the drastic change of TFBS quantity. With enough TFBS quantity, the proposed RS measure is robust to current TFBS data, even using the one with the lowest quality (cutoff 0.1).
This section uses a case (yeast CCT8) to explain the performance advantage of the proposed RS measure. CCT8 is a subunit of the cytosolic chaperonin Cct ring complex. In this case study, yeast CCT8 was of interest and its co-regulated genes were wanted. For this purpose, the RSs of all yeast genes to CCT8 were calculated and the 30 highest ranked genes were considered as co-regulated gene candidates of CCT8 (Table 4). To dig in the uniqueness of the proposed RS measure, we focused on a candidate, RPN8, which is only identified by the proposed RS measure but not identified by the other five compared RS measures. We further dug into which genes were ranked before RPN8 (therefore pushed it out the candidate list) by the other RS measures and found an interesting opponent gene, RSC1, against RPN8.
Table 5 shows the rank orders of the two genes (RPN8 and RSC1) among all yeast genes by the similarity to CCT8 using different RS measures. In this table, the proposed RS measures gave a better rank of RPN8 (#29) than that of RSC1 (#117), but all the other five RS measures gave a reverse rank order. To further investigate the details, the promoters of CCT8, RPN8 and RSC1 were plotted (Figure 5). Figure 5a depicts the aligned promoters of CCT8 and RPN8; while Figure 5b depicts the aligned promoter of CCT8 and RSC1. The number of common TFs of CCT8 and RPN8 is three, and the number of common TFs of CCT8 and RSC1 is five. This is why the other TFBS-based RS measures give a better rank of RSC1 than that of RPN8. However, two of the three common TFs of CCT8 and RPN8 has small TFBS offset distance (Rpn4 and Abf1) and only one of the five common TFs of CCT8 and RPN8 has small TFBS offset distance (Abf1). Since the proposed RS measure is the only one that considers the information of TFBS locations, this is why the proposed RS measure gave a different rank order of RPN8 and RSC1 to the other measures.
To justify the correctness of the rank order, the biological relevance of common TFs were analyzed. In this study, a TF is defined biologically relevant to a gene if the literature evidences obtained from the YEASTRACT database show that the TF regulates the gene. In Figure 5, all TFs with small TFBS offset distances are biologically relevant to both target genes (Rpn4 and Abf1 to both CCT8 and RPN8 in (a) and Abf1 to both CCT8 and RSC1 in (b)). Furthermore, all the other TFs, which have large TFBS offset distances, are not simultaneously relevant to both downstream genes. This suggests the correctness of the proposed RS measure as well as the importance of incorporating the information of TFBS locations.
Good RS measure benefits co-regulation-based applications
Co-regulated genes are considered to influence many biological behaviors and co-regulation measures have been used in various applications [22, 23]. The section "The proposed RS measure is a good co-regulation indicator" shows that the proposed RS is a good co-regulation index over the five competitors. This section discusses whether this leads to a better result in three co-regulation-based applications: detecting gene co-function, gene co-expression and protein-protein interactions.
In this study, the scenario of detecting gene co-function, gene co-expression and protein-protein interactions using gene co-regulation was designed as follows. First, users have a target gene of interest. The RS score of the target gene against each gene in the genome is calculated. The n genes with the highest RSs are called the regulatory neighborhood (RN) to the target gene and n is called the neighborhood size. Then the degree of co-function of the RN is evaluated using the functional enrichment score proposed by Reimand et al. , denoted as FES in this study. In FES, genes are considered to perform similar biological functions if they have similar Gene Ontology (GO) terms . The degree of co-expression of the RN is evaluated by the co-expression score proposed by Yang and Wu , denoted as CES in this study. CES is the average of the pairwise expression correlations in the RN. The degree of protein-protein interactions of the RN is evaluated by the interaction enrichment score proposed by Reimand et al. , denoted as IES in this study. IES measures the tendency of forming protein complex modules of a RN.
The results of the proposed RS measure and the five existing RS measures in the three applications are shown in Figure 6 and Table 6. The proposed RS measure achieved the highest performance among all the compared RS measures in all applications and all neighborhood sizes. In all three applications, the RS measures of van Helden, Veerla and Höglund and Garten et al. had similar performance and were the second best group.
This study proposed a novel measure that can compute the regulatory similarity (RS) of two genes using the location information of transcription factor binding sites. Based on the documented regulation associations between TFs and genes in the YEASTRACT database, this study has shown that the proposed RS measure is a good co-regulation indicator. Furthermore, its good performance can benefit to three co-regulation-based applications. The proposed RS measure will be helpful for unraveling the underlying molecular mechanisms of specific cellular functions and dissecting the gene regulatory networks.
Terai G, Takagi T, Nakai K: Prediction of co-regulated genes in Bacillus subtilis on the basis of upstream elements conserved across three closely related species. Genome Biol. 2001, 2 (11): research0048.0001-research0048.0012
Polanski K, Rhodes J, Hill C, Zhang P, Jenkins DJ, Kiddle SJ, Jironkin A, Beynon J, Buchanan-Wollaston V, Ott S: Wigwams: identifying gene modules co-regulated across multiple biological conditions. Bioinformatics. 2014, 30 (7): 962-970. 10.1093/bioinformatics/btt728.
Lin TW, Wu JW, Chang DTH: Combining phylogenetic profiling-based and machine learning-based techniques to predict functional related proteins. PloS one. 2013, 8 (9): e75940-10.1371/journal.pone.0075940.
Garten Y, Kaplan S, Pilpel Y: Extraction of transcription regulatory signals from genome-wide DNA-protein interaction data. Nucleic Acids Research. 2005, 33 (2): 605-615. 10.1093/nar/gki166.
Veerla S, Höglund M: Analysis of promoter regions of co-expressed genes identified by microarray analysis. BMC bioinformatics. 2006, 7 (1): 384-10.1186/1471-2105-7-384.
Shalgi R, Lieber D, Oren M, Pilpel Y: Global and local architecture of the mammalian microRNA-transcription factor regulatory network. PLOS Computational Biology. 2007, 3 (7): e131-10.1371/journal.pcbi.0030131.
Park PJ, Butte AJ, Kohane IS: Comparing expression profiles of genes with similar promoter regions. Bioinformatics. 2002, 18 (12): 1576-1584. 10.1093/bioinformatics/18.12.1576.
Van Helden J: Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics. 2004, 20 (3): 399-406. 10.1093/bioinformatics/btg425.
Hansen L, Mariño-Ramírez L, Landsman D: Many sequence-specific chromatin modifying protein-binding motifs show strong positional preferences for potential regulatory regions in the Saccharomyces cerevisiae genome. Nucleic Acids Research. 2010, 38 (6): 1772-1779. 10.1093/nar/gkp1195.
Tabach Y, Brosh R, Buganim Y, Reiner A, Zuk O, Yitzhaky A, Koudritsky M, Rotter V, Domany E: Wide-scale analysis of human functional transcription factor binding reveals a strong bias towards the transcription start site. PLoS One. 2007, 2 (8): e807-10.1371/journal.pone.0000807.
Allocco DJ, Kohane IS, Butte AJ: Quantifying the relationship between co-expression, co-regulation and gene function. BMC bioinformatics. 2004, 5 (1): 18-10.1186/1471-2105-5-18.
Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, Sá-Correia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Research. 2006, 34 (suppl 1): D446-D451.
MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC bioinformatics. 2006, 7 (1): 113-10.1186/1471-2105-7-113.
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298 (5594): 799-804. 10.1126/science.1075090.
Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS: Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001, 106 (6): 697-708. 10.1016/S0092-8674(01)00494-9.
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431 (7004): 99-104. 10.1038/nature02800.
Chang DTH, Huang CY, Wu CY, Wu WS: YPA: an integrated repository of promoter features in Saccharomyces cerevisiae. Nucleic acids research. 2011, 39 (suppl 1): D647-D652.
Chang DTH, Li WS, Bai YH, Wu WS: YGA: Identifying distinct biological features between yeast gene sets. Gene. 2013, 518 (1): 26-34. 10.1016/j.gene.2012.11.089.
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320 (5881): 1344-1349. 10.1126/science.1158441.
Pachkov M, Erb I, Molina N, Van Nimwegen E: SwissRegulon: a database of genome-wide annotations of regulatory sites. Nucleic Acids Research. 2007, 35 (suppl 1): D127-D131.
Tsai HK, Chou MY, Shih CH, Huang GTW, Chang TH, Li WH: MYBS: a comprehensive web server for mining transcription factor binding sites in yeast. Nucleic Acids Research. 2007, 35 (suppl 2): W221-W226.
Bhardwaj N, Lu H: Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics. 2005, 21 (11): 2730-2738. 10.1093/bioinformatics/bti398.
Gyenesei A, Wagner U, Barkow-Oesterreicher S, Stolte E, Schlapbach R: Mining co-regulated gene profiles for the detection of functional associations in gene expression data. Bioinformatics. 2007, 23 (15): 1927-1935. 10.1093/bioinformatics/btm276.
Reimand Jr, Vaquerizas JM, Todd AE, Vilo J, Luscombe NM: Comprehensive reanalysis of transcription factor knockout expression data in Saccharomyces cerevisiae reveals many new targets. Nucleic Acids Research. 2010, 38 (14): 4768-4777. 10.1093/nar/gkq232.
Gene Ontology C: The gene ontology: enhancements for 2011. Nucleic Acids Research. 2012, 40 (D1): D559-D564.
Yang TH, Wu W-S: Identifying biologically interpretable transcription factor knockout targets by jointly analyzing the transcription factor knockout microarray and the ChIP-chip data. BMC Systems Biology. 2012, 6 (1): 102-10.1186/1752-0509-6-102.
This work was supported by Ministry of Science and Technology of Taiwan.
The publication charges of this article were funded by Ministry of Science and Technology of Taiwan grant NSC 102-2221-E-006-085-MY2.
This article has been published as part of BMC systems Biology Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5.
WSW and DTHC conceived the research topic, provided essential guidance, developed the algorithm and wrote the manuscript. MLW and CMY performed all the simulations. All authors read and approved the final manuscript.