SPIC: A novel similarity metric for comparing transcription factor binding site motifs based on information contents
BMC Systems Biologyvolume 7, Article number: S14 (2013)
Discovering transcription factor binding sites (TFBS) is one of primary challenges to decipher complex gene regulatory networks encrypted in a genome. A set of short DNA sequences identified by a transcription factor (TF) is known as a motif, which can be expressed accurately in matrix form such as a position-specific scoring matrix (PSSM) and a position frequency matrix. Very frequently, we need to query a motif in a database of motifs by seeking its similar motifs, merge similar TFBS motifs possibly identified by the same TF, separate irrelevant motifs, or filter out spurious motifs. Therefore, a novel metric is required to seize slight differences between irrelevant motifs and highlight the similarity between motifs of the same group in all these applications. While there are already several metrics for motif similarity proposed before, their performance is still far from satisfactory for these applications.
A novel metric has been proposed in this paper with name as SPIC (Similarity with Position Information Contents) for measuring the similarity between a column of a motif and a column of another motif. When defining this similarity score, we consider the likelihood that the column of the first motif's PFM can be produced by the column of the second motif's PSSM, and multiply the likelihood by the information content of the column of the second motif's PSSM, and vise versa. We evaluated the performance of SPIC combined with a local or a global alignment method having a function for affine gap penalty, for computing the similarity between two motifs. We also compared SPIC with seven existing state-of-the-arts metrics for their capability of clustering motifs from the same group and retrieving motifs from a database on three datasets.
When used jointly with the Smith-Waterman local alignment method with an affine gap penalty function (gap open penalty is equal to1, gap extension penalty is equal to 0.5), SPIC outperforms the seven existing state-of-the-art motif similarity metrics combined with their best alignments for matching motifs in database searches, and clustering the same TF's sub-motifs or distinguishing relevant ones from a miscellaneous group of motifs.
We have developed a novel motif similarity metric that can more accurately match motifs in database searches, and more effectively cluster similar motifs and differentiate irrelevant motifs than do the other seven metrics we are aware of.
As one of the most important cellular functions, transcriptional regulation determines the specific gene products in a cell, upon which all the other cellular functions are based [1, 2]. Transcriptional regulation is triggered by the binding of TF proteins to 6-25 bps (base pairs) specific DNA sequences called cis-regulatory elements (CREs) or transcription factor binding sites (TFBSs) in a gene's promoter region or remote regulatory regions such as enhancers, silencers and insulators . These TF-DNA interactions in a cell form the transcriptional regulatory network (TRN) of the cell . In principle, TRNs of all cell types of an organism are encoded in its genome, however, deciphering these TRNs from the genome sequence turns out to be one of a very challenging tasks [5, 6]. The first step to this goal is to recognize all TFBSs in a genome [5, 7, 8]. Although the binding sites of the same TF usually have a certain conservative feature and the same length, they can show some level of degeneration, and be located in very long non-coding sequences, making their computational prediction very difficult . A set of the same TF's conserved binding sites is always called a motif, which can be verified by experiments or predicted by comparing a set of DNA sequences potentially containing the TFBSs. A lot of de novo motif-finding algorithms have been developed to identify TFBSs because they are often more conserved than their surrounding DNA segments . A position frequency matrix (PFM) or a position-specific scoring matrix (PSSM) is always employed to represent a motif [9, 10]. The two matrices are deformed from the alignments of its individual binding site sequences, and largely mirror the position binding preference of the corresponding TF. Thus, we can use one of the two matrices to scan the sequences potentially containing TFBSs to discover them .
After using motif finding tools to get some putative motifs, we often want to infer the TFs affiliated to them by looking for their matching motifs in a validated TFBS motif database , or to cluster similar sub-motifs of the same TF obtained by different methods to remove redundancies or to form a complete motif [11–13]. Moreover, the motifs of a TF family also show some level of similarity to form a familial binding profile (FBP) because these TFs in a family belong to a structurally related class [14, 15]. Consequently, an efficient metric is desired for measuring the motif-motif similarity in the applications mentioned above. Most of current motif comparison methods are divided into two parts: a column similarity metric for comparing two columns which come from the PFMs (or the PSSMs) of two motifs respectively, and a pairwise alignment algorithm for the two motifs using the column similarity metric and a penalty function for gaps . The metrics to measure column-to-column motif similarity mainly include sum of squared distances (SSD) [15, 16], p-value of Chi-square (pCS) , average log-likelihood ratio (ALLR) , average Kullback-Leibler (AKL) , Pearson's correlation coefficient (PCC) . Either the Needleman-Wunsch  or the Smith-Waterman  algorithms used to be applied to search for the optimal alignment assuming an affine gap penalty function. Mahony et al. have built a web server STAMP which integrated these metrics and alignment algorithms after assessing them [11, 23]. Besides these metrics along with alignment algorithms, two alignment-free metrics for comparing motifs, Mosta and KFV, were designed by Pape et al.  and by Xu and Su , respectively. The two alignment-free metrics and these in STAMP have been evaluated by Xu and Su , in which the KFV method was showed to be better than Mosta and the others.
Note that the seven metrics mentioned above only employed PFMs. None of them uses the column information contents (ICs) and PSSMs. In fact, if the total ICs of two motifs are low, they may have high similarity score due to high correlation between each pair of columns. So if two motifs have columns with low ICs, we need to delete these low IC columns before using these metrics for the comparison. These metrics work well to cluster similar motifs but can hardly separate true motifs from spurious ones with low IC columns.
Here we presented a novel metric named SPIC (Similarity with Position Information Contents) with better performance for column-to-column motif comparison. In our genome-wide TFBS motif prediction tools GLECLUBS  and eGLECLUBS  for prokaryotes through comparative genomics, a similar metric with ungapped alignment has been proposed. In this paper, we improved the metric by considering the different alignment algorithms with gap functions. Especially, besides the PFMs and PSSMs, the information content of each position was involved into the SPIC metric. More specifically, for any two columns separately from two motifs, SPIC first computes a score between the PSSM multiplied by the IC of one column and the PFM of the other column, and vice versa. The similarity between the two columns is then defined based on the results with normalization. When evaluated on the datasets from STAMP , KFV , and GLECLUBS [12, 13], SPIC outperforms all the existing metrics for recovering motifs by searching a database and grouping closely related motifs.
The STAMP tool contains five column similarity metrics. The detail definitions of these metrics are summarized in Table 1. In these definitions, for each column of a PFM, denotes the probability of each base , the average of , the total counts of all bases, and the total counts of base . . denotes the background probability of each base and is assumed to be 0.25 for all bases. In the Asymptotic Covariance (AC) metric designed by Pape et al. , the asymptotic covariance between the counts of all binding sites separately from two TFBS motifs and their reverse complementary TFBSs in a m-length background sequence is calculated (see Table 1). The KFV (k-mer frequency vector) metric, recently proposed by Xu and Su , first converts each PFM of length k into a -dimension composition vector and then use cosine angle to calculate the similarity between the vectors of two motifs.
The SPIC Metric
Given a motif composed of TFBSs with a length , let be its PFM and be its PSSM defined as,
where denotes the probability of base contained in background sequences, and are the probability and number of base located at the column of , respectively. Note that a pseudo-count is required for calculating these probabilities. The definition of the information content (IC) of column is as below,
Given two PFMs F1 and F2 and two PSSMs P1 and P2 of two motifs and respectively, the similarity value between two columns and from and respectively is computed by
In the formula (4), the column ICs are used to enhance the effect of the columns of a motif with high information and weaken the influence of the columns with low information on the similarity score. It must be noted that the formula (4) indicates the likelihood of P i (A) generating F j (B). The denominator used to normalize the scores in the similarity function (3) is generally the upper bound of the numerator. In rare instances, the numerator in function (3) may be greater than the denominator, so the number "1" is also used to normalize the scores.
Pairwise column alignment
To compute the similarity between two motifs, we first need to make an alignment between their columns. We consider both local and global alignments between two motifs that are similarly defined as in the pair-wise sequence alignments . Let Ω(M1(X), M2(Y), G) be any alignment between two motifs M1 and M2 with gaps G, where column X of M1 is aligned with column Y of M2. The similarity score between motifs M1 and M2 with the alignment is defined as,
where is the similarity between the two aligned columns M1(X) and M2(Y) and computed by a column similarity metric, and g(G) is a gap penalty function. So the motif-motif similarity score is defined as the score of the best alignment between motifs M1 and M2, i.e.,
For a given column similarity metric, we compute the similarity score between two motifs using the Needleman-Wunsch (NW) global alignment algorithm  or the Smith-Waterman (SW) local alignment algorithm , assuming an affine gap penalty function with the gap-extension penalty being half of the gap-opening penalty. An extended SW alignment algorithm without gaps is also evaluated. Furthermore, an empirical p-value is assigned to the similarity score to measure the likelihood between two aligned motifs .
Datasets of motifs
In this study three dataset of motifs verified by experiments are employed for testing and evaluation purpose. Dataset-1, first chosen from JASPAR by Mahony et al. , is composed of 96 true motifs which belong to 13 known TF structural classes. Among these motifs, 25 motifs belong to the Zinc-Finger (ZF) family. Dataset-2, created by Xu and Su  for testing the outstanding ability of the KFV metric to identify redundant PFMs, is composed of 124 JASPAR core motifs and three sub-motifs for each core motif by randomly selecting its two-thirds of sequences. Dataset-3, available at: http://gleclubs.uncc.edu/pbs, contains about 105 putative motifs that were predicted in our earlier work [12, 13] from more than two thousand sets of genome-wide orthologous intergenetic sequences in E. coli K12 and other 54 reference genomes of gamma-proteobacteria. Referred to the database RegulonDB (version 6) , these predicted motifs cover 1,411 known TFBSs of 122 true motifs (or TFs) in E. coli K12. More details of the three datasets are summarized in Table 2.
Implementation of metrics
The seven metrics (PCC, AKL, ALLR, pCS, SSD, AC, and KFV) listed in Table 1 were employed to compare with SPIC for their ability to cluster relevant true motifs, filter out fake motifs, or recover motifs from a database. We used the STAMP platform for computing the first five alignment-dependent metrics scores http://www.benoslab.pitt.edu/stamp/, the Mosta package included in SABINE for computing the AC scores http://www.ra.cs.uni-tuebingen.de/software/SABINE/downloads/index.htm, and the web server of KFV for computing the KFV scores http://bioinfo.uncc.edu/kfv/.
In order to inspect the ability of these metrics to recognize the motifs of the same TFs in Dataset-1 and Dataset-2, the ROC (Receiver Operating Characteristic) curves were plotted. In database searches, we define the "performance accuracy" as the percent of motifs correctly recovered by using the best-hit method. The ROC profiles were drawn based on the rule described below. Given a dataset consisting of n motifs whose TF structural classes are known, we list all of n(n+1)/2 pairs of motifs and compute the similarity scores of each pair using SPIC and the other metrics. We set two motifs as a mismatch if the similarity score between them is less than a threshold or a match, otherwise. We call a match a true positive (TP) if the two motifs belong to the same FBP , and a mismatch a true negative (TN) if the two motifs belong to different FBPs. The ROC curve is represented by the TP rate against the FP rate under different motif similarity thresholds.
Results and discussions
Given the profile of a motif whose cognate TF information is unknown, one of frequently used applications is to search the motif in a database. A column similarity metric associated with an alignment algorithm or an alignment-free similarity metric is employed to compare the query motif to each motif in the database. The motifs are "hit" by the query motif if their similarity score are over a threshold in the database . However, the motifs of TFs either belonging to the same TF family or in a closely evolutionary relationship show some degree of similarity while the binding sites in a motif sometimes show highly degenerate. So it is often difficult to distinguish similar motifs and identify the required motifs precisely in a database. The SSD, PCC and KFV metrics are chosen for the comparison with SPIC for their capability of retrieving motifs of a same TF family in Dataset-1. It is because that SSD, PCC and KFV were shown to have the better performance than the other three column similarity metrics joint with an optimal alignment  and the alignment-free AC score . As described in Xu and Su , the accuracy of a metric is calculated as the percent of motifs whose TF families are "best hit" by the metric in a dataset of motifs.
As evaluated by Mahony et al. , the PCC metric combined with the SW ungapped alignment algorithm (PCC/SWU), and the SSD metric combined with SW alignment (SSD/SW) with gap extension equal to 0.5 and gap open equal to 1, are the best two metric and alignment settings on Dataset-1 among the five column similarity metrics associated with their all possible alignment settings. According to Xu and Su , when 4-mer and cosine angle are used for vector construction and comparison, the KFV results in the best results. Here we also used the NW and SW alignment algorithms respectively to test the SPIC with almost all of different gap open penalties (gap extension is always set as half the gap open). The top seven performing alignment strategies of SPIC and the optimal strategies of PCC/SWU, SSD/SW and KFV, are listed in Table 3. Among these strategies, the combination of the SPIC metric and the Smith-Waterman algorithm (SPIC/SW) with gap open equal to 1 achieves the highest accuracy on Dataset-1. The results in Table 3 show that SPIC has more superior strategies than the other metrics.
For further comparison of our best strategy SPIC/SW (gap open = 1) with the strategy PCC/SWU which has the best performance in STAMP and the optimal strategy of KFV (4-mer, cosine angle) for recovering motifs from a dataset, we do ROC analysis of the three strategies' performance on Dataset-1 and Dataset-2. As exhibited in Figure 1, SPIC/SW (gap open = 1) performs more outstandingly than the two strategies PCC/SWU in STAMP and KFV (4-mer, cosine angle) for motif recovery on Dataset-1 and Dataset-2.
Separation of true motifs from spurious motifs
In some algorithms for genome-wide prediction of transcription factor binding sites based on phylogenetic footprinting such as GLECLUBS [12, 13] and PhyloNet , sub-motifs and redundant motifs of any TF are required to be merged together into a unique motif, meanwhile, spurious motifs are required to be discarded [12, 13, 16]. To this end, we desire to get a metric that not only precisely measures the pairwise motif similarity, but also effectively differentiates irrelevant motifs. More specifically, the desired metric can assign a similarity score high enough for two sub-motifs of the same TF motif, and a similarity score low enough for two motifs without any evolutionary relationship to separate true motifs from spurious ones. Dataset-3 generated by GLECLUBS [12, 13] is composed of massive amounts of spurious motifs and a tiny fraction of true motifs. In order to discover true motifs from Dataset-3, we need to evaluate the SPIC and the other seven metrics for their ability to cluster sub-motifs of each TF into a motif and separate true motifs from spurious ones.
For this purpose, we need a group of true motifs used for evaluation on Dataset-3. 122 TF motifs of E. coli K12 in ReglonDB are picked out to generate plenty of sub-motifs. For each TF motif consisting of n BSs (), we randomly split it into a sub-motif of size k and a sub-motif of size for each . So pairs of sub-motifs can be generated for a motif of size n. For each sub-motif of size k, we repeat the foregoing split procedure on each sub-motif to generate pairs of sub-sub-motifs (also called sub-motifs afterwards). The procedure can be terminated when the size of each sub-motif is 1. We then employ these metrics with their best strategies to calculate the corresponding similarity scores between each pair of sub-motifs [11, 25] as well as the scores between each pair of motifs in Dataset-3. As shown in Figure 2, the curves labeled by "all pairs" are the distributions of the similarity scores between each pair of motifs in Dataset-3 after score normalization, and the curves labeled by "known inner" are the distributions of the normalized similarity scores between each pairs of true sub-motifs. Due to the relevance between each pair of true sub-motifs and the irrelevance among most of the motifs in Dataset-3, a metric with outstanding performance should depart the curve labeled by "all pairs" from that labeled by "known inner" very well. As shown in the charts of Figure 2, comparing the two curves generated by SPIC with these by other metrics, we find that the two areas under SPIC's distribution curves have the smallest overlap. Specially, the last chart of Figure 2 collects their overlapping rates which demonstrate that SPIC has the highest performance among these existing metrics in recovering true motifs and separating them from spurious ones.
Because many applications contain the motif comparison procedure, we proposed a novel similarity metric SPIC based on column information contents. When used jointly with the SW alignment algorithm, it achieves a better performance than the best strategies of those existing metrics in recovering motifs in a database, grouping relevant motifs, merging sub-motifs or redundant motifs, or digging true motifs out of chaos.
Levine M, Tjian R: Transcription regulation and animal diversity. Nature. 2003, 424 (6945): 147-151. 10.1038/nature01763.
Lagha M, Bothma JP, Levine M: Mechanisms of transcriptional precision in animal development. Trends Genet. 2012
Kadonaga JT: Regulation of RNA polymerase II transcription by sequence-specific DNA binding factors. Cell. 2004, 116 (2): 247-257. 10.1016/S0092-8674(03)01078-X.
Davidson EH: The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. 2006, Academic Press
Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, et al: Unlocking the secrets of the genome. Nature. 2009, 459 (7249): 927-930. 10.1038/459927a.
Rister J, Desplan C: Deciphering the genome's regulatory code: the many languages of DNA. Bioessays. 2010, 32 (5): 381-384. 10.1002/bies.200900197.
Reed JL, Famili I, Thiele I, Palsson BO: Towards multidimensional genome annotation. Nat Rev Genet. 2006, 7 (2): 130-141. 10.1038/nrg1769.
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB: Annotating non-coding regions of the genome. Nat Rev Genet. 2010, 11 (8): 559-571. 10.1038/nrg2814.
GuhaThakurta D: Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006, 34 (12): 3585-3598. 10.1093/nar/gkl372.
Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16 (1): 16-23. 10.1093/bioinformatics/16.1.16.
Mahony S, Auron PE, Benos PV: DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol. 2007, 3 (3): e61-10.1371/journal.pcbi.0030061.
Zhang S, Xu M, Li S, Su Z: Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes. Nucleic Acids Res. 2009, 37 (10): e72-10.1093/nar/gkp248.
Zhang S, Li S, Pham PT, Su Z: Simultaneous prediction of transcription factor binding sites in a group of prokaryotic genomes. BMC Bioinformatics. 2010, 11: 397-10.1186/1471-2105-11-397.
Tan K, McCue LA, Stormo GD: Making connections between novel transcription factors and their DNA motifs. Genome Res. 2005, 15 (2): 312-320. 10.1101/gr.3069205.
Sandelin A, Wasserman WW: Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol. 2004, 338 (2): 207-215. 10.1016/j.jmb.2004.02.048.
Wang T, Stormo GD: Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. Proc Natl Acad Sci USA. 2005, 102 (48): 17400-17405. 10.1073/pnas.0505147102.
Schones DE, Sumazin P, Zhang MQ: Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics. 2005, 21 (3): 307-313. 10.1093/bioinformatics/bth480.
Wang T, Stormo GD: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003, 19 (18): 2369-2380. 10.1093/bioinformatics/btg329.
Kullback S, Leibler RA: On Information and Sufficiency. Ann Math Statist. 1951, 22 (1): 79-86. 10.1214/aoms/1177729694.
Pietrokovski S: Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 1996, 24 (19): 3836-3845. 10.1093/nar/24.19.3836.
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4.
Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5.
Mahony S, Benos PV: STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007, 35 (Web Server): W253-258. 10.1093/nar/gkm272.
Pape UJ, Rahmann S, Vingron M: Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics. 2008, 24 (3): 350-357. 10.1093/bioinformatics/btm610.
Xu M, Su Z: A novel alignment-free method for comparing transcription factor binding site motifs. PLoS One. 2010, 5 (1): e8797-10.1371/journal.pone.0008797.
Mahony S, Benos PV: STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007, 35 (Web Server): W253-258. 10.1093/nar/gkm272.
Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A, et al: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res. 2011, 39 (Database): D98-105. 10.1093/nar/gkq1110.
Zhang S, Jiang L, Du C, Su Z: A novel information contents based similarity metric for comparing TFBS motifs. 2012 IEEE 6th International Conference on Systems Biology (ISB): 18-22 Aug. 2012; Xi'an: IEEE Xplore. 2012, 32-36.
A preliminary version of this paper was published in the proceedings of IEEE ISB2012 . We would like to thank the reviewers for their critical comments and suggestions which really helped us to improve the manuscript.
The publication of this article has been funded by a grant (61103073, SZ) from National Science Foundation of China, a grant (11JCYBJC26600, SZ) from Natural Science Funds of Tianjin, a grant from Doctoral Funds of Tianjin Normal University (52X09013, LJ), and two grants (EF0849615 and CCF1048261, ZS) from National Science Foundation of USA.
This article has been published as part of BMC Systems Biology Volume 7 Supplement 2, 2013: Selected articles from The 6th International Conference of Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S2.
The authors declare that they have no competing interests.
SZ and ZS conceived the project. SZ and CD designed the metric. SZ and XZ implemented and conducted the experiments. ZS and SZ wrote the paper.