mAPC-GibbsOS: an integrated approach for robust identification of gene regulatory networks
© Shi et al.; licensee BioMed Central Ltd. 2013
Published: 9 December 2013
Identification of cooperative gene regulatory network is an important topic for biological study especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory network. Methods that are capable of revealing underlying modularized structure and robust to noise and false positives are needed to be developed.
We proposed and developed an integrated approach to identify gene regulatory networks, which consists of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene expression data and binding motif information. This clustering method can reduce the noise effect from microarray data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes not regulated by certain transcription factors (TFs) will be falsely clustered with true target genes. To overcome this problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In order to evaluate the performance of the proposed method, we generated simulation data under different signal-to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of breast cancer, as supported by their functional annotations.
We have developed a two-step approach for gene regulatory network identification, featuring an integrated method to identify modularized regulatory structures and refine their target genes subsequently. Simulation studies have shown the robustness of the method against noise in gene expression data and false positives in motif binding data. The proposed method has been applied to two breast cancer gene expression datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the method in identifying key regulatory networks related to the progression and recurrence of breast cancer.
Living cells must be able to correctly respond to internal and external stimuli by adjusting gene expression levels . Transcription factors (TFs) cooperatively regulate genes in forming gene regulatory networks, which plays a crucial role in the gene regulation process. Recently, biological researchers have shown that some diseases like cancer are closely related to the breakdown of regulatory networks, and many oncogenes (i.e., genes closely related to cancer) have been shown enrichment in this regulation mechanism . Thus identification of transcriptional gene regulatory networks becomes a promising direction in the field of biology and bioinformatics. Several statistical methods such as principle component analysis (PCA)  and independent component analysis (ICA)  are developed to discover the underlying regulation mechanism. However, the strong assumption of independent or uncorrelated components cannot be easily satisfied in many real biological applications. Due to the fact that genes tend to cooperate to take effect, identifying co-expressed genes modules is an intuitive way to reconstruct regulatory networks. Therefore some clustering based methods such as Fuzzy C-means clustering  have been developed to discover co-expressed genes modules. However co-expressed gene modules are different from co-regulated genes in which we are interested. Co-regulated genes are regulated by some common TFs and tend to have similar gene expression pattern. On the contrary, co-expressed genes are not necessarily regulated by common TFs . Moreover, these methods fail to incorporate the motif binding information provided by matching DNA upstream sequences and TFs with whole genome sequencing techniques .
Dynamic Bayesian Network  is one of the integrative methods, and it takes the motif-binding information as prior knowledge and learns the network from gene expression data. But the method will be hard to analyze data with large candidate TF pool, which limits its application to real biological studies. Network component analysis (NCA)  and several NCA-based methods such as FastNCA  are among several successful integrative methods, which are specifically developed to interpret gene regulatory network as a bipartite network. With some reasonable assumptions referred to as NCA criteria , NCA can decompose gene expression data to estimate the TF activity and then further infer the regulation strength. Nevertheless, motif binding data are often contaminated with many false positive connections and NCA is very sensitive to those false connections. To address the problem of false positive connections, Gu et al. have developed a regression based Gibbs sampling method (namely GibbsOS ) to discover true target genes from an initial gene pool. GibbsOS employs the same model as NCA does and summarizes regression t-test statistics into an outlier sum statistic , then with the help of Gibbs sampling strategy , it can identify true target genes from the gene pool. However, it fails to take modularized regulatory structure into consideration; therefore GibbsOS will perform poorly when a large number of TF candidates are investigated, which significantly limits its application to real biological studies.
The limitations of current methods can be summarized as follows: (i) being sensitive to contaminations (e.g., noise and false positives) in genomic data, (ii) failing to identify the modularized structure and (iii) being unable to handle a large number of candidate TFs. In this paper, we aim at tackling the above-mentioned limitations by proposing a novel method that combines a clustering method with GibbsOS to discover the hidden regulation mechanism; the clustering method is called motif-guided affinity propagation clustering (mAPC) , a modified version of affinity propagation clustering (APC) . To evaluate the performance, we generate some synthetic data under different signal-to-noise ratios (SNRs) and numbers of false positive connections, with which to show that our method has an improved performance in regulatory network identification. Besides, two breast cancer patient datasets are used to demonstrate the feasibility of the proposed method for real biological studies. Experimental results show that the proposed method is able to identify active TFs and their target genes, hence, to reconstruct the underlying regulatory network.
Results and discussion
Motif-guided affinity propagation clustering and Gibbs sampler based on outlier sum statistics
In the second step, we apply GibbsOS to each cluster to remove false positive connections for target gene identification. For the convenience of explanation, we define true target genes as "foreground" genes and genes not regulated by TFs as "background" genes; in such a way, GibbsOS can be seen as identifying foreground genes from the entire gene pool. The detailed description of the method is summarized in the Methods section with mathematical details outlined.
The simulation data are synthesized by MATLAB functions with 300 genes (which include 100 foreground genes and 200 background genes), 80 TFs and 20 experiments (or samples). The motif binding data are generated with modularized structures for both foreground genes and background genes, and the TF activities are randomly generated with Gaussian random variables of mean 0 and variance 1. Then the foreground gene expression data can be synthesized by a linear combination of motif-binding data and TF activities using a log-linear model provided by Liao et al. . For the background genes, the gene expression data are randomly generated with Gaussian random variables (of mean 0 and variance 1) and the amplitude is modified to ensure the equal variance between foreground and background gene expression patterns. To perturb the data, noise is randomly added to gene expression data with certain signal-to-noise ratio (SNR). The level of false positives (FPs) added in motif binding data is measured by FP ratio, which is defined as the number of false positive connections over the number of true positive connections within foreground genes. To test the performance of the proposed method against noise in gene expression and false positives in motif binding data, we first fix the SNR level at 5 dB, and then test the performances of mAPC clustering and TF identification under three different FP ratios (0.5, 1.0 and 1.5). Further, we fix the FP ratio at 1.0 and generate simulation data under three SNR levels (0 dB, 5 dB and 10 dB) to assess the effect of false positives on the performance of mAPC-GibbsOS.
Adjusted rand index values for clustering evaluation.
AUC values for mAPC-based TF identification.
AUC values of target gene identification of mAPC-GibbsOS vs. GibbsOS.
AUC values of target gene identification under different sample sizes.
Breast cancer microarray data
Our method is further tested upon two estrogen receptor (ER) related breast cancer patient datasets mentioned in Symmans et al.  and Loi et al.  to identify gene regulatory networks. The patient samples in the two datasets are divided into 'early recurrence' group (< 3 years) and 'late recurrence' group (> 6 years) according to survival time. The Symmans et al. dataset  consists of 21 samples in 'early recurrence' group and 41 samples in 'late recurrence' group, and the Loi et al. dataset  has 49 samples in 'early recurrence' group and 76 samples in 'late recurrence' group. An initial gene set is selected by T-test on gene expression data between 'early recurrence' and 'late recurrence' groups with a threshold p-value of 0.05. In this study, we analyze the up-regulated genes (over-expressed in 'early recurrence' group) and down-regulated genes (over-expressed in 'late recurrence' group) separately. For Symmans et al. data , totally 615 up-regulated genes and 344 down-regulated genes are selected, while there are 668 up-regulated genes and 559 down-regulated genes selected for Loi et al. data . Motifs are selected from ER related signaling pathways and binding sites , which are believed to have strong connections with cancer progression. Finally 88 and 84 motifs are chosen for Symmans et al. data  and Loi et al. data  respectively.
In this paper, we have proposed a new method consisting of a clustering method (i.e., mAPC) and a sampling based method (i.e., GibbsOS) to tackle the problem of regulatory network identification. mAPC is different from traditional clustering methods in terms of constructing co-regulated gene modules by utilizing both microarray gene expression data and motif binding information. Following mAPC, GibbsOS is applied to refine the module for target gene identification to solve the issue of false positive connections in motif binding data.
The proposed method is tested by simulation data with different SNRs and FP ratios. Significant improvements have been observed in terms of both gene module identification and target gene identification. To further test the method with real biomedical applications, two breast cancer patient datasets are used for the identification of regulatory networks related to recurrence of breast cancer. As a result, a key set of regulatory networks has been reconstructed with active transcription factors and their target genes. Importantly, these regulatory networks are functionally enriched in the progression and recurrence of breast cancer, warranting further investigations to assess their functional roles by biological experiments.
where is an matrix representing the measured gene expression data, is the regulation strength matrix with a dimension of , matrix specifies the TF activities, represents the inevitable experimental noise, N is the number of genes, K is the number of experiments and M is the number of TFs. This model interprets the regulatory mechanism as a bipartite network and the expression of gene can be considered as a direct result from TF activity associated with related regulation strength. Based on this model, we can divide the whole gene set into two distinct categories: (1) "foreground" genes that are truly regulated by TFs and (2) "background" genes that are not related with TFs. It can also be seen that only the foreground genes will hold the relationship between gene expression and TF activity, therefore, it is necessary to identify modularized structure on the foreground gene set rather than the whole gene set.
Motif-guided affinity propagation clustering (mAPC)
where is a trade-off parameter between 0 and 1 to adjust the contribution of gene expression data and that of motif binding information. If is 1, the clusters generated by mAPC will totally depend on motif binding information. On the contrary, if is 0, mAPC turns out to be the classical APC as applied to gene expression data alone. As gene expression data are noisy, the second term can lower the noise effect by the positive support from binding information. On the other hand, the false connections existed in matrix will be penalized by large negative gene expression similarity measurement, because genes not co-regulated do not share similar gene expression patterns. In general, this type of balanced cost function will provide us a better representation of gene modules in terms of both co-regulation and co-expression.
Transcription factor identification
Based on the calculated p-value, we can determine the enrichment of TFs in different clusters with a pre-defined threshold (which can be adjusted according to various cases).
Gibbs sampler based on outlier sum statistics (GibbsOS)
where is the m-th diagonal element of the matrix . This test statistic t will follow a Student t-distribution with a degree of freedom of , and then we can obtain corresponding p-value to make decision on hypothesis with certain predefined confidence level.
Although we have already demonstrated the method to identify foreground genes, we actually do not know the ground truth behind the data. It is impossible to accurately draw foregrounds as seed genes, but we are sure that there must be multiple true foreground genes in the pool, thus we can make those foreground genes support each other. This approach can be completed in an iterative way, assuming that the candidate genes for TF j can be divided into and ( and ), two sets containing foreground genes and background genes respectively. To start the iteration, we randomly select one gene from , where is the cardinality of gene set . Together with the candidate genes for other TFs, we can obtain a foreground gene list . Then we apply the linear regression T-test mentioned in Equation (11); totally ( to except ) t statistics will be generated for TF j.
The form of OS statistic indicates the dependency of the decision of one TF on the choices of foreground genes for other TFs, but what we are interested in is the marginal function which is independent of the choices of other TFs.
where t denotes the t sampling iteration. At each iteration step, we sequentially sample one candidate foreground gene for each TF once. When going through sufficient steps, we not only sample those candidate genes, but also estimate the marginal distributions. In our case, we can simply use the frequency of a gene emerged in the sampled sequence to approximate the empirical marginal distribution. Then the genes with higher frequency will be more probable to be foreground genes.
Adjusted rand index for performance evaluation
Note that ARI will take a value between -1 and 1 and a higher value represents that two clusters are of more similarity. If the ARI value is 1, it means that the two clusters in comparison are the same.
This work is supported by National Institute of Health (NIH) [CA149653, CA149147, CA164384, and NS29525-18A, in part]; National Institutes of Health/National Cancer Institute/Science Applications International Corporation (NIH/NCI/SAIC) [HHSN261200800001E].
The publication costs for this article were funded by National Institute of Health (NIH) [CA149653].
This article has been published as part of BMC Systems Biology Volume 7 Supplement 5, 2013: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S5.
- Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM: A census of human transcription factors: function, expression and evolution. Nature Review Genetics. 2009, 10: 252-263. 10.1038/nrg2538.View ArticleGoogle Scholar
- Gong T, Xuan J, Chen L, Riggins RB, Li H, Hoffman EP, Clarke R, Wang Y: Motif-guided sparse decomposition of gene expression data for ergulatory module identification. BMC Bioinformatics. 2011, 12: 82-10.1186/1471-2105-12-82.PubMed CentralView ArticlePubMedGoogle Scholar
- Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics. 2001, 17: 763-774. 10.1093/bioinformatics/17.9.763.View ArticlePubMedGoogle Scholar
- Lee S, Batzoglou S: Application of independent component analysis to microarrays. Genome Biology. 2003, 4: R76-10.1186/gb-2003-4-11-r76.PubMed CentralView ArticlePubMedGoogle Scholar
- Dembele D, Kastner P: Fuzzy c-means method for clustering microarray data. Bioinformatics. 2003, 19: 973-980. 10.1093/bioinformatics/btg119.View ArticlePubMedGoogle Scholar
- Latchman DS: Transcription factors as potential targets for therapeutic drugs. Current Pharmaceutical Biotechnology. 2000, 1: 57-10.2174/1389201003379022.View ArticlePubMedGoogle Scholar
- Sabatti C, James GM: Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics. 2006, 22: 739-746. 10.1093/bioinformatics/btk017.View ArticlePubMedGoogle Scholar
- Liao JC, Boscolo R, Yang Y-L, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100: 15522-15527. 10.1073/pnas.2136632100.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang C, Ding Z, Hung YS, Fung PCW: Fast network component analysis (FastNCA) for gene regulatory network reconstruction form microarray data. Bioinfomatics. 2008, 24: 1349-1358. 10.1093/bioinformatics/btn131.View ArticleGoogle Scholar
- Gu J, Xuan J, Riggins RB, Chen L, Wang Y, Clarke R: Robust identification of transcriptional regulatory networks using a Gibbs sampler on outlier sum statistic. Bioinformatics. 2012, 28: 1990-1997. 10.1093/bioinformatics/bts296.PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R, Hastie T: Outlier sums for differential gene expression analysis. Biostatistics. 2006, 8: 2-8.View ArticlePubMedGoogle Scholar
- Casella G, George EI: Explaining the Gibbs sampler. American Statistician. 1992, 46: 167-174.Google Scholar
- Frey BJ, Dueck D: Clustering by passing messages between data points. Science. 2007, 315: 972-976. 10.1126/science.1136800.View ArticlePubMedGoogle Scholar
- Hubert L, Arabie P: Comparing partitions. Journal of Classification. 1985, 2: 193-218. 10.1007/BF01908075.View ArticleGoogle Scholar
- Symmans WF, Hatzis C, Sotiriou C, Andre F, Peintinger F, Regitnig P, Daxenbichler G, Desmedt C, Domont J, Marth C, Delaloge S, Bauernhofer T, Valero V, Booser DJ, Hortobagyi GN, Pusztai L: Genomic index of sensitivity to endocrine therapy for breast cancer. J Clin Oncol. 2010, 28: 4111-9. 10.1200/JCO.2010.28.4273.PubMed CentralView ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, Daidon MG, Pierotti MA, Berns EM, Jansen MP, Foekens JA, Delorenzi M, Bontempi G, Piccart MJ, Sotiriou C: Predicting prognosis using molecular profiling in estrogen receptor positive breast cancer treated with tamoxifen. BMC Genomics. 2008, 9: 239-10.1186/1471-2164-9-239.PubMed CentralView ArticlePubMedGoogle Scholar
- Bjomstrom L, Sjoberg M: Mechanisms of estrogen receptor signaling: convergenece of genomic and nongenomic actions on target genes. Molecular Endocrinology. 2005, 19: 833-842. 10.1210/me.2004-0486.View ArticleGoogle Scholar
- Mishra GR, Suresh M, et al: Human protein reference database--2006 update. Nucleic Acids Res. 2006, 34: D411-414. 10.1093/nar/gkj141.PubMed CentralView ArticlePubMedGoogle Scholar
- Orton RJ, Sturm OE, Vyshemirsky V, Calder M, Gilbert DR, Kolch W: Computational modelling of the receptor-tyrosine-kinase-activated MAPK pathway. Biochemical Journal. 2005, 392: 249-261. 10.1042/BJ20050908.PubMed CentralView ArticlePubMedGoogle Scholar
- Daschner PJ, Ciolino HP, Plouzek CA, Yeh GC: Increased AP-1 activiety in drug resisitant human breast cancer MCF-7 cells. Breast Cancer Research and Treatment. 1999, 53: 229-240. 10.1023/A:1006138803392.View ArticlePubMedGoogle Scholar
- Xiao X, Li BX, Mitton B, Ikeda A, Sakamoto KM: Targeting CREB for cancer therapy: friend or foe. Current cancer drug targets. 2010, 10: 384-391. 10.2174/156800910791208535.PubMed CentralView ArticlePubMedGoogle Scholar
- Sankpal NV, Moskaluk CA, Hampton GM, Powell SM: Overexpression of CEBPβ correlates with decreased TFF1 in gastric cancer. Oncogene. 2006, 25: 643-649.PubMedGoogle Scholar
- Boudny V, Kovarik J: JAK/STAT signaling pathways and cancer. Janus kinases/signal transducers and activators of transcription. Neoplasma. 2002, 49: 349-355.PubMedGoogle Scholar
- Kakizawa T, et al: Silencing mediator for retinoid and thyroid hormone receptors interact with octamer transcription factor-1 and acts as a transcritpinal repressor. Journal of Biological Chemistry. 2001, 276: 9720-9725. 10.1074/jbc.M008531200.View ArticlePubMedGoogle Scholar
- Li L, Davie JR: The role of Sp1 and Sp3 in normal and cancer cell biology. Annals of Anatomy. 2010, 192: 275-283. 10.1016/j.aanat.2010.07.010.View ArticlePubMedGoogle Scholar
- Buggy Y, Maguire TM, McGreal G, McDermott E, Hill ADK, O'Higgins N, Duffy MJ: Overexpression of the Ets-1 trancription factor in human breast cancer. British Journal of Cancer. 2004, 91: 1308-1315. 10.1038/sj.bjc.6602128.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.