Volume 1 Supplement 1
Effect of microarray data heterogeneity on regulatory gene module discovery
© Mishra and Gillies; licensee BioMed Central Ltd. 2007
Published: 8 May 2007
An integrative genomics approach, in which data from different micro-array experiments are merged together to study regulatory networks , has been adopted in several recent research studies. However, we propose that blind use of this approach can be misleading. Our hypothesis is that as micro-array data from different experiments are merged, local patterns of activity, for example the cell cycle, can be masked by more global and dominant patterns such as stress reactions. We have carried out a systematic study in which data with increasing heterogeneity is clustered to determine groups of functionally related genes. These clusters are then tested for similarity to each other.
In order to validate our hypothesis, the primary requirement is to obtain the regulatory modules from various datasets and their mixtures and then measure their similarities to each other. A decreasing trend of similarity as we mix more and more heterogeneous data should confirm our hypothesis. A number of researchers have worked on the problem of finding regulatory networks, some of the most important ones being [2, 3] where they have incorporated prior knowledge in the form of known transcription factors or DNA binding data to guide the clustering process. The results in these works have shown that the resulting clusters of regulated gene modules are biologically meaningful. We have used Module Networks algorithm  which is a well established approach and has had success in finding biologically relevant modules. For measuring the similarities among sets of regulated gene clusters resulting from this algorithm, we chose to use the modified Rand Index which has been shown to be a very stable index of partition similarity.
Materials and methods
In order to validate our hypothesis we chose to work with two very diverse datasets from Stanford Microarray Database (SMD). One of them is when yeast is exposed to stress conditions while other is from cell-cycle related study. Expression of genes when stress conditions are created is much more drastic (both repressed and induced genes) when compared to cell-cycle experiments where optimal conditions are created for growth. We started with analysing data by individual researchers for experiments related to stress  in this paper referred as DS-STRESS1 (76 microarrays),  called DS-STRESS2 (49 microarrays) and  called DS-STRESS3 (41 microarrays). In the next stage we merged all the stress microarrays to create the data set we call DS-STRESS. To compare these clustering against an entirely different category, we took 93 microarray data sets for cell-cycle experiments  referred in this article as DS-CCYCLE. A further mixing of both stress and cell-cycle data was named DS-STRESS-CCYCLE. Finally, we extracted all available data (1082 microarrays) for yeast (not only stress/cell-cycle) named DS-ALL and compared the earlier results against it. In order to have statistical significance behind our results we also generated a random microarray dataset for all the genes by generating random numbers from a Gaussian distribution with zero mean and unit standard deviation. This dataset was named DS-RANDOM.
For normalization, we use the assumption that the average log R/G ratio on the array should be zero. Further, we do filtering on the genes selected by choosing genes whose log(base2) of R/G ratio is greater than 2 times for at least one experiment. List of 145 transcription factors (TFs) as prior knowledge were taken from the Yeastract website http://yeastract.com/. We analysed all this data using the software package Genomica which has been provided by the authors of the Module Network.
Comparison of individual stress versus progressively mixed datasets
Comparison of stress and cell-cycle (mixed) versus progressively mixed datasets
- Tanay A, Steinfeld I, Kupiec M, Shamir R: Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium. Mol Syst Biol. 2005, 1: 2005.2002 10.1038/msb4100005View ArticleGoogle Scholar
- Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics. 2003, 34 (2): 166-176. 10.1038/ng1165PubMedView ArticleGoogle Scholar
- Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: Computational discovery of gene modules and regulatory networks. Nature Biotechnology. 2003, 21 (11): 1337-1342. 10.1038/nbt890PubMedView ArticleGoogle Scholar
- Hubert L, Arabie P: Comparing Partitions. Journal of Classification. 1985Google Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11 (12): 4241-4257.PubMedPubMed CentralView ArticleGoogle Scholar
- Saldanha AJ, Brauer MJ, Botstein D: Nutritional Homeostasis in Batch and Steady-State Culture of Yeast. Mol Biol Cell. 2004, 15 (9): 4089-4104. 10.1091/mbc.E04-04-0306PubMedPubMed CentralView ArticleGoogle Scholar
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd.