 Research article
 Open Access
 Published:
Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes
BMC Systems Biology volumeÂ 1, ArticleÂ number:Â 15 (2007)
Abstract
Background
Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology.
Results
We present a more formal approach, the modkprototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis.
Conclusion
The modkprototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and nonangina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several wellknown and successful clustering algorithms. Following modkprototypes clustering of the acetaminophenexposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminopheninduced centrilobular necrosis. The use of wellknown and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable.
Background
Clustering of biological samples based on microarray gene expression data is now standard practice in clinical, biological, toxicological and pharmacological studies [1â€“4]. However, there are limitations to various clustering algorithms. For instance, the classic kmeans clustering algorithm [5, 6] uses Euclidean distance to measure dissimilarity and to cluster objects while the kmodes algorithm [7] only supports categorical or qualitative data through a simple matching objective function as a measure of dissimilarity. The inability of clustering algorithms to incorporate biological data in the grouping process can limit thorough interpretation of the data and its underlying biology.
Several approaches to incorporate biological data associated with samples into the analysis of gene expression data have been proposed recently. Shannon et al. [8] utilized Mantel statistics to correlate gene expression measurements with clinical covariates. The correlations are based on separate distance matrices computed using gene expression data and clinical covariates. Pearson correlation is used to assess main effects, whereas partial correlation coefficients are used to assess correlation between gene expression and a subset of the sample covariates conditioned on other sample covariates. Another approach introduced by Sese et al. [9] describes an itemset constrained clustering method where the optimal cluster that maximizes the interclass variance of gene expression with pathological features between groups is computed. Informative gene expression clusters annotated with disease descriptions of the liver were revealed. Kasturi and Acharya [10] proposed a modelfree clustering method called information fusion, which uses SOMs Kohonen learning to update the weights for clusters and to essentially correlate microarray gene expression patterns with repeated motifs in the upstream region of genes. A potential limitation to this approach is that the grid of the nodes for the SOM has to be defined beforehand and the results of the clustering are dependent on the geometry of the grid. The development of additional methods to simultaneously cluster samples based on microarray gene expression data with associated biological information is reasonably expected to improve the grouping of samples and to enhance the discovery of biological processes that are correlated with phenotypic endpoints.
Recent work has shown that better inference of genomic indicators for an outcome is obtained by integrating gene expression data with clinical or phenotypic data. For instance, Gevaert et al. [11] demonstrated that partial integration of clinical measurements with gene expression data through separate Bayesian networks that are joined by a single phenotype variable, improved the prediction of the prognosis of breast cancer. Others have used principal component analysis with an analysis of variance or partial least squares to associate gene expression data with clinical measurements for improved classification or prediction of an outcome [12, 13]. In addition, a novel clustering approach that incorporates epigenetic (genes monitored for hypermethylation according to a binary [0,1] status) and phenotypic data (clinical measurements encoded as ordinal categorical variables), was shown to group tumor samples sufficiently well enough for discovery of informative pathways that adhere to strict heritability in breast cancer [14]. The approach, called heritable clustering, was suggested to be a framework to integrate other biological data. However, the extension of the algorithm for the analysis of high dimensional gene expression data integrated with clinical data as continuous measurements and phenotypic data as categorical values simultaneously has not been investigated.
Since the kmeans and kmodes algorithms are efficient for processing large numeric and categorical data sets respectively, the combination of objective functions for measuring dissimilarity has been applied in the kprototypes algorithm as a practical approach to extend the kmeanslike algorithm for clustering large data sets with categorical values [7]. To test the utility of the kprototypes algorithm for clustering samples based on numeric microarray gene expression data and clinical chemistry evaluations with histopathological observations as categorical values, we introduce a modkprototypes algorithm. The approach follows the kmeans paradigm with randomization of initialization of the algorithm and is evaluated initially using two data sets. A simulated data set and a heart disease mixed type data set for proofofprinciple. The strategy involves constructing an objective function from the sum of the squared Euclidean distances for numeric data with simple matching for categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure in order to determine the optimal number of clusters in the mixed type data. A cluster's prototype is formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group. The cluster's prototype is taken as a representation of the feature values that depicts the phenotype of the samples in the group.
Further rigorous investigation of the modkprototypes clustering method is then pursued by applying it to gene expression data and associated phenotypic evaluations from acetaminophenexposed rat liver samples. Acetaminophen, which is an analgesic, causes centrilobular necrosis in the rat liver at high dose exposures. Using a chisquare test and GO annotations of selected genes which significantly distinguish differences between prototypes of clusters of the acetaminophen data set across all three data domains, phenotypic prototypes were obtained which were descriptive of, and anchored to, necrosis of the centrilobular region of the rat liver. This is an endpoint manifested from high dose exposures of acetaminophen in the rat liver.
Results
Clustering mixed data types
The data sets used for clustering and the components of the modkprototypes algorithm are shown in Figure 1a. The Î±, Î² and Î³ weighting terms influence how much each data domain contributes to the clustering of the samples. An objective function with the sum of the squared Euclidean distances for numeric data and simple matching for categorical values is used to measure the dissimilarity of the samples. Samples are grouped using kmeans clustering based on numeric attributes and kmodes clustering for attributes with categorical values. The DVI and CU measures comprise the DVI_CU score that measures the validity of the clustering. The modkprototypes algorithm is shown in Figure 1b and is a modification of the original kprototypes algorithm [15]. For k = 2 to N number of samples and for B iterations, assignment of each sample is made to one of the k clusters based on the minimal distance of the sample to the prototypes of the clusters. The prototypes are updated and the samples are reassigned repeatedly until there is no more change in cluster assignment. The DVI_CU score is computed for the final assignment of the samples. The number of clusters in the data is estimated by finding the assignment of the samples, over all B initializations and all k partitions, which yielded the optimal validity score.
Initial validation of the modkprototypes algorithm was performed by evaluating the clustering of the samples in the simulated and the Cleveland Clinic heart disease mixed data sets. Clustering of the simulated data was performed with adaptive weighting of the numeric and categorical data. After 50 trial clustering attempts over 2 to k possible clusters in the data, the modkprototypes algorithm partitioned the data into 3 clusters with the samples in their respective class group (i.e., samples #s 12â€“22, 33â€“43 and 44â€“54 together respectively). Figure S3 in Additional file 1 illustrates the minimization of the DVI_CU index at k = 3.
Clustering of the Cleveland Clinic heart disease data was performed with equal domain weighting. A plot of the DVI_CU validity measure at all values of k shows a minimum at k = 2, implying that the estimated number of clusters is two (Figure 2a). Additional file 2 shows the assignment of the samples to either of the two clusters along with the categorical value for the chest pain type attribute. Cluster 1 has 169 patient samples grouped together with pain type of angina suggestive of having heart disease (sick) while Cluster 2 has 134 patient samples grouped similarly together with nonangina representative of being without heart disease (buff). The accuracy of the clustering of the patients into the two groups was 79%. This is on par with, or better than, the classification accuracy of the samples by the NTGrowth, C4 and CLASSIT, and conceptual clustering classification algorithms which were reported to the University of California at Irvine repository for machine learning as 77%, 74.8% and 78.9% respectively. This analysis indicates that the modkprototypes algorithm can effectively cluster mixed data types leading to relatively accurate assignment of the samples to clusters with the appropriate clinical label.
Similarly, application of the modkprototypes algorithm with equal domain weighting to the acetaminophen mixed data indicates a minimum value for the DVI _CU validity measure at k = 3 (Figure 2b), implying that there are three clusters in the data. Ten samples were grouped into Cluster 1, nine into Cluster 2 and 45 into Cluster 3 (Additional file 3). The samples in Cluster 3 are comprised mostly of low dosed (50, 150 mg/kg) samples and high dosed (1500 and 2000 mg/kg) samples at 6 hrs except for 5 animals (rats #s 405, 406, 423, 518 and 520) that had low ALT and AST enzyme levels (Additional file 4). Elevated levels of ALT and AST correlate with liver injury. Cluster 2 contains all samples exposed to a high dose of acetaminophen for 18 or 24 hrs. Cluster 1 has samples exposed to high dose of acetaminophen for 48 hrs, except for moderate responder rats #s 407, 416 and 420, that were dosed for 18 or 24 hrs and had moderately elevated ALT and AST enzyme levels.
Validation of clustering the acetaminophen mixed data
We next assessed the ability of the algorithm to cluster the samples according to the level of liver necrosis. At toxic doses of acetaminophen, glutathione is depleted leading to the formation of a reactive intermediate that covalently binds to sulfhydryl groups of several cellular proteins [16]. These adducts are thought to contribute to tissue necrosis [17]. The indicator variable representing the histopathological observations made by boardcertified pathologists on the centrilobular region of the liver was removed from the data set prior to running the modkprototypes algorithm. This variable was then used as an external indicator to validate the assignment of samples to the three clusters. This observation has four feature values for all the exposed samples denoting either no, minimal, mild, or moderate severity of necrosis of the centrilobular region of the liver. Using the modkprototypes algorithm with k set at 3 and equal weighting of the microarray, clinical chemistry and histopathology domain data, 90% of the cluster assignments of the acetaminophentreated samples had an adjusted Rand Index R' value greater than 0.64 when compared to the groups of the samples according to the observed level of necrosis (Figure 3). Since there were three clusters generated in the mixed data, yet four classes of acetaminophenexposed centrilobular necrosis of the liver, perfect agreement was not possible, but the achieved clustering approached maximal validity given the external classification (Figures 2b and 3).
Weighting schemes for clustering the acetaminophen mixed data
Proposed weighting of the domain data
Differential weighting of the domains may lead to further improved accuracy of the clustering procedure, as proposed by Lance and Williams [18] who introduced a clustering algorithm dependent on the weight of dissimilarity between objects [5]. User defined weights for clustering permit more or less influence to be given to particular components of the dissimilarity function. Several investigators at the NIEHS/National Center for Toxicogenomics responded to a survey in which they were asked to propose weights for clustering the acetaminophen microarray, clinical chemistry and histopathology data sets by assigning values for the modkprototypes algorithm parameters Î±, Î² and Î³ respectively. Their suggestions are listed in Table 1.
Two respondents, numbers 7 and 8, suggested different weighting schemes according to whether the end goal of the clustering of the samples was to either identify biomarkers related to histopathological changes following exposure to a toxicant, or to ascertain biological processes and pathways related to the phenotype of the samples. One respondent suggested two weighting schemes, one which is purported for data containing microarray, clinical chemistry and histopathology measurements (suggestion 5a) and one for molecular validation of toxicological evaluations of the samples (suggestion 5b). Two other respondents, numbers 14 and 15, suggested that the domain data be weighted according to preferred outcomes in the analysis of the data. The former proposed (i) equal weighting or (ii) coupling biology with phenotype whereas the latter proposed clustering based on (i) general effects of the treatment, (ii) specific injury and endpoints from the exposure or (iii) the affected pathways. The averages of the suggested weights were 0.51, 0.24 and 0.25 for Î±, Î² and Î³ respectively. The standard deviations of the suggested weights were low (<0.16). However, the standard deviations of the Î² and Î³ weights were less than that of the Î± weight.
Simultaneous clustering using domain weights
Table 2 lists the determination of k and validation results of the simultaneous clustering of the acetaminophen microarray, clinical chemistry and histopathology data sets using the modkprototypes algorithm with specified weights. Equal weights of the domain data in the clustering process resulted in k = 3 and R' of 0.64 when the four centrilobular necrosis of the liver histopathology observation levels were used as an external indicator of clustering validity. Adaptive (Î±, Î² and Î³ adjusted to 0.26, 0.39 and 0.35 respectively) or proposed weighting of all three domain data yielded clustering results with k = 3, but R' = 0.64 or 0.67.
The highest agreement among all weighting schemes tested was achieved when the clinical chemistry data was given all of the weight. However, utilizing only the microarray data in the clustering process resulted in partitioning of the samples into just two groups with R' = 0.51. Excluding the microarray data from the analysis by weighting the clinical chemistry and the histopathology data equally yielded three clusters and R' = 0.64. Placing all the weight on the histopathology data or splitting the weighting equally between the microarray and histopathology data resulted in the poorest agreements (R' = 0.33 and 0.46 respectively) of the cluster assignments. Interestingly, having the weighting shared between at least the microarray and clinical chemistry domain data appears to be advantageous for clustering the data irrespective of the balancing of the weights. Surprisingly, no weighting scheme that included the histopathology data resulted in cluster groups with k > 4. However, partitions with k = 6 and k = 5 were obtained respectively, when clinical chemistry data alone and microarray with clinical chemistry data were used for clustering the samples. The latter resulted in R' = 0.66. With the weight of the microarray data set > 0.5 and some weight given to the histopathology data, weighting schemes for clustering of the biological samples validated with R' = 0.67 and k = 3 (the estimated number of clusters in the data [Figure 2b]).
Phenotypic Prototypes
Endpoint components of the prototypes
The groups of samples from the modkprototypes algorithm were analyzed next for phenotypic prototypes by extracting histopathological feature value labels, clinical chemistry measurements, and genes from the prototypes of the clusters that (1) distinguish between pathologic outcomes and (2) best represent the underlying biology of the data. This analysis was performed on the acetaminophen microarray, clinical chemistry and histopathology data (including the centrilobular necrosis variable) with k = 3 and the Î±, Î² and Î³ weights set at 0.51, 0.24 and 0.25, respectively (see Tables 1 and 2 for the averages of the suggested weights). Table 3 lists partial prototypes of the resulting Clusters 1, 2, and 3 that represent samples grouped with moderate, no and mild levels, respectively, of necrosis of the centrilobular region. Samples in Clusters 1 and 3 were qualified by moderate and mild necrosis. By contrast, the majority of the samples in Cluster 2 were either low dosed (50 or 150 mg/kg) at any of the durations of exposure, or high dosed (1500 and 2000 mg/kg) for short durations (6 and 18 hrs). Except for 3 alteredresponder rats (#s 423, 520 and 522) that were dosed for 24 or 48 hrs (Table 4). These exposures were expected to give at least a mild hepatotoxic phenotype. However, the ALT and AST levels for these animals were far below the treatment group average for these enzymes (see Additional file 4). Clusters 1 and 3 contained only high dosed samples (1500 and 2000 mg/kg) with the durations of exposure beyond 6 hrs (18, 24 or 48 hrs). In Cluster 3, most of the samples (6 of 9) were exposed for a time frame in which partial recovery from the treatment is expected (namely 48 hrs), whereas Cluster 1 only contains samples dosed for either 18 or 24 hrs. The samples in Clusters 1 and 3 also showed markedly and moderately elevated ALT and AST enzyme levels, as well as moderate and minimal congestion of the sinusoid region, respectively. Furthermore, the samples from the rats in Cluster 3 were represented by a histopathologic prototype characterized by minimal inflammatory cell infiltration in the centrilobular region, regeneration and degradation of the hepatocytes. The latter observed in the left medial lobe region. Samples from rats #s 407, 416 and 420 were dosed with 1500 mg/kg acetaminophen for either 18 or 24 hrs durations, but had only modest elevations of ALT and AST. Finally, from the prototype, samples in Cluster 1 were observed to have minimal hypertrophy of the hepatocytes predominantly. The rest of the histopathology feature values for the three clusters were not informative (all had no observed endpoint) and therefore not included as representative features in the phenotypic prototypes.
Of the clinical chemistry measurements listed in Table 3 for each cluster prototype, ALT and AST levels clearly distinguish Cluster 1 samples labelled with the prototype feature as moderate necrosis of the centrilobular region of the liver from the two other clusters. In addition, elevated levels of TBA and decreased blood cholesterol differentiate samples in Cluster 1 from samples in Clusters 2 and 3 reasonably well.
Gene expression component of the prototypes
Differences in gene expression levels between each cluster are shown in Figure 4a. The Cluster 2 prototype labelled with no necrosis of the centrilobular region had the least amount of differential gene expression of the samples in the cluster. Samples in Clusters 1 and 3 with moderate and mild necrosis of the centrilobular region as representative indicators respectively, had numerous genes with over 2fold differential expression. The most dramatic gene expression differences were observed in the comparison of no versus moderate (Cluster 2 vs Cluster 1) necrosis of the centrilobular region of the liver, while the moderate versus mild comparison showed only slight differences in magnitude of expression between gene expression prototypes.
To extract genes from the pairwise comparisons of the expression component of the prototypes for the clusters that could statistically distinguish between levels of necrosis of the centrilobular region of the liver, a chisquare goodnessoffit test was employed using the observed difference in a gene's expression ratios between two prototypes and the expected gene expression differences of all pairwise comparisons for all genes in the prototypes. With Î± set at 0.05, 82 genes, including several Cytochrome P450 genes and heme oxygenase 1, were identified as significant and unique in distinguishing contrasts between different levels of necrosis of the centrilobular region of the liver (Additional file 5). A subset of the genes is shown in Table 5. In particular, the GO biological processes cell growth and/or maintenance, amine metabolism and stress response discerned between clusters of samples grouped according to no and moderate necrosis of the centrilobular region of the rat liver. Mild and moderate centrilobular necrosis appeared to involve amine metabolism.
A clearer picture of the differences between the samples in the clusters labelled with either no, mild or moderate necrosis of the centrilobular region of the rat liver was obtained by comparisons of Clusters 1, 2 and 3 using just the expression values of the 82 genes extracted from the prototypes (Figure 4b). About 75% of the genes progressively increase or decrease in differential expression as the level of necrosis of the centrilobular region of the liver transitions from no to mild to moderate. Finally, hierarchical clustering of the biological samples reveals very good grouping of the low dosed and high dosed samples. The latter very prominent and tight within time groups (Figure 5). Interestingly, as shown in Figure 6, the nine no or moderatelyresponding rats (#s 405, 406, 407, 416, 420, 423, 518, 520 and 522) were distinctly different from their counterpart dosebytime group subjects in terms of ALT enzyme levels. The high dosed 6 hrs rats differed from the high dosed 18, 24 and 48 hrs rats by a small cluster of genes that include an activator (Mitogen activated protein kinase kinase kinase 12 [Map3k12]) of the cJun Nterminal kinase (JNK) pathway, a transactivator of thyroidstimulating hormone beta, and a regulator of neuronal differentiation and development.
Discussion
Clustering of microarray gene expression data has matured by virtue of the growing number of analytical approaches for partitioning data. kmeans is one of the most widely used unsupervised clustering methods for gene expression data. Unfortunately, kmeans clustering, and other approaches such as SOMs do not guarantee globally optimal partitioning, require specifying the number of clusters or the configuration of the underlying classification structure, and suffer from inflexibility with respect to incorporation of associated biological data. More importantly, most clustering algorithms support only quantitative or qualitative data but not both simultaneously. Huang [15] introduced the kprototypes algorithm that utilizes the clustering objective function of kmeans for numeric measurements and kmodes for categorical values to partition data. We have proposed modifying this algorithm by adding an objective function to support and weight multidomain, mixed type biological data within the kmeans clustering paradigm. The advantage of our modkprototypes algorithm is that simultaneous clustering of gene expression data with clinical chemistry evaluations and histopathology observations results in informative clusters that are formed with prototypes of genes and values from endpoint variables that are anchored to the phenotypes of samples with similar biological outcomes.
Our method is one of a class of approaches that seek to incorporate biological data directly into the clustering process [9, 14]. Using necrosis of the centrilobular region of the rat liver following acetaminophen exposure as an endpoint to couple with gene expression profiles, clinical chemistry evaluations and histopathological observations, simultaneous clustering of the data with the modkprototypes algorithm revealed phenotypic prototypes which were capable of distinguishing between no, mild and moderate levels of necrosis of the liver (Tables 3 to 5; Figure 4). For instance, non or moderatelyresponding rats to acetaminophen exposure were distinctly different from their counterpart dosebytime group subjects. Furthermore, the high dosed 6 hrs rats vs the high dosed 18, 24 and 48 hrs rats differed by a small cluster of genes involved in signal transduction and growth regulation. Not surprisingly, Cytochrome P450 genes and heme oxygenase 1, which have functions in detoxification and redox regulation in response to oxidative stress, were found to be indicators of toxicity in the gene expression component of the phenotypic prototypes that differentiated between levels of necrosis of the centrilobular region of the rat liver (Table 5). Several published reports of gene expression data generated from treatment of biological samples with toxic agents describe the altered expression of genes such as these in wellknown biological pathways that are perturbed subsequent to incipient toxicity [19â€“24].
Weighting of the terms in the modkprototypes algorithm offers the flexibility to balance the influence of each domain of the data while simultaneously clustering the mixed data (see equation 1). This is advantageous for semisupervised clustering when different goals for analyzing the data are in mind. The interest might be to cluster biological samples based on gene expression data with clinical chemistry measurements and histopathology observations for the purpose of finding biomarkers related to histopathological changes, or identifying which biological processes and pathways are related to the phenotypic endpoint. From empirical analysis of acetaminophentreated rat liver sample data using adaptive weighting or different weighting schemes, giving some weight to histopathology observations and at least half of the weight to the microarray data set is advantageous to clustering the data (Table 2). Interestingly, although applying all the weight to the clinical chemistry data gave the best fit between cluster assignment and histopathology evaluation of centrilobular necrosis, the number of clusters in the data was overestimated. This indicates that improper weighting of the domain data can potentially bias the clustering of the samples. Further work is being done to weight the domain data heuristically.
The high dimensionality of data has challenged the efficiency and reliability of clustering algorithms for quite sometime. In high dimensional space, data points become sparse making the use of some distance measures meaningless. However, results from experiments on realworld high dimensional data have shown that distance measures based on the Minkowski L_{ d }metrics, where d is either 1 or 2, increases or remains constant as the dimensionality of the data increases [25]. Our modkprototypes algorithm is based on the Euclidean (L_{2}) distance metric for the high dimensional microarray data and clinical chemistry data. Given the aforementioned theoretical work plus our own simulation of a smaller scaled data set and reduction of the high dimensional numeric data (see Additional file 1), we are convinced that the clustering of the samples using the modkprototypes algorithm is not dependent on the scale or dimensionality of the data. The simulation results also provide evidence that the algorithm is at least able to find a small number of true/known clusters when they exist. Furthermore, the phenotypically anchored genes that were acquired from the prototypes of the clusters from the acetaminophenexposed samples suggest that the modkprototypes algorithm forms groups of samples that are biologically meaningful. Additional applications of the method to a variety of simulated and real data sets are underway. These should also help in determining its usefulness over a range of scales and data dimensions.
As more biological data becomes available, sophisticated methods for clustering integrated data will be necessary in order to glean more meaningful information about the underlying biology of the samples. Efforts such as integrative genomics, systems biology, toxicogenomics, pharmacogenomics and biomedical informatics are generating volumes of biological data and information spanning transcriptomics, proteomics, metabolomics, toxicology, pharmacology, clinical biology and genetics to leverage each domain data for a more informed assessment of biological outcomes [12, 26â€“30]. Case in point, the work of Baskin et al. [31] to collectively analyze microarray, clinical data and pathology observations revealed that gene expression patterns were very much consistent with the clinical outcomes, gross pathology and histopathology from influenza virusinfected pigtailed macaques primates. However, the identified clusters may not contain genes that are directly associated with the appearance of clinical signs or pathological indications of tissue infection because the domains of data were analyzed independently.
The modkprototypes algorithm is wellsuited as a clustering method for grouping biological samples constrained by integrated data and feature values. It yields representatives of the clusters (the prototypes) which can potentially provide an initial insight to the biological mechanism driving the similarities of the samples and the phenotypes associated with gene expression. This concept of phenotypic anchoring has been proposed and tested as a means to link the cause of a disease or response with gene expression patterns and the altered biological processes that follow the observed effect [32â€“34]. We propose that the modkprototypes clustering method will provide a feasible computational alternative to embark on bridging multidomain data analysis frameworks for integrative genomics, systems biology, pharmacology and toxicology.
Conclusion
Many existing methods for clustering gene expression data do not incorporate phenotypic data about the samples. We developed the modkprototypes algorithm using an objective function with the sum of the squared Euclidean distances and simple matching for clustering biological samples based on numeric data and categorical values respectively. It is a formal approach to cluster gene expression data with phenotypic data. The algorithm is based on the original kprototypes algorithm but is adapted along the kmeans paradigm, it contains weighting terms for microarray, clinical and histopathology data, and is designed to determine the number of clusters in the data by minimizing a DVI_CU measure over all possible numbers of clusters and randomization of the initialization of the algorithm.
The advantage of simultaneous clustering of gene expression data with clinical chemistry evaluations and histopathology observations is that informative clusters are formed with prototypes of genes and endpoint features that are linked to the phenotypes of samples with similar biological outcomes. Following modkprototypes clustering of the acetaminophen data with weighting of the domain data, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. From empirical analysis of acetaminophentreated rat liver sample data using adaptive weighting or different weighting schemes, having some weight given to the histopathology observations and weight of the microarray data set > 0.5, are advantageous to clustering the samples. Clustering the mixed data types in this fashion was better than typical kmeans style clustering of either microarray or clinical chemistry numeric data alone (i.e. the other data sets weights set at 0) and better than kmodes clustering of the samples based solely on the histopathology data. We found that the expression profiles of several Cytochrome P450 genes and heme oxygenase 1 were significant in their differentiation between levels of centrilobular necrosis of the rat liver. Cytochrome P450 genes are in high proportion in the liver and produce detoxification enzymes to metabolize toxicants. Furthermore, the high dosed 6 hrs rats vs the high dosed 18, 24 and 48 hrs rats differed by a small cluster of genes containing an activator of the cJun Nterminal kinase pathway, a transactivator of thyroidstimulating hormone beta and a regulator of neuronal differentiation and development. But overall, cell growth and/or maintenance, amine metabolism and stress response were biological processes that discerned between no and moderate levels of acetaminopheninduced necrosis of the centrilobular region of the rat liver. The use of wellknown and traditional measurements directly in the clustering process provides some guarantee that the resulting clusters will be meaningfully interpretable. However, we realize that improper weights for the domain data can bias the clustering of the samples. In future work, we will investigate weighting the domain data heuristically.
Methods
Heart disease mixed data
Heart disease data from the Cleveland Clinic heart disease database maintained at the University of California at Irvine repository for machine learning was used as a data set with mixed features to evaluate the ability of the clustering algorithm to group samples based on mixed data types. The data set consists of 303 patients defined by 13 clinical features, five of which are numeric and eight categorical or nominal. The data has two classes: 165 individuals with no heart disease (buff) and 138 individuals with heart disease (sick).
Acetaminophen microarray gene expression data and analysis
Microarray gene expression data was derived from left liver lobe mRNA samples collected from 4 male Fischer F344/N rats per dose group exposed to either 50 mg/kg, 150 mg/kg (low doses), 1500 mg/kg or 2000 mg/kg (high doses) body weight acetaminophen during a light period (between 12 noon and 1 pm) as well as liver mRNA collected from control (vehicletreated) male rats [35]. Animals were sacrificed and mRNA extracted from liver specimens 6, 18, 24, or 48 hrs after treatment. Each RNA sample from a treated animal was compared with a pool of timematched control mRNAs and analyzed in duplicate (dye reversal experiments) on Agilent011868 (G4130A) rat oligonucleotide microarrays (Agilent Technologies, Palo Alto, CA). Acetaminophen exposure to the rat liver at 50 and 150 mg/kg is subtoxic. However, 1500 and 2000 mg/kg doses induce severe toxicity which peaks 24 hrs after exposure but the rats show signs of recovery 48 hrs after exposure.
Scanning of the microarray chips and acquisition of data from scanned images were as previously described [22]. Briefly, background subtracted pixel intensity values were log transformed, normalized and assessed for significance of expression (pvalue < 0.05, Bonferroni corrected) using an ANOVA model comparing treated samples with timematched controls. The approximately 3100 significant genes' pixel intensity ratio values from dye reversal hybridizations were combined (same subjects only) using Rosetta Resolver version 5.1.0.1.23 (Rosetta Biosoftware, Seattle, WA) error model weighted averaging [36, 37]. Two gene features (A_43_P22641 and A_43_P22629), which had all values missing, were removed from analysis. The data used for clustering is in Additional file 6.
Acetaminophen histopathology observations and clinical chemistry evaluations
48 histopathological observations of the acetaminophentreated rat liver specimen slides and 10 clinical chemistry measurements on biosamples from the treated animals were collected as previously described [35]. Observations include: inflammatory cell infiltration of the centrilobular region or region not otherwise specified, necrosis of the centrilobular region or of hepatocytes, hyperplasia of the centrilobular hepatocytes, glycogen depletion, degeneration or regeneration of the hepatocytes or the centrilobular region, congestion or glycogen depletion of the centrilobular region or sinusoid and hyperplasia of the bile duct. Microscopic qualifiers were categorized as no, minimal, mild, moderate or marked. Discrepancies in histopathology observations were resolved by a team of boardcertified pathologists [38].
Clinical chemistry evaluations of serum samples were performed using a Roche Cobas Fara chemistry analyzer (Roche Diagnostic Systems, Westwood, NJ) to numerically measure serum enzyme levels. These included indicators of liver injury (Alanine Aminotransferase [ALT] and Aspartate aminotransferase [AST]), Sorbitol dehydrogenase [SDH], cholesterol levels, indication of renal injury (urea nitrogen [BUN]), assessment of cholestasis â€“ bile flow interruption (total bile acids [TBA], Creatine [Creat], Alkaline Phosphatase [ALP]), total protein (TP) and albumin (ALB) levels. Elevated levels of ALT and AST correlate with liver injury [39]. Missing values were imputed for rats #s 308 and 309 with either the group average or the overall mean value for each evaluation.
Simulation of data for clustering using the modkprototypes algorithm
Numeric data
A data set comprised of numeric data with 64 features and 33 objects was simulated from three distinct probability distributions. Normal deviates (mean 0 and standard deviation 20) were drawn at random from the 3 probability distributions generating 11 objects in each. Samples that belong to the same class are #s12â€“22, 33â€“43, and 44â€“54 respectively.
Categorical data
A data set comprised of categorical data with 10 features and 33 objects was simulated from an HMM using R code in the HMM discrete nonparametric package. The HMM contained 3 states modelled on levels of toxicity (no/low, moderate and severe) and 5 severity levels (none, minimal, mild, moderate and marked) of centrilobular necrosis observed in rat livers exposed to acetaminophen. An independent cDNA microarray gene expression data set [22] acquired from rat liver samples exposed to 50, 150, 1500 and 2000 mg/kg of acetaminophen for 6, 24 and 48 hrs was used to estimate transition probabilities from a set of ~700 differentially expressed genes as well as a set of 700 genes selected at random. A third set of transition probabilities were manually created with a high probability (p >= 0.6) of visiting or remaining in the no/low toxicity state. An illustration of the HMM, a curve of the loglikelihood from the training, the transition and emission probabilities are in the supplemental materials (Additional file 1).
A mixed data set (Additional file 7) was created from the merge of the numeric and the categorical simulated data sets.
Modified k (modk)prototypes algorithm
The Huang [15]kprototypes algorithm which combines the kmeans and the kmodes objective functions for clustering numeric data and categorical values respectively, was modified to follow the kmeans algorithm paradigm, and was also optimized to search for clusters formed closest to the global minima of the objective function. In addition, a separate numeric objective function was utilized for the microarray and the clinical chemistry data resulting in the following modkprototypes objective function:
d({X}_{i},{Q}_{l})=\mathrm{\xce\pm}{\displaystyle \underset{j=1}{\overset{{m}_{r}}{\xe2\u02c6\u2018}}{({x}_{ij}^{r}\xe2\u02c6\u2019{q}_{lj}^{r})}^{2}}+\mathrm{\xce\xb2}{\displaystyle \underset{j=1}{\overset{{m}_{s}}{\xe2\u02c6\u2018}}{({x}_{ij}^{s}\xe2\u02c6\u2019{q}_{lj}^{s})}^{2}}+\mathrm{\xce\xb3}{\displaystyle \underset{j=1}{\overset{{m}_{s}}{\xe2\u02c6\u2018}}\mathrm{\xce\xb4}({x}_{ij}^{c},{q}_{lj}^{c})}\left(1\right)
where X_{ i }is the i^{th} sample, for i = 1 to N number of samples, Q_{ l }is the l^{th} prototype, for l = 1 to k number of clusters, m_{ r }is the number of microarray numeric attributes, m_{ s }is the number of clinical chemistry numeric attributes, m_{ c }is the number of histopathological categorical attributes, Î±, Î² and Î³ denote the weights (W) for the microarray, clinical chemistry and histopathology data domain dissimilarity measures, respectively. The weights for data domain d at the n^{th} step (W_{ d }[n]) are adapted (for controlling how much each data domain contributes to the clustering of the samples) as follows:
{W}_{d}[n]=\{\begin{array}{cc}\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$3$}\right.& n=0\\ (1\xe2\u02c6\u2019\mathrm{\xcf\u201e})\xc3\u2014{W}_{d}[n\xe2\u02c6\u20191]+\mathrm{\xcf\u201e}\xc3\u2014avecorr({X}^{d},{Q}^{d})& otherwise\end{array}\left(2\right)
where tau (Ï„) is the exponential weighting update factor in the range [0,1] and avecorr(X^{d}, Q^{d}) is the average correlation coefficient (Pearson for numeric data, Jaccard for categorical data) between the samples and the prototypes based on the feature values from domain d.
avecorr({X}^{d},{Q}^{d})=\{\begin{array}{ll}\left(\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$N$}\right.\right){\displaystyle \underset{i=1,{X}_{i}^{d}\xe2\u02c6\u02c6{C}_{l}}{\overset{N}{\xe2\u02c6\u2018}}{\left(\raisebox{1ex}{$\mathrm{cov}({X}_{i}^{d},{Q}_{l}^{d})$}\!\left/ \!\raisebox{1ex}{${s}_{{X}_{i}^{d}}{s}_{{Q}_{l}^{d}}$}\right.\right)}^{2}}\hfill & if\text{domian}d\text{isnumeric}\hfill \\ \left(\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$N$}\right.\right){\displaystyle \underset{i=1,{X}_{i}^{d}\xe2\u02c6\u02c6{C}_{l}}{\overset{N}{\xe2\u02c6\u2018}}\left(\raisebox{1ex}{${p}_{({X}_{i}^{d},{Q}_{l}^{d})}$}\!\left/ \!\raisebox{1ex}{${p}_{({X}_{i}^{d},{Q}_{l}^{d})}+2{q}_{({X}_{i}^{d},{Q}_{l}^{d})}$}\right.\right)}\hfill & if\text{domian}d\text{iscategorical}\hfill \end{array}
where cov is the sample covariance, s is the sample standard deviation, N is the number of samples, p is the number of features that match and q is the number of features that do not match. The value of Ï„ was set to 0.05 in order to adjust the weight of each domain by 5% at each iteration. The weights are nonnegative and their sum is constrained to equal 1. The weights could potentially go to the boundaries [0,1] depending on the data. However, they can easily be constrained to always be above some lower bound, e.g. 0.05, or even fixed at proportions that are appropriate or reasonable to a domain expert.
Letting z represent r for microarray numeric data or s for clinical chemistry numeric data, the distance between {X}_{i}^{z} and {Q}_{l}^{z} containing missing values is defined as:
{d}_{j}=\{\begin{array}{ll}0\hfill & if{x}_{ij}^{z}or{q}_{lj}^{z}ismissing\hfill \\ {x}_{ij}^{z}\xe2\u02c6\u2019{q}_{lj}^{z}\hfill & otherwise.\hfill \end{array}\left(3\right)
Then the distance between {X}_{i}^{z} and {Q}_{l}^{z} is:
d({X}_{i}^{z},{Q}_{l}^{z})=\frac{p}{p\xe2\u02c6\u2019{p}_{0}}{\displaystyle \underset{j=1}{\overset{p}{\xe2\u02c6\u2018}}{d}_{j}^{2}}\left(4\right)
where d is the Euclidean distance, p is the number of numeric features and p_{0} is the number of numeric features with missing values in {X}_{i}^{z} and {Q}_{l}^{z} or both.
For categorical (c) feature values, the dissimilarity measure between {X}_{i}^{c} and {Q}_{l}^{c} is defined by the total number of mismatches of the corresponding histopathologic features from the sample and the prototype {X}_{i}^{c} and {Q}_{l}^{c} respectively such that
d({X}_{i}^{c},{Q}_{l}^{c})={\displaystyle \underset{j=1}{\overset{{m}_{c}}{\xe2\u02c6\u2018}}\mathrm{\xce\xb4}({x}_{ij}^{c},{q}_{lj}^{c})\left(5\right)}
where
\mathrm{\xce\xb4}({x}_{ij}^{c},{q}_{lj}^{c})=\{\begin{array}{ll}0\hfill & if{x}_{ij}^{c}={q}_{lj}^{c}\hfill \\ 1\hfill & if{x}_{ij}^{c}\xe2\u2030{q}_{lj}^{c}.\hfill \end{array}\left(6\right)
For B (typically set to 100) times, the modkprototypes algorithm initialization is seeded by the domain data vector of a randomly selected sample for each of the k clusters. For adaptive clustering, recursion was used to update the prototypes in order to find the configuration of the initial kprototypes which ultimately results in the reduction of the objective function closest to the global minimum. Matlab code and a standalone executable program for the modkprototypes algorithm to simultaneously cluster gene expression data with clinical chemistry and pathological evaluations are available [48].
Determination of cluster number (k) and validation of cluster assignment
To determine the number of clusters in a data set, the DVI of Shen et al. [40] was used. The DVI is based on an intra/inter ratio validity index that also includes scaling of the intra and the intercluster distances.
DV{I}_{k}=\left\{\frac{intra(k)}{\underset{i=2,\mathrm{...},N}{\mathrm{max}}\{intra(i)\}}+\frac{inter(k)}{\underset{i=2,\mathrm{...},N}{\mathrm{max}}\{inter(i)\}}\right\}
where
inter(k)=\frac{\underset{i,j}{Max}\left({\xe2\u20ac\u2013{Q}_{i}\xe2\u02c6\u2019{Q}_{j}\xe2\u20ac\u2013}^{2}\right)}{\underset{i\xe2\u2030j}{Min}\left({\xe2\u20ac\u2013{Q}_{i}\xe2\u02c6\u2019{Q}_{j}\xe2\u20ac\u2013}^{2}\right)}{\displaystyle \underset{i=1}{\overset{k}{\xe2\u02c6\u2018}}\left(\frac{1}{{\displaystyle \underset{j=1}{\overset{k}{\xe2\u02c6\u2018}}\left({\xe2\u20ac\u2013{Q}_{i}\xe2\u02c6\u2019{Q}_{j}\xe2\u20ac\u2013}^{2}\right)}}\right)},
k is the number of clusters, N is the number of samples and intra is the average Euclidean distance between samples and the prototype Q of the cluster each sample is assigned to.
For mixed data with numeric and categorical values, the DVI was modified to include a CU measure [41] that defines the probability of matching a categorical feature value given a cluster versus the probability of the categorical feature value given the entire data set
C{U}_{m}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$m$}\right.{\displaystyle \underset{k=1}{\overset{m}{\xe2\u02c6\u2018}}P({C}_{k})}\left[{\displaystyle \underset{i}{\xe2\u02c6\u2018}{\displaystyle \underset{j}{\xe2\u02c6\u2018}P{({A}_{i}={V}_{ij}{C}_{k})}^{2}}}\xe2\u02c6\u2019{\displaystyle \underset{i}{\xe2\u02c6\u2018}{\displaystyle \underset{j}{\xe2\u02c6\u2018}P{({A}_{i}={V}_{ij})}^{2}}}\right]\left(7\right)
where P(A_{ i }= V_{ ij }) is the unconditional probability of feature A_{ i }taking on the value V_{ ij }, P(A_{ i }= V_{ ij } C_{ k }) is the conditional probability of A_{ i }= V_{ ij }given cluster C_{ k }, and k is the cluster number from 1 to m. The DVI modified with CU
DVI_CU = (DVI + 1/CU) (8)
is minimized over all k sets for each run of the modkprototypes clustering algorithm. Validation of cluster assignment was carried out using R', the adjusted Rand index [42â€“44]. When two partitions agree totally, R' is 1 and when the partitions are selected by chance, R' is 0.
Generation of phenotypic prototypes
A cluster's prototype is formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group. Hence, the cluster's prototype is taken as a representation of the feature values that depicts the phenotype of the samples in the group. The process for obtaining phenotypic prototypes is to extract all the histopathologic feature value labels and clinical chemistry measurements as well as significant genes from the prototypes of the clusters that can distinguish between pathological outcomes and best represent the underlying biology of the group of samples. Let the observed difference between the expression ratio of the g th gene (p in total) from the gene expression component of prototype q for i th and j th (i not equal to j) cluster (k in total) be observed_{ g }= (q_{ gi } q_{ gj }) and the expected change in expression be
expected=\frac{{\displaystyle \underset{g=1}{\overset{p}{\xe2\u02c6\u2018}}{\displaystyle \underset{i=1}{\overset{k\xe2\u02c6\u20191}{\xe2\u02c6\u2018}}{\displaystyle \underset{j=i+1}{\overset{k}{\xe2\u02c6\u2018}}({q}_{gik}\xe2\u02c6\u2019{q}_{gjk})}}}}{\left(\begin{array}{c}k\\ 2\end{array}\right)p}
Averaging over all the genes gives an estimate of the expected difference between a gene's ratio values in the prototypes of two clusters being compared. Assuming independence and an approximately normal distribution of differences, genes which have expression ratios which significantly distinguish between prototypes of clusters are evaluated using a standard chisquare (X^{2}) goodnessoffit test [45]:
{\mathrm{\xcf\u2021}}_{c}^{2}=\frac{2{(observe{d}_{g}\xe2\u02c6\u2019expected)}^{2}}{expected}\xe2\u2030\yen {\mathrm{\xcf\u2021}}_{\mathrm{\xce\pm},1}^{2}\left(9\right)
where the null hypothesis is that the expression value of the g^{th} gene does not distinguish between prototypes of a pair of clusters that are compared. The null hypothesis is rejected at a level of Î±, the probability of a type I error, if {\mathrm{\xcf\u2021}}_{c}^{2} â‰¥ Ï‡^{2}(1, Î±) where Ï‡^{2}(1, Î±) is the Î±level critical value of a Ï‡^{2}distribution with 1 degree of freedom. An Î± of 0.05 gives reliable results. Genes from a comparison of two prototypes which significantly distinguish the clusters are annotated for biological function and process(es) using the GO database [46, 47].
Abbreviations
 modkprototypes:

modified kprototypes
 ALT:

Alanine aminotransferase
 AST:

Aspartate aminotransferase
 SDH:

Sorbitol dehydrogenase
 BUN:

blood urea nitrogen
 TBA:

total bile acids
 Creat:

Creatine
 ALP:

Alkaline Phosphatase
 TP:

total protein
 ALB:

albumin
 DVI:

dynamic validity index
 CU:

category utility
 R':

Adjusted Rand Index
 DVI_CU:

dynamic validity index with category utility
 SOM:

self organizing map
 HMM:

hidden Markov model
 GO:

Gene Ontology.
References
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 1999, 96 (12): 67456750. 10.1073/pnas.96.12.6745
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531537. 10.1126/science.286.5439.531
Hamadeh HK, Bushel PR, Jayadev S, Martin K, DiSorbo O, Sieber S, Bennett L, Tennant R, Stoll R, Barrett JC, Blanchard K, Paules RS, Afshari CA: Gene expression analysis reveals chemicalspecific profiles. Toxicol Sci. 2002, 67 (2): 219231. 10.1093/toxsci/67.2.219
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J, Raffeld M, Yakhini Z, BenDor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G: Geneexpression profiles in hereditary breast cancer. N Engl J Med. 2001, 344 (8): 539548. 10.1056/NEJM200102223440801
Kaufman L, Rousseeuw PJ: Finding groups in data : an introduction to cluster analysis. Wiley series in probability and mathematical statistics Applied probability and statistics, . 1990, xiv, 342 p.New York , Wiley
, : Some methods for classification and analysis of multivariate observations. Proc 5th Berkeley Symp Math Statist Prob. 1967, 1: 281297.
Huang Z: Extensions to the kmeans algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery. 1998, 2: 283304. 10.1023/A:1009769707641.
Shannon WD, Watson MA, Perry A, Rich K: Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiol. 2002, 23 (1): 8796. 10.1002/gepi.1115
Sese J, Kurokawa Y, Monden M, Kato K, Morishita S: Constrained clusters of gene expression profiles with pathological features. Bioinformatics. 2004, 20 (17): 31373145. 10.1093/bioinformatics/bth373
Kasturi J, Acharya R: Clustering of diverse genomic data using information fusion. Bioinformatics. 2005, 21 (4): 423429. 10.1093/bioinformatics/bti186
Gevaert O, De Smet F, Timmerman D, Moreau Y, De Moor B: Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006, 22 (14): e18490. 10.1093/bioinformatics/btl230
Selaru FM, Yin J, Olaru A, Mori Y, Xu Y, Epstein SH, Sato F, Deacu E, Wang S, Sterian A, Fulton A, Abraham JM, Shibata D, Baquet C, Stass SA, Meltzer SJ: An unsupervised approach to identify molecular phenotypic components influencing breast cancer features. Cancer Res. 2004, 64 (5): 15841588. 10.1158/00085472.CAN033208
Tan Y, Shi L, Hussain SM, Xu J, Tong W, Frazier JM, Wang C: Integrating timecourse microarray gene expression profiles with cytotoxicity for identification of biomarkers in primary rat hepatocytes exposed to cadmium. Bioinformatics. 2006, 22 (1): 7787. 10.1093/bioinformatics/bti737
Wang Z, Yan P, Potter D, Eng C, Huang TH, Lin S: Heritable clustering and pathway discovery in breast cancer integrating epigenetic and phenotypic data. BMC Bioinformatics. 2007, 8 (1): 38 10.1186/14712105838
, : Clustering large data sets with mixed numeric and categorical values. Proceedings of the 14th International Joint Conference on Knowledge Discovery and Data Mining, . 1997
Hodgson E: A textbook of modern toxicology. 2004, xxi, 557 p.Hoboken, N.J. , John Wiley, 3rd
Jollow DJ, Mitchell JR, Potter WZ, Davis DC, Gillette JR, Brodie BB: Acetaminopheninduced hepatic necrosis. II. Role of covalent binding in vivo. J Pharmacol Exp Ther. 1973, 187 (1): 195202.
Lance GN, Williams WT: A general theory of classificatory sorting strategies:1. Hierarchical systems. Computer J. 1966, 9: 373380.
Bauer I, Vollmar B, Jaeschke H, Rensing H, Kraemer T, Larsen R, Bauer M: Transcriptional activation of heme oxygenase1 and its functional significance in acetaminopheninduced hepatitis and hepatocellular injury in the rat. J Hepatol. 2000, 33 (3): 395406. 10.1016/S01688278(00)802755
Hamadeh HK, Bushel PR, Jayadev S, DiSorbo O, Bennett L, Li L, Tennant R, Stoll R, Barrett JC, Paules RS, Blanchard K, Afshari CA: Prediction of compound signature using high density gene expression profiling. Toxicol Sci. 2002, 67 (2): 232240. 10.1093/toxsci/67.2.232
Heijne WH, Slitt AL, van Bladeren PJ, Groten JP, Klaassen CD, Stierum RH, van Ommen B: Bromobenzeneinduced hepatotoxicity at the transcriptome level. Toxicol Sci. 2004, 79 (2): 411422. 10.1093/toxsci/kfh128
Heinloth AN, Irwin RD, Boorman GA, Nettesheim P, Fannin RD, Sieber SO, Snell ML, Tucker CJ, Li L, Travlos GS, Vansant G, Blackshear PE, Tennant RW, Cunningham ML, Paules RS: Gene expression profiling of rat livers reveals indicators of potential adverse effects. Toxicol Sci. 2004, 80 (1): 193202. 10.1093/toxsci/kfh145
Waring JF, Cavet G, Jolly RA, McDowell J, Dai H, Ciurlionis R, Zhang C, Stoughton R, Lum P, Ferguson A, Roberts CJ, Ulrich RG: Development of a DNA microarray for toxicology based on hepatotoxinregulated sequences. EHP Toxicogenomics. 2003, 111 (1T): 5360.
Wormser U, Calp D: Increased levels of hepatic metallothionein in rat and mouse after injection of acetaminophen. Toxicology. 1988, 53 (23): 323329. 10.1016/0300483X(88)902247
Hinneburg A, Aggarwal C, Keim DA: What is the nearest neighbor in high dimensional spaces?. Marking the millennium : 26th International Conference on Very Large Databases, Cairo, Egypt, 1014 September. 2000, Morgan Kaufmann
Hood E: Pharmacogenomics: the promise of personalized medicine. Environ Health Perspect. 2003, 111 (11): A5819.
Nuwaysir EF, Bittner M, Trent J, Barrett JC, Afshari CA: Microarrays and toxicology: the advent of toxicogenomics. Mol Carcinog. 1999, 24 (3): 153159. 10.1002/(SICI)10982744(199903)24:3<153::AIDMC1>3.0.CO;2P
Waring JF, Halbert DN: The promise of toxicogenomics. Curr Opin Mol Ther. 2002, 4 (3): 229235.
Waters MD, Fostel JM: Toxicogenomics and systems toxicology: aims and prospects. Nat Rev Genet. 2004, 5 (12): 936948. 10.1038/nrg1493
Waters MD, Selkirk JK, Olden K: The impact of new technologies on human population studies. Mutat Res. 2003, 544 (23): 349360. 10.1016/j.mrrev.2003.06.022
Baskin CR, GarciaSastre A, Tumpey TM, BielefeldtOhmann H, Carter VS, NistalVillan E, Katze MG: Integration of clinical data, pathology, and cDNA microarrays in influenza virusinfected pigtailed macaques (Macaca nemestrina). J Virol. 2004, 78 (19): 1042010432. 10.1128/JVI.78.19.1042010432.2004
Hamadeh HK, Knight BL, Haugen AC, Sieber S, Amin RP, Bushel PR, Stoll R, Blanchard K, Jayadev S, Tennant RW, Cunningham ML, Afshari CA, Paules RS: Methapyrilene toxicity: anchorage of pathologic observations to gene expression alterations. Toxicol Pathol. 2002, 30 (4): 470482.
Moggs JG, Tinwell H, Spurway T, Chang HS, Pate I, Lim FL, Moore DJ, Soames A, Stuckey R, Currie R, Zhu T, Kimber I, Ashby J, Orphanides G: Phenotypic anchoring of gene expression changes during estrogeninduced uterine growth. Environ Health Perspect. 2004, 112 (16): 15891606.
Paules R: Phenotypic anchoring: linking cause and effect. Environ Health Perspect. 2003, 111 (6): A3389.
Irwin RD, Parker JS, Lobenhofer EK, Burka LT, Blackshear PE, Vallant MK, Lebetkin EH, Gerken DF, Boorman GA: Transcriptional profiling of the left and median liver lobes of male f344/n rats following exposure to acetaminophen. Toxicol Pathol. 2005, 33 (1): 111117. 10.1080/01926230590522257
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102 (1): 109126. 10.1016/S00928674(00)000155
Stoughton R, H. D: US Patent #6351712. 2002
Boorman GA, Haseman JK, Waters MD, Hardisty JF, Sills RC: Quality review procedures necessary for rodent pathology databases and toxicogenomic studies: the National Toxicology Program experience. Toxicol Pathol. 2002, 30 (1): 8892. 10.1080/01926230252824752
Hamadeh HK, Afshari CA: Toxicogenomics : principles and applications. 2004, xx, 361 p.Hoboken, N.J. , WileyLiss
Shen J, Deng Y, Lee ES ,Chang SI ,SJ. B: Determination of cluster number in clustering microarray data. Applied Math and Computation. 2005, 169: 11721185. 10.1016/j.amc.2004.10.076.
Gluck M, Corter J: Information, uncertainty, and the utility of categories. Proc 7th Ann Conf Cog Soc. 1985, 283287.
Jain AK, Dubes RC: Algorithms for clustering data. 1988, xiv, 320 p.Englewood Cliffs, N.J. , Prentice Hall
Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17 (4): 309318. 10.1093/bioinformatics/17.4.309
Hubert L, Arabie P: Comparing partitions. J of Classification. 1985, 2: 193218. 10.1007/BF01908075.
Rao PV: Statistical research methods in the life sciences. 1998, xiv, 889 p.Pacific Grove, CA , Duxbury Press
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 2529. 10.1038/75556
Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11 (8): 14251433. 10.1101/gr.180801
modkprototypes application. http://dir.niehs.nih.gov/microarray/software/modkprototypes/
Acknowledgements
We give thanks to Gary Boorman and Rick Irwin of the NIEHS/National Toxicology Program (NTP) for the design of the acetaminophen study and for generation of the gene expression, clinical chemistry and the histopathology data. The data is publicly available at the Chemical Effects in Biological Systems (CEBS) database, accession number 0020000100110005. Thanks to Robert Detrano, M.D., Ph.D., V.A. Medical Center, Long Beach and Cleveland Clinic Foundation for generation of the heart disease data. The data is available at the University of California at Irvine repository for machine learning web site. Many thanks to the NIEHS/National Center for Toxicogenomics ToxicologyPathology group, other scientists and especially Alexandra Heinloth and Richard S. Paules for their advice on weighting schemes and the biology for different domains of toxicogenomics data. We greatly appreciate Judong Shen for the computation for the Dynamic Validity Index and Pablo Tamayo for R code to simulate numeric data. Thanks to Shuangshuang Dai and Yohan Jin for their critical review of the manuscript. This research was supported, in part by, the Intramural Research Program of the NIH and NIEHS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
PRB performed the analysis of the gene expression data, considered the utilization of the DVI_CU measure and the kprototypes algorithm for gene expression and phenotypic data, implemented the modkprototypes clustering algorithm and DVI_CU, applied them to the mixed type data and wrote the paper. RDW provided statistical guidance for the work. GG provided advice and guidance throughout the project. Both GG and RDW assisted in the evaluation of results. All authors read and approved the final manuscript.
Electronic supplementary material
12918_2006_15_MOESM1_ESM.pdf
Additional file 1: Generation and clustering of simulated mixed data and real data with reduced dimensions of the microarray data. Supplemental_materials.pdf is a pdf file to be viewed with Adobe Acrobat. (PDF 116 KB)
12918_2006_15_MOESM2_ESM.txt
Additional file 2: Cluster assignment of the heart disease samples using equal weights. Table_S1.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 76 bytes)
12918_2006_15_MOESM3_ESM.txt
Additional file 3: Cluster assignment of the acetaminophentreated samples using equal weights. Table_S2.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 1 KB)
12918_2006_15_MOESM4_ESM.txt
Additional file 4: Histopathology observations and clinical chemistry measurements. Table_S3.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 46 KB)
12918_2006_15_MOESM5_ESM.txt
Additional file 5: Significant and unique genes that distinguish between levels of centrilobular necrosis of the rat liver. Table_S4.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 5 KB)
12918_2006_15_MOESM6_ESM.txt
Additional file 6: Approximately 3100 genes determined to be significantly differentially expressed by ANOVA modelling. DEGs.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 2 MB)
12918_2006_15_MOESM7_ESM.txt
Additional file 7: Simulated mixed data of different types (numeric and categorical). sim_mixed_data.txt is a tabdelimited text file to be opened and viewed with any standard spreadsheet software. (TXT 9 KB)
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Bushel, P.R., Wolfinger, R.D. & Gibson, G. Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Syst Biol 1, 15 (2007). https://doi.org/10.1186/17520509115
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/17520509115
Keywords
 Gene Expression Data
 Microarray Gene Expression Data
 Mixed Data
 Histopathology Observation
 Centrilobular Necrosis