Multivariate gene expression analysis reveals functional connectivity changes between normal/tumoral prostates

Background Prostate cancer is a leading cause of death in the male population, therefore, a comprehensive study about the genes and the molecular networks involved in the tumoral prostate process becomes necessary. In order to understand the biological process behind potential biomarkers, we have analyzed a set of 57 cDNA microarrays containing ~25,000 genes. Results Principal Component Analysis (PCA) combined with the Maximum-entropy Linear Discriminant Analysis (MLDA) were applied in order to identify genes with the most discriminative information between normal and tumoral prostatic tissues. Data analysis was carried out using three different approaches, namely: (i) differences in gene expression levels between normal and tumoral conditions from an univariate point of view; (ii) in a multivariate fashion using MLDA; and (iii) with a dependence network approach. Our results show that malignant transformation in the prostatic tissue is more related to functional connectivity changes in their dependence networks than to differential gene expression. The MYLK, KLK2, KLK3, HAN11, LTF, CSRP1 and TGM4 genes presented significant changes in their functional connectivity between normal and tumoral conditions and were also classified as the top seven most informative genes for the prostate cancer genesis process by our discriminant analysis. Moreover, among the identified genes we found classically known biomarkers and genes which are closely related to tumoral prostate, such as KLK3 and KLK2 and several other potential ones. Conclusion We have demonstrated that changes in functional connectivity may be implicit in the biological process which renders some genes more informative to discriminate between normal and tumoral conditions. Using the proposed method, namely, MLDA, in order to analyze the multivariate characteristic of genes, it was possible to capture the changes in dependence networks which are related to cell transformation.


Background
Cancer is one of the main public health problems in the United States and worldwide [1]. Among the diverse types of neoplasia, prostate cancer is the third most common cancer in the World [2], being ranked as the second leading cause of death in men, the first being lung cancer [1]. Its incidence and mortality varies in different parts of the World, being highest in Western countries, mainly among Africans [3].
With the widespread use of the prostate-specific antigen (PSA) test, more men are examined, and consequently, identification of patients with asymptomatic low-stage tumors has increased considerably [4,5]. Although the majority of prostate cancers is confined to the prostate gland, rarely affecting life expectancy, in about 30% of the cases, a specialized group of cells from the primary tumor mass may invade and colonize other distant tissues causing death, therefore, metastatic disease rather than the primary tumor itself is responsible for death, causing the prognosis to be directly related to the spread of the tumor. Unfortunately, the therapeutic approaches used nowadays against advanced stages of prostatic cancers are not effective [6]. Therefore, it is extremely important to understand the basic molecular biology involved in this disease in order to prevent the progression of the tumor [6]. However, the identification and analysis of these molecular mechanisms has been hampered by the heterogeneity and high molecular complexity of the process involved in the development of this disease.
In the last few years, several efforts have been made towards determining the genetic mechanisms involved in the development of this tumor [6,7]. A widely used approach in studying the development of several types of cancers has been the high-throughput gene expression microarray analysis, which has provided a wealth of information about tumor marker genes. Conventional methods of microarray data analysis have been systematically used to examine the differentially expressed genes [8], and molecular pathways [9] and discriminative methods have been used in order to identify biomarkers [10,11].
In general, discriminant studies focus only on the classification accuracy of the method and on a pre-step selection of the features (genes) which best classifies the samples [12]. This selection of features is often carried out by selecting a subgroup of the most differentially expressed genes [13] or in a multivariate fashion [12]. However, understanding of the structure responsible for regulation of these discriminative set of genes in prostatic cancer is required [14].
Many years of intensive research have demonstrated that signaling molecules are organized into complex biochemical networks. These signaling circuits are complicated sys-tems consisting of multiple elements interacting in a multifarious fashion. Signaling networks are regulated both in time and space [15]; allow the cell to decide which cellular process (cell division, differentiation, transformation, or apoptosis) is the most appropriate response for each situation. Due to the high connectivity and complexity of these biological systems, small modifications in a few members ("hub" genes, i.e., highly functionally connected genes) of these biochemical networks are sufficient to perturb the whole system [16], consequently resulting in a change on the cell's phenotype [17]. Frequently, changes in the relative concentration of molecules, such as mRNAs and proteins, are the unique parameter analyzed in biological systems. However, the biomolecules' concentration is not the only important variable, but their compartmentalization and diffusion are also determinants of the cell's phenotype. Therefore, these approaches are reductionists in defining a good biomarker as the most differentially expressed gene or protein when comparing distinct cellular contexts.
Here, we report a cDNA microarray-based study in prostatic cancer aimed at understanding why some genes are good predictors in discriminating normal versus tumoral samples and others are not. We demonstrate that the discriminative information between normal and tumoral prostates is related to the change in functional connectivity between certain genes and not necessarily in their differential expression, as has often been assumed. Moreover, we present a systematic and straightforward approach based on MLDA (Maximum-entropy Linear Discriminant Analysis) to identify putative biomarkers in high dimensional data (when the number of features is greater than the number of observations), and a dependence network analysis in order to interprete sets of discriminative genes. This idea is illustrated in Figure 1.

Simulation
The combination of PCA (Principal Component Analysis) + MLDA (Maximum-entropy Linear Discriminant Analysis) [18] was applied in a simulated data described in the Methods section in order to demonstrate that functional connectivity changes may be captured by the proposed approach. Figure 2 describes the weights in absolute values attributed by MLDA to each feature (artifically generated genes). The features are sorted in a decreasing order of weight. Red crosses represent the genes which have their functional connectivity alterated between conditions 1 and 2. Blue crosses represent the genes which have their connectivities unaltered.

Samples classification
Applying the PCA combined with the MLDA approach to all ~25,000 genes available in our microarray dataset [19], it was possible to classify the samples with an accuracy of 96.5% (a misclassification of 2 out of 57 samples), using a leave-one-out cross validation.

Projection matrix ψ MLDA analysis
The projection matrix ψ MLDA contains the weights (degree of relationship between the gene and the normal/tumoral state) for each feature (gene). Figure 3 describes the weights in absolute values attributed by MLDA to each gene. The genes are sorted in a decreasing order of weight. Table 1 illustrates the top 100 features identified as the most informative genes related to malignant transformation by the PCA+MLDA approach ranked in a decreasing order of weight values. This set of 100 most informative genes represents ~0.4% of the total number of genes available in the microarrays (~25,000 genes). Notice that these 100 genes have a MLDA weight different from zero, i.e., the 100th gene RPS28 has a MLDA weight (~0.035, Table  1) located before the convergence of the curve to zero ( Figure 3, the horizontal red line indicates the 100th gene). In order to verify the stability and robustness of our results, 27 observations out of 32 from normal sample and 20 out of 25 from tumoral sample were randomly selected and the ψ MLDA was re-calculated. This step was A pictorial scheme of the combination of PCA+MLDA and dependence network analysis for two populations (normal and tumoral prostatic tissues)

EĞƚǁŽƌŬ ĂŶĂůǇƐŝƐ
The discriminative weight of each simulated feature

Features
Weight performed 100 times and the mean rank for each gene was obtained. About 80% of the originally obtained top 100 most discriminative genes were ranked as the top 100 most discriminative genes.
We have also manually annotated (which we believe be more accurate than automatic computer-based annotation, since it may be more efficient to capture semantic information from published articles) this set of 100 genes [see Table 1 and Additional file 1].

Putative differentially expressed genes
We have also searched for differentially expressed genes. About 25% of the genes listed in Table 1 do not present statistical evidence to be differentially expressed between normal and tumoral conditions.

Relevance networks
Both normal and tumoral relevance networks with the top 100 most informative genes were constructed, considering a false discovery rate of 5%, being illustrated in Figures  4 and 5, respectively. Nodes in red are the genes which have their functional connectivity (estimated using the non-parametric Hoeffding's D measure [20]) changed considerably between normal versus tumoral conditions, i.e., they become "hubs" (highly connected genes) [16] in tumoral prostates. "Hub" genes were maintained also when relevance networks were constructed under different FDR thresholds (1, 5 and 10%).

Discussion
Firstly, the PCA+MLDA approach was applied to a simulated data set in order to illustrate that differences in connectivity may be behind the oncogenesis process. Sato et al. (2008) [21] have already demonstrated in another context (neuroscience) that the information contained in the connectivity may be useful to sample classification. The simulation was performed in a large scale multidimensional condition, where the relevant features (genes which have the connectivity changed) are only 2% (500 out of 25,000 genes). Interestinlgy, MLDA was able to correctly identify the discriminative features, represented by red crosses in Figure 2. Notice that the relevant features for discrimination do not present differential expression between conditions 1 and 2 (by construction).
In order to verify whether gene expression data contain the information to discriminate normal from tumoral prostatic samples, we have applied the PCA+MLDA approach to actual biological data, obtaining a high classification accuracy (96.5%) by the leave-one-out crossvalidation. In this case, we have used all the principal components in order to avoid losing information. PCA is applied regarding computational cost and memory limitation. It is important to mention that the numerical results are identical in the absence of the PCA step [22]. Notice that MLDA does not require a pre-step feature selection, because it may also work for high dimensional data. Therefore, it was possible to include all of the 25,000 genes of the microarray dataset.
Since it was possible to verify that gene expression data retains information for classification, we analyzed the ψ MLDA projection matrix which contains the weight values for each feature (gene). Notice that the majority of the genes shown in Figure 3 have weights near zero, and only a few genes actually have discriminative information (high weight).
By analyzing Table 1, it is possible to verify that most of the 100 informative genes had already been described in the literature as genes related to cancer (76 genes) and 45 genes had specifically been associated to prostate tumor. Interestingly, most of the other 24 genes do not have references describing their functionality. Therefore, they may be associated to cancer but have not been studied yet. The description of the 76 genes in the literature corroborates the results obtained by the PCA+MLDA method, indicating that these genes are informative to discriminate between normal and tumoral samples. The stability and robustnees of this result were verified by obtaining around 80% of the same top 100 genes when five obser- The discriminative weight of each gene Figure 3 The discriminative weight of each gene. The genes are sorted (in decreasing order) by the absolute value of the weight. The horizontal red line indicates the 100th gene.  [89] vations were excluded randomly from normal sample and five from tumoral sample in 100 re-calculations. For more details about annotation of the top 100 genes and the complete list of the ~25,000 genes, please see Additional file 2.
Comparing the weights obtained by MLDA and the differentially expressed genes, it is surprising that the most differentially expressed genes are not necessarily the most discriminative ones. In other words, a multivariate combination of genes may be regulating the normal/tumoral state, i.e., the combination of genes may contain more information about normal/tumoral conditions than an univariate differentially expressed gene.
Since it is known that a complex network is involved in the regulation of several molecular processes, we further analyzed the dependence network involved in these puta-

RPS28
ribosomal protein S28 0.03497 0.15578 *: genes already described to be related to prostatic cancer. In bold are the genes which do not present statistical evidences to be differentially expressed between normal and tumoral conditions. tive biomarkers in order to gain new insights. The analyis of Figures 4 and 5 indicate that exactly the top seven most discriminative genes described in Table 1 (MYLK, KLK2 These seven genes become "hubs" [16], i.e., highly connected genes in the tumoral condition, whereas in the normal condition, their connectivity was not different when compared to that of other genes. Furthermore, these seven genes maintained the position of the top seven most discriminative ones also when we have resampled the samples (the experiment which was performed in order to verify the stability and robustness of the top 100 genes). A Z-value summary table related to these seven genes is illustrated in Table 2. Z-values increase from normal to tumoral conditions, representing the changes in functional connectivities between these two conditions. The mean Z-values were calculated between the "hub" gene and the other 99 genes. In addition, in the list of the most discriminative features, there are genes which are more differentially expressed than these seven ones (lower p-value), however, their connectivity did not change. Krostka and Spang (2004) [17] have already suggested that differences in co-regulation between normal/disease states may be related to some pathologies. Moreover, Sato et al. (2008) [21] have reported that changes in networks connectivities may influence classification methods. These reports support our results showing that changes in functional connectivity may be closely related to the normal/tumoral states in prostate and that these changes in dependence may con-tain an additional information when compared to differential gene expression.
Almost all top seven genes identified as the most discriminative features between normal and tumoral phenotypes had previously been described in the literature as being associated to cancer. The only gene that so far has not been correlated to cancer is HAN11, probably because little is known about this gene (only two articles were found in the literature describing this gene). Five of these top seven genes namely, MYLK, KLK2, KLK3, LTF and TGM4 had already been specifically related to prostate carcinoma (Table 1).
Myosin light chain kinase (MYLK) is one of them. This enzyme catalyzes the phosphorylation of a specific serine residue on the 20 kD light chain of myosin II (MCL20), consequently regulating the actin-myosin II interaction [23]. This reaction is responsible for smoothing muscle contraction/relaxation and organization of the cytoskeleton. Due to the central role played by the cytoskeleton in cell division and motility, it has been demonstrated that MYLK inhibition induces apoptosis in mammary prostate cancer cells and inhibits the growth of mammary and prostate tumors in rats and mice [24]. Furthermore, since MLC20 phosphorylation is necessary for cell motility [25,26], MYLK inhibition blocks cancer cell invasion and adhesion in vitro. As a result, some reports described the use of MYLK inhibitors as anti-cancer agents since they prevent cancer cells migration [27,28].
KLK3, also known as prostate specific antigen (PSA), is another gene which presents high functional connectivity in tumoral samples. PSA is a serine protease, secreted into seminal plasma, belonging to the human kallikrein gene family, being responsible for semen liquefaction. It is the first FDA (Food and Drug Administration)-approved tumor marker for cancer detection [29]. The prostatic gland volume affects the PSA level in serum, because it is produced and secreted by prostatic tissue [30,31]. However, increased levels of KLK3 are also observed in some patients with benign prostate hyperplasia. Therefore, elevated PSA concentration in patients' plasma may be indicative not only of prostate cancer, but, also of other prostatic pathologies. Consequently, the use of PSA as a cancer-specific marker is questioned.
Nowadays, 15 members of the kallikrein family (KLKs) are described in humans [32]. Among the KLKs, the highest homology is found between PSA and KLK2. In this case, the identity is 78% and 80% at the amino acid and DNA level, respectively [33]. KLK2 is another gene that presented functional connectivity changes between normal/tumoral conditions. The ratio of KLK2 to free PSA improves the discrimination of benign prostate hyperpla-A normal prostate relevance network constructed with the top 100 most discriminative genes and FDR of 5% Figure 4 A normal prostate relevance network constructed with the top 100 most discriminative genes and FDR of 5%. Core genes are represented in red.
sia and prostate cancer patients [34]. In addition, it has already been described that KLK2 discriminates between high and low grade tumors [35]. There is evidence indicating that KLK2 is more closely correlated to the total volume and higher grade prostate cancers than PSA [36].
Identification of both of these classic biomarkers of prostate carcinomas (PSA and KLK2), in our list of the most informative genes, provides additional evidence to the hypothesis that functional connectivity changes and not only differential expression levels are highly correlated to normal/tumoral process.
Another gene classified as one of the most discriminative prostate cancer biomarkers, whose anti-tumorigenic role has already been described Therefore, the literature supports the suggestion that these top seven genes (except for HAN11) may be considered as the most closely and informative prostate cancer biomarkers. Consequently, this suggests that the malignant transformation process in prostatic tissue is more correlated to functional connectivity changes in the gene dependence networks than differential gene expression itself.
Almost all of the 100 genes identified by PCA+MLDA are correlated to cancer, and, in many cases, to prostate cancer. Thus, TIMP3 and ADAMTS1 (Table 1) are genes classically correlated to invasion and the metastatic process, the main cancer attributes responsible for death.

Conclusion
In summary, our main goal using PCA+MLDA was not dimension reduction or verification of the classification accuracy, but to investigate the discriminative characteristics extracted from the whole microarray dataset and how one can interpret them, although this procedure may also be used for classification, yielding good results, as previously described.
We have demonstrated that changes in functional connectivity may underly the biological process which render some genes more informative to discriminate between normal and tumoral conditions. Using the proposed PCA+MLDA method in order to analyze the multivariate gene characteristic, it was possible to capture the changes in dependence networks which are related to cell transformation. Identification of seven genes (MYLK, KLK2, KLK3, HAN11, LTF, CSRP1, TGM4) which have their connectivity altered between normal/tumoral conditions may provide novel insights into specific targets against tumor progression.
A tumoral prostate relevance network constructed with the top 100 most discriminative genes and FDR of 5% Figure 5 A tumoral prostate relevance network constructed with the top 100 most discriminative genes and FDR of 5%. Core genes are represented in red.

Principal component analysis (PCA)
Principal component analysis is a dimension reduction technique used to reduce the high dimensional space (number of genes).
PCA is defined as linear transformations which maps the data to a new orthogonal coordinate system. These linear combinations are constructed so that the greatest variance by any projection lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
In other words, PCA summarizes the original features information by retaining characteristics of the dataset which most contribute to its variance.
For a gene expression data matrix X containing the genes in the columns and the observations in the rows (normalized to have zero mean and unit variance), the PCA transformation matrix ψ PCA is given by where cov is the covariance matrix. In order to prevent losing any variance information, ψ PCA is composed of all eigenvalues with non-zero eigenvectors. Here, PCA is used only to reduce computational and memory costs.

Maximum-entropy linear discriminant analysis (MLDA)
In gene expression data analysis, we usually have a large number of genes (features), but only a few number of observations, i.e., microarrays experiments.
A critical problem in applying conventional Linear Discriminant Analysis (LDA) to these types of data is the singularity and instability of the within-class scatter matrix calculated when the number of features approaches the number of available examples. In order to overcome this limitation, we applied the MLDA approach.
The MLDA method is concerned with the stabilization of pooled covariance matrix estimate S p . This covariance matrix S p is constructed by selecting the largest dispersions regarding the S p average eigenvalue. It is based on the maximum entropy covariance selection idea developed by Thomaz et al (2004) [18].
It is known that the estimated errors of small eigenvalues are greater than that of large eigenvalues. Therefore, Thomaz et al. (2007) [44] proposed to expand only the smaller and less reliable eigenvalues of S p , keeping most of the larger eigenvalues unchanged.
The algorithm may be described as follows: 1. Let the between-class scatter matrix S b be defined as and the within-class scatter matrix S w be defined as where x i, j is the m-dimensional (m: number of genes) observation j from class ∏ i (i = 1, 2, where 1 = normal and 2 = tumoral in our case) containing the gene expressions in the rows, n i is the number of observations (microarrays) from class ∏ i , and g is the total number of classes (g = 2 in our case).
The vector i is the unbiased sample mean and the matrix  where n is the total number of microarrays, i.e., The main advantage of MLDA is that it avoids both the singularity and instability of the within-class scatter matrix S w when applied directly to gene expression data, which consists of a low number of observations and a high number of features.
The implemented R code is available in the Additional file 3.

Simulation
This simulation was designed in order to demonstrate that MLDA is capable to discriminate two different conditions and also to identify the intrinsic functional connectivity changes underlying the tumoral process. For this simulation, artificial gene expressions for 25,000 genes (features) were generated, based on the simulation illustrated in [21]. The 25,000 genes were divided in three sets A (250 genes), B (250 genes) and C (24,500 genes). For each gene, 30 observations representing "normal" condition and 30 observations representing "tumoral" conditions were generated. The model to investigate the situation where there are fuctional connectivity changes and there is no differences in gene expressions between conditions 1 and 2 were as follows: Notice that there is no difference in means between A and B.

Differentially expressed genes
In order to identify putative differentially expressed genes, we have applied the non-parametric Wilcoxon test under a false discovery rate control (FDR) [45] of 5%. Wilcoxon procedure tests the median, therefore, it is more robust to outliers than the t-test (which tests the mean).

Relevance networks
Relevance networks [46] were constructed using the Hoeffding's D measure [20], a non-parametric association method (the R code is freely available in the Hmisc package at [47]), which is more robust to outliers than the Pearson's correlation. Pairwise correlations were measured and the false discovery rate (FDR) [45] was controlled to 1, 5 and 10%. "Hub" genes were determined by calculating the degree (the number of adjacent edges, i.e. functional connectivities) of each gene and selecting the highest ones.

Microarrays
We have analyzed the normal and tumoral prostate dataset publicly available at the Stanford MicroArray Database [48,19]. This dataset is composed of ~25,000 genes with 32 observations for normal state and 25 for tumoral condition.

Authors' contributions
AF has made substantial contributions to the conception, design and implementation of the study, and has also been responsible for drafting the manuscript. LRG has made substantial contributions to the biological interpretations, and has been responsible for drafting some parts of the manuscript. JRS has made substantial contributions to data analysis and applications of statistical concepts. RY, CET and MCS have discussed the results and critically revised the manuscript for important intellectual content.