Multivariate gene expression analysis reveals functional connectivity changes between normal/tumoral prostates
© Fujita et al. 2008
Received: 29 August 2008
Accepted: 05 December 2008
Published: 05 December 2008
Skip to main content
© Fujita et al. 2008
Received: 29 August 2008
Accepted: 05 December 2008
Published: 05 December 2008
Prostate cancer is a leading cause of death in the male population, therefore, a comprehensive study about the genes and the molecular networks involved in the tumoral prostate process becomes necessary. In order to understand the biological process behind potential biomarkers, we have analyzed a set of 57 cDNA microarrays containing ~25,000 genes.
Principal Component Analysis (PCA) combined with the Maximum-entropy Linear Discriminant Analysis (MLDA) were applied in order to identify genes with the most discriminative information between normal and tumoral prostatic tissues. Data analysis was carried out using three different approaches, namely: (i) differences in gene expression levels between normal and tumoral conditions from an univariate point of view; (ii) in a multivariate fashion using MLDA; and (iii) with a dependence network approach. Our results show that malignant transformation in the prostatic tissue is more related to functional connectivity changes in their dependence networks than to differential gene expression. The MYLK, KLK2, KLK3, HAN11, LTF, CSRP1 and TGM4 genes presented significant changes in their functional connectivity between normal and tumoral conditions and were also classified as the top seven most informative genes for the prostate cancer genesis process by our discriminant analysis. Moreover, among the identified genes we found classically known biomarkers and genes which are closely related to tumoral prostate, such as KLK3 and KLK2 and several other potential ones.
We have demonstrated that changes in functional connectivity may be implicit in the biological process which renders some genes more informative to discriminate between normal and tumoral conditions. Using the proposed method, namely, MLDA, in order to analyze the multivariate characteristic of genes, it was possible to capture the changes in dependence networks which are related to cell transformation.
Cancer is one of the main public health problems in the United States and worldwide . Among the diverse types of neoplasia, prostate cancer is the third most common cancer in the World , being ranked as the second leading cause of death in men, the first being lung cancer . Its incidence and mortality varies in different parts of the World, being highest in Western countries, mainly among Africans .
With the widespread use of the prostate-specific antigen (PSA) test, more men are examined, and consequently, identification of patients with asymptomatic low-stage tumors has increased considerably [4, 5]. Although the majority of prostate cancers is confined to the prostate gland, rarely affecting life expectancy, in about 30% of the cases, a specialized group of cells from the primary tumor mass may invade and colonize other distant tissues causing death, therefore, metastatic disease rather than the primary tumor itself is responsible for death, causing the prognosis to be directly related to the spread of the tumor. Unfortunately, the therapeutic approaches used nowadays against advanced stages of prostatic cancers are not effective . Therefore, it is extremely important to understand the basic molecular biology involved in this disease in order to prevent the progression of the tumor . However, the identification and analysis of these molecular mechanisms has been hampered by the heterogeneity and high molecular complexity of the process involved in the development of this disease.
In the last few years, several efforts have been made towards determining the genetic mechanisms involved in the development of this tumor [6, 7]. A widely used approach in studying the development of several types of cancers has been the high-throughput gene expression microarray analysis, which has provided a wealth of information about tumor marker genes. Conventional methods of microarray data analysis have been systematically used to examine the differentially expressed genes , and molecular pathways  and discriminative methods have been used in order to identify biomarkers [10, 11].
In general, discriminant studies focus only on the classification accuracy of the method and on a pre-step selection of the features (genes) which best classifies the samples . This selection of features is often carried out by selecting a subgroup of the most differentially expressed genes  or in a multivariate fashion . However, understanding of the structure responsible for regulation of these discriminative set of genes in prostatic cancer is required .
Many years of intensive research have demonstrated that signaling molecules are organized into complex biochemical networks. These signaling circuits are complicated systems consisting of multiple elements interacting in a multifarious fashion. Signaling networks are regulated both in time and space ; allow the cell to decide which cellular process (cell division, differentiation, transformation, or apoptosis) is the most appropriate response for each situation. Due to the high connectivity and complexity of these biological systems, small modifications in a few members ("hub" genes, i.e., highly functionally connected genes) of these biochemical networks are sufficient to perturb the whole system , consequently resulting in a change on the cell's phenotype . Frequently, changes in the relative concentration of molecules, such as mRNAs and proteins, are the unique parameter analyzed in biological systems. However, the biomolecules' concentration is not the only important variable, but their compartmentalization and diffusion are also determinants of the cell's phenotype. Therefore, these approaches are reductionists in defining a good biomarker as the most differentially expressed gene or protein when comparing distinct cellular contexts.
Applying the PCA combined with the MLDA approach to all ~25,000 genes available in our microarray dataset , it was possible to classify the samples with an accuracy of 96.5% (a misclassification of 2 out of 57 samples), using a leave-one-out cross validation.
ψMLDA: the weights attributed by MLDA.
Official Full Name
myosin light chain kinase
kallikrein-related peptidase 2
kallikrein-related peptidase 3
WD repeat domain 68
cysteine and glycine-rich protein 1
transglutaminase 4 (prostate)
actin gamma 2 smooth muscle enteric
myosin light chain 6 alkali smooth muscle and non-muscle
retinol dehydrogenase 11 (all-trans/9-cis/11-cis)
alpha-2-glycoprotein 1 zinc-binding
NIPA-like domain containing 3
FXYD domain containing ion transport regulator 3
tropomyosin 2 (beta)
crystallin alpha B
actin alpha 2 smooth muscle aorta
ribosomal protein S6
transmembrane protein 130
acid phosphatase prostate
Purkinje cell protein 4
sorbin and SH3 domain containing 1
actin alpha cardiac muscle 1
transforming growth factor beta 3
mucosa associated lymphoid tissue lymphoma translocation gene 1
zinc finger protein 532
palladin cytoskeletal associated protein
inhibitor of growth family member 5
serpin peptidase inhibitor clade A (alpha-1 antiproteinase antitrypsin) member 3
keratin 5 (epidermolysis bullosa simplex Dowling-Meara/Kobner/Weber-Cockayne types)
ribosomal protein L5
insulin-like growth factor 1 (somatomedin C)
zinc finger protein 92 (HTF12)
folate hydrolase (prostate-specific membrane antigen) 1
cysteine-rich angiogenic inducer 61
four and a half LIM domains 1
H19 imprinted maternally expressed transcript
neurofilament heavy polypeptide 200 kDa
protein phosphatase 1 regulatory (inhibitor) subunit 12B
anthrax toxin receptor 2
myosin regulatory light chain MRLC2
chromosome 20 open reading frame 103
ubiquitin A-52 residue ribosomal protein fusion product 1
T cell receptor gamma variable 9
secreted protein acidic cysteine-rich (osteonectin)
delta/notch-like EGF repeat containing
prion protein (p27-30)
pyruvate dehydrogenase kinase isozyme 4
homocysteine-inducible endoplasmic reticulum stress-inducible ubiquitin-like domain member 1
heat shock protein 90 kDa alpha (cytosolic) class B member 1
glutathione S-transferase M2 (muscle)
v-ets erythroblastosis virus E26 oncogene homolog (avian)
connective tissue growth factor
guanylate cyclase 1 soluble alpha 3
TIMP metallopeptidase inhibitor 3
lactate dehydrogenase B
ribonuclease RNase A family 4
caveolin 1 caveolae protein 22 kDa
transmembrane 9 superfamily member 2
heat shock 22 kDa protein 8
tubulin alpha 1a
PDZ and LIM domain 5
LIM domain containing preferred translocation partner in lipoma
MAD2L1 binding protein
ADAM metallopeptidase with thrombospondin type 1 motif 1
ras homolog gene family member A
thioredoxin interacting protein
oxoglutarate (alpha-ketoglutarate) dehydrogenase (lipoamide)
ribosomal protein L35
ankylosis progressive homolog (mouse)
mortality factor 4 like 2
cysteine-rich secretory protein LCCL domain containing 2
aldehyde dehydrogenase 3 family member A2
sodium channel voltage-gated type II beta
SPARC-like 1 (mast9 hevin)
immunoglobulin J polypeptide linker protein for immunoglobulin alpha and mu polypeptides
zinc finger protein 134
mitochondrial ribosomal protein L43
hypothetical protein LOC152485
calmodulin 2 (phosphorylase kinase delta)
collagen type IX alpha 2
P antigen family member 4 (prostate associated)
calmodulin 1 (phosphorylase kinase delta)
anterior gradient homolog 2 (Xenopus laevis)
ribosomal protein S28
We have also manually annotated (which we believe be more accurate than automatic computer-based annotation, since it may be more efficient to capture semantic information from published articles) this set of 100 genes [see Table 1 and Additional file 1].
We have also searched for differentially expressed genes. About 25% of the genes listed in Table 1 do not present statistical evidence to be differentially expressed between normal and tumoral conditions.
Firstly, the PCA+MLDA approach was applied to a simulated data set in order to illustrate that differences in connectivity may be behind the oncogenesis process. Sato et al. (2008)  have already demonstrated in another context (neuroscience) that the information contained in the connectivity may be useful to sample classification. The simulation was performed in a large scale multidimensional condition, where the relevant features (genes which have the connectivity changed) are only 2% (500 out of 25,000 genes). Interestinlgy, MLDA was able to correctly identify the discriminative features, represented by red crosses in Figure 2. Notice that the relevant features for discrimination do not present differential expression between conditions 1 and 2 (by construction).
In order to verify whether gene expression data contain the information to discriminate normal from tumoral prostatic samples, we have applied the PCA+MLDA approach to actual biological data, obtaining a high classification accuracy (96.5%) by the leave-one-out cross-validation. In this case, we have used all the principal components in order to avoid losing information. PCA is applied regarding computational cost and memory limitation. It is important to mention that the numerical results are identical in the absence of the PCA step . Notice that MLDA does not require a pre-step feature selection, because it may also work for high dimensional data. Therefore, it was possible to include all of the 25,000 genes of the microarray dataset.
Since it was possible to verify that gene expression data retains information for classification, we analyzed the ψ MLDA projection matrix which contains the weight values for each feature (gene). Notice that the majority of the genes shown in Figure 3 have weights near zero, and only a few genes actually have discriminative information (high weight).
By analyzing Table 1, it is possible to verify that most of the 100 informative genes had already been described in the literature as genes related to cancer (76 genes) and 45 genes had specifically been associated to prostate tumor. Interestingly, most of the other 24 genes do not have references describing their functionality. Therefore, they may be associated to cancer but have not been studied yet. The description of the 76 genes in the literature corroborates the results obtained by the PCA+MLDA method, indicating that these genes are informative to discriminate between normal and tumoral samples. The stability and robustnees of this result were verified by obtaining around 80% of the same top 100 genes when five observations were excluded randomly from normal sample and five from tumoral sample in 100 re-calculations. For more details about annotation of the top 100 genes and the complete list of the ~25,000 genes, please see Additional file 2.
Comparing the weights obtained by MLDA and the differentially expressed genes, it is surprising that the most differentially expressed genes are not necessarily the most discriminative ones. In other words, a multivariate combination of genes may be regulating the normal/tumoral state, i.e., the combination of genes may contain more information about normal/tumoral conditions than an univariate differentially expressed gene.
The seven "hub" genes.
mean Z-value (normal)
mean Z-value (tumoral)
Almost all top seven genes identified as the most discriminative features between normal and tumoral phenotypes had previously been described in the literature as being associated to cancer. The only gene that so far has not been correlated to cancer is HAN11, probably because little is known about this gene (only two articles were found in the literature describing this gene). Five of these top seven genes namely, MYLK, KLK2, KLK3, LTF and TGM4 had already been specifically related to prostate carcinoma (Table 1).
Myosin light chain kinase (MYLK) is one of them. This enzyme catalyzes the phosphorylation of a specific serine residue on the 20 kD light chain of myosin II (MCL20), consequently regulating the actin-myosin II interaction . This reaction is responsible for smoothing muscle contraction/relaxation and organization of the cytoskeleton. Due to the central role played by the cytoskeleton in cell division and motility, it has been demonstrated that MYLK inhibition induces apoptosis in mammary prostate cancer cells and inhibits the growth of mammary and prostate tumors in rats and mice . Furthermore, since MLC20 phosphorylation is necessary for cell motility [25, 26], MYLK inhibition blocks cancer cell invasion and adhesion in vitro. As a result, some reports described the use of MYLK inhibitors as anti-cancer agents since they prevent cancer cells migration [27, 28].
KLK3, also known as prostate specific antigen (PSA), is another gene which presents high functional connectivity in tumoral samples. PSA is a serine protease, secreted into seminal plasma, belonging to the human kallikrein gene family, being responsible for semen liquefaction. It is the first FDA (Food and Drug Administration)-approved tumor marker for cancer detection . The prostatic gland volume affects the PSA level in serum, because it is produced and secreted by prostatic tissue [30, 31]. However, increased levels of KLK3 are also observed in some patients with benign prostate hyperplasia. Therefore, elevated PSA concentration in patients' plasma may be indicative not only of prostate cancer, but, also of other prostatic pathologies. Consequently, the use of PSA as a cancer-specific marker is questioned.
Nowadays, 15 members of the kallikrein family (KLKs) are described in humans . Among the KLKs, the highest homology is found between PSA and KLK2. In this case, the identity is 78% and 80% at the amino acid and DNA level, respectively . KLK2 is another gene that presented functional connectivity changes between normal/tumoral conditions. The ratio of KLK2 to free PSA improves the discrimination of benign prostate hyperplasia and prostate cancer patients . In addition, it has already been described that KLK2 discriminates between high and low grade tumors . There is evidence indicating that KLK2 is more closely correlated to the total volume and higher grade prostate cancers than PSA .
Identification of both of these classic biomarkers of prostate carcinomas (PSA and KLK2), in our list of the most informative genes, provides additional evidence to the hypothesis that functional connectivity changes and not only differential expression levels are highly correlated to normal/tumoral process.
Another gene classified as one of the most discriminative prostate cancer biomarkers, whose anti-tumorigenic role has already been described  is lactotransferrin (LTF). This non-heme iron-binding glycoprotein  is found in a variety of biological secretions, such as semen, as well as in several secretions derived from glandular epithelium cells, including the prostate. LTF mRNA and protein levels are downregulated in prostate cancer, with significant PSA recurrence associations, due to promoter silencing by hypermethylation . It has been reported that bovine lactotransferrin significantly inhibits colon, esophagus, lung, bladder and liver cancers in rats . Prostate cancer cells treated with LTF presented high apoptotic response, growth arrest at G1 and reduced S phase, suggesting a role for specific cell cycle regulatory mechanisms in LTF-mediated cell growth inhibition .
CSRP1 (cysteine and glycine-rich protein 1) and TGM4 (human prostate-specific transglutaminase gene) are two other genes that become "hubs"  along tumoral development. The former belongs to the CSRP family, encoding a group of LIM domain proteins, which may be involved in regulatory processes which are important for development and cellular differentiation. Hirasawa and collaborators (2006)  suggest the use of CSRP as an important biomarker of hepatocellular carcinoma malignancy, because CSRP1 is inactivated in this model by aberrant methylation . The latter, TGM4 was described as a candidate biomarker of region-specific epithelial identity in the prostate , being involved in the formation of stable protein-protein or protein-polyamide bounds .
Therefore, the literature supports the suggestion that these top seven genes (except for HAN11) may be considered as the most closely and informative prostate cancer biomarkers. Consequently, this suggests that the malignant transformation process in prostatic tissue is more correlated to functional connectivity changes in the gene dependence networks than differential gene expression itself.
Almost all of the 100 genes identified by PCA+MLDA are correlated to cancer, and, in many cases, to prostate cancer. Thus, TIMP3 and ADAMTS1 (Table 1) are genes classically correlated to invasion and the metastatic process, the main cancer attributes responsible for death.
In summary, our main goal using PCA+MLDA was not dimension reduction or verification of the classification accuracy, but to investigate the discriminative characteristics extracted from the whole microarray dataset and how one can interpret them, although this procedure may also be used for classification, yielding good results, as previously described.
We have demonstrated that changes in functional connectivity may underly the biological process which render some genes more informative to discriminate between normal and tumoral conditions. Using the proposed PCA+MLDA method in order to analyze the multivariate gene characteristic, it was possible to capture the changes in dependence networks which are related to cell transformation. Identification of seven genes (MYLK, KLK2, KLK3, HAN11, LTF, CSRP1, TGM4) which have their connectivity altered between normal/tumoral conditions may provide novel insights into specific targets against tumor progression.
Principal component analysis is a dimension reduction technique used to reduce the high dimensional space (number of genes).
PCA is defined as linear transformations which maps the data to a new orthogonal coordinate system. These linear combinations are constructed so that the greatest variance by any projection lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
In other words, PCA summarizes the original features information by retaining characteristics of the dataset which most contribute to its variance.
For a gene expression data matrix X containing the genes in the columns and the observations in the rows (normalized to have zero mean and unit variance), the PCA transformation matrix ψ PCA is given by
ψPCA = eigenvectors(cov(XT)) (1)
where cov is the covariance matrix. In order to prevent losing any variance information, ψ PCA is composed of all eigenvalues with non-zero eigenvectors. Here, PCA is used only to reduce computational and memory costs.
In gene expression data analysis, we usually have a large number of genes (features), but only a few number of observations, i.e., microarrays experiments.
A critical problem in applying conventional Linear Discriminant Analysis (LDA) to these types of data is the singularity and instability of the within-class scatter matrix calculated when the number of features approaches the number of available examples. In order to overcome this limitation, we applied the MLDA approach.
The MLDA method is concerned with the stabilization of pooled covariance matrix estimate S p . This covariance matrix S p is constructed by selecting the largest dispersions regarding the S p average eigenvalue. It is based on the maximum entropy covariance selection idea developed by Thomaz et al (2004) .
It is known that the estimated errors of small eigenvalues are greater than that of large eigenvalues. Therefore, Thomaz et al. (2007)  proposed to expand only the smaller and less reliable eigenvalues of S p , keeping most of the larger eigenvalues unchanged.
The algorithm may be described as follows:
where x i, j is the m-dimensional (m: number of genes) observation j from class ∏ i (i = 1, 2, where 1 = normal and 2 = tumoral in our case) containing the gene expressions in the rows, n i is the number of observations (microarrays) from class ∏ i , and g is the total number of classes (g = 2 in our case).
where n is the total number of microarrays, i.e., .
2. Calculate the ψ eigenvectors and Λ eigenvalues of S p , where S p = S w /[n - g].
4. Construct the new matrix of eigenvalues based on the following largest dispersion criterion Λ* = diag [max(λ i , ),..., max(λ m , )]
The main advantage of MLDA is that it avoids both the singularity and instability of the within-class scatter matrix S w when applied directly to gene expression data, which consists of a low number of observations and a high number of features.
The implemented R code is available in the Additional file 3.
This simulation was designed in order to demonstrate that MLDA is capable to discriminate two different conditions and also to identify the intrinsic functional connectivity changes underlying the tumoral process. For this simulation, artificial gene expressions for 25,000 genes (features) were generated, based on the simulation illustrated in . The 25,000 genes were divided in three sets A (250 genes), B (250 genes) and C (24,500 genes). For each gene, 30 observations representing "normal" condition and 30 observations representing "tumoral" conditions were generated. The model to investigate the situation where there are fuctional connectivity changes and there is no differences in gene expressions between conditions 1 and 2 were as follows:
ϕ(A) = 1 + 0.3ε
gene(A) = ϕ A + 0.3θ A
gene(B) = ϕ B + 0.5θ B
gene(C) = θ C
where ε, ϵ, θ A , θ B and θ C are independent Gaussian random variables with mean of zero and variance of one. This model considers two latent variables ϕ (A) and ϕ (B). Moreover, there is a functional relationship between A and B. Notice that there is no difference in means between A and B.
In order to identify putative differentially expressed genes, we have applied the non-parametric Wilcoxon test under a false discovery rate control (FDR)  of 5%. Wilcoxon procedure tests the median, therefore, it is more robust to outliers than the t-test (which tests the mean).
Relevance networks  were constructed using the Hoeffding's D measure , a non-parametric association method (the R code is freely available in the Hmisc package at ), which is more robust to outliers than the Pearson's correlation. Pairwise correlations were measured and the false discovery rate (FDR)  was controlled to 1, 5 and 10%. "Hub" genes were determined by calculating the degree (the number of adjacent edges, i.e. functional connectivities) of each gene and selecting the highest ones.
We have analyzed the normal and tumoral prostate dataset publicly available at the Stanford MicroArray Database [48, 19]. This dataset is composed of ~25,000 genes with 32 observations for normal state and 25 for tumoral condition.
This work was supported by grants of the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.