Patients
Tumor specimen from 118 women with primary breast cancer expressing the canonical long form of ERα (ERα66) [ER+] or not [ER-] were collected between 1980 and 1998, stored in the Paul Strauss Cancer Center bio-bank and used with the patients’ verbal informed consent with the approval of the hospital ethic committee. Since the tumor pieces used in this study were regarded as post-operative waste materials, verbal consent was recorded by the surgeon during the preoperative examination. The Hospital Ethic Committee for Clinical Research localized into the Paul Strauss Center for Anticancer Research, 3 rue Porte de l’Hôpital, 67000 Strasbourg, France, approved the procedure. 60 [ER+] as well as 58 [ER-] tumor samples were included in the retrospective study. Immediately after resection, one half of each tumor was cryogenized into liquid nitrogen whereas the other part was fixed in 4 % formalin and further used for immunohistological analyses. [ER] status was assayed by standard ligand binding assay. In short, snap frozen tumor samples were pulverized and cytosols were extracted by ultracentrifugation. Human serum albumin was used as a standard control for protein normalization. Cytosol (10 μL) was incubated with 5 nmol/L [H3] estradiol. After incubation, 100 μL supernatant were transferred to an isoelectric focusing gel, in order to separate bound, unbound and unspecifically bound hormone. Samples with >10 fmol/mg bound ER were considered to be [ER+].
RT-QPCR analysis
ERα66, ERα36, GPER, EGFR and HER2, as well as SNAIL1, CXCR4, RANKL, DDB2, VIM and MMP9 expression levels were determined by real-time PCR analyses. Large ribosomal protein (RPLPO) encoding gene was used as a control to obtain normalized values. Primers are listed in [see Additional file 7: Table S4]. Assays were performed at least in triplicate, and the mean values were used to calculate expression levels, using the ΔC (t) method referring to RPLPO housekeeping gene expression. Briefly, total RNA was extracted using RNeasy Plus Universal tissue Mini (Qiagen, Courtabœuf, France) and reverse transcribed (GoScript Reverse Transcription System, Promega, Charbonnières-les-Bains, France). Real-time PCR analyses were then performed by using iTaq Universal SYBR Green Supermix (Bio-Rad, France) in Opticon2 thermocycler (Bio-Rad) as described elsewhere [26].
Statistical analysis and modeling
Mathematical modeling of biological processes has recently emerged and developed as an essential tool to help cancer biologists and clinician pathologists improving personalized diagnosis, therapy and prognosis. Mainly, the first step in many gene regulation network-modeling task is the identification of the co-regulated or co-expressed genes. To this purpose, most of the works are based on a linear correlation computation and statistical hypothesis tests. Nevertheless, these tools do not detect nonlinear relationship between gene expressions, which is generally the case [13, 14]. That is why we propose to use nonlinear correlation and conditional mutual information techniques on the gene expressions in order to detect more accurately and exhaustively the co-regulated genes. More precisely, to confirm that there exists a relationship between two gene expressions, we cross two hypothesis tests. The first one is based on a nonlinear correlation computation based on the Spearman’s rank correlation coefficient. We associate to this number a hypothesis test on the dependence of the considered gene expressions. When the p-value of this test is less or equal to a fixed threshold (0.05 or 0.01 for our study), we conclude on the possible link between these genes that must be confirmed by a second computation based on the mutual information value associated to a significance analysis.
We consider statistical significance testing for the mutual information measurement M (X, Y), where X and Y represent the random variables associated to the considered two gene expressions. The null hypothesis H0 of this test is that X and Y are independent. The Mutual Information is a measure of the variables’ mutual dependence. Here we use it to measure this dependence for every pair of genes. In this context, we consider two random variables X and Y associated to the expression of two genes among the target genes.
The expression of M (X, Y) is given by:
M (X, Y) = H (X) + H (Y) − H (X, Y), where H (X) and H (Y) are the marginal entropies and H (X, Y) is the joint entropy (or the Shannon entropy) of X and Y.
Here, the computation of marginal entropy is given by, for the samples (x
i
)i = 1,.., n
$$ H(X)=-{\displaystyle \sum_i^n}P\left({x}_i\right)lo{g}_2\left(P\left({x}_i\right)\right) $$
and the joint entropy is computed by
$$ H\left(X,Y\right)=-{\displaystyle \sum_i^n}P\left({x}_i,{y}_j\right)lo{g}_2\left(P\left({x}_i,{y}_j\right)\right) $$
Intuitively, mutual information measures how much knowing one of these variables reduces the uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice versa. So their mutual information is zero. At the other extreme, if X is a deterministic function of Y and Y is a deterministic function of X, then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa. As a result, in this case the mutual information is the same as the uncertainty contained in Y (or X) alone, i.e. the entropy of Y (or X).
First we estimate the distribution of the mutual information under H0. The main problem using the mutual information measurement is that we do not have a “reference” to say that from a certain value (0.8 for example) the two variables are dependent. In order to decide whether or not the two variables are dependent, we have to make a hypothesis test using the experimental data compared to randomly generated data. These surrogate series of data are obtained by permuting the elements of one of the studied gene expression. Thus, we compare the obtained Mutual Information results: if the one obtained by using the original computation is significantly high w.r.t. the generated ones, we conclude to the dependence of the two variables (here: gene expressions).
Importantly, these surrogates are computed from the same number of observations, and the same distributions for X and Y (Fig. 3). We can then determine a one-sided p-value of the likelihood of our observation of the mutual information i.e. the probability of observing a greater mutual information value than that actually measured assuming H0. This can be done either by directly counting the proportion of surrogates or assuming a normal distribution of the mutual information and computing the p-value under a z-test.
For a given p-value, which is often 0.05 or 0.01, indicating that the observed results would be highly unlikely under the null hypothesis H0, we reject the latter hypothesis concluding then that a significant relationship between the two gene expressions does exist.
From these networks, we evaluate the pertinence for a unique gene to be assimilated to a breast tumor classifier in three steps. First, after choosing the gene and a classification threshold to separate the samples into two categories, we identify two networks connecting the gene to separate markers by using nonlinear correlation and mutual information techniques. Then, we define and compute the distance between the two networks, which takes into account both the structural differences between the networks (existence or not of relations between the markers, sense of the linking when it exists) and the compartmental differences (behavioral differences in the relationship between genes). Therefore, the distance between both networks represents the classification performance of the classifier gene and allows us finding the more pertinent classifiers.