Integrative network analysis of TCGA data for ovarian cancer

Zhang, Qingyang; Burdette, Joanna E; Wang, Ji-Ping

doi:10.1186/s12918-014-0136-9

Methodology Article
Open access
Published: 31 December 2014

Integrative network analysis of TCGA data for ovarian cancer

Qingyang Zhang¹,
Joanna E Burdette² &
Ji-Ping Wang¹

BMC Systems Biology volume 8, Article number: 1338 (2014) Cite this article

9312 Accesses
39 Citations
4 Altmetric
Metrics details

Abstract

Background

Over the past years, tremendous efforts have been made to elucidate the molecular basis of the initiation and progression of ovarian cancer. However, most existing studies have been focused on individual genes or a single type of data, which may lack the power to detect the complex mechanisms of cancer formation by overlooking the interactions of different genetic and epigenetic factors.

Results

We propose an integrative framework to identify genetic and epigenetic features related to ovarian cancer and to quantify the causal relationships among these features using a probabilistic graphical model based on the Cancer Genome Atlas (TCGA) data. In the feature selection, we first defined a set of seed genes by including 48 candidate tumor suppressors or oncogenes and an additional 20 ovarian cancer related genes reported in the literature. The seed genes were then fed into a stepwise correlation-based selector to identify 271 additional features including 177 genes, 82 copy number variation sites, 11 methylation sites and 1 somatic mutation (at gene TP53). We built a Bayesian network model with a logit link function to quantify the causal relationships among these features and discovered a set of 13 hub genes including ARID1A, C19orf53, CSKN2A1 and COL5A2. The directed graph revealed many potential genetic pathways, some of which confirmed the existing results in the literature. Clustering analysis further suggested four gene clusters, three of which correspond to well-defined cellular processes including cell division, tumor invasion and mitochondrial system. In addition, two genes related to glycoprotein synthesis, PSG11 and GALNT10, were found highly predictive for the overall survival time of ovarian cancer patients.

Conclusions

The proposed framework is effective in identifying possible important genetic and epigenetic features that are related to complex cancer diseases. The constructed Bayesian network has identified some new genetic/epigenetic pathways, which may shed new light into the molecular mechanisms of ovarian cancer.

Background

Ovarian cancer, one of the most malignant gynecologic cancers, is the fifth leading cause of cancer-related deaths among women in the United States. According to the American Cancer Society, 21,980 women will receive a new diagnosis and 14,270 will die of this disease in 2014. The majority of ovarian cancers are serous ovarian carcinomas and only less than 20% of them can be detected early. High-stage cancer patients are usually treated with platinum/taxane-based chemotherapy after debulking surgery. Platinumresistant cancer recurs in approximately 25% of patients within six months after therapy, and the overall five-year survival probability is only 31% [1]. While the molecular mechanism of ovarian cancer remains unclear, studies have suggested that many different factors may contribute to this disease, among which there are tens of well-known oncogenes and tumor suppressors including TP53, PIK3C, BRCA1 and BRCA2. In particular the mutation of gene TP53 is the most common, occurring in at least 70% of advanced-stage cases [1],[2]. Many of the existing studies however, have been focused on a single type of data, most frequently, gene expression analysis [3]-[5]. As pointed out by many researchers, the analysis based on individual gene often fail to provide even moderate prediction accuracy of the cancer status. Thus a systems biology approach that combines multiple genetic and epigenetic profiles for an integrative analysis provides a new direction to study the regulatory network associated with ovarian cancer.

The rapid advances in next-generation sequencing technology now allow genome-wide analysis of genetic and epigenetic features simultaneously. The timely advent of TCGA project has provided the most comprehensive genomic data resource from over 20 types of cancers (http://cancergenome.nih.gov/). For example, the TCGA ovarian cancer data contain both clinical and molecular profiles from 572 tumor samples and 8 normal controls. The molecular profile includes gene expression (microarray), genotype (SNP), exon expression, MicroRNA expression (microarray), copy number variation (CNV), DNA methylation, somatic mutation, gene expression (RNA-seq), MicroRNA-seq and protein expression. The clinical information includes records on recurrence, survival, and treatment resistance. These massive complex data sets have driven enthusiasm to study the molecular mechanism of cancers through computational approaches [1],[6]-[8]. Among the developed methods, Bayesian Network (BN) is one of the most frequently used multivariate models. The BN approach is more appealing than graphs constructed based on correlation or mutual information metrics for it allows rigorous statistical inference of causality between genetic and epigenetic features. However most of the existing studies have been focused on one type of data either continuous or discrete [9]-[13]. How to combine different types of complex data for causal inference in BN poses a big challenge. In addition, deducing the complex network structure from data remains an open problem partially due to the lack of prior information, relatively smaller sample size and the high dimensionality of data (number of possible nodes) [13],[14].

A necessary and important step to construct a BN from tens of thousands of features is feature selection, i.e., to identify a subset of the most-relevant features. Removing irrelevant or redundant features helps improve computing efficiency and estimation accuracy in the causal network. Existing feature selection methods can be roughly classified into two categories: wrapper approach [15],[16] and filter approach [17]-[19]. For large data sets, the filter approach using significance test for difference between the cancer and control samples is more commonly used due to its simplicity. As some features could be causal to other features while having no direct association with the cancer phenotypes, the independent test can filter out many related features (see a simulation study in the Methods section). One innovation of this paper is a novel stepwise correlation-based selector (SCBS) that mimics the hierarchy of the BN for feature selection. The selected features from the TCGA data are a mixture of continuous and categorical variables. To integrate them into the same BN, we discretize the continuous variables and use a logit link function for casual inference. The proposed approach is applied to the TCGA ovarian cancer data and leads to a series of interesting findings that shed light into the genetic/epigenetic mechanisms of ovarian cancer.

Results

Preprocessing of TCGA ovarian cancer data

In this paper, we only consider four types of molecular data including gene expression, DNA copy number variation, promoter methylation and somatic mutation (summarized in Table 1). This data set contains the expression values of 17,812 genes, out of which, 12,831 had methylation level measured for each CpG island located in their promoter regions. If multiple CpG islands exist for a given gene, we took the average as the overall methylation level. The copy number was measured for each chromosomal segment, recorded as a seg.mean value, with the segment length varying from hundred up to tens of millions base pairs by the Circular Binary Segmentation (CBS) performed by TCGA. Out of 17,812 genes, 15,352 have well defined locations on the genome provided by UCSC Genome Browser (http://genome.ucsc.edu/) and each of them was assigned a value as the copy number. If a gene entirely falls within a chromosome segment, we assigned it the corresponding seg.mean value (236 out of 15,352 genes that spans two chromosomal segments were not assigned any seg.mean value). For the somatic mutation data, we defined a binary variable where "1" stands for all the non-silent mutations (coding different amino acid) and "0" for silent mutations or not being mutated, resulting in 9,895 genes with somatic mutations (mutation occurred in at least one sample). For the gene expression and methylation data, we applied an existing method by [8] to remove the effects due to different age groups and batches. Figure 1 illustrates an example for BRCA1 gene where the boxplots showed this procedure was effective in removing the age and batch effects.

Table 1 Summary of TCGA ovarian cancer data

Full size table

The second step in data preprocessing is to discretize the continuous random variables. We classified the gene expression level into three groups using k-means clustering algorithm, namely low, midium, and high. Likewise the promoter methylation level was classified into either hyper or hypo state, and the copy number status into two states: gain or loss. The discretized variables were the input of the Bayesian network to be discussed below.

Feature selection

The pipeline we propose in this paper assumes that cancer phenotype is directly associated with gene expression, which can be potentially driven by genetic and epigenetic changes (this assumption was made from the biological point of view. It can be dropped without affecting the modeling and computing). Figure 2 illustrates the workflow of our proposed framework. We first identify a set of tumor suppressors and oncogenes by differential expression analysis between the cancer and control groups. This set of genes, together with those that were previously reported in the literature as cancer-relevant, form the set of seed genes for further feature selection. These seed genes are then fed into our proposed stepwise correlation-based selector (SCBS) to select other features. The SCBS is motivated by the hierarchy of causality in the Bayesian network. For example, suppose there is a causal relationship A →B →cancer. Though A to B or B to cancer has strong directed association, the association between A and cancer could decay greatly so that it cannot be detected. The SCBS procedure starts with detection of features significantly correlated with the cancer and then progressively selects subsequent features that correlate with the selected features. Our simulation study presented in the Methods section shows that the proposed SCBS is more effective to select important features that are involved in the phenotype-related pathways but indirectly associated with the cancer phenotype.

The first step in feature selection is to define a set of seed genes out of 12,000 genes that have records of expression level and at least one of the three (epi)genetic factors (i.e., CNV, methylation and mutation). We conducted a nonparametric test to identify the most differentially expressed genes between case and control groups, as well as the most differentially methylated genes, most differentially mutated genes and genes with most differentiate copy numbers. A gene was defined as a candidate oncogene or tumor suppressor if it satisfies all the following three conditions: (1) the p-value of gene expression was statistically significant under Benjamini-Hochberg (BH) procedure with false discovery rate (FDR) ≤ 0.05; (2) the p-value of somatic mutation/promoter methylation/copy number was also significant under BH procedure with FDR ≤0.05; and (3) the absolute value of correlation coefficient between gene expression and somatic mutation/promoter methylation/copy number was greater than 0.4. The p-values were calculated using Wilcoxon rank-sum test and the correlation coefficients were calculated using Pearson’s method. This procedure resulted in 48 potential tumor suppressors or oncogenes (Table 2), 16 of which were well-studied tumor suppressors and oncogenes for ovarian cancer from literature (out of a total of 36 presented in [2], Table 3). The union of the two sets of 68 genes were defined as seed genes and generated an additional 271 nodes out of more than 50,000 candidate features by the stepwise correlation based selection (SCBS) procedure, which include 177 gene expressions, 82 copy number variation sites, 11 methylation sites and one somatic mutation site at gene TP53.

Table 2 48 tumor suppressors and oncogenes from TCGA data

Full size table

Table 3 36 tumor suppressors and oncogenes from literature

Full size table

Bayesian network prediction

The 339 nodes (discretized if continuous) were fit into the Bayesian network through a logit link function using the blockwise coordinate descent algorithm for penalized maximum likelihood estimation procedure [20] (Methods).The predicted network contains 698 edges (Figure 3, details are tabulated in Additional file 1: Table S1), where the direction of the edge indicates the downstream feature is regulated by the upstream one. We found the CNVs are the major factor that accounts for differential gene expression. In addition most of the 82 genes were CNV-amplified in cancer samples, suggesting that many amplified genes may act as cancer drivers, confirming findings from a breast cancer study [21]. This network also confirmed many previously reported gene-gene interactions. To name a few, for example, the edge from TPX2 to AURKA could be explained by the fact that the protein encoded by the TPX2 gene activates AURKA by inducing autophosphorylation [22]. The connection between BRCA1 and NBR2 could be due to the shared bi-directional promoter between the two genes [23]. The connection between STAT3 and ETV6 was suggested previously that ETV6 is a negative regulator of STAT3 activity [24]. The edge from CDKN2A to CCNE1 is a known gene-gene regulation in the RB signaling pathway [1]. The edge from MYC to IMPHD2 confirms that MYC depletion results in repression of IMPHD2 (a gene coding rate-limiting enzyme) [25]. These results suggested that the proposed pipeline is capable of revealing important genetic or epigenetic pathways that underlie the complex cancer phenotype.

The average degree (indegree plus outdegree) of the graph is 4.124 indicating that the inferred network is sparse with moderate complexity. Due to the directionality of BN, one could also obtain the indegree and outdegree separately for every single node. Figure 4 shows the observed distribution of outdegree, where the mean and standard deviation are 2.15 and 2.31 respectively. We identified 13 nodes with significantly larger outdegree (greater than mean + 2 × SD, ≥7 edges) in the network including ARID1A, C19orf53, CSNK2A1, DERL1, TRMT6, COL5A2, TCF21, LUM, TPX2, UBE2C, DPM1, NDUFB7, and NDUFB9 (Table 4). These hub genes all have known functions and have causal effect on at least seven neighboring genes, suggesting that they may play important roles in driving corresponding local subnetworks. Some of the hub genes have been reported in the literature that are related to ovarian cancer. For instance, ARID1A is known to promote the formation of SWI/SNF chromatin remodeling complexes containing BRG1 or BRM, and is a candidate tumor suppressor not only in clear cell ovarian cancer but also in endometrioid cancers and uterine endometrioid carcinomas [22],[26],[27]. C19orf53 is known to be associated with Leydig cell tumors which are a member of the sex cord-stromal tumor group of ovarian and testicular cancers and it has a potential role in hypercalcemia of malignancy [22]. CSKN2A1 is a well-known oncogene that can phosphorylate a number of key intracellular signaling proteins implicated in tumor suppression (P53 and PTEN) and oncogenesis (MYC, JUN, NF-kappaB). This gene also influences Wnt signaling via beta-catenin phosphorylation and the PI3K signaling pathway via the phosphorylation of AKT [22]. Interestingly these 13 hub genes can clearly distinguish the cancer samples from the normal samples as revealed by a multi-dimensional scaling plot (MDS, [28]) based on the correlation dissimilarity metric (comparable clustering effect was observed based on the entire set of 245 genes, Figure 5a,b). This suggests that the thirteen hub genes may present the major difference between the cancer and normal samples. The early-stage and high-grade tumor samples however are not well distinguished.

Table 4 13 local drivers (hub genes) in the predicted network

Full size table

Gene clusters

The 245 genes (listed in Additional file 2: Table S2) were identified to fall into four major clusters corresponding to distinct functions by k-means clustering method (Figure 6b). Cluster 1 (black in Figure 7a) contains 18 genes, mainly related to cell division, mitosis, spindle formation etc. Cluster 2 (red) contains 23 genes, most of which are functionally related to growth factor, cell shape, cell motility, tumor invasion etc. Cluster 3 (green) contains 20 genes, mostly related to mitochondrial system, membrane process etc. Cluster 4 (blue in Figure 7a) is the largest and most complicated cluster harboring 184 genes. This large cluster communicates between the other three clusters (as shown in Figure 7b), which are nearly independent from each other. This can be seen from the summary statistics of within and between cluster causal edges in Table 5. These findings could be implicative of some important molecular pathways, which may or may not have been identified, that drive the development of ovarian cancer.

Table 5 Number of causal edges within/between four clusters in the TCGA ovarian cancer data

Full size table

We also looked into the subnetwork within each cluster. Figure 8 shows the local subnetwork corresponding to the first cluster which is involved mainly in cell division processes. Two hub genes, TPX2 and UBE2C, are in the central positions of this network. Our finding that there are four gene clusters may suggest that the development of ovarian cancer could be partially driven by the cell cycle regulation, as well as the pathways related to cell shape and motility, and mitochondrial system. Intervention (activation or suppression) on the hub genes or other important genes may alter the entire network, therefore may control key aspects of disease development.

Survival-centric network prediction

We applied SCBS and Bayesian network model to construct a survival-centric network. In feature selection, we used continuous data to select genetic/epigenetic features that were most relevant to the overall survival time of ovarian cancer patients. Starting from the first node (overall survival time in days), the SCBS procedure selected 66 genes (Table 6) and two methylation sites as the nodes in the Bayesian Network. Interestingly only 6 genes including CCDC19, MMP1, SLC family, TEKT2, WDR family and ZMYND10 had been reported relevant to cancer survival in a separate study (where a total of 88 genes were reported, [29]). The death risk (binary) is then used as the phenotype node to build a phenotype-induced network. We defined the overall survival time for less than 2 years as the "high-risk" (134 patients) and the survival time greater than 4 years as the "low-risk" (101 patients). In our predicted graph, there are 9 hub genes (similarly defined as above) that drive corresponding local subnetworks including C2orf39, FAP, SLC2A2, LAPTM5, CD53, THBS2, CCDC63, SLC17A, and LCT. Gene FAP has a known function to control fibroblast growth or epithelial-mesenchymal interactions during development, tissue repair, and epithelial carcinogenesis [22]. The inferred Bayesian network (Figure 9, details are tabulated in Additional file 3: Table S3) identified two genes, namely PSG11 and GALNT10, that may be directly associated with the overall survival time of ovarian cancer patients (Figure 10). Both genes are functionally related to glycoprotein synthesis, as well as many other genes in the network such as SLC2A2, SLC17A, CD53, THBS2, LCT, GYPA [22]. This indicates the biological pathway related to glycoprotein synthesis may be implicative of death risk of ovarian cancer patients. As reported in literature [30],[31], several tumor-associated glycoproteins were found on the surface of many cancer cells including ovary, breast, colon, and pancreatic cells and they may play potential roles in early detection of caners. One well-known such glycoprotein is CA-125 (encoded by gene MUC16), which is the primary protein used to measure serous cancer tumor load, especially during recurrence, and it is heavily glycosylated [31]. Pregnancy-specific glycoproteins (PSG) are mainly produced by the placental syncytiotrophoblasts during pregnancy and these proteins comprise a subgroup of the carcinoembryonic antigen family [22]. The protein encoded by gene GALNT10 may have increased catalytic activity toward glycosylated peptides compared to activity toward non-glycosylated peptides [22]. As pointed out by several research groups [32]-[34], some certain glycoproteins are closely associated with women cancers such as ovarian cancer and breast cancer, affecting the death risk, chemotherapy resistance and prognosis of ovarian cancer patients. The network also involves genes of other important functions including microtubules (RSHL3, TEKT1 and OLR1), extracelluar processes (ECM1, THBS2), hematopoietic (MS4A4, SRGN, LAPTM5), and human immune system (LILRA1, SIGLEC7, LAT2, LAIR1). This network suggested many causal relationship between different features, some of which have been known, for example, the edge HOTAIR →HOXC10 may be due to the fact that HOTAIR (a noncoding RNA gene) is located within the Homeobox C (HOXC) gene cluster and it regulates the expression of HOX genes such as HOXC and HOXD [22],[35].

Table 6 66 survival-related genes

Full size table

Discussion

In this paper we proposed an integrative approach in the Bayesian network framework for causal inference between genetic and epigenetic features in complex cancer data. It presents novelty in two aspects. First, we demonstrated that the stepwise correlation-based selection approach is more effective than simple single-round selection method in identifying important features in the genetic/epigenetic pathways, particularly those that are indirectly associated with the underlying phenotype. The method we proposed relies on the correlation strength among connected nodes and may fail when the connections are weak, especially for high-dimension data where assigning significance is challenging. Using literature-verified genes as seeds helps to better select features relevant to the phenotype. The SCBS procedure is model-free and computationally efficient and it can be applied to other graphical models such as Markov Random Field (MRF, undirected graph) and gene-gene or protein-protein interaction (PPI) network problems.

Second, we proposed a method for modeling causal relationships between features of different types (continuous or discrete) in a Bayesian network through a logit link function. The block-wise coordinate descent (BCD) algorithm accompanied with the Bayesian network model provides a simple and efficient way to estimate the parameters in the model. With a moderate sample size, this method achieves reasonable accuracy even for a moderate-scale network containing 200 nodes. This pipeline can be readily applied to other complex cancer data for pathway analysis or to find the common pathways between different but correlated phenotypes. For example, the TCGA now has accumulated more than 1,000 samples from breast cancer patients. It would be highly intriguing to know whether these two common diseases share any common molecular basis, especially for the basal-like breast cancer. As previously reported in literature [36], the basal-like subtype of breast cancer is the most distinct of the four subtypes (Luminal A, Luminal B, HER2-enriched and Basal-like) and it is similar to high-grade serous ovarian cancer at the mRNA expression level.

The Bayesian network model allows strict probabilistic inference, but also has limitations. First, it lacks flexibility to model cyclic causal relationships. The BN cannot model any cyclic pathways, for example, A →B→C→A, which though may exist in gene regulation in vivo. Second, most of the existing BN learning algorithms assume sparsity for computational feasibility. If the true network is dense or locally dense (dense subnetworks), the weak causations may fail to be detected. Our proposed BN with logit link function can model categorical variables and discretized continuous variables simultaneously, while discretization procedure may cause loss of information. A more desirable way could be to model the discrete variable nodes with logit distribution, but the continuous variable nodes with Gaussian or other continuous distributions. This direction demands future research. The coordinate descent is successful in solving the LASSO-type problem, especially in the sparse BN problem. Due to the super-exponentially increased number of possible BNs, the traditional sampling-based methods such as Metropolis-Hastings algorithm ([13],[14]) is computational infeasible to estimate the network with moderate number of nodes, e.g., 100 nodes. In our problem involving 339 nodes, the BCD algorithm took about 20 minutes on a single CPU to complete the estimation for a given penalizing parameter λ (Materials and Methods).

We illustrated the flexibility of this pipeline with two phenotypes, namely the ovarian cancer phenotype itself and the survival time of the cancer patient. The induced network by the cancer phenotype contains a set of 245 genes forming into four major clusters of distinct major functions, coordinated by 13 hub genes. Some of the hub genes (e.g., ARID1A) have been reported by other researchers for their important roles in genetic pathways, while other genes (NDUFB7 and NDUFB9) are newly discovered in this study. Their functions in ovarian cancer need to be further investigated. Our discovery that the mitochondrial systems are regulated in serous tumors is consistent with the hypothesis that the Warburg effect impacts tumor progression as suggested in published studies ([37]). We also found that pathways related to glycoprotein synthesis, hematopoietic and immune systems correlate with the survival rate of ovarian cancer patients. In particular, we discovered that the two genes related to glycoprotein synthesis, PSG11 and GALNT10, can significantly affect the overall survival time of ovarian cancer patients.

Conclusions

Understanding the biological mechanism of ovarian cancer has significant practical importance for clinical diagnosis and treatment. Timely advent of TCGA project has provided the most comprehensive genomic data resource for cancer study at the molecular and system level. Nevertheless, how to utilize these complex data for discovery of molecular mechanism of cancers remains one of the biggest challenges in this field. To this end, we propose a new integrative approach in this paper, which presents two innovations: a stepwise feature selection procedure and a Bayesian network model that incorporates both continuous and discrete features for causal inference. The predicted graph for the ovarian cancer data confirmed numerous genetic pathways reported in the literature, as well as many new ones that may provide new clues to guide future research. The graph suggested 13 hub genes that may drive certain subnetworks therefore play important roles in ovarian cancer. Clustering analysis suggested four gene clusters corresponding to distinct biological processes including cell division, tumor invasion and mitochondrial system. In addition, we found that genes related to glycoprotein synthesis, hematopoietic, immune system could be highly predictive of overall survival time of ovarian cancer patients.

Methods

Data

Datasets in ovarian cancer were downloaded from the Cancer Genome Atlas (TCGA) data portal (http://tcga-data.nci.nih.gov). We extracted four types of molecular data including gene expression, promoter methylation, somatic mutation and DNA copy number variation, using the "data matrix" tool provided by TCGA data portal.

Bayesian network with a logit link function

Bayesian Network can be used to model a set of random variables (nodes) and their conditional dependencies (directed edges) [11]. In general, the joint likelihood function of nodes X ₁,…,X _p in a BN can be expressed as:

P (X_{1}, \dots, X_{p}) = \prod_{i = 1}^{p} P (X_{i} | Π_{i}^{G})

(1)

where graph $G = (V, E)$ represents the topological structure of the Bayesian network, V={X ₁,…,X _p} denotes the set of nodes and $E = \{X_{j} \to X_{i}, X_{j} \in \prod_{i}^{G}\}$ denotes the set of edges, and $Π_{i}^{G} \subseteq \{X_{1}, \dots, X_{p}\} ∖ \{X_{i}\}$ stands for the parent set of X _i. We say $X_{j} \in Π_{i}^{G}$ if X _j causes X _i, written as X _j→X _i. A BN is called Gaussian Bayesian Network (GBN) if X _i is normally distributed with the mean equal to a linear combination of $X_{j} \in Π_{i}^{G}$ . The GBN is the most popular BN model and its structure learning problem has been discussed by several researchers [20],[38]. In our motivating example, however the network involves both continuous and discrete random variables. Here we discretize the continuous random variable and consider a multinomial logistic model.

Let X _i take values from {1,…,K _i} with probabilities $\{π_{i 1}, \dots, π_{{iK}_{i}}\} (s.t. \sum_{k = 1}^{K_{i}} π_{ik} = 1)$ , the BN model with logit link function can be written as:

log \frac{π_{ik}}{π_{{iK}_{i}}} = β_{ik 0} + \sum_{j \neq i} \sum_{l = 1}^{K_{j} - 1} β_{ikjl} I \{X_{j} = l\},

(2)

where k=1,…,K _i−1 and i=1,…,p. Here we transform the network structure to a coefficient matrix where β _ikjl=0 for all k and l means X _j↛X _i, and otherwise X _j→X _i. Therefore estimating the structure of is equivalent to estimating matrix {β _ikjl}. For simplicity, we illustrate the parameter estimation using binomial logistic model where all X _i’s only take values 0 or 1. Define π _i≡ prob (X _i=1), then

log \frac{π_{i}}{1 - π_{i}} = β_{i 0} + \sum_{j \neq i} β_{ij} X_{j} .

(3)

Suppose we observe data from N subjects. Let X _n=(X _n0,X _n1,…,X _np)^T, where X _n0=1 is the dummy variable, and X _ni=0 or 1 for n=1,…,N;i=1,…,p. Define π _ni≡ prob (X _ni=1). Let $X_{n_{- i}} = {(X_{n 0}, \dots, X_{n (i - 1)}, X_{n (i + 1)} \dots, X_{np})}^{T}$ , and β _i=(β _i0,…,β _i(i−1),β _i(i+1),…,β _ip)^T, then:

π_{ni} = exp (β_{i}^{T} X_{n_{- i}}) /(1 + exp (β_{i}^{T} X_{n_{- i}})) .

(4)

To achieve the sparsity, we apply the L ₁ penalty to the log-likelihood ([20],[38]) as follows:

\begin{matrix} L^{*} (β) & = \sum_{n = 1}^{N} \sum_{i = 1}^{p} (X_{ni} β_{i}^{T} X_{n_{- i}} - log (1 + exp (β_{i}^{T} X_{n_{- i}}))) \\ - λ \sum_{i = 1}^{p} ∥ β_{i} ∥_{L_{1}} . \end{matrix}

(5)

We aim to optimize the objective function (5) under the constraint of acyclicity. Finding the global maximizer is typically difficult in such a high dimensional space. Here we consider the coordinate descent (CD) algorithm, which has been successfully used to solve lasso regression problems [20],[39]. The CD algorithm is based on single-parameter updating strategy to minimize the objective function coordinate-by-coordinate. For our model, the single-parameter updating can be done as follows, in particular, we seek the maximizer ${\hat{β}}_{ij}$ of the following objective function given all the other parameters, denoted β _{−i j}:

\begin{matrix} L_{i}^{*} (β_{ij} | β_{- ij}) & = \sum_{n = 1}^{N} (X_{ni} β_{i}^{T} X_{n_{- i}} - log (1 + exp (β_{i}^{T} X_{n_{- i}}))) \\ - λ | β_{ij} | . \end{matrix}

(6)

After excluding the constant part, we have:

L_{i}^{*} (β_{ij} | β_{- ij}) = C_{1} β_{ij} - λ | β_{ij} | - \sum_{n = 1}^{N} log (1 + C_{2 n} exp (C_{3 n} β_{ij})) .

where $C_{1} = \sum_{n = 1}^{N} X_{ni} X_{nj} \geq 0$ , $C_{2 n} = exp (\sum_{k \neq i, j} β_{ik} X_{nk}) > 0$ , C _3n=X _nj≥0. Note that $L_{i}^{*} (β_{ij} | β_{- ij})$ is concave and differentiable at (−∞,0)∪(0,∞). Let $f (β_{ij}) = - \sum_{n = 1}^{N} log (1 + C_{2 n} exp (C_{3 n} β_{ij}))$ , do the following to find the maximizer of $L_{i}^{*} (β_{ij} | β_{- ij})$ :

1.
If $f^{'} (β_{ij}) |_{β_{ij} = 0} \in (- \infty, - C_{1} - λ)$ , then $L_{i}^{*} (β_{ij} | β_{- ij})$ is decreasing at 0 and ${\hat{β}}_{ij} < 0$ . Find ${\hat{β}}_{ij}$ by Newton’s method;
2.
If $f^{'} (β_{ij}) |_{β_{ij} = 0} \in (- C_{1} - λ, - C_{1} + λ)$ , then $L_{i}^{*} (β_{ij} | β_{- ij})$ is increasing on (−∞,0) and decreasing on (0,∞), so ${\hat{β}}_{ij} = 0$ ;
3.
If $f^{'} (β_{ij}) |_{β_{ij} = 0} \in (- C_{1} + λ, \infty)$ , then $L_{i}^{*} (β_{ij} | β_{- ij})$ is increasing at 0 and ${\hat{β}}_{ij} > 0$ . Find ${\hat{β}}_{ij}$ by Newton’s method.

The acyclicity constraint brings a major difficulty in BN learning problem, especially when the topological order of nodes is unknown. One immediate result of this constraint is that β _ij and β _ji cannot be both nonzero. To take advantage of this implication, Fu and Zhou (2013) proposed a blockwise coordinate decent (BCD) algorithm where the p(p−1) parameters are partitioned in to p(p−1)/2 blocks. Each block consists of β _ij and β _ji. The BCD algorithm [20] can be implemented as follows (starting from an empty network where all β _ij=0):

Step 1: For each pair of β _ij and β _ji, ${\hat{β}}_{ij} \Leftarrow 0$ stands for that ${\hat{β}}_{ij}$ has to be 0 under the constraint of acyclicity:
1. a.
  If ${\hat{β}}_{ji} \Leftarrow 0$ , find the maximizer ${\hat{β}}_{ij}$ of $L_{i}^{*}$ w.r.t. β _ij.
2. b.
  If ${\hat{β}}_{ij} \Leftarrow 0$ , find the maximizer ${\hat{β}}_{ji}$ of $L_{j}^{*}$ w.r.t. β _ji.
3. c.
  If either ${\hat{β}}_{ij}$ or ${\hat{β}}_{ji}$ can be nonzero, then compare the two sums: $S_{1} = L_{i}^{*} |_{β_{ij} = 0} + L_{j}^{*} |_{β_{ji} = {\hat{β}}_{ji}}$ and $S_{2} = L_{i}^{*} |_{β_{ij} = {\hat{β}}_{ij}} + L_{j}^{*} |_{β_{ji} = 0}$ . Find maximizer of max(S ₁,S ₂).
Step 2: Repeat step 1 until the maximum difference between two successive cycles is below some threshold.

To check the acyclicity of the candidate graph after edge i→j is added, we use a simple breadth-first algorithm detailed as follows. The time complexity of this algorithm is O(|V|+|E|):

Step 1: Remove all the edges coming into j and identify the children set of j, denoted by C S _j.
Step 2: If i?C S _j, then stop. Otherwise remove all the edges coming into C S _j and find the children set of C S _j.
Step 3: Repeat until i is found or all the edges were removed.

Notice that in multinomial setting, the acyclicity constraint forces multiple β’s to be zero. For instance, in ternary case, X _i→X _j indicates that four β’s are zeroes simultaneously. Another important issue in Lasso regression is the choice of the tuning parameter λ. Cross-validation is the most commonly used method for selecting λ, which however, tends to select a too small λ resulting in high false positive rate ([20]). To overcome this difficulty, an empirical λ selection method was proposed in [20] that can guarantee significant increase of the maximized likelihood value as a function of the graph complexity (number of edges). We employed this method for the selection of λ and the sequence of candidates is set to be {1/16,1/8,1/4,1/2,1,2,4,8,16}.

Simulation I: BCD algorithm for BN with discrete nodes

In the first simulation, we evaluated the performance of BCD algorithm in the proposed Bayesian network with logit link function as follows. We first simulated a random graph (i.e., $G = (V, E)$ ) with p nodes and 2p edges respectively. The simulated graph is a weakly connected directed graph generated using R package bnlearn ([40]). Given each , we simulated N independent observations, i.e., X _n=(X _n1,…,X _np), for n=1,..,N. If X _nj is caused by other nodes, then we simulated it based on Bernoulli distribution with success probability following the logistic regression model. For simplicity, for each observation the causal effect β _ij (if X _i→X _j in ) in the logistic regression model was set as a constant. For those nodes in the network but not caused by any other nodes, we simulated them from binomial distribution independently with success probability randomly generated from uniform distribution from 0.1 to 0.9. In this simulation we consider a crossed design of p=50,100,200, β _ij=0.5,1 and N=500,1000,2000. For each setting of (p,β _ij,N), 10 replicated samples were generated. We evaluated the estimated network under two different criteria: the directed version and undirected version of network (skeleton). In the former, we count an edge as a true positive only if it has the correct link and direction. For the skeleton comparison, an edge is counted as a true positive as long as it has the correct link. Table 7 presents the average number of predicted edges (P) from the 10 replicates for each setting, true positive rate (TPR) and false discovery rate (FDR) for both directed and undirected edges. Unsurprisingly the estimation of network structure is affected by the complexity of network, magnitude of causal effect and sample size. For a sparse network with 200 nodes, and sample size N=2000, and β=1, the BCD algorithm achieves an average TPR of 0.7 and 0.96 for directed and undirected (skeleton) graphs respectively. This simulation demonstrates that the BCD algorithm performs reasonably well when applied to categorical data in a moderate complex network when the sample size is relatively large.

Table 7 Simulation I results

Full size table

Comparison of three BN models on real data

To benchmark our logistic BN model, we compared it with two other BN models, namely the Gaussian BN model and multinomial BN model, on a popular data set [41] where the true causal network is known and experimentally-validated. This data set contains the abundance measurement of 11 proteins in 5400 samples, and has been used to elucidate the signaling pathway structure. Both the continuous and discrete versions of data are available online. The known protein-protein network is a Bayesian Network containing 11 nodes and 20 directed edges. Because this data set is based on experimental interventions, we adapt our model by deleting the intervention terms from the likelihood function. Figure 11 shows the true graph and estimated graphs by three different models and Table 8 summarizes the true positive rate and false discovery rate by three models. In terms of TPR and FDR, the logistic model appears to perform slightly better than other two models.

Table 8 Comparison of three different BN models

Full size table

Simulation II: stepwise correlation-based feature selector

In the second simulation, we compare the proposed stepwise correlation-based feature selector with other existing methods. The feature selection step is to define a set of nodes of smallest possible size but include most possible nodes that are involved in the casual structure in the true phenotype-induced Bayesian network. Suppose we aim to select p variables from S (S>>p) candidates as the nodes in Bayesian network based on a random sample of N observations. Assume that the phenotype-induced Bayesian network truly has p+1 nodes (including the phenotype itself). The proposed stepwise feature selection method starts with the phenotype node and selects the features that are most correlated with the current nodes in a stepwise fashion based on a correlation or mutual information metric (in this paper, we use correlation). This procedure is a natural mimic of network structure and can identify those nodes indirectly associated with the phenotype. In practice, the method can be implemented as follows:

Step 1: Calculate the correlation coefficients between the current node X _i and all the other nodes, denoted by ? _ij,j?i. Keep k most correlated nodes with X _i based on ? _ij for further filtering.
Step 2: Calculate the p-value of correlation coefficient for each of the k nodes from step 1, select the node if the p-value is significant under Benjamini-Hochberg (BH) procedure with FDR =0.05.
Step 3: Repeat step 1 and 2 until p nodes are selected.

In practice we need to pre-define the value of p and k based on the complexity of the network. The choice of p is subject to the feature pools size S and the scale of the network to build. The computing time is sub linear to p. We recommend to choose a k of 4, 5 or 6 to attain moderate complexity or sparsity of the network (see a simulation study below for the choice of k).

To test the SCBS method, we conducted a simulation study with S=10,000 features, among which only p=49 are truly involved in the phenotype related network. We first generated 50 random sparse graphs (i.e., $G = (V, E)$ ) consisting of 50 nodes and 100 directed edges (one node will be randomly chosen to be the phenotype node) using bnlearn ([40]). For each graph, we simulated eight samples according to the binomial logistic model, four with constant β _ij=1 and the rest with β _ij=2 at four different sample sizes N=500,1000,2000,5000. By using the topological order, each node can be simulated conditioning on the outcome of its parent nodes. For those nodes in the directed network but not caused by any other nodes, or the 9950 features outside the network, we simulated them from binomial distribution independently with success probability randomly generated from uniform distribution from 0.1 to 0.9. The SCBS with k=4 was applied to each data set and 49 features are identified from the sea of 10,000 candidates. Compared to the Pearson’s Chi-square test (single-round test for features between two phenotypic groups), the proposed SCBS performs uniformly better in all situations in terms of true positive rate (as shown in Figure 12, total positives were controlled at 49 for both methods). In particular, a small β poses more challenges in estimation or testing, while the SCBS can outperform the Chi-square test by 3-4 folds in true positive rate of identified phenotype-related features.

A further simulation study on the choice of k was carried out using the same strategy as above with S=10,000, p=49, β _ij=1 and N=1000. We generated 50 random graphs with average degree 4 (100 edges), 50 graphs with average degree 8 (200 edges) and 50 graphs with average degree 12 (300 edges), then apply SCBS with different k (k=1,2,…,10) to each data set and select 49 features. Figure 13 shows the true positive rates of the selected features under different choices of k for networks with different average degrees. It is shown that increasing network density leads to increasing true positive rates, and a k of 4, 5 or 6 performs better in all situations. Therefore for networks with moderate complexity, we recommend to choose k=4,5,6 since smaller k tends to miss weakly connected nodes and larger k tends to catch more false positives.

Availability of supporting data

The data sets supporting the results of this article are available in the Cancer Genome Atlas repository, https://tcga-data.nci.nih.gov/tcga/.

Additional files

Abbreviations

TCGA:: The Cancer Genome Atlas
BN:: Bayesian network
SCBS:: Stepwise correlation-based feature selector
TPR:: True positive rate
FDR:: False discovery rate

References

The Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma . Nature. 2011, 474: 609-615. 10.1038/nature10166.
Article PubMed Central Google Scholar
Bast R, Hennessy B, Mills G: The biology of ovarian cancer: new opportunities for translation . Nat Cancer Rev. 2009, 9: 415-428. 10.1038/nrc2644.
Article CAS Google Scholar
Chon H, Lancaster J: Microarray-based gene expression studies in ovarian cancer . Cancer Control. 2011, 18: 8-15.
PubMed Google Scholar
Konstantinopoulos P, Spentzos D, Cannistra S: Gene-expression profiling in epithelial ovarian cancer . Nat Clin Pract Oncol. 2008, 5: 577-587. 10.1038/ncponc1178.
Article CAS PubMed Google Scholar
Sowter H, Ashworth A: BRCA1 and BRCA2 as ovarian cancer susceptibility genes . Carcinogenesis. 2005, 26 (10): 1651-1656. 10.1093/carcin/bgi136.
Article CAS PubMed Google Scholar
Xu Y, Zhang J, Yuan Y, Mitra R, Müller P, Ji Y: A Bayesian graphical model for integrative analysis of TCGA data. In 2012 IEEE International Workshop on Genomic Signal Processing and Statistics. Washington DC, USA: BioMed Central; 2012:135–138.
Chen L, Xuan J, Gu J, Wang Y, Zhang Z, Wang T, Shih L: Integrative network analysis to identify aberrant pathway networks in ovarian cancer . Pac Symp Biocomput. 2012, 2012: 31-42.
Google Scholar
Hsu FH, Serpedin E, Hsiao TH, Bishop A, Dougherty E, Chen Y: Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer . BMC Genomics. 2012, 13: S13-10.1186/1471-2164-13-S6-S13.
Article PubMed Central PubMed Google Scholar
Glymour C, Cooper F: Computation, Causation, and Discovery: Cambridge MIT Press; 1999.
Spirtes P, Glymour C, Scheines R: Causation, Prediction, and Search, Cambridge: Springer; 1993.
Jensen F: An introduction to Bayesian Networks, Cambridge: Springer; 1997.
Heckerman D: A tutorial on learning with Bayesian networks. Technical Report1995.
Madigan D, York J: Bayesian graphical models for discrete data . Int Stat Rev. 1995, 63: 215-232. 10.2307/1403615.
Article Google Scholar
Ellis B, Wong W: Learning causal Bayesian network structures from experimental data . J Am Stat Assoc. 2008, 103 (482): 778-789. 10.1198/016214508000000193.
Article CAS Google Scholar
Kohavi R, John G: Wrappers for feature subset selection . Artif Intell. 1997, 97: 273-324. 10.1016/S0004-3702(97)00043-X.
Article Google Scholar
Leng J, Valli C, Armstrong L: A wrapper-based feature selection for analysis of large data 3rd ICCEE. 2010:167–170. ECU Publications Press; Chengdu, China.,
Haindl M, Somol P, Ververidis D, Kotropoulos C: Feature selection based on mutual correlation. Technical Report1999.
Jouve P, Nicoloyannis N: A filter feature selection. Technical Report2010.
Yu L, Liu H: Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings of the Twentieth International Conference on Machine Learning. Washington DC, USA: Morgan Kaufmann; 2003:856–863.
Fu F, Zhou Q: Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent . J Am Stat Assoc. 2013, 108 (501): 288-300. 10.1080/01621459.2012.754359.
Article CAS Google Scholar
Zaman N, Li L, Jaramillo M, Sun Z, Tibiche C, Banville M, Collins C, Trifiro M, Paliouras M, Nantel A, O’Connor-McCourt M, Wang E: Signaling network assessment of mutations and copy number variations predict breast cancer subtype-specific drug targets . Cell Rep. 2013, 5: 216-223. 10.1016/j.celrep.2013.08.028.
Article CAS PubMed Google Scholar
GeneCards. http://www.genecards.org.
Xu C, Brown M, Nicolai H, Chambers J, Griffiths B, Solomon E: Isolation and characterisation of the NBR2 gene which lies head to head with the human BRCA1 gene . Hum Mol Genet. 1997, 6 (7): 1057-1062. 10.1093/hmg/6.7.1057.
Article CAS PubMed Google Scholar
Schick N, Oakeley E, Hynes N, Badache A: TEL/ETV6 is a signal transducer and activator of transcription 3(Stat3)-induced repressor of Stat3 activity . J Biol Chem. 2004, 279 (37): 38787-38796. 10.1074/jbc.M312581200.
Article CAS PubMed Google Scholar
Mannava S, Grachtchout V, Wheeler L, Im M, Zhuang D, Slavina E, Mathews C, Shewach D, Nikiforov M: Direct role of nucleotide metabolism in C-MYC-dependent proliferation of melanoma cells . Cell Cycle. 2008, 7 (15): 2392-2400. 10.4161/cc.6390.
Article PubMed Central CAS PubMed Google Scholar
Guan B, Wang T, Shih L: ARID1A, a factor that promotes formation of SWI/SNF-mediated chromatin remodeling, is a tumor suppressor in gynecologic cancers . Cancer Res. 2011, 71 (21): 6718-6727. 10.1158/0008-5472.CAN-11-1562.
Article PubMed Central CAS PubMed Google Scholar
Lowery W, Schildkraut J, Akushevich L, Bentley R, Marks J, Huntsman D, Berchuck A: Loss of ARID1A-associated protein expression is a frequent event in clear cell and endometrioid ovarian cancers . Int J Gynecol Cancer. 2012, 22: 9-14. 10.1097/IGC.0b013e318231f140.
Article PubMed Central PubMed Google Scholar
Schiffman S, Reynolds M, Young F: Introduction to Multidimensional Scaling: Theory, Methods, and Applications, West Yorkshire: Emerald; 1981.
Yoshihara K, Tajima A, Yahata T, Kodama S, Fujiwara H, Suzuki M, Onishi Y, Hatae M, Sueyoshi K, Fujiwara H, Kudo Y, Kotera K, Masuzaki H, Tashiro H, Katabuchi H, Inoue I, Tanaka K: Gene expression profile for predicting survival in advanced-stage serous ovarian cancer across two independent datasets . PLoS One. 2010, 5 (3): e9615-10.1371/journal.pone.0009615.
Article PubMed Central PubMed Google Scholar
Louhimo J, Alfthan H, Stenman U, Haglund C: Serum HCG beta and CA 72-4 are stronger prognostic factors than CEA, CA 19-9 and CA 242 in pancreatic cancer . Oncology. 2004, 66 (2): 126-131. 10.1159/000077438.
Article CAS PubMed Google Scholar
Thériault C, Pinard M, Comamala M, Migneault M, Beaudin J, Matte I, Boivin M, Rancourt C, Piché A: MUC16 (CA125) regulates epithelial ovarian cancer cell growth, tumorigenesis and metastasis . Gynecol Oncol. 2011, 121 (3): 434-443. 10.1016/j.ygyno.2011.02.020.
Article PubMed Google Scholar
Kuzmanov U, Musrap N, Kosanam H, Smith C, Batruch I, Dimitromanolakis A, Diamandis E: Glycoproteomic identification of potential glycoprotein biomarkers in ovarian cancer proximal fluids . Clin Chem Lab Med. 2012, 51 (7): 1467-1476.
Google Scholar
Joensuu H, Klemi P, Toikkanen S, Jalkanen S: Glycoprotein CD44 expression and its association with survival in breast cancer . Am J Pathol. 1993, 143 (3): 867-874.
PubMed Central CAS PubMed Google Scholar
Baekelandt M, Holm R, Nesland J, Kristensen G, Tropé C: P-glycoprotein expression is a marker for chemotherapy resistance and prognosis in advanced ovarian cancer . Anticancer Res. 2000, 20 (2B): 1061-1067.
CAS PubMed Google Scholar
Rinn J, Kertesz M, Wang J, Squazzo S, Xu X, Brugmann S, Goodnough L, Helms J, Farnham P, Segal E, Chang H: Functional demarcation of active and silent chromatin domains in human HOX Loci by Noncoding RNAs . Cell. 2007, 129 (7): 1311-1323. 10.1016/j.cell.2007.05.022.
Article PubMed Central CAS PubMed Google Scholar
The Cancer Genome Atlas Research Network: Comprehensive molecular portraits of human breast tumours . Nature. 2012, 490: 61-70. 10.1038/nature11412.
Article Google Scholar
Kroemer G: Mitochondria in cancer . Oncogene. 2006, 25: 4630-4632. 10.1038/sj.onc.1209589.
Article CAS PubMed Google Scholar
Yuan M, Lin Y: Model selection and estimation in the Gaussian graphical model . Biometrika. 2007, 94: 19-35. 10.1093/biomet/asm018.
Article Google Scholar
Friedman J, Hastie T, Höfling H, Tibshirani R: Pathwise Coordinate Optimization . Ann Appl Stat. 2007, 1: 302-332. 10.1214/07-AOAS131.
Article Google Scholar
Scutari M: Learning Bayesian networks with the bnlearn R package . J Stat Softw. 2010, 35 (3): 1-22.
Article Google Scholar
Sachs K, Perez O, Pe’er D, Lauffenburger D, Nolan G: Causal protein-signaling networks derived from multiparameter single-cell data . Science. 2005, 308 (5721): 523-529. 10.1126/science.1105809.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This research is supported by NIH grant (U54CA143869).

Author information

Authors and Affiliations

Department of Statistics, Northwestern University, Evanston, IL, 60208, USA
Qingyang Zhang & Ji-Ping Wang
Department of Medicinal Chemistry and Pharmacognosy, University of Illinois, Chicago, 60607, IL, USA
Joanna E Burdette

Authors

Qingyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Joanna E Burdette
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Ping Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qingyang Zhang or Ji-Ping Wang.

Additional information

Competing interests

The authors have declared that no competing interests exist.

Authors’ contributions

JPW conceived the study. QZ and JPW analyzed the data. QZ, JEB and JPW wrote the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12918_2014_136_MOESM1_ESM.xlsx

Additional file 1: Table S1. Detailed information about the predicted network in Figure 6, including upstream (parent) and downstream (child) gene names for each gene in the network, gene locations (chromosome, strand, start and end positions), activation (represented by 1) or suppression (represented by -1) between two genes, affected by genetic/epigenetic changes or not. (XLSX 36 KB)

Additional file 2: Table S2. Genes in each of the four clusters. (XLSX 15 KB)

12918_2014_136_MOESM3_ESM.xlsx

Additional file 3: Table S3. Detailed information about the survival-centric network in Figure 9, including upstream (parent) and downstream (child) gene names for each gene in the network, gene locations (chromosome, strand, start and end positions), activation (represented by 1) or suppression (represented by -1) between two genes, affected by genetic/epigenetic changes or not. (XLSX 11 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Zhang, Q., Burdette, J.E. & Wang, JP. Integrative network analysis of TCGA data for ovarian cancer. BMC Syst Biol 8, 1338 (2014). https://doi.org/10.1186/s12918-014-0136-9

Download citation

Received: 10 September 2014
Accepted: 11 December 2014
Published: 31 December 2014
DOI: https://doi.org/10.1186/s12918-014-0136-9

Integrative network analysis of TCGA data for ovarian cancer

Abstract

Background

Results

Conclusions

Background

Results

Preprocessing of TCGA ovarian cancer data

Feature selection

Bayesian network prediction

Gene clusters

Survival-centric network prediction

Discussion

Conclusions

Methods

Data

Bayesian network with a logit link function

Simulation I: BCD algorithm for BN with discrete nodes

Comparison of three BN models on real data

Simulation II: stepwise correlation-based feature selector

Availability of supporting data

Additional files

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Systems Biology

Contact us