Identifying cancer biomarkers by network-constrained support vector machines

Background One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers. Results We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis. Conclusions We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.


Background
While promising progress in research has been made in recent years, predicting cancer outcomes is a difficult task since cancer is a complicated disease and its mechanisms remain largely unclear. Biomarkers play an important role in the diagnosis of cancer, and also in assessing prognosis and directing treatment of cancer. As microarray technology makes it possible to measure the expression of tens of thousands of genes simultaneously, biomarker identification has become one of the major tasks in the field of microarray data analysis. Common statistical practice attempts to find biomarkers differentially expressed across different phenotypes, such as cancer samples vs. normal samples, in a high-dimensional gene space. Given clinical outcomes data, the problem can also be formulated as a prediction problem that is designed to find informative genes from a classification model with good prediction performance.
Traditional methods [1][2][3][4][5][6][7][8] are largely developed based on microarray data alone, with the assumption that each individual gene contributes independently to clinical outcomes. Thus, the reproducibility of prediction performance is often unexpectedly low when tested across different data sets (even though data are acquired from apparently similar study designs). This problem may be explained in part by the properties of microarray data that are often noisy and the cellular and molecular heterogeneity of cancer specimens. Unfortunately, biomarkers selected by many current algorithms often have limited mechanistic coherence related to the specific cancer under study, partly because the approaches do not deal effectively with the challenges posed by working in high dimensional data spaces [9].
Genes generally work collaboratively, and many cancer-related genes are involved in multiple pathways [10]. Recently, several methods have been developed to identify significant gene sets or pathways involved in diseases or biological processes by incorporating some prior biological knowledge. For example, gene set enrichment analysis or pathway enrichment analysis [11][12][13] uses the membership information in functional gene clusters or pathways, which facilitates an understanding of the underlying biological mechanism(s). Other algorithms use interacting structures, such as protein-protein interactions (PPIs), protein-DNA interactions, or regulatory pathways. For example, Chuang et al. [14] proposed a protein-protein network-based approach to identify biomarkers of metastasis using breast cancer gene expression data. Biomarkers identified by this approach are encoded as subnetworks of interacting proteins within a large human PPI network. The average expression activities of these subnetworks were then used for prediction of metastasis. A noticeable limitation of this method is that the network structure was not taken into consideration in the classifier building step. Li et al. [15] introduced a network-constrained regularization procedure for linear regression analysis of microarray data. Specifically, a network-constrained term based on the L1-norm of regression coefficients was used to enforce the smoothness of the coefficients for each network. In a general regression framework, the effectiveness of this approach has been initially demonstrated with relevant genes or subnetworks identified showing an improved association with the appropriate phenotypes. However, in many cases only binary information of clinical outcomes are known (recurrent/non-recurrent, alive/dead), therefore a binary prediction model is more suitable than a regression model for cancer prediction. Zhu et al. [16] recently started using support vector machines to build binary classifiers as prediction models, in which an F ∞ -norm constraint was proposed to account for gene-gene interaction information. As an initial attempt, they applied this approach to breast cancer data to study three small, focused networks centered upon TP53, BRCA1, and BRCA2, respectively, showing the potential of this approach to identify those frequently mutated cancer related genes, although the results apply to genes largely known from previous studies [16].
We have developed an integrated approach, networkconstrained support vector machine (netSVM), to predict clinical outcome of patients and to identify biologically meaningful biomarkers by incorporating protein-protein interacting network information. Specifically, we embed a network constraint into the objective function of an SVM to impose the smoothness of coefficient over a prediction network. The network constraint is represented by a Laplacian matrix of protein-protein interactions. We first validate the netSVM approach using simulation studies to explore the effectiveness of the proposed method. We then apply the netSVM to breast cancer data for cancer biomarker identification. The study shows that our method can be used to improve the prediction performance across data sets, especially when signal-to-noise ratio (SNR) is relatively low. More importantly, the identified genes and subnetwork are highly related to biological pathways involved in breast cancer progression and metastasis.

Results and discussion
Network-constrained support vector machines We propose an integrated approach using gene expression data and PPI network information to predict clinical outcomes of breast cancer and to identify cancer biomarkers. For these studies, we are less interested in describing clinically useful classifiers than we are in using clinically relevant outcomes data to support a classifier from which we can obtain mechanistically relevant biological insights. Figure 1 shows the framework of the proposed method. The method takes gene expression data and PPI network knowledge as the input, builds a classifier using a network-constrained support vector machine (netSVM), and then predicts the outcome of new samples based on the trained classifier. Significant genes or subnetworks from the classifier can be detected through a significance test based on permutation of sample labels. Unlike conventional SVM, netSVM adds a network constraint in the gene space to its objective function; thus we obtain highly connected genes as the significant features and should improve prediction performance across different data sets. The approach is described in the Methods part with its mathematical details outlined.

Simulation experiments
We simulated microarray gene expression data under two conditions by a modified MRF-GG model [17]. First, a Markov random field (MRF) model was used to determine the states of genes -differentially expressed (DE) and equally expressed (EE) -given a known, ground truth subnetwork. Based on the states of genes, the Gamma-Gamma (GG) model [18] was then used for modeling the gene expression levels in the two conditions (see Methods).
We conducted simulation studies on a breast cancerrelated network that contains 584 genes and 2,280 interactions. Genes are either breast cancer related [19] or involved in estrogen signaling pathways collected from Ingenuity Pathway Analysis (Ingenuity ® Systems, http://www.ingenuity.com). Interactions were extracted from the HPRD database [20]. Weights in the network were set to 1 if there are known connections between two genes, and to 0 otherwise. Parameters in the GG model (a = 10, a 0 = 0.9 and ν = 0.5) are those in Newton et al. [18]. When generating simulation data sets, we added different levels of noise and adjusted parameter w (see Eq. 11 in Methods) to control the false positive rate in the sampled DE subnetworks. For each scenario, we randomly generated 100 training and testing data sets, each data set with 100 training samples and 100 testing samples.
We implemented network-constrained SVMs for training and testing. A 10-fold cross validation was conducted on the training data set to select the optimal value of parameter l, a trade-off parameter between classification error and network constraint (see Eq. 4 in Methods). We then computed the accuracy, sensitivity, and specificity for classification performance evaluation on the testing data. The classifier's performance in recovering ground truth subnetwork genes was also assessed using receiver operating characteristic (ROC) analysis [2] of the ranked gene list. Specifically, genes were ranked by their p-values through a significance test. True positive rate and false positive rate were calculated in the ranked gene list, and the area under the ROC curve (AUC) were calculated for an overall performance evaluation.
As a comparison, we also implemented many existing methods for classifier training and performance evaluation. Among them, F ∞ -norm SVM [16], Larsnet [15] and Chuang's method [14] are network-based methods that integrate gene expression data and protein-protein interaction network information. Conventional SVM, Lasso [21] and Linear Discriminant Analysis (LDA) [22] are gene-based methods that are based on gene expression data alone. Note that for LDA we used t-test to select top ranked (significant) genes for prediction if number of genes is greater than number of samples. Similarly, we conducted 10-fold cross validation to determine the optimal parameters for the methods and compared the classifier's performance in term of prediction accuracy in the outcome of testing samples and recovering ground truth subnetwork genes.
We first fixed weight (ω = 10) and added different levels of Gaussian noise to the simulated gene expression data. Table 1 shows that the AUC values of prediction performance on testing data sets for netSVM and other existing methods. From the table we can see that when signal-to-noise ratio is relatively high (>4 db), most of the methods can achieve good prediction results, except for two regression methods, Larsnet and Lasso. However, when signal-to-noise ratio is low, which is a common problem with microarray gene expression data, netSVM gives rise to an improved classification performance compared to other methods. The regression methods do not show good prediction performances in noisy conditions. One possible reason is that the simulation data are generated based on statistical distributions rather than precise regression models. The AUC values for subnetwork identification are shown in Table 2. We can see that network-based methods outperform gene-based methods consistently, and netSVM outperforms all other methods. This indicates that integrating PPI network information could improve discovering underlying subnetworks. Figure 2 and Figure 3 show the detailed comparison between netSVM and conventional SVM in terms of AUC values of prediction performance and subnetwork identification, respectively. NetSVM outperforms SVM significantly in identifying the ground truth subnetwork or relevant genes. We further evaluated the performance of uncovering underlying network/genes with different false positive rates in the data by varying weights (ω), to control the false positive rate of sampled subnetworks compared with the ground truth subnetwork. With a fixed signalto-noise ratio (SNR = 0 dB), the prediction performance of six methods are similar with the ones in Table 1 (results are not shown). However, the performance in identifying underlying subnetworks is substantially different, which is shown in Table 3. From the table, we can conclude that network-based methods outperform gene-based methods in general. Figure 4 shows the detailed comparison between netSVM and conventional SVM. From the figure we can see that netSVM achieves  higher AUC values than conventional SVM significantly, especially when false positive rate of sampled subnetwork is high (>40%). Since our method is designed to emphasize the role of hub genes, the negative effect on prediction accuracy of hub genes is greater than other genes when the genes have inconsistent, abnormal expression between training and testing data sets. We assessed the robustness of the method by perturbing the expression levels of all ground truth genes and hub genes, respectively. Genes were considered as hub genes if their connection degrees are larger than 10. We added different levels of noise in the test data sets and compared the prediction performance of netSVM and that of conventional SVM. From  simulation experiments, we can see that netSVM is more robust than conventional SVM when perturbing all ground truth genes ( Figure 5(a)). The performance degrades even faster when perturbing hub genes alone, but it is still acceptable when compared to the performance of conventional SVM ( Figure 5(b)).

Breast cancer microarray data
We studied two gene expression profiles of breast cancer patients previously reported by van de Vijver et al. [23] and Wang et al. [24]. We focused on estrogen receptor (ER) positive patients in our study, aiming to improve our understanding of estrogen signaling and action. Among the ER positive patients, 78 patients in van de Vijver et al. [23] and 80 in Wang et al. [24] had been diagnosed with metastasis during their follow-up visits within 5 years of surgery, which were assigned to 'recurrence' group. The remaining 217 and 129 patients, respectively in the two studies, were then labeled as 'non-recurrence'. In order to construct a network, we collected a set of genes that are either breast cancer related [19] or involved in estrogen signaling pathways from Ingenuity Pathway Analysis (Ingenuity ® Systems, http://www.ingenuity.com). The protein-protein interactions (PPIs) were extracted from the HPRD database [20]. In this study, the weights in the network are set as 1 if there are connections between two genes and 0 otherwise. After mapping the network to the two gene expression data sets [23,24], we obtained a PPI network with 553 breast cancer related genes and 2,257 interactions.
We conducted a stratified, 5-fold cross validation on the first data set to build a classifier, and then tested on the second data set to measure its prediction performance, and vice versa. For cross-validation with network-constrained SVM, the samples are divided into five subsets: three are used to train the classifier, one is used to determine the optimal value of parameter l, and validation performance is calculated on the remaining subset using the optimal l. To obtain a more reliable evaluation of the performance, we repeated the crossvalidation procedure 100 times by random partitions. The most frequently occurring value of parameter l during the cross validation was used to build a classifier based on all training samples for independent testing. We evaluated the prediction performance of netSVM using ROC analysis, from which AUC, accuracy, sensitivity, and specificity were calculated.
Similarly we compared the prediction performance of netSVM with other network-based methods and genebased methods in terms of cross-validation and independent testing. For the cross-validation performance, Table  4 shows the mean and standard deviation of prediction performance for all methods; network-based methods achieved a slightly better classification performance than the gene-based methods. Table 5 shows the prediction performance of independent testing on two data sets. The prediction performance of netSVM and Chuang's method are comparable and better than other networkbased and gene-based methods. This indicates that netSVM, along with Chuang method, is of a better reproducibility to predict independent data sets as compared to conventional SVM and other methods. The overlaps in the top 50 ranked genes from two data sets also show that netSVM has a better reproducibility for network identification (Figure 6).
We also compared the prediction performance with the performances reported in the original studies  [23,24]. 70 gene signatures were identified in van de Vijver et al. [23] and 76 gene signatures in Wang et al. [24]. In the setting of cross validation netSVM achieved a slightly better prediction performance than the original studies (Table 4). However, in the setting of independent testing across data sets, netSVM achieved a significant improvement in prediction accuracy as compared to that from the 70 or 76 gene signatures identified in the original studies (Table 5). Furthermore, netSVM can identify more overlapped genes from two data sets (~20%) than those of previous studies (~1%) (Figure 6), which indicates that netSVM has a better reproducibility across data sets in terms of prediction performance and biomarker identification.
As observed in the simulation study, netSVM is more sensitive to hub genes if they have abnormal expression between two data sets. To study the possibility of this effect, we examined the expression changes of hub genes and non-hub genes in two breast cancer data sets. Figure 7 shows the distribution of difference of fold changes between two data sets for 100 hub genes and 100 non-hub genes. The variance of hub genes is in overall smaller than non-hub genes. This observation is consistent with our assumption that hub genes have little expression changes between difference phenotypes, so that they have less variations across different data sets as compared to their down-stream genes. We also conducted a statistical analysis to assess the significance of robustness of selected genes across two data sets. We  take the variance of difference of fold change as the summary statistic and generate the null distribution from randomly selected genes (of the same number as the identified genes). The empirical p-value is then calculated by the frequency of occurrences of null variance less than the observed one. The p-values for the top 50 genes selected by netSVM are 0.09 in van de Vijver et al. [23] and 0.02 in Wang et al. [24], respectively, which are much more significant than those from the genes selected by SVM (0.13 in van de Vijver et al. [23] and 0.18 in Wang et al. [24], respectively). These results further support and validate that network-based methods can perform better than single gene-based methods. We further examined the top ranked genes and their composed networks from the classifiers defined by network-constrained SVM and conventional SVM on two data sets. The genes were ranked by their p-values through a significance test (see the Methods section for the detailed procedure). We compared various network properties including number of edges, average node degree and network density. Network density is defined as 2 × m/n× (n-1), where m is the number of edges and n is the number of nodes in the network. Figure 8 and Figure 9 show the trends of network properties with different network sizes for netSVM and SVM, respectively. From the figures we can see that netSVM results in much denser subnetworks than does SVM for the top ranked genes. Figure 10 shows the number of overlapped genes in the top ranked genes from van de Vijver et al. [23] and Wang et al. [24]. netSVM results in more overlapped genes in the top ranked subnetworks than SVM, indicating that a good reproducibility can be obtained by using netSVM across different data sets.
To obtain a more detailed comparison and understanding of the subnetworks identified by SVM and netSVM, we selected the top 50 genes (p-value threshold 0.05) to check the subnetworks from van de Vijver et al. [23] and Wang et al. [24]. For SVM, 20 genes (17 edges) in van de Vijver et al. [23] and 18 genes (15 edges) on Wang et al. [24] are connected to form subnetworks. Only 5 genes overlap in the two subnetworks. For netSVM, 47 genes (100 edges) on van de Vijver et al. [23] (shown in Figure 11) and 49 genes (131 edges) on Wang et al. [24] (shown in Figure 12) are connected to form subnetworks. Moreover, 17 genes overlap in the two subnetworks. We further input these gene lists to the DAVID database [25] for functional annotation and pathway enrichment analysis. 'Pathways in cancer' is highly enriched in two subnetworks identified by netSVM (Benjamini p-value = 2.1 e-12 on van de Vijver et al. [23]; Benjamini p-value = 4.6 e-21 on Wang et al. [24]), which is much more significant than those obtained with SVM (Benjamini p-value = 0.12 on van de Vijver et al. [23]; Benjamini p-value = 1.1 e-6 on Wang et al. [24]). The networks are shown in Figures 11 and  12 as displayed by the Cytoscape software [26,27]. Figures 11 and 12 show three major components in each subnetwork and they are quite similar. The first Figure 6 Overlaps of the top 50 genes selected from van de Vijver et al. [23]and Wang et al. [24]using different methods. The overlap is defined as the number of interaction genes divided by the number of union genes in two gene lists. component contains common (or shared) genes of SRC, CHUK, CASP7, HDAC1, MDM2, NFKB1A and JAK2 (the right panels in Figure 11 and Figure 12). The major functional annotations for these genes are apoptosis (p = 5.7 e-7), response to cytokine stimulus (p = 2.4 e-5), chemokine signaling pathway (p = 1.6 e-7) and JAK-STAT signaling pathway (p = 3.4 e-06), as estimated using the DAVID database [25]. The second component includes genes of FN1, CAV1, TGFBR1, MAPK1, MAPK14, MAPK3, SMAD4, and PXN (the left panels in Figure 11 and Figure 12), which are enriched in regulation of apoptosis (p = 8.1 e-7); regulation of growth; regulation of cell proliferation (p = 1.1 e-7); TGF-beta signaling pathway (p = 1.8 e-6) and MAPK signaling pathway (p = 3.1 e-5). For the remaining genes, one is centered by TP53 ( Figure 11) and another is centered by AR and BRCA1 (Figure 12) in the nucleus. Both components are enriched in regulation of cell cycle (p = 7.0 e-5).
The significant genes in the subnetworks from the Wang and van de Vijver data sets potentially represent a strong prognostic signature in breast cancer. The functions of most of these genes are related to biological pathways already known to be involved in disease progression, such as apoptosis, cell cycle and cell proliferation, and these functional results are consistent with ones discovered in the original studies [23,24]. Importantly, some gene itself may not show differential expression between two phenotypes, but may play an important role in interconnecting other differentially expressed genes in PPI network [14]. Therefore we consider the genes with high degree of interactions in PPI network as hub genes. Our proposed method can highlight several hub genes and signaling pathways that were not identified in the original studies, such as MAPK, TGF-beta, and JAK-STAT signaling pathways (see Figures 11(b) &12(b)). The subnetworks from two data sets have been extensively studied in Chuang et al. [14], where many subnetworks are functionally related to signaling of cell growth and survival, cell proliferation and replication, apoptosis, metabolism, etc. However, with the limitation inherited from a local search, many subnetworks only contain a small number of genes, which makes it difficult to gain a global picture of underlying biological mechanisms. This is especially problematic for signaling pathways, because signaling pathways are considered to be more global (from membrane to cytoplasm and to nucleus) rather than local protein interactions. As a comparison, the networks identified by netSVM are more related to signaling pathways; and the genes in the networks are likely to be associated with diverse cellular locations ranging from extracellular matrix, plasma membrane, cytoplasm and nucleus (Figures 11(b) &12(b)).
In estrogen receptor-positive breast cancer, MAPK activation is robustly increased during ligand (estrogen)independent cell proliferation resulting from long-term estrogen deprivation [28], and combined inhibition of the Ras/MAPK and Notch signaling pathways is being explored as a potential new modality for breast cancer treatment [29]. A previous study [30] has also shown that MAPK inhibition in estrogen receptor-negative breast cancer cell lines can restore estrogen receptor expression and growth inhibition by the antiestrogen Tamoxifen. Recent studies have further shown that activation of MAPK signaling pathway could mediate response to Tamoxifen in breast cancer patients [31] and the combination of MAPK and PI3K inhibitors is an effective strategy to overcome endocrine therapy resistance [32]. Transforming growth factor-beta (TGFbeta) is often considered a tumor suppressor, which is implicated in many types of human cancer including breast cancer [33]. However, other recent studies have shown that TGF-beta signaling may positively influence the metastatic cascade in breast cancer by enabling cells to become motile and enhancing the ability of cells to survive clearance from the lungs during the metastatic process [34]. Regulation of JAK-STAT signaling is highly complex and involves cross-talk with numerous other signaling pathways. For example, the functions of activated STATs can be altered through association with other transcription factors such as c-Jun, c-Fos, NF-Kap-paB, SMAD, SP1, p300, CBP, BRCA1 and MCM5 [35]. Furthermore, STAT1 [36], STAT3 [37] and STAT5 [38] have all been shown to play important roles in endocrine-resistant breast cancer.
As a final note, the prediction accuracy from two data sets is not high enough for recurrence prediction of breast cancer for clinical applications. This limitation is a challenge to the field, which is largely caused by the sample heterogeneity, complexity of breast cancer and experimental noise in microarray data. However, our method can achieve a comparable performance with other network-based methods. Besides, our method can identify important network biomarkers that are functionally related to breast cancer, aiming for a mechanistic understanding of breast cancer. The networks and enriched pathways identified from two data sets have shown that there is a convergent point at the functional level even with a large discrepancy observed at the gene expression level.

Conclusion
In this paper, we have developed a novel method (netSVM) for cancer biomarker identification that incorporates gene-gene interaction information. This network information has been explicitly formulated as a Laplacian matrix and embedded into the objective function of SVM for optimization. Therefore, the contribution of hub genes to the classification hyperplanes of SVM is greatly enhanced, even when these hub genes are not significantly differentially expressed between the two phenotypes. Our method for subnetwork identification in simulated and breast cancer data shows significantly improved reproducibility of prediction performance across different data sets when compared to other network-based methods and gene-based methods. Finally, several signaling pathways revealed by netSVM have high functional relevance to breast cancer, and these may provide us new insight into the underlying mechanism of breast cancer progression and metastasis.
The proposed method works under the assumption that hub genes usually have little expression changes, thus to help improve the generalizability across different data sets by integrating network information. The method may not achieve an improved performance if the assumption is violated. In addition, since the proposed method utilizes protein-protein interaction data as prior knowledge, the performance largely relies on the correctness of prior knowledge. Therefore in future, it is necessary to assess the influence of prior knowledge onto the method. Meanwhile, it is desirable to incorporate more sophisticated network identification approaches into this method to improve the prediction accuracy for clinical applications.
Although we focused on a breast cancer study in the paper, the proposed method can be generalized to different applications (e.g., studying drug resistance in breast cancer) or other cancer studies (e.g., ovarian  Methods Support vector machine (SVM) is a classification scheme that addresses the general case of nonlinear and non-separable classification tasks efficiently. The goal of an SVM is to find a hyperplane that maximizes the width of the margin between the classes and at the same time minimizes the empirical errors. Since the coefficients in weight vector correspond to real genes for linear SVM, we will focus on discussing the network-constrained SVM for linear case only in order to have a clear biological interpretation of those significant features (i.e., genes).

Support vector machine
Given a training sample set (x 1 , y 1 ),..., (x l , y l ) with p features and l samples, where x i R p and y i {-1 + 1}, the SVM learning algorithm aims to find a linear function of the form f(x) = b · x + b, with b R p and b R such that a data point x is assigned to a label +1 if f(x) > 0, and a label -1 otherwise. The linear SVM classifier can be obtained by solving the following optimization problem: where the slack variable ξ i > 0 denotes the difference of sample i to the required functional margin. The sum of ξ i can be seen as an upper bound of the empirical risk. And the regularization constant C > 0 determines the trade-off between 1/2||b||² (the complexity term) and the sum of ξ i .
By introducing non-negative Lagrangian multipliers a i , the above optimization problem is equivalent to maximizing the dual Lagrangian function with respect to a i in Equation (2): This is a quadratic programming problem and the solution to Equation (2) gives that β = can be simply computed with any training point such that equality holds in Equation (1).

Network-constrained SVM
Consider a gene network that is represented by a graph G = (V, E, W), where V is a set of vertices that correspond to p genes, E = {u~v} is a set of edges indicating that gene u and v are linked on the network and W is the weights of the edges. The degree of a vertex v is where w(u, v) indicates the weight of edge u~v. For this application, the weights could represent the probabilities of having edges between two vertices. Following Chung et al. [39], we define the Laplacian matrix L of G with the uv th element to be: This matrix is symmetric and non-negative definite and its corresponding eigenvalues or spectra reflect many properties of the graph as detailed in [39].
We define the network-constrained SVM given nonnegative parameter l as follows: Compared to Equation (1), the only difference is that we add one more regularization term lb T Lb into the objective function. We already know that the first regularization term is designed to maximize the width of the margin between two classes. We will thus focus on discussing the meaning of the second regularization term.
Note that L can be written as L = SS T , where S is the matrix whose rows are indexed by the vertices and whose columns are indexed by the edges of G such that each column (corresponding to an edge e = {u, v}) has an entry w(u, v)/ √ d u in the row corresponding to u, an entry − w(u, v)/ √ d u in the row corresponding to v, and zero entries elsewhere. Therefore we can see that b T Lb can be re-written as From this representation we can understand that the added regularization term lb T Lb imposes the smoothness of parameters (coefficients) b over the network via penalizing the weighted sum of squares of the scaled difference of coefficients between neighboring vertices in the network.
It is worth noting that the network-constrain SVM is different from Laplacian SVM [40]. Network-constrained SVM imposes smoothness for weight vector b, while Laplacian SVM imposes smoothness for Lagrangian multipliers a. In Laplacian SVM, it assumes that the data of each class, which follow a manifold and decision function must avoid passing through the manifold. In network-constrained SVM, the underlying assumption is that the genes highly connected in a network have synergistic effect and they should be considered together rather than individually.
Next, we will discuss how to solve the problem of Equation (4). Here we propose a simple algorithm by reducing it to a conventional SVM optimization problem. Since L is symmetric and semi-positive definite, Equation (4) can be represented as where, when P = U 1/2 Further with the definition of b* = P T b, the problem in Equation (6) can be reduced to where x * i = ((P T ) −1 ) T x i . Therefore, this optimization problem can be solved by its corresponding dual problem similar to Equation (2). The solution gives that β * = l i=1 α i y i x * i and we can recover b through b = (P T ) -1 b*. Note that l is a parameter that can be optimized through cross validation in practice.

Significance analysis of subnetworks defined by netSVM
From the input network, we want to know which parts of the network are significantly contributing to the decision boundary for classification. As is shown in the Equation (4), the larger the absolute value of an element in coefficient vector b, the more important the corresponding gene is. Based on the clinical outcome information, we design a significance test to evaluate the significance of each gene in the network and then significant subnetworks can be determined by those genes whose p-values are less than some predefined threshold. For each gene i in the network, we take its absolute value of coefficient b i as a summary statistic. To form a null distribution, we randomly permute training sample labels, and learn the coefficient vector b 0 using networkconstrained SVM on the training samples with permuted labels. The procedure is repeated B times, and all the corresponding absolute values of b i 0 will be used to form the null distribution. The p-value of gene i can be calculated as follows: Simulation of microarray gene expression data We modified a Markov random field (MRF) model in [17] to embed differentially expressed subnetwork/genes in a PPI network given a ground truth subnetwork. Let S be a binary vector indicating the differential expressed states of genes in a PPI network G, 0 representing 'equally expressed' ('EE')and 1 representing 'differentially expressed' ('DE'). Assume the ground truth differential subnetwork is G 0 , which means S {G0} = 1 and S {G-G0} = 0. We sample the gene state according to the following probability based on Markov random field model: In the original model, μ i (1-k) denotes the number of neighbors of gene i having state 1-k, k = 0, 1. g k and c are the parameters predefined. In order to introduce different level of false positives in the sampled differential subnetwork, we added one parameter to control the probability of keeping initial states of ground truth DE genes and background EE genes. Here we define μ i (1-k) as a function of parameter ω as follows: , where S 1 = S, S 0 = 1 − S.
The larger ω is, the more consistent the simulated DE genes and ground truth genes are. Therefore we can vary ω to generate different simulation gene expression data sets with different levels of consistency.
Then, we simulated gene expression data X given S using a Gamma-Gamma (GG) model [18,41]. In the GG model, the observed variable x (gene expression level) is a Gamma distribution having shape parameter a>0 and scale parameter c g , with a mean value μ g = ac g . Its probability density function is: p(x|α, χ g ) = x α−1 exp{−x/χ g } χ α g (α) .
In the above equation, the scale parameter c g has a Gamma distribution with shape parameter a 0 and scale parameter ν. Given these three parameters, we can simulate gene expression levels in two conditions with multiple replicates. Particularly for this study, we assume that equally expressed gene has same expected mean value for all samples and differentially expressed gene has different expected mean values for samples in different conditions. We fist sampled the scale parameter c g based on Gamma distribution (a 0 , ν) and then sampled gene expression levels using parameters (a, c g ) given the states of genes.