Network-based logistic regression integration method for biomarker identification

Background Many mathematical and statistical models and algorithms have been proposed to do biomarker identification in recent years. However, the biomarkers inferred from different datasets suffer a lack of reproducibilities due to the heterogeneity of the data generated from different platforms or laboratories. This motivates us to develop robust biomarker identification methods by integrating multiple datasets. Methods In this paper, we developed an integrative method for classification based on logistic regression. Different constant terms are set in the logistic regression model to measure the heterogeneity of the samples. By minimizing the differences of the constant terms within the same dataset, both the homogeneity within the same dataset and the heterogeneity in multiple datasets can be kept. The model is formulated as an optimization problem with a network penalty measuring the differences of the constant terms. The L1 penalty, elastic penalty and network related penalties are added to the objective function for the biomarker discovery purpose. Algorithms based on proximal Newton method are proposed to solve the optimization problem. Results We first applied the proposed method to the simulated datasets. Both the AUC of the prediction and the biomarker identification accuracy are improved. We then applied the method to two breast cancer gene expression datasets. By integrating both datasets, the prediction AUC is improved over directly merging the datasets and MetaLasso. And it’s comparable to the best AUC when doing biomarker identification in an individual dataset. The identified biomarkers using network related penalty for variables were further analyzed. Meaningful subnetworks enriched by breast cancer were identified. Conclusion A network-based integrative logistic regression model is proposed in the paper. It improves both the prediction and biomarker identification accuracy. Electronic supplementary material The online version of this article (10.1186/s12918-018-0657-8) contains supplementary material, which is available to authorized users.

L 1 penalty to the original least square problem, which leads to the sparsity of the coefficients and thus achieves the variable selection goal. Based on this idea, several other variable selection methods are proposed, such as the sparse logistic regression [9], sparse partial least square regression [4], sparse partial least square classification [3]. Due to the high co-linearity of some covariates, these methods may select different variables that have similar effects on the responses. To take into account this issue, elastic net model adds both the L 1 and L 2 norm as the penalty of the coefficients [10]. With a balance of both norms, the highly correlated variables can be selected together. This model is further extended to the network constraints with sparsity [11][12][13][14]. In [11], the L 2 norm in elastic net model is changed to a Laplacian term, which penalizes the variables that have connections in a given network, such that the coefficients of these variables tend to be the same. In [12,13], the coefficients in the Laplacian term are replaced by their absolute value, which considers the case that the highly correlated variables have opposite contributions to the response. Different computational algorithms are given in these two papers. By adding the network constraints, the variables having high correlations or connections can be selected together, which reduces the effect of co-linearity, and thus improves the variable selection robustness.
Though such a lot of methods have been developed for variable selection, these models are mainly for one single dataset. Due to the small sample size relative to the large number of variables and the batch effects in different platforms or different laboratories, the biomarkers inferred from one dataset often suffer a lack of reproducibilities. As a potential solution to such problems, integrative analysis is a cost-effective option, since many genomic databases are nowadays publicly available. For example, the public functional genomics data repository NCBI GEO has more than 2.5 billion samples on more than 18 thousand platforms [1]. Here, integrative analysis means combining the data or information from multiple independent studies that are designed for the same biological or medical problems in order to draw more reliable conclusions, though some integration methods focusing on incorporating different data types have been developed [15][16][17][18][19][20][21][22][23]. To this purpose, there mainly exist two types of approaches: analysis by data merging and meta-analysis. The merging approach integrates the same data type after transforming the original data to numerically comparable measures or correcting the confounder factors first [24][25][26][27][28], while the meta-analysis approach combines the results of individual studies at the interpretative level [29][30][31][32][33][34][35][36][37]. In the data merging approach, the first step is to do cross-platform normalization or confounder correction, followed by the variable selection methods for one single dataset. Compared to data merging approach, meta-analysis is more complex and has taken into account more factors in the integrative process. The key issues for guiding conducting a meta-analysis of gene expression microarray datasets has given in [30]. An early R package for implementing metaanalysis is in [29], which has been widely applied such as the work in [34,37]. Later on, several methods were proposed on meta-analysis. Ma et al. first proposed a Meta Threshold Gradient Descent Regularization (MTGDR) approach [31], then they developed a 2-norm group bridge penalization approach such that the markers with consistent effects across multiple studies can be identified [32]. They further proposed a sparse boosting for marker identification [33]. Li et al. proposed meta-lasso (MetaLasso) method for variable selection in meta-analysis, which used a hierarchical decomposition on regression coefficients to identify important genes, and kept the selection flexibility across different datasets [36]. These methods are all based on logistic regression for selecting the genes in microarray datasets, without considering the gene-gene interactions or the high correlations between the genes. Though the work [35] presented a statistical framework for identifying the differential co-expressed gene pairs as markers, they did not consider the general gene-gene interactions.
In this article, we investigate the integrative analysis of multiple datasets from different platforms or laboratories that are designed for the same biological questions. We propose a penalization approach based on logistic regression for biomarker selection. The penalization includes two parts: penalization of the sample relations and penalization of the variables. Penalization of the sample relations defines a new penalty as the the function of the sample relation network, and aims to make the regression coefficients for the samples from the same source be the same while allowing the heterogeneity across different datasets. The advantages of taking into account the sample relation network in general regression have been addressed in [38]. The penalization of the variables takes advantage of the recent development on network constraints penalization methods in single dataset such that the variables having high correlation or given connections can be selected together, which cannot be easily extended to from the current integrative models [31-33, 36, 37]. Numerical experiments on both simulated datasets and real datasets show the performance of our formulation.

Methods
We assume the variables are measured in M different experiments with M > 1. Let X m denote the measurement of the variables in the m-th experiment, which is an N m × p matrix with N m being the sample size and p being the number of variables. We let X m i denote the i-th row in X m . Let Y m denote the clinical outcomes in the m-th experiment, which is a vector of binary values representing case/control state or different disease states, and Y m i be the i-th entry of Y m . We let X = X 1 T , X 2 T , · · · , X M T T be the values of the variables for all the samples, and Y = be the clinical outcomes for all the samples. Let X i denote the i-th row in X, and Y i the i-th entry in Y, correspondingly. N = M m=1 N m is the total number of samples for all the considered datasets.
We first consider the logistic regression model for each single dataset. Let p m i = P Y m i = 1|X m i denote the probability that the sample i in the m-th experiment has the outcome 1. Then the logistic regression model can be formulated as: . To obtain β m 0 , β m , we can maximize the log-likelihood function, which can be formulated as a minimization problem as follows: where To do biomarker identification using logistic regression model, different penalties have been proposed to add to (1), which can be formulated as: where λ is a parameter to control the importance of the regularization term.
When α = 0, it corresponds to ridge penalty, when α = 1, it corresponds to the LASSO [2], and when 0 < α < 1, it corresponds to the elastic net (Enet) [10]. Later on, Li et al. proposed the network penalty (Network) [11], where P α (β m ) = 1 2 (1 − α)(β m ) T Lβ m + α β m 1 , and L is the normalized Laplacian matrix for a network measuring the connections or the correlations of the variables. With this model, the connected variables in the network tend to be selected together. This penalty is further extended to P α (β m ) = 1 2 (1 − α)(|β| m ) T L|β| m + α β m 1 to tackle the case when the highly correlated variables have opposite contributions to the response (Abs-Network) [12,13].
When we identify the biomarkers from multiple datasets generated for the same biological question, different biomarkers may be selected when we do the experiments in each individual dataset. This may be due to the batch effects from different platforms/experimental conditions, or the high co-linearity among the variables. In reality, the same variables should contribute to their corresponding response equally. Thus we assume β 1 = β 2 = · · · β M = β. To estimate these parameters, direct merging all the datasets together is one choice. However, it cannot explain the heterogeneity of different datasets. To explain the heterogeneity, and to make the final response appear with high probabilities, we set the constant term in the model to be different for different samples. We let β m 0,i be the constant term corresponding to the sample i in the m-th experiment. Due to the homogeneity within the same dataset, all the parameters should be the same for the same dataset. Thus we add constraints to make the model satisfy this condition. Our formulation now becomes: μ is a parameter to control the importance of the penalty term β T 0L β 0 .L is the Laplacian matrix for the sample relation network, where if two samples are from the same dataset, we assign an edge between them, otherwise, there is no edge. By minimizing the term β T 0L β 0 , β m 0,i will tend to be the same for different i and a fixed m, and it depends on m when sample i is from different experiments. This penalty helps make the constant term in logistic regression be the same for the same dataset, and allowing the differences across different datasets.
To solve the optimization problem (3), we notice that its formulation is similar to (2), except the constraint on β 0 . Thus we can apply similar methods to the one solving (2). In [9,39], proximal Newton method is applied to solve the problem (2) [40]. This method mainly includes two steps: first a Newton step is applied to the log-likelihood term to get a temporal point; then the original optimization problem is approximated at this point by a quadratic function with the original penalty kept. Usually this quadratic optimization problem can be solved efficiently. Using the same technique, for our formulation (3), we first derive a temporal point with Newton method for the log-likelihood term. Different from (2), here β 0 is a vector of size N. To take advantage of the standard logistic regression, we letX = (X, Now we can use standard Newton method to get a new pointβ temp by computing: Here X T WX is the Hessian matrix for the likelihood function.
The quadratic approximation problem to be solved is: where H =X T WX. It is equivalent to the following problem: To solve problem (6), we refer to the coordinate descent algorithm [41]. Given β, the objective function is convex and smooth with respect to β 0 , thus we can compute the elements in β 0 simultaneously. Given β 0 , the objective function is nonsmooth when L 1 penalty exists, and may be convex or not depending on the penalty term for β. When we use the Abs-Network penalty, it is nonconvex. However, as described in [41,42], this term follows the regularity condition, which implies that if moving along all coordinate directions fail to decrease the objective function, it arrives at the local minimum. Thus we can use the coordinate descent algorithm. In the following, we derive the computation procedure for (6) with Abs-Network penalty. For other penalties mentioned above, the computation procedure is similar. Given β, β 0 should satisfy the equation: which is obtained by computing the partial derivative with respect to β 0 . Given β 0 , we compute β using cyclic coordinate descent, which can be computed using: where Here, we useβ 0 ,β j to denote the fixed parameters in the coordinate descent process. S(u, v) is a soft-thresholding function defined as: denotes the (k, j) entry in the network adjacency matrix for the variables, d j denotes the degree of the j-th variable in the network of variables. Algorithm 1 shows the full process of computingβ. For other penalties, we can infer the algorithm similarly.

Algorithm 1 Network-based integrative logistic regression Input:
Information for all the considered samples: X, Y; Normalized network Laplacian for the variables: L; Network Laplacian for the datasets information:L; Parameters: λ, μ, α.

Output:
Coefficients in logistic regression model:β 0 , β; repeat Compute Z using (4); repeat Compute β 0 by solving the linear equations (7); for k = 1 to p do Compute β k using (8); end for until The objective function in (6) converges until The objective function in (3) converges return β 0 , β; After the computation, we can get β and β 0 . Then we average the value β m 0 to get an estimate of the constant term for the data in the m-th experiment, and do the prediction.

Results
In this section, we first evaluate the proposed integrative logistic regression model using simulation studies, we then apply the method to multiple gene expression datasets for studying breast cancer metastasis.

Simulation study
The experiments are designed to classify the case/control samples using gene expression datasets. We simulated the gene expression datasets using the similar method as that in [11]. Suppose we have n TF transcription factors (TFs) and each regulates n RG genes. The resulting regulatory network includes n TF + n RG genes and the edges between each of the TFs and the regulated genes. We assume four TFs and the genes that they regulate are related to the response Y. We generated the input variables using the following distributions: • The expression levels for the n TF TFs follow standard normal X TF j ∼ N(0, 1); • The expression levels of the TF and the gene that it regulates are jointly distributed as a bivariate normal with a correlation of 0.7, which implies that conditioning on the expression level of the TF, the regulated gene expression level follows normal distribution: N(0.7X TF j , 0.51).
We designed two settings of the regression coefficients. The first one is shown in (9), The constant term β 0 is set to be different in multiple datasets. Here, we generated four different datasets, and the mean of β m 0 :β m 0 for m = 1, 2, 3, 4 is set to be − 3, − 1, 1, and 3. β m 0,i for each sample i follows N β m 0 , 0.5 . In this case, when integrating the four datasets, one main concern for predicting Y is the batch effects shown in β 0 . Y i is generated following Bernoulli distribution with the probability P(Y i = 1|X i ).
For each setting, we generated 100 training samples and 100 test samples for the four datasets. We set μ = 1 and α = 0.5 directly except the LASSO penalty, and used 5-fold cross validation (CV) to train the model and got the parameter λ. Then we applied the model obtained using the full training set with the parameters that gave the best AUC (area under ROC curve) to see the prediction results in the test set. We took prediction sensitivity, specificity, accuracy, AUC, and the variable selection precision, recall, F 1 score to measure the prediction and variable selection results of the model. The variable penalty term is set to be LASSO [2], elastic net (Enet) [10], network constraint (Network) [11], and network-regularized penalty using absolute value of the coefficients (Abs-Network) [12]. We note that better results are expected if we use CV to choose all these three parameters (μ, α, λ) together at the cost of more parameter tuning computation time.
To evaluate the performance of the proposed method, we compared it with the methods without integration, direct data merging, and MetaLasso [36]. For the results obtained without integration, we trained the model in each of the four training datasets separately, and predicted the samples in all the four test sets. We then recorded the best result among the four. For direct data merging, we merged the four training sets as the training set, and the four test sets as the test set, followed by the penalized logistic regression. We implemented the R package 'MetaLasso' for MetaLasso [36]. We added the prefix 'Int-' to denote the method after integration using the corresponding penalty, and 'Merge-' to denote a direct merging of all the datasets. We implemented the whole computation process for 30 times for each method, and computed the mean and standard deviation (sd) values of the seven evaluation measures. Tables 1 and 2 show the prediction and variable selection results for setting 1, and Tables 3 and 4 show the results for setting 2. We highlighted the highest values for each measure. For the prediction results, it is clear that adding network constraints to integrate multiple datasets in the model improves the results under both  β is shown in (9), β 1 0 ,β 2 0 ,β 3 0 ,β 4 0 = (−3, −1, 1, 3) The maximum value for each measure is highlighted using boldface font of our simulation settings. Normally, direct data merging outperforms the methods without integration, and integration outperforms direct merging. For MetaLasso, it only uses the LASSO penalty, and outperforms LASSO and Merge-LASSO. But it is not as good as our proposed integration method using LASSO penalty. This shows that our integration technique can capture more information in multiple datasets. For the variable selection results, the highest F 1 score is achieved using Int-Abs-Network, though Merge-Network and Merge-Abs-Network achieve β is shown in (9), β 1 0 ,β 2 0 ,β 3 0 ,β 4 0 = (−3, −1, 1, 3) The maximum value for each measure is highlighted using boldface font  the highest precision. This should come from the fact that data merging enhances the signal of the important variables and variable network constraints combine the related variables. However, merging may miss some associated genes due to the heterogeneity across multiple   −1, 1, 3) The maximum value for each measure is highlighted using boldface font datasets, which may lead to a lower recall in some experiments. Compared to it, Int-Abs-Network performs more robust and gets the highest F 1 score.

Real data study
We downloaded two datasets GSE2034 and GSE1456 from Gene Expression Omnibus (GEO). These two datasets were generated for studying breast cancer metastasis. The information of the samples is shown in Table 5. Both datasets were measured on Affymetrix HGU133 microarrays, and each dataset includes 22283 transcripts. We combined those probes corresponding to the same gene using their mean value as the gene expression level. We then downloaded the protein-protein interaction (PPI) data from https://thebiogrid.org for the humans, and removed the genes that have no information in the PPI network. We chose 2000 genes with the largest variance from each dataset and took their intersection as our considered gene set. Finally a total of 1456 genes were selected. For each gene, we imputed the missing value using the mean value of the gene, and normalized the expression of each gene.
We applied three types of methods to these two datasets. The first type of method applied logistic regression model with the four penalties to the merged data directly. The second type is MetaLasso, and our proposed method is as the third type. When using our method, we set μ = 1 and α = 0.5 except LASSO penalty. We selected the parameter λ using 3-fold cross validation between 0.02 to 0.1 with a step size 0.02 and got the AUC under ROC curve. We then trained the model using the whole dataset and got the biomarkers. Table 6 shows the mean of AUCs and its standard deviation when doing CV. It's clear that our method outperforms direct data merging. MetaLasso achieved the AUC value 0.62 with sd 0.02, and it selected three genes as the biomarkers. In [14], several biomarker identification methods have been applied to these two datasets separately. The best AUC for GSE2034 is 0.690 and 0.736 for GSE1456, respectively. The stability for the selected genes measured using Jaccard index is about 0.2,  which means the intersection of the selected genes in both datasets over their union is about 0.2. This shows the instability of biomarker identification from different datasets. With our method, using the same biomarkers, we can get a comparable AUC 0.70. We note that, if we tune all the three parameters λ, α, μ together, we may get better results at the cost of more computational time.
As we know, when LASSO penalty is applied, less genes will be selected, while when the elastic penalty is added, the correlated genes can be selected. By adding the PPI network related penalties, we aim at finding the subnetworks that cooperatively contribute to the disease development. Table 6 also shows the number of the selected genes for different penalties. Since the development of disease is a very complex process, subnetwork biomarkers are reasonable. We presented the subnetwork biomarker identification using Abs-Network penalty in the following.
When using Abs-Network penalty, 270 genes were selected, among which there are 30 connected subnetworks. We put all the 30 subnetworks in Additional file 1. Figure 1 shows the connected components. We also did gene ontology (GO) enrichment analysis and KEGG pathway enrichment analysis for these subnetworks using "clusterProfiler" [43]. Twenty one of all these networks are enriched by GO: CC, MF and BP, and KEGG pathway. We put all these enrichment results in Additional files 2, 3, 4, and 5. One typical subnetwork is shown in Fig. 2. This subnetwork is enriched by KEGG pathway hsa05224: breast cancer with an adjusted p-value 10 −4 . There are 7 genes in this subnetwork, of which three genes ("GSK3B" , "CTNNB1" , "PIK3CA") are associated with breast cancer. And these three genes are also associated with some other cancers such as endometrial cancer, colorectal cancer, prostate cancer, and others. GSK3B interacts with CTNN1, while it interacts with PIK3CA through BEX1.

Discussion
Biomarker identification has been a hot research topic for several years. Many mathematical and statistical models and algorithms have been proposed to tackle this problem. However, due to the heterogeneity of datasets from different platforms or laboratories, the biomarkers inferred from one single dataset are lack of reproducibilities. Even different datasets are generated for the same purpose, the intersection of the inferred biomarkers is small. This motivates us to consider the integrative methods for robust biomarker identification using multiple datasets.
In this work, we assumed the regression coefficients of the variables for multiple datasets are the same. The differences in predicting the response of each sample Fig. 2 One identified subnetwork biomarker using network-based integrative logistic regression with Abs-Network penalty lie in the constant term. To combine the information from different samples, we added the network penalty to make the constant term within the same dataset be the same. To achieve the biomarker identification purpose, we added more penalties than LASSO, such as network related penalties to select the subnetworks as biomarkers. We then developed proximal Newton method to solve the optimization problem, and gave the detailed formulations for the Abs-Network penalty. Algorithms for other formulations can be easily inferred. Since this algorithm involves solving linear equations, it is slower than that solving the model without integration term. Developing faster algorithms so as to apply the model to larger dataset is very important.
We applied the proposed model to both simulated datasets and real datasets. For the simulation study, it is not easy to make simulations similar to the real data. We tried two different settings to see the performance of the model. Both experiments gave reasonable results. In the real data study, we integrated two breast cancer gene expression datasets. We compared the results with direct merging the datasets and MetaLasso, and we checked the existing works on biomarker identification and prediction in each dataset separately. Our method performs much better than direct merging and Meta-Lasso. And it achieved results comparable to the best results in each single dataset. All these results show the good performance of our proposed method. In our model, we assumed the sources of the test dataset are included in that of the training dataset, thus when we do prediction, we can directly use the corresponding constant term in logistic regression. This limits the application of the proposed model for the datasets whose sources are not known.
In this study, we tested our method in only one real data setting. Other datasets may not have the same properties as our tested datasets. Thus applying our model to more real datasets, and incorporating more information to the model so as to improve the prediction accuracy is one of the future works.

Conclusions
In this work, we proposed an integrative method for classification based on logistic regression model. By adding a network-based penalty for the constant term in logistic regression for the samples from different datasets, both the homogeneity within each dataset and the heterogeneity between different datasets are kept. After adding network related penalties besides LASSO, subnetwork biomarkers can be identified. In both simulation datasets and the real datasets, the proposed method shows good performance. This method may help better identify the biomarkers by integrating multiple datasets.