Identifying drug-pathway association pairs based on L2,1-integrative penalized matrix decomposition

Background Traditional drug identification methods follow the “one drug-one target” thought. But those methods ignore the natural characters of human diseases. To overcome this limitation, many identification methods of drug-pathway association pairs have been developed, such as the integrative penalized matrix decomposition (iPaD) method. The iPaD method imposes the L1-norm penalty on the regularization term. However, lasso-type penalties have an obvious disadvantage, that is, the sparsity produced by them is too dispersive. Results Therefore, to improve the performance of the iPaD method, we propose a novel method named L2,1-iPaD to identify paired drug-pathway associations. In the L2,1-iPaD model, we use the L2,1-norm penalty to replace the L1-norm penalty since the L2,1-norm penalty can produce row sparsity. Conclusions By applying the L2,1-iPaD method to the CCLE and NCI-60 datasets, we demonstrate that the performance of L2,1-iPaD method is superior to existing methods. And the proposed method can achieve better enrichment in terms of discovering validated drug-pathway association pairs than the iPaD method by performing permutation test. The results on the two real datasets prove that our method is effective.


Background
Studies of the mechanism of carcinogenesis have led to the implementation that cancer is radically a disease of a variety of genetic aberrations [1]. And at present, the main method to treat cancer is drug therapy. New drug research is an important topic of the drug discovery. And one of the basic research concept of these new drugs is to determine the interaction between drugs and targets. And it can be used to predict candidate drugs, which may act on targets [2]. Besides under the guidance of the concept of pharmacology research and development for new drugs, it can also be used for the relocation of existing drugs, and to forecast the new targets for known drugs [3]. Drug discovery technology is in primary stage, but many related algorithms have been developed to find drug targets. In general, original drug target identification algorithms follow the "one drug-one target" line [4]. The purpose of those methods is to discover the effective drugs, which act on individual targets. It is obvious that those methods do not take into consideration of the relations among genes. Thus, "one drugone target" algorithms ignore related pathways [5]. Generally, many complex diseases are resulted from unique pathway functions rather than individual genes. And the function of drugs is not just aiming at single proteins, but rather affecting the complex interaction of some associated biological pathways [6]. Therefore, identifying drug-pathway associations is a momentous task for quickening the development of drug discovery.
With the rapid development of high-throughput drugs and pathways related data, it is feasible for researchers to infer drug-pathway interactions. A large amount of studies have utilized the drug related data to obtain insights on drug-pathway modes of action [7]. Gene Set Enrichment Analysis (GSEA) is a traditional method to identify drug-pathway associations. GSEA is proposed by Harvard University and MIT's broad institute research group. It is utilized to analyze genome-wide expression microarray and drug related data. You can download it after free register [8], whose website is http://software. broadinstitute.org/gsea/index.jsp. Based on known genepathway association information, the GSEA method can forecast responsiveness of pathways. But the GSEA method does not consider the known pathway information, its identification precision is poor [9]. In order to improve the identification precision and use the prior information, the FacPad method is proposed to predict drug-pathway associations, and it build a sparse Bayesian factor analysis model to infer pathway responsive for drug treatments [6]. In order to further improve the performance of the FacPad method, another Bayesian model named "iFad" is developed to discover the novel drugpathway associations [10]. And Ma et al. apply the iFad method to analyze gene expression and drug related data from the NCI-60 cell lines. The NCI-60 cell lines is from the NCI-60 project, which provides useful information for various types of "Omics" characterization of 60 human cancer cell lines with nine different cancer types. The iFad method can discover effective drug-pathway associations. However, its computational costing is expensive since this method applies the Markov Chain Monte Carlo (MCMC) algorithm [11] to perform statistical inferences. At the same time, some prior parameters in the iFad model require to be specified in advance by the investigators. With the rapid development of modern genomics and pharmacology technologies, the dimensionality of the raw data becomes larger and larger, that is, these data have a large number of variables [12]. Thus, the size of sample is also becoming larger and larger. And the computational expense of dealing with the high-dimensional data becomes more expensive. Based on the above problems, an efficient method named "iPaD" is proposed to analyze drug related data [13]. Li et al. use integrative penalized matrix decomposition (iPaD) method to jointly analyze drug expression and drug sensitivity data. And Li et al. apply the iPaD method to the Cancer Cell Line Encyclopedia (CCLE) and NCI-60 datasets. Compared with the NCI-60 data set, the CCLE data set has the larger sample size. At the moment, the CCLE project has more than 1000 cell lines. Compared with the iFad method, the iPaD method has obvious superiority in computational efficiency. And the iPaD method only has one parameter required to be turned. In addition, the iPaD method applies the L 1 -norm penalty to obtain sparse solutions. However, the sparsity produced byL 1 -norm penalty is too dispersive [14].
In this paper, we impose the L 2, 1 -norm penalty to replace the L 1 -norm penalty on the drug-pathway association matrix. The L 2, 1 -norm regularization penalty can make each row of the drug-pathway association matrix as a whole and produce row sparsity solutions [15,16]. Besides, the L 2, 1 -norm penalty can select the most prominent morphometric variables [17]. In this paper, compared with the iPaD method, our new proposed method has two outstanding advantages: firstly, the L 2, 1 -iPaD method can achieve better performance in identifying validated drug-pathway associations by applying our proposed method to the CCLE and NCI-60 datasets; secondly, in this paper, we also perform permutation test to evaluate the significance of the identified drugpathway associations, the experimental results demonstrate that our proposed method can gain the smaller P-values. Thus, we can obtain that our proposed method can achieve better overall enrichment in terms of identifying drug-pathway association pairs.
In the next subsection, at first, we will describe a novel algorithm named L 2, 1 -iPaD to identify drug-pathway associations. And then we will apply the L 2, 1 -iPaD method on two real datasets (the CCLE and NCI-60 datasets) and give the results of our proposed and iPaD methods. Finally, we will give the conclusions and future work.

Model description
Given a gene expression data matrix Y (1) with the size of N × G (1) and a drug sensitivity data matrix Y (2) with the size of N × G (2) . N denotes the number of samples, G (1) and G (2) denote the number of genes and drugs, respectively. The traditional iPaD method decomposes the gene expression matrix Y (1) into the pathway activity level matrix X ∈ R N × K and the gene-pathway interaction matrix B (1) . K denotes the number of pathways. And the iPaD method decomposes the drug related data matrix Y (2) into the pathway activity level matrix X and the drug-pathway interaction matrix B (2) . The model of iPaD method can be introduced as follows: where E (1) and E (2) denote the error matrices in (1). Then the model (1) can be written as the following form: In general, a drug is associated with a few pathways, therefore, drug-pathway association matrix B (2) is sparse. Based on this fact, in this paper, we propose a novel method to improve the performance of the iPaD method. We employ the L 2, 1 -norm regularization to replace the L 1 -norm regularization in the iPaD method. Then the optimization model of L 2, 1 -iPaD method can be written as follows: where ‖W‖ F denotes the Frobenius norm of the matrix W. The detailed definition of the Frobenius norm can be written as , where w i is the i-th row of the matrix W. ‖W‖ 2, 1 denotes theL 2, 1 -norm of the matrix W. The definition of L 2, 1 -norm is first proposed in reference [18]. And the L 2, 1 -norm has been applied in many research direction such as the feature identification [19,20] and image direction [21,22]. The definition of L 2, 1 -norm can be written as W k k Specifically, we firstly need to calculate the L 2 -norm of the vector w i , and then compute the L 1 -norm of the vector b(w) = (‖w 1 ‖ 2 , ‖w 2 ‖ 2 , ⋯, ‖w m ‖ 2 ) T [23]. The L 2, 1 -norm penalty achieves the rows sparsity of the drugpathway association matrix B (2) . Thus, the irrespective drug-pathway pairs can be abandoned. In addition, L 1 ð Þ i;j ∈ 0; 1 f g is a known prior knowledge matrix with the size of K × G (1) . In the model of L 2, 1 -iPaD method, L 1 ð Þ i;j is used for indicating gene-pathway association matrix B (1) . When L 1 ð Þ i;j ¼ 1 , the i-th pathway will be associated with the j-th gene. When L 1 ð Þ i;j ¼ 0, the i-th pathway will not be associated with the j-th gene. Thus, similar to reference [13], in order to merge the known pathway-gene relationship, we impose the first constraint on gene-pathway association matrix B (1) . Besides, the second constraint on pathway activity level matrix X is used to guarantee that the optimization problem (3) is identifiable.

Optimization algorithm
In this paper, the optimization model (3) is convex, that is, when X is fixed, optimizing gene-pathway association matrix B (1) and drug-pathway association matrix B (2) are both convex optimization problems. And when genepathway association matrix B (1) and drug-pathway association matrix B (2) are fixed, optimizing X is also a convex optimization problem. Thus, in this paper, we optimize X by fixing gene-pathway association matrix B (1) and drug-pathway association matrix B (2) , and optimize gene-pathway association matrix B (1) and drugpathway association matrix B (2) by fixing X.
where Y = [Y (1) , Y (2) ] and B = [B (1) , B (2) ]. We use gradient descent method [24] to solve the problem (4). We first calculate the derivative of matrix X, the detailed computation process can be written as follows: Thus, according to the update formula of gradient descent method, X can be updated by where μ denotes a step size. And then at each iteration, we will project X k + 1 to the feasible region, that is, we Þ . If X k + 1 satisfies this condition, we will perform next step, if not, we will make it as X k + 1 = X k + 1 /‖X k + 1 ‖. In addition, we also apply the Nesterov's algorithm [25] to quicken the convergence speed of this algorithm.
Updating B (1) In this paper, we assume that the relationship of genes and pathways is already known. Similar to [13], we also apply ordinary least squares (OLS) algorithm to solve genepathway association matrix B (1) , that is, we decompose the original problem into G (1) separate OLS problems.
According to the update formula of ordinary least squares algorithm, the gene-pathway association matrix B (1) can be updated as follows: where Y 1 Updating B (2) We observe each column of drug-pathway association matrix B (2) , and decompose the optimization problem into G (2) separate L 2, 1 -norm minimization problems: Note that the problem (9) cannot merge known drugpathway association information. Thus, in order to use prior information, we modify the optimization problem (9) as follows, f g is also an indicating matrix with the size of K × G (2) . It is used to indicate drug-pathway matrix B (2) . Besides, λ is a turning parameter to turn the sparsity of matrix B (2) . Following, we will introduce the optimization process.
We first solve the part that the drug-pathway association B (2) is pointed by L (2) , that is: Note that we omit the notation in the objective function of problem (11). The objective function of problem (11) can be rewritten as the following equation.
where J 1 (⋅) is an auxiliary function and I ∈ R K × K is a unit matrix. Then we compute the derivative of J 1 (B (2) ), and set its result to zero, we have Thus, we can obtain: Then we solve the part that the drug-pathway association B (2) is pointed by 1 -L (2) . According to reference [26], we propose an efficient method to solve this problem. This problem can be described as follows: where D is a diagonal matrix with the i-th diagonal element as: Then we make X 1 = XD -1/2 and B 2 ð Þ ð Þ . Thus, the problem (15) can be rewritten as follows: The objective function of problem (17) can be rewritten as follows: Then we compute the derivative of J 2 B 1 2 ð Þ À Á , and then set its result to zero, we obtain: Thus, we have: Therefore, we can obtain the updating formula of matrix B (2) , that is, B 2 ð Þ ¼ D −1=2 B 1 2 ð Þ . Note that diagonal matrix D depends on drug-pathway association matrix B (2) . We summarize the alternating optimization algorithm for the L 2, 1 -iPaD method in Algorithm 1.

Dealing with missing values
The gene expression data matrix Y (1) and drug related data matrix Y (2) in original data set have a few missing values. In order to strengthen the performance of our proposed method, we need to deal with missing values. Since each column of the gene-pathway association matrix B (1) and drug-pathway interaction matrix B (2) can be solved separately, the missing values in original data set can be removed in the process of updating matrix B (1) and B (2) . However, we treat X as a whole matrix in updating matrix X. It is not easy to handle missing values, directly. Similar to [13], we use the soft-impute algorithm to handle the missing values during the process of updating X. The soft-impute algorithm can solve the incomplete matrix learning problem [27,28]. Following, we will introduce the detailed process for handling missing values in the L 2, 1 -iPaD method.
Firstly, suppose that Ω ∈ {0, 1} is an indicating matrix with the size of N × (G (1) + G (2) ), and the matrix Ω can indicate observed values in the matrix Y (Y = [Y (1) , Y (2) ]). And H Ω is an operator, and when it projects the matrix X onto the space indicated by Ω, it satisfies the following formula: Hence, the optimization problem for X can be expressed as follows: Then let Ω 1 = 1 − Ω, which is used to indicate the missing values in the matrix Y, the problem (22) can be rewritten as: The detailed proving process can be found in [13,27]. The problem (23) means that at every iteration, they will plug into H Ω 1 XB ð Þ for the next iteration. And this is exactly the main thought of the soft-impute method [27].
The specific step of the optimization algorithm is summarized in Algorithm 2.

Parameters selection and significance test
In the problem (3), λ is the only turning parameter, which is used to turn the sparsity of the drug-pathway interaction matrix B (2) . The more important the drugpathway associations is, the earlier the non-zero elements will become. Thus, we set the value of λ from producing the first non-zero elements in drug-pathway interaction matrix B (2) to 0.1. And then we use ten-fold cross-validation to obtain an appropriate λ value. Thus, we find an appropriate λ value according to the smallest residual sum of squares (RSS). The Figs. 1 and 2 show the changing curve for RSS on the NCI-60 and CCLE datasets, respectively. Thus, we can estimate the importance of the identified association pairs by recording the order of the values in the drug-pathway association matrix B (2) in which the values become non-zero. However, this identification method can not assess the significance of identified drug-pathway associations. Therefore, we perform permutation test to assess the significance of identified drug-pathway associations after gaining an appropriate λvalue, and calculate the P -values of every element in the drug-pathway association matrix B (2) . Similar to [13], we also compute P-values by the following equation: where B 2 ð Þ t ð Þ i;j denotes the value of drug-pathway association matrix B (2) in the t-th permutation. T denotes the overall number of permutations. And B 2 ð Þ i;j is the estimated value in the original data.

Results and discussion
In this section, we will show the experimental results on the real datasets, including the CCLE and NCI-60 datasets. And, in order to present the performance of our proposed method, we compare our proposed method with the iPaD algorithm.

Results on the CCLE data set
In this subsection, in order to assess the performance of L 2, 1 -iPaD method, we apply this method to the CCLE data set in [13]. CCLE data set is downloaded from the CCLE project, which can provide public information as for the genomic data, analysis and visualization for about 1046 terms. The CCLE data set is made up of 480 cell lines (usually samples) with transcription data for 1802 genes and drug related data for 22 drugs covering 58 pathways. And those pathways are downloaded from KEGG database. And the drug related data are measured by area over the dose-response curve ("activity area") since activity area can both express the potency and efficacy of chemical drugs. Besides, it has less unknown values [13]. In this paper, the known drug-pathway association pairs are regarded as validation information. And the prior information matrix L (2) is set to a zero matrix. In the iPaD method, the authors perform 2000 permutation to estimate P-values. The smaller the value of P-value is, the stronger the significance of identified drug-pathway association pairs becomes. And to be fair, we also perform 2000 permutation to assess the significance of identified drug-pathway association pairs in our method. Table 1 lists the P-values on CCLE data set for the L 2, 1 -iPaD and iPaD methods. Obviously, the L 2, 1 -iPaD method can mostly obtain smaller P-values than the iPaD method. In Table 1 the superior results are in italic type. Thus, our method is better than the iPaD method in identifying drug-pathway associations. Moreover, the L 2, 1 -iPaD method is a sparse optimization algorithm. Thus, nonzero elements in the drug-pathway association matrix B (2) are regarded as the drug-pathway association pairs. After applying our method to the CCLE data set, we discover that the L 2, 1 -iPaD method can identify 368 drug-pathway pairs, whose p-values are no more than 0.05, and 66 drug-pathway association pairs among them are verified in the CancerResource. But the iPaD method can only identify 88 drug-pathway association pairs, whose p-values are no more than 0.05, and 25 pairs among them are verified in the CancerResource. And then we compute the number of identified drug-pathway association pairs that P-values are no more than 0.005. For the iPaD method, it can identify 51 drug-pathway association pairs, and 16 drug-pathway association pairs among them can be verified in the CancerResource. And our proposed method can discover 53 association pairs with 16 drugpathway associations verified in the CancerResource database. Tables 2 and 3 list the identification and verification rates of drug-pathway association pairs on the CCLE data set for the L 2, 1 -iPaD and iPaD methods. Note that in Table 2, we also compare our method with the iFad method. And the identification number denotes the number of drug-pathway association pairs, which posterior probabilities are no more than 0.9. The number of verification denotes the number of identified drugpathway association pairs, which are validated in the CancerResource. The results of iFad is derived from the Fig. 1 The changing curve for RSS on the NCI-60 data set Fig. 2 The changing curve for RSS on the CCLE data set reference [13]. It is obvious that the performance of our method is better than the iFad method. In Table 1, we can discover that the drug of Nutlin-3 is associated with Chronic myeloid leukemia pathway. And, a study published in [29] said that the drug of Nutlin-3 is a tumor suppresser, which can up-regulate the expression of Notch1 in both lymphoid and myeloid leukemic cells. And we discover that PD-0332991 is a CDK 4/6 inhibitor [30] and can act on chronic myeloid leukemia [31]. In Table 1, LBW242 is associated with Chronic myeloid leukemia pathway, but their association is not validated in the CancerResource. A study published in 2007 says that LBW242 can effect on mutant FLT3-expressing cells in potentiating antileukemic therapies [32]. Therefore, our method can also identify drug-pathway association pairs which are not validated in CancerResource.
We note that combining different cancer types of data can increase the number of samples and can be better to identify common signals from different cancer. But, this operation may weaken the knowledge, which is specific to certain cancer types. Thus, in this paper, we apply the L 2, 1 -iPaD method to the lung cancer data set, which is extracted from the CCLE data set. And we also apply the iPaD method to analyze lung cancer data set. Table 4 lists the identification and validation rates on the lung cancer data set.

Results on the NCI-60 data set
We also apply our method on the NCI-60 data set, which has been used in [10,13]. The NCI-60 data set is from the NCI-60 project, which consists of 60 human cancer cell lines with nine cancer types. The specific data pre-processed process can be found in [10]. The NCI-60 data set is made up of transcription and drug sensitivity data, which are all downloaded from the CellMiner database [33], and can be found from the URL of http://discover.nci.nih.gov/cellminer. This data set contains 57 cell lines from eight different cancer types and 1863 genes covering 58 KEGG pathways and 101 drugs. The drug sensitivity is measured by GI 50 values, which is the minimum concentration of the drug needed to inhibit the growth of 50% [34]. As a consequence, the lower GI 50 values can manifest the drugsensitive response, the higher GI 50 values can manifest  Note: a The results of iFad method are derived from the reference thirteen. And the identification number denotes the number of drug-pathway association pairs, which posterior probabilities are no more than 0.9. The number of verification denotes the number of identified drug-pathway association pairs, which are validated in the CancerResource The superior results are in italic type the drug-resistant response [5]. In the NCI-60 data set, we also use the ten-fold cross-validation to discover an appropriate λ value. And then we also perform 2000 permutations to obtain the P-values, which can estimate the significance of the drug-pathway associations. Table 5 lists the P-values on NCI-60 data set for the L 2, 1 -iPaD and iPaD methods. After applying our proposed method to the NCI-60 data set, we discover that the L 2, 1 -iPaD method can find 562 association pairs, with the P-values no more than 0.05, and 163 pairs among them are verified in the CancerResource. However, the iPaD method can only discover 247 drug-pathway association pairs with the P-values no more than 0.05, and among those drug-pathway associations only 74 drugpathway pairs are verified in the CancerResource. And then we calculate the number of identified drug-pathway association pairs that P-values are no more than 0.005. The iPaD method can find 72 association pairs with 26 drug-pathway association pairs among them validated in the CancerResource database. But our proposed method can identify 89 drug-pathway association pairs with 33 association pairs among them validated in the CancerResource. Tables 6 and 7 list the identification and verification rates of drug-pathway association pairs on NCI-60 data for the L 2, 1 -iPaD and iPaD methods. From Tables 6 and 7, we can prove that our proposed method can discover more drug-pathway association pairs than the iPaD method. Note that in Table 6, we also compare our method with the iFad method. Similar to the results of CCLE data set, the identification number denotes the number of drug-pathway association pairs with posterior probabilities that are no more than 0.9. The number of verification denotes the number of identified drugpathway association pairs, which are validated in the CancerResource. The results of iFad is derived from the reference [13]. In the NCI-60 data set, the performance of our method is superior than the iFad method. In our method results, Cell cycle pathway is related with Tiazofurin, which is a C-nucleoside, is converted in sensitive cells to the active metabolite, TAD, which tightly bound at the NADH site inhibited IMP DH activity [35].
The results of reference [36] may be utilized in cancer chemotherapy to combine Tiazofurin with biologic response modifiers which recruit quiescent leukemic cells into the cell cycle. And Selenazofurin is an IMPDH inhibitor. The reference [37] has introduced that Selenazofurin and Tiazofurin are due to a cell cycle block that causes the cells to accumulate in the S-phase. Lomustine is a kind of anti-cancer drugs. It is associated with Tight junction, which contributes to the barrier property of brain endothelial cells [38]. In Table 5, we can find that Mycophenolic Acid (MPA) is related with the Cell cycle pathway, but their association is not validated in the CancerResource. Similar to the CCLE data set, we also use published literatures to prove their associations. The authors in [39] demonstrate that in peripheral blood lymphocytes, MPA can lead to an inhibition for the cell cycle proliferation. As a consequence, in the NCI-60 data set, our method can also infer drug-pathway association pairs, which are not validated in the CancerResource.

Conclusions
Drug-pathway association identification is an important issue in pharmacology. In this paper, we develop an effective algorithm named "L 2, 1 -iPaD" to discover novel drug-pathway associations. In the optimization model, the objective function has only one turning parameter λ. Thus, our proposed method is nearly turning-free. To find the best performance of our method, we apply tenfold cross-validation to discover an appropriate λ value. And to estimate the significance of the identified drugpathway association pairs, we perform permutation test to calculate the P-values. For the purpose of assessing the performance of the L 2, 1 -iPaD method, we apply this method in the CCLE and NCI-60 datasets. The experimental results in the CCLE and NCI-60 datasets demonstrate that our proposed method can discover more drug-pathway association pairs than the iPaD method. And the L 2, 1 -iPaD method can identify more validated associations. With the development of genomics and pharmacology, dealing with transcription and drug sensitivity data has become feasible. Our proposed method has tremendously improved the performance of the original algorithm. In the future, we are ready to propose more efficient and Table 3 The identification and verification rates on CCLE data set with the P-values < 0.005   The superior results are in italic type Table 6 The identification and verification rates on NCI-60 data set with the P-values < 0.05 Note: a The results of iFad method are derived from the reference thirteen. And the identification number denotes the number of drug-pathway association pairs, which posterior probabilities are no more than 0.9. The number of verification denotes the number of identified drug-pathway association pairs, which are validated in the CancerResource The superior results are in italic type Table 7 The identification and verification rates on NCI-60 data set with the P-values < 0.005 The superior results are in italic type robust algorithms to handle the high-throughput drug related data. And the rapid growth of the high-throughput gene expression and drug related data is calling for more effective algorithms to solve the computational problems.