Prediction of kinase inhibitor response using activity profiling, in vitro screening, and elastic net regression

Background Many kinase inhibitors have been approved as cancer therapies. Recently, libraries of kinase inhibitors have been extensively profiled, thus providing a map of the strength of action of each compound on a large number of its targets. These profiled libraries define drug-kinase networks that can predict the effectiveness of untested drugs and elucidate the roles of specific kinases in different cellular systems. Predictions of drug effectiveness based on a comprehensive network model of cellular signalling are difficult, due to our partial knowledge of the complex biological processes downstream of the targeted kinases. Results We have developed the Kinase Inhibitors Elastic Net (KIEN) method, which integrates information contained in drug-kinase networks with in vitro screening. The method uses the in vitro cell response of single drugs and drug pair combinations as a training set to build linear and nonlinear regression models. Besides predicting the effectiveness of untested drugs, the KIEN method identifies sets of kinases that are statistically associated to drug sensitivity in a given cell line. We compared different versions of the method, which is based on a regression technique known as elastic net. Data from two-drug combinations led to predictive models, and we found that predictivity can be improved by applying logarithmic transformation to the data. The method was applied to the A549 lung cancer cell line, and we identified specific kinases known to have an important role in this type of cancer (TGFBR2, EGFR, PHKG1 and CDK4). A pathway enrichment analysis of the set of kinases identified by the method showed that axon guidance, activation of Rac, and semaphorin interactions pathways are associated to a selective response to therapeutic intervention in this cell line. Conclusions We have proposed an integrated experimental and computational methodology, called KIEN, that identifies the role of specific kinases in the drug response of a given cell line. The method will facilitate the design of new kinase inhibitors and the development of therapeutic interventions with combinations of many inhibitors.


Background
The important role of kinases in cancer biology [1] has spurred a considerable effort towards the synthesis of libraries of fully profiled kinase inhibitors, providing a map of the strength of each compound on a large number of its potential targets [2][3][4]. In particular, a recently published dataset has profiled several hundred kinase inhibitors using a panel of more than 300 kinases [4]. These profiled libraries define a network of interactions between drugs and their kinase targets [5], and represent a valuable resource for the development of new therapies. In this paper, we introduce a novel computational method that incorporates profiled libraries and in vitro measurements to predict the response of cells to previously untested drugs. Besides making prediction about the cellular response to drugs, the method identifies critical kinase targets and pathways that are statistically associated to drug sensitivity in a given cell line.
Statistical inference and regression methods in conjunction with gene expression or mutations have been used to identify specific biomarkers associated with an increased sensitivity/resistance to drugs. For instance, the sensitivity to PARP inhibitors of Ewing's sarcoma cells with mutations in the EWS gene and to MEK inhibitors in NRAS-mutant cell lines with AHR expression have been predicted using analysis of variance and the elastic net method [6] and then experimentally validated [7,8]. In these analyses, the statistical variable associated to drugs was represented by the half maximal inhibitory concentration (IC 50 ) in different cell lines. However, besides the IC 50 , there are many other types of information that characterize chemical compounds. These types of information can enhance the statistical analyses and improve the accuracy of predictions. For instance, a method to predict drugs sensitivity in cell lines based on the integration of genomic data with molecular physico-chemical descriptors of the drugs has been recently proposed [9]. Another useful type of information is the residual activity of kinases after interacting with a compound. Kinase profiling, patient genetic profiles, and sensitivity of primary leukemia patient samples to kinase inhibitors were recently used by Tyner et al. [10] to identify functionally important kinase targets and clarify kinase pathway dependence in cancer.
In this paper, the residual activity of kinases upon drug interaction is used to make predictions of the cellular response for in vitro experiments using an elastic net [6] regression approach. This regression method reduces the number of predictors to a minimum set, providing a clear picture of the kinases involved in the response of cell lines. A primary screen (single drug) and a secondary screen (two-drug combinations) are used as the training set for the regression. The two-drug screening exhibits a broader distribution in the response and provides a good level of predictability. In fact, the model based only on single drug response did not pass the statistical cross-validation test.
We are applying this Kinase Inhibitor Elastic Net (KIEN) method to predict cell viability of a lung cancer cell line (A549) and a normal fibroblast cell line (IMR-90) after drug treatment. We found that the regression can be improved through a logarithmic transformation on the data. Using the results of the regression, we identified a set of kinases that are strongly associated to a selective response of A549 and not IMR-90. Then, a pathway-based enrichment using Reactome [11] revealed ten significant pathways using this set of kinases, including axonal guidance and related semaphorin interactions as top hits.
This paper is organized as follows: Section In vitro screen of the kinase inhibitor library contains the experimental results of the primary and secondary in vitro screening corresponding to single drugs and two-drug combinations.
These experimental results and residual kinase activity are analyzed with Pearson's correlation in Section Analysis of correlations. This simple correlation analysis gives a first glance of the kinases that are statistically associated to a significant change in the viability of cancer and normal cell lines. In Section Elastic net regression, we introduce the elastic net approach and we present the results of a leaveone-out cross validation for predictions on single and pairs of drugs. We also present in this section the results obtained using the logarithmic transformation on the variables and a pathway enrichment analysis using Reactome [11]. The Discussion of the results is in Section Discussion, conclusions in Section Conclusions, and Materials and Methods in Section Materials and methods.

In vitro screen of the kinase inhibitor library
Our methodology begins with the high-throughput screening of single drug and drug pair experiments. The 244 kinase inhibitors (KIs) of the EMD drug library were screened at 1000 nM individually and the treatment lasted for 72 hours. To quantify a selective response of a cancer cell line with respect to a control normal cell line, we define the selectivity S of a single drug or drug combination as where v N indicates the viability of normal cells (IMR90) after treatment, and v C the viability of cancer cells (A549) after treatment. From the screening of the 244 KIs, the top hit was PDK1/Akt1/Flt3 Dual Pathway Inhibitor (CAS # 331253-86-2) as ranked by selectivity ( Figure 1). For the secondary screen, we used the PDK1/ Akt1/Flt3 Dual Pathway Inhibitor as the starting point and combined this compound with the other KIs as a drug pair combination. The dose of PDK1/Akt1/Flt3 Dual Pathway Inhibitor was studied to ensure proper dosing range and minimize toxicity. We used 125 nM, which maintains the normal cell line IMR-90's viability >90% ( Figure 2). For the other 243 KIs we used the standard dose of 1000 nM. Several pairs in the secondary screen showed very high selectivity. The top hit from the secondary screen of the library was Alsterpaullone 2-cyanoethyl (CAS # 852529-97-0) with a selectivity of S = 6.14 for the pair (Figure 3).

Analysis of correlations
In our second step, we analyzed the Pearson's correlation of the primary and secondary screening with a published dataset [4] containing target profiles for 140 kinase inhibitors. Therefore, even though we had a library of 244 KIs in the experimental screening, we were limited to utilizing 140 KIs for the analysis. For each inhibitor, the dataset provides the residual activity (0 ≤ A ≤ 1) of 291 kinases after drug treatment. This quantity is a measure of the strength of inhibition of a drug on each kinase. For each kinase k, we calculate the Pearson's correlation, C k , between the selectivity S i and the activities A k, i , with i ∈ {1, …, M} indicating the single drug or drug pair in the set. For drug pairs, the activity is estimated as a product of the residual activities of the two drugs. The kinases are then ranked based on the p-value of their correlation with selectivity, and we calculate the False Discovery Rate (FDR) adjusted p value [12]. The list of kinases mostly correlated to the selectivity from the primary and secondary screen are listed in Table 1. We also did calculations of the correlation between the normal or cancer cell viability and the activities. The results for the top kinase-viability correlations for the primary and secondary screen are shown in the supplementary materials (Additional file 1: Table S1).

Elastic net regression
Next, we build a regression model that predicts the response of a cell line to a drug or drug combination i. The response we predict is the normal and cancer cell viability, from which the selectivity can be derived. For this purpose, we define a regression problem in which we use the residual activity of the kinase k under the effect of drug i, which we indicate as A k, i , as predictors of the viability. The response can be written as A fitting procedure based on a training set of measurements produces the coefficients (β 0 , β 1 , …, β p ). Equation  (1) can then be used to predict the viability of a new drug that has not been tested, but of which the profiling information is available. Note that we are integrating two different types of data: kinase profiling data is obtained through enzymatic assays that probe directly the interaction between drug and kinases, while the in vitro cell response data is the result of complex signaling that involves many pathways downstream of the affected kinases. The coefficients β k can be seen as a measure of the sensitivity of a given cell line due to alterations in the activity of kinase k.
It is well known that the least square method does not perform well in the case of linear regression with many predictors. In our case, we would like to use a database of drugs that have been profiled on about 300 kinases.
However, it would be desirable to select and keep in the final model a minimal set of the kinases that provide a simple model, useful to gain biological insight. The lasso technique [13] is a powerful method to reduce the number of predictors by imposing a penalty on the regression coefficients. However, in the presence of a group of kinase predictors with strong mutual correlation, the lasso could select only one kinase predictor from the group while missing the others. To prevent this problem, our method uses the elastic net approach. This method incorporates the lasso penalty as well as a ridge penalty to keep the regression coefficients small without completely removing them [6]. The weights of the ridge and lasso penalties in the least square procedure can be optimized for best performance of the method.  A negative correlation indicates that inhibition of that particular kinases is associated to a higher selectivity. The top two hits with negative correlation, TGFBR2 and CDK4 are known to have an important role in cell proliferation, invasion and metastasis in lung adenocarcinoma [21,22].
We show in Figure 4(a) and (b) the results of a leave one out cross validation (LOOCV) method for the primary (a) and secondary screen (b). For each of the 140 drugs, we apply the elastic net method using the remaining 139 drugs and then we compare the result to the measured value. This cross validation method is a particular case of the more general k-fold cross validation procedure in which k is equal to the size of the training set [14]. The cross LOOCV shows that the information contained in the primary screen is not sufficient to define a predictive model. The fact that some kinases in Table 1 show some significant correlation with the response when considered individually is in general not a sufficient condition for defining a predictive, multiple regression model. On the other hand, the secondary screen is able to reproduce the viability of many drugs, especially the ones with the stronger effect on both cell lines. Overall, the data from the secondary screen presents a much broader distribution with a tail representing a few drug combinations particularly effective. The regression works better in identifying these highly effective pairwise combinations and the relative ranking of their strengths. Data is not particularly informative for drugs and drug pair combinations that are not effective, which concentrate in the neighborhood of~1.
Data transformations can represent a powerful strategy to improve regression. We applied a logarithmic transformation, which is consistent with the hypothesis of an independent action on the different kinases on the total viability. In this case we assume that the viability can be rewritten in the form By applying a log transformation on both sides of Eq. (2) we reduce the problem to a linear regression, to which the elastic net strategy can be applied. We show in Figure 5 the results of the LOOCV for the primary and secondary screen using the logarithmic data transformation. As in the linear case, we find that the method fails the cross validation procedure if we use data from the primary screen, while the secondary screen with log transformed data gives better R 2 .
In addition to a regression model that can be used to predict the efficacy of drugs that have not been tested, the β i coefficients can be used to rank kinases in terms of their relevance in the regression. Therefore, these coefficients identify the kinases whose inhibition is associated to a decrease in the cell viability. A ranking based on the differential β C i −β N i , where the index N and C identify the regression model of the cancer and normal cells, gives insight on specific pathways important for a selective response of cancer cells. Table 2 shows a list of kinases ranked in terms of β C i −β N i , where the coefficients have been obtained using the logarithmic data transformation on the secondary screen. In order to test whether selected pathways were significantly enriched for the identified kinase genes in Table 2, a pathway-based enrichment analysis was Figure 4 Leave-one-out Cross Validation of the elastic net regression model based on the primary (top) and secondary (bottom) screens for normal and cancer cell lines. Each of the 140 point in these figures corresponds to one of the 140 drug. "Regression" refers to the viability predicted by the regression model using all data from the other 139 drugs as training set, while "Measured" refers to the actual viability measured for the drug or drug combination. Note that only the secondary screen leads to predictive models with significant R 2 for the two cancer cell types. conducted using the results from the elastic net kinase analysis and Fisher exact tests. Ten pathways from Reactome were identified as significant (p < 0.05) using this kinase list, including axon guidance, activation of Rac, and semaphorin interactions as top hits (Table 3).

Discussion
Drug-kinase profiling represents a controller-target network [5] that when combined with in vitro testing, can be used in regression models to predict drug response and to identify pathways statistically associated to drug sensitivity. Network methods in biology are often based on the analysis of large datasets from high-throughput experiments. An example is given by gene regulatory networks, which presents many challenges either when restricted to a homogeneous set of data [15,16] or when it includes different classes of data [17][18][19][20]. In our KIEN method, information from the drug-target network and experimental query of the biological system are integrated. The goal is not a reconstruction of a regulatory network, but to identify a set of kinases linked to a therapeutic response in a given cell line. In order to establish associations, the system has to be perturbed by the use of kinase inhibitor drugs. The response to these single drugs or drug combinations becomes a training set that when combined with the kinase profiling, can lead to predictions.
The elastic net method is one of the most widely used regularization techniques. Regularization techniques are used in statistical and machine learning models to achieve an optimal tradeoff between accuracy and simplicity. Simplicity makes a model less prone to overfitting and more likely to generalize. In our analysis, we found that the elastic net regressions based on single drug responses were not successful, while drug pair data provided statistically significant predictions. A possible explanation for this finding is the following: single drugs might be less able to overcome the robustness of biological networks [5]. The phenotypic signal is therefore blunted and not easily measured. If a second drug is added, any compensatory capacity is already stretched and the effects from the inhibition of each kinase can be seen more clearly. Using data from drug pairs, we found that noise can be better filtered out and stronger statistical associations between kinases and therapeutic response are revealed. Clearly, if a different training set with higher variance in efficacy measures were used in the primary screen, it is likely that also single drug in vitro response would have given a significant predictive model.
We identified several kinases that are implicated in lung cancer that gives biological significance to our KIEN method. In particular, TGFBR2 appears as a top hit both in the correlation and in the elastic net methods. This finding is consistent with recent siRNA experiments on A549 cell lines [21], which demonstrated that silencing of this receptor reduces cell proliferation, invasion, and metastasis. The Cyclin-dependent kinase 4 (CDK4) appears as a second top target in the correlation analysis, and is also highly significant in the KIEN analysis. Experiments using lentiviral-mediated shRNA to inhibit CDK4 in A549 have shown inhibited cell cycle Figure 5 Leave-one-out Cross Validation of the elastic net regression model based on the primary (top) and secondary (bottom) screens for normal and cancer cell lines after logarithmic transformation on the data. Each of the 140 point in these figures corresponds to one of the 140 drugs. "Regression" refers to -log of the viability predicted by the regression model using all data from the other 139 drugs as training set, while "Measured" refers to -log of the actual viability measured for the drug or drug combination. Note that, as in Figure 4, only the secondary screen leads to predictive models with significant R 2 for both cell types. The R 2 for the Cancer cell lines is considerably better using the log transformation.
progression, suppressed cell proliferation, colony formation, and migration [22], and there is an ongoing clinical trial using a CDK4/6 inhibitor in lung cancer [23]. The KIEN analysis identified EGFR, which is known to be overexpressed in the majority of nonsmall cell lung cancers [24]. Furthermore, RNAi experiments targeting EGFR demonstrated cancer growth suppression in A549 xenograft in mice [25]. The third kinase in Table 2, PHKG1 has also been found to be upregulated in human tumor samples, including lung adenocarcinoma, and aberrations in its gene copy number is a feature of many human tumors [26].
The pathway-based enrichment provides a broader view on the role of the kinases identified by our method in Table 2. Among the top three pathways shown in Table 3 are activation of Rac and Semaphorin interactions. Rac proteins play a key role in cancer signaling and they belong to the RAS superfamily [27]. We also identified a set of semaphorins in our analysis that is represented in the top significantly enriched pathways. Semaphorins, previously known as collapsins, are a set of proteins containing a 500-amino acid sema domain among others (including PSI and immunoglobulin type domains), which can be transmembranous or secreted [28]. It is known that Sema3E cleavage promotes invasive growth and metastasis in vivo [28]. These genes also have selective targeting by Rac and Rho family members. This generates hypotheses of possible pathways that could be targeted therapeutically. However, these hypotheses need to be validated by further experiments with different inhibitors for the same targets or with alternative methods, e.g. using siRNA.

Conclusions
We have introduced an integrated experimental and computational methodology that identifies the role of specific kinases in the drug response of a given cell line.
The key element of our KIEN methodology is a multiple regression procedure that uses in vitro screen data as a training set. If a new library of kinase inhibitor compounds were to be synthetized and profiled, then our model would be able to immediately estimate the effect of these drugs on in vitro experiments on a given cell line. We have shown an application to a lung cancer cell A larger difference is associated with a selective response of A549 upon inhibition. Note that in addition to TGFB2R and CDK4, which were identified with the correlation approach of Table 1, additional kinases known to have an important role in lung cancer such as EGFR [24,25] and PHKG1 [26] are found using the elastic net approach. line, but our method can be extended to different cell lines. The method will facilitate the design of new kinase inhibitors and the development of therapeutic interventions with combinations of many inhibitors [29]. The procedure could be extended to three drug combinations, if measurements for these larger combinations were available. Finally, the method could be extended to regression models that are specific of cancer cells with the same set of mutations, or it could be directly used with patient-derived primary cells to identify a personalized treatment.

Materials
The primary screening of a kinase inhibitor (KI) library comprised of 244 KIs was purchased from EMD Chemicals, and diluted with DMSO to 2 mM concentrations for high-throughput screening purposes. The KI library was stored at −80°C. Additionally, PDK1/Akt1/Flt3 Dual Pathway Inhibitor (CAS # 331253-86-2) was ordered from EMD. Only 140 out of 244 were used in the drugtarget network reconstruction because the drug profiling information was available only for these compounds. One kinase inhibitor known to affect the kinase targets indirectly was excluded. We provide in Additional file 2 the chemical structure of kinase inhibitors with highest selectivity in the primary and secondary screening.

Kinase inhibitor experiments
IMR-90 (1500 cells/well) and A549 (750 cells/well) were seeded on 384-well microplates (Grenier Bio-One) and incubated for 3 hours before the addition of kinase inhibitor(s

ATP measurements
ATPlite 1Step (Perkin Elmer) was used to evaluate the cell number and cytotoxicity. ATP measurements were done by dispensing 20 uL of the ATPlite 1Step solution to each well to a final volume of 60 uL. The plate was placed on a shaker at 1100 rpm and the luminescence activity was detected by Analyst GT Plate Reader. The percent (%) of control is the quantity of ATPlite 1step measurement of the treated versus the untreated wells of each individual cell type. The ATP standard was prepared with culture media to final volume of 40 uL, and 20 uL of ATPlite 1step reagent was added. Additional file 1: Figure S3 shows the ATP standard curve. The plate was read immediately.

Computational methods
Correlations between selectivity/viability and kinase activity were calculated using the python scipy linregress function, which also provide p-values. Ranking the p-values and directly applying the Benjamini-Hochberg procedure gave us the FDR values. The elastic net regression was carried out using the Scikit-learn package [30] which finds the coefficients β that minimize the function where v is the vector of the observed viabilities and A is the matrix containing the residual activity of the kinases from the profiling, and M is the total number of drugs or drug combinations used. The parameters α and β determine the relative weights of the lasso and ridge penalties quantified using L 1 (|| | 1 ) and L 2 (|| || 2 ) norm, respectively. We used α = 0.15 and ρ = 0.01 in the results of Figures 4 and 5 and in Table 2. We also tried other values of these parameters, which did not give a significant difference in the results.

Pathway-based enrichment
Reactome pathways were downloaded using a newer build of the 'biomaRt' library (v2.12.0) in Bioconductor/R (v2.15.0). Gene symbols from the kinase list were converted to Entrez gene identifier numbers ('entrezgene') and mapped against the gene ids in each Reactome pathway. For each pathway, the set of significant genes enriched within any given pathway was computed using a Fisher exact test. The procedure computes the significance (p-value) of observing significant kinases, as deemed significant by our method, within the selected