Predicting the response to a drug for cancer disease patients based on genomic information is an important problem in modern clinical oncology. This problem occurs in part because many available drug sensitivity prediction algorithms do not consider better quality cancer cell lines and the adoption of new feature representations; both lead to the accurate prediction of drug responses. By predicting accurate drug responses to cancer, oncologists gain a more complete understanding of the effective treatments for each patient, which is a core goal in precision medicine.

Results

In this paper, we model cancer drug sensitivity as a link prediction, which is shown to be an effective technique. We evaluate our proposed link prediction algorithms and compare them with an existing drug sensitivity prediction approach based on clinical trial data. The experimental results based on the clinical trial data show the stability of our link prediction algorithms, which yield the highest area under the ROC curve (AUC) and are statistically significant.

Conclusions

We propose a link prediction approach to obtain new feature representation. Compared with an existing approach, the results show that incorporating the new feature representation to the link prediction algorithms has significantly improved the performance.

Background

Cancer has a significant global impact on public health; it is the second leading cause of death in the United States of America [1]. Cancer patients respond differently to potential drugs (i.e., chemotherapy) due to environmental causes, tumor heterogeneity, and genetic factors, making cancer drug discovery difficult [2,3,4,5]. The increasing number of deaths associated with cancer has attracted the attention of researchers from numerous domains, such as computational biology, machine learning, and data mining [6,7,8,9]. Costello et al. [10] assessed the performance of 44 drug sensitivity prediction algorithms based on profiling datasets (i.e., genomic, proteomic, and epigenomic data) in breast cancer cell lines. The training set consists of 35 cell lines, in which each cell line is associated with 28 drug responses. The test set consists of 18 cell lines. The task of each prediction algorithm is to learn a model from the training cell lines and perform predictions on the test set. The predictions correspond to a ranking of the 28 drugs—from the most sensitive to the most resistant for each cell line on the test set. The top-performing approach [10] improved the performance by integrating several profiling datasets with improved representation with a probabilistic nonlinear regression model. The second-best performing approach employed random forest regression to make predictions on the test set. The prediction algorithms were evaluated using the weighted probabilistic c-index (wpc-index) and resampled Spearman correlations [10]. The remaining prediction algorithms were not statistically different.

Geeleher et al. [11] proposed the following approach to drug sensitivity in which the input data are baseline expressions with drug IC_{50} values in cell lines and in vivo tumor gene expressions. The raw microarray data for the cell lines and clinical trials are processed separately and then combined and homogenized. The homogenized expression data consist of cell line expression data (i.e., baseline gene expression levels in the cell lines) and clinical trial expression data (i.e., baseline tumor expression data from the clinical trial). A learning algorithm is applied to the cell line expression data with the associated drug IC_{50} values for cell lines to learn a model. The resulting model is applied to clinical trial expression data to yield drug sensitivity predictions.

Two problems associated with the previous drug sensitivity prediction algorithms contribute to the degradation of the performance: (1) the poor quality of cell lines, especially when cell lines are not screened against all compounds [12]; and (2) the failure to adopt a new feature representation, because new feature representations provide a basis for improving the performance of learning algorithms [13,14,15].

In this paper, we model the cancer drug sensitivity as a link prediction problem, which is a classical research topic in computational social science [16,17,18,19] and biomedicine [20, 21]. Modeling the problem as link prediction enables us to exploit two link prediction algorithms: (1) the supervised link prediction algorithm, which aims to select better quality cancer cell lines; and (2) the extended supervised link prediction, which selects cancer cell lines and the top-k genes (i.e., features) using state of the art CUR matrix decomposition [22]. Our experimental results indicate that the proposed link prediction algorithms outperform the baseline prediction algorithms proposed by Geeleher et al. [11].

The key contributions of our paper are as follows: 1) we represent cancer drug sensitivity as a link prediction problem, which to the best of our knowledge is the first robustly transfer cancer drug sensitivity prediction to link prediction, 2) we connect a social network domain to a health informatics domain for advancing health informatics, 3) we propose two link prediction algorithms, and 4) we perform an experimental study on clinical trial data to demonstrate the predictive power and stability of our proposed link prediction algorithms against the prediction algorithms that employ the current approach [11].

This paper is organized as follows: In Related works section, we review the relevant literature, which pertains to both link prediction and cancer drug sensitivity prediction. In Methods section, we describe how the cancer drug sensitivity problem can be modeled as a link prediction problem. Then, we propose two link prediction algorithms that employ our link prediction approach: the supervised link prediction algorithm (A1) and the extended supervised link prediction algorithm (A2). Results and experiments section reports the experimental results and compares our link prediction algorithms against the baseline on the clinical trial data that pertains to breast cancer and multiple myeloma. Conclusions section summarizes our contributions in this paper.

Related works

Link prediction in gene regulatory networks

Given m genes, in which each gene has n expression values, we can denote their gene expression profiles by G∈ℝ^{m × n}, which contains m rows—each row corresponds to a gene—and n columns—each column corresponds to an expression value [23]. To learn a model, we need to know the regulatory relationships (i.e., labels) among the genes, which are stored in the matrix H∈ℝ^{p × 3}. H contains p rows—each row shows a known regulatory relationship between two genes—and three columns. The first column shows the source gene (i.e., the transcription factor). The second column shows the target gene, and the third column shows the label, which is denoted as +1 (i.e., present link) when the source gene regulates the target gene or −1 (i.e., missing link) when the source gene does not regulate the target gene. Thus, H represents the observed (i.e., known) gene regulatory network. To learn a model, we need to construct the training set D∈ℝ^{p × 2n + 1}. The p examples in D are constructed as follows: For each pair of genes with the associated label in matrix H, the n expression values of each pair of genes in matrix G are extracted, and the concatenation of the n expression values of each pair of genes and the corresponding label is performed. For example, consider the ith example in the training set D, which is denoted by D_{
i
} and defined as

where \( {g}_i^{\mathtt{1}},{g}_i^{\mathtt{2}},\dots, {g}_i^n \) are the n expression values of \( {\mathtt{g}}_i \) (also called the expression profile of \( {\mathtt{g}}_i \)), \( {g}_l^{\mathtt{1}},{g}_l^{\mathtt{2}},\dots, {g}_l^n \) are the n expression values of \( {\mathtt{g}}_l \), and y_{
i
}∈ {1, −1}. The ith example of the test set, T, is denoted by T_{
i
} and constructed as follows:

where \( {g}_i^{\mathtt{1}},{g}_i^{\mathtt{2}},\dots, {g}_i^n \) are the n expression values of \( {\mathtt{g}}_i \), and \( {g}_j^{\mathtt{1}},{g}_j^{\mathtt{2}},\dots, {g}_j^n \) are the n expression values of \( {\mathtt{g}}_j \). These feature vector definitions have been used by the existing supervised inference of gene regulatory networks [23,24,25,26,27,28]. After constructing the feature vectors, the learning algorithm is applied to D to induce (i.e., learn) the model h. The resulting model is used to perform prediction on T. The known regulations among genes enable using the induction principle to predict new regulations (i.e., labels): If gene \( {\mathtt{g}}_j \) has an expression profile that is similar to gene \( {\mathtt{g}}_l \), which is known to be regulated by \( {\mathtt{g}}_i \), then \( {\mathtt{g}}_j \) is likely to be regulated by \( {\mathtt{g}}_i \) [29]. Genes with similar expression profiles that are likely to be co-regulated have been used in the unsupervised clustering of expression profiles [30,31,32].

Cancer drug sensitivity prediction

The gene expression profiles denoted by X ∈ℝ^{p × n}, which contains p rows—each row corresponds to a cell line or a sample—and n columns—each column corresponds to a gene. Y = (y_{1}, … , y_{
p
})^{T} consists of the corresponding real-value drug responses (i.e., drug IC_{50} values) to X, where Y ∈ℝ^{p} (i.e., the p-dimensional column vector). IC_{50} is defined as the concentration of a compound that is required to produce 50% cancer cell growth inhibition after 48 h of treatment [33]. A training set is defined as \( \mathrm{D}={\left\{\left({\mathtt{g}}_i,{y}_i\right)\right\}}_{i=1}^p \), where \( {\mathtt{g}}_i\in \mathrm{X}\;\mathrm{and}\kern0.24em {y}_i\in \mathrm{Y}. \). Let the ith example of the training set D, denoted by D_{
i
}, be defined as

where \( {g}_i^{\mathtt{1}},{g}_i^{\mathtt{2}},\dots, {g}_i^n \) represent the n genes of the cancer cell line \( {\mathtt{g}}_i \) (also called the expression profile of \( {\mathtt{g}}_i \)), and y_{
i
}∈ℝ is the drug response value. The ith example of the test set T, denoted by T_{
i
}, is constructed as follows:

These feature vector definitions have been used by existing supervised cancer drug sensitivity prediction algorithms [9,10,11, 33,34,35,36]. A learning algorithm is applied to D to induce model h, which is subsequently used to perform predictions on T. Known cancer cell lines with associated drug responses enabled the use of the induction principle: If tumor \( {\mathtt{g}}_j \) has an expression profile similar to \( {\mathtt{g}}_i \), then \( {\mathtt{g}}_j \) is likely to have a drug response value closer to the drug response value associated with \( {\mathtt{g}}_i \).

Methods

The fundamental task of cancer drug sensitivity prediction is to correctly predict the response of a tumor to the drug. This prediction is typically achieved based on how closely this tumor (also referred to as the test example) is related to a known cancer cell line with the associated drug response. Proximity, which is a measure of closeness, lies at the heart of both link prediction in gene regulatory networks and cancer drug sensitivity prediction [29, 37].

Feature vector construction

To bridge link prediction and cancer drug sensitivity, we transform the feature representations of Eqs. (3) and (4) to the corresponding Eqs. (1) and (2) as follows: Let \( {\left\{\left({\mathtt{g}}_i,{y}_i\right)\right\}}_{i=1}^p\subseteq \mathbf{D} \) be the cancer cell lines, where D∈ℝ^{p × n + 1} , b = p.

1

Find the k^{’} nearest neighbors \( {\mathtt{g}}_1^{\ast },{\mathtt{g}}_2^{\ast },\dots, {\mathtt{g}}_{k^{\hbox{'}}}^{\ast } \) of each \( {\mathtt{g}}_i \) in D. (In this study k^{’}= 1.)

2

Generate synthetic cell lines along the lines between the randomly selected k^{’} nearest neighbors and each \( {\mathtt{g}}_i \) using the following lines of code:

Store \( \left[{\mathtt{g}}_i,{\mathtt{g}}_b,{y}_i\right] \) in G

2.1.2

end for

2.2

end for

where the index b refers to only those synthetic cell lines (e.g., \( {\mathtt{g}}_{p+1} \) when the index b = p + 1) that differ from the cell lines in D, whose indexes run from 1 to p, λ = 0.3, and G∈ℝ^{p × 2n + 1} is the new feature representation of the cell lines of the training set. Step 2.1.1.2 creates the synthetic cell line \( {\mathtt{g}}_b \). Let G_{
i
} be the ith row of G, defined as

where \( {\mathtt{g}}_i^{\mathtt{1}},{\mathtt{g}}_i^{\mathtt{2}},\dots, {\mathtt{g}}_i^n \) represent n genes of the cancer cell line \( {\mathtt{g}}_i \), \( {g}_{p+1}^{\mathtt{1}},{g}_{p+1}^{\mathtt{2}},\dots, {g}_{p+1}^n \) represent the synthetic n genes of the synthetic cancer cell line \( {\mathtt{g}}_{p+1} \), and y_{
i
}∈ℝ denotes that both \( {\mathtt{g}}_i \) and \( {\mathtt{g}}_{p+1} \) are linked by sharing the same drug response value. Let \( {\left\{\left({\mathtt{g}}_i,{y}_i\right)\right\}}_{i=1}^q\subseteq \mathbf{T} \) be the test set of tumors, where T∈ℝ^{q × n}. Note that Steps 1–2 are similar to the Synthetic Minority Oversampling Approach (SMOTE) [38, 39], However, Step 2.1.1.3 is a different core step in which we increase the dimensionality (i.e., the number of features) instead of the size, as SMOTE does. We then apply the previous steps (i.e., Steps 1 and 2—changing Step 2.1 to i = 1 to q and Step 2.1.1.3 to Store \( \left[{\mathtt{g}}_i,{\mathtt{g}}_b\right] \) in G^{'}) to T to obtain G^{'}∈ℝ^{q × 2n}. G^{'} is the new feature representation of the clinical trial expression data of the test set. Let \( {\mathbf{G}}_i^{\hbox{'}} \) be the ith row of G^{'}, which is defined as

where \( {\mathtt{g}}_j^{\mathtt{1}},{\mathtt{g}}_j^{\mathtt{2}},\dots, {\mathtt{g}}_j^n \) represent n genes of tumor \( {\mathtt{g}}_j \), and \( {\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}^{\mathtt{1}},{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}^{\mathtt{2}},\dots, {\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}^n \)represent n synthetic genes of the synthetic tumor \( {\mathtt{g}}_{p+2{k}^{\hbox{'}}+1} \). A learning algorithm is called on the training set, G to induce the model h, which is subsequently used to perform predictions on the test set G^{'}. The logic behind the mechanism of the induction principle is as follows: If the expression profiles of the pair of tumors \( \left({\mathtt{g}}_j,{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}\right) \) are similar to those of the cell lines \( \left({\mathtt{g}}_i,{\mathtt{g}}_{p+1}\right) \), then \( \left({\mathtt{g}}_j,{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}\right) \) is likely to have a drug response value closer to the drug response value associated with \( \left({\mathtt{g}}_i,{\mathtt{g}}_{p+1}\right) \). In machine learning terms, let \( \left({\mathtt{g}}_i,{\mathtt{g}}_{p+1},{y}_i\right)\in {\mathbf{\mathbb{R}}}^{2n+1} \) be a row feature vector that encodes information about the pair of cancer cell lines \( \left({\mathtt{g}}_i,{\mathtt{g}}_{p+1}\right) \). Given a new pair of tumors encoded by \( \left({\mathtt{g}}_j,{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}\right) \), if \( \left({\mathtt{g}}_j,{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}\right) \) has feature values similar to \( \left({\mathtt{g}}_i,{\mathtt{g}}_{p+1}\right) \), whose label is y_{
i
}, then \( \left({\mathtt{g}}_j,{\mathtt{g}}_{p+2{k}^{\hbox{'}}+1}\right) \) is more likely to have a closer response (i.e., label) value to y_{
i
}.

Notations and algorithms

Notations

To provide a better understanding of our proposed prediction algorithms, the notations used throughout the remainder of this paper are summarized as follows: Matrices are denoted by boldface uppercase letters, e.g., matrix X. We denote the row vectors of a matrix by boldface uppercase letters with a subscript, e.g., X_{
j
} is the jth row of matrix X. Vectors are denoted by boldface lowercase letters, e.g., vector v. Vector entries are denoted by italic lowercase letters with a subscript, e.g., v_{
i
} is the ith entry of vector v. The number of entries of a vector is denoted by the cardinality symbol, e.g. ∣v∣ is the number of elements of vector v. Scalars are denoted by italic lowercase letters, e.g., m. f , f^{∗} , and h are reserved letters, where f refers to a learning algorithm (e.g., SVR), f^{∗} refers to an induced (i.e., learned) model, and h is an induced model used to perform predictions on the test set. We refer to specific learning algorithms and induced models using subscripts. For example, \( {f}_i\left({f}_i^{\ast },\mathrm{respectively}\right) \) denotes the ith learning algorithm and induced model, respectively.

The supervised link prediction algorithm (A1)

Figure 1 outlines the supervised link prediction algorithm, which we designate A1, as follows. (a) Given a training set of cancer cell lines with associated drug responses D∈ℝ^{p × n + 1} and a test set of tumors T∈ℝ^{q × n} that are described as in cancer drug sensitivity prediction subsection. (b) Transform D and T using the feature vector construction method described in feature vector construction subsection, to obtain a new feature representation G∈ℝ^{p × 2n + 1} for the training set and a new feature representation G^{'}∈ℝ^{q × 2n} for the test set. (c) Our link filtering method aims to select a better quality training set that works as follows: Each row (i.e., feature vector) in the new representations G and G^{'} can be viewed as a cell line or tumor, represented by a 2n–dimensional row vector when the drug responses of the training set G are excluded. We weigh each cell line [40] \( {\mathtt{g}}_i \) in the training set G by the minimum distance from the cell line \( {\mathtt{g}}_i \) to all tumors \( {\mathtt{g}}_j^{\hbox{'}} \) in the testing set G^{'}:

where \( {\mathtt{g}}_i\in {\mathbf{\mathbb{R}}}^{2n} \), \( {\mathtt{g}}_j^{\hbox{'}}\in {\mathbf{\mathbb{R}}}^{2n} \), w_{
i
} is the weight assigned to \( {\mathtt{g}}_i \), and \( \mathrm{dist}\left({\mathtt{g}}_i,{\mathtt{g}}_{j^{\ast}}^{\hbox{'}}\right) \) is the Euclidean distance. Let w = (w_{1}, w_{2}, … , w_{
p
}). Then, we perform the following steps to select better quality training cell lines using our modified version of Query by Committee (QBC) [41,42,43]:

1

Let med be the median of the w vector of weights of each \( {\mathtt{g}}_i \) in G

2

Let \( \mathbf{X}=\left\{\left({\mathtt{g}}_i,{y}_i\right)|\left({\mathtt{g}}_i,{y}_i\right)\in \mathbf{G}\;\mathrm{and}\;{w}_i\le med\right\} \)

3

Let \( {\mathbf{X}}^{\hbox{'}}=\left\{{\mathtt{g}}_i|{\mathtt{g}}_i\kern0.24em \mathrm{in}\kern0.24em \mathbf{G}\;\mathrm{and}\;{w}_i\le med\right\} \)

4

Let \( \mathbf{Z}=\left\{\left({\mathtt{g}}_i,{y}_i\right)|\left({\mathtt{g}}_i,{y}_i\right)\in \mathbf{G}\;\mathrm{and}\;{w}_i\ge med\right\} \)

5

Let \( {\mathbf{Z}}^{\hbox{'}}=\left\{{\mathtt{g}}_i|{\mathtt{g}}_i\kern0.24em \mathrm{in}\kern0.24em \mathbf{G}\;\mathrm{and}\;{w}_i\ge med\right\} \)

6

Apply the learning algorithm, f_{1} or f_{2}, to X or Z, respectively, to induce the model \( {f}_1^{\ast}\;\left({f}_2^{\ast },\mathrm{respectively}\right) \). (In this study, we chose ridge regression as the learning algorithm)

7

Apply the model \( {f}_1^{\ast}\;\left({f}_2^{\ast },\mathrm{respectively}\right) \) to perform predictions on Z^{'} or X′, respectively) and store predictions in v or b respectively)

8

Let q =∣v∣ = ∣b∣

9

Let P = (v, b)^{T}

10

Let r = {y_{
i
}| y_{
i
} in Z} and e = {y_{
i
}| y_{
i
} in X}

QBC aims to partition the training set G into S and U, where S or U is treated as the labeled or unlabeled set, respectively. QBC is accompanied by two major items: (1) the set of models (i.e., the committee) that are consistent with all labeled cell lines in S; and (2) given the unlabeled set, U, the QBC applies the models (i.e., the committee) to U to select the unlabeled tumor that maximizes the disagreement because it represents the most important tumor that will be added to S, in addition to querying the drug response value associated with the tumor. The main obstacle of the first major step of QBC is to find models that agree on all the labels of set S with reasonable computational complexity [43]. Thus, we relax the first major step according to Steps 1–14, where relaxation is practiced to address the first major step [41]. Steps 1–5 partition the training set into X and Z using the median as a threshold, where X or Z contains cell lines from G that are near or far, respectively, from the test set G^{'}. Steps 6–14 aim to assign the set of cell lines where the model incurred fewer errors (or more errors, respectively) to S or U, respectively. The logic behind these steps (i.e., Steps 13–14) is that we want S or U, respectively, to contain the set of cell lines that are more or less, respectively, correctly labeled by one model (i.e., one member of the committee). Steps 1–14 are motivated by other QBC approaches [41,42,43], in which the success of the second major step of QBC is dependent on the first major step.

15

Repeat k” times

15.1

Apply the learning algorithms f_{1} , f_{2} , … , f_{
t
} on S to induce the models (i.e., committee) \( {f}_1^{\ast },{f}_2^{\ast },\dots, {f}_t^{\ast } \). (In this study, t = 3, and the learning algorithms include support vector regression with a linear kernel (SVR + L), SVR with a polynomial kernel of degree 5, and SVR with a sigmoid kernel (SVR + S))

15.2

Let \( {w}_t^{\hbox{'}} \) be the weight of the ith model \( {f}_i^{\ast } \) where \( {\boldsymbol{w}}^{\hbox{'}}=\sum_{i=1}^t{w}_i^{\hbox{'}}=1 \). (In this study, t = 3 and \( {w}_1^{\hbox{'}}={w}_2^{\hbox{'}}={w}_3^{\hbox{'}}=\frac{1}{3} \))

15.3

For each \( {\mathtt{g}}_j \) in U, let \( {f}^{\hbox{'}}\left({\mathtt{g}}_j\right)=\sum_{i=1}^t{\boldsymbol{w}}_i^{\hbox{'}}{f}_i^{\ast}\left({\mathtt{g}}_j\right) \) where \( {f}_i^{\ast}\left({\mathtt{g}}_j\right) \) is the prediction of the ith learned model on the jth cell line \( {\mathtt{g}}_j \), and \( {f}^{\hbox{'}}\left({\mathtt{g}}_j\right) \) is the weighted ensemble average of the jth cell line \( {\mathtt{g}}_j \).

15.4

Find the cell line \( {\mathtt{g}}_{j^{\ast }} \) that maximizes the disagreement:

Find the label \( {y}_{j^{\ast }} \) of \( {\mathtt{g}}_{j^{\ast }} \) in U

15.6

Add the pair \( \left({\mathtt{g}}_{j^{\ast }},{y}_{j^{\ast }}\right)\in \mathbf{U} \) to S and remove the pair \( \left({\mathtt{g}}_{j^{\ast }},{y}_{j^{\ast }}\right) \) from U

15.7

Update ∣v∣ = ∣v∣ − 1

16

Return S

Steps 15.1–15.4.1 return the index of the cell line in set U that maximizes the disagreement, where disagreement is defined in Step 15.4.1 [44]. Then, \( \left({\mathtt{g}}_{j^{\ast }},{y}_{j^{\ast }}\right) \) is added to or removed from S or U respectively, as shown in Steps 15.5–15.6. (In this study, k” = 5.) Step 15.7 updates |v| as the size of U is reduced after each iteration. S (Step 16) is the returned set that will be used as the training set. (d) We apply a learning algorithm on S to induce the model h. Finally (i.e., (e in Fig. 1)), we apply model h to perform predictions on the test set G^{'} (i.e., the set of new feature representations of the clinical trial expression data). In the remainder of this paper, we refer to the supervised link prediction algorithms that employ the following machine learning algorithms (SVR and RR) as: A1 + SVR + L, A1 + SVR + S, and A1 + RR (abbreviations are listed in Table 1).

The extended supervised link prediction algorithm (A2)

Figure 2 shows the data flow diagram of the extended supervised link prediction (A2). Steps (a), (b), and (c) are the same as Steps (a), (b), and (c) of the supervised link prediction algorithm. (d) Mahoney et al. [22] proposed CUR matrix decomposition as a dimensionality reduction paradigm that aims to obtain a low rank approximation of matrix S, which is expressed in terms of the actual rows and columns of the original matrix S:

$$ \kern0.5em \mathbf{S}\approx \mathbf{CUR}, $$

(8)

where C consists of a small number of the actual columns of S,R consists of a small number of the actual rows, and U is a constructed matrix that guarantees that CUR is close to S. We select k genes based on their importance score (refer to Equation 9), which depends on matrix S and the input rank parameter l (in this study, we used the default parameter value for l in CUR function [45].) If \( {v}_j^{\xi } \) is the j-th element of the ξ − th right singular vector of S, then the normalized statistical leverage scores are equal to

for all j = 1..2n, and \( \sum_{j=1}^{2n}{\pi}_j=1 \). Statistical leverage scores have been successfully employed in data analysis to identify the most influential genes and outlier detection [22]. A high statistical leverage score for a given gene indicates that the gene is regarded as an important (i.e., influential) gene. A low statistical leverage score for a given gene indicates that the gene is regarded as a less important gene. We store the indexes of the highest k leverage scores in I; these correspond to the positions of the k most influential genes in matrix S. We select k genes from the training set S using their positions in I and store subsampled cell line expression data with k genes in S^{'}. (e) A learning algorithm is called on S^{'} to induce model h. (f) The k genes in the test set G^{'} are selected using their positions in I and stored in G^{''}. Model h is applied on the test set G^{''} to perform predictions. We refer to the extended supervised link prediction algorithms that employ machine learning algorithms as A2 + SVR + L, A2 + SVR + S, and A2 + RR (see Table 1).

Results

We empirically evaluate our proposed approach and compare it against the baseline approach proposed by Geeleher et al. [11] on clinical trial datasets. This section first describes the datasets and experimental methodology and presents the experimental results.

Datasets

Data pertaining to breast cancer

The training set D∈ℝ^{482 × 6539} contains 482 cancer cell lines, 6538 genes, and drug IC_{50} values that correspond to a 482-dimensional column vector. The test set T∈ℝ^{24 × 6538} consists of 24 breast cancer tumors and 6538 genes. The drug IC_{50} values for docetaxel (a chemotherapy drug) [46, 47] were downloaded from (http://genemed.uchicago.edu/~pgeeleher/cgpPrediction/). The cell line expression data were downloaded from the ArrayExpress repository [48] (accession number is E-MTAB-783, also available at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-783/?query=EMTAB783). The clinical trial data corresponding to the test set were downloaded from the Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo/) with accession numbers GSE350 and GSE349 [49,50,51]. The data with accession numbers GSE350 and GSE349 contain 10 and 14 samples, respectively. If the remaining tumor was <25% or ≥25%, a breast cancer patient is considered to be sensitive or resistant, respectively, to docetaxel treatment. All the data were downloaded and processed according to the approach proposed by Geeleher et al. [11].

Data pertaining to multiple myeloma

The training set D∈ℝ^{280 × 9115} contains 280 cancer cell lines, 9114 genes, and drug IC_{50} values that correspond to a 280-dimensional column vector. The test set T∈ℝ^{188 × 9114} is composed of 188 multiple myeloma patients and 9114 genes. The drug IC_{50} values for bortezomib [52, 53] were downloaded from (http://genemed.uchicago.edu/~pgeeleher/cgpPrediction/), and the data for the cancer cell lines were downloaded from the ArrayExpress repository (accession number is E-MTAB-783 or available at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-783/?query=EMTAB783). The clinical trial data corresponding to the test set were downloaded from the Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo/) with accession number GSE9782 [54]. The data were downloaded, processed and mapped according to Geeleher et al. [11].

Data pertaining to non-small cell lung cancer and triple-negative breast cancer

The training sets correspond to an 258 × 9508 matrix and an 497 × 9621 matrix for non-small cell lung cancer and triple-negative breast cancer, respectively. The test sets correspond to an 25 × 9507 matrix (excluding labels) and an 24 × 9620 matrix (excluding labels) for non-small cell lung cancer and triple-negative breast cancer, respectively. The data were downloaded from (http://genemed.uchicago.edu/~pgeeleher/cgpPrediction/) [11].

Experimental methodology

Kernel-based methods, such as SVM and support vector regression (SVR), are popular machine learning algorithms and exhibit state-of-art performance in many applications [55, 56], including biological fields [57]. Therefore, in our experiments, we used SVR with linear kernel (SVR + L) and sigmoid kernel (SVR + S) as machine learning algorithms, coupled with our proposed link prediction algorithms (A1 or A2). We also employed our proposed link prediction algorithms with linear ridge regression (RR). In total, we considered 9 drug sensitivity prediction algorithms, as summarized in Table 1.

Each prediction algorithm was trained on the same training set, whose labels are continuous to yield models (see Methods section). Then, each model is applied to the same test set to yield predictions, as discussed in Methods section. The test set consists of the clinical trial expression data of patients, including baseline tumor expression data from primary tumor biopsies prior to treatment with an anticancer drug. The responses (i.e., labels) of the test set are categorical (e.g., either “sensitive” or “resistant”). These labels were clinically evaluated by the degree of reduction in tumor size to the given drug [11].

To evaluate whether the proposed approach exhibits stable superior performance as the sample size changes, we gradually reduced the sample size for the training set by 1 to 4% in each run. That is, we have 5 runs with sample sizes of 482, 478, 473, 468, and 463 and 280, 278, 275, 272, and 269 for the two datasets, respectively.

The accuracy of the prediction algorithms is measured using the Area Under the ROC Curve (AUC), as shown in [11]. The higher AUC an algorithm has, the better performance that algorithm achieves. We denote the mean of the AUC values averaged over the five runs of the test set as the MAUC. A run of the test set is defined as predictions of a learned model on the test set, such that the model is learned from the training set. The size of this training set is varied to assess the stability of prediction algorithms, in which a stable prediction algorithm is one for which the prediction accuracy on the test set does not change dramatically due to small changes in the size of the training set [58, 59]. This type of assessment is important in biological systems, in which the best prediction algorithm outperforms other algorithms many times in the conducted experiments. Statistical significance is measured between all pairs of the prediction algorithms.

The software employed in this study included support vector regressions with linear and sigmoid kernels in the LIBSVM package [60], ridge regression [11], gene selection using CUR and topLeverage functions in the rCUR package [45], and R code for processing the datasets and performance evaluation [11]. We used R to write the code for the link prediction algorithms and perform the experiments.

Experimental results

Tables 2 and 3 show the AUC of 9 docetaxel and bortezomib, respectively, sensitivity prediction algorithms on clinical breast cancer or multiple myeloma trial data. For each variation in training set size the prediction algorithm with the best performance (i.e., the highest AUC) on the clinical trial data is shown in bold.

Table 2 shows that our prediction algorithms perform better than the baseline prediction algorithms (i.e., B + SVR + L and B + SVR + S) including B + RR, which is a prediction algorithm proposed by Geeleher et al. Row “m” and “d”, shows the number of cell lines or genes, respectively, in the training set that were provided to each prediction algorithm. We provided the same training set to each prediction algorithm. Rows “m + A1” and “m + A2”, or “d + A1” and “d + A2” show the number of selected cell lines or genes, respectively, that were used in the prediction algorithms that employed our approach for learning the models. The results of our prediction algorithms are dominant compared with the baseline prediction algorithms that employ clinical trial data of breast cancer in terms of the AUC of four runs and the MAUC. In contrast to the baseline prediction algorithms, the performance of our prediction algorithms on the test set outperforms in terms of the AUC when we reduce the training set size.

Table 3 shows that our prediction algorithms perform better than the baseline prediction algorithms (i.e., B + SVR + L and B + SVR + S) and B + RR, which is a prediction algorithm proposed by Geeleher et al. Row “m” or “d”, respectively, shows the number of cell lines or genes, respectively, in the training set that were provided to each prediction algorithm. We provided the same training set to each prediction algorithm. Rows “m + A1” and “m + A2” or “d + A1” and “d + A2” show the number of selected cell lines or genes, respectively, used in the prediction algorithms that employ our approach for learning the models. The results of our prediction algorithms are dominant compared with the baseline prediction algorithms on the multiple myeloma clinical trial data in terms of the AUC of each run and the MAUC. In particular, A2 + RR achieves the highest mean AUC (MAUC) of 0.693 and performed the best in all runs. In contrast to the baseline prediction algorithms, the performance of A2 + RR on the test results in the best AUC as we reduce the training set size, which indicates that A2 + RR has a stable performance.

Table 4 shows the p-values of the two-tailed Wilcoxon signed rank test [61, 62] to measure the statistical significance between the prediction algorithms using clinical trial data of breast cancer and multiple myeloma patients. The p-values indicate that our A1 + SVR + L and A2 + SVR + L prediction algorithms significantly outperformed the baseline prediction algorithms B + SVR + L, B + SVR + S, and B + RR. The remaining prediction algorithms that employ our approach are not statistically different from B + SVR + S.

Figures 3 and 4 show the ranking of all prediction algorithms from the highest to the lowest MAUC using clinical trial data pertaining to breast cancer and multiple myeloma patients, respectively. Each MAUC is calculated over the 5 runs of the clinical trial data. As shown in Figs. 3 and 4, our prediction algorithms outperform the baseline prediction algorithms [11] w.r.t the MAUC.

Figure 5 shows the predictions of three prediction algorithms on the test set (clinical data samples of 24 breast cancer patients) when the prediction algorithms were trained on a dataset with the size m = 482 (i.e., the complete training set without any reductions). Figure 5a–c show the predictions of A2 + SVR + L A1 + SVR + L and B + SVR + S, respectively. For A2 + SVR + L in Fig. 5a, the difference between the predicted drug sensitivity in breast cancer patients was highly statistically significant (P=472 × 10^{−6} from the result of a t-test) between the trial-defined sensitive and resistant groups. The result of A1 + SVR + L in Fig. 5b was also highly statistically significant (P=614 × 10^{−6} from a t-test). B + SVR + S in Fig. 5c achieved statistical significance (P=1176 × 10^{−6} from a t-test). Higher sensitivity or higher resistance, respectively, denote the greater or lesser effectiveness of the drug. In Fig. 5d, the ROC reveals AUC values of 0.892, 0.878 and 0.842 for A2 + SVR + L, A1 + SVR + L, and B + SVR + S, respectively, as shown in Table 2.

In Fig. 6, the predictions of three prediction algorithms are reported on the test set (clinical trial data of 188 multiple myeloma samples of patients) when prediction algorithms learned models from a training set of size m = 280 (i.e., the training set without any reductions). Figure 6a–c show the predictions of the A2 + RR, A1 + RR, and B + RR, algorithms, respectively. For A2 + RR (Fig. 6a), the difference between the predicted drug sensitivity in multiple myeloma patients was highly significant (P=8 × 10^{−6} from a t-test) between trial-defined responder groups and non-responder groups. The result of A1 + RR was also highly significant (P=11 × 10^{−6} from a t-test), while B + RR achieved statistically significant result (P=2612 × 10^{−6} from a t-test). Figure 6d–f break down the responders and non-responders of Fig. 6a–c, respectively, to CR, PR, MR, NC or PD. In Fig. 6g, The ROC reveals AUCs of 0.686, 0.685, and 0.614 for A2 + RR, A1 + RR, and B + RR, respectively, as shown in Table 3.

We also evaluated the performance of prediction algorithms on the clinical trial data pertaining to non-small cell lung cancer patients and the triple-negative breast cancer patients. We observed similar results that our prediction algorithms noticeably outperform the baseline prediction algorithms (See Additional file 1: Tables S1 and S2).

It is worth mentioning that we also assessed the performance of other machine learning algorithms, including random forests [63], support vector regression with a polynomial kernel of degree 2, and support vector regression with a Gaussian kernel. Moreover, we applied other dimensionality reduction methods such as principal component analysis (PCA) [64] based on the prcomp package in R [65], sparse PCA [66, 67], non-negative and sparse cumulative PCA, and negative and sparse PCA [68, 69]. However, they did not exhibit acceptable predictive performance; consequently, their results are not included in this paper.

Discussion

Gene (feature) selection is important to the success of the proposed method. After many years of biomedical research, some signaling pathways have been known for being implicated in various cancers. It is tempted to exploit this pathway information for feature selection. For example, we might consider adding the signaling pathways as a constraint to get reliable feature sets. Consequently, we assessed the performance of the proposed prediction algorithms using only the genes in the signaling pathways that are known to the cancers. We obtained inferior results (See Additional file 2 for details). It is noted that the current pathway information is limited. If we consider only those signaling genes, we may miss those important genes not identified yet by domain knowledge. This may hurt the overall performance as shown in our case. Therefore, a better strategy may be to include all genes but assign more weights to those signaling pathway genes. This is an interesting direction, and we leave it to our future work.

Conclusion

In this paper, we introduce a link prediction approach to cancer drug sensitivity prediction. The benefit of introducing a link prediction approach is to obtain satisfactory feature representation for better prediction performance. We propose two algorithms that employ the link prediction approach: (1) A supervised link prediction algorithm, which selects better quality training cancer cell lines using a modified version of QBC; and (2) An extended supervised link prediction, which selects both better training cancer cell lines and a subset of important genes using state of the art CUR matrix decomposition.

In our study, the link prediction algorithms use two machine learning algorithms: support vector regression and ridge regression. The experimental results demonstrate the stability of the proposed link prediction algorithms, which outperform drug sensitivity prediction algorithms of an existing approach as measured by their higher and statistically significant AUC scores.

Abbreviations

A1:

The supervised link prediction algorithm

A2:

The extended supervised link prediction algorithm

AUC:

Area under curve

B:

The baseline approach

GEO:

Gene expression omnibus

IC50:

Half-maximal inhibitory concentration

MAUC:

Mean area under curve

QBC:

Query by committee

ROC:

Receiver operating characteristic

RR:

Ridge regression

SVR + L:

Support vector regression with a linear kernel

SVR + S:

Support vector regression with a sigmoid kernel

References

Siegel RL, Miller KD, Jemal A. Cancer statistics, 2015. CA Cancer J Clin. 2015;65(1):5–29.

Covell DG. Data Mining Approaches for Genomic Biomarker Development: Applications Using Drug Screening Data from the Cancer Genome Project and the Cancer Cell Line Encyclopedia. PLoS One. 2015;10(7):e0127433.

Costello JC, Heiser LM, Georgii E, Gonen M, Menden MP, Wang NJ, Bansal M, Ammad-ud-din M, Hintsanen P, Khan SA, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32(12):1202–12.

Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 2014;15(3):R47.

Yadav B, Gopalacharyulu P, Pemovska T, Khan SA, Szwajda A, Tang J, Wennerberg K, Aittokallio T. From drug response profiling to target addiction scoring in cancer cell models. Dis Model Mech. 2015;8(10):1255–64.

Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798-828.

Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine learning. MIT press; 2012.

Coates A, Ng AY. Learning feature representations with k-means. In: Neural Networks: Tricks of the Trade. Springer; 2012. p. 561–80.

Leskovec J, Faloutsos C. Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. Philadelphia: ACM; 2006. p. 631–6.

Getoor L, Diehl CP. Link mining: a survey. SIGKDD Explorations. 2005;7(2):3–12.

Turki T, Wang JTL. A New Approach to Link Prediction in Gene Regulatory Networks. In: Intelligent Data Engineering and Automated Learning – IDEAL 2015: 16th International Conference, Wroclaw, Poland, October 14–16, 2015, Proceedings. Edited by Jackowski K, Burduk R, Walkowiak K, Woźniak M, Yin H. Cham: Springer International Publishing; 2015. p. 404-15.

Gillani Z, Akash MS, Rahaman MM, Chen M. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks. BMC bioinformatics. 2014;15(1):395.

Turki T, Bassett W, JTL W. A Learning Framework to Improve Unsupervised Gene Network Inference. In: Perner P, editor. Machine Learning and Data Mining in Pattern Recognition: 12th International Conference, MLDM 2016, New York, NY, USA, July 16–21, 2016, Proceedings. Cham: Springer International Publishing; 2016. p. 28–42.

Riddick G, Song H, Ahn S, Walling J, Borges-Rivera D, Zhang W, Fine HA. Predicting in vitro drug sensitivity using Random Forests. Bioinformatics. 2011;27(2):220–4.

Menden MP, Iorio F, Garnett M, McDermott U, Benes CH, Ballester PJ, Saez-Rodriguez J. Machine Learning Prediction of Cancer Cell Sensitivity to Drugs Based on Genomic and Chemical Properties. PLoS ONE. 2013;8(4):e61318.

Jang IS, Neto EC, Guinney J, Friend SH, Margolin AA. Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. In: Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing: 2014. NIH Public Access: 63.

Falgreen S, Dybkær K, Young KH, Xu-Monette ZY, El-Galaly TC, Laursen MB, Bødker JS, Kjeldsen MK, Schmitz A, Nyegaard M. Predicting response to multidrug regimens in cancer patients using cell line experiments and regularised regression models. BMC Cancer. 2015;15(1):235.

Chiluka N, Andrade N, Pouwelse J. A link prediction approach to recommendations in large-scale user-generated content systems. In: Advances in Information Retrieval. Springer; 2011. p. 189–200.

Turki T, Wei Z. A greedy-based oversampling approach to improve the prediction of mortality in MERS patients. In: 2016 Annual IEEE Systems Conference (SysCon): 18–21 April 2016 2016. 1–5.

Turki T, Wei Z. IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions. In: MAICS: 2015. 181–185.

Settles B. Active learning literature survey. Univ Wis Madison. 2010;52(55–66):11.

Melville P, Mooney RJ. Diverse ensembles for active learning. In: Proceedings of the twenty-first international conference on Machine learning: 2004. ACM: 74.

Gilad-Bachrach R, Navot A, Tishby N. Query by committee made real. In: Advances in neural information processing systems: 2005. 443–450.

Krogh A, Vedelsby J. Neural network ensembles, cross validation, and active learning. Adv Neural Inf Proces Syst. 1995;7:231–8.

Joensuu H, Kellokumpu-Lehtinen P-L, Bono P, Alanko T, Kataja V, Asola R, Utriainen T, Kokko R, Hemminki A, Tarkkanen M, et al. Adjuvant Docetaxel or Vinorelbine with or without Trastuzumab for Breast Cancer. N Engl J Med. 2006;354(8):809–20.

Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31(1):68–71.

Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, Mohsin S, Osborne CK, Chamness GC, Allred DC, et al. Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet. 362(9381):362–9.

Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Tham Y-L, Kalidas M, Elledge R, Mohsin S, Osborne CK, et al. Patterns of Resistance and Incomplete Response to Docetaxel by Gene Expression Profiling in Breast Cancer Patients. J Clin Oncol. 2005;23(6):1169–77.

Mulligan G, Mitsiades C, Bryant B, Zhan F, Chng WJ, Roels S, Koenig E, Fergus A, Huang Y, Richardson P, et al. Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. Blood. 2007;109(8):3177–88.

Balfer J, Bajorath J. Systematic Artifacts in Support Vector Regression-Based Compound Potency Prediction Revealed by Statistical and Activity Landscape Analysis. PLoS One. 2015;10(3):e0119301.

Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009:kxp008.

Sigg CD, Buhmann JM. Expectation-maximization for sparse and non-negative PCA. In.: 2008: 960–967.

The authors thank anonymous reviewers for their valuable comments of the manuscripts submitted to ICIBM 2016, which helped improve the paper considerably. TT thanks King Abdulaziz University for their scholarship and Saudi Arabian Cultural Mission for their academic and financial support.

Funding

Publication charges for this article have been funded by the first corresponding author.

This article has been published as part of BMC Systems Biology Volume 11 Supplement 5, 2017: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2016: systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume-11-supplement-5.

Author information

Authors and Affiliations

Department of Computer Science, King Abdulaziz University, P.O. Box 80221, Jeddah, 21589, Saudi Arabia

Turki Turki

Bioinformatics Program and Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA

TT and ZW conceived the study. TT and ZW designed the algorithms. TT implemented the algorithms and conducted the experiments. TT and ZW wrote and approved the final manuscript.

Performance evaluation of prediction algorithms on clinical trial data pertaining to non-small cell lung cancer patients and triple-negative breast cancer patients. (DOCX 31 kb)

Performance of prediction algorithms using signaling pathways as a constraint to get reliable feature set. (DOCX 25 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Turki, T., Wei, Z. A link prediction approach to cancer drug sensitivity prediction.
BMC Syst Biol11
(Suppl 5), 94 (2017). https://doi.org/10.1186/s12918-017-0463-8