Skip to main content

Network-based characterization of drug-protein interaction signatures with a space-efficient approach

Abstract

Background

Characterization of drug-protein interaction networks with biological features has recently become challenging in recent pharmaceutical science toward a better understanding of polypharmacology.

Results

We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call “drug-protein interaction signatures” from the integration of large-scale heterogeneous data of drugs and proteins. We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by space-efficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers.

Conclusions

Our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue the these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions and are expected to contribute to rational drug design.

Background

Target proteins of drug molecules are classified into a primary target and off-targets. The former is the desired target, whereas the latter could lead to adverse drug reactions [1] or unexpected beneficial effects in drug repositioning [2]. Therefore, comprehensive analysis throughout primary targets and off-targets on a genome-wide scale is crucial in drug discovery. The in silico approach is expected to improve the research productivity in this field.

Several computational methods have been presented for predicting drug-protein interactions (or compound-protein interactions) from chemogenomic and pharmacogenomic viewpoints on a large-scale. The basic idea behind the chemogenomic approach is that chemically similar drugs are expected to interact with similar proteins, with which the similarity of drugs and proteins are defined based on their side-effects and the amino acid sequences, respectively [38]. On the other hand, the key idea behind the pharmacogenomic approach is that phenotypically similar drugs are predicted to interact with similar proteins, on the basis of drug side effects and/or protein sequences [912]. However, previous predictive models are not easily interpretable, making it difficult to extract biological features characterizing drug-protein interactions and making it impossible to give insights into the theoretical basis of interactions.

The characterization of drug-protein interaction networks with biological characteristics has become a challenging problem in modern pharmaceutical science toward better understanding of poly-pharmacology. It is hypothesized that polypharmacology is involved in various features of drugs and target proteins (e.g., chemical substructures, pharmacophores, functional sites, and pathways) and complicated associations between the heterogeneous features.

A variety of feature extraction methods have recently been proposed for automatically characterizing drug-protein interactions. A data mining method was proposed for extracting molecular substructure pairs appearing frequently in interacting drug-target pairs [13]. Machine learning methods with sparse statistical models were presented to associate protein domains with drug chemical substructures [14, 15] or with drug side effects [16]. The inference of proteins eliciting drug side effects has been reported by several groups [17, 18]. However, the scalability of these methods is very limited, and these studies were conducted from the perspective of either protein functional sites, drug chemical substructures or drug phenotypic effects. There is a strong and growing need to develop efficient and scalable methods for characterizing overall drug-protein interactions with many types of features of drugs and proteins at once.

We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call “drug-protein interaction signatures”. We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by space-efficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers. In the results, our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue that these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions. To the best of our knowledge, this is the first report on characterizing a large-scale drug-protein interaction network with various biological features of drugs and proteins in an integrative framework. The drug-protein interaction signatures comprehensively inferred with our method are expected to contribute to rational drug design.

Results

Drug-protein interactions

We got the information on drug-protein interactions from five databases: ChEMBL [19], KEGG [20], DrugBank [21], PDSP Ki [22], and Matador [23]. The number of unique drug-protein interactions in the merged dataset is 78,692. These interactions involve 2302 drugs and 2334 target proteins, and the number of all possible drug-protein pairs is 5,372,868. We utilized this dataset in our experiments.

Drug profiles

We described drug chemical structures by 17,017 chemical substructures using the KEGG Chemical Function and Substructures (KCF-S) descriptor [24]. We represented each drug by a 17,017-dimension binary vector where the presence or absence of each of the KCF-S substructures is coded as 1 or 0. The resulting vector is referred to as a chemical profile.

We obtained the information about adverse drug reactions (ADRs) from the public release of the adverse event reporting system (AERS) of the US Food and Drug Administration (FDA) [25]. We derived 2,904,050 reports from 2004 to 2010 and mapped the drug names to KEGG following a previous study [12]. Based on the resulting 10,543 ADRs, we represented each drug by a 10,543-dimension binary vector where the presence or absence of each ADR is coded as 1 or 0. The resulting vector is referred to as an ADR profile.

Finally, we constructed an integrative feature vector of each drug by concatenating the chemical and the ADR profiles into a single one. The dimension of the resulting feature vector of each drug was 27,560.

Protein profiles

We obtained functional domains, biological pathways, and pathway modules (compactly clustered pathways) about proteins from the KEGG [20] and the PFAM [26] databases.

Based on 2678 PFAM domains, we represented each protein by a 2678-dimension binary vector where the presence or absence of a functional domain is coded as 1 or 0. The resulting vector is referred to as domain profile. Based on 270 KEGG pathway maps, we represented each protein by a 270 dimension binary vector where the presence or absence of the involvement in a biological pathway is coded as 1 or 0. The resulting vector is referred to as a pathway profile. Based on 107 KEGG pathway modules, we represented each protein by a 107-dimension binary vector where the presence or absence of the involvement in a pathway module is coded as 1 or 0. The resulting vector is referred to as module profile.

Finally, we constructed an integrative feature vector of each protein by concatenating the domain, pathway, and module profiles into a single profile. The dimension of the resulting feature vector of each protein was 3055.

We address the problem of extracting features characterizing drug-protein interaction networks in the framework of supervised classification.

Linear model for drug-protein pairs

Let C be a drug (or a drug candidate compound) and let P be a target protein (or a target candidate protein). We represent a drug-protein pair (C,P) as a high dimensional feature vector Φ(C,P) and present a linear function, f(C,P)=wTΦ(C,P), whose output is used to predict whether a (C,P) is an interacting pair or not. The weight vector w is estimated such that each drug-protein pair is correctly classified into the interaction class (positive class) or non-interaction class (negative class) based on the training set.

An advantage of the linear model is that one can interpret features effective for predictions from learned models. Since each element in Φ(C,P) corresponds to an element of w, effective features can be selected by extracting highly weighted features. However, the performance of the linear model depends heavily on the feature vector design.

We represent each drug-protein pair as a high dimension feature vector by taking the tensor product of a drug profile and protein profile. The representation is similar to that in previous studies [15, 16]. The profile of a C is defined as a D-dimension binary vector:

Φ(C)=(c1,c2,...,cD)T,

where ci{0,1}, i=1,...,D. The profile of a P is defined as a D-dimension binary vector: \(\Phi (P)=(p_{1},p_{2},...,p_{D^{\prime }})^{T}\), where pi{0,1}, i=1,...,D. We compute the tensor product between a drug profile Φ(C) and protein profile Φ(P), and define a feature vector Φ(C,P) as follows:

$$\Phi(C,P) = (c_{1}p_{1},c_{1}p_{2},...,c_{1}p_{D^{\prime}},c_{2}p_{1},...c_{D}p_{1},...,c_{D}p_{D^{\prime}})^{T}. $$

where Φ(C,P) is composed of all possible products between elements in Φ(C) and those in Φ(P). The resulting feature vector is a D×D-dimension binary vector, i.e., fingerprint, for encoding cross-integrated biological features. This is referred to as a “tensor-product fingerprint”.

In this study, Φ(C) was a 27,560-dimension binary vector, and Φ(P) was a 3055-dimension binary vector. Thus, the tensor-product fingerprint Φ(C,P) of each drug-protein pair is a 84,195,800-dimension binary vector.

A simpler way for representing each drug-protein pair is to concatenate Φ(C) and Φ(P) into a single feature vector as Φ(C,P)=(Φ(C)T,Φ(P)T)T [7]. However, it cannot determine the correlation between drug and protein features. The feature vector is referred to as a “concatenated fingerprint”.

Logistic regression

We apply logistic regression to train the weight vector in the linear model and introduce L1-regularizations to prevent over-fitting. The L1-regularization induces sparsity in the weight vector and drives most of the weight elements corresponding to unimportant features to zeros, which makes it easier for us to interpret the model and extract features.

Minimizing the logistic loss with L1-regularization for a large number of high dimensional data is difficult, but several efficient algorithms have recently been proposed. To the best of our knowledge, LIBLINEAR [27] is the most efficient and high-performance algorithm, but it requires a huge amount of memory for extremely high-dimensional data. In fact, it was not computationally feasible for our dataset in this study because of the memory problem (see the “Results” section). To overcome this difficulty, we introduce a gradient-based method.

Given a collection of drug-protein pairs and their labels (Φ(Ci,Pj),yij) where yij{+1,−1}(i=1,...,n,j=1,...,m), the logistic loss is defined as

$$LR(\mathbf{w}) = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m} \log (1+\exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right). $$

The logistic loss with L1-regularization is defined as

$$ L_{1}\text{-}LR(\mathbf{w}) \,=\,\! \sum\limits_{i=1}^{n}\!\sum\limits_{j=1}^{m} \!\log\! \left(\!1\,+\,\exp\! \left(\!-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j})\! \right) \!\right) + C\|\mathbf{w}\|_{1}, $$

where w1 is L1 norm (the sum of absolute value in the vector) and C is a regularization parameter.

Since L1-LR(w) is a convex function, the weight vector w minimizing L1-LR(w) can be found at zero of its gradient. However, it is impossible to compute the gradient of L1-LR(w), because L1 norm contains non-differential points where wd=0. Instead, we compute the d-th dimensional gradient dLR(w) of LR(w) as follows:

$$ \nabla_{d} LR(\mathbf{w}) = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m} \frac{-y_{ij} \Phi_{d}(C_{i},P_{j}) \exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right)}{1+\exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right)}, $$

where Φd(Ci,Pj) is the d-th dimensional value of Φ(Ci,Pj). We then compute the D×D-dimensional gradient vector \(\nabla LR(\mathbf {w}) \in \mathfrak {R}^{D\times D^{\prime }}\) as

$$\nabla LR(\mathbf{w}) = \left(\nabla_{1} LR(\mathbf{w}),\nabla_{2} LR(\mathbf{w}),...,\nabla_{D\times D^{\prime}} LR(\mathbf{w}) \right)^{T}. $$

The use of LR(w) enables the global minimum for the optimal w in L1-LR(w) to be found using an efficient gradient-based optimization algorithm called orthant-wise limited-memory quasi-newton (OWL-QN) [28]. The L1-regularized logistic regression methods, with the tensor product of the fingerprint proposed and with the concatenated fingerprint, is referred to as L1LOG-tensor and L1LOG-concat, respectively.

For comparison, we also trained models with L2-regularized logistic regression using the gradient-based algorithm called the limited memory quasi-Newton (L-BFGS) [29]. The L2-regularized logistic regression method, with the tensor-product fingerprint and the concatenated fingerprint, are referred to as L2LOG-tensor and L2LOG-concat, respectively.

Space-efficient representation of drug-protein pairs

Compact representation of drug-protein pairs is crucial for training linear models in memory, so we use the set representation with items corresponding to dimensions of one bit in the fingerprint. However, this still consumes a huge amount of memory for storing them, resulting in limited scalability in memory for extremely high-dimensional data. To overcome this memory problem, we constructed two space-efficient representations of fingerprints. We present a brief overview of these representations (further details are given in the supplemental material [30]).

Figure 1 illustrates the construction of our two representations. We first represent each fingerprint Φ(Ci,Pj) as a set S(Ci,Pj)={d|Φd(Ci,Pj)=1,d=1,...,D×D} that contains items corresponding to dimensions of one bit in Φ(Ci,Pj). We refer to a set representation of fingerprints as SET. To minimize each item, we then compute the difference between the k-th item S(Ci,Pj)[k] and (k−1)-th item S(Ci,Pj)[k−1] as (S(Ci,Pj)[k]−S(Ci,Pj)[k−1]) and keep the results in a new set S(Ci,Pj). We can recover S(Ci,Pj) by cumulatively adding the items in S(Ci,Pj).

Fig. 1
figure 1

Brief summary of constructing space-efficient representations of fingerprints for drug-protein pairs constructed with our proposed method: VLA and SUCTRIE

We constructed our two space-efficient representations of fingerprints by leveraging the idea behind succinct data structures that achieve space-efficient representations of data structures while preserving the property of fast operations. The first one is a variable-length array for compactly representing fingerprints. The S(Ci,Pj) is represented by two bit strings Rij and Pij which are indexed by rank/select dictionary, i.e., a succinct data structure for bit strings. We can randomly access any element in S(Ci,Pj) in O(1) time by using fast operations in the rank/select dictionary [31]. We refer to this variable-length array representation of fingerprints as VLA.

The second one is a type of succinct trie for representing fingerprints. The trie is a data structure for strings, and it is also practical for representing fingerprints. A standard point-based implementation of trie consumes a huge amount of memory, resulting in limited scalability. Alternatively, we present a compact representation of trie by using a succinct data structure called LOUDS [32]. We can recover the original fingerprints by traversing a succinct trie in a depth-first manner. We refer to this succinct trie representation of fingerprints as SUCTRIE.

Extraction of drug-protein interaction signatures

We applied the proposed method (L1LOG-tensor) to extract drug-protein interaction signatures from drug profiles (chemical substructures and adverse drug reactions) and protein profiles (protein domains, biological pathways, and pathway modules), based on a large-scale drug-protein interaction network. Each signature is the association between a drug feature and protein feature, where two features in the same signature are thought of as being associated in terms of drug-protein interactions. The results of all extracted drug-protein interaction signatures are presented in the supplemental material [30].

L1LOG-tensor extracted 105,684 signatures, while L2LOG-tensor extracted 7,843,218 signatures. Note that the number of all possible combinations of drug features and protein features is 84,195,504. The number of signatures from our L1LOG-tensor method was much less than that of L2LOG-tensor, due to the sparsity induced by L1-regularization. This makes it easier to analyze the extracted drug-protein interaction signatures for biological interpretation, so we focused on analyzing the results from L1LOG-tensor below.

Figure 2 shows a network representation of some of the drug-protein signatures extracted with L1LOG-tensor, where highly weighted associations of five features of drugs or proteins, that is, drug-chemical substructures (blue), adverse drug reaction (red), protein pathway (green), pathway module (yellow) and protein domain (gray). Only selected results are shown due to space limitation. The inferred signature association network provides us with clues about the important features behind the drug-protein interaction network. There has been no study on the inference of these associations.

Fig. 2
figure 2

Part of obtained drug-protein interaction signature network among five features, i.e., drug chemical substructures (blue), adverse drug reactions (red), protein domain (gray), biological pathway (green), and pathway module (yellow). Node size represents degree of each feature, and edge width represents corresponding weight in model

Biological interpretation of the extracted signatures

We constructed biological interpretations for the drug-protein interaction signatures extracted with L1LOG-tensor. We give only two examples due to space limitation. The result of all analyzed signatures and the figures/tables are presented in the supplemental material [30].

Figure 3 shows an extracted signature representing the association between a drug-chemical substructure (SKELETON C1b(N1d)-C1b(O7a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction), where the vertical axis on the heat map (a) shows all drugs sharing the extracted substructure, and the horizontal axis shows all proteins sharing the extracted pathway. The extracted drug-chemical substructures on the associated drug structures (b) are in pink. Drugs and proteins in known interacting pairs tend to have such extracted features in the same signature. For example, Propantheline bromide (D00481), Methanthelinium bromide (D00721), Acetylcholine chloride (D00999), Carbachol (D00524), Succinylcholine chloride (D00766), and Suxamethonium chloride (D02275) share a choline skeleton, and all known to act on acetylcholine receptors. However, there are many other drugs sharing the extracted drug feature and proteins sharing the extracted protein feature, and the drug-protein interactions are not known. Thus, it may be possible to predict previously unknown interactions between drugs and proteins through the extracted features in the signatures. See Table 1 and Fig. 4 for detail.

Fig. 3
figure 3

Example of drug-protein interaction signature: association between drug-chemical substructure (SKELETON C1b(N1d)-C1b(O7a) in KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction). a Horizontal axis shows drugs sharing chemical substructure, and vertical axis shows proteins sharing biological pathway. Color of each element corresponds to number of databases storing corresponding interaction. b Chemical structures of drugs sharing substructure are shown, and extracted substructure is highlighted in red (see Table 1 for further details)

Table 1 Association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z(C5a+O7a)-C1z(C1a)” and KEGG pathway hsa04080 Neuroactive “ligand-receptor interaction”. See also Fig. 4
Fig. 4
figure 4

The association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z(C5a+O7a)-C1z(C1a)” and KEGG pathway hsa04080 Neuroactive “ligand-receptor interaction” a The heat map shows the numbers of databases that register confirmed drug-protein interactions from KEGG, DrugBank, Matador, Chembl, PSD pi databases. Horizontal and vertical axes show drugs and proteins, respectively. Gray, blue, green, yellow, orange and red indicate that 0, 1, 2, 3, 4 and 5 databases contain the corresponding interaction. b Chemical structures of some drugs, where red areas (if any) show the extracted substructure indicated by KCF-S. See also Table 1

Table 2 shows an extracted signature representing the association between an ADR (R01631 Graft-versus-host disease) and protein domain (PF14446 Prokaryotic RING finger family 1), where all drugs sharing the extracted ADR and all proteins sharing the extracted protein domain are also shown. Interestingly, most drugs sharing the ADR (R01631 Graft-versus-host disease) were related to inflammation, immunosuppression, and cancer, which supports the recently expanded concept that inflammation is a critical component of cancer progression [33]. See Fig. 5 and Table 3 for detail.

Fig. 5
figure 5

The association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C1x)-C1y(C1x)-C1z(C1a+C1y)” and KEGG pathway module “hsa_M00110 C19/C18-Steroid hormone biosynthesis”. a The heat map shows the numbers of databases that register confirmed drug-protein interactions from KEGG, DrugBank, Matador, Chembl, PSD pi databases. Horizontal and vertical axes show drugs and proteins, respectively. Gray, blue, green, yellow, orange and red indicate that 0, 1, 2, 3, 4 and 5 databases contain the corresponding interaction. b Chemical structures of some drugs, where red areas (if any) show the extracted substructure indicated by KCF-S. See also Table 3

Table 2 Example of drug-protein interaction signature: association between adverse drug reaction (ADR) (R01631 Graft-versus-host disease) and protein domain (PF14446 Prokaryotic RING finger family 1)
Table 3 The association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C1x)-C1y(C1x)-C1z(C1a+C1y)” and KEGG pathway module “hsa_M00110 C19/C18-Steroid hormone biosynthesis”

Figure 4 shows an extracted signature representing the association between a drug-chemical substructure (RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z(C5a+O7a)-C1z(C1a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction). It was observed that Megestrol acetate (D00952), Cyproterone acetate (D01368) and Chlormadinone acetate (D01299) share common ring structures. All these drugs are known to act on many neuroactive ligand-receptors. See Table 1 for detail.

Figure 6 show an extracted signature representing the association between a drug-chemical substructure (SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x in the KCF-S format) and biological pathway (hsa03050 Proteasome). Proteasome inhibitors have been applied to the treatment of cancer, especially multiple myeloma. The substructure “SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x” corresponds to a phenylalanine residue, which is captured as a characteristic substructure in known proteasome inhibitors Bortezomib (D03150) and Carfilzomib (D08880). See Table 4 for detail.

Fig. 6
figure 6

The association between KCF-S “SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x” and KEGG pathway “hsa03050 Proteasome” a The heat map shows the numbers of databases that register confirmed drug-protein interactions from KEGG, DrugBank, Matador, Chembl, PSD pi databases. Horizontal and vertical axes show drugs and proteins, respectively. Gray, blue, green, yellow, orange and red indicate that 0, 1, 2, 3, 4 and 5 databases contain the corresponding interaction. b Chemical structures of some drugs, where red areas (if any) show the extracted substructure indicated by KCF-S. See also Table 4 for detail

Table 4 The association between KCF-S “SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x” and KEGG pathway “hsa03050 Proteasome”

Performance evaluation on generalization property

If the extracted signatures are biologically meaningful in terms of drug-protein interactions, they need to have good generalization to predict drug-protein interactions.

We tested five feature extraction methods: L1LOG-tensor, L2LOG-tensor, L1LOG-concat, L2LOG-concat, and L1LOG-LIBLINEAR-tensor on their abilities to reconstruct known drug-protein interactions. As mentioned above, L1LOG-tensor is our proposed method. The others are previous methods based on current algorithms or conventional fingerprints (see the Logistic regression section for further details). L1LOG-tensor and L2LOG-tensor use tensor-fingerprints represented by our space-efficient algorithm. L1LOG-concat and L2LOG-concat use previous concatenated fingerprints [7] represented by the LIBLINEAR algorithm [27]. L1LOG-LIBLINEAR-tensor is a method [15, 16] which uses the tensor-product fingerprints represented by the LIBLINEAR algorithm [27].

We conducted the following fold cross-validation in a pair-wise manner. We first randomly divided all drug-protein pairs in the gold standard set into five subsets. Next, we considered four of the subsets as a training set and the remaining subset as a test set. We learned a predictive model on the drug-protein pairs in the training set. Finally, we applied the predictive model to the drug-protein pairs in the test set.

We used the receiver operating characteristic curve (ROC curve), which is defined as a plot of true positive rates against false positive rates based on various thresholds, and the precision-recall curve (PR curve), which is defined as a plot of precision (positive predictive value) against recall (sensitivity) based on various thresholds, as evaluation measures for prediction performance.

We computed the area under the ROC curve (AUC score) and the area under the PR curve (AUPR score). The parameters involved in each method (e.g., regularization parameter) were fit with AUC and AUPR as the objective functions.

Figure 7 shows the AUC and AUPR scores in the pair-wise cross-validation, where the number of negative pairs in the training set was changed from the same number of positive examples to that of all possible negative examples in the training set. We observed that the prediction accuracy of the models trained with all five methods improved as the number of negative examples in the training set increased. This suggests that using all possible negative examples for learning a predictive model will enhance prediction reliability. L1LOG-tensor performed the best.

Fig. 7
figure 7

AUC score (left) and AUPR score (right) in pair-wise cross validation

L1LOG-LIBLINEAR-tensor did not perform well with an increasing number of negative examples in the training set because of the memory storage problem. The learning process with the LIBLINEAR algorithm consumed all the memory of our machine with 128GB-memory. In contrast, the other four methods with our space-efficient algorithm were able to finish the training process. This suggests that our space-efficient algorithm is more suitable and powerful for learning a predictive model on extremely high-dimensional data.

L1-LOG-tensor and L2LOG-tensor performed better than L1-LOG-concat and L2LOG-concat, which suggests that the tensor-product fingerprint can capture relevant information for drug-protein interaction prediction. On the other hand, the concatenated fingerprint cannot capture enough information, even though calculation is faster.

Table 5 shows the AUC score, AUPR score, training time, and consumed memory in the pair-wise cross-validation. L1LOG-tensor and L2LOG-tensor consumed 24 GB for learning predictive models on all possible drugprotein pairs, which suggests their applicability for largescale drug-protein interaction prediction. They also took about 24 hours, which can be considered reasonable on a practical level, though they were slower than L1LOG-concat and L2LOG-concat.

Table 5 AUC score, AUPR score, training time in seconds, and consumed memory in megabytes in the pair-wise cross validation experiments

In the pair-wise cross-validation, drugs and proteins in test pairs often overlap with those in the training set. We conducted a different 5-fold cross-validation to avoid the overlap of drugs and proteins in test pairs between those in the training set, which we call “block-wise cross-validation”. The results of this block-wise cross-validation are shown in Fig. 8 and Table 6. The same tendency in the pair-wise cross-validation was also seen in the block-wise cross-validation. However, the AUC and AUPR scores in the block-wise cross-validation were much lower than those in the pair-wise cross validation. The results indicate that predictions of unknown interactions for new drug candidates (without known targets) and orphan proteins (without known ligands) are much more difficult than detecting missing interactions between drugs of known targets and proteins of known ligands in practice.

Fig. 8
figure 8

AUC score (left) and AUPR score (right) in block-wise cross validation

Table 6 AUC score, AUPR score, training time in seconds, and consumed memory in megabytes in the block-wise cross validation experiments

Finally, we tested SUCTRIE, VLA, and SET on their space-efficiencies of fingerprint representations. Note that SET is a standard representation, and SUCTRIE and VLA are those constructed with our proposed method. Figure 9 shows a plot of the consumed memory against the number of fingerprints. SET is known to use a large amount of memory for storing all possible fingerprints. In fact, it consumed 57GB for storing all possible drugprotein pairs in our dataset, which limits its practical usage. In contrast, our proposed representations SUCTREE and VLA are more space-efficient than SET. The consumed memory of SUCTREE was slightly smaller than that of VLA. SUCTREE and VLA consumed 16 and 20 GB, respectively, for storing all possible drug-protein pairs, Suggesting the usefulness of our SUCTREE and VLA. In fact, we were not able to conduct all the analyses for this study without SUCTRIE.

Fig. 9
figure 9

Comparison of consumed memory between different fingerprint representations: SUCTRIE, VLA and SET

Conclusions

We proposed a novel method of extracting the underlying features characterizing overall drug-protein interactions, which we call “drug-protein interaction signatures”. We extracted a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules, and argued that the extracted drug-protein interaction signatures were biologically meaningful. Our proposed method is original in that the space-efficient representations for high-dimensional fingerprints of drug-protein pairs, in the characterization of a large-scale drug-protein interaction network with various features in an integrative framework, and in the interpretability for the extracted feature associations.

Our proposed method will be useful for various applications in drug discovery. A limitation of the method is that it cannot extract the associations between different attributes of drugs or proteins. For example, it cannot detect the associations between drug-chemical substructures and adverse drug reactions or the associations between protein domains and biological pathways. Extension of the method for analyzing such more complicated features is an important future work.

Abbreviations

ADRs:

Adverse drug reactions

AERS:

Adverse event reporting system

AUC score:

area under the ROC curve

AUPR score:

area under the PR curve FDA: Food and drug administration

KCF-S:

KEGG chemical function and substructures

KEGG:

Kyoto encyclopedia of genes and genomes

L-BFGS:

Limited-memory quasi-newton

OWL-QN:

Orthant-wise limited-memory quasi-newton

PR:

Precision recall

ROC curve:

Receiver operating characteristic curve

References

  1. Whitebread S, Hamon J, Bojanic D, Urban L. Keynote review: In vitro safety pharmacology profiling: an essential tool for successful drug development. Drug Discov Today. 2005; 10(21):1421–33.

    Article  CAS  Google Scholar 

  2. Chong CR, Sullivan DJ. New uses for old drugs. Nature. 2007; 448:645–6.

    Article  CAS  Google Scholar 

  3. Faulon JL, Misra M, Martin S, Sale K, Sapra R. Genome scale enzyme-metabolite and drug-target interaction predictions using the signature molecular descriptor. Bioinformatics. 2008; 24:225–33.

    Article  CAS  Google Scholar 

  4. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008; 24:232–40.

    Article  Google Scholar 

  5. Jacob L, Vert JP. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008; 24:2149–56.

    Article  CAS  Google Scholar 

  6. Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen NH, Kuijer MB, Matos RC, Tran TB, Whaley R, Glennon RA, Hert J, Thomas KL, Edwards DD, Shoichet BK, Roth BL. Predicting new molecular targets for known drugs. Nature. 2009; 462(7270):175–81.

    Article  CAS  Google Scholar 

  7. Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y. Analysis of multiple compound-protein interactions reveals novel bioactive molecules. Mol Syst Biol. 2011; 7:472.

    Article  CAS  Google Scholar 

  8. Lounkine E, Keiser MJ, Whitebread S, Mikhailov D, Hamon J, Jenkins JL, Lavan P, Weber E, Doak AK, Côté S, et al.Large-scale prediction and testing of drug activity on side-effect targets. Nature. 2012; 486(7403):361–7.

    Article  CAS  Google Scholar 

  9. Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science. 2008; 321(5886):263–6.

    Article  CAS  Google Scholar 

  10. Yamanishi Y, Kotera M, Kanehisa M, Goto S. Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics. 2010; 26(12):246–54.

    Article  Google Scholar 

  11. Atias N, Sharan R. An algorithmic framework for predicting side-effects of drugs. J Comput Biol. 2011; 18(3):207–18.

    Article  CAS  Google Scholar 

  12. Takarabe M, Kotera M, Nishimura Y, Goto S, Yamanishi Y. Drug target prediction using adverse event report systems: a pharmacogenomic approach. Bioinformatics. 2012; 28:611–8.

    Article  Google Scholar 

  13. Takigawa I, Tsuda K, Mamitsuka H. Mining Significant Substructure Pairs for Interpreting Polypharmacology in Drug-Target Network. PloS ONE. 2011; 6:16999.

    Article  Google Scholar 

  14. Yamanishi Y, Pauwels E, Saigo H, Stoven V. Extracting Sets of Chemical Substructures and Protein Domains Governing Drug-Target Interactions. J Chem Inf Model. 2011; 51:1183–94.

    Article  CAS  Google Scholar 

  15. Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y. Identification of chemogenomic features from drug-target interaction networks using interpretable classifiers. Bioinformatics. 2012; 28(18):487–94. https://doi.org/10.1093/bioinformatics/bts412.

    Article  Google Scholar 

  16. Iwata H, Mizutani S, Tabei Y, Kotera M, Goto S, Yamanishi Y. Inferring protein domains associated with drug side effects based on drug-target interaction network. BMC Syst Biol. 2013; 7(Suppl 6):18. https://doi.org/10.1186/1752-0509-7-S6-S18.

    Article  Google Scholar 

  17. Mizutani S, Pauwels E, Stoven V, Goto S, Yamanishi Y. Relating drug–protein interaction network with drug side effects. Bioinformatics. 2012; 28(18):522–8.

    Article  Google Scholar 

  18. Kuhn M, Al Banchaabouchi M, Campillos M, Jensen LJ, Gross C, Gavin A-C, Bork P. Systematic identification of proteins that elicit drug side effects. Mol Syst Biol. 2013; 9(1).

  19. Gaulton A, Bellis L, Bento A, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington J. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012; 40:1100–7.

    Article  Google Scholar 

  20. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2013; 40:109–14.

    Article  Google Scholar 

  21. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014; 42:1091–7.

    Article  Google Scholar 

  22. Roth BL, Lopez E, Patel S, Kroeze WK. The multiplicity of serotonin receptors: Uselessly diverse molecules or an embarrassment of riches?. Neuroscientist. 2000; 6:252–62.

    Article  CAS  Google Scholar 

  23. Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales E, Gewiess A, Jensen L, Schneider R, Skoblo R, Russell R, Bourne P, Bork P, Preissner R. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008; 36:919–22.

    Article  Google Scholar 

  24. Kotera M, Tabei Y, Yamanishi Y, Moriya Y, Tokimatsu T, Kanehisa M, Goto S. KCF-S: KEGG Chemical Function and Substructure for improved interpretability and prediction in chemical bioinformatics. BMC Syst Biol. 2013; 7(Suppl 6):2.

    Article  Google Scholar 

  25. FDA. 2018. http://www.fda.gov/.

  26. Finn R, Tate J, Mistry J, Coggill P, Sammut J, Hotz H, Ceric G, Forslund K, Eddy S, Sonnhammer E, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2008; 36:281–8.

    Article  Google Scholar 

  27. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR:A library for large linear classification. J Mach Learn Res. 2008; 9:1871–4.

    Google Scholar 

  28. Andrew G, Gao J. Scalable training of L 1-regularized log-linear models. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning: 2007. p. 33–40.

  29. Liu DC, Nocedal J, Liu DC, Nocedal J. On the limited memory bfgs method for large scale optimization. Math Program. 1989; 45:503–28.

    Article  Google Scholar 

  30. Supplementary information. 2018. http://labo.bio.kyutech.ac.jp/~yamani/drugprotein/.

  31. Jacobson G. Succinct static data structures. PhD thesis, Carnegie Mellon University; 1989.

  32. Jacobson G. Space-efficient Static Trees and Graphs. In: Proceedings of the 30th Annual Symposium of Foundations of Computer Science: 1989. p. 549–54.

  33. Coussens LM, Werb Z. Inflammation and cancer. Nature. 2002; 420:860–7.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank anonymous reviewers for their thoughtful and constructive reviews.

Funding

Publication of this work was supported by JST PRESTO Grant Number JPMJPR15D8.

Availability of data and materials

All results and datasets are available at [30].

About this supplement

This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume-13-supplement-2.

Author information

Authors and Affiliations

Authors

Contributions

YT and YY designed research. YT designed the method. YT and MK performed the experiments. YT, MK and YY wrote the paper. All of the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Yasuo Tabei.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tabei, Y., Kotera, M., Sawada, R. et al. Network-based characterization of drug-protein interaction signatures with a space-efficient approach. BMC Syst Biol 13 (Suppl 2), 39 (2019). https://doi.org/10.1186/s12918-019-0691-1

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s12918-019-0691-1

Keywords