Network-based characterization of drug-protein interaction signatures with a space-efficient approach

Tabei, Yasuo; Kotera, Masaaki; Sawada, Ryusuke; Yamanishi, Yoshihiro

doi:10.1186/s12918-019-0691-1

Volume 13 Supplement 2

Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology

Research
Open access
Published: 05 April 2019

Network-based characterization of drug-protein interaction signatures with a space-efficient approach

Yasuo Tabei¹,
Masaaki Kotera²,
Ryusuke Sawada³ &
…
Yoshihiro Yamanishi^3,4

BMC Systems Biology volume 13, Article number: 39 (2019) Cite this article

10k Accesses
12 Citations
3 Altmetric
Metrics details

Abstract

Background

Characterization of drug-protein interaction networks with biological features has recently become challenging in recent pharmaceutical science toward a better understanding of polypharmacology.

Results

We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call “drug-protein interaction signatures” from the integration of large-scale heterogeneous data of drugs and proteins. We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by space-efficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers.

Conclusions

Our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue the these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions and are expected to contribute to rational drug design.

Background

Target proteins of drug molecules are classified into a primary target and off-targets. The former is the desired target, whereas the latter could lead to adverse drug reactions [1] or unexpected beneficial effects in drug repositioning [2]. Therefore, comprehensive analysis throughout primary targets and off-targets on a genome-wide scale is crucial in drug discovery. The in silico approach is expected to improve the research productivity in this field.

Several computational methods have been presented for predicting drug-protein interactions (or compound-protein interactions) from chemogenomic and pharmacogenomic viewpoints on a large-scale. The basic idea behind the chemogenomic approach is that chemically similar drugs are expected to interact with similar proteins, with which the similarity of drugs and proteins are defined based on their side-effects and the amino acid sequences, respectively [3–8]. On the other hand, the key idea behind the pharmacogenomic approach is that phenotypically similar drugs are predicted to interact with similar proteins, on the basis of drug side effects and/or protein sequences [9–12]. However, previous predictive models are not easily interpretable, making it difficult to extract biological features characterizing drug-protein interactions and making it impossible to give insights into the theoretical basis of interactions.

The characterization of drug-protein interaction networks with biological characteristics has become a challenging problem in modern pharmaceutical science toward better understanding of poly-pharmacology. It is hypothesized that polypharmacology is involved in various features of drugs and target proteins (e.g., chemical substructures, pharmacophores, functional sites, and pathways) and complicated associations between the heterogeneous features.

A variety of feature extraction methods have recently been proposed for automatically characterizing drug-protein interactions. A data mining method was proposed for extracting molecular substructure pairs appearing frequently in interacting drug-target pairs [13]. Machine learning methods with sparse statistical models were presented to associate protein domains with drug chemical substructures [14, 15] or with drug side effects [16]. The inference of proteins eliciting drug side effects has been reported by several groups [17, 18]. However, the scalability of these methods is very limited, and these studies were conducted from the perspective of either protein functional sites, drug chemical substructures or drug phenotypic effects. There is a strong and growing need to develop efficient and scalable methods for characterizing overall drug-protein interactions with many types of features of drugs and proteins at once.

We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call “drug-protein interaction signatures”. We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by space-efficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers. In the results, our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue that these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions. To the best of our knowledge, this is the first report on characterizing a large-scale drug-protein interaction network with various biological features of drugs and proteins in an integrative framework. The drug-protein interaction signatures comprehensively inferred with our method are expected to contribute to rational drug design.

Results

Drug-protein interactions

We got the information on drug-protein interactions from five databases: ChEMBL [19], KEGG [20], DrugBank [21], PDSP Ki [22], and Matador [23]. The number of unique drug-protein interactions in the merged dataset is 78,692. These interactions involve 2302 drugs and 2334 target proteins, and the number of all possible drug-protein pairs is 5,372,868. We utilized this dataset in our experiments.

Drug profiles

We described drug chemical structures by 17,017 chemical substructures using the KEGG Chemical Function and Substructures (KCF-S) descriptor [24]. We represented each drug by a 17,017-dimension binary vector where the presence or absence of each of the KCF-S substructures is coded as 1 or 0. The resulting vector is referred to as a chemical profile.

We obtained the information about adverse drug reactions (ADRs) from the public release of the adverse event reporting system (AERS) of the US Food and Drug Administration (FDA) [25]. We derived 2,904,050 reports from 2004 to 2010 and mapped the drug names to KEGG following a previous study [12]. Based on the resulting 10,543 ADRs, we represented each drug by a 10,543-dimension binary vector where the presence or absence of each ADR is coded as 1 or 0. The resulting vector is referred to as an ADR profile.

Finally, we constructed an integrative feature vector of each drug by concatenating the chemical and the ADR profiles into a single one. The dimension of the resulting feature vector of each drug was 27,560.

Protein profiles

We obtained functional domains, biological pathways, and pathway modules (compactly clustered pathways) about proteins from the KEGG [20] and the PFAM [26] databases.

Based on 2678 PFAM domains, we represented each protein by a 2678-dimension binary vector where the presence or absence of a functional domain is coded as 1 or 0. The resulting vector is referred to as domain profile. Based on 270 KEGG pathway maps, we represented each protein by a 270 dimension binary vector where the presence or absence of the involvement in a biological pathway is coded as 1 or 0. The resulting vector is referred to as a pathway profile. Based on 107 KEGG pathway modules, we represented each protein by a 107-dimension binary vector where the presence or absence of the involvement in a pathway module is coded as 1 or 0. The resulting vector is referred to as module profile.

Finally, we constructed an integrative feature vector of each protein by concatenating the domain, pathway, and module profiles into a single profile. The dimension of the resulting feature vector of each protein was 3055.

We address the problem of extracting features characterizing drug-protein interaction networks in the framework of supervised classification.

Linear model for drug-protein pairs

Let C be a drug (or a drug candidate compound) and let P be a target protein (or a target candidate protein). We represent a drug-protein pair (C,P) as a high dimensional feature vector Φ(C,P) and present a linear function, f(C,P)=w^TΦ(C,P), whose output is used to predict whether a (C,P) is an interacting pair or not. The weight vector w is estimated such that each drug-protein pair is correctly classified into the interaction class (positive class) or non-interaction class (negative class) based on the training set.

An advantage of the linear model is that one can interpret features effective for predictions from learned models. Since each element in Φ(C,P) corresponds to an element of w, effective features can be selected by extracting highly weighted features. However, the performance of the linear model depends heavily on the feature vector design.

We represent each drug-protein pair as a high dimension feature vector by taking the tensor product of a drug profile and protein profile. The representation is similar to that in previous studies [15, 16]. The profile of a C is defined as a D-dimension binary vector:

Φ(C)=(c₁,c₂,...,c_D)^T,

where c_i∈{0,1}, i=1,...,D. The profile of a P is defined as a D^′-dimension binary vector: $\Phi (P)=(p_{1},p_{2},...,p_{D^{\prime }})^{T}$, where p_i∈{0,1}, i=1,...,D^′. We compute the tensor product between a drug profile Φ(C) and protein profile Φ(P), and define a feature vector Φ(C,P) as follows:

$$\Phi(C,P) = (c_{1}p_{1},c_{1}p_{2},...,c_{1}p_{D^{\prime}},c_{2}p_{1},...c_{D}p_{1},...,c_{D}p_{D^{\prime}})^{T}. $$

where Φ(C,P) is composed of all possible products between elements in Φ(C) and those in Φ(P). The resulting feature vector is a D×D^′-dimension binary vector, i.e., fingerprint, for encoding cross-integrated biological features. This is referred to as a “tensor-product fingerprint”.

In this study, Φ(C) was a 27,560-dimension binary vector, and Φ(P) was a 3055-dimension binary vector. Thus, the tensor-product fingerprint Φ(C,P) of each drug-protein pair is a 84,195,800-dimension binary vector.

A simpler way for representing each drug-protein pair is to concatenate Φ(C) and Φ(P) into a single feature vector as Φ(C,P)=(Φ(C)^T,Φ(P)^T)^T [7]. However, it cannot determine the correlation between drug and protein features. The feature vector is referred to as a “concatenated fingerprint”.

Logistic regression

We apply logistic regression to train the weight vector in the linear model and introduce L₁-regularizations to prevent over-fitting. The L₁-regularization induces sparsity in the weight vector and drives most of the weight elements corresponding to unimportant features to zeros, which makes it easier for us to interpret the model and extract features.

Minimizing the logistic loss with L₁-regularization for a large number of high dimensional data is difficult, but several efficient algorithms have recently been proposed. To the best of our knowledge, LIBLINEAR [27] is the most efficient and high-performance algorithm, but it requires a huge amount of memory for extremely high-dimensional data. In fact, it was not computationally feasible for our dataset in this study because of the memory problem (see the “Results” section). To overcome this difficulty, we introduce a gradient-based method.

Given a collection of drug-protein pairs and their labels (Φ(C_i,P_j),y_ij) where y_ij∈{+1,−1}(i=1,...,n,j=1,...,m), the logistic loss is defined as

$$LR(\mathbf{w}) = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m} \log (1+\exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right). $$

The logistic loss with L₁-regularization is defined as

$$ L_{1}\text{-}LR(\mathbf{w}) \,=\,\! \sum\limits_{i=1}^{n}\!\sum\limits_{j=1}^{m} \!\log\! \left(\!1\,+\,\exp\! \left(\!-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j})\! \right) \!\right) + C\|\mathbf{w}\|_{1}, $$

where ∥w∥₁ is L₁ norm (the sum of absolute value in the vector) and C is a regularization parameter.

Since L₁-LR(w) is a convex function, the weight vector w minimizing L₁-LR(w) can be found at zero of its gradient. However, it is impossible to compute the gradient of L₁-LR(w), because L₁ norm contains non-differential points where w_d=0. Instead, we compute the d-th dimensional gradient ∇_dLR(w) of LR(w) as follows:

$$ \nabla_{d} LR(\mathbf{w}) = \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m} \frac{-y_{ij} \Phi_{d}(C_{i},P_{j}) \exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right)}{1+\exp \left(-y_{ij}\mathbf{w}^{T}\Phi(C_{i},P_{j}) \right)}, $$

where Φ_d(C_i,P_j) is the d-th dimensional value of Φ(C_i,P_j). We then compute the D×D^′-dimensional gradient vector $\nabla LR(\mathbf {w}) \in \mathfrak {R}^{D\times D^{\prime }}$ as

$$\nabla LR(\mathbf{w}) = \left(\nabla_{1} LR(\mathbf{w}),\nabla_{2} LR(\mathbf{w}),...,\nabla_{D\times D^{\prime}} LR(\mathbf{w}) \right)^{T}. $$

The use of ∇LR(w) enables the global minimum for the optimal w in L₁-LR(w) to be found using an efficient gradient-based optimization algorithm called orthant-wise limited-memory quasi-newton (OWL-QN) [28]. The L₁-regularized logistic regression methods, with the tensor product of the fingerprint proposed and with the concatenated fingerprint, is referred to as L1LOG-tensor and L1LOG-concat, respectively.

For comparison, we also trained models with L₂-regularized logistic regression using the gradient-based algorithm called the limited memory quasi-Newton (L-BFGS) [29]. The L₂-regularized logistic regression method, with the tensor-product fingerprint and the concatenated fingerprint, are referred to as L2LOG-tensor and L2LOG-concat, respectively.

Space-efficient representation of drug-protein pairs

Compact representation of drug-protein pairs is crucial for training linear models in memory, so we use the set representation with items corresponding to dimensions of one bit in the fingerprint. However, this still consumes a huge amount of memory for storing them, resulting in limited scalability in memory for extremely high-dimensional data. To overcome this memory problem, we constructed two space-efficient representations of fingerprints. We present a brief overview of these representations (further details are given in the supplemental material [30]).

Figure 1 illustrates the construction of our two representations. We first represent each fingerprint Φ(C_i,P_j) as a set S(C_i,P_j)={d|Φ_d(C_i,P_j)=1,d=1,...,D×D^′} that contains items corresponding to dimensions of one bit in Φ(C_i,P_j). We refer to a set representation of fingerprints as SET. To minimize each item, we then compute the difference between the k-th item S(C_i,P_j)[k] and (k−1)-th item S(C_i,P_j)[k−1] as (S(C_i,P_j)[k]−S(C_i,P_j)[k−1]) and keep the results in a new set S^′(C_i,P_j). We can recover S(C_i,P_j) by cumulatively adding the items in S^′(C_i,P_j).

We constructed our two space-efficient representations of fingerprints by leveraging the idea behind succinct data structures that achieve space-efficient representations of data structures while preserving the property of fast operations. The first one is a variable-length array for compactly representing fingerprints. The S^′(C_i,P_j) is represented by two bit strings R_ij and P_ij which are indexed by rank/select dictionary, i.e., a succinct data structure for bit strings. We can randomly access any element in S^′(C_i,P_j) in O(1) time by using fast operations in the rank/select dictionary [31]. We refer to this variable-length array representation of fingerprints as VLA.

The second one is a type of succinct trie for representing fingerprints. The trie is a data structure for strings, and it is also practical for representing fingerprints. A standard point-based implementation of trie consumes a huge amount of memory, resulting in limited scalability. Alternatively, we present a compact representation of trie by using a succinct data structure called LOUDS [32]. We can recover the original fingerprints by traversing a succinct trie in a depth-first manner. We refer to this succinct trie representation of fingerprints as SUCTRIE.

Extraction of drug-protein interaction signatures

We applied the proposed method (L1LOG-tensor) to extract drug-protein interaction signatures from drug profiles (chemical substructures and adverse drug reactions) and protein profiles (protein domains, biological pathways, and pathway modules), based on a large-scale drug-protein interaction network. Each signature is the association between a drug feature and protein feature, where two features in the same signature are thought of as being associated in terms of drug-protein interactions. The results of all extracted drug-protein interaction signatures are presented in the supplemental material [30].

L1LOG-tensor extracted 105,684 signatures, while L2LOG-tensor extracted 7,843,218 signatures. Note that the number of all possible combinations of drug features and protein features is 84,195,504. The number of signatures from our L1LOG-tensor method was much less than that of L2LOG-tensor, due to the sparsity induced by L1-regularization. This makes it easier to analyze the extracted drug-protein interaction signatures for biological interpretation, so we focused on analyzing the results from L1LOG-tensor below.

Figure 2 shows a network representation of some of the drug-protein signatures extracted with L1LOG-tensor, where highly weighted associations of five features of drugs or proteins, that is, drug-chemical substructures (blue), adverse drug reaction (red), protein pathway (green), pathway module (yellow) and protein domain (gray). Only selected results are shown due to space limitation. The inferred signature association network provides us with clues about the important features behind the drug-protein interaction network. There has been no study on the inference of these associations.

Biological interpretation of the extracted signatures

We constructed biological interpretations for the drug-protein interaction signatures extracted with L1LOG-tensor. We give only two examples due to space limitation. The result of all analyzed signatures and the figures/tables are presented in the supplemental material [30].

Figure 3 shows an extracted signature representing the association between a drug-chemical substructure (SKELETON C1b(N1d)-C1b(O7a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction), where the vertical axis on the heat map (a) shows all drugs sharing the extracted substructure, and the horizontal axis shows all proteins sharing the extracted pathway. The extracted drug-chemical substructures on the associated drug structures (b) are in pink. Drugs and proteins in known interacting pairs tend to have such extracted features in the same signature. For example, Propantheline bromide (D00481), Methanthelinium bromide (D00721), Acetylcholine chloride (D00999), Carbachol (D00524), Succinylcholine chloride (D00766), and Suxamethonium chloride (D02275) share a choline skeleton, and all known to act on acetylcholine receptors. However, there are many other drugs sharing the extracted drug feature and proteins sharing the extracted protein feature, and the drug-protein interactions are not known. Thus, it may be possible to predict previously unknown interactions between drugs and proteins through the extracted features in the signatures. See Table 1 and Fig. 4 for detail.

Table 1 Association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z(C5a+O7a)-C1z(C1a)” and KEGG pathway hsa04080 Neuroactive “ligand-receptor interaction”. See also Fig. 4

Full size table

Table 2 shows an extracted signature representing the association between an ADR (R01631 Graft-versus-host disease) and protein domain (PF14446 Prokaryotic RING finger family 1), where all drugs sharing the extracted ADR and all proteins sharing the extracted protein domain are also shown. Interestingly, most drugs sharing the ADR (R01631 Graft-versus-host disease) were related to inflammation, immunosuppression, and cancer, which supports the recently expanded concept that inflammation is a critical component of cancer progression [33]. See Fig. 5 and Table 3 for detail.

Table 2 Example of drug-protein interaction signature: association between adverse drug reaction (ADR) (R01631 Graft-versus-host disease) and protein domain (PF14446 Prokaryotic RING finger family 1)

Full size table

Table 3 The association between KCF-S “RING C1x-C1x-C1y(C1z)-C1y(C1x)-C1y(C1x)-C1z(C1a+C1y)” and KEGG pathway module “hsa_M00110 C19/C18-Steroid hormone biosynthesis”

Full size table

Figure 4 shows an extracted signature representing the association between a drug-chemical substructure (RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z(C5a+O7a)-C1z(C1a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction). It was observed that Megestrol acetate (D00952), Cyproterone acetate (D01368) and Chlormadinone acetate (D01299) share common ring structures. All these drugs are known to act on many neuroactive ligand-receptors. See Table 1 for detail.

Figure 6 show an extracted signature representing the association between a drug-chemical substructure (SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x in the KCF-S format) and biological pathway (hsa03050 Proteasome). Proteasome inhibitors have been applied to the treatment of cancer, especially multiple myeloma. The substructure “SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x” corresponds to a phenylalanine residue, which is captured as a characteristic substructure in known proteasome inhibitors Bortezomib (D03150) and Carfilzomib (D08880). See Table 4 for detail.

Table 4 The association between KCF-S “SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x” and KEGG pathway “hsa03050 Proteasome”

Full size table

Performance evaluation on generalization property

If the extracted signatures are biologically meaningful in terms of drug-protein interactions, they need to have good generalization to predict drug-protein interactions.

We tested five feature extraction methods: L1LOG-tensor, L2LOG-tensor, L1LOG-concat, L2LOG-concat, and L1LOG-LIBLINEAR-tensor on their abilities to reconstruct known drug-protein interactions. As mentioned above, L1LOG-tensor is our proposed method. The others are previous methods based on current algorithms or conventional fingerprints (see the Logistic regression section for further details). L1LOG-tensor and L2LOG-tensor use tensor-fingerprints represented by our space-efficient algorithm. L1LOG-concat and L2LOG-concat use previous concatenated fingerprints [7] represented by the LIBLINEAR algorithm [27]. L1LOG-LIBLINEAR-tensor is a method [15, 16] which uses the tensor-product fingerprints represented by the LIBLINEAR algorithm [27].

We conducted the following fold cross-validation in a pair-wise manner. We first randomly divided all drug-protein pairs in the gold standard set into five subsets. Next, we considered four of the subsets as a training set and the remaining subset as a test set. We learned a predictive model on the drug-protein pairs in the training set. Finally, we applied the predictive model to the drug-protein pairs in the test set.

We used the receiver operating characteristic curve (ROC curve), which is defined as a plot of true positive rates against false positive rates based on various thresholds, and the precision-recall curve (PR curve), which is defined as a plot of precision (positive predictive value) against recall (sensitivity) based on various thresholds, as evaluation measures for prediction performance.

We computed the area under the ROC curve (AUC score) and the area under the PR curve (AUPR score). The parameters involved in each method (e.g., regularization parameter) were fit with AUC and AUPR as the objective functions.

Figure 7 shows the AUC and AUPR scores in the pair-wise cross-validation, where the number of negative pairs in the training set was changed from the same number of positive examples to that of all possible negative examples in the training set. We observed that the prediction accuracy of the models trained with all five methods improved as the number of negative examples in the training set increased. This suggests that using all possible negative examples for learning a predictive model will enhance prediction reliability. L1LOG-tensor performed the best.

L1LOG-LIBLINEAR-tensor did not perform well with an increasing number of negative examples in the training set because of the memory storage problem. The learning process with the LIBLINEAR algorithm consumed all the memory of our machine with 128GB-memory. In contrast, the other four methods with our space-efficient algorithm were able to finish the training process. This suggests that our space-efficient algorithm is more suitable and powerful for learning a predictive model on extremely high-dimensional data.

L1-LOG-tensor and L2LOG-tensor performed better than L1-LOG-concat and L2LOG-concat, which suggests that the tensor-product fingerprint can capture relevant information for drug-protein interaction prediction. On the other hand, the concatenated fingerprint cannot capture enough information, even though calculation is faster.

Table 5 shows the AUC score, AUPR score, training time, and consumed memory in the pair-wise cross-validation. L1LOG-tensor and L2LOG-tensor consumed 24 GB for learning predictive models on all possible drugprotein pairs, which suggests their applicability for largescale drug-protein interaction prediction. They also took about 24 hours, which can be considered reasonable on a practical level, though they were slower than L1LOG-concat and L2LOG-concat.

Table 5 AUC score, AUPR score, training time in seconds, and consumed memory in megabytes in the pair-wise cross validation experiments

Full size table

In the pair-wise cross-validation, drugs and proteins in test pairs often overlap with those in the training set. We conducted a different 5-fold cross-validation to avoid the overlap of drugs and proteins in test pairs between those in the training set, which we call “block-wise cross-validation”. The results of this block-wise cross-validation are shown in Fig. 8 and Table 6. The same tendency in the pair-wise cross-validation was also seen in the block-wise cross-validation. However, the AUC and AUPR scores in the block-wise cross-validation were much lower than those in the pair-wise cross validation. The results indicate that predictions of unknown interactions for new drug candidates (without known targets) and orphan proteins (without known ligands) are much more difficult than detecting missing interactions between drugs of known targets and proteins of known ligands in practice.

Table 6 AUC score, AUPR score, training time in seconds, and consumed memory in megabytes in the block-wise cross validation experiments

Full size table

Finally, we tested SUCTRIE, VLA, and SET on their space-efficiencies of fingerprint representations. Note that SET is a standard representation, and SUCTRIE and VLA are those constructed with our proposed method. Figure 9 shows a plot of the consumed memory against the number of fingerprints. SET is known to use a large amount of memory for storing all possible fingerprints. In fact, it consumed 57GB for storing all possible drugprotein pairs in our dataset, which limits its practical usage. In contrast, our proposed representations SUCTREE and VLA are more space-efficient than SET. The consumed memory of SUCTREE was slightly smaller than that of VLA. SUCTREE and VLA consumed 16 and 20 GB, respectively, for storing all possible drug-protein pairs, Suggesting the usefulness of our SUCTREE and VLA. In fact, we were not able to conduct all the analyses for this study without SUCTRIE.

Conclusions

We proposed a novel method of extracting the underlying features characterizing overall drug-protein interactions, which we call “drug-protein interaction signatures”. We extracted a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules, and argued that the extracted drug-protein interaction signatures were biologically meaningful. Our proposed method is original in that the space-efficient representations for high-dimensional fingerprints of drug-protein pairs, in the characterization of a large-scale drug-protein interaction network with various features in an integrative framework, and in the interpretability for the extracted feature associations.

Our proposed method will be useful for various applications in drug discovery. A limitation of the method is that it cannot extract the associations between different attributes of drugs or proteins. For example, it cannot detect the associations between drug-chemical substructures and adverse drug reactions or the associations between protein domains and biological pathways. Extension of the method for analyzing such more complicated features is an important future work.

Abbreviations

ADRs:: Adverse drug reactions
AERS:: Adverse event reporting system
AUC score:: area under the ROC curve
AUPR score:: area under the PR curve FDA: Food and drug administration
KCF-S:: KEGG chemical function and substructures
KEGG:: Kyoto encyclopedia of genes and genomes
L-BFGS:: Limited-memory quasi-newton
OWL-QN:: Orthant-wise limited-memory quasi-newton
PR:: Precision recall
ROC curve:: Receiver operating characteristic curve

References

Whitebread S, Hamon J, Bojanic D, Urban L. Keynote review: In vitro safety pharmacology profiling: an essential tool for successful drug development. Drug Discov Today. 2005; 10(21):1421–33.
Article CAS Google Scholar
Chong CR, Sullivan DJ. New uses for old drugs. Nature. 2007; 448:645–6.
Article CAS Google Scholar
Faulon JL, Misra M, Martin S, Sale K, Sapra R. Genome scale enzyme-metabolite and drug-target interaction predictions using the signature molecular descriptor. Bioinformatics. 2008; 24:225–33.
Article CAS Google Scholar
Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008; 24:232–40.
Article Google Scholar
Jacob L, Vert JP. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008; 24:2149–56.
Article CAS Google Scholar
Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen NH, Kuijer MB, Matos RC, Tran TB, Whaley R, Glennon RA, Hert J, Thomas KL, Edwards DD, Shoichet BK, Roth BL. Predicting new molecular targets for known drugs. Nature. 2009; 462(7270):175–81.
Article CAS Google Scholar
Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y. Analysis of multiple compound-protein interactions reveals novel bioactive molecules. Mol Syst Biol. 2011; 7:472.
Article CAS Google Scholar
Lounkine E, Keiser MJ, Whitebread S, Mikhailov D, Hamon J, Jenkins JL, Lavan P, Weber E, Doak AK, Côté S, et al.Large-scale prediction and testing of drug activity on side-effect targets. Nature. 2012; 486(7403):361–7.
Article CAS Google Scholar
Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science. 2008; 321(5886):263–6.
Article CAS Google Scholar
Yamanishi Y, Kotera M, Kanehisa M, Goto S. Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics. 2010; 26(12):246–54.
Article Google Scholar
Atias N, Sharan R. An algorithmic framework for predicting side-effects of drugs. J Comput Biol. 2011; 18(3):207–18.
Article CAS Google Scholar
Takarabe M, Kotera M, Nishimura Y, Goto S, Yamanishi Y. Drug target prediction using adverse event report systems: a pharmacogenomic approach. Bioinformatics. 2012; 28:611–8.
Article Google Scholar
Takigawa I, Tsuda K, Mamitsuka H. Mining Significant Substructure Pairs for Interpreting Polypharmacology in Drug-Target Network. PloS ONE. 2011; 6:16999.
Article Google Scholar
Yamanishi Y, Pauwels E, Saigo H, Stoven V. Extracting Sets of Chemical Substructures and Protein Domains Governing Drug-Target Interactions. J Chem Inf Model. 2011; 51:1183–94.
Article CAS Google Scholar
Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y. Identification of chemogenomic features from drug-target interaction networks using interpretable classifiers. Bioinformatics. 2012; 28(18):487–94. https://doi.org/10.1093/bioinformatics/bts412.
Article Google Scholar
Iwata H, Mizutani S, Tabei Y, Kotera M, Goto S, Yamanishi Y. Inferring protein domains associated with drug side effects based on drug-target interaction network. BMC Syst Biol. 2013; 7(Suppl 6):18. https://doi.org/10.1186/1752-0509-7-S6-S18.
Article Google Scholar
Mizutani S, Pauwels E, Stoven V, Goto S, Yamanishi Y. Relating drug–protein interaction network with drug side effects. Bioinformatics. 2012; 28(18):522–8.
Article Google Scholar
Kuhn M, Al Banchaabouchi M, Campillos M, Jensen LJ, Gross C, Gavin A-C, Bork P. Systematic identification of proteins that elicit drug side effects. Mol Syst Biol. 2013; 9(1).
Gaulton A, Bellis L, Bento A, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington J. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012; 40:1100–7.
Article Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2013; 40:109–14.
Article Google Scholar
Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014; 42:1091–7.
Article Google Scholar
Roth BL, Lopez E, Patel S, Kroeze WK. The multiplicity of serotonin receptors: Uselessly diverse molecules or an embarrassment of riches?. Neuroscientist. 2000; 6:252–62.
Article CAS Google Scholar
Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales E, Gewiess A, Jensen L, Schneider R, Skoblo R, Russell R, Bourne P, Bork P, Preissner R. SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008; 36:919–22.
Article Google Scholar
Kotera M, Tabei Y, Yamanishi Y, Moriya Y, Tokimatsu T, Kanehisa M, Goto S. KCF-S: KEGG Chemical Function and Substructure for improved interpretability and prediction in chemical bioinformatics. BMC Syst Biol. 2013; 7(Suppl 6):2.
Article Google Scholar
FDA. 2018. http://www.fda.gov/.
Finn R, Tate J, Mistry J, Coggill P, Sammut J, Hotz H, Ceric G, Forslund K, Eddy S, Sonnhammer E, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2008; 36:281–8.
Article Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR:A library for large linear classification. J Mach Learn Res. 2008; 9:1871–4.
Google Scholar
Andrew G, Gao J. Scalable training of L ₁-regularized log-linear models. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning: 2007. p. 33–40.
Liu DC, Nocedal J, Liu DC, Nocedal J. On the limited memory bfgs method for large scale optimization. Math Program. 1989; 45:503–28.
Article Google Scholar
Supplementary information. 2018. http://labo.bio.kyutech.ac.jp/~yamani/drugprotein/.
Jacobson G. Succinct static data structures. PhD thesis, Carnegie Mellon University; 1989.
Jacobson G. Space-efficient Static Trees and Graphs. In: Proceedings of the 30th Annual Symposium of Foundations of Computer Science: 1989. p. 549–54.
Coussens LM, Werb Z. Inflammation and cancer. Nature. 2002; 420:860–7.
Article CAS Google Scholar

Download references

Acknowledgements

We thank anonymous reviewers for their thoughtful and constructive reviews.

Funding

Publication of this work was supported by JST PRESTO Grant Number JPMJPR15D8.

Availability of data and materials

All results and datasets are available at [30].

About this supplement

This article has been published as part of BMC Systems Biology Volume 13 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume-13-supplement-2.

Author information

Authors and Affiliations

RIKEN Center for Advanced Intelligence Project, Nihonbashi 1-chome Mitsui Building, 15th floor, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan
Yasuo Tabei
School of Engineering, Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Masaaki Kotera
Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Lizuka, Fukuoka, 820-8502, Japan
Ryusuke Sawada & Yoshihiro Yamanishi
PRESTO, Japan Science and Technology Agency, Saitama, 332-0012, Japan
Yoshihiro Yamanishi

Authors

Yasuo Tabei
View author publications
You can also search for this author in PubMed Google Scholar
Masaaki Kotera
View author publications
You can also search for this author in PubMed Google Scholar
Ryusuke Sawada
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiro Yamanishi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YT and YY designed research. YT designed the method. YT and MK performed the experiments. YT, MK and YY wrote the paper. All of the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Yasuo Tabei.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Tabei, Y., Kotera, M., Sawada, R. et al. Network-based characterization of drug-protein interaction signatures with a space-efficient approach. BMC Syst Biol 13 (Suppl 2), 39 (2019). https://doi.org/10.1186/s12918-019-0691-1

Download citation

Published: 05 April 2019
DOI: https://doi.org/10.1186/s12918-019-0691-1

Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): systems biology

Network-based characterization of drug-protein interaction signatures with a space-efficient approach