 Research
 Open access
 Published:
Scalable prediction of compoundprotein interactions using minwise hashing
BMC Systems Biology volumeÂ 7, ArticleÂ number:Â S3 (2013)
Abstract
The identification of compoundprotein interactions plays key roles in the drug development toward discovery of new drug leads and new therapeutic protein targets. There is therefore a strong incentive to develop new efficient methods for predicting compoundprotein interactions on a genomewide scale. In this paper we develop a novel chemogenomic method to make a scalable prediction of compoundprotein interactions from heterogeneous biological data using minwise hashing. The proposed method mainly consists of two steps: 1) construction of new compact fingerprints for compoundprotein pairs by an improved minwise hashing algorithm, and 2) application of a sparsityinduced classifier to the compact fingerprints. We test the proposed method on its ability to make a largescale prediction of compoundprotein interactions from compound substructure fingerprints and protein domain fingerprints, and show superior performance of the proposed method compared with the previous chemogenomic methods in terms of prediction accuracy, computational efficiency, and interpretability of the predictive model. All the previously developed methods are not computationally feasible for the full dataset consisting of about 200 millions of compoundprotein pairs. The proposed method is expected to be useful for virtual screening of a huge number of compounds against many protein targets.
Background
The identification of compoundprotein interactions is an important part in the drug development toward discovery of new drug leads and new therapeutic protein targets. The completion of the human genome sequencing project has made it possible for us to analyze the genomic space of possible proteins coded in the human genome. At the same time, many efforts have also been devoted to the constitution of molecular databanks to explore the entire chemical space of possible compounds including synthesized molecules or natural molecules extracted from animals, plants, or microorganisms. However, there is little knowledge about the interactions between compounds and proteins. For example, the US PubChem database stores more than 30 million chemical compounds, but the number of compounds with information on their target proteins is very limited [1]. In that field, the importance of chemogenomics research has recently grown fast to investigate the relationship between the chemical space and the genomic space [2, 3]. A key issue in chemogenomics is computational prediction of compoundprotein interactions on a genomewide scale.
Recently, a variety of in silico chemogenomic approaches have been developed to predict compoundprotein interactions or drugtarget interactions, assuming that similar compounds are likely to interact with similar proteins. The stateoftheart in the chemogenomic approach is to built the chemogenomic space of compoundprotein pairs as the tensor product of the chemical space of compounds and the genomic space of proteins, and analyze compoundprotein pairs by machine learning classifiers such as support vector machine (SVM) [4â€“8]. However, the input of the SVM method in most previous works is the pairwise kernel similarity matrix of compoundprotein pairs, which makes it difficult to analyze largescale data. For example, it is impossible to apply standard implementations such as LIBSVM [9] and SVM^{light}[10], because it requires prohibitive computational time and the size of the kernel matrix for compoundprotein pairs is too huge to construct explicitly in the memory. All previous chemogenomic methods are not suitable for scalable screening of millions of or billions of compoundprotein pairs.
Fingerprint is a powerful way to efficiently summarize information about various biomolecules (e.g., compounds, proteins), that is, encoding their molecular structures or physicochemical properties into finitedimensional binary vectors. The fingerprint representation has a long history in chemoinformatics, and many 1D, 2D or 3D descriptors for molecules have been proposed [11] and adopted in many molecular databases such as PubChem [1] and ChemDB [12]. The fingerprints can be used for exploring the chemical space based on their Euclidian distance or Tanimoto coefficients, and can also be used as inputs of various machine learning classifiers to predict various biological activities of compounds [13]. The fingerprint representation is applicable to proteins as well [14, 15].
In this study we consider representing compoundprotein pairs by the fingerprints to use them as inputs of linear SVM, because the linear SVM provides us with interpretable predictive models and works well for superhigh dimensional data [16]. A straightforward way is to represent each compoundprotein pair by taking the tensor product of the compound fingerprint and the protein fingerprint, which enables biological interpretation of chemogenomic features (functional associations between compound substructures and protein domains) behind interacting compoundprotein pairs [8]. However, the resulting fingerprint is sparse and superhigh dimensional. Even worse, the total number of fingerprints is the product of the number of compounds and the number of proteins, so it is difficult to train classical linear SVM for extremely largescale data. Although optimization techniques of linear SVM have recently advanced [17â€“20], they are not enough to analyze a huge number of compoundprotein pairs in practice.
In this paper we develop a novel chemogenomic method to make a scalable prediction of compoundprotein interactions from heterogeneous biological data using minwise hashing, which is applicable for virtual screening of a huge number of compounds against many human proteins. The proposed method mainly consists of two steps: 1) construction of new compact fingerprints for compoundprotein pairs by an improved minwise hashing algorithm, and 2) application of the linear SVM to the compact fingerprints. A unique feature of the proposed method is that the linear SVM with the compact fingerprints generated by the minwise hashing is able to simulate the nonlinear property of the kernel SVM. We test the proposed method on its ability to make a largescale prediction of compoundprotein interactions from compound substructure fingerprints and protein domain fingerprints, and show superior performance of the proposed method compared with the previous chemogenomic methods in terms of prediction accuracy, computational efficiency, and interpretability of the predictive model. All the previously developed methods are not computationally feasible for the full dataset consisting of about 200 millions of compoundprotein pairs.
Materials
Compoundprotein interactions involving human proteins were obtained from the STITCH database [21]. Compounds are small molecules and proteins belong to many different classes such as enzymes, transporters, ion channels, and receptors. The dataset consists of 300,202 known compoundprotein interactions out of 216,121,626 possible compoundprotein pairs, involving 35,366 compounds and 6,111 proteins. Note that duplicated compounds were removed. The set of known interactions is used as gold standard data.
Chemical structures of compounds were encoded by a chemical fingerprint with 881 chemical substructures defined in the PubChem database [1]. Each compound was represented by a substructure fingerprint (binary vector) whose elements encode for the presence or absence of each of the 881 PubChem substructures by 1 or 0, respectively.
Genomic information about proteins was obtained from the UniProt database [22], and the associated protein domains were obtained from the PFAM database [23]. Proteins in our dataset were associated with 4,137 PFAM domains. Each protein was represented by a domain fingerprint (binary vector) whose elements encode for the presence or absence of each of the retained 4,137 PFAM domains by 1 or 0, respectively.
Methods
We deal with the insilico chemogenomics problem as the following machine learning problem: given a set of n compoundprotein pairs (C_{1}, P_{1}),..., (C_{ n }, P_{ n }), then estimate a function f(C, P) that would predict whether a compound C binds to a protein P . In addition, we attempt to estimate an interpretable function f in order to extract informative features. Since our dataset consists of about 216 millions of compoundprotein pairs, we propose an efficient and general approach to solve these problems.
Model
Linear models are a feasible tool for largescale classification and regression tasks such as linear support vector machines (linear SVM) and logistic regression which provide comprehensible models for these tasks. Generally, linear models represent each example E as a feature vector Î¦(E) âˆˆ â„œ^{D} and then estimate a linear function f(E) = w^{T}Î¦(E) whose sign is used to predict whether or not the example E is positive or negative. Note that fingerprints are used for feature vectors in this study. The weight vector wâˆˆ â„œ^{D} is estimated based on its ability to correctly predict the classes of examples in the training set. Since each element of the weight vector w corresponds to an element of the fingerprint Î¦(E), we can interpret salient features by sorting elements of Î¦(E) according to the values of the corresponding elements of w.
In this study each compoundprotein pair corresponds to an example. Thus, it is necessary to represent each compoundprotein pair (C, P) as a single fingerprint Î¦(C, P) and then estimate a function f(C, P) = w^{T}Î¦(C, P) whose sign is used to predict whether a compound C interacts with a protein P or not. As in the previous case, we can extract effective features in Î¦(C, P) for compoundprotein interaction predictions.
Fingerprint representation of compoundprotein pairs
A fingerprint representation of compoundprotein pairs has a large impact on not only classification ability of linear models but also interpretability of features. To meet both demands, we represent each compoundprotein pair by a fingerprint using the compound fingerprint and the protein fingerprint.
The fingerprint of a compound C is represented by a Ddimensional binary vector: Î¦(C) = (c_{1}, c_{2}, ..., c_{ D })^{T} where c_{ i } âˆˆ {0, 1}, i = 1, ..., D. The fingerprint of a protein P is represented by a Dâ€²dimensional binary vector as well: Î¦(P) = (p_{1}, p_{2}, ..., p_{ Dâ€² })T where p_{ i } âˆˆ {0, 1}, i = 1, ..., Dâ€². We define the fingerprint of each compoundprotein pair as the tensor product of Î¦(C) and Î¦(P) as follows:
Î¦(C, P) consists of all possible products of elements in two fingerprints Î¦(C) and Î¦(P), so the fingerprint is a D Ã— Dâ€² dimensional binary vector. The dimensions of Î¦(C), Î¦(P), and Î¦(C, P) in this study are D = 881, Dâ€² = 4, 137, and DDâ€² = 3, 644, 697, respectively.
Minwise hashing
We propose to use minwise hashing for analyzing fingerprints efficiently. In this section, we make a brief review of minwise hashing [24]. A key observation is that any fingerprint can be represented by a set uniquely. Each fingerprint Î¦(C, P) is represented by a set S âŠ† Î© = {1, 2, ..., D Ã— Dâ€²}. Given two sets S_{ i } and S_{ j }, Jaccard similarity J(S_{ i }, S_{ j }) of Si and S_{ j } is defined as
Minwise hashing is a random projection of sets such that the expected Hamming distance of obtained symbol strings is proportional to the Jaccard similarity [24]. We pick â„“ random permutations {\mathrm{\xcf\u20ac}}_{k}, k = 1, ..., â„“, each of which maps [1, M] to [1, M]. Let T_{ i } = t_{ i1 }, â‹¯, t_{ iâ„“ } be a resultant string projected from S_{ i }. The projection is defined as the minimum element of the random permutation of the given set,
For example, if {\mathrm{\xcf\u20ac}}_{k} is defined as
S_{ i } = (1, 4, 6, 7) is transformed to {\mathrm{\xcf\u20ac}}_{k}(S_{ i }) = (3, 1, 6, 4), and the final product is t_{ ik } = 1. The collision probability, which is a probability that two sets S_{ i } and S_{ j } are projected to the same elements t_{ ik } and t_{ jk }(t_{ ik } = t_{ jk }), is described as
Therefore, the expected Hamming distance between t_{ i } and t_{ j } is identical to â„“(1  J(S_{ i }, S_{ j })).
Saving memory by additional hashing
The common practice of minwise hashing is to store each hashed value using 64bits [24]. The storage (and computational) cost is prohibitive in largescale applications. To overcome this problem, Li et al. proposed bbit minwise hashing [25, 26], which rounds each hashing value to only lower bbits value. However, a theoretical analysis of the collision probability is complicated.
Here we introduce a simple yet effective method such that a theoretical estimation of collision probability can easily be derived. In our method, the hashing values are further hashed to a set {1, ..., N} randomly, where N <<M. This projection is defined as follows:
where h : {1, ..., M} â†’ {1, ..., N} is a random hash function. If t_{ ik } and t_{ jk } are identical, s_{ ik } and s_{ jk } always collide. If not, they collide with probability 1/N. Thus, the collision probability is obtained as follows:
Figure 1 shows collision probability for each hashing value, where four different Jaccard similarities, 0.1, 0.3, 0.5 and 0.7, are chosen. It is observed that collision probabilities do not increase for hashing values of no less than 2^{8}. Thus, small hashing values can be chosen without loss of accuracy.
Building compact fingerprints by minwise hashing
Learning linear models with largescale highdimensional data is a difficult problem in terms of computational cost. Here we propose a method to represent the original fingerprint of compoundprotein pair by a new fingerprint whose size is smaller than that of the original fingerprint.
A crucial observation is that any fingerprint can be represented as a set uniquely, and can also be converted into a string uniquely. First, we convert the original fingerprint of each compoundprotein pair into a string by applying minwise hashing and additional hashing. Next, we expand hashing values organizing the string into a new binary vector whose dimension is much smaller than that of the original fingerprint.
Let S(C, P) be a set representation of Î¦(C, P) where i is contained in S(C, P) iff the ith element of Î¦(C, P) is 1. We apply minwise hashing Ï€_{ k }(k = 1, ..., â„“) to S(C, P) to generate a string T(C, P) = t_{1}, t_{2}, ..., t_{ â„“ }, where each element t_{ k } takes a value ranging from 1 to M. We additionally hash each element t_{ k } to a new small value {t}_{k}^{\mathrm{\xe2\u20ac\xb2}} ranging from 1 to N(N <<M) by applying additional hash h, and generate a new string {T}^{\mathrm{\xe2\u20ac\xb2}}\left(C,P\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{t}_{1}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{t}_{2}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}\xe2\u20ac\xa6,{t}_{\mathrm{\xe2\u201e\u201c}}^{\mathrm{\xe2\u20ac\xb2}}. Each value {t}_{k}^{\mathrm{\xe2\u20ac\xb2}} in the string Tâ€²(C, P) is expanded to an Ndimensional binary vector f_{ k }, where the {t}_{k}^{\mathrm{\xe2\u20ac\xb2}}th element is 1 and the others are 0. Finally, we concatenate f_{1}, ..., f_{ â„“ } into a single one, and obtain an â„“Ndimension binary vector F(C, P) = (f_{1}, ..., f_{ â„“ }). The newly obtained F(C, P) is referred to as "compact fingerprint". Figure 2 shows an illustration of the proposed procedure.
Linear support vector machines (Linear SVM)
We use linear SVM as a classifier. The predictive model is typically learned by minimizing objective functions with a regularization. The most common regularization is L_{2}regularization which keeps most elements in the weight vector to be nonzeros, so one suffers from difficulty in interpreting the predictive model with many nonzero weights. L_{2}regularized linear SVM is referred to as L2SVM. Another regularization is L_{1}regularization which keeps most elements in the weight vector to be zeros, so the L_{1}regularization is popularly used for its high interpretability owing to the induced sparsity. L_{1}regularized linear SVM is referred to as L1SVM.
Given a training set of compoundprotein pairs and labels {\left\{F\left({C}_{i},{P}_{i}\right),\phantom{\rule{0.3em}{0ex}}{y}_{i}\right\}}_{i=1}^{n},\text{}{y}_{i}\xe2\u02c6\u02c6\left\{+\text{1},\text{}\text{1}\right\}, linear SVM is formulated as the following unconstrained optimization problem:
To prevent overfitting, the weight vector is optimized with L_{1}regularization and L_{2}regularization as follows:
and
where â‹¯_{1} and â‹¯_{2} are L_{1} and L_{2} norms, and C is a hyperparameter. Recently, optimization algorithms for linear SVM have rapidly advanced. In this study, we use an efficient optimization algorithm named LIBLINEAR [18]^{1}.
^{1}The software is available from http://www.csie.ntu.edu.tw/~cjlin/liblinear/
In our method, we propose to use the compact fingerprint F(C, P) instead of the original fingerprint Î¦(C, P) as an input for L1SVM and L2SVM. L1SVM and L2SVM with the compact fingerprints F(C, P) are referred to as Minwise Hashingbased L1SVM (MHL1SVM) and Minwise Hashingbased L2SVM (MHL2SVM), respectively. In contrast, L1SVM and L2SVM with the original fingerprints Î¦(C, P) are referred to as L1SVM and L2SVM, respectively, which correspond to previous methods [8].
In most previous works the kernel SVM method was used, but the input of kernel SVM is the kernel similarity matrix for compoundprotein pairs [5, 6], which makes it difficult to apply the kernel SVM to largescale interaction prediction. This is because the time complexity of the quadratic programming problem for kernel SVM is O\left({n}_{c}^{3}\phantom{\rule{0.3em}{0ex}}\xc3\u2014\phantom{\rule{0.3em}{0ex}}{n}_{p}^{3}\right), where n_{ c } is the number of compounds and n_{ p } is the number of proteins, and the space complexity is O\left({n}_{c}^{2}\phantom{\rule{0.3em}{0ex}}\xc3\u2014\phantom{\rule{0.3em}{0ex}}{n}_{p}^{2}\right), which is just for storing the kernel matrix. Moreover, kernel SVM does not have any interpretability of the predictive model because it is not able to extract features.
Relation to kernel SVM
In this section, we describe a theoretical foundation for using linear SVM with compact fingerprints and discuss the relation to kernel SVM [5, 6]. Kernel matrix is an n Ã— n matrix K satisfying {\xe2\u02c6\u2018}_{ij}{c}_{i}{c}_{j}{K}_{ij}\phantom{\rule{0.3em}{0ex}}\xe2\u2030\xa4\phantom{\rule{0.3em}{0ex}}0 for all real vectors c. Such a property is called positive definite (PD), which is necessary to effectively train an SVM classifier with a kernel matrix. A matrix A is PD if it can be written as an inner product of matrices B^{T}B.
Our linear SVM with compact fingerprints simulates nonlinear SVMs with the Jaccard similarity matrix for the following reasons.

1.
Each element of the pairwise kernel matrix of compoundprotein pairs is defined as the number of common elements between two sets S(C, P) and S(Câ€², Pâ€²), i.e, S(C, P) âˆ© S(Câ€², Pâ€²). The pairwise kernel matrix is PD. Jaccard similarity is a pairwise kernel normalized by the cardinality of the union of two sets S(C, P) and S(Câ€², Pâ€²), i.e., S(C, P) âˆª S(Câ€², Pâ€²). The Jaccard similarity matrix of compoundprotein pairs, where each element is Jaccard similarity of two sets S(C, P) and S(Câ€², Pâ€²), is also PD.

2.
Let the minwise hashing matrix of compoundprotein pairs be a matrix whose element is defined as the inner product of two compact fingerprints F(C, P) and F(Câ€², Pâ€²). The minwise hashing matrix is PD.

3.
The (i, j)element of the Jaccard similarity matrix correlates with the (i, j)element of the minwise hashing matrix.

4.
While Jaccard similarity is a nonlinear function, the inner product is a linear function.
The third reason is true because the collision probability, which is a probability that two minwise hashing and additional hashing values for two sets S(C, P) and S(Câ€², Pâ€²) are the same, is positively correlated with Jaccard similarity J(S(C, P), S(Câ€², Pâ€²)) (Equation 1).
Feature extraction for biological interpretation
Extracting informative features in the original fingerprint for predicting compoundprotein interactions is also an important task. Since each value of the weight vector in a linear model corresponds to the importance of the corresponding feature of the original fingerprint in the classification task. In our method, we apply minwise hashing and additional hashing to the original fingerprint, and build the compact fingerprint to efficiently train a linear SVM classifier. Thus, it is not trivial to extract features in the original fingerprint in our framework.
We propose to keep inverse mappings {\mathrm{\xcf\u20ac}}_{k}^{1} and h^{1} for permutation {\mathrm{\xcf\u20ac}}_{k} and additional hashing h, and apply h^{1} and {\mathrm{\xcf\u20ac}}_{k}^{1} to each element in the compact fingerprint in order to recover the weight vector for the original fingerprint. Let {\mathrm{\xcf\u20ac}}_{k}^{1} : [1, M] â†’ [1, M] (k = 1, ..., â„“) be an inverse mapping for permutation Ï€_{ k } : [1, M] â†’ [1, M]. Let h^{1} : [1, N] â†’ [1, M]^{âˆ—} be an inverse mapping for additional hashing h : [1, M] â†’ [1, N]. Note that h^{1} is, basically, a onetomany mapping N <<M.
First, we apply inverse mapping h^{1} to each element in the compact fingerprint to recover values hashed by additional hashing h. Since h^{1} is a onetomany mapping, several values are recovered. Then, inverse mapping Ï€^{1} is applied to each value in order to recover an element in the original fingerprint. Finally, we compute an average of the weights learned by linear SVMs, which provides the recovered weight vector for the original fingerprint. Figure 3 shows an illustration of the proposed procedure.
Results
Performance evaluation
We tested MHL1SVM and MHL2SVM (newly proposed methods) on their abilities to predict compoundprotein interactions from compound substructure fingerprints and protein domain fingerprints, and compared the performance with L1SVM and L2SVM (previous methods [8]) in terms of prediction accuracy and computational cost. Note that the kernel SVM (the stateoftheart [4â€“7]) was not computationally feasible for our large data. Our full dataset is too huge (consists of about 216 millions of compoundprotein pairs), so we used a subset of the full data for efficient evaluation of the four different methods. In the subdataset, the numbers of positive and negative examples were balanced, i.e., 300,202, respectively and 600,404 in total. We performed two types of 5fold crossvalidations: pairwise crossvalidation and blockwise crossvalidation.
In the pairwise crossvalidation we perform the following procedure: 1) We randomly split compoundprotein pairs in the gold standard set into five subsets of roughly equal sizes, and take each subset in turn as a test set. 2) We train a predictive model on the remaining four subsets. 3) we compute the prediction scores for compoundprotein pairs in the test set. 4) Finally, we evaluate the prediction accuracy over the five folds. The pairwise crossvalidation assumes the situation where we want to detect missing interactions between known ligand compounds and known target proteins with information about interaction partners. In the blockwise crossvalidation we perform the following procedure: 1) We randomly split compounds and proteins in the gold standard set into five compound subsets and five protein subsets, and take each compound subset and each protein subset in turn as test sets. 2) We train a predictive model on compoundtarget pairs in the remaining compound subsets and four protein subsets. 3) We compute the prediction scores for compoundprotein pairs involving test compound set and test protein set. 4) Finally, we evaluate the prediction accuracy over the five folds. The blockwise crossvalidation assumes the situation where we want to detect new interactions for newly arriving ligand candidate compounds and target candidate proteins with no information about interaction partners. In the both cases, we evaluated the performance by the area under the ROC curve (AUC) and execution time. The crossvalidations were performed by varying the hyperparameter C = 10^{5}, 10^{4}, ..., 10^{5} and chosen as the one to achieve the best AUC score.
We investigated the effects of the length of strings l and the size of hashing values N in the minwise hashing process of MHL1SVM and MHL2SVM on the performance. We tried five different lengths of string â„“ = 5, 10, 15, 30, 50. The size of additional hashing values N is varied from 2^{2} to 2^{32}. Figure 4 and 5 shows the AUC scores for MHL1SVM and MHL2SVM in the pairwise cross validation. It was observed that the AUC scores reached the maximum with the length of string â„“ = 10 and the size of additional hashing value N = 2^{16}, and the AUC score was comparable to that for the original fingerprint.
Figure 6 and 7 shows the execution time for performing the minwise hashing and for learning SVM classifiers, where the length of string â„“ is varied from 5 to 50 and the size of additional hashing value is fixed to N = 2^{16}. The AUC scores of MHL1SVM and MHL2SVM with the length of string â„“ = 10 and the size of additional hashing N = 2^{16} were comparable to those of L1SVM and L2SVM. In addition, MHL1SVM and MHL2SVM achieved certain speedup compared with L1SVM and L2SVM.
The same trends of these results in the pairwise crossvalidation were observed in the case of the blockwise crossvalidation as well. The corresponding results for the blockwise crossvalidation are shown in Figures 8, 9, 10 and 11. The AUC scores in the blockwise crossvalidation were lower than those in the pairwise crossvalidation, which implies that predicting unknown interactions for newly coming compounds and proteins outside of the learning set is much more difficult than detecting missing interactions between compounds and proteins in the learning set.
Experiments on largescale datasets
We evaluated the performance for the full data consisting of 216,121,626 compoundprotein pairs, where the best parameter values for each method in the crossvalidation experiments in the previous subsection were used. We examined the effect of the ratio of positive compoundprotein pairs against negative compoundprotein pairs on the performance. Note that the number of negative examples is much larger than that of positive examples in our dataset. We varied the number of negative examples in the crossvalidation from the same number of positive examples to the number of all possible negative examples.
Figure 12 shows the memory usages of the four different methods. It was observed that the memory usage grew linearly as the number of compoundprotein pairs increased in each method. Especially, both L1SVM and L2SVM required about 200GB in memory. On the other hand, MHL1SVM and MHL2SVM took only about 30GB in memory. There is little difference of memory usage between L1regularization and L2regularization.
Table 1 shows the AUC scores in the pairwise crossvalidation. It was observed that the AUC scores of MHL1SVM and MHL2SVM were comparable to those of L1SVM and L2SVM, respectively. Table 2 shows training time on the pairwise crossvalidation, where the training time includes the minwise hashing process and the upper limitation is put on the execution time for all methods to 24 hours. MHL1SVM and MHL2SVM are significantly faster than L1SVM and L2SVM, respectively. Especially, MHL2SVM is about 10 times faster than L2SVM. L1SVM did not finish the computation for such a large number of compoundprotein pairs within 24 hours. On the other hand, our MHL1SVM finished the computation and took only 25,060 seconds on average.
The same trends for these results in the pairwise crossvalidation were observed in the blockwise crossvalidation as well (See Tables 3 and 4).
Table 5 shows the AUC scores and training times in using all possible negative examples, where only 1fold of the 5fold crossvalidation was performed on this dataset. On this extremely large data, L1SVM and L2SVM did not finish the computation within 24 hours. On the other hand, MHL1SVM and MHL2SVM finished the computation, and the AUC scores were reasonable. The training times of MHL1SVM and MHL2SVM were 157,013 and 10,054 seconds, respectively. These results suggest the usefulness of our proposed methods in largescale applications.
Figure 13 shows the numbers of features extracted by MHL1SVM and MHL2SVM. The number of features extracted by MHL1SVM are about third times smaller than that of features extracted by MHL2SVM. This result suggests that MHL1SVM provides us with more selective features, which would help to make a biological interpretation about the functional associations between compound substructures and protein domains behind compoundprotein interactions.
Discussion and conclusion
In this paper we proposed a novel chemogenomic method to predict unknown compoundprotein interactions on a large scale, which was made possible by using an improved minwise hashing algorithm to efficiently represent the fingerprints of compoundprotein pairs. Interestingly, the linear SVM with the compact fingerprints generated by the minwise hashing is able to simulate the nonlinear property of the kernel SVM (the stateoftheart). The originality of the proposed method lies in the scalable prediction of compoundprotein interactions, in the computational efficiency, and in the interpretability of the predictive model. It should be pointed out that all previous methods were not computationally feasible for the full data. The proposed method is expected to be useful for virtual screening of a large number of compounds against many protein targets.
The proposed method can be used, as soon as compounds and proteins are represented by binary descriptors (chemical substructures and protein domains in this study). However, a limitation of the proposed method is that the performance depends on the definitions of chemical substructures of compounds and functional domains of proteins. The use of other descriptors (e.g., KlekotaRoth, ECFP6, Daylight, and Dragon) could improve the generalization properties of the method. Datasets, all results and softwares are available at https://sites.google.com/site/interactminhash/.
References
Chen B, Wild D, Guha R: PubChem as a source of polypharmacology. J Chem Inf Model. 2009, 49: 20442055. 10.1021/ci9001876.
Stockwell B: Chemical genetics: ligandbased discovery of gene function. Nat Rev Genet. 2000, 1: 116125.
Dobson C: Chemical space and biology. Nature. 2004, 432 (7019): 824828. 10.1038/nature03192.
Nagamine N, Sakakibara Y: Statistical prediction of proteinchemical interactions based on chemical structure and mass spectrometry data. Bioinformatics. 2007, 23: 20042012. 10.1093/bioinformatics/btm266.
Faulon JL, Misra M, Martin S, Sale K, Sapra R: Genome scale enzymemetabolite and drugtarget interaction predictions using the signature molecular descriptor. Bioinformatics. 2008, 24: 225233. 10.1093/bioinformatics/btm580.
Jacob L, Vert JP: Proteinligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008, 24: 21492156. 10.1093/bioinformatics/btn409.
Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y: Analysis of multiple compoundprotein interactions reveals novel bioactive molecules. Mol Syst Biol. 2011, 7: 472
Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y: Identification of chemogenomic features from drugtarget interaction networks using interpretable classifiers. Bioinformatics. 2012, 28: i487i494. 10.1093/bioinformatics/bts412.
Chang C, Lin C: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2: 127.
Joachims T: Learning to classify text using support vector machines: Methods, Theory and Algorithms. 2002, Kluwer Academic Publishers, 186:
Todeschini R, Consonni V: Handbook of Molecular Descriptors. 2002, New York, USA: WileyVCH
Chen J, Swamidass S, Dou Y, Bruand J, Baldi P: ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005, 21: 41334139. 10.1093/bioinformatics/bti683.
Lodhi H, Yamanishi Y: Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques. IGI Global. 2010
Park K, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: 16561663. 10.1093/bioinformatics/btg222.
Lanckriet G, Deng M, Cristianini N, Jordan M, Noble W: Kernelbased data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput. 2004, 300311.
BenHur A, Soon Ong C, Sonnenburg S, SchÃ¶lkopf B, RÃ¤tsch G: Support Vector Machines and Kernels for Computational Biology. PLoS Computational Biology. 2008, 4 (10): e100017310.1371/journal.pcbi.1000173.
Fan RE, Chang KW, Hsieh CJ, Wang X, Lin CJ: LIBLINEAR:A library for large linear classification. The Journal of Machine Learning Research. 2008, 9: 18711874.
Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, S S: A Dual Coordinate Descent Method for Largescale Linear SVM. Proceedings of the 25th international conference on Maching Learning. 2008, 408415.
Joachims T: Training Linear SVMs in Linear Time. Proceedings of the 12th ACM SIGKDD Conference on Knowledge Discover and Data Mining. 2006, 217226.
ShalevShwartz S, Singer Y, Srebro N: Pegasos: primal estimated subgradient solver for SVM. Proceedings of the 24th international conference on Machine learning. 2007, 807814.
Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, Jensen L, Beyer A, Bork P: STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res. 2010, 38 (suppl 1): D552D556.
Consortium TU: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38: D142D148.
Finn R, Tate J, Mistry J, Coggill P, Sammut J, Hotz H, Ceric G, Forslund K, Eddy S, Sonnhammer E, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2008, 36: D281D288. 10.1093/nar/gkn226.
Broder A, Charikar M, Frieze A: MinWise Independent Permutations. Journal of Computer and System Sciences. 2000, 60: 630659. 10.1006/jcss.1999.1690.
Li P, KÃ¶ning AC: bbit minwise hashing. Proceedings of 27th International World Wide Web Conference. 2010, 671680.
Li P, KÃ¶ning AC, Gui W: bbit minwise hashing for estimating threeway similarities. TwentyFourth Annual Conference on Neural Information Processing Systems. 2010
Acknowledgements
This work was supported by MEXT/JSPS KAKENHI Grant Numbers 24700140 and 25700029. This work was also supported by the Program to Disseminate Tenure Tracking System, MEXT, Japan, and Kyushu University Interdisciplinary Programs in Education and Projects in Research Development. This work was also supported by the PRESTO program of the Japan Science and Technology Agency (JST).
Declarations
The publication cost for this work was supported by the PRESTO program of the Japan Science and Technology Agency (JST).
This article has been published as part of BMC Systems Biology Volume 7 Supplement 6, 2013: Selected articles from the 24th International Conference on Genome Informatics (GIW2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S6.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
None declared.
Authors' contributions
YT implemented the algorithm of the methods, made all analyses, and drafted the manuscript. YY prepared the datasets, and drafted the manuscript.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tabei, Y., Yamanishi, Y. Scalable prediction of compoundprotein interactions using minwise hashing. BMC Syst Biol 7 (Suppl 6), S3 (2013). https://doi.org/10.1186/175205097S6S3
Published:
DOI: https://doi.org/10.1186/175205097S6S3