Volume 5 Supplement 2
Drug-drug relationship based on target information: application to drug target identification
© Park and Kim; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Drugs that bind to common targets likely exert similar activities. In this target-centric view, the inclusion of richer target information may better represent the relationships between drugs and their activities. Under this assumption, we expanded the “common binding rule” assumption of QSAR to create a new drug-drug relationship score (DRS).
Our method uses various chemical features to encode drug target information into the drug-drug relationship information. Specifically, drug pairs were transformed into numerical vectors containing the basal drug properties and their differences. After that, machine learning techniques such as data cleaning, dimension reduction, and ensemble classifier were used to prioritize drug pairs bound to a common target. In other words, the estimation of the drug-drug relationship is restated as a large-scale classification problem, which provides the framework for using state-of-the-art machine learning techniques with thousands of chemical features for newly defining drug-drug relationships.
Various aspects of the presented score were examined to determine its reliability and usefulness: the abundance of common domains for the predicted drug pairs, c.a. 80% coverage for known targets, successful identifications of unknown targets, and a meaningful correlation with another cutting-edge method for analyzing drug similarities. The most significant strength of our method is that the DRS can be used to describe phenotypic similarities, such as pharmacological effects.
Recently, many studies have examined the quantitative structure-activity relationship (QSAR) between drugs, as researchers seek to characterize chemical compounds in terms of their activities. Thus far, the studies have adopted a mathematical procedure which transforms chemical properties into numeric features, the so-called “molecular descriptor.” Until now, many thousands of descriptors have been devised and have proven to be useful for predicting a variety of drug activities, such as drug-likeness , pharmacokinetic parameters , acute toxicity , multi-modal binding propensity , and many other physicochemical properties  (e.g. log P). Furthermore, descriptors have also been used to infer the drug-drug relationship, which expands the applicability to virtual screening [6, 7], chemical library construction , drug clustering  and classification [10–12].
The wide availability of chemical information (descriptors) is based on an implicit assumption that drugs that bind to the same target likely exert similar activities. In line with this thinking, the theory of “neighborhood behavior”  has long asserted that structurally similar drugs likely bind to a common therapeutic target. Therefore, it can be said that drug target information is the most direct evidence for inferring a drug’s activity. In this target-centric view, the inclusion of richer target information may better represent the relationships between drugs and their activities. However, drug-drug relationships have typically been calculated using chemical structural information [14–16]. That is, a chemical structure is converted into numerical features representing various chemical properties , and the structural features are then used to define the drug-drug relationship by determining which features are the same and which are different. However, the weak point of this method is that it cannot consider many structurally unrelated drugs bound to a common target [18, 19].
In this study, we present a new drug-drug relationship score (DRS) which aims to encode both the drug target information and the global structural similarity. The “common binding rule” assumption of QSAR studies was used and expanded to posit the existence of common rules governing drug-target interaction which could be learned from large-scale drug-target interaction data.
Specifically, more than 2,000 descriptors were used to transform drug pairs into numerical vectors. The estimation of drug-drug relationships was thus restated in a classification framework that prioritizes drug pairs with a common target. This procedure was based on the assumption that drugs sharing a target are much more similar than drugs that are only alike in terms of structure. To improve the reliability of the score, data cleaning, iterative under-sampling, and the ensemble approach were combined with a Random Forest classifier.
The classification performance was validated using both an internal and external test set. In addition, the reliability and usefulness of the DRS were examined in terms of the abundance of common domains for the predicted drug pairs, c.a. 80% coverage for known targets, successful examples for unknown target identifications, and meaningful correlation with another cutting-edge technique. Significantly, the DRS showed better performance for describing similarity in pharmacological effects , perhaps due to the encoded target information.
Results and discussion
Generating drug-drug relationship score
These results suggest that the DRS contains more useful target information than traditional similarity measures, and the classification model seems to be unbiased by the huge amounts of negative data. In addition, true positives (correctly predicted drug pairs) covered many structurally-unrelated drug pairs (Additional file 1), implying that the DRS could capture the important spatial features of structurally-unrelated drug-pairs. On the other hand, the performances of the five structural similarity measures were virtually identical, although PubChem fingerprint showed the best performance.
Predicted drug pairs seem to be promising: high domain-matching ratio
In the classification framework, drug pairs that do not share any known common targets were considered as negative data. However, it is possible that the drugs’ shared common targets might be unknown because of insufficient knowledge about drug-target interaction. Therefore, using the DRS to mine unknown drug-drug relationships could be very interesting work. Indeed, new similarities between drugs were used to reposition the marketed drugs by revealing unknown drug-drug relationship [20, 21]. From this view point, drug pairs predicted as positives might have a better chance of sharing a common target than negative drug pairs.
Specifically, the proportion of negative drug pairs that shared common PFAM domains was investigated according to the DRS. Note that negative drug pairs are those without any common targets. The results showed that a higher DRS represented a higher domain-matching ratio. For example, more than 50% of drug pairs had common target domains when the DRS was set to 0.5, which was significantly higher than the random (less than 1%). Accordingly, the result of the domain matching ratio suggests that DRS might be useful for finding unknown drug-drug relationships.
New target identification by drug-drug relationship score
Drug target prediction examples by the DRS
Score (frequency for score ties) from the positive drug-pairs
Score (frequency for score ties) from the newly predicted drug-pairs (not sharing a target)
0.939 [Potassium H2]
0.655 [Potassium D2,D3,KQT2,A1]
0.918 [β1 adrenergic receptor]
0.635 [β1 adrenergic receptor]
0.988 [β1 adrenergic receptor]
0.699 [β1 adrenergic receptor]
0.505 [Kv subfamily C member 4]
0.451 [Kv subfamily KQT member 1]
0.451 [Kv subfamily E member 1]
For most drugs, the target prediction scheme employing the DRS worked well, even for the new targets discovered by Keiser. For example, alpha-1 type adrenergic, the target of Motilium, could be found in the fourth rank (with a score that was tied with the first rank). In addition, other targets such as potassium channel (K+) and serotonin receptor 2A (5HT-2A) were successfully discovered, even though they were not included in the DrugBank database and were thus not in the training set. As expected, the positive drug pairs seemed to be helpful for predicting new targets (e.g. α1 of Motilium, α2 of Xenazine and δ of prantal) by annotation transfer based on the shared target. Interestingly, the newly discovered targets (bold) and those targets not annotated in the DrugBank (underlined) could also be discovered by the new DRS predictions.
As another case study, we tried to find the off-targets of celecoxib (DB00482), which has been known to show unexpected nanomolar inhibition to carbonic anhydrase 2 [24, 25], an effect which was not annotated in the DrugBank database. As expected, the known targets of celecoxib appeared in the predicted target list based on positive drug pairs, but carbonic anhydrase 2 could be found only from the newly predicted drug pairs (score 0.826, first rank). In addition, recent studies have shown that celecoxib blocks human cardiac voltage-gated potassium channels (Kv), which accounts for the drug’s known cardiovascular side effects [26, 27]. Indeed, the target predictions of celecoxib resulted in a high score for the potassium channels, such as potassium voltage-gated channel subfamily C member 4 (0.505), potassium voltage-gated channel subfamily KQT member 1 (0.451), and potassium voltage-gated channel subfamily E member 1 (0.451). Note that the range of the DRS is from -1 to 1.
Correlation with another drug similarity score
Campillos et al. calculated the target-sharing probabilities of drugs based on the similarity of side effects and chemical structure . Because both the target-sharing probability and the DRS prioritized drug pairs with common targets, we compared the two methods for each drug group. In the previous study , drug pairs with at least 25% probability of sharing a protein target were selected and divided into five groups: the first group (G1) was drug pairs known to share targets (true positives in our study); the second (G2) was drug pairs with similar structures or targets; the third (G3) was drug pairs without known human targets; the fourth (G4) was drug pairs from the same therapeutic category; and the last (G5) was drug pairs predicted only by the side effect similarities.
Pearson’s product-moment correlation coefficient was used to test the significance of the correlation between the two methods. Because the G1 group was drug pairs that shared a target and were included in the training set, the score by our method should obviously be high. On the other hand, all of the drug pairs in other groups were new predictions, so the significant correlations between the two scores seemed to be meaningful. Specifically, the correlation coefficients in G2, G4, and G5 were 0.688 (p-value 1.74e-07), 0.724 (2.85e-05), and 0.396 (2.41e-05), respectively (Additional file 3). Note that the G3 group was not considered because of the insufficient number (eight) of drug-pairs in the group. Accordingly, the two scores are largely correlated to each other even though they use different information.
Pharmacological effect similarity by drug-drug relationship score
Chemical similarity has frequently been used to estimate relationships between drugs. For example, in the drug discovery process, the chemical library can be scanned with a query drug to find those compounds which bind to the same target as the query. This drug/target activity view point led us to develop a new target-centric drug-drug relationship score (DRS) under the assumption that drugs that bind with a common target have other common factors. Indeed, the DRS was shown to be closely related to similarities in pharmacological effects.
In our method, to represent drug pairs with their target information, the estimation of drug-drug relationships was restated as a large-scale classification problem that distinguished drug pairs with a common target. In addition, the classification model was improved through data cleaning, iterative under-sampling, and an ensemble approach in combination with a Random Forest classifier. The usefulness of the DRS was demonstrated with internal and external validations, as well as a high domain matching ratio for the new predictions, successful identifications of unknown targets, and a meaningful correlation with another cutting-edge method for studying drug-similarity.
Drug-target interaction data
Drug structure and data on target and drug-target interaction were retrieved from the DrugBank database (April 2011) . After erroneous drugs were removed during the descriptor calculation by PaDEL , the number of remaining drugs and drug-target interactions were 5,858 and 14,490, respectively. The simple network properties of the relationship are shown in Additional file 5. See the previous work by Yildirim et al. for detailed network properties of the drug-target network .
Drug representation by molecular descriptor
Molecular descriptors (descriptors) are a result of standardized numerical calculations, and logical, mathematical interpretations of chemical information. To characterize drugs, descriptors were calculated using PaDEL software . Specifically, PaDEL descriptors (801), PubChemFP (PubChem fingerprint, 881), EStateFP (E-State fragments, 79), MACCSFP (MACCS keys, 166) and SubFPC (SMART patterns for functional group classification, 307) fingerprints were calculated for each drug. In this procedure, descriptors that generated calculating errors or gave almost the same values for more than 90% of drugs were removed. As a result, 89,354 target-sharing drug pairs were selected as positives, and represented in descriptor space. The drugs were then projected into the largest 162 principal components (PCs), which cumulatively explained 90% of the variance. The purpose of considering the major principal components was to eliminate noise and remove redundant information derived from inter-correlations between descriptors.
Construction of the drug pair vector
A feature vector representing a drug pair was constructed from the PC-based drug representation (Figure 1). The drug pair vector consisted of an M and an E vector, where the M vector (constructed by averaging PCs between drugs) represents the basal chemical properties and the E vector (obtained by calculating the squared-errors of PCs) represents the chemical property differences. Accordingly, the drug pair vector represented the basal chemical properties and their differences.
Generation of the drug-drug relationship score from classification model
Another problem of tackling the classification was the proliferation of negative samples as compared to the positive samples, which raised the question of imbalance. When all the samples were used, the number of negative samples was about 200 times larger than the positive samples. Thus, the negatives should be under-sampled, because machine learning techniques usually seek to minimize total prediction errors, so the classification for the imbalanced data tends to be biased towards larger samples.
To minimize the problem, only positive samples were kept, whereas the iterative under-sampling procedure was used to construct multiple negative sample sets. First, the density of structure similarity between drugs was obtained by calculating the PubChem structure similarity for all negative drug pairs. After that, a number of negative drug pairs equivalent to the number of positive drug pairs (89,236) was chosen, based on the sampling probability (inversely proportional to the density of structural similarity). This procedure aimed to select more diverse negative drug pairs, so as not to be biased to specific drug groups. The above procedure was repeated ten times to obtain ten negative sample sets. Then, ten Random Forest classification models were constructed respectively with the positive samples. Finally, the classification scores for the ten classification models were averaged, and the result was regarded as the final drug-drug relationship score. This technique aimed to give a higher score to common-target drug pairs, and ranged from -1 to 1. Note that, to guarantee an “unseen” test set, the score from a single classifier was only used to estimate the classification performance, whereas the average score from the ten classifiers was applied to predict new drug targets.
In the study, Random Forest was used to construct the classification models. Random Forest, developed by Leo Breiman and Adele Cutler, is a collection of tree-based classifiers which constructs trees depending on an independent feature-sampling procedure . Each tree is built by sampling with a replacement, so that about one-third of samples are left out. These OOB (out-of-bag) samples are used to get an unbiased estimate of the classification error. The voting results from an ensemble of decision trees determine the most popular objective class. The Random Forest classifier has been shown to be relatively free from the over-fitting problem as compared to other machine learning methods.
Validation of classification performance
Two approaches were used to estimate the classification performance. The first of these was internal cross-validation using out-of-bag (OOB) samples from Random Forest classifiers. Random Forest performs a type of cross-validation in parallel with the training step by using out-of-bag (OOB) error estimate. Specifically, the samples that are left out (about one-third of samples) after bootstrapping in the training step become OOB samples. Because these OOB samples have not been used in the tree construction, they can be used to estimate test set errors (OOB error).
In addition, external validation using an independent test set was adopted to estimate the general prediction error of the unseen data. Prior to the training procedure, 50 drugs were randomly selected, and all drug-pairs that included any of those 50 drugs were removed from the training data. After the training procedure, the resulting classifier was tested against the remaining drug pairs. This procedure was used to generate a test set consisting of unseen drug data, and to mimic the virtual screening procedure scanning the most similar drug in the chemical library. The performances of the internal and external cross-validation were shown by a sensitivity-specificity plot. Sensitivity is defined as TP/(TP+FN) and specificity is TN/(TN+FP), where TP is a true positive, FN is a false negative, TN is a true negative, and FP is a false positive.
Drug structural similarity by various fingerprints
In the present study, 881-bit PubChem fingerprint with the Tanimoto coefficient (ratio of intersection-bits to union-bits) was regarded as a basic measure for chemical structural similarity. In addition, 1024-bit ExtFP (Extends the Fingerprint with additional bits describing ring features), 1024-bit FP (Fingerprint of length 1024 and search depth of 8), 1024-bit GraphFP (specialized version of the Fingerprint which does not take bond orders into account), and 4860-bit KRFP (presence of chemical substructures) calculated from PaDEL software were also used to compare the performance between different fingerprints. To estimate the performance, drug pairs were sorted by the Tanimoto coefficient using different fingerprints to check if the two drugs shared the same target (Figure 2).
Prediction of potential targets by the drug-drug relationship score
We developed a drug target prediction scheme based on the DRS. The target score for the query drug was obtained by transferring the DRS between the query drug and a drug in the database that binds to the same target. When there were more than two database drugs that bind to the target, the higher DRS (between the query and database drugs) was assigned as the target score. In addition, if the targets had the same score, the one which was more frequently above the predefined score (0.5) came first.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MEST) (2009-0086964), a grant of the Korea Healthcare technology R&D Project, Ministry for Health, Welfare & Family Affairs, Republic of Korea [A092006], and the Korea Institute of Science and Technology Information Supercomputing Center.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S2.
- Clark DE, Pickett SD: Computational methods for the prediction of 'drug-likeness'. Drug Discov Today. 2000, 5 (2): 49-58. 10.1016/S1359-6446(99)01451-8.View ArticlePubMed
- Turner JV, Maddalena DJ, Cutler DJ: Pharmacokinetic parameter prediction from drug structure using artificial neural networks. Int J Pharm. 2004, 270 (1-2): 209-219. 10.1016/j.ijpharm.2003.10.011.View ArticlePubMed
- Lee S, Park K, Ahn HS, Kim D: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data. Toxicol Appl Pharmacol. 2010, 246 (1-2): 38-48. 10.1016/j.taap.2010.04.004.View ArticlePubMed
- Park K, Lee S, Ahn HS, Kim D: Predicting the multi-modal binding propensity of small molecules: towards an understanding of drug promiscuity. Mol Biosyst. 2009, 5 (8): 844-853. 10.1039/b901356c.View ArticlePubMed
- Bonchev D: The overall Wiener index--a new tool for characterization of molecular topology. J Chem Inf Comput Sci. 2001, 41 (3): 582-592. 10.1021/ci000104t.View ArticlePubMed
- Walters WP, Stahl MT, Murcko MA: Virtual screening - an overview. Drug Discov Today. 1998, 3 (4): 160-178. 10.1016/S1359-6446(97)01163-X.View Article
- Willett P, Barnard JM, Downs GM: Chemical similarity searching. J Chem Inf Comput Sci. 1998, 38 (6): 983-996. 10.1021/ci9800211.View Article
- Miller MA: Chemical database techniques in drug discovery. Nat Rev Drug Discov. 2002, 1 (3): 220-227. 10.1038/nrd745.View ArticlePubMed
- McGregor MJ, Pallai PV: Clustering of large databases of compounds: using the MDL ''keys'' as structural descriptors. J Chem Inf Comput Sci. 1997, 37 (3): 443-448. 10.1021/ci960151e.View Article
- Bender A, Mussa HY, Glen RC, Reiling S: Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci. 2004, 44 (5): 1708-1718. 10.1021/ci0498719.View ArticlePubMed
- Bajorath J: Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comput Sci. 2001, 41 (2): 233-245. 10.1021/ci0001482.View ArticlePubMed
- Xue L, Bajorath J: Molecular descriptors for effective classification of biologically active compounds based on principal component analysis identified by a genetic algorithm. J Chem Inf Comput Sci. 2000, 40 (3): 801-809. 10.1021/ci000322m.View ArticlePubMed
- Patterson DE, Cramer RD, Ferguson AM, Clark RD, Weinberger LE: Neighborhood behavior: a useful concept for validation of ''molecular diversity'' descriptors. J Med Chem. 1996, 39 (16): 3049-3059. 10.1021/jm960290n.View ArticlePubMed
- Brown RD, Martin YC: The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J Chem Inf Comput Sci. 1997, 37 (1): 1-9. 10.1021/ci960373c.View Article
- Hagadone TR: Molecular substructure similarity searching - efficient retrieval in 2-dimensional structure databases. J Chem Inf Comput Sci. 1992, 32 (5): 515-521. 10.1021/ci00009a019.View Article
- Kearsley SK, Sallamack S, Fluder EM, Andose JD, Mosley RT, Sheridan RP: Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci. 1996, 36 (1): 118-127. 10.1021/ci950274j.View Article
- Livingstone DJ: The characterization of chemical structures using molecular properties. A survey. J Chem Inf Comput Sci. 2000, 40 (2): 195-209. 10.1021/ci990162i.View ArticlePubMed
- Park K, Kim D: Binding similarity network of ligand. Proteins. 2008, 71: 960-971. 10.1002/prot.21780.View ArticlePubMed
- Sheridan RP, Kearsley SK: Why do we need so many chemical similarity search methods?. Drug Discov Today. 2002, 7 (17): 903-911. 10.1016/S1359-6446(02)02411-X.View ArticlePubMed
- Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen NH, Kuijer MB, Matos RC, Tran TB: Predicting new molecular targets for known drugs. Nature. 2009, 462 (7270): 175-181. 10.1038/nature08506.PubMed CentralView ArticlePubMed
- Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target identification using side-effect similarity. Science. 2008, 321 (5886): 263-266. 10.1126/science.1158140.View ArticlePubMed
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997, 28 (3): 405-420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L.View ArticlePubMed
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res. 2011, 39 (Database issue): D1035-D1041.PubMed CentralView ArticlePubMed
- Knudsen JF, Carlsson U, Hammarström P, Sokol GH, Cantilena LR: The cyclooxygenase-2 inhibitor celecoxib is a potent inhibitor of human carbonic anhydrase II. Inflammation. 2004, 28 (5): 285-290. 10.1007/s10753-004-6052-1.View ArticlePubMed
- Weber A, Casini A, Heine A, Kuhn D, Supuran CT, Scozzafava A, Klebe G: Unexpected nanomolar inhibition of carbonic anhydrase by COX-2-selective celecoxib: new pharmacological opportunities due to related binding site recognition. J Med Chem. 2004, 47 (3): 550-557. 10.1021/jm030912m.View ArticlePubMed
- Brueggemann LI, Mani BK, Mackie AR, Cribbs LL, Byron KL: Novel actions of nonsteroidal anti-inflammatory drugs on vascular ion channels: accounting for cardiovascular side effects and identifying new therapeutic applications. Mol Cell Pharmacol. 2 (1): 15-19.
- Macías A, Moreno C, Moral-Sanz J, Cogolludo A, David M, Alemanni M, Pérez-Vizcaíno F, Zaza A, Valenzuela C, González T: Celecoxib blocks cardiac Kv1.5, Kv4.3 and Kv7.1 (KCNQ1) channels: effects on cardiac action potentials. J Mol Cell Cardiol. 49: 984-992.
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34 (Database issue): D668-D672.PubMed CentralView ArticlePubMed
- Yap CW: PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011, 32 (7): 1466-1474. 10.1002/jcc.21707.View ArticlePubMed
- Yildirim MA, Goh KI, Cusick ME, Barabási AL, Vidal M: Drug-target network. Nat Biotechnol. 2007, 25 (10): 1119-1126. 10.1038/nbt1338.View ArticlePubMed
- Breiman L: Random Forests. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.