Drug-drug relationship based on target information: application to drug target identification

Background Drugs that bind to common targets likely exert similar activities. In this target-centric view, the inclusion of richer target information may better represent the relationships between drugs and their activities. Under this assumption, we expanded the “common binding rule” assumption of QSAR to create a new drug-drug relationship score (DRS). Method Our method uses various chemical features to encode drug target information into the drug-drug relationship information. Specifically, drug pairs were transformed into numerical vectors containing the basal drug properties and their differences. After that, machine learning techniques such as data cleaning, dimension reduction, and ensemble classifier were used to prioritize drug pairs bound to a common target. In other words, the estimation of the drug-drug relationship is restated as a large-scale classification problem, which provides the framework for using state-of-the-art machine learning techniques with thousands of chemical features for newly defining drug-drug relationships. Conclusions Various aspects of the presented score were examined to determine its reliability and usefulness: the abundance of common domains for the predicted drug pairs, c.a. 80% coverage for known targets, successful identifications of unknown targets, and a meaningful correlation with another cutting-edge method for analyzing drug similarities. The most significant strength of our method is that the DRS can be used to describe phenotypic similarities, such as pharmacological effects.


Introduction
Recently, many studies have examined the quantitative structure-activity relationship (QSAR) between drugs, as researchers seek to characterize chemical compounds in terms of their activities. Thus far, the studies have adopted a mathematical procedure which transforms chemical properties into numeric features, the so-called "molecular descriptor." Until now, many thousands of descriptors have been devised and have proven to be useful for predicting a variety of drug activities, such as drug-likeness [1], pharmacokinetic parameters [2], acute toxicity [3], multi-modal binding propensity [4], and many other physicochemical properties [5] (e.g. log P). Furthermore, descriptors have also been used to infer the drug-drug relationship, which expands the applicability to virtual screening [6,7], chemical library construction [8], drug clustering [9] and classification [10][11][12].
The wide availability of chemical information (descriptors) is based on an implicit assumption that drugs that bind to the same target likely exert similar activities. In line with this thinking, the theory of "neighborhood behavior" [13] has long asserted that structurally similar drugs likely bind to a common therapeutic target. Therefore, it can be said that drug target information is the most direct evidence for inferring a drug's activity. In this target-centric view, the inclusion of richer target information may better represent the relationships between drugs and their activities. However, drug-drug relationships have typically been calculated using chemical structural information [14][15][16]. That is, a chemical structure is converted into numerical features representing various chemical properties [17], and the structural features are then used to define the drug-drug relationship by determining which features are the same and which are different. However, the weak point of this method is that it cannot consider many structurally unrelated drugs bound to a common target [18,19].
In this study, we present a new drug-drug relationship score (DRS) which aims to encode both the drug target information and the global structural similarity. The "common binding rule" assumption of QSAR studies was used and expanded to posit the existence of common rules governing drug-target interaction which could be learned from large-scale drug-target interaction data.
Specifically, more than 2,000 descriptors were used to transform drug pairs into numerical vectors. The estimation of drug-drug relationships was thus restated in a classification framework that prioritizes drug pairs with a common target. This procedure was based on the assumption that drugs sharing a target are much more similar than drugs that are only alike in terms of structure. To improve the reliability of the score, data cleaning, iterative under-sampling, and the ensemble approach were combined with a Random Forest classifier.
The classification performance was validated using both an internal and external test set. In addition, the reliability and usefulness of the DRS were examined in terms of the abundance of common domains for the predicted drug pairs, c.a. 80% coverage for known targets, successful examples for unknown target identifications, and meaningful correlation with another cutting-edge technique. Significantly, the DRS showed better performance for describing similarity in pharmacological effects [8], perhaps due to the encoded target information.

Results and discussion
Generating drug-drug relationship score To derive the DRS, a drug pair vector was constructed by averaging and subtracting paired drug features in descriptor space ( Figure 1). All drug pairs were classified into two groups: positive drug pairs (which shared at least one common target) and negative drug pairs (which did not share any targets). After that, machine learning techniques were adopted to prioritize drug pairs bound to a common target (see the Methods section for the detailed procedures). Conceptually, this procedure implemented the assumption that drugs with common targets might have more similar actions than structurally similar drugs.
To estimate the classification proficiency, we performed internal cross-validation, using out-of-bag (OOB) samples, and external validation, using an independent test set. As a baseline method, 2D structural similarity measures based on the different fingerprints of the drugs were calculated and compared with the DRS. That is, the drug pairs were sorted by the Tanimoto coefficient and checked to see if they shared the same target. The performance is represented by the sensitivity-specificity plot in Figure 2. The results of internal cross-validation showed that the DRS outperformed the 2D similarity measures in retrieving common-target drugs (Figure 2a). When the score threshold was set to zero, the sensitivity and the specificity reached about 0.8 and 0.8, respectively. In addition, the results of external validation also showed a similar trend, even though the performance was a little bit lower than the internal cross-validation ( Figure 2b).
These results suggest that the DRS contains more useful target information than traditional similarity measures, and the classification model seems to be unbiased by the huge amounts of negative data. In addition, true positives (correctly predicted drug pairs) covered many structurally-unrelated drug pairs (Additional file 1), implying that the DRS could capture the important spatial features of structurally-unrelated drug-pairs. On the other hand, the performances of the five structural similarity measures were virtually identical, although Pub-Chem fingerprint showed the best performance.
Predicted drug pairs seem to be promising: high domainmatching ratio In the classification framework, drug pairs that do not share any known common targets were considered as negative data. However, it is possible that the drugs' shared common targets might be unknown because of insufficient knowledge about drug-target interaction. Therefore, using the DRS to mine unknown drug-drug relationships could be very interesting work. Indeed, new similarities between drugs were used to reposition the marketed drugs by revealing unknown drug-drug relationship [20,21]. From this view point, drug pairs predicted as positives might have a better chance of sharing a common target than negative drug pairs.
To estimate the hypothesis, the PFAM domains [22] of the targets of the negative drugs were investigated to see if the drug pairs had a target of the same domain ( Figure  3a). It was assumed that drug targets of the same domains likely bind to the same drug because of their structural and sequential homology. For example, the structural similarity between DB02270 and DB00884 was very low (Tanimoto coefficient based on PubChem fingerprint: 0.15) in spite of a high DRS (0.77, when the range was adjusted from 0 to 1 as the structural similarity). The maximum target identity between possible target pairs was also relatively low (sequence identity: 23%). However, the overall target structures, especially ligand binding pockets, were very similar (Cα RMSD 2.56Å for PDB id 1YV5 and 1RQI) because they shared the same PFAM domain: polyprenyl synthetase (PF00348). Indeed, the binding modes of the drugs appeared very similar to one another ( Figure 3b). In addition, many drug pairs with potential similar binding pockets could be discovered by the domain matching information.
Specifically, the proportion of negative drug pairs that shared common PFAM domains was investigated according to the DRS. Note that negative drug pairs are those without any common targets. The results showed that a higher DRS represented a higher domain-matching ratio. For example, more than 50% of drug pairs had common target domains when the DRS was set to 0.5, which was significantly higher than the random (less than 1%). Accordingly, the result of the domain matching ratio suggests that DRS might be useful for finding unknown drug-drug relationships.

New target identification by drug-drug relationship score
The newly predicted positive drug pairs (i.e. false positives in terms of classification) were used to identify  potential targets. The target identification scheme based on the maximum DRS transferred the information on drug-drug relationships to the drug targets (See Methods). This scheme was successful for about 80% of the known drug targets (Additional file 2). To estimate the target-finding capability for unknowns, the recently discovered drug-target interactions by Keiser et al. [20] were used as a test set. Note that drugs whose discovered targets were not annotated in the DrugBank database were used in this study [23]. This process was similar to finding new targets of known drugs. The tested drugs were DMT (DB01488), Motilium (DB01184), Xenazine (DB04844), Prantal (DB00729), Paxil (DB00715), Prozac (DB00472), and Rescriptor (DB00705), and their known targets are listed in Table  1, along with their DRS values and ranks. In addition, the target scores from the false positive drug pairs (those with a high DRS value but no common target) were separated from those of the known positive drug pairs (which shared a common target). Thus, this separation (Table 1) was designed to determine whether the new target predictions were meaningful. For most drugs, the target prediction scheme employing the DRS worked well, even for the new targets discovered by Keiser. For example, alpha-1 type adrenergic, the target of Motilium, could be found in the fourth rank (with a score that was tied with the first rank). In addition, other targets such as potassium channel (K+) and serotonin receptor 2A (5HT-2A) were successfully discovered, even though they were not included in the DrugBank database and were thus not in the training set. As expected, the positive drug pairs seemed to be helpful for predicting new targets (e.g. α1 of Motilium, α2 of Xenazine and δ of prantal) by annotation transfer based on the shared target. Interestingly, the newly discovered targets (bold) and those targets not annotated in the DrugBank (underlined) could also be discovered by the new DRS predictions.
As another case study, we tried to find the off-targets of celecoxib (DB00482), which has been known to show unexpected nanomolar inhibition to carbonic anhydrase 2 [24,25], an effect which was not annotated in the DrugBank database. As expected, the known targets of celecoxib appeared in the predicted target list based on positive drug pairs, but carbonic anhydrase 2 could be found only from the newly predicted drug pairs (score 0.826, first rank). In addition, recent studies have shown that celecoxib blocks human cardiac voltage-gated potassium channels (Kv), which accounts for the drug's known cardiovascular side effects [26,27]. Indeed, the target predictions of celecoxib resulted in a high score for the potassium channels, such as potassium voltagegated channel subfamily C member 4 (0.505), potassium voltage-gated channel subfamily KQT member 1 (0.451), and potassium voltage-gated channel subfamily E member 1 (0.451). Note that the range of the DRS is from -1 to 1.
Correlation with another drug similarity score Campillos et al. calculated the target-sharing probabilities of drugs based on the similarity of side effects and chemical structure [21]. Because both the target-sharing probability and the DRS prioritized drug pairs with common targets, we compared the two methods for each drug group. In the previous study [21], drug pairs with at least 25% probability of sharing a protein target were selected and divided into five groups: the first group (G1) was drug pairs known to share targets (true positives in our study); the second (G2) was drug pairs with similar structures or targets; the third (G3) was drug pairs without known human targets; the fourth (G4) was drug pairs from the same therapeutic category; and the last (G5) was drug pairs predicted only by the side effect similarities.
Pearson's product-moment correlation coefficient was used to test the significance of the correlation between the two methods. Because the G1 group was drug pairs that shared a target and were included in the training set, the score by our method should obviously be high. On the other hand, all of the drug pairs in other groups were new predictions, so the significant correlations between the two scores seemed to be meaningful. Specifically, the correlation coefficients in G2, G4, and G5 were 0.688 (p-value 1.74e-07), 0.724 (2.85e-05), and 0.396 (2.41e-05), respectively (Additional file 3). Note that the G3 group was not considered because of the insufficient number (eight) of drug-pairs in the group. Accordingly, the two scores are largely correlated to each other even though they use different information.

Pharmacological effect similarity by drug-drug relationship score
How much does the DRS represent the actions of drugs? To answer this question, the DRS was used to estimate the similarity of pharmaceutical effects between drugs. For this, the Anatomical Therapeutic Chemical (ATC) system was adapted (http://www.whocc.no/atc/). The ATC system divides drugs into different groups according to the organ or system on which they act, as well as their therapeutic and chemical characteristics. Reflecting the hierarchical structure of the ATC system, the terms of the 2 nd and 3 rd ATC level were considered to see if the DRS correlated with the pharmacological effect similarity. Specifically, the drug pairs used in the external validation set (i.e. unseen data) were sorted by different drug similarity measures, and the number of drugs with matching ATC was plotted according to that score ( Figure 4). We found that the correlation between

Conclusions
Chemical similarity has frequently been used to estimate relationships between drugs. For example, in the drug discovery process, the chemical library can be scanned with a query drug to find those compounds which bind to the same target as the query. This drug/target activity view point led us to develop a new target-centric drugdrug relationship score (DRS) under the assumption that drugs that bind with a common target have other common factors. Indeed, the DRS was shown to be closely related to similarities in pharmacological effects. In our method, to represent drug pairs with their target information, the estimation of drug-drug relationships was restated as a large-scale classification problem that distinguished drug pairs with a common target. In addition, the classification model was improved through data cleaning, iterative under-sampling, and an ensemble approach in combination with a Random Forest classifier. The usefulness of the DRS was demonstrated with internal and external validations, as well as a high domain matching ratio for the new predictions, successful identifications of unknown targets, and a meaningful correlation with another cutting-edge method for studying drug-similarity.

Drug-target interaction data
Drug structure and data on target and drug-target interaction were retrieved from the DrugBank database (April 2011) [28]. After erroneous drugs were removed during the descriptor calculation by PaDEL [29], the number of remaining drugs and drug-target interactions were 5,858 and 14,490, respectively. The simple network properties of the relationship are shown in Additional file 5. See the previous work by Yildirim et al. for detailed network properties of the drug-target network [30].

Drug representation by molecular descriptor
Molecular descriptors (descriptors) are a result of standardized numerical calculations, and logical, mathematical interpretations of chemical information. To characterize drugs, descriptors were calculated using PaDEL software [29]. Specifically, PaDEL descriptors (801), PubChemFP (PubChem fingerprint, 881), ESta-teFP (E-State fragments, 79), MACCSFP (MACCS keys, 166) and SubFPC (SMART patterns for functional group classification, 307) fingerprints were calculated for each drug. In this procedure, descriptors that generated calculating errors or gave almost the same values for more than 90% of drugs were removed. As a result, 89,354 target-sharing drug pairs were selected as positives, and represented in descriptor space. The drugs were then projected into the largest 162 principal components (PCs), which cumulatively explained 90% of the variance. The purpose of considering the major principal components was to eliminate noise and remove redundant information derived from inter-correlations between descriptors.

Construction of the drug pair vector
A feature vector representing a drug pair was constructed from the PC-based drug representation ( Figure  1). The drug pair vector consisted of an M and an E vector, where the M vector (constructed by averaging PCs between drugs) represents the basal chemical properties and the E vector (obtained by calculating the squared-errors of PCs) represents the chemical property differences. Accordingly, the drug pair vector represented the basal chemical properties and their differences. sampled, because machine learning techniques usually seek to minimize total prediction errors, so the classification for the imbalanced data tends to be biased towards larger samples.
To minimize the problem, only positive samples were kept, whereas the iterative under-sampling procedure was used to construct multiple negative sample sets. First, the density of structure similarity between drugs was obtained by calculating the PubChem structure similarity for all negative drug pairs. After that, a number of negative drug pairs equivalent to the number of positive drug pairs (89,236) was chosen, based on the sampling probability (inversely proportional to the density of structural similarity). This procedure aimed to select more diverse negative drug pairs, so as not to be biased to specific drug groups. The above procedure was repeated ten times to obtain ten negative sample sets. Then, ten Random Forest classification models were constructed respectively with the positive samples. Finally, the classification scores for the ten classification models were averaged, and the result was regarded as the final drug-drug relationship score. This technique aimed to give a higher score to common-target drug pairs, and ranged from -1 to 1. Note that, to guarantee an "unseen" test set, the score from a single classifier was only used to estimate the classification performance, whereas the average score from the ten classifiers was applied to predict new drug targets.
In the study, Random Forest was used to construct the classification models. Random Forest, developed by Leo Breiman and Adele Cutler, is a collection of tree-based classifiers which constructs trees depending on an independent feature-sampling procedure [31]. Each tree is built by sampling with a replacement, so that about one-third of samples are left out. These OOB (out-ofbag) samples are used to get an unbiased estimate of the classification error. The voting results from an ensemble of decision trees determine the most popular objective class. The Random Forest classifier has been shown to be relatively free from the over-fitting problem as compared to other machine learning methods.

Validation of classification performance
Two approaches were used to estimate the classification performance. The first of these was internal cross-validation using out-of-bag (OOB) samples from Random Forest classifiers. Random Forest performs a type of cross-validation in parallel with the training step by using out-of-bag (OOB) error estimate. Specifically, the samples that are left out (about one-third of samples) after bootstrapping in the training step become OOB samples. Because these OOB samples have not been used in the tree construction, they can be used to estimate test set errors (OOB error).
In addition, external validation using an independent test set was adopted to estimate the general prediction error of the unseen data. Prior to the training procedure, 50 drugs were randomly selected, and all drugpairs that included any of those 50 drugs were removed from the training data. After the training procedure, the resulting classifier was tested against the remaining drug pairs. This procedure was used to generate a test set consisting of unseen drug data, and to mimic the virtual screening procedure scanning the most similar drug in the chemical library. The performances of the internal and external cross-validation were shown by a sensitivity-specificity plot. Sensitivity is defined as TP/(TP+FN) and specificity is TN/(TN+FP), where TP is a true positive, FN is a false negative, TN is a true negative, and FP is a false positive.

Drug structural similarity by various fingerprints
In the present study, 881-bit PubChem fingerprint with the Tanimoto coefficient (ratio of intersection-bits to union-bits) was regarded as a basic measure for chemical structural similarity. In addition, 1024-bit ExtFP (Extends the Fingerprint with additional bits describing ring features), 1024-bit FP (Fingerprint of length 1024 and search depth of 8), 1024-bit GraphFP (specialized version of the Fingerprint which does not take bond orders into account), and 4860-bit KRFP (presence of chemical substructures) calculated from PaDEL software were also used to compare the performance between different fingerprints. To estimate the performance, drug pairs were sorted by the Tanimoto coefficient using different fingerprints to check if the two drugs shared the same target ( Figure 2).

Prediction of potential targets by the drug-drug relationship score
We developed a drug target prediction scheme based on the DRS. The target score for the query drug was obtained by transferring the DRS between the query drug and a drug in the database that binds to the same target. When there were more than two database drugs that bind to the target, the higher DRS (between the query and database drugs) was assigned as the target score. In addition, if the targets had the same score, the one which was more frequently above the predefined score (0.5) came first.
the target score and the success ratio represent that the score finds the known targets within the corresponding rank (x-axis).