Predicting target-ligand interactions using protein ligand-binding site and ligand substructures
© Wang et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
Cell proliferation, differentiation, Gene expression, metabolism, immunization and signal transduction require the participation of ligands and targets. It is a great challenge to identify rules governing molecular recognition between chemical topological substructures of ligands and the binding sites of the targets.
We suppose that the ligand-target interactions are determined by ligand substructures as well as the physical-chemical properties of the binding sites. Therefore, we propose a fragment interaction model (FIM) to describe the interactions between ligands and targets, with the purpose of facilitating the chemical interpretation of ligand-target binding. First we extract target-ligand complexes from sc-PDB database, based on which, we get the target binding sites and the ligands. Then we represent each binding site as a fragment vector based on a target fragment dictionary that is composed of 199 clusters (denoted as fragements in this work) obtained by clustering 4200 trimers according to their physical-chemical properties. And then, we represent each ligand as a substructure vector based on a dictionary containing 747 substructures. Finally, we build the FIM by generating the interaction matrix M (representing the fragment interaction network), and the FIM can later be used for predicting unknown ligand-target interactions as well as providing the binding details of the interactions.
The five-fold cross validation results show that the proposed model can get higher AUC score (92%) than three prevalence algorithms CS-PD (80%), BLM-NII (85%) and RF (85%), demonstrating the remarkable predictive ability of FIM. We also show that the ligand binding sites (local information) overweight the sequence similarities (global information) in ligand-target binding, and introducing too much global information would be harmful to the predictive ability. Moreover, The derived fragment interaction network can provide the chemical insights on the interactions.
The target and ligand bindings are local events, and the local information dominate the binding ability. Though integrating of the global information can promote the predictive ability, the role is very limited. The fragment interaction network is helpful for understanding the mechanism of the ligand-target interaction.
Through various high-throughput experimental projects for analyzing the genome, transcriptome and proteome, we are beginning to understand the genomic spaces. Simultaneously, the high-throughput screening of large-scale chemical compound libraries with various biological assays enable us to explore the chemical space. However, our knowledge about the relationship between the chemical and genomic spaces is very limited. For example, the PubChem database at NCBI  stores information on millions of chemical compounds, but the number of compounds with information on their target protein is very limited . Therefore, there is a strong incentive to develop new methods capable of detecting these potential target-ligand interactions efficiently.
Due to time and cost limitations of experimental approaches, a number of predictive approaches attempt to predict target-ligand relationships in silico. The traditional computational predictive methods roughly fall into two categories: target-based approaches and ligand-based approaches . Target-based approaches mainly utilize the target information to predict. Molecular docking is a target-based approach [4, 5], which predicts the preferred orientation by conformation searching and energy minimization. Docking could provide excellent conformation, but it is difficult to find a rank/evaluation function to select which orientation is more appropriate . Another target-based method is comparing target similarities, which compares the targets of a given ligand by sequences, EC number, domains, 3D structures, etc. Ligand-based methods compare candidate ligands with the known ligands of a given target to make a prediction . Three-dimensional quantitative structure-activity relationship (3D-QSAR) is a typical ligand-based model , which indirectly reflect non-bonding interaction characteristics between the ligand and target. The most widely used 3D-QSAR methods are comparative molecular field analysis (CoMFA) and comparative molecular similarity (CoMSIA). CoMFA first aligns the ligands capable of binding to a given target, and then measure field intensities around the aligned ligands by different atom probes (force field-based). Finally, the measured field intensities are regressed with the active values and the regression equation is applied to predict interactions. Moreover, we can map the coefficients of CoMFA back into 3D space to obtain a 3D-QSAR model, which could guide the optimization of lead compounds .
Recently, some methods considering both the target and ligand information have been proven to be promising in drug design and discovery. Jacob et al. applied the EC Number (Enzyme Commission number) and PubChem fingerprints (a set of molecular substructures)  to represent targets and ligands respectively, and proposed a pairwise support vector machine (pSVM) method to predict target-ligand interactions . Laarhoven et al. described the targets and ligands by sequences and compound 3D structures respectively, and introduced the target-ligand interaction network to build the prediction model . Bleakley et al. proposed Bipartite Local Model (BLM), which integrated the ligand-based and target-based methods to generate a comprehensive prediction . BLM has been further studied by Xia et al., Laarhoven et al. and Mei et al.[12, 10, 13]. The BLM shows a very good predictive ability, however, it cannot deal with the situation that both the ligand and target are unseen in the training set. Yamanishi et al. represented the genome space with sequences and target profiles, and the chemical space with compound 3D structures and ligand profiles, and then generated a uniform "pharmacological space" to build the prediction model . Cheng et al. applied the mass distribution property from physics on the target-ligand network to predict the target-ligand interactions . Cao et al. integrated the genome and chemical space into random forests to obtain a better predictive ability .
Materials and methods
Statistics of the data set
Representation of targets' binding sites based on target dictionary
where α(α01), α(α11), α(α12) and α tri (α01, α11, α12) are the 5-dimensional vectors. α01 is the center (major) amino acid, α11 and α12 are the left and right amino acid (subordinate) respectively. There is no location difference between α11 and α12, that means α tri (α01, α11, α12) and α tri (α01, α12, α11) are equivalence. In the third layer, the hierarchical clustering (Ward's algorithm ) is used to cluster 4200 trimers into 199 clusters [19, 18]. All clusters constitute the dictionary.
where B s (·) denotes a binding site feature vector. c(·) denotes a cluster, for example, c(G(N N), G(M N),...) represent a cluster that contains G(N N), G(M N), etc. Because the trimers in the same cluster own similar chemical properties, the clusters can be viewed as chemical "groups", based on which the ligand binding sites are decomposed into fragments.
Generation of ligand dictionary and ligand representation
Representation of chemical space involves two steps, defining a dictionary and de-scribing ligands as features. We have integrated sever data sources to make the dictionary (data shows in supplementary materials). In PubChem database, there are 881 predefined chemical substructures. We made some modification on the fin-gerprints to gear with our model. First, the single atoms and bonds were removed because they are not in the same structural level with trimers. Second, some sub-structures, such as benzene were removed; because they are too common to serve as a discriminately feature. Third, functional groups/fingerprints of molecular in Check-mol were integrated . Finally, we generated a dictionary with 747 substructures. Based on the dictionary, each ligand was represented by a 747-dimensional binary vector whose element indicates the presence or absence of each substructure by 1 or 0.
Construction of fragment interaction model
where s ∗ represents a binding site and l ∗ represents a ligand. s ∗ and l ∗ might be unseen to the data set. Mrepresents genomic and chemical spaces interaction matrix/network. If sign(F (s ∗ , l ∗ )) is 1, we predicted a positive interaction, otherwise we predicted a negative interaction (sign(·) is the sign function, return −1 and 1).
Although Equation 7 have been mentioned in many papers [9, 11, 23], the kernels in those works were nonlinear and irreversible (because of kernel trick), thus we known little about how the genomic space interact with chemical space. In this paper, we adopted linear kernel without kernel trick, so that the genomic and chemical interaction matrix could be calculate through Equation 10, which rendered the model to be chemical interpretable.
Representative methods for comparison
In order to evaluate the proposed method in this paper, we chosen three representative methods for comparison: chemical substructures and protein domains correlation model (CS-PD) , bipartite local model with neighbor-based interaction-profile inferring (BLM-NII)  and random forest (RF) .
• CS-PD: Proteins were described by domains and ligands were represented by substructures in CS-PD model. Sparse canonical correspondence analysis (SCCA) algorithm was applied to recognize the physical-chemical factors between the domains and substructures. In prediction phase, the domain and substructure physical-chemical factors in a given target-ligand pair were added to generate a discriminant value. If the value was higher than a threshold, the target and ligand were predicted to interact with each other.
• BLM-NII: On one hand, excluding target t i , make a list of all other known targets of ligand l j , as well as a separate list of the targets not known to be targeted ligand l j. The known targets were given a label +1 and the others a label −1. Then, look for a classification rule that tried to discriminate the +1-labeled data from the −1-labeled data using the available genomic sequence data for the targets. This rule was applied to predict the label of target t i and ligand l j. On the other hand, fixing the same target t i and excluding ligand l j , make a list of all other known ligands targeting t i , as well as a list of ligands not known to target t i . Similar with before, ligands known to target t i were given the label +1 and the others were given the label −1. We looked for a classification rule that tried to discriminate the +1-labeled data from the −1-labeled data, using the available chemical structure data for the ligands. This rule was also used to predict the label of target t i and ligand l j. At last, the two results were combined to generate a final label. For new targets or ligands, a neighbor-based interaction-profile inferring was applied to get an interaction profile.
• RF: The targets were described as CTD (Composition-Transition-Distribution, ) features. The ligands were represented as fingerprints. Then, the two kinds of features were combined into a vector to generate data set. Finally, random forest (RF) was employed to predict interactions.
Investigation on the interaction data
Comparison result of the prediction performances
Table 2 shows that the ACC and AUC scores of CS-PD are 56.5% and 79.9% respectively, which means the correct prediction rate is only slightly higher than random guess (the expect correct rate of random guess is 50%) and the comprehensive performance is not good. We guess that the poor performance of CS-PD is due to lacking of powerful classifier and it only serves as a feature extraction approach. BLM-NII preforms good in our data set, but not as well as in its origin data set (Yamanishi's "Gold Standard"). The AUC score of BLM-NII is 85.8% in our data set, while it is more than 98% in all four categories (enzyme, ion channel, GPCR, nuclear receptor) in its origin data set. The difference of data set could be the main cause of the AUC difference. It is a pity that not all the crystal structures of the targets in Yamanishi's data set are determined, and we could not perform our approach in the "Gold Standard". The ACC and AUC scores of RF are 0.743% and 0.851% respectively, which are similar with BLM-NII. The bagging ensemble procedure might promote the prediction ability of RF model. The ACC and AUC of FIM are 82.7% and 91.6% respectively, which is much higher than that of CS-PD, BLM-NII and RF. The ACC and AUC score is promoted more than 10% and 5% respectively, compared with state-of-the-art (BLM-NII). In short, the FIM have shown remarkable predictive ability and outperforms other three approaches in our data set.
The role of global information in the binding
Based on the above facts, it is reasonable to infer that the fluctuation caused by global information is limited and local information dominates the binding predictive accuracy, which support our assumption that target-ligand binding is a local event.
Fragment interaction network analysis
In this section, we first give a brief overview of fragment interaction matrix. Then we investigate the underlying chemical mechanisms of fragments interactions.
According to the hypothesis, the feature interactions reflect the chemical interaction, as a result, it is necessary to investigate whether the feature interactions response the hypothesis. Since the number of interactions is large, we only analyze the top twenty interactions (Figure 5C), the others could be analyzed similarly. In Figure 5C, the first letter of site fragment is the center amino acid of the trimer cluster, and the letters in the parenthesis represent the subordinate amino acids. The smarts (a kind of molecular patterns) represent ligand fragments. The Figure 5C suggest that the feature interactions reflect the chemical interaction well, which in consistent with the hypothesis. For example, the major amino acid of site fragment 147 (TF147) is Aspartic (short for D), which could interact with ligand fragment 92 (LF92, containing keto group) through hydrogen bond, if the distance and orientation are appropriate. In some situations, the major amino acid of a target feature could not form significant interaction with ligand feature, but the subsidiary amino acid could. For example, the major amino acid of site fragment 57 (TF57) is isoleucine (short for I), which is a hydrophobic amino acid. Isoleucine could not interact with ligand fragment 44 (LF44), which contains amino group. However, the subsidiary amino acid of site fragment 57, such as threonine (short for T) and arginine (short for R) can form hydrogen bond with ligand fragment 44, if the distance and orientation are appropriate.
Discussion and conclusion
In this work, we consider binding is a local event and emphasize the local information in target-ligand interaction prediction. We apply site-ligand interactions instead of target-ligand interactions and propose a chemical interpretable model to cover the site-ligand interactions. We first extract the ligand-binding sites from target-ligand complexes. Then we break the binding sites and ligands into fragments so that they can be encoded as fragment vectors based on target and ligand dictionary respectively. Finally, we assume that the fragments interactions determine the site-ligand interaction and propose a model, fragment interaction model (FIM), to generalize the assumption. The proposed model demonstrates higher AUC score (92%) with respect to two prevalence algorithms CS-PD (80%), BLM-NII (85%) and RF (85%). In addition, the fragment interaction network origined from FIM is chemical interpretable. Comparing to BLM-NII, RF and CS-PD model, it require crystal structure to extract local information (binding site) in FIM, which hinder the applying of FIM sometimes. However, with the increasing determination of protein crystal structures and the developing molecular modeling technique, we can model a 3D structure by computer, and extract the binding site.
Compared with traditional target-based or ligand-based approaches, the proposed FIM method has the advantages of finding target candidates and ligand candidates simultaneously. Moreover, FIM can predict the interaction between previously unseen targets and ligand candidates. Different with other target-ligand based methods, our method emphasizes the basic chemical interactions between amine acids and ligand fragments, which is more general and could be applied beyond drugtarget interactions. Furthermore, we no longer represent the target as a whole but extract the ligand-binding sites from target-ligand complexes and apply the binding sites to describe the genomic space. For one hand, representing the genomic space by binding sites allows us provide site-ligand interaction prediction, which is important for multi-site targets. For another hand, the binding sites are local, which facilitate to achieve chemical interpretable model. Along this way, we break the binding sites and ligands into fragments, and regard the fragment interactions as genomic and chemical space interactions. We know clearly about how the genomic space interacts with chemical space under FIM.
In all, we highlight the local information during the binding process and attempt to figure out a clear relationship between the genomic and chemical spaces. The proposed model (FIM) applies the ligand binding sites as local information and views the binding site and ligand fragment interactions as genomic and chemical space interactions. The fragment interactions are straightforward and chemical interpretable, and the fragment interaction network reflect the chemical interactions. The comparison result shows that FIM outperforms other three approaches. The investigation on the role of global information shows that the local information dominate the predictive accuracy and integrating of the global information might promote the predictive ability to a very limited extent.
Publication of this article has been funded by the National Science Foundation of China [61272274, 60970063, 61402340, 31270101]; the program for New Century Excellent Talents in Universities [NCET-10-0644]; the Ministry of Science and Technology of China (973 and 863 Programs) and the National Mega Project on Major Drug Development.
This article has been published as part of BMC Systems Biology Volume 9 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/9/S1
- Wheeler DL, Barrett T, Bryant SH, Canese K, Benson DA: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2006, 173-180. 34 DatabaseGoogle Scholar
- Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M: Prediction of drug-target interaction networks from the integration of chemicaland genomic spaces. Bioinformatics. 2008, 24: 232-240. 10.1093/bioinformatics/btn162.View ArticleGoogle Scholar
- Rognan D: Chemogenomic approaches to rational drug design. Br J Pharmacol. 2007Google Scholar
- Cheng AC, Coleman RG, Smyth KT, Cao Q, Soulard P, Caffrey DR, Salzberg AC, Huang ES: Structure-based maximal affinity model predicts small-molecule druggability. Nat Biotechnol. 2007, 25: 71-75. 10.1038/nbt1273.View ArticlePubMedGoogle Scholar
- Rarey M, Kramer B, Lengauer T, Klebe G: A fast flexible docking method using an incremental construction algorithm. J Mol Biol. 1996, 261: 470-489. 10.1006/jmbi.1996.0477.View ArticlePubMedGoogle Scholar
- Kitchen DB: Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov. 2004, 3: 935-949. 10.1038/nrd1549.View ArticlePubMedGoogle Scholar
- Nantasenamat C: A practical overview of quantitative structure-activity relationship. Excli J. 2009, 8: 74-88.Google Scholar
- Chen B, Wild D, Guha R: Pubchem as a source of polypharmacology. J Chem Inf Model. 2009, 49: 2044-2055. 10.1021/ci9001876.View ArticlePubMedGoogle Scholar
- Jacob L: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008, 24: 2149-2156. 10.1093/bioinformatics/btn409.PubMed CentralView ArticlePubMedGoogle Scholar
- Laarhoven T: Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011, 27: 3036-3043. 10.1093/bioinformatics/btr500.View ArticlePubMedGoogle Scholar
- Bleakley K, Yamanishi Y: Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics. 2009, 25: 2397-2403. 10.1093/bioinformatics/btp433.PubMed CentralView ArticlePubMedGoogle Scholar
- Xia Z, Wu LY, Zhou X, Wong ST: Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst Biol. 2010, 4 (Suppl 2): 6-10.1186/1752-0509-4-S2-S6.View ArticleGoogle Scholar
- Mei JP, Kwoh CK, Yang P, Li XL, Zheng J: Drug-target interaction prediction by learning from local informationand neighbors. Bioinformatics. 2013, 29: 238-245. 10.1093/bioinformatics/bts670.View ArticlePubMedGoogle Scholar
- Cheng F, Liu C: Prediction of drug-target interactions and drug repositioning via network-based inference. PLOS Computational Biology. 2012, 8: 1002503-10.1371/journal.pcbi.1002503.View ArticleGoogle Scholar
- Cao DS, Liang YZ, Deng Z, Hu QN, He M, Xu QS, Zhou GH, Zhang LX, Deng ZX, Liu S: Genome-scale screening of drug-target associations relevant to ki using a chemogenomics approach. PLoS One. 2013, 8: 57680-10.1371/journal.pone.0057680.View ArticleGoogle Scholar
- Meslamani J, Rognan D, Kellenberger E: sc-pdb: a database for identifying variationsand multiplicity of 'druggable' binding sites in proteins. Bioinformatics. 2011, 27: 1324-1326. 10.1093/bioinformatics/btr120.View ArticlePubMedGoogle Scholar
- Berman HM: The protein data bank: a historical perspective. ActaCrystallographica Section A: Foundations of Crystallography. 2008, 88-95. A64Google Scholar
- Nagamine N, Sakakibara Y: Statistical prediction of protein chemical interactions based on chemical structureand mass spectrometry data. Bioinformatics. 2007, 23: 2004-2012. 10.1093/bioinformatics/btm266.View ArticlePubMedGoogle Scholar
- Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics. 2005, 21: 218-226. 10.1093/bioinformatics/bth483.View ArticlePubMedGoogle Scholar
- Hotelling H: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology. 1993, 24: 417-441.View ArticleGoogle Scholar
- Ward JH: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association. 1963, 58: 236-244. 10.1080/01621459.1963.10500845.View ArticleGoogle Scholar
- Haider N: Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach. Molecules. 2010, 15: 5079-1592. 10.3390/molecules15085079.View ArticlePubMedGoogle Scholar
- Yamanishi L: Extracting sets of chemical substructuresand protein domains governing drug-target interactions. J Chem Inf Model. 2011, 51: 1183-1194.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S: From genomics to chemical genomics: new developments in kegg. Nucleic Acids Res. 2006, 354-357. 34 DatabaseGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.