Volume 7 Supplement 6
Stringent DDI-based Prediction of H. sapiens-M. tuberculosis H37Rv Protein-Protein Interactions
© Zhou et al.; licensee BioMed Central Ltd. 2013
Published: 13 December 2013
H. sapiens-M. tuberculosis H37Rv protein-protein interaction (PPI) data are very important information to illuminate the infection mechanism of M. tuberculosis H37Rv. But current H. sapiens-M. tuberculosis H37Rv PPI data are very scarce. This seriously limits the study of the interaction between this important pathogen and its host H. sapiens. Computational prediction of H. sapiens-M. tuberculosis H37Rv PPIs is an important strategy to fill in the gap. Domain-domain interaction (DDI) based prediction is one of the frequently used computational approaches in predicting both intra-species and inter-species PPIs. However, the performance of DDI-based host-pathogen PPI prediction has been rather limited.
We develop a stringent DDI-based prediction approach with emphasis on (i) differences between the specific domain sequences on annotated regions of proteins under the same domain ID and (ii) calculation of the interaction strength of predicted PPIs based on the interacting residues in their interaction interfaces.
We compare our stringent DDI-based approach to a conventional DDI-based approach for predicting PPIs based on gold standard intra-species PPIs and coherent informative Gene Ontology terms assessment. The assessment results show that our stringent DDI-based approach achieves much better performance in predicting PPIs than the conventional approach. Using our stringent DDI-based approach, we have predicted a small set of reliable H. sapiens-M. tuberculosis H37Rv PPIs which could be very useful for a variety of related studies.
We also analyze the H. sapiens-M. tuberculosis H37Rv PPIs predicted by our stringent DDI-based approach using cellular compartment distribution analysis, functional category enrichment analysis and pathway enrichment analysis. The analyses support the validity of our prediction result. Also, based on an analysis of the H. sapiens-M. tuberculosis H37Rv PPI network predicted by our stringent DDI-based approach, we have discovered some important properties of domains involved in host-pathogen PPIs. We find that both host and pathogen proteins involved in host-pathogen PPIs tend to have more domains than proteins involved in intra-species PPIs, and these domains have more interaction partners than domains on proteins involved in intra-species PPI.
The stringent DDI-based prediction approach reported in this work provides a stringent strategy for predicting host-pathogen PPIs. It also performs better than a conventional DDI-based approach in predicting PPIs. We have predicted a small set of accurate H. sapiens-M. tuberculosis H37Rv PPIs which could be very useful for a variety of related studies.
Keywordsprotein-protein interaction (PPI) H. sapiens-M. tuberculosis H37Rv PPIs Domain-domain interaction (DDI)
Tuberculosis is an infectious disease which causes millions of deaths each year. M. tuberculosis--the causative agent of tuberculosis-- infects around one-third of the world's population [1, 2]. Tuberculosis is one of the most common opportunistic infections in HIV-infected patients and it is also one of the most common death causes among HIV patients [3, 4].
Host-pathogen PPIs are essential for a pathogen's colonization, adhesion and invasion of host cells, which are crucial for the understanding of infection mechanism and the interaction between pathogen and host. Unfortunately, high-quality large-scale experimental host-pathogen PPIs are not available in many host-pathogen systems, especially between H. sapiens and M. tuberculosis H37Rv. Many computational approaches have been developed to predict host-pathogen PPIs including approaches based on homology, interacting domain/motif, structure, and even machine learning . DDI-based approaches are often used for predicting both intra-species and inter-species PPIs, with the assumption that domain-domain interactions mediate the protein-protein interactions, because domains are the basic building blocks determining the structure and function of proteins .
In this work, we develop a stringent DDI-based approach for predicting the H. sapiens-M. tuberculosis H37Rv PPIs by taking into account of the differences between each specific domain sequence (we name it "domain instance") on each annotated region of proteins under the same domain ID. The interactions between query domain instances are made based on very stringent sequence alignment to the structural template domain instances. Moreover, we adopt an effective scoring strategy in ranking how likely the predicted proteins are interacting with each other by examining the interacting residues in the interaction interfaces. As long as the two amino acids have one of the atomic interaction: hydrogen bonds, electrostatic or van de Waals interactions between two domain instances that are defined as interacting residues in this study. Thus, we are standing on a much more stringent and finer level of domain interaction by examining not only the sequence similarity of each domain instances but also the interaction interface compatibility between them. In contrast, conventional DDI-based approaches generally use some popular tools to annotate the domains in proteins and then see whether two proteins contain a pair of domains whose IDs match a pair of domains that are known to interact in some other pair of proteins. Matching query domain instance to template domain instance based on domain ID--as done in such conventional DDI-based approaches--is rather coarse and often leads to matching of domain instances that do not have the same interaction interfaces.
Using gold standard H. sapiens PPIs, we assess the performance of our stringent DDI-based approach and the conventional DDI-based approach by comparing their precision-recall curves and the number of predicted PPIs overlapping with gold standard PPIs. We also use the percentage of coherent informative Gene Ontology(GO) annotations to assess the predicted H. sapiens PPIs to compare the performance of our stringent DDI-based approach and the conventional DDI-based approach. These assessments demonstrate that our stringent DDI-based approach has much better performance than a conventional DDI-based approach. Cellular compartment distribution analysis, pathway enrichment analysis, and functional category enrichment analysis supports the validity of our predicted H. sapiens-M. tuberculosis H37Rv PPI dataset. Our stringent DDI-based approach can be used for predicting host-pathogen PPIs in a variety of different host-pathogen systems. We have also discovered some interesting properties of both pathogen and host proteins participating in host-pathogen PPIs, including the tendency to have more domains, and the domains on the proteins involved in host-pathogen PPIs tend to have much higher degrees.
Our stringent DDI-based approach predicts PPIs by inferring domain instance interactions from structural template domain instance interactions. Using the MUSCLE alignment program , we accurately align query protein domain instances to template domain instances using a stringent threshold(length difference ≤ 20% and sequence similarity ≥ 50%) and transfer the possible interactions between template structural domain instances to our query domain instances. Here the length difference is calculated by the difference of length (longer sequence length minus shorter sequence length) divided by the length of query domain instance; sequence similarity is the number of correctly aligned residues divided by the length of query domain instance.
We then predict the possible PPIs from interacting query domain instances. The structural domain instances are extracted from the 3did database . Each interacting query domain instance pair is scored according to the similarity of the interaction residues in the interaction interfaces, and the best query instance score is used to represent the interaction strength of the predicted PPI (how likely the two proteins in the PPI are interacting each with other). We predict both host-pathogen (H. sapiens-M. tuberculosis H37Rv) and intra-species (H. sapiens) PPIs in this work. For a comparison study, we use a conventional DDI-based approach  to predict possible intra-species (H. sapiens) PPIs. We assess our stringent DDI-based approach and the conventional approach using gold standard H. sapiens PPIs and by the percentage of the predicted PPIs that have coherent informative GO annotation. These assessments show that our stringent DDI-based approach has better performance in predicting PPIs than the conventional approach. Cellular compartment distribution analysis, pathway enrichment analysis, and functional enrichment analysis support our prediction results and show that the predicted PPIs correspond to the M. tuberculosis H37Rv infection process. We further analyze some of the basic domain properties of proteins involved in the host-pathogen Protein-Protein Interaction Network (PPIN), comparing with other proteins involved in intra-species PPIN, by examining the number of domains and domain interaction degrees.
PPI prediction--our stringent DDI-based approach
It is a reasonable assumption that an observed interaction between two domain instances can be used to infer the interaction of another domain instance pair, provided the two domain instance pairs are sufficiently similar as to preserve the relevant interaction interfaces. Specifically, consider two protein domains A and B. Let A i and B i be two instances of domain A and B, respectively. Suppose we know that these two instances have a direct physical interaction (from the crystal structure of a protein complex). Given the observation of A i and B i , one could infer the interaction of another instance pair of A and B, A j and B j , by using a sequence similarity threshold between (A i , B i ) and (A j , B j ).
In general, conventional DDI-based approaches disregard the details of the interaction between these domain instances in the real 3D space--i.e., the interaction interface between the two instances--and thus effectively match the domain instances based on name. In contrast, we formulate a stringent approach that emphasizes the similarity of the interaction interface of the domain instances. Specifically, we assign a positive prediction score on pairs with high interface residue similarity with respect to the observed interaction instances in the existing protein structural data.
The data on structural domain instances, including the interacting domain pair, the structural and sequence details of interacting domain instances, the interacting residues in the interaction interfaces are extracted from the 3did database . These individual domain instances with 3did structural data serve as "template domain instances", and pairs of interacting domain instances with 3did structural data serve as "template interacting domain instance pairs". The fasta sequences of all H. sapiens and M. tuberculosis H37Rv proteins are obtained from Uniprot . Their respective protein domain annotations are obtained from InterPro , from which we collect the sequences of domain instances which have at least one template domain instance from 3did. These domain instances are named the "query domain instances". They are aligned to each of the template domain instances under the same domain ID using the MUSCLE alignment program . Only query domain instances meeting the stringent threshold of length difference ≤ 20% and sequence similarity≥ 50% are kept for the following analysis. For each pair (A i , B i ) of query domain instances that meets the stringent alignment threshold to a template interacting domain instance pair (A, B), we infer the interaction interface residues in (A i , B i ) as the residues that are aligned to the interaction interface residues in (A, B). A score of this interaction interface of (A i , B i ) is then computed by summing the BLOSUM62 substitution score  between the residues in this interaction interface and the corresponding residues in the interaction interface of (A, B) that they are aligned to. This score is defined as the "domain instance interaction strength". Query domain instances with multiple possible template instances are scored based on the template with the best domain instance interaction strength. For any possible pair of proteins, if they have a query domain instance pair (one domain instance on each of the two proteins), then these two proteins are predicted to be interacting with an interaction score equaling the domain instance interaction strength of that query domain instance pair. If the protein pair has more than one underlying query domain instance interaction pair, then the query domain instance pair with the best score is used to represent the protein pair. This best score is taken as "interaction strength" of this protein pair.
PPI prediction--a conventional DDI-based approach
The conventional DDI-based approach predicts how likely two proteins are interacting with each other by integrating known intra-species PPIs with domain profiles based on an association method (sequence-signature algorithm) proposed by Sprinzak et al.  Specifically, domains are annotated in each protein in a known intra-species PPI dataset. Then, the probability P(d, e) that two proteins containing a specific pair of domains (d, e) would interact is estimated for each pair of domains in a Bayesian manner. Finally, given a new pair of proteins, their probability of interaction is estimated by a naive combination of the probabilities from each pair of domains (d i , e j ) contained in the pair of proteins . This predicted probability(called "interaction strength" of the conventional approach) can be used to rank the list of predicted PPIs.
This conventional DDI-based approach is applied to predict host-pathogen PPIs as follows. For each pair of proteins (one in H. sapiens and one in M. tuberculosis), we compute their probability of interactions as described above based on DDIs in a yeast physical PPI dataset collected from MINT , BioGRID , and IntAct . This conventional DDI-based approach is also applied to predict human intra-species PPIs. In this case, for each pair of proteins (both in H. sapiens), we compute their probability of interactions as described above based on DDIs in the same yeast physical PPI dataset. As a control study, we ensure that the domains considered are the same domain set considered in the stringent DDI-based approach--i.e., we restrict the domain set to domains contained in 3did.
Assessment based on gold standard H. sapiens PPIs
Assessment of the stringent and the conventional DDI-based approaches through gold standard H. sapiens PPIs.
Conventional DDI-based Approach
Overlap with Gold Standard
Top 3085 PPIs
Top 885 PPIs
Stringent DDI-based Approach
Overlap with Gold Standard
All 839 PPIs
Assessment using coherent informative GO annotation of predicted H. sapiens PPIs
Number of informative GO terms annotated to proteins involved in PPIs predicted by the stringent and the conventional DDI-based approach.
Conventional DDI-based Approach
CC term No.
BP term No.
MF term No.
All 724185 PPIs
Top 839 PPIs
Stringent DDI-based Approach
CC term No.
BP term No.
MF term No.
All 839 PPIs
Cellular compartment distribution of H. sapiens proteins targeted by the predicted host-pathogen PP
The assessments above prove that our stringent DDI-based approach has a much better performance than the conventional DDI-based approach in predicting more reliable intra-species PPIs. We now analyze the host-pathogen PPIs predicted by our stringent DDI-based approach.
The cellular compartments of the H. sapiens proteins targeted by the predicted H. sapiens-M. tuberculosis H37Rv PPIs are useful in telling the quality of the predicted host-pathogen PPIs. If the targeted H. sapiens proteins are located in cellular compartments that are very relevant to the pathogen's infection or are very likely to be involved in interactions with the pathogen, then the result supports the host-pathogen predictions. Gene Ontology (Cellular Compartment, CC) is a very comprehensive annotation system for human proteins. However, as the Gene Ontology is hierarchical, we only use informative CC terms for our analysis.
Cellular compartment distribution of H. sapiens proteins targeted by host-pathogen PPIs predicted by the stringent DDI-based approach.
No. of Proteins
GO:0005759 mitochondrial matrix
GO:0045211 postsynaptic membrane
GO:0005741 mitochondrial outer membrane
GO:0016469 proton-transporting two-sector ATPase complex
GO:0044439 peroxisomal part
GO:0031965 nuclear membrane
GO:0048471 perinuclear region of cytoplasm
GO:0016324 apical plasma membrane
GO:0005925 focal adhesion
GO:0035770 ribonucleoprotein granule
GO:0016605 PML body
GO:0016607 nuclear speck
GO:0030018 Z disc
Functional enrichment analysis of proteins involved in host-pathogen PPIs
Functional enrichment analysis of H. sapiens proteins involved in the host-pathogen PPI dataset predicted by the stringent DDI-based approach.
GO:0050660 FAD binding
GO:0016462 pyrophosphatase activity
GO:0004022 alcohol dehydrogenase (NAD) activity
GO:0032559 adenyl ribonucleotide binding
GO:0042626 ATPase activity, coupled to transmembrane movement of substances
GO:0015405 P-P-bond-hydrolysis-driven transmembrane transporter activity
GO:0042625 ATPase activity, coupled to transmembrane movement of ions
GO:0000287 magnesium ion binding
GO:0004466 long-chain-acyl-CoA dehydrogenase activity
GO:0003960 NADPH:quinone reductase activity
GO:0070402 NADPH binding
GO:0004745 retinol dehydrogenase activity
GO:0019841 retinol binding
GO:0042288 MHC class I protein binding
Pathway enrichment analysis of proteins involved in host-pathogen PPIs
Pathway data are very important functional information for identifying a list of proteins' overall related functions in a cell. For a set of proteins which is significantly enriched in some pathways, it is very likely that this set of proteins play similar or co-ordinated roles in vivo. Thus, pathway enrichment analysis is also one of the most frequently used strategy for analyzing predicted host-pathogen PPIs.
Pathway enrichment analyses of H. sapiens proteins involved in the host-pathogen PPI dataset predicted by the stringent DDI-based approach.
Fatty Acid Metabolism
Valine, Leucine and Isoleucine Degradation
Fatty Acid Beta Oxidation
Glycolysis and Gluconeogenesis
2-Oxobutanoate Degradation I
p53 Signaling Pathway
Ethanol Degradation II (cytosol)
Pathway enrichment analyses of M. tuberculosis H37Rv proteins involved in the host-pathogen PPI dataset predicted by the stringent DDI-based approach.
Fatty Acid β oxidation I
Analysis of domain properties of proteins involved in host-pathogen PPIs
The analysis of protein domain properties considers the number of domains and the degrees of domains on proteins. The protein domain properties directly reflect differences between the proteins involved in inter-species host-pathogen PPIN and intra-species PPIN. We analyze the domain properties of both M. tuberculosis H37Rv and H. sapiens involved in the predicted host-pathogen PPIs, and comparing them with other proteins in their own intra-species PPIN. As a control experiment, we also conduct the same analysis on the H. sapiens proteins in the gold standard H. sapiens-HIV PPIs  to see whether the H. sapiens proteins in the gold standard H. sapiens-HIV PPIs exhibit similar properties.
As the host-pathogen PPIs are predicted by the stringent DDI-based approach, to avoid biased analysis, we use a different domain annotation system in this analysis. The annotation of both M. tuberculosis H37Rv and H. sapiens protein domains is accomplished using HMMER-V3.0 . The domain profiles used in the protein domain annotation are Pfam-A . The threshold for the domain annotation is E-value(iE-value) ≤ E − 20 and accuracy ≥ 0.9. For each domain annotated on each protein, we retrieve the sequences of these domains on every protein for the following analysis.
Protein domain property analysis result.
H. sapiens proteins
H. sapiens proteins
Average No. of domains
Average Domain degrees
Software Packages and Datasets
The software packages and database tools used in this study are:
The datasets used in this study are:
M. tuberculosis H37Rv PPI dataset consisting of four reliable subsets of the B2H PPI dataset and STRING PPI dataset(threshold at 770) .
Protein domain annotation (protein2ipr) from InterPro ; date of download is March 5th, 2012.
DDI data from the 3did database (version November 28, 2010).
DDI data from the DOMINE database V2.0 .
Pfam-A Domain profiles .
H. sapiens-HIV-1 PPI dataset downloaded from "HIV-1, human protein interaction database at NCBI" .
Prediction of host-pathogen PPIs
Because of the stringent alignment threshold used for identifying query and template domain instances, lots of instances with large sequence variation under the same domain ID are filtered out, leaving very few domain instances for study. Also, our template interacting domain instances are from structurally resolved data in 3did, therefore the template domain instances are a relatively small number. Due to these two factors, our stringent DDI-based approach predicted PPI datasets are usually small. We have predicted 92 H. sapiens-M. tuberculosis H37Rv PPIs and this small set of predicted host-pathogen PPIs are analyzed using several approaches as discussed in the following sections. We visualize the predicted host-pathogen PPIN consisting of these 92 H. sapiens-M. tuberculosis H37Rv PPIs using Cytoscape  in Figure 1. The orange dots are M. tuberculosis H37Rv proteins, while the blue dots are H. sapiens proteins. The predicted H. sapiens-M. tuberculosis H37Rv PPI dataset can be found in the Additional Files 1. From Figure 1 we can observe that, like many host-pathogen PPINs, the pathogen proteins tend to be hubs in host-pathogen PPIN.
Prediction of intra-species PPIs
Currently no large-scale high-quality H. sapiens-M. tuberculosis H37Rv dataset is available. So we can not directly assess the performance of our stringent DDI-based approach in the inter-species host-pathogen system. Reluctantly, we turn to the intra-species system for the assessments. We predict intra-species H. sapiens PPIs using the stringent and the conventional DDI-based approaches. Altogether 839 H. sapiens PPIs are predicted by the stringent DDI-based approach. In contrast, 724185 H. sapiens PPIs are predicted by the conventional DDI-based approach. Just from the number of PPIs predicted by two approaches the differences are obvious. Our stringent DDI-based approach relies on very high sequence similarity to the template domain instances and stands on the stringent domain instances to make the prediction. Therefore only a small amount of PPIs are predicted. And the small number of structurally resolved template interacting domain instances also limits the number of PPIs we can predict using our stringent DDI-based approach. Whereas the conventional DDI-based approach derives the possible interacting domain information from known PPI datasets(which can be abundant for some species), and treats all domain instances annotated under the same domain ID as the same. So a large number of PPIs can be predicted by the conventional DDI-based approach. We compare the performance of our stringent DDI-based approach and the conventional DDI-based approach based on gold standard PPI datasets and percentage of PPIs having coherent informative GO terms.
Assessment based on gold standard H. sapiens PPIs
We collect the known H. sapiens physical PPI dataset from MINT , BioGRID , and IntAct  as our gold standard PPI dataset to assess the H. sapiens PPIs predicted by the stringent and the conventional DDI-based approaches. We calculate and plot the precision-recall curve of the stringent and the conventional DDI-based approaches; see Figure 2. From the plots we can see both of the prediction approaches achieve better precision when the threshold increases. This shows that the scoring strategies adopted by both prediction approaches in calculating the "interaction strength" are valid in telling the likelihood of predicted PPIs being real. From the precision-recall curves, one can clearly tell that overall the stringent DDI-based approach consistently predicts PPIs with much higher precision than that of the conventional DDI-based approach; see Figure 2. As the conventional DDI-based approach makes a large number of predictions, it has higher recall. The precision-recall curve shows that our stringent DDI-based approach can only predict small amount of PPIs but with much higher accuracy than the conventional approach. As the two approaches predict very different number of PPIs, we also choose some special points to compare the performance of the two prediction approaches, see Table 1. We can see that when our stringent DDI-based approach predicts 839 H. sapiens PPIs, 82 of which overlap with the gold standard; when the conventional DDI-based approach predicts 885 H. sapiens PPIs, only 11 of which overlap with the gold standard. Our stringent DDI-based approach has to predict 839 H. sapiens PPIs in order to have 82 H. sapiens PPIs overlapping with the gold standard. The conventional DDI-based approach has to predict 3085 H. sapiens PPIs in order to have 81 H. sapiens PPIs overlapping with the gold standard; see Table 1. All these assessments using the gold standard H. sapiens PPIs clearly show that our stringent DDI-based approach is more stringent and has better performance than that of the conventional DDI-based approach.
Assessment based on coherent informative GO annotation of predicted H. sapiens PPIs
To further compare the performance of the stringent and the conventional DDI-based approaches, we calculate the percentage of PPIs that have coherent informative GO terms. From Figure 3 and Figure 4, the overall percentage of PPIs having coherent informative GO terms reveals that both approaches work well--as moving towards to a higher threshold (smaller number of top PPIs) leads to a higher percentage of PPIs having coherent informative GO terms. As shown in Figure 3, the PPI dataset predicted by our stringent DDI-based approach starts with high percentage of PPIs having coherent informative GO terms; this indicates overall good performance as the PPI dataset predicted by our stringent DDI-based approach has low noise level and high quality. In contrast, the PPI dataset predicted by the conventional DDI-based approach does not show as good performance as the stringent DDI-based approach in terms of the overall percentage of PPIs having coherent informative GO terms--the PPI dataset predicted by the conventional DDI-based approach starts with a low percentage of PPIs having coherent informative GO terms, especially very low percentage of cellular compartment (CC) terms and biological process (BP) terms; this indicates that the PPI dataset predicted by the conventional DDI-based approach has high noise and the quality is not good. As the PPI datasets predicted by the two approaches are very different in the number of predicted PPIs, it may not be a sufficient assessment seeing only overall plots of percentage of PPIs having coherent informative GO terms. Therefore, we focus on the top 839 PPIs respectively predicted by the stringent and the conventional DDI-based approaches and plot their percentage of PPIs having coherent informative GO terms in Figure 5. We can clearly observe that PPIs predicted by the stringent DDI-based approach have consistently higher percentage of coherent informative CC and BP terms; see Figure 5. The percentage of PPIs that have coherent informative GO terms may also be influenced by the number of GO terms that are annotated to the proteins in the PPI datasets. So we summarize the number of GO terms that are annotated to proteins in all 839 PPIs predicted by the stringent DDI-based approach, and proteins in all 724185 PPIs and the top 839 PPIs predicted by the conventional DDI-based approach in Table 2. This table shows that although a high percentage of the PPIs predicted at a high threshold by the conventional DDI-based approach has coherent informative GO terms, this may be due the fact that these top 839 PPIs are annotated with very few distinct GO terms. Even with such a smaller number of informative GO terms we can see that the percentage of PPIs predicted by the conventional DDI-based approach having coherent informative GO terms is still consistently lower than the stringent DDI-based approach; this strongly supports the conclusion that the stringent DDI-based approach has a much better performance than that of the conventional DDI-based approach in predicting reliable PPIs.
Cellular compartment distribution of H. sapiens proteins targeted by predicted host-pathogen PPIs.
The cellular compartment distribution of the H. sapiens proteins targeted by the host-pathogen PPIs predicted by our stringent DDI-based approach is an important indicator of the performance of the prediction approach and the quality of the H. sapiens-M. tuberculosis H37Rv PPIs predicted. If the targeted H. sapiens proteins are mostly located in cellular compartments having a close relationship with pathogen infection then the predicted results are more convincing. We identify the informative CC terms in H. sapiens proteins. Then we calculate the number and percentage of proteins in the datasets that have been annotated with each of the informative CC terms. Then we plot the located informative CC terms for the targeted H. sapiens proteins by the stringent DDI-based approach in Figure 6, with detail statistics given in Table 1.
Many of the host-pathogen PPIs predicted by the stringent DDI-based approach target H. sapiens proteins located in very relevant cellular compartments. M. tuberculosis H37Rv infection has a close relationship with mitochondria activities and function and induces quantitatively distinct changes in the mitochondrial proteome . Ultrastructural changes in the mitochondria and mitochondrial clustering are also observed in the M. tuberculosis H37Rv infected cells . The augmentation of mitochondrial activity by M. tuberculosis H37Rv enables manipulation of host cellular mechanisms to inhibit apoptosis and ensure fortification against anti-microbial pathways . Therefore mitochondrial matrix(GO:0005759), mitochondrial outer membrane(GO:0005741) and proton-transporting two-sector ATPase complex(GO:0016469), are relevant to M. tuberculosis H37Rv infection.
H. sapiens proteins located at flagellum (GO:0019861) have much higher chance of interacting with M. tuberculosis H37Rv during infection as proteins located at flagellum are the first set of proteins that M. tuberculosis H37Rv comes across before invading the cell.
The CC term peroxisomal part(GO:0044439) is also strongly related to M. tuberculosis infection. It is found that the interaction between the mycobacterial phagosome and the endoplasmic reticulum leads to proteasome degradation and MHC class I presentation of M. tuberculosis antigens.
Focal adhesion(GO:0005925) is also closely interconnected to the M. tuberculosis infection process. In many bacterial pathogens, protein tyrosine phosphatases (PTPases) are essential for dephosphorylating host focal adhesion proteins and focal adhesion kinase. This dephosphorylation leads to destabilization of focal adhesions involved in the internalization of bacterial pathogens by eukaryotic cells [26, 27]. Therefore the proteins located at "Focal adhesion" compartment are very important target for M. tuberculosis infection of host. This strongly supports the validity of the prediction results of our stringent DDI-based approach.
The cellular compartment lamellipodium(GO:0030027) also supports the validity of our prediction results. It has been reported that host cell's actin filament network is interfered by pathogenic species of mycobateria [28–30]. A more recent study shows that M. tuberculosis affects actin polymerisation .
The CC term nucleolus(GO:0005730) may also be related to M. tuberculosis infection, as M. tuberculosis infection of human macrophages blocks several responses to IFN-γ. The inhibitory effect of M. tuberculosis is directed at the transcription of IFN-γ-responsive genes . Several studies show that M. tuberculosis and its purified protein derivative induced HIV LTR primarily through transcriptional activation .
The cellular compartment distribution analysis of the H. sapiens proteins targeted by host-pathogen PPIs strongly supports the validity of the PPI dataset predicted by our stringent DDI-based approach.
Functional enrichment analysis of proteins involved in host-pathogen PPIs
Functional enrichment analysis points out the possible functional relevance of H. sapiens proteins involved in the H. sapiens-M. tuberculosis H37Rv PPIN predicted by the stringent DDI-based approaches. The representative result--the most significantly enriched level 5 MF GO terms--is given in Table 4.
Most of the significantly enriched functional categories are strongly related to M. tuberculosis H37Rv infection, including adenyl ribonucleotide binding(GO:0032559), ATPase activity, coupled to transmembrane movement of substances (GO:0042626), P-P-bond-hydrolysis-driven transmembrane transporter activity(GO:0015405), ATPase activity, coupled to transmembrane movement of ions(GO:0042625), long-chain-acyl-CoA dehydrogenase activity(GO:0004466), NADPH:quinone reductase activity(GO:0003960), NADPH binding(GO:0070402), retinol dehydrogenase activity(GO:0004745), retinol binding(GO:0019841), and MHC class I protein binding(GO:0042288).
As described above, M. tuberculosis H37Rv infection is closely related to the mitochondria. Therefore all those MF terms closely related to mitochondria are relevant to M. tuberculosis H37Rv infection; the relevant GO terms include ATPase activity, coupled to transmembrane movement of substances (GO:0042626), P-P-bond-hydrolysis-driven transmembrane transporter activity(GO:0015405), ATPase activity, coupled to transmembrane movement of ions(GO:0042625), NADPH:quinone reductase activity(GO:0003960), NADPH binding(GO:0070402).
MHC class I protein binding(GO:0042288) is a strongly immune-related term which is also very relevant to M. tuberculosis H37Rv infection. Proteins enriched in this term play an important role in presenting M. tuberculosis antigens, which is essential for the immune response to this pathogen.
The long-chain-acyl-CoA dehydrogenase activity(GO:0004466) is a fatty acid-related term which is very relevant to M. tuberculosis H37Rv infection. Fatty acids and cholesterol appear to be the favored nutrients for M. tuberculosis inside H. sapiens cells . The breakdown of fatty acids and cholesterol can generate propionyl-CoA, which gives rise to potentially toxic intermediates . Through the methylcitrate cycle, the methylmalonyl pathway, or incorporation of the propionyl-CoA into methyl-branched lipids in the cell wall, M. tuberculosis expands the acetyl-CoA pool and alleviates the pressure from propionyl-CoA .
This functional enrichment analysis shows that our stringent DDI-based approach is accurate and has merits in identifying possible H. sapiens proteins that are involved in H. sapiens-M. tuberculosis H37Rv PPIs.
Pathway enrichment analysis of proteins involved in host-pathogen PPIs
Pathway enrichment analysis of the proteins involved in host-pathogen PPIN can provide rich information on the functional relevance of (both the host and pathogen) proteins involved in the host-pathogen PPIN. The analysis should show that the host proteins involved in host-pathogen interactions is a set of proteins that have functional correlation to pathways relevant to the pathogen's infection. Indeed H. sapiens proteins involved in the H. sapiens-M. tuberculosis H37Rv PPIN predicted by the stringent DDI-based approach are mostly enriched in the pathways are closely relevant to M. tuberculosis infection; see Table 5. For example, "Fatty Acid Metabolism", "Fatty Acid Beta Oxidation", and "Glycolysis and Gluconeogenesis" are closely related to M. tuberculosis infection as fatty acids are one of the favored nutrients for M. tuberculosis inside H. sapiens cells . M. tuberculosis is able to grow on a variety of carbon sources, but mounting evidence has implicated fatty acids as the major source of carbon and energy for M. tuberculosis during infection . And M. tuberculosis switches its carbon source from sugars to fatty acids during the persistent phase of infection . Biosynthesis of sugars from intermediates of the tricarboxylic acid cycle is essential for growth . The pathways "Metabolic Pathways", "Valine, Leucine and Isoleucine Degradation", "2-Oxobutanoate Degradation I", and "Ethanol Degradation II (cytosol)" maybe also be very related to M. tuberculosis infection as they are closely involved with intermediates of the tricarboxylic acid cycle which is essential for the growth of M. tuberculosis . And they may also contribute to the carbon flow of M. tuberculosis metabolism inside the human cell.
M. tuberculosis H37Rv proteins involved in the H. sapiens-M. tuberculosis H37Rv PPIN predicted by the stringent DDI-based approach are significantly enriched in the "Fatty Acid β oxidation I" pathway, see Table 6. This strongly supports the validity of our prediction results. As discussed above, fatty acids are the major source of carbon and energy for M. tuberculosis during infection , and pathways involved with fatty acids metabolism strongly indicate association with the infection state of M. tuberculosis H37Rv. It is found that when the pathogen's acyl-coenzyme A synthetase gene is disrupted, infected mice survive significantly longer than those infected with the wild type, thus suggesting attenuation of the mutated pathogen. In fact the pathogen never attains the plateau phase of infection in mouse lungs when pathogen's acyl-coenzyme A synthetase gene is disrupted . M. tuberculosis fatty acyl-coenzyme A synthetase gene may serve to recycle mycolic acids for the long-term survival of the tubercle bacilli . Carbon rerouting is marked by a switch from metabolic pathways generating energy and biosynthetic precursors in growing bacilli to pathways for storage compound synthesis during growth arrest . This analysis result is in accord with the above cellular compartment distribution, functional enrichment analysis.
All the results support the validity of the H. sapiens-M. tuberculosis H37Rv PPIs predicted by our stringent DDI-based approach. Therefore the prediction results from our stringent DDI-based approach can serve as a reliable reference of PPIs between H. sapiens and M. tuberculosis H37Rv.
Analysis of domain properties of proteins involved in host-pathogen PPIs
We compare two domain properties of both H. sapiens and M. tuberculosis H37Rv proteins in the predicted H sapiens-M. tuberculosis H37Rv PPIN and their own intra-species PPIN. We also conduct a similar analysis on H. sapiens proteins involved in the gold standard H. sapiens-HIV PPIN  as a control experiment. Table 7 provides summary results from the analysis of H. sapiens and M. tuberculosis H37Rv proteins. It is obvious that H. sapiens proteins targeted by the predicted H. sapiens-M. tuberculosis H37Rv PPIN show properties very similar to those H. sapiens proteins targeted by the gold standard H. sapiens-HIV PPIN . This also supports the validity of our prediction results to some extent.
Both in the predicted H. sapiens-M. tuberculosis H37Rv PPIN and in the gold standard H. sapiens-HIV PPIN, H. sapiens proteins tend to have more domains and those domains tend to have higher degrees than those proteins in the intra-species H. sapiens PPIN.
The discoveries found by analyzing domain properties may be helpful in illuminating the basic mechanisms of how the host and pathogen proteins interact with each other, and may be useful in assessing the predicted host-pathogen PPIN.
Sequence similarity between domain instances in DDI-based prediction
Comparing with conventional DDI-based approaches, our stringent DDI-based approach emphasizes the importance of domain instances in inferring interactions from template DDIs. While this emphasis on stringent sequence similarity between template and query domain instances in transferring interaction results in significant improvement on prediction performance, it also draws attention to the large sequence variation among domain instances which may limit conventional DDI-based approaches.
It is also noteworthy that many new prediction algorithms based on the stringent alignment of domain instances can be proposed to predict possible intra- and inter-species PPIs.
Pros and cons of DDI-based prediction
The advantages of our stringent DDI-based approach have been discussed above, as it can predict more accurate PPIs on a small scale. The possible limitation of this approach is the lack of large-scale high-quality structurally-resolved DDIs. However, it is reasonable to expect more protein complex structures will be resolved, and the effectiveness of our stringent DDI-based approach will consequently be significantly strengthened.
Producing only a small amount of PPIs does not distract us from the merits of our stringent DDI-based approach, because the small number of highly accurate PPIs may already be more valuable than a huge amount of PPIs with a substantial fraction of noise. Highly accurate predicted PPIs, even though small in size, are usually very welcomed in experimental research, as they are a much more valuable reference for experimental verification than large datasets with high noise.
Accurate sequence alignment among domain instances are much more computationally expensive than the conventional DDI-based approach. This may limit the application of our stringent DDI-based approach to large-scale prediction of PPIs across many host-pathogen systems.
In this work, we have proposed a stringent DDI-based prediction approach based on high sequence similarity between template domain instances and query domain instances. The assessment based on gold-standard H. sapiens PPIs and informative GO annotation shows that the stringent DDI-based approach performs better than the conventional DDI-based approach. We have also predicted a small set of accurate H. sapiens-M. tuberculosis H37Rv PPIs. Through cellular compartment distribution, functional enrichment, and pathway enrichment analysis, we have demonstrated that this small set of accurate H. sapiens-M. tuberculosis H37Rv PPIs is valid and closely corresponds to M. tuberculosis H37Rv infection. This dataset of H. sapiens-M. tuberculosis H37Rv PPIs can be used for a variety of related studies as an important reference.
Interacting domain instances and structural information from 3did can be downloaded from: http://compbio.ddns.comp.nus.edu.sg/~zhouhufeng/Research/DDIbased/data/.
This project was supported in part by two NGS scholarships and a Singapore Ministry of Education Tier-2 grant MOE2009-T2-2-004.
Funding for the publication of this paper is provided by Wong's KITHCT chair professorship.
This article has been published as part of BMC Systems Biology Volume 7 Supplement 6, 2013: Selected articles from the 24th International Conference on Genome Informatics (GIW2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S6.
- Butler D: New fronts in an old war. Nature. 2000, 406 (6797): 670-672. 10.1038/35021291.View ArticlePubMedGoogle Scholar
- Koul A, Herget T, Klebl B, Ullrich A: Interplay between mycobacteria and host signalling pathways. Nature Reviews Microbiology. 2004, 2 (3): 189-202. 10.1038/nrmicro840.View ArticlePubMedGoogle Scholar
- Hestvik A, Hmama Z, Av-Gay Y: Mycobacterial manipulation of the host cell. FEMS Microbiology Reviews. 2006, 29 (5): 1041-1050.View ArticleGoogle Scholar
- Global Tuberculosis Programme WHO: Global Tuberculosis Control: WHO Report. 2010, Global Tuberculosis Programme, World Health OrganizationGoogle Scholar
- Zhou H, Jin J, Wong L: Progress in computational studies of host-pathogen interactions. J Bioinform Comput Biol. 2013, 11 (2): 1230001-10.1142/S0219720012300018.View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein A, Céol A, Aloy P: 3did: Identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Research. 2011, 39 (suppl 1): D718-D723.PubMed CentralView ArticlePubMedGoogle Scholar
- Dyer MD, Murali TM, Sobral BW: Computational prediction of host-pathogen protein-protein inter-actions. Bioinformatics. 2007, 23 (13): i159-i166. 10.1093/bioinformatics/btm208.View ArticlePubMedGoogle Scholar
- The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Research. 2012, 40 (D1): D71-D75.View ArticleGoogle Scholar
- Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al.: InterPro in 2011: New developments in the family and domain prediction database. Nucleic Acids Research. 2012, 40 (D1): D306-D312. 10.1093/nar/gkr948.PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA. 1992, 89 (22): 10915-10919. 10.1073/pnas.89.22.10915.View ArticleGoogle Scholar
- Sprinzak E, Margalit H: Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology. 2001, 311 (4): 681-692. 10.1006/jmbi.2001.4920.View ArticlePubMedGoogle Scholar
- Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Letters. 2002, 513: 135-140. 10.1016/S0014-5793(01)03293-8.View ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz B, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone M, Nixon J, Van Auken K, Wang X, Shi X, et al.: The BioGRID interaction database: 2011 update. Nucleic Acids Research. 2011, 39 (suppl 1): D698-D704.PubMed CentralView ArticlePubMedGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A: IntAct: An open source molecular interaction database. Nucleic Acids Research. 2004, 32 (suppl 1): D452-D455.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou H, Wong L: Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets. BMC Genomics. 2011, 12 (Suppl 3): S20-10.1186/1471-2164-12-S3-S20.PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis G, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R: DAVID: Database for annotation, visualization, and integrated discovery. Genome Biology. 2003, 4 (5): P3-10.1186/gb-2003-4-5-p3.View ArticlePubMedGoogle Scholar
- Zhou H, Jin J, Zhang H, Bo Y, Wozniak M, Wong L: IntPath--an integrated pathway gene relationship database for model organisms and important pathogens. BMC System Bio. 2012, 6 (Suppl 2): S2-10.1186/1752-0509-6-S2-S2.View ArticleGoogle Scholar
- Fu W, Sanders-Beer B, Katz K, Maglott D, Pruitt K, Ptak R: Human immunodeficiency virus type 1, human protein interaction database at NCBI. Nucleic Acids Research. 2009, 37 (suppl 1): D417-D422.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy S: Accelerated profile HMM searches. PLoS Computational Biology. 2011, 7 (10): e1002195-10.1371/journal.pcbi.1002195.PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E: The Pfam protein families database. Nucleic acids research. 2004, 32 (suppl 1): D138-D141.PubMed CentralView ArticlePubMedGoogle Scholar
- Smoot M, Ono K, Ruscheinski J, Wang P, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011, 27 (3): 431-432. 10.1093/bioinformatics/btq675.PubMed CentralView ArticlePubMedGoogle Scholar
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: Protein domains identifier. Nucleic Acids Research. 2005, 33 (suppl 2): W116-W120.PubMed CentralView ArticlePubMedGoogle Scholar
- Yellaboina S, Tasneem A, Zaykin D, Raghavachari B, Jothi R: DOMINE: A comprehensive collection of known and predicted domain-domain interactions. Nucleic Acids Research. 2011, 39 (suppl 1): D730-D735.PubMed CentralView ArticlePubMedGoogle Scholar
- Jamwal S, Midha MK, Verma HN, Basu A, Rao KV, Manivel V: Characterizing virulence-specific perturbations in the mitochondrial function of macrophages infected with mycobacterium tuberculosis. Scientific Reports. 2013, 3: 1328-PubMed CentralView ArticlePubMedGoogle Scholar
- Persson C, Carballeira N, Wolf-Watz H, Fällman M: The PTPase YopH inhibits uptake of Yersinia, tyrosine phosphorylation of p130Cas and FAK, and the associated accumulation of these proteins in peripheral focal adhesions. The EMBO Journal. 1997, 16 (9): 2307-2318. 10.1093/emboj/16.9.2307.PubMed CentralView ArticlePubMedGoogle Scholar
- Black D, Bliska J: Identification of p130Cas as a substrate of Yersinia YopH (Yop51), a bacterial protein tyrosine phosphatase that translocates into mammalian cells and targets focal adhesions. The EMBO Journal. 1997, 16 (10): 2730-2744. 10.1093/emboj/16.10.2730.PubMed CentralView ArticlePubMedGoogle Scholar
- Guérin I, de Chastellier C: Pathogenic mycobacteria disrupt the macrophage actin filament network. Infection and Immunity. 2000, 68 (5): 2655-2662. 10.1128/IAI.68.5.2655-2662.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Guérin I, de Chastellier C: Disruption of the actin filament network affects delivery of endocytic contents marker to phagosomes with early endosome characteristics: the case of phagosomes with pathogenic mycobacteria. European Journal of Cell Biology. 2000, 79 (10): 735-749. 10.1078/0171-9335-00092.View ArticlePubMedGoogle Scholar
- Anes E, Kühnel M, Bos E, Moniz-Pereira J, Habermann A, Griffiths G: Selected lipids activate phagosome actin assembly and maturation resulting in killing of pathogenic mycobacteria. Nature cell biology. 2003, 5 (9): 793-802. 10.1038/ncb1036.View ArticlePubMedGoogle Scholar
- Esposito C, Marasco D, Delogu G, Pedone E, Berisio R: Heparin-binding hemagglutinin HBHA from My-cobacterium tuberculosis affects actin polymerisation. Biochemical and Biophysical Research Communications. 2011, 410 (2): 339-344. 10.1016/j.bbrc.2011.05.159.View ArticlePubMedGoogle Scholar
- Ting L, Kim A, Cattamanchi A, Ernst J: Mycobacterium tuberculosis inhibits IFN-γ transcriptional responses without inhibiting activation of STAT1. The Journal of Immunology. 1999, 163 (7): 3898-3906.PubMedGoogle Scholar
- Toossi Z, Xia L, Wu M, Salvekar A: Transcriptional activation of HIV by Mycobacterium tuberculosis in human monocytes. Clinical and Experimental Immunology. 1999, 117 (2): 324-330. 10.1046/j.1365-2249.1999.00952.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee W, VanderVen BC, Fahey RJ, Russell DG: Intracellular Mycobacterium tuberculosis exploits host-derived fatty acids to limit metabolic stress. Journal of Biological Chemistry. 2013, 288 (10): 6788-6800. 10.1074/jbc.M112.445056.PubMed CentralView ArticlePubMedGoogle Scholar
- Marrero J, Rhee KY, Schnappinger D, Pethe K, Ehrt S: Gluconeogenic carbon flow of tricarboxylic acid cycle intermediates is critical for Mycobacterium tuberculosis to establish and maintain infection. Proceedings of the National Academy of Sciences USA. 2010, 107 (21): 9819-9824. 10.1073/pnas.1000715107.View ArticleGoogle Scholar
- Shi L, Sohaskey CD, Pfeiffer C, Datta P, Parks M, McFadden J, North RJ, Gennaro ML: Carbon flux rerouting during Mycobacterium tuberculosis growth arrest. Molecular Microbiology. 2010, 78 (5): 1199-1215. 10.1111/j.1365-2958.2010.07399.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Dunphy KY, Senaratne RH, Masuzawa M, Kendall LV, Riley LW: Attenuation of Mycobacterium tuberculosis functionally disrupted in a fatty acyl-coenzyme A synthetase gene fadD5. Journal of Infectious Diseases. 2010, 201 (8): 1232-1239. 10.1086/651452.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.