The biological context of HIV-1 host interactions reveals subtle insights into a system hijack
© Dickerson et al. 2010
Received: 1 October 2009
Accepted: 7 June 2010
Published: 7 June 2010
Skip to main content
© Dickerson et al. 2010
Received: 1 October 2009
Accepted: 7 June 2010
Published: 7 June 2010
In order to replicate, HIV, like all viruses, needs to invade a host cell and hijack it for its own use, a process that involves multiple protein interactions between virus and host. The HIV-1, Human Protein Interaction Database available at NCBI's website captures this information from the primary literature, containing over 2,500 unique interactions. We investigate the general properties and biological context of these interactions and, thus, explore the molecular specificity of the HIV-host perturbation. In particular, we investigate (i) whether HIV preferentially interacts with highly connected and 'central' proteins, (ii) known phenotypic properties of host proteins inferred from essentiality and disease-association data, and (iii) biological context (molecular function, processes and location) of the host proteins to identify attributes most strongly associated with specific HIV interactions.
After correcting for ascertainment bias in the literature, we demonstrate a significantly greater propensity for HIV to interact with highly connected and central host proteins. Unexpectedly, we find there are no associations between HIV interaction and inferred essentiality. Similarly, we find a tendency for HIV not to interact with proteins encoded by genes associated with disease. Crucially, we find that functional categories over-represented in HIV-host interactions are innately enriched for highly connected and central proteins in the host system.
Our results imply that HIV's propensity to interact with highly connected and central proteins is a consequence of interactions with particular cellular functions, rather than being a direct effect of network topological properties. The lack of a propensity for interactions with phenotypically essential proteins suggests a selective pressure to minimise virulence in retroviral evolution. Thus, the specificity of HIV-host interactions is complex, and only superficially explained by network properties.
Human immunodeficiency virus type 1 (HIV-1) and its associated illnesses have major health and socio-economic impacts, particularly in developing countries . Concomitant with the progression of the HIV pandemic there has, thus, been a major international research effort, leading to a detailed understanding of HIV biology. One of the most important aspects of this knowledge is the set of known contacts between viral proteins and the host system[2–4], fundamental to HIV's life cycle. HIV, like all viruses, subjugates and exploits host cells in order to propagate. To achieve this, the HIV virion must first bind to a host cell, primarily CD4+ T cells, macrophages and dendritic cells, and then 'hijack' their cellular machinery . Untreated HIV infection leads to a decrease in CD4+ T cell count, eventually resulting in the loss of cell-mediated immunity, an immunocompromised state and the onset of AIDS (Acquired Immunodeficiency Syndrome) . However, infection with the HIV-like simian immunodeficiency virus (SIV) in its "natural" hosts, does not generally result in the development of AIDS, even when viral loads are high . Despite SIV exhibiting high viral loads, and there being a decreased CD4+ T cell count in natural hosts, these infections are effectively non-pathogenic. The differences between natural and human hosts must, thus, be due to the molecular specificity of viral perturbation of the host system: that is the gain (or loss) of protein-protein interactions during adaptation to different host species or because these host systems differ themselves.
More general work on the use of the host system by pathogens  has found patterns in the types of interactions and infection strategies employed by multiple pathogens. Specifically, pathogens appear to preferentially interact with "key" human proteins that already participate in multiple interactions and/or have central importance in intra-cellular communication. Highly connected proteins, or "hubs", have classically characterised vulnerable points in a network due to their role in a large number of interactions and due to their association with essentiality [9–11]. Similarly, "bottlenecks", that is proteins with a high betweenness centrality, a measure of the total number of shortest paths going through the protein [12, 13], also associate with protein essentiality [14–16]. It has been inferred that this non-uniform contact with the host system represents evolutionary pressure to optimise exploitation of the host cell .
In order to test the hypothesis that the specificity of HIV interactions is in some way explained by network properties, we examine their biological context by integrating known phenotypic properties. Our analysis is based on the HIV-1, Human Protein Interaction Database (HHPID), which currently comprises over 2,500 unique interactions, curated from over 3,200 papers with over half of the interactions validated by being linked to multiple publications [2, 4]. While this data set no doubt contains false positive interactions and potential bias, it nevertheless constitutes an excellent catalogue of HIV-human interactions as represented by published research .
In terms of phenotypic properties, whilst it is difficult to assess human gene essentiality directly, we can use mouse genome knockout data as a proxy for the importance of a gene in terms of a known phenotypic consequence in disrupting its product's function . Similarly, gene-disease associations from The Online Mendelian Inheritance in Man (OMIM) provide another cohort of genes for which deleterious mutations are associated with phenotypic consequence. Integrating these phenotypic data into our network would be expected to corroborate any relationships with topological properties, since proteins with a high connectivity and high betweenness centrality have been demonstrated as having a tendency to be essential [9–11, 14–16].
Correcting for ascertainment bias, however, we find that there is no significant relationship between HIV interaction and protein essentiality, and there is a potential under-representation of disease-association amongst HIV interacting human proteins. Moreover we find that HIV's propensity to interact with highly connected and central proteins is most probably a consequence of interactions with specific cellular functions. Thus, the biological context of HIV-interacting proteins, rather than their individual properties, has been the key determinant in the infection of hosts by retroviruses.
Numerous studies have suggested that betweenness centrality also has some significance for the properties of proteins [14–16]. Does HIV preferentially interact with proteins that have a high betweenness? We calculated the ES of the betweenness centrality (in the same way as for degree), amongst the sample data sets. The ES(betweenness) of HIV-interacting proteins is 0.90 and the average ES amongst the rand (pop) sample is 0.84 (p-value of 1.98 × 10-21), whilst that of rand (lit) is 0.88 (p-value of 4.36 × 10-8). Again, despite a significant difference between rand (pop) and rand (lit) , HIV-interacting proteins can be shown to have a higher betweenness centrality than expected (Figure 2B).
To highlight the consequence of the betweenness centrality/degree overlap, a partial human-human protein interaction network visualisation was created using HIV-host interactions from HHPID (pink edges) and then incorporating any additional human-human interactions the human partner has (blue edges) from NCBI (see methods). This was annotated with nodes that are hubs (high degree), bottlenecks (high betweenness centrality) or both hubs and bottlenecks (Figure 1). Furthermore, HIV-interacting over-representation was demonstrated in the full network (n = 21,504) amongst hubs but not bottlenecks (n = 92) and conversely bottlenecks but not hubs (n = 85) and was found to be 51.09% (p-value of 1.34 × 10-28) and 32.94% (p-value of 6.72 × 10-12) respectively.
These results raise some questions: why has HIV evolved to preferentially interact with key host proteins? Is HIV preferentially interacting with functionally "essential" proteins, as has been suggested for pathogens generally ?
To place our findings in a stronger biological context, we next investigated the relationship between HIV-host interactions and protein function. A functional understanding of the host-pathogen interaction network can be gained by integrating annotations from GO [4, 23]. To investigate HIV's use of the host system in more detail, we identified biological processes over-represented for HIV interactions (see also Pinney et al. ). These categories represent diverse functions exploited by multiple interactions, involving multiple HIV genes, demonstrating that HIV proteins co-ordinate to target specific parts of the human cellular system.
Our results confirm that HIV preferentially interacts with hubs and bottlenecks - key host proteins that are apparently important to the cell (Figures 2 and 4). As proteins with a high connectivity and high betweenness centrality have previously been shown to demonstrate a tendency towards being essential [9–11, 14–16] (and see Figures 4A and 4B), we investigated whether selection for interactions with essential proteins could account for these network topological observations. This was done by integrating phenotypic data - assessed with protein essentiality inferred from mouse knockout data - into our analysis. After correcting for ascertainment bias, however, we found no significant relationship between HIV-1 interaction and protein essentiality (Figure 5B). That is, HIV-1 proteins appear to be no more or less likely to interact with essential proteins than expected by random chance. This lack of over-representation of interactions with essential proteins (despite a significant tendency to interact with key host proteins) could be the result of ancestral selection pressure on retroviruses to minimise interactions with phenotypically essential proteins. Specifically, this would be consistent with selection acting on HIV's retroviral ancestors (due to longstanding co-evolution of retroviruses with host species) to minimise the pathogenic outcome of infection and maximise transmission potential, presumably in a trade-off between virulence and transmissibility [24, 25].
Using an alternate measure of phenotype associated with perturbation: disease association, we investigated these observations further. Disease genes have previously been shown to display no propensity towards encoding either lowly or highly connected proteins  and we find that this is also true of the human protein interaction network when the overlap with essential genes is removed (Figure 6A and 6B). Accordingly, we would expect to observe no relationship between disease-association and HIV interaction amongst human proteins. Initially we find an over-representation of disease-association amongst HIV-interacting human proteins (Figure 7B). However, after compensating for ascertainment bias in the literature, we find the opposite: there appears to be an under-representation of disease-association amongst HIV-interacting proteins (Figure 7B). As there is no apparent relationship between connectivity and disease-association (Figure 6A), the under-representation of disease-association amongst HIV-interacting proteins is not related to network topology. Rather, we hypothesise that this under-representation of disease-association could again represent a selection pressure on retroviral proteins to avoid interacting with proteins associated with adverse phenotypes.
Given these results, how can we explain HIV's tendency to interact with high-degree and high-betweenness host proteins? Dyer and co-workers  have suggested that viral and bacterial proteins tend to interact with key proteins, as they may control critical human cellular processes, through their high connectivity and betweenness centrality. We find that the two concepts are interrelated: certain human proteins are central because they represent essential cellular functions, e.g., immune response. HIV interacts with these proteins to achieve its biology, and their high connectivity is simply secondary to this. Indeed, proteins involved in the over-represented biological process GO terms tend to be highly connected and central (Figure 8). Thus, HIV's propensity to interact with highly connected and central proteins is mainly a consequence of its interactions with particular cellular functions, rather than being related to global network properties in any straightforward way.
The specificity of the HIV-1 host interaction from HHPID, in the context of these underlying host protein functions, permits a detailed analysis of HIV's perturbation of the host system. Indeed, focussing on biological functions (from GO), our analysis demonstrates the directionality and complexity of both pro-pathogen (the majority promoting HIV's replication cycle) and pro-host (the host response to infection) interactions with specific cellular functions . Collectively this highlights the subtle but complex manipulation of the host cell.
Throughout our analyses, we have corrected for the potential effect of ascertainment bias . However, as it is very difficult to provide an accurate estimate for the degree of bias in the HHPID data, we have deliberately chosen a very conservative methodology for bias correction. Therefore, whilst we can be confident that degree and betweenness are both higher than expected after correction, it is possible that we are over-correcting in the case of the essentiality and disease-association data. Our results should therefore be interpreted as indicating no evidence for over-representation of these properties amongst HIV-interacting proteins; further research into bias correction methods for genome-scale data will be needed in order to provide more definitive conclusions.
In order to fully understand HIV's hijack of the host system it will be necessary to study in detail the functional modules that are being exploited. This is exemplified by the complexity of HIV-host interactions, with the same functions being targeted multiple times (Figure 9). It will also be important to study the directionality of interactions, i.e., those that are pro-pathogen interactions as opposed to pro-host interactions, or even bystander interactions, incidental interactions of little consequence to either virus or host. Our finding that that there are patterns in terms of the types of interactions HIV makes can be explained by the cellular functions that HIV requires in order to replicate. The apparent tendency for HIV to 'avoid' phenotypically important molecules, underlines - despite HIV's recent acquisition by humans - the long-standing relationship that retroviruses have with their hosts. As more data become available, it will be informative to study this co-evolution of pathogens with their (often changing) host species. Understanding the precise molecular specificity of both the adaptation and persistence of pathogens with their hosts will yield novel insights into virulence and, potentially, new intervention strategies.
Human protein interactions were derived from multiple sources: BioGRID http://www.thebiogrid.org, BIND http://www.bind.ca and HPRD http://www.hprd.org and filtered from the NCBI "interactions" file ftp://ftp.ncbi.nlm.nih.gov/gene/GeneRIF. Interaction data contained in these data sets are derived from multiple sources. HIV-host interactions and properties were derived from The HIV-1, Human Protein Interaction Database (available at http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions). The data set currently comprises 1,435 human genes encoding 1,448 proteins that interact with 19 HIV-1 proteins making 2,589 unique interactions, curated from over 3,200 papers published between 1984 and 2007[2, 4]. This paper also made extensive used of the "gene_info" and "gene2refseq" files provided by the Entrez Gene database ftp://ftp.ncbi.nlm.nih.gov/gene filtered to human genes (n = 36,455) and limited to those known to be protein-coding (n = 21,504). All data sets were current as of July 2009.
To predict the essentiality of a human gene, we used the phenotype information of the corresponding mouse ortholog. A human gene was defined as essential if a knockout of its mouse ortholog confers lethality. We obtained the human-mouse orthology and mouse phenotype data from Mouse Genome Informatics http://www.informatics.jax.org/. We considered the annotations of postnatal, prenatal and perinatal lethality as lethal phenotypes, and the rest of the phenotypes as nonlethal ones. Overall, 27,697 annotations were filtered to leave 2,145 genes with an inferred essentiality.
The Online Mendelian Inheritance in Man (OMIM) Morbid Map http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim contains the most complete curated disorder-gene associations . The data was filtered for the "(3)" tag http://www.ncbi.nlm.nih.gov/Omim/omimfaq.html#gene_map_symbols, for which there is strong evidence that at least one mutation in the particular gene is causative to the disorder, to identify 3,328 unique diseases across 3,049 genes. We used the gene_info file (to convert OMIM gene symbols to NCBI GeneIDs to facilitate integration. This data was used as a proxy for mild phenotypic effect.
GO terms  were collected for each human gene from the NCBI "gene2go" file ftp://ftp.ncbi.nlm.nih.gov/gene/DATA. Term ancestors were then determined for each term from "gene_ontology_edit.obo" http://www.geneontology.org to ensure complete coverage. Select GO terms were taken from[3, 23], retested for over-representation amongst HIV-interacting human proteins using Fisher's Exact Test in R and separated into the three ontologies: biological process, cellular component and molecular function.
Networks were visualised as graph-based layouts using Cytoscape .
The degree of a vertex in a network is the number of connections it has, in the case of a PPI network, this represents the number of other proteins the vertex interacts with. The degree of a single vertex is therefore equal to the number of adjacent edges.
A protein with a high degree is considered a hub and these have frequently been identified as the most vulnerable points in biological networks [9–11, 14–16]. Yu et al.  classify a protein as a hub if it falls within the top 20% of proteins when sorted according to their degree. A cut-off of 20% in our data categorises a hub as any protein with a degree ≥3, we therefore chose a stricter cut-off of 2% so a hub is only classified as such with a degree ≥23.
Betweenness is a centrality measure of a vertex within a graph that summarises its relative importance both locally and globally [14–16]. Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not and are considered bottlenecks. Bottlenecks are generally a more accurate indicator of essentiality than degree or hub propensity, despite the two being correlated. For a graph G = (V, E), the betweenness centrality C B (v) for vertex v is: , where σ st is the number of geodesic (shortest) paths from S to t, and σ st (v) is the number of geodesic paths from S to t, that pass through a vertex v. We use Brandes' algorithm to calculate the betweenness centrality of all vertices in G, normalised by dividing through the number of pairs of vertices not including v: (V-1)(V-2). As for hubs, we define a bottleneck as the top 2% of ranked proteins, so a bottleneck is classified as such with a normalised betweenness centrality ≥2.43 × 10-4.
The Wilcoxon rank-sum test, implemented in R , was used to compare the distributions of degree/betweenness across the entire genome against individual over-represented biological processes GO terms (see above). This enables us to determine whether the distributions for each GO term are significantly different to that found in the genome.
For every protein coding gene contained in Entrez Gene (n = 21,504), we obtained the number of unique publications using "Entrez Programming Utilities" http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. In total, 409,964 publications were recorded; with an average 19 articles per gene (2,217 genes were not matched to a publication).
Rejection sampling was used to generate sets of random genes that matched the publication frequency distribution of the HIV-interacting human set f(x) (n = 1,431), from the overall protein-coding gene population with publication frequency distribution g(x) (n = 21,504).
The John von Neumann Monte Carlo algorithm  was used, such that instead of sampling directly from the distribution f(x), we use an envelope distribution Mg(x), where M is the maximal f(x) < Mg(x), and selected such that f(x) < Mg(x) for all observed publication counts x:
A) A gene (with publication count x) is selected at random from the overall population with publication frequency distribution g(x). A random number U from U(0,1) is also selected.
B) If , x is accepted as a realisation of f(x) and the gene is kept, otherwise sample step (A) is repeated.
The procedure is repeated until a set of genes of the required size is obtained. The samples match the distribution with a p-value of 0.43 (chi-squared, Figure 3). Using this procedure we constructed 10,000 sets of 1,431 randomised genes, rand (lit) , matching the publication frequency distribution of the HIV-interacting human genes. For comparison, 10,000 fully randomised samples, rand (pop) , were also generated by standard random sampling from the set of all genes. When comparing observed properties to these random samples, a z-score calculation was used to standardise the raw score s of each property tested, and this was converted to a P-value using R . This enables us to determine whether any results in the HIV-interacting set are due to ascertainment bias.
Following the example of Dyer et al. , we adapted the gene set enrichment analysis (GSEA) method of Subramanian et al.  to test for significant differences between HIV-interacting and random sets of genes (both rand (lit) and rand (pop) ). For a graph G = (V, E) let L be the list V ranked by either degree or by betweenness centrality. Let S be a subset of vertices within L, for example, the vertices that are HIV-interacting, rand (lit) or (rand (pop) . Let l i be the value (of degree or centrality) at index i of L, such that 1 ≤ i ≤ |L|. If i is a member of S, the protein whose rank is i, thus, belongs to S. First, calculate , the sum of all the values of S. Next, for each index i of L, we compute two values, , the weighted fraction of proteins in S with an index ≤ i and , the fraction of proteins not in S with an index ≤ i. The enrichment score is therefore the largest positive value of es(S, L) = P hit (S, i) - P miss (S, i). A large positive value of es(S, L) indicates that the proteins in S have high degree or high betweenness centrality. To compute p-values for the observed es(S, L), Dyer and co-workers  selected |S| random proteins from L 1,000,000 times and estimated the p-value based on this distribution. However, we predict S to be biased, so similarly biased random samples |S| must be taken from L. We therefore used rejection sampling to generate 10,000 samples of |S HIV | with the distribution of S HIV in preference to the naïve random selection. A p-value was calculated from the z-score using R.
JED is supported by a Wellcome Trust studentship and JWP by a Royal Society University Research Fellowship. Thanks also to the Apple Research & Technology Support scheme for support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.