### Main Data sets

Human protein interactions were derived from multiple sources: BioGRID http://www.thebiogrid.org, BIND http://www.bind.ca and HPRD http://www.hprd.org and filtered from the NCBI "interactions" file ftp://ftp.ncbi.nlm.nih.gov/gene/GeneRIF. Interaction data contained in these data sets are derived from multiple sources. HIV-host interactions and properties were derived from The HIV-1, Human Protein Interaction Database (available at http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions). The data set currently comprises 1,435 human genes encoding 1,448 proteins that interact with 19 HIV-1 proteins making 2,589 unique interactions, curated from over 3,200 papers published between 1984 and 2007[2, 4]. This paper also made extensive used of the "gene_info" and "gene2refseq" files provided by the Entrez Gene database ftp://ftp.ncbi.nlm.nih.gov/gene filtered to human genes (n = 36,455) and limited to those known to be protein-coding (n = 21,504). All data sets were current as of July 2009.

### Protein Essentiality

To predict the essentiality of a human gene, we used the phenotype information of the corresponding mouse ortholog. A human gene was defined as essential if a knockout of its mouse ortholog confers lethality. We obtained the human-mouse orthology and mouse phenotype data from Mouse Genome Informatics http://www.informatics.jax.org/[17]. We considered the annotations of postnatal, prenatal and perinatal lethality as lethal phenotypes, and the rest of the phenotypes as nonlethal ones. Overall, 27,697 annotations were filtered to leave 2,145 genes with an inferred essentiality.

### Gene Ontology

GO terms [23] were collected for each human gene from the NCBI "gene2go" file ftp://ftp.ncbi.nlm.nih.gov/gene/DATA. Term ancestors were then determined for each term from "gene_ontology_edit.obo" http://www.geneontology.org to ensure complete coverage. Select GO terms were taken from[3, 23], retested for over-representation amongst HIV-interacting human proteins using Fisher's Exact Test in R[26] and separated into the three ontologies: biological process, cellular component and molecular function.

### Network Visualisations

Networks were visualised as graph-based layouts using Cytoscape [27].

### Degree, Hubs, Betweenness and Bottlenecks

The degree of a vertex in a network is the number of connections it has, in the case of a PPI network, this represents the number of other proteins the vertex interacts with. The degree of a single vertex is therefore equal to the number of adjacent edges.

A protein with a high degree is considered a hub and these have frequently been identified as the most vulnerable points in biological networks [9–11, 14–16]. Yu et al. [16] classify a protein as a hub if it falls within the top 20% of proteins when sorted according to their degree. A cut-off of 20% in our data categorises a hub as any protein with a degree ≥3, we therefore chose a stricter cut-off of 2% so a hub is only classified as such with a degree ≥23.

Betweenness is a centrality measure of a vertex within a graph that summarises its relative importance both locally and globally [14–16]. Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not and are considered bottlenecks. Bottlenecks are generally a more accurate indicator of essentiality than degree or hub propensity[16], despite the two being correlated. For a graph *G* = (*V, E*), the betweenness centrality *C*
_{
B
}(*v*) for vertex *v* is:
, where *σ*
_{
st
}is the number of geodesic (shortest) paths from *S* to *t*, and *σ*
_{
st
}(*v*) is the number of geodesic paths from *S* to *t*, that pass through a vertex *v*. We use Brandes' algorithm[12] to calculate the betweenness centrality of all vertices in *G*, normalised by dividing through the number of pairs of vertices not including *v*: (*V*-1)(*V*-2). As for hubs, we define a bottleneck as the top 2% of ranked proteins, so a bottleneck is classified as such with a normalised betweenness centrality ≥2.43 × 10^{-4}.

The Wilcoxon rank-sum test, implemented in R [26], was used to compare the distributions of degree/betweenness across the entire genome against individual over-represented biological processes GO terms (see above). This enables us to determine whether the distributions for each GO term are significantly different to that found in the genome.

### Ascertainment bias

For every protein coding gene contained in Entrez Gene (n = 21,504), we obtained the number of unique publications using "Entrez Programming Utilities" http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html. In total, 409,964 publications were recorded; with an average 19 articles per gene (2,217 genes were not matched to a publication).

Rejection sampling was used to generate sets of random genes that matched the publication frequency distribution of the HIV-interacting human set *f(x*) (n = 1,431), from the overall protein-coding gene population with publication frequency distribution *g*(*x*) (n = 21,504).

The John von Neumann Monte Carlo algorithm [28] was used, such that instead of sampling directly from the distribution *f(x*), we use an envelope distribution *Mg(x*), where *M* is the maximal *f(x) < Mg(x)*, and selected such that *f(x) < Mg(x)* for all observed publication counts *x*:

A) A gene (with publication count *x*) is selected at random from the overall population with publication frequency distribution *g*(*x*). A random number *U* from *U*(0,1) is also selected.

B) If
, *x* is accepted as a realisation of *f(x)* and the gene is kept, otherwise sample step (A) is repeated.

The procedure is repeated until a set of genes of the required size is obtained. The samples match the distribution with a p-value of 0.43 (chi-squared, Figure 3). Using this procedure we constructed 10,000 sets of 1,431 randomised genes, *rand*
_{
(lit)
}, matching the publication frequency distribution of the HIV-interacting human genes. For comparison, 10,000 fully randomised samples, *rand*
_{
(pop)
}, were also generated by standard random sampling from the set of all genes. When comparing observed properties to these random samples, a z-score calculation was used to standardise the raw score *s* of each property tested,
and this was converted to a P-value using R [26]. This enables us to determine whether any results in the HIV-interacting set are due to ascertainment bias.

### Gene set enrichment analysis

Following the example of Dyer et al. [8], we adapted the gene set enrichment analysis (GSEA) method of Subramanian et al. [19] to test for significant differences between HIV-interacting and random sets of genes (both *rand*
_{
(lit)
}and *rand*
_{
(pop)
}). For a graph *G* = (*V*, *E*) let *L* be the list *V* ranked by either degree or by betweenness centrality. Let *S* be a subset of vertices within *L*, for example, the vertices that are HIV-interacting, *rand*
_{
(lit)
}or (*rand*
_{
(pop)
}. Let *l*
_{
i
}be the value (of degree or centrality) at index *i* of *L*, such that 1 ≤ *i* ≤ |*L*|. If *i* is a member of *S*, the protein whose rank is *i*, thus, belongs to *S*. First, calculate
, the sum of all the values of *S*. Next, for each index *i* of *L*, we compute two values,
, the weighted fraction of proteins in *S* with an index ≤ *i* and
, the fraction of proteins not in *S* with an index ≤ *i*. The enrichment score is therefore the largest positive value of *es(S, L)* = *P*
_{
hit
}(*S, i*) - *P*
_{
miss
}
*(S, i)*. A large positive value of *es(S, L*) indicates that the proteins in *S* have high degree or high betweenness centrality. To compute p-values for the observed *es(S, L*), Dyer and co-workers [8] selected |*S*| random proteins from *L* 1,000,000 times and estimated the p-value based on this distribution. However, we predict *S* to be biased, so similarly biased random samples |*S*| must be taken from *L*. We therefore used rejection sampling to generate 10,000 samples of |*S*
_{
HIV
}| with the distribution of *S*
_{
HIV
}in preference to the naïve random selection. A p-value was calculated from the z-score using R[26].