HINT: High-quality protein interactomes and their applications in understanding human disease
© Das and Yu; licensee BioMed Central Ltd. 2012
Received: 22 July 2011
Accepted: 30 June 2012
Published: 30 July 2012
Skip to main content
© Das and Yu; licensee BioMed Central Ltd. 2012
Received: 22 July 2011
Accepted: 30 June 2012
Published: 30 July 2012
A global map of protein-protein interactions in cellular systems provides key insights into the workings of an organism. A repository of well-validated high-quality protein-protein interactions can be used in both large- and small-scale studies to generate and validate a wide range of functional hypotheses.
We develop HINT (http://hint.yulab.org) - a database of high-quality protein-protein interactomes for human, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Oryza sativa. These were collected from several databases and filtered both systematically and manually to remove low-quality/erroneous interactions. The resulting datasets are classified by type (binary physical interactions vs. co-complex associations) and data source (high-throughput systematic setups vs. literature-curated small-scale experiments). We find strong sociological sampling biases in literature-curated datasets of small-scale interactions. An interactome without such sampling biases was used to understand network properties of human disease-genes - hubs are unlikely to cause disease, but if they do, they usually cause multiple disorders.
HINT is of significant interest to researchers in all fields of biology as it addresses the ubiquitous need of having a repository of high-quality protein-protein interactions. These datasets can be utilized to generate specific hypotheses about specific proteins and/or pathways, as well as analyzing global properties of cellular networks. HINT will be regularly updated and all versions will be tracked.
Numerous recent efforts in systems biology have tried to characterize the set of all possible pairwise physical interactions or the binary protein “interactome” of an organism [1–3]. Most proteins perform their functions through interactions . Thus, these large-scale maps are critical in elucidating the biological roles of functional products of genes that are identified by large-scale genome and cDNA sequencing projects. Because most of these efforts are discovery-oriented and try to explore previously unknown functionalities, it is of utmost importance to ensure that the resultant maps are of high quality. Erroneous results at this stage could propagate into both ill-conceived hypotheses and futile downstream experiments. Moreover, it has been shown that high-quality interaction networks can provide key insights into fundamental topological and biological properties of cellular systems [5–8]. Although there are numerous databases [9–16] that try to systematically curate the entire repository of interactions for different organisms, there has been very little effort in filtering out unreliable ones. This has led to low overlaps between independent publications and resultant confusion as to which interactions are correct [17–19].
There are two major types of protein-protein interaction data – binary physical interactions and co-complex associations. While some databases distinguish between these two orthogonal datasets, others fail to do so. Binary interactions represent a direct biophysical interaction between two proteins. On the other hand, co-complex associations provide information about co-membership in a complex. A lot of these associations may actually represent indirect interactions [17, 18]. The biological information conveyed by these two kinds of interactions is different and for many applications it is necessary to have a clear distinction between these two.
There are two major methods to obtain a global map of binary interactions – literature-curation (LC) and high-throughput experiments (HT) . LC refers to systematically collecting interaction data from thousands of small-scale studies directed at validating a single or a few specific hypotheses. On the other hand, HT experiments produce large-scale interaction maps. Because most LC data are generated by hypothesis-driven experiments, it is much easier to infer biological function from those studies as compared to HT experiments. On the other hand, although the search space of some HT experiments might be focused on certain functional groups, most HT experiments are not designed to detect the presence or absence of specific interactions. Any experiment can have two kinds of bias – “assay bias” and “sampling bias”. The first arises because no assay is perfect and all experiments – HT or small-scale have their own characteristic biases . However, small-scale studies also have a sampling bias, i.e., they are typically focused on one or a few proteins of interest and hence selectively sample interactions from only a part of the search space. HT experiments are free of this sampling bias, i.e., the search space is scanned without a priori expectations [17, 19]. Thus, for many global topological analyses, it is often necessary to use only the HT datasets.
Here, we describe a publicly available protein-protein interaction database, HINT (High-quality INTeractomes) that directly addresses the above three issues and provides high-quality binary and co-complex interactions for human, S. cerevisiae, S. pombe, and O. sativa. The binary interactomes have also been divided into LC and HT subsets. Using these datasets, we show that there are significant sociological sampling biases in LC datasets, i.e., well-studied proteins tend to have more interactions in LC datasets for both human and S. cerevisiae. Finally, using only the high-quality HT interactions for human, we find that disease genes (i.e., genes that have a causal connection with one or more diseases) with more interactions tend to cause more diseases. Even though this result is unexpected in light of previous findings that interaction hubs are less likely to cause disease [21, 22], it will help understand mechanisms of various disease processes and develop corresponding treatments.
The set of all protein-protein interactions for the organisms was downloaded from the public databases – BioGrid , DIP , HPRD , IntAct , iRefWeb , MINT , MIPS  and VisAnt . Not all four organisms were present in all the databases. Though some of the databases mentioned above store both genetic and physical interactions, only physical interactions were used in building the interactomes. Certain tools [13, 23] also provide scoring schemes for protein-protein interactions. However, we do not include these for HINT as they integrate both computational predictions and experimentally determined interactions. Our goal is to provide a repository of only experimentally well-validated high-quality protein-protein interactions.
Source databases – download dates and versions
Download date (version if applicable)
11 January 2012 (v 3.1.84)
12 January 2012
12 January 2012 (Release 9)
12 January 2012 (2011 release)
12 January 2012 (v 4.1)
17 January 2012 (2012 release)
17 January 2012
13 January 2012 (Release 3.93)
Database statistics – Summary of high-quality interactions obtained from the different data-sources and those finally included in HINT
S. cerevisiae binary
S. cerevisiae co-complex
S. pombe binary
S. pombe co-complex
O. sativa binary
For the binary network, we generated two sub-interactomes – the high-quality LC (HQ-LC) and the high-quality HT (HQ-HT) sub-interactomes. Interactions that are supported by both forms of evidence belong to both.
Network statistics for binary interactions
Unfiltered binary interactions
Filtered binary interactions
No. of proteins in filtered network
Average degree of filtered network
Network statistics for co-complex interactions
Unfiltered co-complex interactions
Filtered co-complex interactions
No. of proteins in filtered network
Average degree of filtered network
HT and LC interactions
Number of HT interactions
Number of LC interactions
There has been a great deal of effort in the literature at discovering new protein-protein interactions in different species to gain an understanding of the entire interactome of that organism. However, due to experimental errors and inaccurate curation, databases often contain interactions that are low quality/erroneous . Since accuracy is of paramount importance in generating new hypotheses using these interaction data, it is essential to have an easily accessible repository of high-quality binary protein-protein interactions. HINT is a repository created by combining information from commonly used databases. To ensure quality control, we adopt the following protocol. Since the number of HT publications is relatively low as compared to the vast number of small-scale studies, we manually inspect each of the HT studies (Additional file 3: Table S1 and Additional file 4: Table S2). We ensure that high-quality HT experiments included in HINT have been verified by orthogonal traditional assays (e.g., co-immunoprecipitation). Some experiments that do not perform any validation of their screen are considered low-quality and therefore removed. More recently, we developed a statistical framework to comprehensively evaluate the quality of HT datasets verified by orthogonal assays in both human and S. cerevisiae[17, 19]. Using this framework, we can quantitatively and experimentally measure the quality of individual interactions, as well as the whole dataset. The quality of interactions reported by a HT experiment can be measured by two independent statistical parameters – the number of interactions validated, i.e., the “validation rate” and the number of interactions that could be re-tested in the validation carried out, i.e., the “retest rate”. The first parameter is a measure of the confidence associated with the validation carried out (i.e., more confidence can be associated with the results when a larger fraction of the reported interactions are validated), while the second one directly assays the reproducibility of the HT experiment. We carried out a comprehensive re-curation for all HT experiments included in HINT. A list of these parameters for all the HT experiments can be found in Additional file 5: Table S3 and Additional file 6: Table S4.
On the other hand, since it is impossible to manually check all small-scale studies, we require two independent publications to report the same interaction for it to be included in our dataset. This is because while some interactions from dedicated small-scale studies are high-quality and have been repeated multiple times in the literature, a significant fraction of interactions from small-scale experiments are not easily reproducible. In fact, many of the interactions that cannot be reproduced are supported by only one publication, were not produced by dedicated experiments and were often not even mentioned in the paper (Additional file 7: Figure S3) . More importantly, it has been experimentally shown that such interactions are indeed of low quality [17, 19]. Thus, our repository of high-quality interactions contains only manually validated HT experiments and interactions from small-scale studies that have been reported at least twice in the literature.
The organism of interest is selected from a drop-down menu followed by entering the query proteins separated by semi-colons. Up to a maximum of 10 proteins can be entered per query. The database supports Entrez gene IDs  and gene names for proteins in human, ORF names and gene names for proteins in S. cerevisiae and S. pombe and Uniprot ids for O. sativa. The user also has the option of specifying the cutoff number of publications for each of the query proteins. One can also specify a particular evidence type for searching interactions. For each interacting protein, the gene name is listed in the first column followed by the list of Pubmed IDs of the papers supporting this interaction in column 2. The last column lists the PSI-MI evidence code  that describes the kind of evidence supporting the interaction. The gene names are linked to the NCBI Entrez Gene database  for human and S. cerevisiae, the GeneDB database  for S. pombe, and the Uniprot database  for O. sativa. The PubMed IDs link to the NCBI website for the relevant abstracts.
For batch download, separate links are provided for binary and co-complex interactomes for each organism. The binary interactome is also divided into the LC and HT networks. One notes here that the LC and HT networks are not completely mutually exclusive. There are certain protein-protein interactions that have been discovered both by HT experiments and by LC. There are included in both interactomes.
Using HINT, it will now be possible to analyze, visualize, and generate reliable hypotheses about a part of or the complete interactome of the four different organisms – human, S. cerevisiae, S. pombe, and O. sativa. Future efforts may be directed at similarly collecting and filtering data for other organisms and also updating the current dataset based on new findings.
HINT clearly distinguishes between binary and co-complex interactions. The binary network represents direct interactions between two proteins. On the other hand, the co-complex network merely indicates membership of a group and does not necessarily imply pairwise interactions. In most cases, the exactly topology of the complex is unknown. Two primary methods – the spoke model and the matrix model are used to represent these complexes. However, both models are approximations and merely suggest possible topologies . Since different reports base their choice of model on study-specific conditions, all co-complex associations were included as curated in the source databases. No re-curation was performed. Moreover, compared to co-complex interactome models, binary maps have a greater fraction of transient signaling connections and inter-complex connections [17, 33]. Since these two datasets represent fundamentally different biological entities, their overlap is low (Additional file 8: Figure S4) and it is important to differentiate between them in certain studies. For example, recent studies have examined how mutations may either lead to complete loss of gene products or edge-specific changes in the interactome [34, 35]. We show in a recent study that the pathogenesis of human disease can be better understood by looking at the position of mutations on interaction interfaces . These approaches are applicable to direct binary interactions, as it is more difficult to infer interface pairs from co-complex associations. The latter can be resolved using information on three-dimensional structures of protein complexes if these are available. Thus, based on the context, it may be more appropriate to use one interactome over the other. Moreover, there are significant differences in the topological properties of these two networks. We calculated the clustering coefficient  and the edge betweenness  for the different interaction networks in HINT. Clustering coefficient measures the density of clustering in an interaction network . We find that co-complex networks have a significantly higher clustering coefficient (P < 10-8 in both cases as calculated by a two-sample Kolmogorov-Smirnov test) than binary networks (Additional file 9: Figure S5). This shows that co-complex associations tend to be much more dense in terms of topological structure. Edge betweenness is used to detect community structure in networks. A higher betweenness value for an edge indicates that it connects different modules and disrupting this edge will fragment the network into disjoint components . We find that binary networks for both human and S. cerevisiae have a significantly higher betweenness (P < 10-8 in both cases as calculated by a two-sample Kolmogorov-Smirnov test) than co-complex networks for the two organisms (Additional file 9: Figure S5). This suggests that co-complex associations form tightly regulated modules and binary interactions are often used to form links between these modules. We did not use the S. pombe or O. sativa networks for our global topological calculations as these interactomes are highly underexplored at this stage and the small number of interactions available make the networks unsuitable for global analyses.
People have realized in the last decade that a human disease is rarely the consequence of an isolated abnormality in a particular gene but is generally the outcome of complex perturbations of the underlying cellular network . This has led to systematic studies of interactome networks and numerous insights have been obtained from such studies. The structure of these networks is governed by key biological principles and changes in their global properties may be linked to human disease . Further advances in such studies are expected to uncover the biological significance of disease-associated mutations discovered by genome-wide association studies  and help in identifying biomarkers and novel drug targets .
However, we were unable to reproduce the same results using the LC interactome (Figure 5A; error bars correspond to standard error of the mean assuming a binomial distribution). There is a significant increase (P < 10-8 as calculated by a one-way ANOVA) of percentage of disease genes with degree for proteins that have at least one interaction. This led us to believe that the difference could be due to study biases in the LC data. To systematically analyze if this is true, we plotted the average number of publications against the number of interactions of proteins separately for the HT and LC interactomes. Intuitively, there should be no strong correlation between these two entities as the number of publications associated with a protein should have no connection with its degree. The average number of publications does not vary significantly with degree for the HT dataset but increases dramatically for the LC interactome (see Figures 5B and 5 C). This illustrates the strong study bias in the LC data – proteins with a greater number of interactions tend to be revisited more often by small-scale studies. Our results are consistent with earlier findings that the degree of proteins in the LC interactome is strongly correlated with the number of publications associated with them [17, 44]. This makes the LC interactome unsuitable for global topological analyses. The low overlap between the HT and LC interactomes (Additional file 9: Figure S4) also confirms that these are in fact two separate networks that need to be appropriately used based on the context.
To further investigate whether protein interactomes can help us understand disease mechanisms and uncover previously unknown disease genes, we used the HT human interactome to analyze what fraction of disease genes are disease-hubs, i.e., genes causing multiple diseases. We examined the distribution of disease-hubs as a function of their degrees (Figure 5D; error bars correspond to standard error of the mean assuming a binomial distribution). We observed that proteins with a higher number of interactions are significantly more likely to be disease hubs (P < 10-8 as calculated by a one-way ANOVA). Though this may seem contradictory to earlier findings in Figure 5A, these two are in fact independent results. It is true that if a disease gene has more interactions, there is a higher probability of its fitness being affected. However, in Figure 5D, we focused only on disease genes. By virtue of the fact that these are observed in the population as disease genes, their mutations are less likely to cause embryonic lethality. Therefore the evolutionary constrains in Figure 5A do not apply here. It is logical to expect that a disease protein with multiple interactions will have a greater propensity for causing multiple diseases. This is because a protein with more interactions is involved in more biological functions . This result also means that protein-protein interactions are important in the pathogenesis of many human diseases. Further studies on alteration of interactions by disease mutations may reveal insights into molecular mechanisms of various diseases and provide information about potential drug targets.
HINT is a comprehensive repository of high-quality binary and co-complex physical interactions in human, S. cerevisiae, S. pombe, and O. sativa. It establishes and implements systematic techniques for separating interactions based on both type (i.e., binary and co-complex) and data-source (i.e., LC and HT). Making these distinctions is critical for many applications. Using only the HT dataset, we demonstrated that human disease genes with a greater number of interactions tend to cause more diseases. Future directions involve implementation of the same techniques for other organisms of biological interest.
As one of the primary goals of the database is to clearly distinguish binary interactions from co-complex associations, two separate and mutually exclusive lists of evidence codes were created – one for each category. An evidence code is a unique number assigned by the PSI-MI initiative to a particular form of experimental information in support of an interaction . The lists used for both categories can be found in Additional file 10: Table S5, Additional file 11: Table S6 and Additional file 12: Table S7. Using these lists, all the interactions were classified into binary and/or co-complex. Interactions supported by evidence codes that are in neither of the two lists are excluded. Different databases use different gene identifiers and as this may lead to error, all gene identifiers in each database were converted to Entrez gene IDs for human, ORF names for S. cerevisiae and S. pombe, and Uniprot ids for O. sativa. For each of the organisms, gene names (when available) are also provided in the bulk download files. Mapping files we obtained from Uniprot  and the NCBI gene databases.
As described earlier, for an interaction to qualify as high-quality, it has to have at least one manually verified HT evidence code or at least two LC evidence codes supporting it. For certain database-specific details, please refer to the Methods section in the Supplementary information (Additional file 13: Supplementary Methods).
where Ni is the number of disease genes in bin i and Ti is the total number of genes in bin i.
Here each bin corresponds to the number of interactions – 0, 1, 2, 3, and > =4 respectively. These values have been shown in Figure 5A. The error bars represent standard error of the mean assuming a binomial distribution (each gene is either involved or not involved in disease).
where Nj is the number of disease hubs in bin j and Tj is the total number of disease genes in bin j.
Here each bin corresponds to the number of interactions – 0, 1, and > =2 respectively and a disease hub is any disease gene implicated in three or more diseases. These values have been shown in Figure 5D. The error bars represent standard error of the mean assuming a binomial distribution (each protein is either a disease hub or it is not).
Affinity Purification followed by Mass Spectrometry
Open Reading Frame
Protein Standards Initiative Molecular Interaction.
JD was supported by the Tata Graduate Fellowship. HY is supported by US National Institute of General Medical Sciences. This work was funded by US National Institute of General Medical Sciences grant R01 GM097358 to HY.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.