Reconstruction of the experimentally supported human protein interactome: what can we learn?

Background Understanding the topology and dynamics of the human protein-protein interaction (PPI) network will significantly contribute to biomedical research, therefore its systematic reconstruction is required. Several meta-databases integrate source PPI datasets, but the protein node sets of their networks vary depending on the PPI data combined. Due to this inherent heterogeneity, the way in which the human PPI network expands via multiple dataset integration has not been comprehensively analyzed. We aim at assembling the human interactome in a global structured way and exploring it to gain insights of biological relevance. Results First, we defined the UniProtKB manually reviewed human “complete” proteome as the reference protein-node set and then we mined five major source PPI datasets for direct PPIs exclusively between the reference proteins. We updated the protein and publication identifiers and normalized all PPIs to the UniProt identifier level. The reconstructed interactome covers approximately 60% of the human proteome and has a scale-free structure. No apparent differentiating gene functional classification characteristics were identified for the unrepresented proteins. The source dataset integration augments the network mainly in PPIs. Polyubiquitin emerged as the highest-degree node, but the inclusion of most of its identified PPIs may be reconsidered. The high number (>300) of connections of the subsequent fifteen proteins correlates well with their essential biological role. According to the power-law network structure, the unrepresented proteins should mainly have up to four connections with equally poorly-connected interactors. Conclusions Reconstructing the human interactome based on the a priori definition of the protein nodes enabled us to identify the currently included part of the human “complete” proteome, and discuss the role of the proteins within the network topology with respect to their function. As the network expansion has to comply with the scale-free theory, we suggest that the core of the human interactome has essentially emerged. Thus, it could be employed in systems biology and biomedical research, despite the considerable number of currently unrepresented proteins. The latter are probably involved in specialized physiological conditions, justifying the scarcity of related PPI information, and their identification can assist in designing relevant functional experiments and targeted text mining algorithms.


Background
Deciphering the structure and dynamics of the proteinprotein interaction (PPI) networks is among the major objectives of the systems biology research in the quest for the mechanisms of life. For the human protein interactome in particular, its reconstruction and further exploration of its topology and dynamics are expected to have a significant impact in biomedical research and applications [1,2]. The number of experimentally supported PPIs has drastically increased for model organisms since 2000 [3][4][5][6][7] and for the human interactome since 2005 [8,9] mainly due to the gradually increasing number of high-throughput methodologies for PPI detection. The experimentally identified PPIs are mined from the literature and stored in bulk in PPI databases, most of which are repositories for many species. For the human interactome, the various source PPI databases report the protein identifiers at different molecular levels of biological information, and include protein interaction sets of limited overlap due to own literature mining criteria, differences in PPI incorporation rates from smallscale experiments, as well as differences in methods for PPI selection, curation and updating [10][11][12][13][14]. Therefore, several PPI meta-databases also exist, combining information from multiple source databases [15][16][17][18][19][20][21][22][23]. However, as each metadatabase has distinct curation objectives and methods for data normalization and integration, the use of its combined PPI dataset may not be straight away comparable to the direct query on the source databases [11,12]. In addition, it is worth mentioning that the set of protein nodes of a metadatabase network varies depending on the PPIs of the employed source datasets, and it may change upon updating or incorporation of new datasets. This fact creates heterogeneity between the various PPI meta-databases and hinders the direct comparison among their networks [11]. Because of this inherent heterogeneity, although there have been many studies comparing a variety of PPI datasets [10][11][12][13][14], the way in which the human protein interactome expands via the integration of multiple datasets has not been comprehensively explored; therefore, a global perspective of the biology emerging from the network structure is still eluding.
The objective of the present study is to reconstruct the current experimentally supported network of direct human protein interactions in a global structured way, explore it to obtain information about the fraction of the human proteome that it currently involves, discuss the biological role of proteins within the topology of the network, and identify the presently absent from the network ("orphan") proteins. To this end, we started by defining the UniProtKB manually reviewed human "complete" proteome [24] as the reference set of nodes that the human PPI network can have. Then, we mined five major source PPI databases, i.e.: HPRD [25], IntAct [26], MINT [27], DIP [28] and BioGRID [29], for direct interactions exclusively between members of the defined reference protein set. After appropriate updating of the old and filtering of the obsolete protein identifiers, the acquired PPI data were normalized to and combined at the UniProt protein identifier level. We analyzed the reconstructed network to discuss whether the revealed role of proteins based on their position in the interactome topology is supported by the currently available knowledge about their function. In addition, based on the verified scale-free structure of the PPI network in human [1,30], we predict the number of connections of the unrepresented proteins and provide a novel perspective about the presently "missing" part of the interactome.

Protein and PPI datasets
The UniProtKB/Swiss-Prot manually reviewed human "complete" proteome From UniProtKB, the knowledgebase of the Universal Protein (UniProt) resource [24], we downloaded the tab-delimited files of: (a) the entire set of human UniProt identifiers, and (b) the manually reviewed human "complete" proteome. The latter contained 20,242 UniProt identifiers in the Dec 14 2011 release of UniProtKB downloaded on Jan 23 2012. The two tabdelimited files included all default columns augmented by the cross-references with the EMBL nucleotide, the NCBI nucleotide and the Entrez Gene databases. The text file indicating the correspondence of the secondary to the respective primary UniProt identifier(s) was downloaded too.

The Human Protein Reference Database (HPRD)
HPRD is a manually curated reference database for human protein information [25]. In this study, we used only its binary PPI dataset, which is provided in the form of interactions between HPRD identifiers. From the total 19651 HPRD identifiers in the HPRD version 9, downloaded on Jan 23 2012, 9673 were involved in at least one of the 39204 PPIs reported as binary interactions. Only the primary one-to-one correspondence of the HPRD identifiers to nucleotide sequence identifiers was considered. Any necessary updating or conversion of the nucleotide sequence identifiers to other molecular levels of biological information (i.e. gene or protein level) was carried out through cross-reference with current versions of the relevant databases.

IntAct
IntAct, a main partner of the International Molecular Exchange (IMEx) Consortium [10], is a repository of molecular interaction data for multiple organisms [26].
In the single file supplied by IntAct for external use, including interaction information from all species, PPIs are provided mainly at the UniProt protein identifier level. From the Jan 3, 2012 release downloaded on Jan 30, 2012, only the non -"spoke" PPIs between two human protein identifiers were retained, as the label "spoke" characterizes the PPIs originated from protein complex expansion.

The Molecular INTeraction database (MINT)
Similarly to IntAct, MINT [27] is a repository of literaturecurated PPIs from multiple organisms and an IMEx consortium partner with PPI information provided mainly at the UniProt protein identifier level. The binary PPI file for human used in the present study was downloaded on Jan 30, 2012 (release date: Dec 8, 2011).

Database of Interacting Proteins (DIP)
DIP [28] is also a collection of experimentally supported protein interactions from multiple organisms and among the first partners of the IMEx consortium. In the downloaded on Jan 30, 2012 PPI file for human (release date: Oct 27 2011), PPIs are provided as interactions between DIP identifiers. The latter are corresponded mainly to UniProt protein identifier(s) and most to NCBI nucleotide RefSeq identifier(s), too.

The Biological General Repository for Interaction Datasets (BioGRID)
BioGRID [29] is the most recently initiated among the five source PPI databases used in this study, currently participating in the IMEx consortium as an affiliate member. The PPI file for human was downloaded from the BioGRID web site on Jan 30, 2012 (release 3.1.84 tab2 file). PPIs are provided as interactions between BioGRID identifiers, which are in one to one correspondence to Entrez Gene identifiers (GeneID). BioGRID provides extensive information about the experimental method and the nature, i.e. low-or high-throughput, of the experimental set-up used for any PPI detection; however, it does neither make a distinction between binary interaction and protein complex data nor provide a relevant filtering criterion. To avoid including PPI data expanded from protein complexes, we opted to keep (a) all physical associations identified in lowthroughput setups and (b) from the physical associations detected only in high-throughput experiments, those derived from any of "protein complementation assay (PCA)", "reconstituted complex", "protein-peptide", "FRET", "two-hybrid" or "co-crystal structure" methods. Genetic interactions provided in BioGRID were de facto filtered out.

PPI data mining
Direct PPIs with both interactors belonging to the set of the 20,242 primary UniProt identifiers included in the manually reviewed human "complete" proteome were mined from: (a) the binary PPI dataset of HPRD, (b) all PPIs of IntAct not characterized with the term "spoke" in the "expansion" field, (c) the binary PPI dataset of MINT, (d) the DIP dataset, which is provided as containing only binary manually reviewed PPIs, and (e) all physical associations in BioGRID detected in at least one low-throughput experiment or by any of the detection methods mentioned above, if identified only in highthroughput setups.

Protein identifier normalization
Normalization of the protein identifiers to the UniProt identifier level was required for: (a) HPRD, since it reports the interactors at the nucleotide sequence level, (b) BioGRID, which reports the interactors at the gene level and (c) few cases of IntAct, MINT and DIP, for which other than the default UniProt identifier has been used.

Source PPI dataset uploading
To upload, store and handle the five PPI datasets and integrate them into the final reconstructed PPI network, the Microsoft SQL Server (MSSQL) 2008 Developer Edition platform equipped with SQL Server Integration Services (SSIS) was used under the University of Patras academic license. The source PPI dataset uploading was organized in a set of SSIS modules executed at the server side. Each module involves a series of subtasks for the filtering and updating of certain data from the source PPI dataset, along with a large number of checks to monitor and handle exceptions, avoiding thus the contamination of the final database with erroneous or ill-formatted data. Additional file 1 shows the workflow for the IntAct uploading sub-module.
The first subtask of the filtering and updating algorithm involves the extraction of the interactions between human protein identifiers. In sequence, the main interactor identifiers are retained for each PPI. For In-tAct, MINT and DIP, the interactors are expected to be represented by a UniProtKB accession number. If the relevant format is not recovered from the algorithm for any of the two interactors, then the non-UniProt interactor identifier is compared against a maintained interactor identifier dictionary. If matched to a dictionary entry and identified as active, the non-UniProt interactor identifier is replaced by the corresponding primary UniProt identifier. If it has become obsolete or cannot be assigned to a UniProtKB accession number, it is removed from the finally uploaded dataset along with all associated PPIs. If active, all isoform UniProt protein identifiers are replaced by their primary UniProt identifier(s). Any remaining non-UniProt interactor identifiers are stored in a separate table, for the curator to appropriately update the interactor identifier dictionary, so that the "patching" process is completed in a second iteration. In HPRD, the interactor identifier dictionary is used to update the nucleotide sequence identifiers to their currently active entries. Notably, among the 9673 HPRD identifiers involved in PPIs, 119 were identified to correspond to obsolete nucleotide sequence identifiers, 4 corresponded to non protein-coding RNAs, while 16 were replaced by new nucleotide sequence identifiers; due to this updating, in three cases, two HPRD identifiers were assigned to the same nucleotide sequence identifier. In BioGRID, all interactors were identified by an active Entrez GeneID, thus no updating was necessary. For the PPIs remaining after the interactor identifier patching step, the algorithm inspects the identifier of the supporting publication(s). If no publication is provided, the PPI is removed from the uploaded dataset. If a non-PubMed publication identifier is provided, this is patched based on an in-memory maintained dictionary as described for the interactor identifiers in the previous step. The utilized interactor identifier dictionary was created based on information recovered from the online UniProt converter and the online versions of all relevant databases on February 2, 2012. The Digital Object Identifier (DOI) numbers and IMEx reference identifiers were assigned to their PubMed publication identifiers based on an online converter and the online version of MINT, respectively. After uploading IntAct, MINT and DIP, their PPI data were further processed based on information from UniProtKB to include only interactions between two active primary UniProt identifiers in the human manually reviewed "complete" proteome.

Gene functional classification analysis
Gene functional classification analysis was carried out using the DAVID Bioinformatics Resources version 6.7 [31,32] by combining all available gene annotation categorizations.

Identification of network characteristics
The identification of the reconstructed PPI network characteristics was carried out using the relevant "Network Analysis" tool of the open source network visualization and analysis software Cytoscape -version 2.8 [33].

Results and discussion
Reconstructing the human protein interactome based on a well-defined set of protein nodes The novelty of our approach regarding the PPI data integration from major literature-curated source PPI datasets compared to existing meta-databases was the a priori definition of the set of nodes of the human protein interactome considering the UniProtKB manually reviewed human "complete" proteome as a robust, well-defined reference set. Thus, instead of merging PPI information for any protein identifier stored in the source databases, the latter were selectively mined for PPIs exclusively between members of the as above defined reference human protein set.
For proper normalization of the source PPI datasets to the UniProt identifier level, it was also important to consider the continuous updating of biological information, since it can lead to changes in the annotation of protein identifiers and in their associations at other molecular levels. Thus, we proceeded to a careful updating of the old and filtering of the obsolete protein identifiers in the source datasets based on the current knowledge about gene annotation. UniProtKB and its cross-references with major resources at the nucleotide sequence and gene levels of molecular information (i.e. NCBI, Entrez Gene and EMBL databases) provided a valuable reference for the appropriate normalization of HPRD and BioGRID identifiers to the UniProt level, and of a small fraction of IntAct, MINT and DIP protein entries that were not provided at the default UniProt level. It is noted that during this conversion to the UniProt level, 1920 BioGRID identifiers reported as human were found to correspond to non-human UniProt identifiers (data not shown), leading thus to the exclusion of their PPIs from the final integrated PPI network.
In the normalized HPRD, IntAct, MINT, DIP and BioGRID files, only the PPIs between two active primary UniProt identifiers in the manually reviewed human "complete" proteome were retained. These datasets were combined keeping one record for each included PPI. A last source of PPI redundancy in the normalized datasets that was eliminated, concerns the double reporting of an interaction using opposite sequence of the two interactors. In some cases, such duplications may have been intentionally included by the curator of a source PPI dataset to report the experimentally supported sequence of the interactors; this type of duplications were encountered in IntAct and MINT. In most cases, however, they were just a product of the protein identifier conversions at the various stages of the PPI dataset uploading and formatting and had to be eliminated at the integration stage.
The final integrated PPI dataset will be referred to as the PICKLE (Protein InteraCtion KnowLedge BasE) dataset. Table 1 shows the number of (a) the direct PPIs in the PICKLE and the normalized source PPI datasets, (b) the UniProt identifiers in the manually reviewed human "complete" proteome covered by each of them, and (c) the publications providing experimental evidence for the PPIs. As expected, the integrated PICKLE dataset is much larger than any of the individual source datasets with respect to the number of PPIs, of the protein interactors and of the supporting publications, verifying the value of PPI resource integration.
Reconstructing the PPI network in this global structured way: we resolve the issue of potential protein identifier and consequently PPI redundancy in the network originating from the combination of records of multiple databases reporting at different levels of biological information; we determine which protein nodes of the manually reviewed human "complete" proteome remain with no direct PPIs ("orphan" proteins) and discuss this fact in the context of the current information about these proteins; we comment on the proteins represented in the interactome with a high number of PPIs with respect to the importance of their function within the entire network; we consider the human interactome in its entirety, commenting on its future expansion to the maximum potential format in the context of the expected scalefree structure, a fundamental feature of PPI networks [30,34]. Consequently, the interactome reconstructed in the presented way can only grow in edges (PPIs) between the defined set of protein nodes, while keeping its scale-free form. In this global context, we can argue for the expected number of interactions for the "orphan" protein nodes and for the type of their interactors, suggesting a novel perspective for the currently "missing" part of the network, as it is discussed in the following sections.
The reconstructed interactome covers nearly 60% of the manually reviewed human "complete" proteome Out of the 20,242 UniProt identifiers in the manually reviewed human "complete" proteome, 11827 (58.4%) were found to have a total number of 75965 direct interactions (Table 1). Gene functional classification analysis (see Methods section) of the proteins currently included in the reconstructed interactome compared to the "orphan" ones did not indicate any functional annotations that could differentiate the one group from the other. Thus, the presently "orphan" proteins are not associated with any apparent functional or subcellular location characteristics that could "hinder" them from binding with other proteins.
Dataset integration augments the overall network mainly with additional interactions for largely overlapping sets of proteins HPRD and BioGRID are the main contributors of the overall human PPI network, comprising, respectively, 78.7% and 78.3% of its UniProt identifiers, and 48.9% and 56.1% of its PPIs (Table 1 and Figures 1, 2 and 3). Moreover, exclusion of the information from HPRD and BioGRID wοuld, respectively, decrease the overall network by 20.4% and 18.9% in proteins and 33.2% and 39.1% in PPIs. These characteristics can be partially justified by the number of references used by each of these two databases, constituting 72.2% (HPRD) and 51.8% (BioGRID) of the total number of supporting references. In addition, HPRD is one of the first literature-curated databases, having though a decline in the rate of reference (and thus PPI) incorporation after 2005 ( Figure 3B). BioGRID is currently the fastest growing, having also incorporated a significant part of the HPRD PPI network at the time of its creation [11,29]. This information complements the observed much higher curation overlap between HPRD and BioGRID compared to the other pairs of source PPI datasets discussed by Turinsky et al. in [12]. On the other hand, IntAct corresponds to the largest ratio of PPIs per number of references, i.e. 12.1, followed by MINT, i.e. 6.7 (Table 1), indicating that a major fraction of their datasets originates from The size of the reconstructed network is compared to the size of the normalized to the UniProt identifier level source PPI datasets for the defined reference protein set. N/A: not applicable. (1) members of the UniProtKB manually reviewed human "complete" proteome. (2) between members of the UniProtKB manually reviewed human "complete" proteome. references of high-throughput PPI experiments. Notably, the reconstructed human protein interactome is mainly supported by small-scale studies ( Figure 4A); 91% of the references supporting the PICKLE PPI dataset refer to a maximum of five PPIs, and only 51 publications report more than 100 PPIs. In this aspect, PICKLE follows the characteristics of HPRD, currently the main contributor of references to the overall dataset. It is worth mentioning that 84% of the 75965 PPIs in the human interactome are supported by only one reference ( Figure 4B) and just 42 PPIs by more than 20 (Additional file 2). Considering that the degree of confidence of a given PPI increases with the number of independent supporting references [35], it is evident that, apart from exploiting existing models for PPI assessment [36], further targeted experimentation is required for validating the majority of the PPI data. A noteworthy observation of our work, revealing an interesting aspect of the literature-supported PPI data collections, is that the fraction of protein nodes that each source dataset uniquely contributes to the integrated network is much smaller than the corresponding fraction for the PPIs, even for the largest HPRD and BioGRID datasets (Figures 1 and 2). The PPI diversity between the source datasets has been discussed earlier [e.g. 10,12] and mainly attributed to the fact that the various databases incorporate knowledge from different publications. This was recently presented for the IMEx Consortium member databases [10] and validated in the present study from the substantially small number, i.e. eight, of common references between the five employed datasets (Figure 1). Furthermore, Turinsky et al. [12] showed that the source databases exploit different curation criteria even for the shared publications. Thus, it is striking that, despite the heterogeneous text mining and data curation methods used by the various databases, the integration of multiple source PPI datasets augments mainly the interactome with different PPIs for essentially the same part of the manually reviewed human "complete" proteome.
This observation suggests that the knowledge about direct PPIs that is available in the literature and can be promptly identified through existing text mining algorithms refers mainly to the fraction, i.e. approximately 60%, of the manually reviewed human "complete" proteome already incorporated in the interactome, while evidence for PPIs for the rest 40% cannot be easily spotted. In this context, as PPI information from all high-throughput experiments has been included in at least one of the source datasets, there are two possibilities for the "orphan" proteins: either there is currently no available PPI information in the literature, or, if existing, it should concern reports of targeted smallscale functional experiments. From this kind of reports,  protein interactions can be indirectly deduced, requiring thus advanced directed text mining algorithms. Furthermore, there is a higher probability for such experiments to refer to PPIs occurring under specialized and/or highly transient or rare physiological conditions, while this type of interactions cannot be easily identified in high-throughput experiments. These implied direct interaction characteristics for the "orphan" proteins support a peripheral role for most of them within the topology of the PPI network. In this context, the actual determination of the "orphan" proteins may assist in directed literature mining to extract potentially existing relevant PPI information from currently unexploited reports or promote further experimentation to verify the argument.
Τhe proteins with a high number of interactions are involved in essential biological processes Analysis of the integrated human PPI network characteristics indicated that 11577 out of the 11827 UniProt identifiers are connected in one component. The remaining 250 proteins are currently in separate components of up to four nodes, among which 114 homodimers and 46 heterodimers ( Table 2). The vastest functional categories for these proteins as indicated by gene functional classification analysis concerned 107 glycoproteins, 64 of which are homodimers and 89 signal peptides, among which 65 glycoproteins; 68 of the signal peptides, including 39 glycoproteins, are associated with extracellular matrix. While the network diameter, i.e. the greatest distance between two protein nodes, was determined equal to 12, the characteristic path length is 3.69. This feature along with the equal to 1 radius and the high value of shortest paths metric (i.e. 95%) indicates a well-connected network, despite its low density (i.e. 0.001) ( Table 2). The distribution of PPIs per protein, i.e. protein degree, indicated 53% of the proteins as having up to five interactions ( Figure 5), while 16 UniProt identifiers had more than 300 PPIs each (Table 3). This pattern is consistent with the relevant "network biology" theory supported by Barabasi [30,37], according to which the human PPI network is expected to follow a scale-free structure with few protein hubs and the majority of the protein nodes having a small number of interactions. Indeed, even though it is currently incomplete and many interactions are still in need of verification, the reconstructed human protein interactome correlates well with the power law ( Figure 5), implying that the degree distribution of the current PPI network already suggests the role of most proteins as high-, middle-or lowdegree nodes.
The sixteen proteins determined with more than 300 PPIs (Table 3) are mainly implicated in the regulation of apoptosis (10 proteins), the MAP kinase signalling pathway (6 proteins) and the cell cycle (7 proteins). A full list of the most significant protein ontology clusters for these high-degree proteins is shown in Additional file 3. Notably, eight of them have been associated with pathways in cancer, while subsets of nine are involved in transcription regulation, covalent chromatin modification or the ubiquitin-like modifier (ubl) conjugation pathway. This information indicates that the observed central role of these proteins within the topology of the PPI network is not a mere result of them being extensively studied, i.e. "study bias", but correlates well with the current knowledge about their function, as it has also been suggested earlier for the cancer-associated proteins [38,39]. An additional fact which counter argues the "study bias" for these proteins is that, apart from various targeted small-scale experiments, many of their direct interactions have also been detected in independent high-throughput setups. For example, at least 54 interactions of the cellular tumor antigen p53 [8,40], 257 interactions of the 14-3-3 protein zeta/delta [41], 212 interactions of the Myc proto-oncogene protein [42] and 48 interactions of the TNF receptor-associated factor 6 [43] have been identified in high-throughput studies.
Polyubiquitin: a hub to be discussed Polyubiquitin (UniProt identifier: P0CG48, UBC) was the protein identified with the largest number of interactions in the reconstructed network. It interacts with Table 2 The characteristics of the integrated PPI network Network characteristic (1) Value (2) Detailed description for every characteristic can be found in http://med. bioinf.mpi-inf.mpg.de/netanalyzer/help/2.6.1/index.html. (2) Determined using the network analysis tool of Cytoscape (2.8.2). more than a thousand, i.e.: 1112, members of the manually reviewed human "complete" proteome, while the second ranked high-degree node, i.e.: TP53 (UniProt identifier: P04637), has 476 interactions. Notably, this much larger number of interactions for polyubiquitin compared to the other protein hubs deviates from the scale-free network structure, assigning a centralized role to a single protein ( Figure 5). Querying the PICKLE dataset, we identified HPRD, IntAct, MINT, DIP and BioGRID reporting, respectively, 19 (2 unique), 5 (0 unique), 143 (48 unique), 53 (15 unique) and 1423 (909 unique) polyubiquitin PPIs. Without exhausting our search regarding polyubiquitin PPI supporting publications, we detected that our integrated dataset contains interactions from studies investigating polyubiquitin function in the context of protein degradation (e.g. [44]). Polyubiquitin can be covalently linked to a protein through an isopeptide bond and mark it for degradation at the proteasome. However, it is questionable whether this one-sided polyubiquitin action on a protein should be included in the interactome or should be considered in the post-translational modification (PTM) network [45,46]. The latter could explain why, apart from BioGRID, the other source databases used in this work have considered a limited number of polyubiquitin PPIs. In the context of the non-directional PPI network, the existence of an interaction link from one protein to another directly implies a link in the opposite direction, too. Consequently, the absence of a protein and thus its interactions will affect its neighbours and add a certain stress to the network, the extent of which depends on the network structure and dynamics. In the case of unidirectional polyubiquitination of a protein for leading it to degradation, the absence of the protein will neither affect polyubiquitin nor exert a stress to the rest of the polyubiquitin substrates. Thus, this type of actions of a protein on another should be considered as a separate category than the non-directional protein-protein interactions and modelled differently for their role in cell physiology dynamics. On the other hand, the monoubiquitination of proteins for regulatory purposes (e.g. [47]) fits into the notion of the nondirectional PPI network. However, even in this case, it is questionable whether ubiquitin itself or rather the ubiquitinated proteins should be included as nodes of the network. In this context, the incorporation of ubiquitin PPIs in the interactome should be cautiously curated. Accordingly, this argument is also relevant to other proteins involved in interactions of similar type, like the small ubiquitin-related modifiers (SUMO1-4) and neddylin (NEDD8) engaged in the sumoylation and neddylation reactions, respectively.
The bulk of the proteins currently absent from the network should have up to four interactions As shown, the reconstructed human protein interactome follows the scale-free structure with a very good correlation ( Figure 5). The part of the network that contributes to the decrease in the correlation coefficient refers to the proteins with up to four interactions. The difference between the data and the power-law curve for a nearly perfect fit is calculated to be about 8300 UniProt identifiers, with more than 6500 of them corresponding to degree equal to 1. Consequently, with 8415 UniProt identifiers not currently included in the interactome ("orphan" proteins), it could be speculated that the vast majority of them should have up to four interactions with nodes in the same degree group. This anticipated network structure implies that the core of the human protein interactome has essentially been revealed and could provide a reasonable explanation for the current lack of PPI information for about 40% of the human proteome, agreeing with a specialized "peripheral" role for most of these "orphan" proteins. Indeed, with most of them expected to have a single PPI, and in general no more than four, with similarly not well-connected proteins, the probability of them being involved in specialized physiological conditions is high. This speculation further corroborates with the fact that interactions for these proteins cannot be easily confirmed in PPI identification experiments, as discussed in section C.

Conclusions
We have obtained a normalized and clean from outdated protein identifier annotations integrated set of direct PPIs referring to the well-defined UniProtKB manually reviewed human "complete" proteome. We suggest that this PPI network with the involvement of approximately 60% of the "complete" proteome represents the core of the human protein interactome. Based on a global view of the way in which the current network will have to expand to its maximum potential in accordance with the scale-free theory, we provide a novel perspective for suggesting its currently "missing" part. We envisage that the proteins not yet identified in direct PPI assays may participate in specialized biological functions interacting with a limited number of other not well-connected proteins. Now determined, this set of "orphan" proteins may trigger targeted text mining efforts or appropriately designed functional experiments for the identification of any relevant PPIs. In effect, we suggest that this reconstructed human interactome already provides a useful tool for generating valuable working hypotheses for the investigation of important biological processes and molecular functions in the context of biomedical research and applications.