Volume 6 Supplement 2
PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from H-Invitational protein-protein interactions integrative dataset
- Shingo Kikugawa†1,
- Kensaku Nishikata†1, 2, 3,
- Katsuhiko Murakami†1,
- Yoshiharu Sato1,
- Mami Suzuki1,
- Md Altaf-Ul-Amin2,
- Shigehiko Kanaya2 and
- Tadashi Imanishi1Email author
© Kikugawa et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Proteins interact with other proteins or biomolecules in complexes to perform cellular functions. Existing protein-protein interaction (PPI) databases and protein complex databases for human proteins are not organized to provide protein complex information or facilitate the discovery of novel subunits. Data integration of PPIs focused specifically on protein complexes, subunits, and their functions. Predicted candidate complexes or subunits are also important for experimental biologists.
Based on integrated PPI data and literature, we have developed a human protein complex database with a complex quality index (PCDq), which includes both known and predicted complexes and subunits. We integrated six PPI data (BIND, DIP, MINT, HPRD, IntAct, and GNP_Y2H), and predicted human protein complexes by finding densely connected regions in the PPI networks. They were curated with the literature so that missing proteins were complemented and some complexes were merged, resulting in 1,264 complexes comprising 9,268 proteins with 32,198 PPIs. The evidence level of each subunit was assigned as a categorical variable. This indicated whether it was a known subunit, and a specific function was inferable from sequence or network analysis. To summarize the categories of all the subunits in a complex, we devised a complex quality index (CQI) and assigned it to each complex. We examined the proportion of consistency of Gene Ontology (GO) terms among protein subunits of a complex. Next, we compared the expression profiles of the corresponding genes and found that many proteins in larger complexes tend to be expressed cooperatively at the transcript level. The proportion of duplicated genes in a complex was evaluated. Finally, we identified 78 hypothetical proteins that were annotated as subunits of 82 complexes, which included known complexes. Of these hypothetical proteins, after our prediction had been made, four were reported to be actual subunits of the assigned protein complexes.
We constructed a new protein complex database PCDq including both predicted and curated human protein complexes. CQI is a useful source of experimentally confirmed information about protein complexes and subunits. The predicted protein complexes can provide functional clues about hypothetical proteins. PCDq is freely available at http://h-invitational.jp/hinv/pcdq/.
Proteins interact with other proteins or biomolecules to perform their functions, and protein complexes are the fundamental functional units of these macromolecular systems. Comprehensive analysis of PPIs provides a valuable framework for understanding the protein functions required for various biological processes in cells. Moreover, it can provide annotation clues about proteins with unknown function [1–3].
An important issue for the elucidation of the functional organization of the proteome is the extraction of information about protein complex formation and function from the PPI network.
In recent years, a number of well-organized public PPI databases have become available, including Biomolecular Interaction Network Database (BIND) [4, 5], Database of Interacting Proteins (DIP) , Molecular INTeraction database (MINT) [7, 8], Human Protein Reference Database (HPRD) , IntAct , and Genome Network Project Y2H data (GNP-Y2H;http://genomenetwork.nig.ac.jp/index_e.htmlhttp://genomenetwork.nig.ac.jp/). In the present PPI data, the main focuses are on protein-binding partners or binary protein interactions. Knowledge about how gene products form complexes, interactions among complexes, or protein interconnectivity in a complex is still scarce. The overlap of PPI data entities across databases is relatively low. The existence of only a partial map of the whole interactome space limits the broad application of systems modeling. Accordingly, it is essential to integrate PPI data in order to fill in as many holes in the interactome space as possible. Some integration of the above PPI data has been conducted by STRING , OPHID , and HAPPI . However, protein complex information has been poorly annotated in these resources.
Several human protein complex databases have been developed to date, including CORUM [14, 15] and disease-related complex . The protein complexes in CORUM were collected only from literature. The database does not provide information about many uncharacterized proteins whose interactions are supported by PPI data. The disease-related complex database  is focused on disease complexes, using information on proteins known to be involved in similar disorders. Accordingly, it contains a relatively small number of complexes (506) and lacks many other important complexes.
In this study, we integrated human PPI data from the six databases and predicted human protein complexes from the integrated PPI data set by finding densely connected regions with cluster properties in the PPI network based on graph theory as described in our previous report . The novelty of prediction methods is that we optimized parameter settings for the prediction tool DBClus using an original correct dataset. After prediction, experienced annotators manually annotated the predicted protein complexes according to our standardized procedures, using literature mining and the wealth of annotation data in the human full-length cDNA database "H-Invitational Database" (H-InvDB) that we developed [18–20]. Using the data from H-InvDB, we performed several analyses of the annotated complexes to increase the validity of our annotation. This is the first attempt at comprehensive manual curation of human protein complexes predicted from PPI networks.
Construction and content
Integration of PPI data into H-InvDB proteins
To integrate PPI information, we collected PPI data from the six databases, BIND [4, 5]; DIP ; MINT [7, 8]; HPRD ; IntAct ; and GNP, as major resources for PPI. We used XML and flat files from PPI databases; BIND, DIP, MINT, HPRD, IntAct, and GNP on October 25, 2007. These databases, except for GNP, store experimentally determined PPIs from many organisms collected by literature curation, whereas GNP stores original Y2H experimental data on humans. Computationally predicted PPIs were excluded from this study. A standardized interaction data model is needed for storing PPI data from different sources. Following the method described in the Atlas biological data warehouse , we designed data loading applications for each PPI database and a relational data storage system compliant with the Proteomics Standards Initiative Molecular Interaction Standard (PSI-MI) controlled vocabulary , a community-standard XML format for the presentation of protein interaction data. This system allowed us to unify data from different sources. We used only human PPIs in this study and did not use cross-species PPI data such as human proteins interacting with mouse proteins or data with ambiguous taxonomic labels such as "Mammalia," commonly found in the HPRD download file. To survey human PPIs from the landscape of the human interactome, we mapped the PPI information onto the H-InvDB proteins. We removed PPI data redundancies by evaluating sequence similarity and then integrated human PPIs with the H-InvDB proteins. As a result, we obtained 32,198 human PPIs composed of 9,268 proteins.
Prediction of protein complexes with clustering tool DPClus after parameter optimization using an original reference protein complex set
In a PPI network, nodes represent proteins and edges represent interactions. We previously developed an algorithm called DPClus, which extracted densely connected regions in a network and demonstrated that many of these regions correspond to known protein complexes or protein functional units [17, 30]. DPClus is a robust algorithm unaffected by a high rate of false positives in data from high-throughput interaction-detection techniques . DPClus can detect clusters of networks that are separated by sparse regions, keeping track of the periphery of a cluster by monitoring cluster properties of its neighbor. Thus the program considers two parameters, "network density" and "cluster property."
To evaluate the optimal values of these two parameters for predicting protein complexes, we used a set of experimentally determined protein complexes (the reference protein complex set). We manually collected 89 protein complexes from the scientific literature and retrieved 55 complexes from three-dimensional structures of human protein complexes recorded in the PDB . We performed parameter optimization to select the two best parameters to achieve the best match of the predicted set with the reference complex set. DPClus was run many times for all possible combinations of the two parameters (network density and cluster property, varied from 0.0 to 1.0 with increments of 0.1). In the parameter optimization process, DPClus was restricted to finding complex sizes of three or more. For this case, a predicted complex needs at least two proteins in common with a known complex to be considered a match. Two scores were checked for each parameter set: the sum of recalls, which is a ratio of the number of matched proteins of a known complex to those of a predicted complex, and the sum of precisions, which is a ratio of the number of matched proteins of a predicted complex to those of a known complex. Recall and precision were zero when proteins of a known complex matched fewer than two proteins of a predicted complex. Recall and precision were one when proteins of a known complex matched perfectly to the proteins of a predicted complex. To avoid overprediction of duplicated complexes, which shared several proteins and matched an identical known complex, the best recall and precision scores were divided by their frequencies. For the best prediction performance of DPClus, the two parameters, network density and cluster property, were optimized using the largest protein subunits of the reference complex set. We simulated prediction with 100 different parameter sets and the best, with network density 0.6 and cluster property 0.5, was determined from the best ROC curves. With this parameter set, DPClus predicted 1,264 complexes matching 92 of the 144 known complexes. The average recall and precision of these 92 matched complexes were 0.54 and 0.66, respectively. We also calculated the average number of complexes that share a common protein. On an average, each protein was present in 1.24 complexes of the reference complex set. Using the optimized parameters gave a result identical to that for the predicted set. With this parameter set (network density 0.6, cluster property 0.5), we predicted 1,319 protein complexes in the integrated PPI network composed of 32,198 human PPIs.
In prediction of protein complexes by DPClus, we adopted the "overlapping clustering mode," which allows identical proteins to be classified into different clusters, because it is biologically well established that proteins can be present in multiple complexes at different times and locations. For example, POLR2E/RPB5 (HIP000039507), POLR2F/RPB6 (HIP000096671), POLR2H/RPB8 (HIP000027404), POLR2K/RPB12 (HIP000043404), and POLR2L/RPB10 (HIP000064404) are conserved throughout RNA polymerases I, II, and III . Before complex prediction, we evaluated the optimal values of DPClus parameters by comparing the predicted complex set with the experimentally determined set of 144 reference complexes.
Manual annotation of the predicted protein complexes: re-clustering, functional annotation, protein category, complex quality index (CQI), and naming of complexes
Using the clusters or protein complexes predicted by DPClus, we performed manual annotation by the following procedures: 1) curators searched the scientific literature for evidence that the proteins of the predicted complexes are experimentally defined complex members or subunits, 2) missing proteins were manually added to the predicted complexes if literature evidence showed that they were subunits of the complexes, and 3) data such as complex names; descriptions; localizations; and complex-complex interactions (CCIs), and their subunit functions, structures, expression profiles, gene loci, and PPIs among protein subunits were integrated. We did not exclude proteins that were predicted to be subunits but lacked literature evidence, instead considered them as complex subunit candidates. The provision of predicted candidates is one of the advantages of PCDq.
We assigned the protein subunits, or member proteins of complexes, of the predicted complexes to three categories based on the evidence levels: category I, proteins that are confirmed as subunits of known complexes in the literature or as ternary structures in the PDB ; category II, proteins for which no evidence of complex membership were found in the literature, but which have functions related to those of the shared category I subunits in the predicted complexes according to their protein definitions or Gene Ontology (GO) terms ; and category III, proteins that are predicted as complex subunits by DPClus and do not fall into the other two categories. Because our protein complex prediction allowed the same proteins to be subunits of different complexes, such shared proteins could be classified into different categories in different complexes.
To summarize the categories of all the subunits in a complex, we devised a CQI and assigned a CQI value to each complex. CQI is an index of the different levels of evidence for an annotated complex based on the protein category, defined by "[Number of category I proteins].[category II proteins].[category III proteins]/[Total number of proteins in a predicted complex]." For example, if the CQI of a complex is "5.2.1/8," the complex has eight subunits with five, two, and one protein classified into categories I, II, and III, respectively.
The predicted complexes were named based on scientific names from the literature, if the majority of proteins in a complex were common to a known complex and a name (e.g., exosome, spliceosome) for the complex was available; however, we used artificial descriptions using concatenated gene symbols when not all symbols of proteins were available (e.g., GLI1-STK36-SUFU complex, DBNL-ITK-PLCG1-SH3BP2 containing complex). Descriptions of complexes were quoted from references with their PubMed IDs. Functional categories and subcellular localizations were added if the descriptions were available in the literature.
Database of protein complex annotations and visualization tool PPI-Map for CCIs
The novel human protein complex database, called PCDq, provides three main views: protein complex information in the "protein complex view," integrative overview of a PPI in the "PPI view," and network information including both PPI and CCIs in "PPI-Map." The complex view provides names, functions, protein subunits, subunit roles, and CQL. PPI view provides PPI partners for a specified protein. Finally the new visualization tool PPI-Map allows users to visualize protein interactions graphically: not only PPIs among the protein subunits but also CCIs, via a seamless and detailed annotation of each protein complex and its subunits. These three views have hyperlinks to one other and also to transcript/locus/protein views of the H-InvDB human gene/transcript/protein database. Considering all of these features, PCDq is a useful platform for understanding protein function from the viewpoint of protein complexes as another important functional level, as well as their interactions. The CQI provides unique and reliable clues for inferring some roles of proteins whose functions are unknown.
Statistics of PCDq
Protein and the complex annotation summary
Number of the proteins (a)
Proteins in the PPI data set
Proteins in the predicted complexes
Number of the complexes (b)
We defined three types of predicted complexes: perfectly matched, partially matched, and hypothetical complex. These correspond to a complex with all subunits in category I, a complex with at least two proteins in category I, and a complex with all subunits in category III, respectively (Table 1b). By this annotation, the number of complexes was 136 for type I, 405 for type II, and 723 for type III Table 1b).
Consistency of GO terms assigned to subunits in a complex
Given that proteins in a complex cooperatively play a biological role, it is expected that they are present in the same location in a cell at a certain time and that they act cooperatively in the same biological process or pathway. To assess the quality of our protein complex annotation, we calculated the enrichment ratio of consistency of GO terms among subunits of a complex. This assessment is based on the assumption that the same GO terms are assigned to proteins in a single protein complex.
All GO terms of "biological process," "cellular component," and "molecular function" assigned to the H-InvDB transcripts were used for this study. The depth of GO terms from the root in the GO hierarchy was set to five and GO terms representing nodes with depth less than five were ignored. If the GO term assigned to the transcript had depth greater than five, the corresponding parental node with depth five was reassigned and redundancy was removed. As a control set representing the entire proteome, we collected GO terms assigned to all 36,073 representative transcripts in H-InvDB. All protein subunits in 1,264 complexes were used as one set of protein complexes (PCset1) for the assessment. To construct the manually curated set of protein complexes (PCset2), we collected only category I proteins from perfectly or partially matched complexes (these complexes were defined in the subsection "Statistics of PCDq") and discarded category II or III proteins, which have not been described as subunits of a complex in the literature. PCset2 contained 541 complexes.
First, we estimated the enrichment of some GO terms in a complex compared to GO terms assigned to the proteome. The proteome set comprised 36,073 proteins, each derived from a distinct locus or gene of H-InvDB. The enrichment of GO terms was examined against two sets of protein complexes, PCset1 and PCset2. Significance of enrichment of a given GO term in a complex was tested by one-sided Fisher exact test for a 2 × 2 contingency table (A, B, C, D). "A" represents the number of subunits expressing the given GO term, and "B" is the number of subunits not having the GO term in the protein complex. "C" and "D" represent the corresponding numbers estimated for the entire proteome.
where Ncons is the number of edges that connect two proteins sharing the same GO term and Nall is the number of possible combinations (edges) for all subunits of the complex.
It was observed that 450 of 1,264 PCset1 (35.6%) protein complexes had one or more enriched GO term (Fisher exact test, p-value ≤ 0.01). In contrast, 254 of the 541 PCset2 complexes (47%) had one or more enriched GO term. The ratio of protein complexes having enriched GO terms was greater in PCset2 than in PCset1, suggesting that the reliability of protein complex annotation was refined by manual checking.
Intriguingly, we found 28 PCset1 unique complexes with consistency index 1.0. Although the existence of the protein complexes has not yet been validated experimentally, the compatibility between the prediction of protein complexes by our clustering method and the consistency of GO terms offers reliable candidates for novel functional protein complexes to be validated by future experiments.
Similarity of gene expression profiles among proteins in the same complexes
Based on the idea that coexpressed genes are more likely to have the same or similar functions, cluster analysis of gene expression data has been used to predict the functions of non-annotated proteins [34, 35]. Reversing the process, we examined whether proteins in the same complex (involved in the same functions) have similar expression profiles. For each complex, we compared the expression profiles of protein subunits in the complex. When the subunits of a complex are similar in their expression profiles, the profile should provide some functional information about a complex whose function is unknown.
Expression profiles of 729 complexes were obtained from the Human Anatomic Gene Expression Library (H-ANGEL) , the satellite database of H-InvDB. From the download file of H-ANGEL ("H-ANGEL_matrix.txt," December 2007 version), gene expression data measured by the iAFLP method  for 10 tissue categories were extracted. For some loci, multiple iAFLP-tags correspond to the same locus. In such cases, the different expression profiles for a single locus were averaged over the tags. The expression profile of a gene was expressed by a vector of 10 elements. The similarity of gene expression profiles between two loci was calculated as the cosine of the two vectors. The similarity of multiple gene expression profiles for subunits of a protein complex was defined by the averaged cosines of all combinations of all the different subunits. The cosines of a complex were evaluated by simulation. For every number (k) of subunits in the complex, we randomly selected k-genes from genes having expression profiles. We then calculated the averages of the cosines of the expression profiles. We repeated the procedure 100,000 times for every number of subunits (k), and used the results for p-value estimation.
Protein complexes comprising protein subunits with significantly similar gene expression profiles
# of genes
19S proteasome of the 26S proteasome
20S proteasome of the 26S proteasome
RNA polymerase II complex
COP9 signalosome (CSN)
GAGE6-GMCL1L containing complex
18S U11/U12 complex
Relationship between the establishment of protein complexes and gene duplication
To investigate the contribution of gene duplication to the establishment of protein complexes, we examined portions of duplicated genes (proteins) or paralogs in the complexes.
For all combinations of subunits in a protein complex, we evaluated whether the genes were paralogous (two genes copied by segmental duplication) following the method of Gu et al. . Gene models that were mapped onto "random" or "haplotype" contigs were not used in the analysis. FASTA package version 34t25  was used for the analysis. In addition, we conducted another paralog analysis with BLASTP using less stringent criteria for the assignment of duplicated genes. BLAST version 2.2.17 was used. If the gene pair showed similarity with E-value less than 1E-05, we assigned it as paralogous.
This paralog assignment method yielded 2,353 duplicated genes in a total of 4,191 genes that were the components of 1,264 complexes. Of the 1,264 complexes, 336 (26.5%) were judged to have at least one paralog pair. Moreover, we obtained 218 complexes (17.2%) in which more than half of the components were judged to be paralogous to another gene in the same complex. Using a less stringent method with BLASTP (E-value ≤ 1E-05), these percentages were estimated to be 38.5% and 27.3%, respectively.
The replication factor C (RFC) complex (complex 105) is a good example of the formation of a protein complex induced by gene duplication. This complex consists of five RFC subunits and one binding partner, PCNA . The complex is known to be associated with DNA synthesis , and the function and machinery are conserved between yeast and human , indicating that this is an ancient protein complex. Paralog assignment suggested that three (RFC 36, 37, 40) of five RFC subunits are paralogous; i.e., originating from a common ancestor, whereas the result obtained by the less stringent BLASTP method suggested that all five subunits are mutually paralogous. The presence of the "RFC box" motif in all five proteins and the consistency of exon-intron boundaries also support the homologous relationships of these five subunits. These results indicate that the enlargement of a protein complex is mainly mediated by homologous interactions and that gene duplication events markedly contribute to the establishment of protein complexes.
Functional assignments for hypothetical proteins in the annotated complexes
First, we explain the definition of proteins with no functional assignments, known as "hypothetical proteins." H-InvDB proteins were analyzed with standardized functional annotation by curators who classified the proteins into several categories: i) identical to known human proteins, ii) similar to known proteins (having 50% sequence similarity), iii) interPro-domain-containing proteins, and iv) hypothetical proteins (with no biological functions inferred). The "hypothetical proteins" discussed here are of the fourth category.
Hypothetical proteins whose functions can be easily inferred from their partners
HIP (protein ID)
Fanconi anemia (FA) core complex
C8orf32-EFCBP2-RUNX1T1-ZNF652 containing complex
SRGAP3-WASF1 containing complex
C19orf25-KNTC1-ZW10 containing complex
NONO-PSPC1-WBP4-ZNRD1 containing complex
SCF (Skp1, cullin 1, F-box) ubiquitin E3 ligase complex
After annotation, we found that some of the hypothetical proteins were reported in the literature as actual protein subunits (Table 3). The results show the high potential value of our predicted complex data and indicate that the complex annotation used for our database can be a key tool for new discovery of protein complexes and their functions.
PCDq comprises both known and predicted complexes and subunits. The evidence level for each subunit was also determined and summarized as a complex quality index (CQI) for each protein complex.
The expected users of PCDq are both experimental biologists and computational scientists. Biologists can seek candidate protein subunits for known or unknown protein complexes and review the information (functions, gene expressions, PPIs, etc.) about a protein complex. Computational scientists can collect integrated PPI network datasets with various levels of reliability using original annotation in the form of protein categories and CQIs. Thus, for users who would like to develop a method for protein complex prediction, PCDq provides different thresholds for dataset assembly using CQI.
Users can download the dataset of PCDq, including protein complex list, their subunits (members), and related functional annotation from the H-InvDB download page (http://h-invitational.jp/hinv/dataset/download.cgi, see "Results of computational analysis").
To assess the quality of our protein complex annotation, we estimated the enrichment and the proportion of consistency of GO terms among subunits of a complex. This assessment is based on the assumption that the same GO terms are assigned to the proteins in a single protein complex. The proportions of protein complexes having enriched GO terms and the degree of GO term consistency were greater in the manually curated set of protein complexes (PCset2) than in all the predicted complexes (PCset1) or the random set, indicating the relatively high quality of manual annotation and the advantage of protein complex prediction followed by manual annotation as opposed to only single computational prediction.
Next, for each complex, we compared the expression profiles of the protein subunits in the complex based on the idea that proteins in the same complex would have similar functions and that coexpressed genes are more likely to have similar functions. The result showed that the subunits of large complexes tend to be expressed similarly. The ratio of duplicated genes to all the proteins in a complex was evaluated, and the results indicated that the enlargement of a protein complex is mainly mediated by homologous interactions and that gene duplication events markedly contribute to the establishment of protein complexes.
Recent statistics of H-InvDB proteins show that 35% of H-InvDB representative transcripts are hypothetical proteins. Assigning functions to hypothetical proteins of unknown function is one of the most important issues in proteome analysis. Since subunits of a complex generally tend to have the same biological function, prediction of a protein complex allows increased confidence in the annotation of hypothetical proteins. After the construction of PCDq by protein complex prediction and annotation, we found that 78 hypothetical proteins were contained in the 82 predicted complexes. Of these 78, 13 were subunits of 12 functionally annotatable complexes. These hypothetical proteins are probably involved in biological processes shared by other subunits of their complexes. Thus complex prediction gives us some clues for inferring their functions. For example, it is suggested that the hypothetical proteins HIP000013164 and HIP000053526 in the dREAM complex function in the cell cycle, and that HIP000177716 (FA core complex), HIP000079962 (INO80 complex), and HIP000024165 (Lamins complex) function in DNA repair, DNA repair and transcription, and nuclear organization, respectively. The remaining eight hypothetical proteins that could be assigned functions are summarized in Table 3. In fact, when we checked the recent literature after making the predictions, four of the thirteen hypothetical proteins were found to be in fact subunits of the predicted protein complexes, and their PCDq entries were updated. Thus, protein complex prediction and annotation offers clues to the functions of hypothetical proteins.
We predicted and annotated 1,264 human protein complexes from integrated PPI data. GO analysis increased the reliability of both complex prediction and manual annotation. The analysis of expression profiles and duplicated genes made it clear that protein subunits tend to be expressed similarly and are mutually paralogous within complexes. Comprehensive protein complex prediction and annotation will provide strong functional annotation clues about hypothetical proteins. We constructed a new human protein complex database with quality index (PCDq) to provide this comprehensive annotation of human protein complexes.
Availability and requirements
PCDq is freely available at the URL http://h-invitational.jp/hinv/pcdq/.
(Biomolecular Interaction Network Database)
(Basic Local Alignment Search Tool)
(Cap Analysis of Gene Expression)
(Complex Quality Index)
(Database of Interacting Proteins)
(Expressed Sequence Tag)
(False Discovery Rate)
(Genome Network Project)
(Human Anatomic Gene Expression Library)
(Human Protein Reference Database)
(introduced Amplified Fragment Length Polymorphism)
(Molecular INTeraction database)
(Open Reading Frame)
(protein complex database with quality index)
(Protein Data Bank)
(Proteomics Standards Initiative Molecular Interaction Standard)
(Replication Factor C)
(Receiver Operating Characteristic)
(Extensible Markup Language).
This work was supported by the Japan Biological Informatics Consortium (JBIC). We appreciate the great efforts of Ryoko Sanbonmatsu, Ryuichi Sakate, Akiko Ogura Noda, Yoshihiro Kawahara, Jun-ichi Takeda, Emilio Campos, and Takayuki Oonishi in complex annotation and Masaru Watanabe in the database tool construction.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein--protein interaction data. Yeast. 2001, 18: 523-531. 10.1002/yea.706.View ArticlePubMed
- Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FC: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell. 2002, 9: 1133-1143. 10.1016/S1097-2765(02)00531-2.View ArticlePubMed
- Titz B, Schlesner M, Uetz P: What do we learn from high-throughput protein interaction data?. Expert Rev Proteomics. 2004, 1: 111-121. 10.1586/147894184.108.40.206.View ArticlePubMed
- Bader GD, Hogue CW: BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics. 2000, 16: 465-477. 10.1093/bioinformatics/16.5.465.View ArticlePubMed
- Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003, 31: 248-250. 10.1093/nar/gkg056.PubMed CentralView ArticlePubMed
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMed CentralView ArticlePubMed
- Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett. 2002, 513: 135-140. 10.1016/S0014-5793(01)03293-8.View ArticlePubMed
- Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007, 35: D572-574. 10.1093/nar/gkl950.PubMed CentralView ArticlePubMed
- Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.PubMed CentralView ArticlePubMed
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al: IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004, 32: D452-455. 10.1093/nar/gkh052.PubMed CentralView ArticlePubMed
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, et al: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39: D561-568. 10.1093/nar/gkq973.PubMed CentralView ArticlePubMed
- Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21: 2076-2082. 10.1093/bioinformatics/bti273.View ArticlePubMed
- Chen JY, Mamidipalli S, Huan T: HAPPI: an online database of comprehensive human annotated and predicted protein interactions. BMC genomics. 2009, 1 (10 Suppl): S16-View Article
- Ruepp A, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Stransky M, Waegele B, Schmidt T, Doudieu ON, St\"umpflen V, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes. Nucleic acids research. 2008, 36: D646-650.PubMed CentralView ArticlePubMed
- Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes H-W: CORUM: the comprehensive resource of mammalian protein complexes-2009. Nucleic acids research. 2010, 38: D497-501. 10.1093/nar/gkp914.PubMed CentralView ArticlePubMed
- Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, et al: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25: 309-316. 10.1038/nbt1295.View ArticlePubMed
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207-10.1186/1471-2105-7-207.PubMed CentralView ArticlePubMed
- Yamasaki C, Murakami K, Takeda J, Sato Y, Noda A, Sakate R, Habara T, Nakaoka H, Todokoro F, Matsuya A, et al: H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Res. 2010, 38: D626-632. 10.1093/nar/gkp1020.PubMed CentralView ArticlePubMed
- Yamasaki C, Murakami K, Fujii Y, Sato Y, Harada E, Takeda J, Taniya T, Sakate R, Kikugawa S, Shimada M, et al: The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts. Nucleic Acids Res. 2008, 36: D793-799.PubMed
- Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, et al: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004, 2: e162-10.1371/journal.pbio.0020162.PubMed CentralView ArticlePubMed
- Shah SP, Huang Y, Xu T, Yuen MM, Ling J, Ouellette BF: Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics. 2005, 6: 34-10.1186/1471-2105-6-34.PubMed CentralView ArticlePubMed
- Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2012, 40: D48-53. 10.1093/nar/gkr1202.PubMed CentralView ArticlePubMed
- Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40: D130-135. 10.1093/nar/gkr1079.PubMed CentralView ArticlePubMed
- The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40: D71-75.PubMed CentralView Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMed
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007, 35: D5-12. 10.1093/nar/gkl1031.PubMed CentralView ArticlePubMed
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403.View ArticlePubMed
- Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol. 2002, 20: 991-997. 10.1038/nbt1002-991.View ArticlePubMed
- Kumar A, Snyder M: Protein complexes take the bait. Nature. 2002, 415: 123-124. 10.1038/415123a.View ArticlePubMed
- Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, et al: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 2006, 16: 686-691. 10.1101/gr.4527806.PubMed CentralView ArticlePubMed
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998, 54: 1078-1084. 10.1107/S0907444998009378.View ArticlePubMed
- Werner F: Structure and function of archaeal RNA polymerases. Mol Microbiol. 2007, 65: 1395-1404. 10.1111/j.1365-2958.2007.05876.x.View ArticlePubMed
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMed
- Devos D, Valencia A: Practical limits of function prediction. Proteins. 2000, 41: 98-107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S.View ArticlePubMed
- Rost B: Enzyme function less conserved than anticipated. J Mol Biol. 2002, 318: 595-608. 10.1016/S0022-2836(02)00016-5.View ArticlePubMed
- Tanino M, Debily MA, Tamura T, Hishiki T, Ogasawara O, Murakawa K, Kawamoto S, Itoh K, Watanabe S, de Souza SJ, et al: The Human Anatomic Gene Expression Library (H-ANGEL), the H-Inv integrative display of human gene expression across disparate technologies and platforms. Nucleic Acids Res. 2005, 33: D567-572. 10.1093/nar/gki388.PubMed CentralView ArticlePubMed
- Kawamoto S, Ohnishi T, Kita H, Chisaka O, Okubo K: Expression profiling by iAFLP: a PCR-based method for genome-wide gene expression profiling. Genome Res. 1999, 9: 1305-1312. 10.1101/gr.9.12.1305.PubMed CentralView ArticlePubMed
- Kim KI, van de Wiel MA: Effects of dependence in high-dimensional multiple testing problems. BMC Bioinformatics. 2008, 9: 114-10.1186/1471-2105-9-114.PubMed CentralView ArticlePubMed
- Liu CT, Yuan S, Li KC: Patterns of co-expression for protein complexes by size in Saccharomyces cerevisiae. Nucleic Acids Res. 2009, 37: 526-532.PubMed CentralView ArticlePubMed
- Gu Z, Cavalcanti A, Chen FC, Bouman P, Li WH: Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Biol Evol. 2002, 19: 256-262. 10.1093/oxfordjournals.molbev.a004079.View ArticlePubMed
- Lipman DJ, Pearson WR: Rapid and sensitive protein similarity searches. Science. 1985, 227: 1435-1441. 10.1126/science.2983426.View ArticlePubMed
- Cai J, Uhlmann F, Gibbs E, Flores-Rozas H, Lee CG, Phillips B, Finkelstein J, Yao N, O'Donnell M, Hurwitz J: Reconstitution of human replication factor C from its five subunits in baculovirus-infected insect cells. Proc Natl Acad Sci USA. 1996, 93: 12896-12901. 10.1073/pnas.93.23.12896.PubMed CentralView ArticlePubMed
- O'Donnell M, Onrust R, Dean FB, Chen M, Hurwitz J: Homology in accessory proteins of replicative polymerases--E. coli to humans. Nucleic Acids Res. 1993, 21: 1-3. 10.1093/nar/21.1.1.PubMed CentralView ArticlePubMed
- Litovchick L, Sadasivam S, Florens L, Zhu X, Swanson SK, Velmurugan S, Chen R, Washburn MP, Liu XS, DeCaprio JA: Evolutionarily conserved multisubunit RBL2/p130 and E2F4 protein complex represses human cell cycle-dependent genes in quiescence. Mol Cell. 2007, 26: 539-551. 10.1016/j.molcel.2007.04.015.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.