A novel function prediction approach using protein overlap networks
© Liang et al.; licensee BioMed Central Ltd. 2013
Received: 8 February 2013
Accepted: 12 July 2013
Published: 17 July 2013
Construction of a reliable network remains the bottleneck for network-based protein function prediction. We built an artificial network model called protein overlap network (PON) for the entire genome of yeast, fly, worm, and human, respectively. Each node of the network represents a protein, and two proteins are connected if they share a domain according to InterPro database.
The function of a protein can be predicted by counting the occurrence frequency of GO (gene ontology) terms associated with domains of direct neighbors. The average success rate and coverage were 34.3% and 43.9%, respectively, for the test genomes, and were increased to 37.9% and 51.3% when a composite PON of the four species was used for the prediction. As a comparison, the success rate was 7.0% in the random control procedure. We also made predictions with GO term annotations of the second layer nodes using the composite network and obtained an impressive success rate (>30%) and coverage (>30%), even for small genomes. Further improvement was achieved by statistical analysis of manually annotated GO terms for each neighboring protein.
The PONs are composed of dense modules accompanied by a few long distance connections. Based on the PONs, we developed multiple approaches effective for protein function prediction.
KeywordsProtein overlap network Protein function prediction Composite network Functional genomics
Proteins are basic functional components in any biological process. Discovery of the functions of an individual protein is therefore a critical step towards understanding biological processes and the complete biological system. Besides experimental studies, computational prediction plays an important role in current protein function investigation . The general biochemical function of a protein can be inferred if the amino acid sequence or 3-D structure of the protein resembles another protein whose function is known [2–7]. The rationale for the homology-based method is that two proteins with a similar sequence or structure could evolve from a common ancestor and thus have similar functions. Although this homology-based method remains the most widely utilized computational tool for function assignment, some proteins identified from genome sequencing do not have any homologs that were functionally characterized in the previous studies. On the other hand, homologous proteins might also acquire different functions in evolution . Thus computational methods other than the homology-based method are much in demand to improve the accuracy and coverage for protein function prediction.
Graph theoretical analysis can be used for predicting the function of uncharacterized proteins [9–11]. Briefly, genes or the products of genes, i.e. proteins, are represented as nodes in the network and two nodes are connected if they are in some type of association, e.g., protein-protein interaction (PPI). Possible functions of a protein can be assigned based on the frequently observed known functions of its immediate interacting partners . Advanced approaches, such as defining the neighborhood of a protein with a radius of n, considering the shared neighborhood of a pair of proteins , transferring common annotations in a module to the uncharacterized members [15–19], were also developed over the years.
Unlike function inference methods, construction of a reliable network remains the bottleneck for network-based function prediction. High-throughput experimental methods such as two-hybrid techniques , affinity precipitation , and synthetic lethal screening [22, 23] have been developed to construct PPI networks. Gene expression networks were built from gene pairs showing significant correlation of expression in microarray experiments [24, 25]. In spite of wide applications, the high-throughput experimental methods often sacrifice specificity for scale. The constructed networks have a high level of false positives  and the predicted protein functions are even noisier.
The so-called genomic context (GC) methods, namely, gene fusion , gene order conservation , phylogenetic profiles , and operon rearrangement [30, 31] exploit the genome sequences themselves to predict protein-protein associations. The derived information can be used for the construction of a PPI network. As a result, the network based approach showed better performance than the raw scores of GC-based methods in protein function prediction .
To a less extent, domain co-occurrence networks (DCN) were investigated for the prediction of protein and domain functions . A protein domain is a continuous sequence within the protein that represents a structural, functional, and evolutional unit. A simple protein may contain a single domain, whereas 70% of eukaryotic proteins are composed of multiple domains . In a DCN, two domains are connected by an edge if there is one protein containing both of them. Previous studies showed that DCNs are scale-free and small-world networks . The domain function could be predicted by counting the occurrence frequency of GO terms associated with the neighboring nodes in a single-genome DCN while the function of a multiple-domain protein could be derived by integrating the predicted function of each domain. The prediction results were not improved by sophisticated algorithms such as χ2 method or support vector machine (SVM) learning .
As a counterpart to DCN, the protein overlap network (PON) provides another way for function prediction. In a PON, each node represents a protein and two nodes are connected with an edge if the proteins share a common domain. While a few efforts have been made to study the properties of DCNs [32, 34–36], to our knowledge, there has been no detailed research report but a few brief comments about PONs . In this study, we constructed the PONs with the entire genomes of individual organisms, studied the network properties, and made predictions of protein function with the composite PON of multiple genomes.
The protein annotations were downloaded from InterPro , a database that integrates various resources, for the complete genomes of yeast (S. cerevisiae), fruit fly (D. melanogaster), worm (C. elegans), and human (H. sapiens) in April 2012. Every protein was annotated with domains retrieved from the Pfam database , and the functions of each domain were represented by their available GO (gene ontology) term annotations . The four species were selected because they were all well-studied model organisms spanning comprehensive levels of complexity. We found 4,759 proteins possessing a UniProt ID  and Pfam annotation, and a total number of 2619 domains with 947 GO term annotations in yeast, as well as 12,933 proteins (3422 domains and 1114 GO terms) of fly, 15,433 proteins (3172 domains and 1093 GO terms) of worm, and 41,053 proteins (4290 domains and 1363 GO terms) of human. The GO term annotations were the same for the common domains contained in various proteins of the four species.
We also used UniProt-GOA (gene ontology annotation) database  for prediction and evaluation, in which GO terms were assigned to gene products using a combination of high-quality electronic mappings and manual curation. Even homologous proteins with the same domain composition might acquire various functions in evolution and thus annotated with different GO terms. The gene association files were downloaded from http://www.ebi.ac.uk/GOA/ for yeast, fly, worm, and human in April 2012.
Network construction and property analysis
In our constructed PONs, each node represents one protein. Two nodes are connected by an edge if the corresponding proteins share a common domain defined by Pfam. The proteins and the domains acquired above were used to construct a PON for yeast, fly, worm, and human, respectively.
where n i is the number of connected node pairs (edges) among nodes directly connected to the central node n and k i is the number of neighboring nodes of the central node n.
Protein function prediction with a domain-based method
We used the neighbor-counting method to retrieve GO terms associated with all neighbors of a query protein in the PON and ranked them by their occurrence frequencies. Instead of simply counting the number of protein nodes associated with a GO term, the occurrence frequency was calculated as the number of domains that were associated with the GO term and presented in the neighboring nodes. Naturally, GO terms associated with the common domains shared by a query protein and its neighbors should be considered more favorable than those associated with domains observed only in the neighbors. For the purpose of testing the prediction method, we assumed that the function was unknown for all domains in the query protein. Thus, unless specifically indicated, we made predictions using only domains that were not observed in the query protein, i.e., excluding the common domains. In cases where a GO term was associated with both the common domain and other domains in neighbors, the GO term was considered in prediction but the common domain was not counted in the occurrence frequency. The result was compared with the combination of annotated GO terms of all domains in the query protein. As a control to evaluate the prediction performance, we randomly switched GO terms belonging to two proteins and the corresponding domains throughout the whole genome.
Protein function prediction with GOA database
Annotated GO terms obtained from the GOA database were statistically analyzed for neighboring nodes of the query protein. Those with a high occurrence frequency, i.e., associated with a large number of proteins, were selected and compared with the annotations of the query in the GOA database.
Evaluation of protein function prediction
We used the success rate of prediction as the primary evaluation method. The success rate was defined as the ratio of correctly predicted GO terms to the total predictions. For the purpose of evaluation, we limited the number of predicted GO terms to be the same as that of the annotated GO terms in the query protein. Frequently, several GO terms had the same occurrence frequency around the cut-off and only a part of them were randomly selected to meet the criteria. We might also have a smaller number of GO terms associated with the neighboring proteins than the total annotations of the query protein and thus all of the analyzed GO terms constituted the prediction. In addition to the success rate, we also evaluated top 3 ranked GO terms, which means that the prediction was considered correct only if one of the three GO terms matched any annotation of the query protein. Similarly, the GO term with the highest occurrence frequency was evaluated for the top 1 GO term accuracy. A node was considered predictable if any of its neighboring nodes was annotated with one or more GO terms. Predictable nodes with observed GO term annotations were subject for evaluation and the coverage was calculated as the percentage of the evaluated nodes out of all nodes in the network.
Statistical properties of protein overlap network
Network properties of the main sub-graphs of four investigated genomes
No. of proteins
No. of domains
No. of GO terms
For the entire network of the four species, 65 ~ 75% proteins were annotated with at least one GO term according to the InterPro database and the mean value of the annotations was very similar (2.4 ~ 2.6) despite of the different genome size. The nodes in a PON sub-graph frequently have a common GO term annotation. About 30% of the nodes share the most popular GO term, protein binding function, for the main sub-graphs of the fly, worm, and human networks. On the other hand, the most popular GO term for the main sub-graph of the yeast is ATP binding function, which is shared by 41% of the nodes. Nevertheless, 95% of the nodes of the third largest sub-graph of the yeast network, which contains 106 proteins and 36 domains, share the most popular GO term Protein binding function. Compared to the main sub-graphs, other sub-graphs usually are small, comprise only a few domains, and have a very popular GO term among the nodes.
Protein function prediction with a single-genome PON
Results of protein function prediction with a single-genome PON
Prediction with a composite PON
We constructed the composite PON by combining different genomes because the success rate of the domain-based method is positively correlated with the size of the network and the GO term annotation of a domain is exactly the same for various genomes according to InterPro database. The composite PON from the four species (yeast, fly, worm, and human) containing 74178 nodes and 5079 domains was constructed to investigate whether the prediction could be improved with a larger PON. The accuracies obtained for each of the four species using the composite PON were compared with those obtained using their individual single-genome PONs. For the three species with a smaller-sized genome, yeast, fly, and worm, the success rate was slightly improved by up to 4% and the coverage was also increased by 7%-12%. For the human genome, the coverage slightly increased by 1.2% though the success rate decreased by 1.6% (Table 2 & Table 3).
There are two advantages using the composite PON for protein function predictions: one is to improve the accuracy for those nodes that already can be predicted with the corresponding small-sized single-genome PON, and the other is to improve the coverage by adding new predictions, albeit at lower accuracy. With the composite PON, the success rate significantly increased by 8.7%, 2.6%, and 4.6% for those annotated and originally predicable nodes for yeast, fly, and worm, respectively. On the other hand, the new predictions that were unpredictable with a single-genome PON showed lower success rates (<20%) except for worm (32%), partly due to the lack of annotated domain types at the neighboring nodes. For example, the average occurrence frequencies of the most popular GO terms were 1.3 and 1.8 for the new predictions in the composite network compared with 6.3 and 6.2 for the originally predictable nodes for fly and worm, respectively. In future studies, more proteins and genomes may be used in the combination to improve the prediction accuracy and coverage even for the human genome.
Prediction using the second layer neighbors
Results of protein function prediction with a composite PON of four genomes
Overall success rate (%)
Success rate (%)
High degree nodesa
Low degree nodesb
Results of protein function prediction with the second layer nodes
No. of domains associated with a statistically analyzed GO term
> = 1
> = 4
> = 16
Success rate (%)
Success rate (%)
Success rate (%)
Prediction and evaluation with gene ontology annotation (UniProt-GOA) database
Results of prediction and evaluation with GOA database
Success rate (%)
Top 1 accuracy (%)
Top 3 accuracy (%)
Prediction with a single-genome network
Prediction with the composite network constructed from the other three genomes excluding the predicted one
Moreover, the GOA database used for prediction is a comprehensive GO annotation dataset and the coverage is very high for the four species predicted with the composite PON of other genomes, respectively, achieving a mean value of 79.2%. As a comparison, the domain-based method yield a mean coverage of 71.0%, which is also high compared with those in Table 3 for considering the common domains in the prediction. Furthermore, we define the recall rate as the percentage of correctly predicted GO terms in total annotations of the query protein. The recall rate of the domain-based method (36.6%) is significantly lower than that of the GOA database method (47.5%) since the InterPro database used for the domain-based method is only one of information sources to build the GOA database and the predicted GO terms are often less than the annotations in GOA database.
In general, the GOA database method shows a better success rate, coverage of query proteins, and recall rate of observed GO terms than the domain-based method, especially in prediction with a single-genome PON. However, there are only 13 species with genome-wide annotations in the GOA database until the end of 2012, therefore the applicability is rather limited. It is also not clear to what extent the high success rate predicted with the GOA database is related to the sequence similarity between common domains of the query protein and neighbors. On the other hand, we optimized the domain-based method without considering the common domains. The unknown function of a domain can be predicted from other functionally characterized domains, which is especially helpful for biological experimentalists.
We built PONs for protein function prediction. Our function prediction algorithm is based on two observations. Firstly, if two domains (or domain combinations), B and C, combine with the domain A to form two proteins: AB and AC, it is possible that the domain B and C share the same GO terms. Actually, for all connected pairs, e.g., AB and AC, in the human PON, a GO term associated with domain B is also associated with domain C in 24.4% of cases compared with 4.7% in the random control procedure. In fact, this indirect functional association was also observed at protein level in early studies. Chua et al. found proteins interacting with the same partners had a greater likelihood of sharing similar physical or biochemical characteristics . Secondly, a GO term that is associated with A is also the annotated GO term of B in 35.4% cases, i.e., domains within a protein may share the same GO term annotations, which is the basis for domain function prediction with a DCN . It should be noted that the GO terms shared by B and C are not due only to the fact that B shares GO terms with A and A shares GO terms with C (35.4% × 35.4% <24.4%).
Wang et al. made protein function predictions by integrating the GO terms predicted for each of its domains using a single-genome DCN . If all the conditions were equally set, the domain-based PON methods and the DCN methods should produce exactly the same results for protein function prediction. Nevertheless, PONs contain richer information than DCNs, which makes it possible to yield better prediction results. For example, the current Pfam 26.0 has a sequence coverage of only 69.7% and an amino acid coverage of 44.4% for the human genome . For the DCN method, the prediction completely relies on the Pfam annotation. However, the complete GO term annotations of a protein could be obtained from GOA database  and used for function prediction with a PON even if only a part of the protein is annotated with Pfam domains.
GO term comparison is a yes-or-no binary criterion in this study even though two different GO terms may share functional similarity. Nevertheless, the mean value of success rates calculated with the binary criterion strongly correlates with that of the similarity scores in large-scale calculations like those presented here. For example, the mean success rates predicted and evaluated with the GOA database were 48.3, 41.1, and 35.5% respectively, for yeast proteins using the single-genome PON, the composite PON of four species, and the composite PON excluding yeast. When the semantic score between the predicted GO term set and the annotated set of each protein was calculated with the online method G-sesame , the mean values correspondingly increased to 0.709, 0.678, and 0.638, respectively.
The currently proposed methods can be used for the prediction of new functions of a protein in addition to the annotated GO terms, or a protein with unknown function. In fact, there are 3.2% yeast proteins that are not annotated but predictable with the domain based method using the single-genome PON (Table 2). The number increased to 7.5% by using the composite network of yeast, fly, worm, and human. In future, more genomes could be used to build the composite network to improve the prediction accuracy and coverage for the domain-based prediction method. Results from multiple approaches such as the domain-based method, prediction with the GOA database, and analysis of the second layer neighbors could also be combined to produce a comprehensive prediction.
The protein overlap networks have different properties compared with the domain co-occurrence networks. The PON of an entire genome contains many disconnected sub-graphs with one main sub-graph, which comprises 50% of proteins except for the relatively small PON of yeast. The logarithmic relationship between the number of nodes and the corresponding degree value does not follow a typical power law distribution of scale-free network. The clustering coefficient of the networks is exceptionally high (>0.9). Using a PON, protein functions can be predicted with the domain-based method or gene ontology annotation database. The GOA database method usually shows a better performance than the domain-based method. The composite PON of multiple genomes can be used to enhance the performance for the domain-based method, especially in prediction for small genomes. But the GOA database method achieves the best accuracy for a protein when the single-genome PON of the same species is used. The success rate of the domain-based method is close to that of the GOA method in prediction for proteins without genome-wide sequence information and domain annotations. Moreover, the domain-based method can be optimized to predict the functions for an uncharacterized domain from the annotated GO terms of other domains.
Protein overlap network
Domain co-occurrence networks
Support vector machine
Gene ontology Annotation.
This work was partly supported by a kakenhi grant 24570184: Grant-in-Aid for Scientific Research (C) from the Japan Society for the Promotion of Science (JSPS).
- Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature. 2000, 405: 823-826. 10.1038/35015694.PubMedView ArticleGoogle Scholar
- Dobson PD, Cai YD, Stapley BJ, Doig AJ: Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem. 2004, 11: 2135-2142. 10.2174/0929867043364702.PubMedView ArticleGoogle Scholar
- Procter JB, Thompson J, Letunic I, Creevey C, Jossinet F, Barton GJ: Visualization of multiple alignments, phylogenies and gene family evolution. Nat Methods. 2010, 7: S16-S25. 10.1038/nmeth.1434.PubMedView ArticleGoogle Scholar
- Watson JD, Laskowski RA, Thornton JM: Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005, 15: 275-284. 10.1016/j.sbi.2005.04.003.PubMedView ArticleGoogle Scholar
- Pal D, Eisenberg D: Inference of protein function from protein structure. Structure. 2005, 13: 121-130. 10.1016/j.str.2004.10.015.PubMedView ArticleGoogle Scholar
- Ponomarenko JV, Bourne PE, Shindyalov IN: Assigning new GO annotations to protein data bank sequences by combining structure and sequence homology. Proteins. 2005, 58: 855-865. 10.1002/prot.20355.PubMedView ArticleGoogle Scholar
- Sael L, Chitale M, Kihara D: Structure- and sequence-based function prediction for non-homologous proteins. J Struct Funct Genomics. 2012, 13: 111-123. 10.1007/s10969-012-9126-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity?. J Mol Biol. 2003, 333: 863-882. 10.1016/j.jmb.2003.08.057.PubMedView ArticleGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-PubMedPubMed CentralView ArticleGoogle Scholar
- Janga SC, Diaz-Mejia JJ, Moreno-Hagelsieb G: Network-based function prediction and interactomics: the case for metabolic enzymes. Metab Eng. 2011, 13: 1-10. 10.1016/j.ymben.2010.07.001.PubMedView ArticleGoogle Scholar
- Reimand J, Hui S, Jain S, Law B, Bader GD: Domain-mediated protein interaction prediction: From genome to network. FEBS Lett. 2012, 586: 2751-2763. 10.1016/j.febslet.2012.04.027.PubMedView ArticleGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol. 2000, 18: 1257-1261. 10.1038/82360.PubMedView ArticleGoogle Scholar
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast. 2001, 18: 523-531. 10.1002/yea.706.PubMedView ArticleGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006, 22: 1623-1630. 10.1093/bioinformatics/btl145.PubMedView ArticleGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003, 100: 12123-12128. 10.1073/pnas.2032324100.PubMedPubMed CentralView ArticleGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks. Proc Natl Acad Sci USA. 2003, 100: 1128-1133. 10.1073/pnas.0237338100.PubMedPubMed CentralView ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinforma. 2003, 4: 2-10.1186/1471-2105-4-2.View ArticleGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinforma. 2006, 7: 207-10.1186/1471-2105-7-207.View ArticleGoogle Scholar
- Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins. 2004, 54: 49-57.PubMedView ArticleGoogle Scholar
- Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature. 1989, 340: 245-246. 10.1038/340245a0.PubMedView ArticleGoogle Scholar
- Larsson PO, Mosbach K: Affinity precipitation of enzymes. FEBS Lett. 1979, 98: 333-338. 10.1016/0014-5793(79)80211-2.PubMedView ArticleGoogle Scholar
- Novick P, Osmond BC, Botstein D: Suppressors of yeast actin mutations. Genetics. 1989, 121: 659-674.PubMedPubMed CentralGoogle Scholar
- Bender A, Pringle JR: Use of a screen for synthetic lethal and multicopy suppressee mutants to identify two new genes involved in morphogenesis in Saccharomyces cerevisiae. Mol Cell Biol. 1991, 11: 1295-1305.PubMedPubMed CentralView ArticleGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302: 249-255. 10.1126/science.1087447.PubMedView ArticleGoogle Scholar
- Aoki K, Ogata Y, Shibata D: Approaches for extracting practical information from gene co-expression networks in plant biology. Plant Cell Physiol. 2007, 48: 381-390. 10.1093/pcp/pcm013.PubMedView ArticleGoogle Scholar
- Hakes L, Pinney JW, Robertson DL, Lovell SC: Protein-protein interaction networks and biology–what’s the connection?. Nat Biotechnol. 2008, 26: 69-72. 10.1038/nbt0108-69.PubMedView ArticleGoogle Scholar
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999, 402: 86-90. 10.1038/47056.PubMedView ArticleGoogle Scholar
- Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999, 96: 2896-2901. 10.1073/pnas.96.6.2896.PubMedPubMed CentralView ArticleGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999, 96: 4285-4288. 10.1073/pnas.96.8.4285.PubMedPubMed CentralView ArticleGoogle Scholar
- Snel B, Bork P, Huynen MA: The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci USA. 2002, 99: 5890-5895. 10.1073/pnas.092632599.PubMedPubMed CentralView ArticleGoogle Scholar
- Janga SC, Collado-Vides J, Moreno-Hagelsieb G: Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 2005, 33: 2521-2530. 10.1093/nar/gki545.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Z, Zhang XC, Le MH, Xu D, Stacey G, Cheng J: A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny. PLoS One. 2011, 6: e17906-10.1371/journal.pone.0017906.PubMedPubMed CentralView ArticleGoogle Scholar
- Han JH, Batey S, Nickson AA, Teichmann SA, Clarke J: The folding and evolution of multidomain proteins. Nat Rev Mol Cell Biol. 2007, 8: 319-330. 10.1038/nrm2144.PubMedView ArticleGoogle Scholar
- Wuchty S: Scale-free behavior in protein domain networks. Mol Biol Evol. 2001, 18: 1694-1702. 10.1093/oxfordjournals.molbev.a003957.PubMedView ArticleGoogle Scholar
- Przytycka T, Davis G, Song N, Durand D: Graph theoretical insights into evolution of multidomain proteins. J Comput Biol. 2006, 13: 351-363. 10.1089/cmb.2006.13.351.PubMedPubMed CentralView ArticleGoogle Scholar
- Cohen-Gihon I, Nussinov R, Sharan R: Comprehensive analysis of co-occurring domain sets in yeast proteins. BMC Genomics. 2007, 8: 161-10.1186/1471-2164-8-161.PubMedPubMed CentralView ArticleGoogle Scholar
- Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012, 40: D306-D312. 10.1093/nar/gkr948.PubMedPubMed CentralView ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.PubMedPubMed CentralView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40: D71-D75.PubMed CentralView ArticleGoogle Scholar
- Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Martin MJ, Bely B, Browne P, Chan WM, Eberhardt R, et al: The UniProt-GO Annotation database in 2011. Nucleic Acids Res. 2012, 40: D565-D570. 10.1093/nar/gkr1048.PubMedPubMed CentralView ArticleGoogle Scholar
- Dijkstra EW: A note on two problems in connexion with graphs. Numer Math. 1959, 1: 269-271. 10.1007/BF01386390.View ArticleGoogle Scholar
- Ye Y, Godzik A: Comparative analysis of protein domain organization. Genome Res. 2004, 14: 343-353. 10.1101/gr.1610504.PubMedPubMed CentralView ArticleGoogle Scholar
- Vogel C, Teichmann SA, Pereira-Leal J: The relationship between domain duplication and recombination. J Mol Biol. 2005, 346: 355-365. 10.1016/j.jmb.2004.11.050.PubMedView ArticleGoogle Scholar
- Du Z, Li L, Chen CF, Yu PS, Wang JZ: G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res. 2009, 37: W345-W349. 10.1093/nar/gkp463.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.