A vertex similarity-based framework to discover and rank orphan disease-related genes
© Zhu et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Zhu et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
A rare or orphan disease (OD) is any disease that affects a small percentage of the population. While opportunities now exist to accelerate progress toward understanding the basis for many more ODs, the prioritization of candidate genes is still a critical step for disease-gene identification. Several network-based frameworks have been developed to address this problem with varied results.
We have developed a novel vertex similarity (VS) based parameter-free prioritizing framework to identify and rank orphan disease candidate genes. We validate our approach by using 1598 known orphan disease-causing genes (ODGs) representing 172 orphan diseases (ODs). We compare our approach with a state-of-art parameter-based approach (PageRank with Priors or PRP) and with another parameter-free method (Interconnectedness or ICN). Our results show that VS-based approach outperforms ICN and is comparable to PRP. We further apply VS-based ranking to identify and rank potential novel candidate genes for several ODs.
We demonstrate that VS-based parameter-free ranking approach can be successfully used for disease candidate gene prioritization and can complement other network-based methods for candidate disease gene ranking. Importantly, our VS-ranked top candidate genes for the ODs match the known literature, suggesting several novel causal relationships for further investigation.
In the USA, a rare or orphan disease (OD) is defined as a disease that affects fewer than 200,000 inhabitants . According to an estimate, there are as many as 8000 ODs, many of which are known to be of genetic origin, affect children at a very early age and are life-threatening and/or chronically debilitating [2, 3]. Although, the advent of next-generation sequencing technologies accelerates the disease gene discovery pipeline, the prioritization of candidate genes is still a critical step for disease-gene identification . We , and several other earlier studies [6–9], have shown that genes associated with phenotypically close disorders tend to share molecular signatures which include similar expression profiles, participation in the same biological processes or pathways, protein interactions or complexes, literature co-citation. We have recently completed a global analysis of all ODs that have at least one known mutant gene associated (data from Orphanet  and the OMIM databases ) and show that the relationship between ODs cannot be fully captured by the gene-based network alone. Integrating diverse biomedical and genomic data types can facilitate hypotheses synthesis about disease causing mutant genes. Additionally, it can help in addressing an important question, namely, are there any candidate genes related to known causal genes for a disease? A useful way to approach this question is to rank the genes in a test set based on their similarity to a reference or 'seed' set. Such a "guilt by association" ranking approach has become an important way to prioritize candidate disease genes, such as the candidates found in genome-wide association or linkage studies . The genes within a locus shown to be linked to a particular disease, for example, can be prioritized based on their similarities to a reference set of known genes for that disease. We and others have developed several computational approaches which perform this task automatically [4, 13–23].
Network-based analyses have been equally successful in the identification and prioritization of disease candidate genes [6, 7, 24–31] especially where the genes are relatively less annotated. Network-based candidate gene ranking approaches can be broadly grouped into two categories: parameter-based and parameter-free methods. The parameter-based methods, such as PageRank with Priors (PRP ), Random Walk (RW ) and PRIoritizatioN and Complex Elucidation (PRINCE ), usually require additional auxiliary parameters that need to be trained by using available data sets. The PRP for instance needs a parameter β to control the probability of jumping back to the initial node , and the PRINCE algorithm uses a parameter to describe the relative importance of prior information . Since selecting optimal parameters could be a challenge, parameter-free approaches are preferred and considered as more user-friendly . Additionally, most parameter-based approaches take into account the global information in the entire network which often requires extensive computation. For example, in PRP, scores of all the nodes need to be updated iteratively until they converge. This process typically becomes extremely slow and inefficient especially when the network size is large. The parameter-free methods (e.g. Interconnectedness or ICN ), on the other hand, measure closeness of each candidate gene to known disease genes by taking into account direct link and the shared neighbors between two genes and tend to be less intensive computationally. The performance of parameter-free methods however is usually not comparable to parameter-based ranking approaches. Here, we report a novel network-based parameter-free framework for discovering and prioritizing candidate orphan disease genes. We specifically focus on two aspects: a) enhance prioritizing performance compared to current parameter-free methods and b) achieve a comparable performance to the parameter-based ones. We test, in a leave-one-out cross-validation setting, the utility of our approach in prioritizing genes for 172 ODs with at least five known causal genes (from Orphanet database ). We compare the performance of our method to two approaches, one each from parameter-based and parameter-free methods. To demonstrate the utility of our approach, we rank the immediate neighbors of known OD genes as potential novel candidate genes. The immediate neighboring gene sets were compiled using (a) protein interactions; (b) functional linkage network [32, 33]; and (c) literature co-citations.
Hypothesizing that genes that are connected to one or more known disease genes ("seed genes") are also probably implicated in the same disease, our goal is to find such novel candidate genes with "strong" associations to the seed genes. Our proposed VS-based candidate gene ranking approach is based on guilt-by-association principle. Two nodes or vertices are considered similar if their immediate neighbors in the network are themselves similar (common biological process, pathway, etc.). This principle is used to build a self-consistent matrix formulation of functional similarity that can be evaluated iteratively using only knowledge of the adjacency matrix of the network (based on functional annotations of genes). To this effect, we consider similarity between two vertices (genes) as a measure of their association strength in a network. Thus, two vertices with a high similarity are likely to be strongly related. In order to find the similarities between the seed and the candidate or test set genes, we introduce a vertex similarity measurement in our algorithm. Vertex similarity which defines the similarity of two vertices based on the structure of network has been used for information retrieval in World Wide Web  and in social network analysis . Similarity measurements, such as cosine similarity, have been successfully applied for computing similarity between documents which are described as vectors of keywords . However, to the best of our knowledge, there have been no reports of using it as a measure to compute similarity between two genes in a functional network and use it for ranking candidate disease genes.
where Γ A and Γ B represent the degree (number of connections or edges the node has to other nodes) of nodes A and B respectively, and σ shared = | Γ A ∩ Γ B | and represents the number of shared neighbor nodes by both A and B.
where C k is the node on the shortest path of A and B, and r is the discovery range that controls the maximum degree of separation (maximum r hops). In other words if the shortest path length between nodes A and B is more than r hops or if there is no shortest path between them, Sim(A, B) equals to 0.
where Sim(i, j) is the connection score between gene i and j. All candidate genes are then ranked based on these scores.
To compare the performance of our VS-based approach in candidate disease gene ranking, we select two methods, one each from parameter-based and parameter-free methods: PageRank with priors (PRP)  and Interconnectedness (ICN) . Parts of implementation of PRP are done using JUNG (Java Universal Network/Graph; jung.sourceforge.net) framework  as described earlier . To evaluate the performance of VS-based approach and compare it with two other methods, we used a leave-one-out cross-validation procedure. In each cross-validation trial, we removed a single OD causal gene ("target gene") from the data, and each of the 3 algorithms was evaluated by its success in assigning the rank to the "target gene" (see Methods for additional details).
When we increased the rank cut-off (k), VS-based approach performed equally well as PRP0.3. Additionally, compared to ICN, another parameter-free method, our VS-based approach performed better. We also note that VS outperformed PRP too when the back probability was set to 0.05.
However, since biological networks tend to have low diameters , we believe that low values of the steps/hops are preferable. Interestingly, a previous study provided examples of two real data applications where the number of hops or steps between disease causal genes (m) were set to two and reported that m = 2 was preferable over m = 1 . Since the edge information between two genes may be noisy or incomplete, we believe that our VS-based approach for novel candidate disease gene ranking is desirable as it takes into account alternative measures of pairwise interconnectedness and is not just limited to direct interactions or having a shared neighbor node.
Examples of orphan diseases and VS-ranked top 5 candidate genes
No. of known causal genes
VS ranked top 5 candidate genes
Cone rod dystrophy
CRB1, RDH5, USH1C, EFEMP1, CABP4
Severe combined immunodeficiency
CD3G, JAK1, ZAP70, IL2RB, IL4
HES1, SAMD3, CYP19A1, XRCC3, USP1
PEX7, PHEX, ABCD2, ABCD1, ABCD3
Autosomal dominant Charcot-Marie-Tooth disease, type 2
STAT4, FAIM, MARCH5, STAT6, CRYGC
ZFY, ZFX, PTCH2, SOX9, AMH
Hereditary nonpolyposis colon cancer
MRC1, MSH3, CARKD, TRIT1, EXO1
Papillary or follicular thyroid carcinoma
CORO2A, ZBTB33, KIF11, AAAS, SEH1L
KCNE3, MINK1, KCNJ3, ALG10B, KCNJ9
GCKR, IDDM7, MAFA, ST6GAL1, INSRL
Among other examples, HES1, the top ranked gene for Fanconi anemia is a novel interacting protein of the Fanconi anemia core complex and cells depleted of HES1 exhibit a Fanconi anemia-like phenotype . The two top-ranked genes for gonadal dysgenesis, ZFX and ZFY, are known to function in sex differentiation and Zfx mutant mice are reported to have fewer germ cells than wild-type mice . Likewise, maturity-onset diabetes of the young type (MODY syndrome) is linked to kinetic alterations and regulation of glucokinase activity [46, 47] and in our ranking glucokinase receptor is the top ranked gene for MODY syndrome. Interestingly, a recent study in the Japanese families proposes GCKR as a susceptibility gene for familial diabetes . While our ranking provides further support for the involvement of the top-ranked ranked genes in the investigated ODs, it also suggests that the top scoring candidates that are not previously associated with these ODs could be potential candidates for further research.
The vertex similarity method (VS) is parameter-free approach for prioritizing candidate disease genes, where it calculates the similarity between nodes other than updating and training the parameters and data sets in every step. Through cross-validation experiments we show that VS outperforms ICN, another parameter-free method and that it is comparable to parameter-based methods such as PRP. We demonstrate the utility of VS-based parameter-free ranking approach in ranking OD candidate genes and importantly, these top ranked candidate genes for the ODs match the known literature, suggesting several novel causal relationships for further investigation.
Our approach however has some limitations. First, as with any training set dependent candidate gene ranking approaches, we assume that the OD causal genes we have yet to discover will be consistent with what is already known about an OD and/or its genetic basis which may not always be the case. Additionally, this also means that our approach currently cannot be used to rank novel candidate OD genes if an OD lacks known causal genes. Similarly, even if an OD has known causal genes but if there is no protein interactome data available then we cannot use VS for such cases. An alternative approach would be to consider other types of networks (coexpression or functional networks). Second, it is important to note that the prioritization by our approach can only be as accurate as the current protein interactome data are. Third, if a seed gene has only one known interaction then that interactant will be ranked higher.
The ODs and causal gene information was downloaded from Orphanet . We merged some of the OD subtypes of a single disease based on their given disorder names as described previously [5, 8]. From this, we selected 172 ODs that have at least five causal genes. The total number of genes across 172 selected diseases was 1598. The human protein interactome used in this study was compiled from several resources [49–54] with both redundant interactions and self-loops removed.
We performed a leave-one-out cross-validation using the 172 ODs and 1312 OD causing genes that exist in PPI network. We used the human protein interaction network as the global network to evaluate the prioritizing performance of VS and other two methods. The human protein interactome used in our study contains protein-protein interactions from large-scale yeast two-hybrid experiments [49, 50], computational predictions , and curation of the literature [52–54], with both redundant interactions and self-loops removed. The assembled PPI network consists of 11,765 proteins and 69,167 interactions. During each set of a validation trial, one seed gene ("target gene") from one of the selected 172 ODs was picked out and mixed with 99 random genes from PPI network to form a test set of 100 candidate genes. The remaining seed genes of an OD were used as the training set. The test set genes were then prioritized using the three approaches: PRP (with back probabilities 0.3. and 0.05), ICN, and VS-based approach. During each run, the rank of the "target gene" was noted. We evaluated the performance of each algorithm in terms of the success rate versus rank cut-off (k). If the "target gene" is ranked among the top k in a particular validation run, it is considered as a 'success'. The validation runs are repeated until all the seed genes have been used as the target gene and their ranks are obtained. The "success rate" is defined as the ratio of successful validation runs and the total validation runs for all the existing OD genes from 172 ODs. The same strategy was followed for all the three algorithms. In case of PRP which is a parameter-based method, we selected a back probability of 0.3 since we have shown previously that the performance of PRP in ranking candidate disease genes was best at p = 0.3 .
For identifying and ranking novel OD candidate genes, we used the immediate neighbors of known OD genes as the test set. The immediate neighboring genes of selected ODs' causal genes were compiled based on (a) protein interactions; (b) functional linkage network [32, 33]; and (c) literature co-citations. The protein interactome data as described earlier was compiled from several resources [49–54]. The functional linkage network-based candidate gene sets were derived from two resources: (i) HumanNet, a probabilistic functional gene network of Homo sapiens  and (ii) functional protein interaction network built upon expert-curated pathways . The test set genes based on literature co-citations were compiled using the OMIM database. Briefly, for the selected ODs, we identified the corresponding OMIM records, which summarize results from publications about gene-disease relationships. For the OD mapped OMIM mapped records, we first extracted the cited literature (links to PubMed records for the references cited in an OMIM entry) in the OMIM records. Using this OD-related PubMed records, we extracted the related genes from the 'gene2pubmed' file from NCBI . For a given OD with known causal genes, we pooled all neighboring genes (immediate neighbors or direct interactants) of causal genes from different sources and used it as a test set for ranking in the global protein interactome using VS-based approach.
This work was supported in part by Cincinnati Digestive Health Sciences Center (Public Health Service Grant P30 DK078392) and Cincinnati Children's Hospital Medical Center.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.