 Research article
 Open Access
 Published:
DomainRBF: a Bayesian regression approach to the prioritization of candidate domains for complex diseases
BMC Systems Biology volume 5, Article number: 55 (2011)
Abstract
Background
Domains are basic units of proteins, and thus exploring associations between protein domains and human inherited diseases will greatly improve our understanding of the pathogenesis of human complex diseases and further benefit the medical prevention, diagnosis and treatment of these diseases. Within a given domaindomain interaction network, we make the assumption that similarities of disease phenotypes can be explained using proximities of domains associated with such diseases. Based on this assumption, we propose a Bayesian regression approach named "domainRBF" (domain Rank with Bayes Factor) to prioritize candidate domains for human complex diseases.
Results
Using a compiled dataset containing 1,614 associations between 671 domains and 1,145 disease phenotypes, we demonstrate the effectiveness of the proposed approach through three largescale leaveoneout crossvalidation experiments (random control, simulated linkage interval, and genomewide scan), and we do so in terms of three criteria (precision, mean rank ratio, and AUC score). We further show that the proposed approach is robust to the parameters involved and the underlying domaindomain interaction network through a series of permutation tests. Once having assessed the validity of this approach, we show the possibility of ab initio inference of domaindisease associations and genedisease associations, and we illustrate the strong agreement between our inferences and the evidences from genomewide association studies for four common diseases (type 1 diabetes, type 2 diabetes, Crohn's disease, and breast cancer). Finally, we provide a precalculated genomewide landscape of associations between 5,490 protein domains and 5,080 human diseases and offer free access to this resource.
Conclusions
The proposed approach effectively ranks susceptible domains among the top of the candidates, and it is robust to the parameters involved. The ab initio inference of domaindisease associations shows strong agreement with the evidence provided by genomewide association studies. The predicted landscape provides a comprehensive understanding of associations between domains and human diseases.
Background
Over the past few decades, remarkable success has been achieved for such traditional genemapping approaches as familybased linkage analysis [1, 2] and populationbased association studies [3, 4] in pinpointing genes that are responsible for human inherited diseases [5, 6]. Nevertheless, these traditional methods are either only capable of linking diseases with genetic regions that typically contain dozens to hundreds of genes, or usually require carefully selected candidate genes that are biologically related to the disease under investigation [5, 6]. Consequently, the development of computational methods for the inference of genes and their protein products that are truly responsible for the disease of interest has been one of the major tasks in human genetics and functional genomics [7–18]. Particularly, a protein typically consists of several structural domains, each of which is closely related to a specific function of the protein. Therefore, the inference of causative genes could be aided by first dividing products of candidate genes into discrete domains with known functions and structural features and then infer the association of these domains to the disease of interest [19–21]. Following this direction, a new protein domain called PAAD has been discovered to be associated with apoptosis, cancer, and autoimmune diseases [19], and a novel domain called G8 has been reported to be linked to polycystic kidney disease and nonsyndromic hearing loss [20]. However, most of these discoveries have thus far been made with the assistance of protein sequence analysis and other experimental techniques. Even though such findings are significant, the associations reported are still very sporadic. Therefore, it would be helpful to develop computational methods to directly infer possible associations between domains and human diseases.
It has been shown that deleterious nonsynonymous single nucleotide polymorphisms (nsSNPs) that are responsible for a specific disease of interest may change structures of some protein domains, affect functions of corresponding proteins, and further result in the disease under investigation. Therefore, existing associations between domains and diseases can be constructed by bridging protein domains that contain known deleterious nsSNPs and human diseases with which the nsSNPs are associated [22]. Furthermore, recent advances in computational functional genomics have enabled the largescale prediction of domaindomain interactions and have led to repositories of known and predicted domaindomain interactions such as DOMINE [23] and InterDom [24, 25]. Accordingly, largescale inference of unknown associations between domains and human diseases can be performed by using these data sources. For example, one of our previous studies [22] adopted the "guiltbyassociation" principle [26] to compute scores that quantify the strength of associations between a query disease and candidate domains from domaindomain interaction data and known associations between the query disease and other domains, and then rank candidate domains according to their scores. However, the scope of application of this approach is limited because the "guiltbyassociation" principle relies on known associations between the query disease and domains to infer novel associations for the query disease. Under these conditions, the method cannot be applied to diseases whose genetic bases are completely unknown.
Recent studies on the modular nature of human genetic diseases have shown that diseases share common clinical characteristics are often caused by functionally related genes [16, 27]. With the application of text mining techniques, it has also been possible to calculate pairwise similarities for most human disease phenotypes [28]. With these advances, various methods have been proposed to prioritize candidate genes through the combined use of disease phenotype similarity and gene proximity [7, 8, 29–32]. Inspired by the successes of these methods, we propose in this paper to infer associations between domains and human disease phenotypes based on the assumption that phenotypically similar diseases are caused by functionally related domains. More specifically, we resort to a linear regression framework to model the relationship between a domain proximity profile and a phenotype similarity profile, and we develop a Bayesian regression approach, called domainRBF (domain Ranking with Bayes Factor), to calculate Bayes factors that quantify the strength of associations between corresponding domain proximity profiles and phenotype similarity profiles.
We compile a set of known domaindisease associations using the Pfam database [33] and annotations of nsSNPs in the UniProt database [34, 35], extract a domaindomain interaction network from the DOMINE database [23] as well as the InterDom database [24, 25], and then download a precalculated phenotype similarity network [28]. Using these data, we show that domain proximities calculated from a domaindomain interaction network do, indeed, imply phenotype similarities of diseases. We next validate the approach and evaluate its performance using three criteria: precision, mean rank ratio, and AUC score. To accomplish this, we apply three largescale leaveoneout crossvalidation experiments against random control, simulated linkage interval, and genomewide scan with two domain proximity measures: diffusion kernel and shortest path with Gaussian kernel. Results show that the proposed approach can successfully recover known associations between domains and human diseases. We further show the robustness of this approach to the parameters involved and the underlying domaindomain interaction network through a series of permutation tests. Having successfully assessed the validity and robustness of this approach, we can then infer domaindisease in an ab initio way and illustrate the strong agreement of the inference results with evidence of genomewide association studies for four common human diseases, including type 1 diabetes, type 2 diabetes, Crohn's disease, and breast cancer. We further demonstrate the possibility of inferring genedisease associations from domaindisease associations. Finally, we calculate a genomewide landscape of associations between 5,490 domains and 5,080 human diseases using all known domaindisease associations, and we provide a freely accessible website for this resource.
Methods
Overview of the DomainRBF approach
We ground the inference of domains that are associated with human inherited diseases on a set of known domaindisease associations that are compiled from the Pfam database [33] and annotations of nsSNPs in the UniProt database [34, 35], a domaindomain interaction network extracted from the DOMINE database [23] and the InterDom database [24, 25], as well as a precalculated phenotype similarity network containing pairwise similarity scores among more than 5,000 human genetic disease phenotypes in the OMIM database [28].
Based on the assumption that phenotypically similar diseases are caused by functionally related domains, we propose a linear regression framework to model the relationship between a domain proximity profile and a phenotype similarity profile, and we resort to a Bayesian approach to solve the linear regression model. As shown in Figure 1 (inspired by Ideker and Sharan [36]), given a query phenotype p and the precalculated pairwise similarity scores between phenotypes, we extract scores between the query phenotype and all other phenotypes that have at least one associated domain and obtain a phenotype similarity profile for the query phenotype. On the other hand, for a query domain d in a set of candidate domains, we resort to the domaindomain interaction network to calculate proximity scores of the query domain to all domains that are known to be associated with some phenotypes and further calculate a domain proximity profile. With these two profiles, we propose a Bayesian regression approach called domainRBF (domain Ranking with Bayes Factor) to calculate a Bayes factor that quantifies the strength of association between the query domain and the query phenotype, using the phenotype similarity profile as the response variable and the domain proximity profile as the predictor variable. Finally, we rank candidate domains according to their corresponding Bayes factors and obtain a rank list of the candidates.
Data sources
Domaindisease associations
A domain is defined as associated with a disease if the domain contains at least one nonsynonymous single nucleotide polymorphism (nsSNP) associated with the disease [22]. Therefore, associations between domains and diseases are obtained by combining known associations between nsSNPs and diseases as well as relationships between protein and domains.
Known associations between nsSNPs and diseases are obtained from annotations of nsSNPs in the UniProt database [34, 35], in which nsSNPs are classified into three categories: disease, polymorphism, and unclassified. In version 57.15 (released on March 2, 2010) of this database, 23,372 nsSNPs belong to the disease category, 36,303 belong to the polymorphism category, and the remaining 2,019 nsSNPs are currently unclassified. For each of the nsSNPs in the disease category, the entry ID of the specific disease in the OMIM database is also provided. Consequently, we obtain 19,552 associations between 19,552 nsSNPs and 1,592 diseases.
Relationships between human proteins and domains are obtained from the Pfam database [33], which provides a large collection of both high quality protein domain families (PfamA) and low quality protein domain families (PfamB). In version 24.0 of the PfamA collection (released in October 2009), 11,912 domain families that cover more than 75.15% of known proteins are collected. Using this data source, we obtain 96,276 relationships between 4,324 domains and 66,498 human proteins.
Using the above data sources and having defined a domain as associated with a disease if the domain contains at least one nsSNP associated with the disease, we are able to establish 1,614 associations between 671 domains and 1,145 diseases.
Domaindomain interaction networks
Our inference of domaindisease associations is based on domaindomain interaction networks extracted from the DOMINE [23] and InterDom [24, 25], two of the most widelyused databases of known and predicted domaindomain interactions.
The latest version of DOMINE (released in February 2008) contains a total of 20,513 domaindomain interactions, out of which 4,349 (goldstandard positives) are inferred from PDB entries (the union of the sets of interactions from iPfam [37] and 3did [38, 39]), and 17,781 are predicted by at least one computational approach of 8 different computational approaches using Pfam domain definitions. Of the 17,781 predicted interactions, there are 3,143 highconfidence predictions (predicted by ME [40] or at least two different approaches), 729 mediumconfidence predictions (heterodomain interactions in which both domains have the same annotations in the biological process of the gene ontology), and 13,909 remaining lowconfidence predictions [23].
The latest version of InterDom (released in July 31, 2007) contains a total of 148,938 domaindomain interactions, out of which 7,718 are inferred from PDB entries [41], 143,820 are inferred from BIND [42] and DIP [43] entries, and 4,631 are inferred from the domain fusion hypothesis. InterDom further uses a probabilistic scoring system to give confidence scores to domain interactions that are derived independently by multiple methods from different data sources. Finally, interactions with 90%, 75%, 50%, and 25% confidence levels are provided [24, 25].
In our work, we use two domaindomain interaction networks extracted from these data. First, we discard singletons in the PDB part of the DOMINE database [23] and obtain a small network that is composed of 2,285 interactions between 1,971 domains (2.32 interactions per domain on average). Second, we combine 37,177 interactions whose confidence scores are at least 90% in the InterDom database and all interactions in the DOMINE databases to obtain a large domaindomain interaction network that is composed of 48,778 interactions between 5,490 domains (17.77 interactions per domain on average).
The phenotype similarity network
The phenotype similarity network of human diseases is a fully connected network obtained from an earlier work of van Driel et al. [28], in which the pairwise relationships between 5,080 human genetic diseases from the OMIM database are mapped. Briefly, van Driel et al. use the anatomy (A) and the disease (C) sections of the medical subject headings vocabulary (MeSH) to extract terms from the OMIM database, thus providing a standard way of presenting the OMIM records as corresponding phenotype feature vectors. As a result, each disease phenotype is characterized by a vector of standardized and weighted phenotypic feature terms mapped from corresponding OMIM records in the full text (TX) and clinical synopsis (CS) fields. Then, for each pair of disease phenotypes, a similarity score is calculated by the cosine of their feature vector angle. The reliability of the phenotype similarity score has been tested [28], showing that these similarities are positively correlated with a number of measures of gene functions. The final phenotype similarity network contains pairwise similarity scores for 5,080 OMIM records, covering a majority of recorded human disease phenotypes.
The DomainRBF model
Given the phenotype similarity network, we use y_{ pp' }to denote the similarity score between a query disease phenotype p and another disease phenotype p'. We further define the phenotype similarity profile for disease phenotype p as , i.e., the similarities between the disease phenotype p and all m disease phenotypes p_{ 1 }, p_{ 2 }, ..., p_{ m }that have at least one associated domain.
On the other hand, given a domaindomain interaction network of n nodes, we calculate the proximity between two domains using two measures: (1) shortest path with Gaussian kernel (SG) and (2) diffusion kernel (DK). The shortest path proximity between two domains u and v, SP(u,v), is defined as the length of the shortest path between the two domains. Using the Gaussian kernel, the proximity distance measure SG (u, v) is obtained as SG(u,v) = exp{β(SP(u,v))^{2}}, where β is a free parameter. The diffusion kernel for the network is defined as K = (k_{ uv })_{ n }_{×}_{ n }= e^{}^{γL}, where 0 < γ < 1 is a free parameter that controls the magnitude of diffusion. The matrix L = D  A is the Laplacian of the network, where D is a diagonal matrix containing node degrees, and A is the adjacency matrix of the domaindomain interaction network. With the diffusion kernel K = (k_{ uv })_{ n }_{×}_{ n }, we define the diffusion proximity of two domains u and v as DK(u,v) = k_{ uv }, i.e., the corresponding element in the diffusion kernel. Then, let x_{ dd' }denote the proximity between domains d and d' in the domaindomain interaction network, and let D(p) denote the set of domains known to be associated with a phenotype p. We define the proximity between domain d to disease phenotype p as the summation of proximity scores between domain d and all domains known to be associated with disease phenotype p, i.e., x_{ dp }= ∑_{ d }_{'∈}_{ D }_{(}_{ p }_{)}x_{ dd }_{'}. We further define the domain proximity profile for domain d as .
Then, given a query disease phenotype p and a query domain d, we explain the phenotype similarity profile y_{ p }using domain proximity profile x_{ d }via a linear regression model
where y = y_{ p }is the response vector, X = (1,x_{ d }) the design matrix, β = (β_{0}, β_{1})^{T}the coefficient vector, and ε = (ε_{1},..., ε_{ m })^{T}the residual vector. Note that the first column of the design matrix being 1s for the purpose of incorporating the intercept. We propose to solve this linear regression model using a Bayesian approach. We choose to take a Bayesian approach because it provides a natural way to consider the uncertainty in estimated parameters, and it provides Bayes factor, a measure of the strength of evidence for an association, which is defined as the ratio of marginal likelihoods for y conditional on X under the alternative and the null hypothesis, respectively, as described below.
For the alternative model, we assume that y conditional on X is subject to a normal distribution, as
with residuals independent and identically distributed, following normal density with mean 0 and variance σ^{2}. We set conjugate prior distributions for β and σ^{2}, as
and
where μ_{0}= (μ_{0}, μ_{1})^{T}is composed of prior means, and σ^{2}Σ_{0} prior variances with Σ_{0} = diag(σ_{ μ }^{2},σ_{1}^{2}) being a diagonal matrix. The joint distribution of all random quantities y, β, and σ^{2} is then given as
Integrating out β and σ^{2}, we obtain the marginal likelihood of y given X as
where n_{ n }= n + n_{0} and n_{ n }σ_{n}^{2} = n_{0}σ_{0}^{2} + y^{T}y + μ_{0}^{T}Σ_{0}^{1}μ_{0}μ_{ n }^{T}Σ_{ n }^{1}μ_{ n }with Σ_{n} = (X^{T}X + Σ_{0}^{1})^{1} and μ_{ n }= Σ_{n} (X^{T}y + Σ_{0}^{1}μ_{0}).
On the other hand, for the null model, where y is independent of X, the marginal likelihood of y can be derived in a similar way, as
where , and .
Then, the Bayes factor BF is the ratio of p_{1}(yX) and p_{0}(y), as
Following the literature [44], we take the limit +∞ for σ_{ μ }^{2} and 0 for both n_{0} and σ_{0}^{2}, and we obtain the limit value of the Bayes factor as
For simplicity, we further set μ_{0} = 0 as in the literature [44], and we set σ_{1}^{2} = 1 as the default setting in this paper, although the effect of these parameters are also studied.
Note that before the construction of the Bayesian regression relationship between y_{ p }and x_{ d }, we apply an inversenormal transform to y_{ p }to guarantee that the responsive variable is normally distributed. As illustrated in [45, 46], the transform formula we use is:
where r_{ i }is the rank of in the vector y_{ p }, m the length of y_{ p }, and Φ the cumulative distribution function of the standard normal distribution.
Validation methods and evaluation criteria
On the basis of the domaindomain interaction network and known associations between protein domains and disease phenotypes, we proceed to validate how well the proposed approach performs in recovering these known associations. We adopt three large scale leaveoneout crossvalidation experiments for this purpose.
First, in the validation of random controls, we prioritize domains that are known to be associated with disease phenotypes (i.e., disease domains) against randomly selected control domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against a set of 99 randomly selected control domains.
Second, in the validation of simulated linkage intervals, we prioritize domains that are known to be associated with disease phenotypes (i.e., seed domains) against domains that are located around the seed domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against a set of control domains that are located within 10 Mbp upstream and downstream of this domain.
Third, in the validation of genomewide scan, we prioritize seed domains against all known domains. Specifically, in each run of the validation, we select an association between a domain and a disease phenotype, assume that the association is unknown, and prioritize the domain against all other domains in the domaindomain interaction network.
In each of the above leaveoneout crossvalidation experiments, we repeat the validation run for every known association between a domain and a disease phenotype, and we are able to obtain a number of ranking lists. We further normalize the ranks by dividing them by the total number of candidate domains in the rankling list to obtain rank ratios and calculate the values of three criteria to measure the performance of a prioritization method.
The first criterion is termed precision. We consider a prediction as successful if the known disease domain is ranked at the top (with rank 1). Then, the proportion of successful predictions among all predictions is defined as the precision. Obviously, a high precision suggests that a method has high prediction power. The second criterion is termed mean rank ratio, which is simply the average of rank ratios for all known disease domains in a crossvalidation experiment. This criterion provides a summary of the ranks of all domains that are known to be associated with disease phenotypes, and the smaller the mean rank ratio, the better a method. The third criterion is termed AUC, which is the area under the receiver operating characteristic curve (ROC). Given a list of rank ratios and a predefined threshold, we define the sensitivity as the percentage of disease domains that are ranked above the threshold and the specificity as the percentage of control domains that are ranked below the threshold. By varying the threshold values, we are able to plot a receiver operating characteristic curve, which shows the relationship between sensitivity and 1specificity. Calculating the area under the ROC curve (AUC), we are able to obtain the AUC score, which provides an overall measure for the performance of the prioritization approach.
Results
Domain proximity implying phenotype similarity
The DomainRBF approach is based on the assumption that similarities of disease phenotypes can be explained by proximities of domains associated with the phenotypes within a domaindomain interaction network via a regression model. In order to validate this assumption, we discard singletons in the PDB part of the DOMINE database [23] and obtain a domaindomain interaction network that is composed of 2,285 interactions between 1,971 domains. Focusing on these domains, we obtain 1,066 associations between 763 phenotypes and 378 domains. Then, we calculate a Bayes factor for each of these associations, and run a Wilcoxon signed rank test to check whether the resulting Bayes factors are significantly greater than 1 (the random case). Results show that the pvalue is smaller than 2.2 × 10^{16}, indicating that the similarities of disease phenotypes have a strong relationship with the proximities of associated domains.
To further substantiate this point, we perform a series of permutations towards diseasedisease, domaindisease, and domaindomain relationships. First, we break the diseasedisease relationship by permuting the phenotype similarity profile. Second, we break the domaindisease relationship by two methods: (1) permuting domaindisease associations and (2) replacing domains in known diseasedomain associations with randomly selected domains. Third, we break the domaindomain relationship by permuting connections in the underlying domaindomain interaction network, while keeping node degrees and recalculating the diffusion kernel. For each of the above permutations, we calculate Bayes factors of disease domains and present the results in Figure 2, which shows that the median of Bayes factors based on the original data is much higher than the medians obtained from the different permuted relationships, as described above.
We also perform similar studies using the large domaindomain interaction network (48,778 interactions between 5,490 domains) that includes the entire DOMINE database [21] and the highconfidence part of the InterDom [22, 23] database. Results show that Bayes factors for known domaindisease associations are also significantly greater than 1, while the pvalue of the Wilcoxon signed rank test is smaller than 2.2 × 10^{16}. We further perform a series of permutation tests and present the results [Additional file 1: Supplemental Figure S1]. Based on these comprehensive studies, our hypothesis has been clearly demonstrated: that similarities between diseases can be explained by the proximities of domains associated with such diseases within a given domaindomain interaction network. In other words, domain proximity implies phenotype similarity.
Performance of the DomainRBF approach
Since interactions from the PDB entries have the highest confidence of domaindomain interactions, we first test the validity of our approach on the PDB part of the DOMINEdatabase [23]. We implement three largescale leaveoneout crossvalidation experiments against random controls, simulated linkage intervals and genomewide scan, respectively, each on the basis of two distance measures: diffusion kernel (DK) and shortest path with Gaussian kernel (SG).
For each of the three validation experiments, using either the diffusion kernel or the shortest path with Gaussian kernel, we draw a histogram of rank ratios for the entire 1,066 known associations, as shown in Figure 3. From the figure we see that rank ratios are concentrated mostly within the interval of the first few bins, and as the rank ratios increase, corresponding frequencies all take a general trend of declination. In other words, the proposed approach is capable of ranking domains known as associated with some disease phenotypes among the top of the candidates.
We then assess the performance of the proposed approach using the three criteria (mean rank ratio, precision, and AUC score) and summarize the results in Table 1. First, we can see from these results that the domainRBF approach can successfully recover the associations between protein domains and human disease phenotypes. For example, in the crossvalidation for random controls, the precisions are greater than 26%, the mean rank ratios are less than 12%, and the AUC scores are greater than 88%. In the crossvalidation for linkage intervals, the precisions are greater than 23%, the mean rank ratios are less than 12%, and the AUC scores are greater than 89%. In the crossvalidation for genomewide scan, the precisions are greater than 5%, the mean rank ratios are less than 12%, and the AUC scores are greater than 88%. We therefore conclude that the domainRBF approach is effective in the identification of domains that are associated with human disease phenotypes.
Second, we conjecture from these results that the diffusion kernel measure is slightly better than the shortest path measure with Gaussian kernel, because the mean rank ratios obtained using the diffusion kernel are in general smaller, and the precisions and AUC scores are in general larger, than those obtained using the shortest path with Gaussian kernel. This phenomenon might be explained by the fact that diffusion kernel is a global networkdistance measure. As such, the distance between two domains not only depends on the relative location of the candidate domain to all other domains (as the shortest path with Gaussian kernel does), but also relies on the graph structure of the entire network. Thus, for interaction networks with different graph structure, two nodes with the same shortest path distance usually have different diffusion kernel distance, and it is possible that this difference makes the diffusion kernel distance more reasonable and precise in the description of similarities between two domains in the interaction network. This point has also been explicitly illustrated in literature [29].
Third, we conjecture from these results that the domainRBF approach with some proper defined priors can achieve higher performance than the nonBayesian linear regression method. We compare the performance of the (Bayesian) domainRBF approach with the (nonBayesian) ordinary linear regression method through the three largescale leaveoneout crossvalidation experiments, and we also list the results in Table 1. Although both approaches can successfully recover the associations between protein domains and human disease phenotypes, the results show that the domainRBF approach can achieve better performance than the ordinary linear regression approach in most cases. For example, in all three crossvalidation experiments, the domainRBF approach can achieve higher precisions (with only two exceptions for genomewide scan), smaller mean rank ratios (for at least 5.32%), and larger AUC scores (for at least 3.19%). When looking at the ROC curves (Figure 4), we see that the curve of the domainRBF approach climbs much faster towards the upper left corner of the plot than does that of the ordinary linear regression approach, suggesting that the Bayesian domainRBF approach is superior to the nonBayesian ordinary linear regression method.
Robustness of the DomainRBF approach
Effects of network interactions
The above validation results suggest that the domainRBF approach can successfully prioritize candidate domains and put the domain that is truly associated with the query disease phenotype at the top of the candidates. However, it is still necessary to determine whether the correct prioritization of disease domains is due to the connectivity information that includes in the domaindomain interactions, domainphenotype associations, and phenotypephenotype similarities. To accomplish this, we artificially destroy informative interactions in the above three networks and see what performances will turn out. It is expected that both the mean rank ratios and the AUC scores will be around 50%, together with very low precisions. With this understanding, we perform three permutation experiments: 1) shuffling interactions among domains while fixing the node degree (number of direct neighbours) distribution of the entire interaction network, 2) shuffling interactions among domainphenotype associations while fixing the number of associated domains for each of the phenotypes, and 3) shuffling the phenotypephenotype similarity while fixing the distribution of phenotype similarities, respectively. Then we repeat the leaveoneout crossvalidation experiments using the shuffled networks, which contain no informative interactions among domains, among domain and phenotypes, or among phenotypes, respectively. As shown in Figure 5, the results obtained are generally consistent with our expectation in that AUC scores are all around 50%. We therefore conclude that the successful prioritization of candidate domains is indeed due to the informative interactions among domains that are included in the domaindomain interaction network.
Effects of different domaindomain interaction networks
We notice that the two compiled domaindomain interaction networks have different properties. For example, the average degree of the smaller network that includes only PDB entries is 2.32, while that of the larger network that includes predicted interactions from both DOMINE and InterDom is 17.77. It is possible that many predicted interactions may actually be noise and thus negatively affect the prioritization of disease domains. Hence, it is necessary to validate the robustness of the proposed approaches to the underlying domaindomain interactions. For this purpose, we implement the same validation process based on the large compiled domaindomain interaction network that is composed of all interactions in the DOMINE database and highconfidence interactions in the InterDom database. Results are presented in Table 2, from which we can see that the performances of the domainRBF approach using the large domaindomain interaction network that includes the entire DOMINE database and the highconfidence interactions in the InterDom database are generally somewhat inferior to those using the PDB part of the DOMINE database. For instance, when using the domainRBF approach, the disparity of precisions, mean rank ratios and AUCs are all within the scope of 10 percent. We then conjecture from these results that the proposed domainRBF approach is quite robust to the possible noise in the domaindomain interaction network.
Effects of parameters in the distance measures
We further notice that the parameter β in the shortest path measure with Gaussian kernel and the parameter γ in the diffusion kernel are free parameters that need to be predetermined (see Materials and Methods for details). In the above crossvalidation experiments we set these parameters as 1 and 0.05, respectively, for simplicity. However, it is necessary to show whether the prioritization methods are sensitive to these parameters. For this purpose, we select several values across the range of these parameters, perform the crossvalidation experiments, and see how the results change accordingly. We take the prioritization results using the domainRBF approach against random controls (in Table 1) as an example to illustrate the influence of β. Since this parameter ranges from 0 to +∞, we perform a grid search of this parameter by changing it from 0.1 to 10 with step 0.1 and see the effect, as reflected in the change of precision, mean rank ratio, and AUC score as shown in Figure 6(A). From the curve we can see that when β changes from 0.1 to 1, there is an obvious upward climb for the three criteria, while after the point β = 1.0 (precision = 26.56%, mean rank ratio = 11.99%, and AUC score = 88.80%), the values in the curve becomes fairly stable. Even so, we find that the peak performance is obtained at β = 3.7 (precision = 28.89%, mean rank ratio = 10.67%, and AUC score = 90.20%), and the worst performance is obtained at β = 0.1 (precision = 18.01%, mean rank ratio = 19.23%, and AUC score = 81.63%). From these results, we conclude that the prioritization methods are not sensitive to this free parameter when β is greater than 1. Similarly, we find that the prioritization methods are not sensitive to the free parameter γ when it is smaller than 0.15 (data not shown). The corresponding changes in precision, mean rank ratio, and AUC score are shown in Figure 6(B). We find that the peak performance is obtained at γ = 0.03 (precision = 29.55%, mean rank ratio = 10.61%, and AUC score = 90.16%), and the worst performance is obtained at γ = 0.93 (precision = 24.86%, mean rank ratio = 13.34%, and AUC score = 87.46%). From the results, we can see that the proposed approach is quite robust when β in the shortest path with Gaussian kernel is greater than 1 or when γ in the diffusion kernel is smaller than 0.15.
Effects of parameters in the domainRBF approach
Besides the two parameters in the distance measures, there are also four parameters in the domainRBF approach that need to be predetermined, namely μ_{0}, σ_{1}^{2}, n_{0}, and σ_{0}^{2}, all of which are included in the priors of the domainRBF approach (see Materials and Methods for details). In the real implementation we set μ_{0} = 0, n_{0} = 0, and σ_{0}^{2} = 0, for the reason explained in the literature [44], and we set σ_{1}^{2} = 1, for simplicity. Therefore, we only need to test the robustness of the approach when different values of σ_{1}^{2} are used. To achieve this objective, we set σ_{1}^{2} as 0.001, 0.01, 0.1, 1, 10, and 100, respectively, and we apply the approach to the same cross validation process. We list the results in Table 3, which shows that when σ_{1}^{2} is smaller than 1, the domainRBF approach is quite robust to the change of σ_{1}^{2}, with the change of precision within 1.12%, change of mean rank ratios within 0.53%, and change of AUC scores within 0.54%. On the other hand, when σ_{1}^{2} is larger than 1, the decrease in performances becomes slightly conspicuous, but remains within the scope of 3.04% for precisions, 2.22% for mean rank ratios and 1.77% for AUC scores. Hence we can see that our domainRBF approach is generally robust to the change of parameters.
Effects of seed domaindisease associations
In order to test the influence of the size of seed, or known associations on the prioritization results, we select at random 100%, 90%, 80%, 70%, 60%, and 50% of the original seed associations, respectively, and we repeat the leaveoneout validation processes. We only calculate the performance using the domainRBF approach based on diffusion kernel measure, and we choose the PDB part of the DOMINE database as the domaindomain interaction network. Results show that with the percentage of seed associations decreases from 100% to 50%, performance also slightly decreases in terms of precision, mean rank ratio and AUC score, despite some exceptions (see Table 4). For example, in the crossvalidation for random controls, the changes of precisions are no more than 2.99%, the changes of mean rank ratios are no more than 2.72%, and the changes of AUC scores are no more than 2.73%. In the crossvalidation for linkage intervals, the changes of precisions are no more than 2.73%, the changes of mean rank ratios are no more than 0.75%, and the changes of AUC scores are no more than 0.78%. In the crossvalidation for genomewide scan, the changes of precisions are no more than 1.95%, the changes of mean rank ratios are no more than 1.04%, and the changes of AUC scores are no more than 0.95%. From these results, we conclude that the prioritization methods are not sensitive to the size of seed associations in our problem.
In order to study how known domaindisease associations for other diseases contribute to the inference of domains that are associated with the query disease, we keep 10%, 20%, 30%, 40%, and 50% disease phenotypes that have the highest similarity scores to the query disease, respectively, and we repeat the leaveoneout validation processes, using the diffusion kernel measure and the small domaindomain interaction network that is composed of the PDB part of the DOMINE database. Results (Table 5) show that our method is robust in this experiment, in the sense that the values of the three evaluation criteria do not change significantly. For example, in the crossvalidation for random controls, the changes of precisions are no more than 6.64%, the changes of mean rank ratios are no more than 1.15%, and the changes of AUC scores are no more than 1.20%. However, we also notice that the performance of our approach tends to drop when more phenotypes with lower similarity scores are included. For example, in the experiment, our approach achieves the highest performance when keeping only 10% phenotypes which have the highest similarity scores to the query phenotype and the lowest performance when keeping 50% of the most similar phenotypes, although the drop in performance is small. We also repeat the above analysis using the large domaindomain interaction network that includes the entire DOMINE database and the highconfidence part of the InterDom database, and we obtain similar results [Additional file 2: Supplemental Table S1]. From these results, we conclude that seed domaindisease associations in which the diseases have high phenotype similarity scores with the query disease have main contributions in the prioritization procedure.
Ab initio inference of domaindisease and genedisease associations
Above we have used several large scale leaveoneout crossvalidation experiments to evaluate the performance and robustness of the proposed domainRBF approach. However, it might be argued that a disease may be associated with more than one domain and that, consequently, the inclusion of domains already known to be associated with the query disease in the calculation of the domain proximity profile may ease the identification of novel associations. Following this line of reasoning, we demonstrate the capability of the proposed domainRBF approach in the prediction of novel associations for query diseasesby performing the following ab initio inference experiments. For each query disease, we calculate domain proximity profiles with the exclusion of all domains that are known to be associated with the disease (i.e., as if genetic bases of the disease were completely unknown), apply the domainRBF method to score candidate domains, and then prioritize the candidates. We again perform random control, linkage interval, and genomewide validation experiments and evaluate the performance of our approach in terms of precision, mean rank ratio, and AUC scores. We perform this ab initio inference using the large network that is composed of entire interactions in DOMINE and highconfidence interactions in InterDom (with diffusion kernel), and we summarize the results in Table 6.
In comparison with the results in Table 2, we find that the performance of the domainRBF approach slightly drops (less than 4%). For example, for the random control validation, the precision is almost the same, the mean rank ratio drops from 14.96% to 17.34%, and the AUC score from 85.82% to 83.43%. For the linkage interval validation, the precision drops from 27.56% to 24.39%, the mean rank ratio from 14.12% to 17.07%, and the AUC score from 86.68% to 84.62%. For the genomewide validation, the precision even increases slightly from 3.47% to 3.90%, while the mean rank ratio drops from 14.11% to 16.45% and the AUC score from 85.72% to 83.27%. In other words, for a query disease of interest, our approach is capable of inferring novel associations between domains and the disease without prior knowledge about genetic bases of the disease. This characteristic of our approach is of great importance, because the genetic bases for about half of the diseases in the OMIM database are still unknown [47].
We also study the contribution of seed domaindisease associations by keeping a fraction of disease phenotypes that have the highest similarity scores to the query disease and repeating ab initio prediction experiments. We observe from the results [Additional file 2: Supplemental Tables 2 and 3] that seed domaindisease associations in which the diseases have high phenotype similarity scores with the query disease have main contributions in the prioritization procedure. We further study whether the ab initio prediction tends to give higher ranks to domains that occur more frequently in human proteins. We merge the frequency of occurrence of domains in all human proteins into 11 bins (010, 1120, 2130, 3140. 4150, 5160, 6170, 7180, 8190. 91100, 101 and above), and we look at how ranks of domains that are known to be associated with diseases distribute in different bins. From the results [Additional file 1: Supplemental Figure S2], we see that the median of mean ranks of such domains do not show much change for different bins, indicating that our method is not biased towards common domains.
Inspired by the success of ab initio inference of domains and diseases, we further propose the following application of the proposed domainRBF approach in the inference of genes that are associated with diseases, by combining predicted domaindisease associations and known domainprotein relations. As shown in Figure 7, given a query disease and a gene whose products (proteins) are usually composed of several protein domains, we look at corresponding Bayes factors of these domains and define an association score that measures the strength of association between the gene and the disease as the maximum among these Bayes factors. Then, given a set of candidate genes, we are able to obtain association scores for the genes and further rank the genes according to their scores.
We validate our gene prioritization approach using known genedisease associations extracted from the BioMart database [48, 49]. After ruling out diseases that do not exist in our phenotype similarity network and genes whose products do not have domain annotation, we obtain 2,847 associations between 1,737 genes and 1,875 diseases. We then prioritize each of these genes that are known to be associated with some diseases (disease genes) against a total of 14,944 genes whose products have domain annotations. Results show that in 207 out of the 2,847 associations, the known disease genes rank first in the candidate list of 14,944 genes, obtaining a precision of 7.27%, as well as a fold enrichment of 1,087 (It should be noted that fold enrichment is defined in existing literature [32] as follows: for a method that is able to rank known disease genes among the top a% of all candidates in b% validation runs, the fold enrichment is b/a on average). In other words, our domainRBF approach can also be effectively used as an intermediate step to infer associations between genes and diseases.
Genomewide evidence of associations between domains and common human diseases
The identification of susceptible single nucleotide polymorphisms (SNPs) conferring risk for common human diseases is one of the main tasks of genomewide association studies (GWAS). Since the study that identified the association of complement factor H (CFH) with agerelated macular degeneration (AMD) in 2005, over 450 GWAS have been performed and more than 2,000 susceptible SNPs or genetic loci have been reported [50]. With these resources, it is of interest to to determine the extent to which the genomewide ab initio inference of associations between domains and diseases are consistent with these GWAS results.
Given a disease of interest, we collect from SNPedia [51, 52] or other relevant literatures a list of reported susceptible SNPs, and we see how many of these SNPs appear within 5 Mbp of the domains that are ranked in the top 10 in our genomewide ab initio inference, which uses all domains in the domaindomain interaction network as candidates. We further implement the following permutation test to check whether the number of such susceptible SNPs is significantly enriched within these regions, as

1.
Count the number of reported SNPs that appear within 5 Mbp of domains that are ranked among the top 10 in the genomewide ab initio inference. Record this number as N _{0}.

2.
For the ith permutation, select 10 domains at random from all domains in the genomewide ab initio inference. Count the number of reported SNPs that appear within 5 Mbp of these domains. Record this number as N _{ i }.

3.
Repeat the above random selection M times (M = 10,000 in our study). Count the number of times that N _{ i }(i = 1,...,N) is greater than or equal to N _{0}. Record this number as m.

4.
Calculate a pvalue as p = m/M.
The null hypothesis in the above permutation test is that the number of the reported susceptible SNPs within 5 Mbp regions of the high ranking domains (top 10) is not different from that of randomly selected domains. Therefore, a small pvalue indicates that the reported susceptible SNPs tend to be closer to the high ranking domains. In other words, high ranking domains are more likely to be associated with the disease under investigation.
We then select four disease examples (Type 1 diabetes, Type 2 diabetes, Crohn's disease, and Breast cancer), apply the above permutation test method to these diseases, and analyze the results in detail. We choose these four diseases because they are common and have GWAS results available. It has been shown that diabetes had affected 2.8% of the population worldwide by 2000 [53], with type 2 diabetes as the most common form of this disease [54]. It is also known that Crohn's disease affects 0.2% to 0.1% people within the UK [55], and that breast cancer is the most common type of nonskin cancer in women and the fifth most common cause of cancer death [56, 57].
Type 1 Diabetes
Type 1 diabetes, formerly called juvenile diabetes or insulindependent diabetes, is a condition in which pancreatic β cell destruction usually leads to absolute insulin deficiency [58]. The genetic susceptibility of Type 1 diabetes is strongly associated with HLADQ and DR on chromosome 6, but genetic factors on other chromosomes such as the insulin gene on chromosome 11 and the cytotoxic Tlymphocyte antigen gene on chromosome 2 may modulate disease risk [59]. In our study, we compile from SNPedia 48 reported susceptible SNPs, and 25 of them are found to be within 5Mbp regions of 6 domains (i.e., NACHT, Recep_L_domain, Collagen, HNF1A_C, CARD, and FGF) that are ranked among the top 10. We present the detailed list of these domains and SNPs [Additional file 2: Supplemental Table S4]. In summary, we observe 3 times that a susceptible SNP locates inside a domain (once in each of Recep_L_domain, Collagen, and HNF1A_C, respectively), 15 times that a susceptible SNP locates within 1 Mbp upstream or downstream of a domain, and 44 times that a susceptible SNP locates within 5 Mbp region of a domain. The permutation test, as described above, yields a pvalue of 0.0313, which is smaller than 0.05. From these results we conjecture that domains ranked among the top 10 do, indeed, tend to be closer to, or even include, known susceptible SNPs for this disease.
In addition, we also examine the 4 domains that are not close to susceptible SNPs reported by GWAS. For domain HNF1B_C (PF04812), we notice that Urhammer et al. [60] have pointed out that mutations and polymorphisms in HNF1 cause the type 3 form of maturityonset diabetes of the young (MODY3), and for domain HNF1_N (PF04814), mutations and the common polymorphism Ala/Val in position 98 of HNF1 also cause MODY3. It is known that MODY3 is a kind of monogenic diabetes, which is different from type 1 diabetes that involves more complex combinations of causes involving multiple genes and environmental factors (i.e., polygenic). However, most commonly MODY3 acts like a very mild version of type 1 diabetes, with continued partial insulin production and normal insulin sensitivity [61]. Therefore, domain HNF1B_C (PF04812) and domain HNF1_N (PF04814), which are highly ranked in terms of our approach, may also be closer to, even include, known susceptible SNPs for this disease.
Type 2 Diabetes
Type 2 diabetes, formerly called adultonset diabetes or noninsulindependent diabetes, is the most common form of diabetes. It usually begins with insulin resistance, a condition in which fat, muscle, and liver cells do not use insulin properly [62]. Numerous SNPs have been associated with (slightly) increased risk for type2 diabetes [63, 64], but they only marginally improve the odds of predicting whether an individual will get type2 diabetes based on the traditional clinical characteristics combining age, sex and weight [65]. In our study, we compile from SNPedia and reference [64] a total of 53 reported susceptible SNPs, and 24 of them are found to be within 5 Mbp regions of 7 domains (i.e., HNF1B_C, IF_tail, Sulfatase, Collagen, Alk_phosphatase, Pkinase_Tyr, and FGF) that are ranked among top 10. We present the detailed list of these domains and SNPs [Additional file 2: Supplemental Table S5]. In summary, we observe 47 times that a susceptible SNP locates within 1 Mbp upstream or downstream of a domain, and 102 times that a susceptible SNP locates within 5 Mbp region of a domain. The permutation test yields a pvalue of 0.0363, which is smaller than 0.05. From these results we conjecture that domains ranked among the top 10 do, indeed, tend to be closer to known susceptible SNPs for this disease.
In addition, we also examine the 3 domains that are not close to susceptible SNPs reported by GWAS. We notice that both domains HNF1A_C (PF04813) and HNF1_N (PF04814) contain mutations that may cause the type 3 form of maturityonset diabetes of the young (MODY3), as pointed out by Urhammer et al. [60]. Although type 2 diabetes may share some characteristics in common with MODY3, no direct evidence has yet been found to demonstrate that these two domains cause type 2 diabetes.
Crohn's Disease
Crohn's disease, a chronic inflammatory disorder of the gastrointestinal tract, is thought to result from the combination of effect of environmental factors and genetic predisposition [66, 67]. Recently, genomewide association studies have made notable progress in the study of this disease, with the number of confirmed associated loci increasing from two to more than ten [68, 69]. In our study, we compile from SNPedia 36 reported susceptible SNPs, and 29 of them are found to be within 5 Mbp regions of 8 domains (i.e., Sulfatase, NACHT, Collagen, Crystall, Pkinase_Tyr, CARD, Gla, and Hormone_1) that are ranked among top 10. We present the detailed list of these domains and SNPs [Additional file 2: Supplemental Table S6]. In summary, we observe 5 times that a susceptible SNP locates inside a domain (3 times in NACHT and 2 times in Pkinase_Tyr), 21 times that a susceptible SNP locates within 1 Mbp upstream or downstream of a domain, and 49 times that a susceptible SNP locates within 5 Mbp region of a domain. The permutation test yields a pvalue of 0.0029, which is far smaller than 0.05. From these results we conjecture that domains ranked among the top 10 do, indeed, tend to be closer to, or even include, known susceptible SNPs for this disease.
Breast Cancer
Breast cancer, the most common malignancy in women in the Western world [70], exhibits a characteristic of familial clustering [70, 71]. Although little is known currently to explain the familial clustering of breast cancer, a large amount of susceptible genes and SNPs of this disease have been recently reported, including the wellknown high breast cancer risk in BRCA1 and BRCA2 mutation carriers as well as the risk for breast cancer in certain rare syndromes caused by mutations in TP53, STK11, PTEN, CDH1, NF1 or NBN [70]. In our study, we compile from SNPedia 60 reported susceptible SNPs, and 38 of them are found to be within 5 Mbp regions of 7 domains (i.e., Sulfatase, Pkinase_Tyr, Crystall, FGF, Collagen, NACHT, and Recep_L_domain) that are ranked among the top 10. We present the detailed list of these domains and SNPs [Additional file 2: Supplemental Table S7]. In summary, we observe 6 times that a susceptible SNP locates inside a domain (all in Pkinase_Tyr), 21 times that a susceptible SNP locates within 1 Mbp upstream or downstream of a domain, and 80 times that a susceptible SNP locates within 5 Mbp region of a domain. The permutation test yields a pvalue of 0.0159, which is smaller than 0.05. From these results we conjecture that domains ranked among the top 10 do, indeed, tend to be closer to, or even include, known susceptible SNPs for this disease.
Contributions of seed domaindisease associations in the analysis of the four diseases
For each of the four diseases examples, we further evaluate the contribution of seed domaindisease associations by keeping 10%, 20%, 30%, 40%, and 50% disease phenotypes that have the highest similarity scores to the query disease, obtaining a rank list of all domains, and then using the permutation test to check whether the number of known susceptible SNPs is still significantly enriched around the top ranking domains. As shown in [Additional file 2: Supplemental Table S8], we find that all resulting pvalues are smaller than 0.05, and are also numerically close to those obtained using all phenotypes. We therefore conjecture that our approach is robust to the seed domaindisease associations in the inference for these disease examples.
A predicted landscape of domaindisease associations
With the above validation results demonstrating the possibility of recovering the associations between protein domains and disease phenotypes, we further apply the domainRBF approach to all available protein domains and human disease phenotypes and predict a genomewide landscape of the associations between protein domains and human disease phenotypes. There are a total of 5,080 phenotypes in the phenotype similarity network and 5,490 protein domains in the domaindomain interaction network (the union of the entire DOMINE and InterDom network). For each phenotype, we perform a prioritization of all domains with the use of the domainRBF approach (using the diffusion kernel measure). The prioritization results, together with a freely accessible web interface, are provided at http://bioinfo.au.tsinghua.edu.cn/domainRBF/domain. All domains on the webpage are linked to the DOMINE database and the InterDom database, from which further information can be obtained.
On the basis of the above prioritization results, we aggregate the Bayes factors between all the 5,490 domains and 1,145 phenotypes, and obtain a matrix of altogether 6,286,050 elements. Here we first make a log (base 10) transform of original matrix, and then implement clustering while removing the rows in which the values are all smaller than 0.1. Since phenotypes clustered together generally have similar molecular basis, or share significant genetic overlaps [32], we implement a twoway hierarchical clustering [72], to identify interesting areas where large values of Bayes factors are highly enriched. The clustering result is demonstrated in the form of a heat map, as shown in Figure 8(A). We then manually inspect and annotate each of the phenotype clusters with one of the 22 disorder classes based on the physiological system affected [73]. Through clustering, many highly scored blocks or regions are formed in the heat map, each of which represents a set of functionally related domains implicated in a set of genetically overlapping phenotypes [32]. Specifically, we take the region in the pink circle as an example, which is enlarged in Figure 8(B). Phenotypes in the region selected are enriched with diseases related to the muscle system, and domains are also conjectured to share similar functions with adjacent domains in the same region.
We further apply the above prioritization method to all human disease phenotypes and obtain a landscape of genephenotype associations that include 5,080 disease phenotypes and 14,944 human genes. The prioritization results, together with a freely accessible web interface, are provided at http://bioinfo.au.tsinghua.edu.cn/domainRBF/gene/. All genes on the webpage are linked to the Ensembl database [74], from which further information can be obtained.
Discussion and Conclusions
In this paper, we studied the problem of identifying domains that are associated with human inherited diseases under a prioritization framework. We proposed an approach called domainRBF from the perspective of Bayesian regression, verified its superior performance through three largescale crossvalidation experiments, and demonstrated the robustness of this approach via a series of permutation tests. We further proposed to perform ab initio inference of domaindisease associations and genedisease associations. Finally, we calculated a landscape between 5,490 protein domains and 5,080 disease phenotypes.
In comparison with previous studies that rely on phenotype similarity and proteinprotein interaction data to infer genedisease associations [32], our approach can achieve higher resolution in pinpointing susceptibility functional units in the genome, essentially because a domain is only a fraction of a protein and is typically small in size (ranging between 40 and 700 residues [75] with an average of approximately 100 residues [76]). Moreover, as demonstrated in the Results section, our approach can also be used as an intermediate step in the inference of genedisease associations.
However, our method has the following limitations. First, our method can only be applied to diseases that are included in phenotype similarity data and domains that are included in domain interaction data. In the case of phenotype similarity, a possible solution would involve the development of a visualization and annotation system such as the one in [77] that can associate a new disease to a standard vocabulary and then calculate similarities for the new disease. In the case of domain interaction data, a possible solution would involve the development of effective computational methods to predict domaindomain interactions.
Second, our method currently only considers conjugate priors in the Bayesian regression model. Although such formulation results in analytic solutions and thus alleviates the computational burden in the calculation of Bayes factors, it is known that the specification of prior is intrinsically complicated and subjective [44]. The main consideration is that the posterior mean and variance should not depend on the units in which the disease similarities are measured and should also be invariant to the shift of the response variable. To meet these requirements, the use of the Jeffreys prior [78] could be considered, and a Markov chain Monte Carlo (MCMC) approach could be adopted for the calculation of the marginal likelihood.
Our approach can be further studied from the following aspects. First, in addition to the domaindomain interaction network, information such as annotations of Pfam domains in the Gene Ontology (GO) can also provide a means for calculating similarities between domains. Recently, methods for calculating semantic similarities between GO terms have been packed into userfriendly software [79]. It is therefore possible to calculate pairwise semantic similarities between every two domains and then use this similarity profile with our domainRBF model to infer associations between domains and human inherited diseases.
Second, it is conceptually straightforward to extend the domainRBF model to infer interactive effects of multiple domains on a query disease. For example, given a query disease and a set of candidate domains, we can enumerate all twoway combinations of the domains and then use the DomainRBF model to infer possible associations between the disease and interactions of two domains. Nevertheless, such brute force method is computationally intensive and not quite feasible in application to the study of threeway or even higher order interactive effects of candidate domains.
Third, with the accumulation of publicly available data in genomewide association (GWA) studies, we can consider the integration of our method and GWA studies. For example, given a disease of interest and a set of candidate domains, we can prioritize the candidate domains using our method and obtain the ranks of the domains. On the other hand, given pvalues of SNPs in a GWA study, we can obtain the statistical significance of candidate domains in the GWA study by combining the pvalues of SNPs located in the domains and then prioritize the candidates to obtain their ranks. With these two ranks, we can resort to statistical methods, such as the one described in [8], to obtain a single rank for each candidate domain.
References
 1.
Lathrop GM, Lalouel JM, Julier C, Ott J: Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA. 1984, 81: 34433446. 10.1073/pnas.81.11.3443
 2.
Ott J: Computersimulation methods in human linkage analysis. Proc Natl Acad Sci USA. 1989, 86: 41754178. 10.1073/pnas.86.11.4175
 3.
Balding DJ: A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006, 7: 781791. 10.1038/nrg1916
 4.
Cardon LR, Bell JI: Association study designs for complex diseases. Nat Rev Genet. 2001, 2: 9199.
 5.
Glazier AM, Nadeau JH, Aitman TJ: Finding genes that underlie complex traits. Science. 2002, 298: 23452349. 10.1126/science.1076641
 6.
Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228237.
 7.
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics. 2005, 6: 55 10.1186/14712105655
 8.
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, et al.: Gene prioritization through genomic data fusion. Nat Biotechnol. 2006, 24: 537544. 10.1038/nbt1203
 9.
van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG: A new webbased data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet. 2003, 11: 5763. 10.1038/sj.ejhg.5200918
 10.
Franke L, van Bakel H, Fokkens L, de Jong ED, EgmontPetersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006, 78: 10111025. 10.1086/504300
 11.
Freudenberg J, Propping P: A similaritybased method for genomewide prediction of diseaserelevant human genes. Bioinformatics. 2002, 18 (Suppl 2): S110115. 10.1093/bioinformatics/18.suppl_2.S110
 12.
PerezIratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002, 31: 316319.
 13.
Turner FS, Clutterbuck DR, Semple CA: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol. 2003, 4: R75 10.1186/gb2003411r75
 14.
Gaulton KJ, Mohlke KL, Vision TJ: A computational system to select candidate genes for complex human traits. Bioinformatics. 2007, 23: 11321140. 10.1093/bioinformatics/btm001
 15.
Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using proteinprotein interactions. J Med Genet. 2006, 43: 691698. 10.1136/jmg.2006.041376
 16.
Oti M, Brunner HG: The modular nature of genetic diseases. Clin Genet. 2007, 71: 111.
 17.
George RA, Liu JY, Feng LL, BrysonRichardson RJ, Fatkin D, Wouters MA: Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res. 2006, 34: e130 10.1093/nar/gkl707
 18.
Sharma A, Chavali S, Tabassum R, Tandon N, Bharadwaj D: Gene prioritization in Type 2 Diabetes using domain interactions and network analysis. BMC Genomics. 2010, 11: 84 10.1186/147121641184
 19.
Pawlowski K, Pio F, Chu Z, Reed JC, Godzik A: PAAD  a new protein domain associated with apoptosis, cancer and autoimmune diseases. Trends Biochem Sci. 2001, 26: 8587. 10.1016/S09680004(00)017291
 20.
He QY, Liu XH, Li Q, Studholme DJ, Li XW, Liang SP: G8: a novel domain associated with polycystic kidney disease and nonsyndromic hearing loss. Bioinformatics. 2006, 22: 21892191. 10.1093/bioinformatics/btl123
 21.
Fontalba A, MartinezTaboada V, Gutierrez O, Pipaon C, Benito N, Balsa A, Blanco R, FernandezLuna JL: Deficiency of the NFκB inhibitor caspase activating and recruitment domain 8 in patients with rheumatoid arthritis is associated with disease severity. J Immunol. 2007, 179: 48674873.
 22.
Wang W, Zhang W, Jiang R, Luan Y: Prioritisation of associations between protein domains and complex diseases using domaindomain interaction networks. IET Syst Biol. 2010, 4: 212222. 10.1049/ietsyb.2009.0037
 23.
Raghavachari B, Tasneem A, Przytycka TM, Jothi R: DOMINE: a database of protein domain interactions. Nucleic Acids Res. 2008, 36: D656661.
 24.
Ng SK, Zhang Z, Tan SH, Lin K: InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res. 2003, 31: 251254. 10.1093/nar/gkg079
 25.
Ng SK, Zhang Z, Tan SH: Integrative approach for computationally inferring protein domain interactions. Bioinformatics. 2003, 19: 923929. 10.1093/bioinformatics/btg118
 26.
Altshuler D, Daly M, Kruglyak L: Guilt by association. Nat Genet. 2000, 26: 135137. 10.1038/79839
 27.
Oti M, Huynen MA, Brunner HG: Phenome connections. Trends Genet. 2008, 24: 103106. 10.1016/j.tig.2007.12.005
 28.
van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A textmining analysis of the human phenome. Eur J Hum Genet. 2006, 14: 535542. 10.1038/sj.ejhg.5201585
 29.
Köhler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008, 82: 949958. 10.1016/j.ajhg.2008.02.013
 30.
Chen J, Aronow BJ, Jegga AG: Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009, 10: 73 10.1186/147121051073
 31.
Chen J, Bardes EE, Aronow BJ, Jegga AG: ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009, 37: W305311. 10.1093/nar/gkp427
 32.
Wu X, Jiang R, Zhang MQ, Li S: Networkbased global inference of human disease genes. Mol Syst Biol. 2008, 4: 189
 33.
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al.: The Pfam protein families database. Nucleic Acids Res. 2010, 38: D211222. 10.1093/nar/gkp985
 34.
, : The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38: D142148.
 35.
Jain E, Bairoch A, Duvaud S, Phan I, Redaschi N, Suzek BE, Martin MJ, McGarvey P, Gasteiger E: Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics. 2009, 10: 136 10.1186/1471210510136
 36.
Ideker T, Sharan R: Protein networks in disease. Genome Res. 2008, 18: 644652. 10.1101/gr.071852.107
 37.
Finn RD, Marshall M, Bateman A: iPfam: visualization of proteinprotein interactions in PDB at domain and amino acid resolutions. Bioinformatics. 2005, 21: 410412. 10.1093/bioinformatics/bti011
 38.
Stein A, Panjkovich A, Aloy P: 3did Update: domaindomain and peptidemediated interactions of known 3D structure. Nucleic Acids Res. 2009, 37: D300304. 10.1093/nar/gkn690
 39.
Stein A, Russell RB, Aloy P: 3did: interacting protein domains of known threedimensional structure. Nucleic Acids Res. 2005, 33: D413417.
 40.
Lee H, Deng M, Sun F, Chen T: An integrated approach to the prediction of domaindomain interactions. BMC Bioinformatics. 2006, 7: 269 10.1186/147121057269
 41.
Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, et al.: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002, 58: 899907. 10.1107/S0907444902003451
 42.
Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BINDThe Biomolecular Interaction Network Database. Nucleic Acids Res. 2001, 29: 242245. 10.1093/nar/29.1.242
 43.
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303305. 10.1093/nar/30.1.303
 44.
Servin B, Stephens M: Imputationbased analysis of association studies: candidate regions and quantitative traits. PLoS Genet. 2007, 3: e114 10.1371/journal.pgen.0030114
 45.
Li KC: Genomewide coexpression dynamics: theory and application. Proc Natl Acad Sci USA. 2002, 99: 1687516880. 10.1073/pnas.252466999
 46.
Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by combining gene expression and proteinprotein interaction data. Bioinformatics. 2007, 23: 215221. 10.1093/bioinformatics/btl569
 47.
McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 2007, 80: 588604. 10.1086/514346
 48.
Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A: BioMart Central Portalunified access to biological data. Nucleic Acids Res. 2009, 37: W2327. 10.1093/nar/gkp265
 49.
Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMartbiological queries made easy. BMC Genomics. 2009, 10: 22 10.1186/147121641022
 50.
Ku CS, Loy EY, Pawitan Y, Chia KS: The pursuit of genomewide association studies: where are we now?. J Hum Genet. 2010, 55: 195206. 10.1038/jhg.2010.19
 51.
Yu W, Ned R, Wulf A, Liu T, Khoury MJ, Gwinn M: The need for genetic variant naming standards in published abstracts of human genetic association studies. BMC Res Notes. 2009, 2: 56 10.1186/17560500256
 52.
Malzahn D, Balavarca Y, Lozano JP, Bickeboller H: Tests for candidategene interaction for longitudinal quantitative traits measured in a large cohort. BMC Proc. 2009, 3 (Suppl 7): S80 10.1186/175365613s7s80
 53.
Wild S, Roglic G, Green A, Sicree R, King H: Global prevalence of diabetes: estimates for the year 2000 and projections for 2030. Diabetes Care. 2004, 27: 10471053. 10.2337/diacare.27.5.1047
 54.
Type 2 Diabetes Overview.http://diabetes.webmd.com/guide/type2diabetes
 55.
Genetic complexity of Crohn's disease revealed.http://www.well.ox.ac.uk/jun08geneticsofcrohnsdisease
 56.
Boyle P, Levin B: World Cancer Report 2008.http://www.iarc.fr/en/publications/pdfsonline/wcr/2008/wcr_2008.pdf
 57.
Most frequent cancers: women.http://globocan.iarc.fr/factsheets/populations/factsheet.asp?uno=900
 58.
Daneman D: Type 1 diabetes. Lancet. 2006, 367: 847858. 10.1016/S01406736(06)683414
 59.
Lernmark A: Type 1 diabetes. Clin Chem. 1999, 45: 13311338.
 60.
Urhammer SA, Fridberg M, Hansen T, Rasmussen SK, Moller AM, Clausen JO, Pedersen O: A prevalent amino acid polymorphism at codon 98 in the hepatocyte nuclear factor1alpha gene is associated with reduced serum Cpeptide and insulin responses to an oral glucose challenge. Diabetes. 1997, 46: 912916. 10.2337/diabetes.46.5.912
 61.
Yamagata K, Oda N, Kaisaki PJ, Menzel S, Furuta H, Vaxillaire M, Southam L, Cox RD, Lathrop GM, Boriraj VV, et al.: Mutations in the hepatocyte nuclear factor1alpha gene in maturityonset diabetes of the young (MODY3). Nature. 1996, 384: 455458. 10.1038/384455a0
 62.
Tuomilehto J, Lindstrom J, Eriksson JG, Valle TT, Hamalainen H, IlanneParikka P, KeinanenKiukaanniemi S, Laakso M, Louheranta A, Rastas M, et al.: Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. N Engl J Med. 2001, 344: 13431350. 10.1056/NEJM200105033441801
 63.
Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, et al.: A genomewide association study identifies novel risk loci for type 2 diabetes. Nature. 2007, 445: 881885. 10.1038/nature05616
 64.
Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, de Bakker PI, Abecasis GR, Almgren P, Andersen G, et al.: Metaanalysis of genomewide association data and largescale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet. 2008, 40: 638645. 10.1038/ng.120
 65.
van Hoek M, Dehghan A, Witteman JC, van Duijn CM, Uitterlinden AG, Oostra BA, Hofman A, Sijbrands EJ, Janssens AC: Predicting type 2 diabetes based on polymorphisms from genomewide association studies: a populationbased study. Diabetes. 2008, 57: 31223128. 10.2337/db080425
 66.
Ogura Y, Bonen DK, Inohara N, Nicolae DL, Chen FF, Ramos R, Britton H, Moran T, Karaliuskas R, Duerr RH, et al.: A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease. Nature. 2001, 411: 603606. 10.1038/35079114
 67.
Braat H, Peppelenbosch MP, Hommes DW: Immunology of Crohn's disease. Ann N Y Acad Sci. 2006, 1072: 135154. 10.1196/annals.1326.039
 68.
Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM, et al.: Genomewide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet. 2008, 40: 955962. 10.1038/ng.175
 69.
Mathew CG: New links to the pathogenesis of Crohn disease provided by genomewide association scans. Nat Rev Genet. 2008, 9: 914.
 70.
Ripperger T, Gadzicki D, Meindl A, Schlegelberger B: Breast cancer susceptibility: current knowledge and implications for genetic counselling. Eur J Hum Genet. 2009, 17: 722731. 10.1038/ejhg.2008.212
 71.
Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, et al.: Genomewide association study identifies novel breast cancer susceptibility loci. Nature. 2007, 447: 10871093. 10.1038/nature05887
 72.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 1486314868. 10.1073/pnas.95.25.14863
 73.
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104: 86858690. 10.1073/pnas.0701361104
 74.
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al.: The Ensembl genome database project. Nucleic Acids Res. 2002, 30: 3841. 10.1093/nar/30.1.38
 75.
Jones S, Stewart M, Michie A, Swindells MB, Orengo C, Thornton JM: Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 1998, 7: 233242.
 76.
Wheelan SJ, MarchlerBauer A, Bryant SH: Domain size distributions can predict domain boundaries. Bioinformatics. 2000, 16: 613618. 10.1093/bioinformatics/16.7.613
 77.
Robinson PN, Kohler S, Bauer S, Seelow D, Horn D, Mundlos S: The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008, 83: 610615. 10.1016/j.ajhg.2008.09.017
 78.
Jeffreys HS: Theory of probability. 1998, Oxford [Oxfordshire]: Clarendon Press; New York: Oxford University Press, 3,
 79.
Frohlich H, Speer N, Poustka A, Beissbarth T: GOSiman Rpackage for computation of information theoretic GO similarities between terms and gene products. BMC Bioinformatics. 2007, 8: 166 10.1186/147121058166
Acknowledgements
This work was partly supported by the National Science Foundation of China (60805010, 60928007, and 60934004), Tsinghua University Initiative Scientific Research Program, Tsinghua National Laboratory for Information Science and Technology (TNLIST) Crossdiscipline Foundation, Research Fund for the Doctoral Program of Higher Education of China (200800031009), and the Scientific Research Foundation for Returned Overseas Chinese Scholars. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WZ derived the model, implemented the method, collected the results and drafted the manuscript. YC participated in the design of the study. FS designed the research and revised the manuscript. RJ designed the research, drafted, and revised the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Supplemental Tables
Additional file 2:. Supplemental Table S1 lists contributions of seed domaindisease associations (leaveoneout crossvalidation experiments using the large domaindomain interaction network). Supplemental Table S2 lists contributions of seed domaindisease associations (ab initio prediction experiments using the small domaindomain interaction network). Supplemental Table S3 lists contributions of seed domaindisease associations (ab initio prediction experiments using the large domaindomain interaction network). Supplemental Table S4 lists the genomewide evidence of associations between domains and type 1 diabetes. Supplemental Table S5 lists the genomewide evidence of associations between domains and type 2 diabetes. Supplemental Table S6 lists the genomewide evidence of associations between domains and Crohn's disease. Supplemental Table S7 lists the genomewide evidence of associations between domains and breast cancer. Supplemental Table S8 lists contributions of seed domaindisease associations in the analysis of the four disease examples. (PDF 529 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Zhang, W., Chen, Y., Sun, F. et al. DomainRBF: a Bayesian regression approach to the prioritization of candidate domains for complex diseases. BMC Syst Biol 5, 55 (2011). https://doi.org/10.1186/17520509555
Received:
Accepted:
Published:
Keywords
 Diffusion Kernel
 Rank Ratio
 Candidate Domain
 Susceptible SNPs
 Human Disease Phenotype