Molecular mechanistic associations of human diseases
© Stegmaier et al. 2010
Received: 18 December 2009
Accepted: 6 September 2010
Published: 6 September 2010
Skip to main content
© Stegmaier et al. 2010
Received: 18 December 2009
Accepted: 6 September 2010
Published: 6 September 2010
The study of relationships between human diseases provides new possibilities for biomedical research. Recent achievements on human genetic diseases have stimulated interest to derive methods to identify disease associations in order to gain further insight into the network of human diseases and to predict disease genes.
Using about 10000 manually collected causal disease/gene associations, we developed a statistical approach to infer meaningful associations between human morbidities. The derived method clustered cardiometabolic and endocrine disorders, immune system-related diseases, solid tissue neoplasms and neurodegenerative pathologies into prominent disease groups. Analysis of biological functions confirmed characteristic features of corresponding disease clusters. Inference of disease associations was further employed as a starting point for prediction of disease genes. Efforts were made to underpin the validity of results by relevant literature evidence. Interestingly, many inferred disease relationships correspond to known clinical associations and comorbidities, and several predicted disease genes were subjects of therapeutic target research.
Causal molecular mechanisms present a unifying principle to derive methods for disease classification, analysis of clinical disorder associations, and prediction of disease genes. According to the definition of causal disease genes applied in this study, these results are not restricted to genetic disease/gene relationships. This may be particularly useful for the study of long-term or chronic illnesses, where pathological derangement due to environmental or as part of sequel conditions is of importance and may not be fully explained by genetic background.
Diseases and accompanying symptoms are spawned by systems of molecules, which operate within and across cell and tissue boundaries. A major goal of medical research is to identify the molecular components which play a role in causing a pathological condition. Since first seminal achievements , events at the molecular level have been recognized as key to understand disease mechanisms.
Phenotype/genotype associations provide evidence for a role of affected gene products in respective causal mechanisms and extensive resources document medically relevant gene variants [2, 3]. Recent studies on hereditary phenotypes have shown that similarities among disorders imply involvement of functionally related gene products, summarized as "phenotypic overlap implies genetic overlap". The modular nature of human genetic diseases suggests that modules of similar disorders, also denoted as disease subnetworks, can be juxtaposed with modules of molecules which commonly contribute to a biological function, or interact in molecular complexes or pathways [4–7]. Several studies support the modularity concept and it was successfully applied to derive computational approaches for prediction of candidate genes as well as functional links between molecules [8–12].
It is now clear that analysis of disease relationships unfolds new opportunities for both medical and biological research. Several aforementioned works determined pairwise disorder similarity with a score derived from text-mining of OMIM phenotype descriptions . Rzhetsky et al.  analyzed associations among 161 diseases based on their co-occurrence in patient records. Possibilities to correlate diseases through protein interaction networks or molecular pathways were also explored [13, 14]. Sam et al.  used relations between proteins, Gene Ontology (GO) , and phenotypes established in the PhenoGO NLP system  together with Reactome  protein interactions to find diseases involving common protein-protein interaction networks such as xeroderma pigmentosum and Cockayne syndrome, for which a functional link was previously discussed . Li and Agarwal  obtained disease/gene associations through literature mining of MEDLINE abstracts and constructed a network of diseases which share common molecular pathways. In this network they identified novel disease relationships and observed that a disease is linked to several pathways and a pathway is linked to several diseases.
We present a novel approach to analyze mechanistic relationships between human diseases. Using about 10000 causal disease/gene associations annotated in the BIOBASE Knowledge Library (BKL)  a statistical method that quantifies pairwise similarity between disorders was developed. Connecting diseases at a certain significance threshold, the statistical approach revealed groups of diseases which feature characteristic biological functions. So far, computationally inferred disease relationships were mainly examined with regard to shared molecular networks. Yet, many disease associations reported in this work correspond to known clinical associations and causal links between pathologies. Furthermore, we used disease associations and gene associations to predict causal disease genes. The results suggest that analysis of causal mechanisms provides a unified framework for disease classification, discovery of causal components, and can be used to obtain computational evidence for clinical disease associations as well as hypotheses about their molecular foundation.
We extracted disease/gene associations which had been manually classified as causal or preventative from the BIOBASE Knowledge Library™ (Methods). In the following, we denote respective genes as causal genes. The data set comprised 375 diseases which were connected to at least 5 of 3051 causal genes by a total of 9871 disease/gene associations. Similarity of involved molecular mechanisms for each disease pair was assessed by calculating the number of common causal genes and the corresponding P-value as described in Methods.
We first constructed a map connecting all diseases with a minimum of two common genes and a maximal similarity P-value of 0.001. This map consisted of one giant component with 123 disease nodes, three medium-sized components with 14, 12, and 10 nodes as well as 29 small components with two to six nodes. In total, there were 239 of the 375 diseases, so that 136 diseases were not connected to any other at the required similarity threshold.
We tested whether the number of 239 diseases connected at the chosen P-value threshold was statistically significant. For this, we calculated false discovery rates (FDR) for P-values of disease pairs with at least two common causal genes using the fdrtool package . According to fdrtool, the P-value cut-off 0.001 corresponded to a false discovery rate of 0.024. Hence, the disease connections were statistically significant also after multiple testing correction. For comparison, 282 disorders were connected at a FDR of 0.05 (P-value 3.73e-3).
In the giant component, diseases are congregated in three subregions. The top of the network (yellow colored, Fig. 1) comprises mostly muscular, cardiovascular and metabolic diseases such as diabetic disorders, obesity, myopathies and heart failure, but also stroke and brain ischemia. Many of the disease entities gathered in this region are recognizable as components of the cardiometabolic syndrome, a clinical clustering of cardiovascular disease risk factors like obesity, hypertension, and insulin resistance [21, 22]. Notably, two neoplastic diseases, namely parathyroid neoplasms and pituitary neoplasms (orange nodes), are located in a branch shared with acromegaly, adenoma, hyperparathyroidism, and hypoparathyroidism (yellow nodes). Acromegaly is an endocrine disorder which is caused in more than 95% of the cases by benign, growth hormone producing pituitary adenoma . Other endocrine neoplasia such as parathyroid neoplasms can occur as part of an acromegaly-causing syndrome called multiple endocrine neoplasia (MEN) . Hence, this branch involves endocrine disorders and known comorbidities.
Through thrombosis and thrombocytopenia, also connected to the more general class of blood platelet disorders, the top region is joined with an area containing hematological malignancies like leukemia and lymphoma (red nodes, Fig. 1) as well as several other immune system-related pathologies among others multiple sclerosis, acquired immunodeficiency syndrome, and rheumatoid arthritis (purple nodes, Fig. 1). The third subregion of the giant component contains exclusively non-hematological malignancies like liver neoplasms, brain neoplasms, and melanoma (orange colored, Fig. 1). The connection to the central part of the component occupied by immune system-related disorders occurs through hepatocellular carcinoma and glioma, which are linked with multiple myeloma.
The three medium-sized components (Fig. 2) represent developmental abnormalities, audio-visual disorders as well as neurodegenerative and psychiatric illnesses. One cluster (Fig. 2A) concatenates, among others, variants of congenital mental retardation, eye abnormalities, tooth abnormalities as well as glaucoma, cataract and renal tubular acidosis. Retinal diseases, blindness as well as hearing loss and deafness are located together in another group (Fig. 2B). In the third disease group (Fig. 2C), we find Parkinsonian disorders, Alzheimer disease, dementia, as well as bipolar disorder and alcoholism. Several of the smaller disease groups (Additional file 1) reflect the hierarchy of MeSH headings which are used for BKL disease annotation, e.g. hepatitis descriptors (group 7), ataxias (group 14), osteoporosis and postmenopausal osteoporosis (group 21), and growth disorders and dwarfism (group 31). The link between xeroderma pigmentosum and Cockayne syndrome (group 10) as well as their connection to the hair disease Trichothiodystrophy were previously discussed [13, 18, 25, 26].
To examine whether the revealed disease associations reflect common causal mechanisms, we compared GO assignments of genes in the six largest disease groups. The Gene Ontology  is an extensive resource of functional annotations of genes in three main categories Biological Process, Molecular Function and Cellular Component. The Fisher test is typically used to test for significant enrichment of GO categories in gene sets (Methods). Starting from enrichment P-values obtained with the standard test, we assigned GO biological processes to disease groups identified in this work and ranked them by a preponderance value that compares P-values of different gene sets (Methods). Beyond identification of significantly enriched biological functions, this comparative approach enables to detect functional differences between gene sets even when the standard method does not assign top ranks to respective GO categories. The analysis was performed on six disease groups. We explored functional differences between disease clusters using curated GO annotation from the BKL. Calculation of preponderance values and GO term assignments were also performed with enrichment P-values reported by the DAVID Functional Annotation Tool [27, 28] as a public source of GO annotation. The first three groups used in the analysis match regions of the giant component: the top region mainly comprising cardiometabolic diseases also including parathyroid and pituitary neoplasms, the middle region constituted of leukemia, lymphoma, and immune system-related pathologies, as well as the lower region with solid tissue neoplasms (Fig. 1). In the following, we denote these groups as clusters M, I, and C, respectively. The other three disease groups were obtained from networks shown in Fig. 2, and are in the following denoted as clusters D (Fig. 2A), P (Fig. 2B), and N (Fig. 2C). Enrichment P-values were calculated and compared for complementary sets made up of genes which were specific for each cluster. Respectively, 337 genes, 279 genes, 683 genes, 107 genes, 82 genes, and 130 genes represented cluster M, I, C, D, P, and N. All diseases were associated with at least one gene in corresponding gene sets, except for transient ischemic attack, so that results only apply to 62 of 63 disorders in cluster M.
Since the BKL assignments of genes to GO biological processes are manually curated, we carried out the same analysis using enrichment P-values calculated by the DAVID Functional Annotation Tool in order to validate our results with an alternative source of GO annotations. In Additional file 2 we report the top 30 biological processes associated with each disease cluster according to enrichment P-values calculated by the DAVID tool. The topics of categories assigned to disease clusters based on GO annotation of DAVID are in good agreement with those observed in the analysis of curated BKL annotation. A notable difference is the absence of cell cycle categories among the top 30 biological processes assigned to cluster C in the analysis using DAVID. Cell cycle terms were still associated with disease cluster C, albeit with lower ranks than in the BKL analysis. For instance, preponderance values calculated for DAVID enrichment P-values ranked the GO categories "regulation of cell cycle" and "cell cycle" at position 86 and 88 (data not shown), respectively, whereas they were ranked 5th and 9th in the BKL analysis (Fig. 3). Nevertheless, the top ranked biological processes assigned to disease cluster C based on either DAVID or BKL share a common theme of cell proliferation, apoptosis, angiogenesis and developmental pathways. Hence, both resources confirmed that the disease clusters feature biological processes that reflect the type of clustered disorders.
To further explore the relevance of inferred disease associations, we inspected vicinities of some selected disorders defined by a certain similarity level. Here, we made use of the statistical method to extract all diseases associated with a pathology of interest through at least two common causal genes and a similarity P-value below 0.01. In the following, we exemplify three cases of metabolic disorders, namely type 1 diabetes (T1DM), type 2 diabetes (T2DM), obesity, and the neurodegenerative disorder Parkinson disease (PD).
Disorders associated with obesity, Parkinson disease, T1DM and T2DM at a P-value threshold of 0.01 and an overlap of at least 2 genes.
Coronarya Artery Disease
Amyotrophic Lateral Sclerosis
Attention Deficit Disorder with Hyperactivity
Diabetes Mellitus, Type 2
Diabetes Mellitus, Type 1
Graft vs. Host Disease
Diabetes Mellitus, Type 2
Epilepsy, Temporal Lobe
Hyperlipoproteinemia Type II
Polycystic Ovary Syndrome
In summary, the statistical analysis of causal genes enabled us to find meaningful disease associations. Interestingly, many of these associations correspond to clinical observations of comorbidities and known etiological relationships between diseases as supported by highlighted scientific literature. Examination of disease clusters revealed characteristic biological functions which confirm the causal mechanistic basis of inferred disease relationships. Altogether, our findings suggest that causal molecular mechanisms provide for an expedient principle to gain further insight into the network of human diseases.
Having a method to identify meaningful disease associations, our next goal was to apply disease similarities as a starting point for causal gene prediction. Following our previous results, we assumed that gene sets of associated disorders potentially harbor novel mechanistic components of the disease of interest. The short-list of candidate genes was then culled from associated pathologies hypothesizing that frequent co-occurrence in causal gene sets implies functional relationship with a known disease gene (Methods).
Causal genes predicted for T1DM, T2DM and obesity and supporting literature referenced by PubMed identifiers.
By manual literature research we could verify the majority of predictions as shown by the PubMed identifiers of relevant research articles given next to corresponding candidate gene symbols. Corroboration of our predictions was least successful for T1DM with 6 of 13 candidate genes left unverified, whereas only 3 of 20 genes proposed to be involved in T2DM, namely ADRB1, IL2, and ITGA2B, were not confirmed.
As an additional step, we performed network analysis of signal transduction molecules encoded by known and predicted causal genes using the network cluster algorithm of ExPlain™ . The algorithm constructs signaling pathways connecting as many molecules from an input set as possible with a distance constraint for reaction cascades. As a result, input molecules are clustered into networks of two or more molecules. These network clusters can be visualized and subjected to other bioinformatic analyses . In our pursuit, the application served two purposes. Firstly, molecular pathways point out potential mechanisms by which known and predicted causal components exert a common function. Secondly, signaling cascades may allude to additional, previously unknown constituents of disease mechanisms. In the following, we examined network clusters of known and predicted causal components of T1DM as well as T2DM.
ExPlain™ reported two network clusters for T1DM. In the following, we provide gene symbols in parentheses where these differ from the protein names reported in ExPlain™ networks. A small cluster consisted of the known causal component CD154 (CD40LG) and the predicted molecule alpha IIb-integrin encoded by ITGA2B (Table 2), providing computational evidence for a role of ITGA2B in T1DM. Additional file 4 shows the larger cluster including ten known causal components (red nodes) and the novel component IGF-2 (IGF2) (green node) connected by other molecules (blue nodes) through activating (black arrows) or inhibitory (red arrows) reactions. By manual literature research, we verified involvement of SOCS3 , Jak1 (JAK1) , and SHP-1 (PTPN6)  in T1DM. Notably, Grb-14 (GRB14) and PTP1B (PTPN1) are known molecular constituents of insulin resistance [43, 44] and development of PTP1B inhibitors for therapeutic modulation of insulin sensitivity is an active field of research . While PTP1B and Grb-14 functions were mainly explored with regard to their causal role in T2DM and obesity, the prevalence of insulin resistance in conjunction with type 1 diabetes has recently gained attention [46, 47].
We further obtained two network clusters of T2DM molecules shown in Additional files 5 and 6. In a small network (Additional file 5), known causal components ADA and CD26 (DPP4) form a cascade with the predicted causal component CD44 (green node) connected by RANTES (CCL5) (blue node), which harbors promoter polymorphisms associated with type 2 diabetes . The larger network (Additional file 6) comprises 19 known causal components (red nodes) and 5 predicted components, namely activated protein C (PROC), alpha-IIb integrin (ITGA2B), Cu-ZnSOD (SOD1), IRS-2 (IRS2) and IL-2 (IL2) (green nodes). Moreover, scientific literature supports a mechanistic role of several network components, such as PKCdelta (PRKCD) , PKCtheta (PRKCQ) , GSK3 (GSK3B) , p85 (PIK3R1) , Rac1 (RAC1) , p65PAK (PAK1) , Akt (AKT1), PDK1 (PDPK1), and mTOR (MTOR) .
Taken together, disease and gene associations successfully predicted causal genes for obesity, T1DM, and T2DM, and scientific literature verified the majority of proposed candidates. Molecular network analysis of T1DM and T2DM gene sets then suggested signal transduction cascades connecting predicted and known causal proteins encoded by respective genes. Additional constituents of causal disease mechanisms were inferred along with molecular pathways and a good part of them (more than 1/3) were supported by literature evidence. Notably, many of the cited research articles investigated respective causal components as therapeutic targets for T1DM or T2DM. These results demonstrate the utility of causal mechanism-based disease analysis for inference of novel disease genes.
We examined how the number of predictions correlated with P-value thresholds and observed an approximately linear dependence in all three examples (Fig. 4B, D, and 4F). This shows that the P-values effectively controlled the number of predictions, albeit the rates of change were not the same for the three disorders.
Cross-validation (CV) was performed to evaluate the robustness of our results with perturbed disease/gene association data. We carried out 20 rounds of cross-validation for obesity, T1DM, and T2DM by removing 5 randomly chosen genes from their causal gene sets. Poisson parameters were re-estimated for diseases and for genes based on 105 random gene sets and 105 random disease sets, respectively, using the modified association data. Subsequently, we predicted causal genes for each of the disorders at a P-value threshold of 0.01 and an overlap threshold of 2.
In summary, cross-validation demonstrated that the method robustly produced a limited set of genes which preferably included the originally reported gene predictions. We could observe effects of sampling random disease or gene sets in the empirical estimation of Poisson parameters. Increasing the number of random sets would mitigate the variability inherent to the sampling procedure. This does not represent a major drawback, since the regression analysis needs to be conducted only once and parameter estimates can then be used in subsequent comparisons. Furthermore, we assume that recovery of known disease genes was limited to certain fraction of the causal gene sets due to the low coverage of underlying disease/gene associations. This suggests that other types of information such as molecular interactions could complement the predictions entirely based on disease/gene associations.
We therefore tested the utility of combining predictions derived from disease/gene associations with a method that employs molecular interactions. The GeneWanderer is a tool that applies a global network distance measure to rank candidate genes according to their context to known disease genes in a network . The algorithm assigns a distance value to candidate genes based on a random walk with restart (RWR), which reflects how well candidate genes are connected to modules of disease genes . RWR distance values are higher for genes that are well connected to known disease genes and were successfully applied to prioritize candidate genes . We compared the ranks of distance values obtained with and without inclusion of genes predicted on the basis of disease/gene associations for known disease genes that were omitted from the cross-validation samples. Thus, for each of the left-out disease genes, we calculated RWR distance values using either only the truncated set of disease genes or predicted genes in addition. In the following, we denote RWR distance values as network score. An interaction network containing 10486 genes and 109089 interactions was compiled from the IntAct , BioGRID , and Reactome  databases (Additional file 7). We used the ranks of network scores of the test disease genes among all genes of the network to compare the performance with and without addition of genes predicted by disease/gene associations in each CV sample.
Knowledge about components of causal mechanisms has proven useful to analyze relationships between human diseases. The definition of causal genes applied in the BKL includes genotype associations, yet also covers other sources of evidence for involvement in causal molecular systems. This may be of importance taking into account that the activity of gene products in the context of molecular networks may bias the ability of genes to harbor pathological mutations as shown by several studies [59–61]. Probably, more or less complex patterns of genetic variation contribute to every disease. However, their functional effects become manifest in molecular interactions, where networks of proteins, yet also protein/DNA interactions, take an important part and genetic alterations are one of many possibilities to induce derangement .
On the foundation of causal disease/gene associations, we built a method to quantify the similarity of two diseases that accounts for unequal frequencies of genes in the entire set of associations. To provide a familiar and intuitive quantity, the presented method reports a P-value for the overlap of causal gene sets. At standard P-value thresholds, 0.001 and 0.01, the statistical analysis revealed meaningful disease associations as demonstrated by constructing a map of human diseases, GO analysis of disease clusters and inspection of vicinal disorders for obesity, T1DM, T2DM and PD at the lower threshold.
Human disease networks were previously studied with respect to physiological disease classes [10, 14]. Here, we first constructed the disease map and afterwards demonstrated that biological processes coincided with well-known attributes of clustered disorders as confirmed by manually curated GO biological process annotation of the BKL as well as GO annotations available through the DAVID Functional Annotation Tool. Furthermore, we observed that the giant component of our disease map as well as the vicinities of obesity, T1DM, and T2DM clustered components of the cardiometabolic syndrome. Notably, the disease vicinities that were explored in more detail reflected not only similarities between obesity, T1DM, and T2DM, but captured also specific relationships such as connections to immunological disorders in the T1DM vicinity. When we applied causal disease associations to predict new disease genes, the majority of predictions proposed for obesity, T1DM, and T2DM could be supported by references to scientific literature. Altogether, our results corroborate that the inferred disease associations reflect common molecular mechanisms and indicate applications for disease gene prediction as well as disease classification and definition.
Limitations of current standard classifications were previously challenged with regard to molecular or complex systems approaches [63, 64]. Protein-protein interactions and molecular pathways have been employed to identify disease relationships [13, 14]. So far our method for disease comparison left molecular interactions unspecified and addressed only the overlap of causal gene sets. A first step towards incorporating molecular networks into our analysis was taken by clustering causal components in signal transduction networks. As evidenced by scientific literature, the network clusters contained many known causal components and therapeutic targets. Furthermore, we translated the analysis of disease associations and of causal gene associations into a method to select for new disease genes. While disease gene prediction entirely based on experimentally verified disease/gene associations faced limitations originating from the sparseness of the available data, we could show that a combination of disease associations and molecular network analysis enhanced the possibility to identify new disease genes. Incorporation of molecular interactions is therefore an important area for further development, where greater fidelity with molecular systems that underlie disease mechanisms can be achieved.
We would like to point out that our method for prediction of causal genes relied on four cut-off values consisting of a P-value and a minimal overlap parameter in the first and the second step (Fig. 9). It may be difficult to tune each of the parameters to achieve optimal results. Throughout this work, we set identical cut-offs for disease and gene similarity and confined analyses to standard P-value thresholds (0.001 or 0.01). Furthermore, the overlap threshold was always set to a small value of 2 with the purpose of controlling a minimal level of shared causal genes or diseases. We think that with this setting the overlap cut-off sufficiently complements the P-value. As demonstrated, the number of false positives grows linearly with the P-value threshold (Fig. 4), so that this parameter lends itself to further adjust the algorithm. One possibility is to choose a number of predictions admissible for validation and to select a P-value cut-off that satisfies this constraint. For this type of approach our method offers, in addition to using a common value for disease and gene similarity, the possibility to fix the disease similarity threshold (e.g. to 0.01) and to subsequently rank predicted causal genes according to overlap P-values with known disease genes. Furthermore, the method achieved best precision for obesity, T1DM, and T2DM at P-value thresholds of 0.005, 0.003, and 0.01, respectively. At a P-value of 0.005 the observed precision for T1DM and T2DM was still at least 75%. This indicates that the range from 0.001 to 0.01 is suitable to choose a threshold for causal gene prediction and suggests 0.005 as a possible starting point.
In our efforts to verify the inferred disease associations, we were able to highlight many instances of known clinical associations and comorbidities, suggesting that these are special cases of related pathologies sharing causal mechanisms which also connect them etiologically. Interestingly, these findings inversely confirm a previous study, where co-occurrence of disorders in medical records was used to predict genetic overlap . Validation of disease co-prevalences often requires laborious population studies. According to the results of this work, the decision to mount a study could be supported by testing hypotheses about disease associations computationally. Simultaneously, shared causal components provide insight into the molecular basis of etiological disease relationships and suggest potential diagnostic markers.
So far, different methods have been proposed to investigate human disease associations. A main difference lies in the representation of disease entities by features that are eventually compared to obtain a figure of similarity. While this work focused on genes that were manually classified as causal disease genes, other approaches used clinical characteristics, phenotypes, or genes and pathways [5, 8, 14]. Each choice of feature representation involves advantages and disadvantages with respect to quality, coverage, or detail of information. For instance, manual curation promises greater quality than computationally derived annotation, but its coverage is often inferior. Furthermore, associated genes capture disease components in finer detail than descriptions of clinical characteristics, but we assume that for a disease the latter are more often defined than associated genes. It is therefore of importance to compare the different approaches to recognize and validate strengths and weaknesses.
To the best of our knowledge, no reference set has been established to systematically examine the ability of different methods to correctly identify disease associations, so that a necessary step towards such a comparison is to assemble a set of known disease links.
Another future direction will be to combine different levels of information such as causal genes, affected biological processes, and clinical characteristics to gain further insight into disease subtypes and corresponding mechanisms. Of interest are phenotypically similar disease subtypes that present different molecular mechanisms as well as similarities on the molecular level that cannot be mapped to known clinical characteristics. Identification of such disease subtypes and of "hidden" disease similarities may open new avenues to develop therapeutic approaches for respective disorders.
We developed a novel approach to analyze human disease associations and demonstrated its utility in several application areas. Causal molecular mechanisms present a unifying principle for disease classification and definition, analysis of clinical disorder associations, as well as prediction of disease genes, therapeutic targets and diagnostic markers. According to the definition of causal disease genes applied in this study, these results are not restricted to genetic disease/gene relationships. This may be particularly useful for the study of long-term or chronic illnesses, where pathological derangement due to environmental or as part of sequel conditions is of importance and may not be fully explained by genetic background. The possibility to identify common molecular mechanisms for clinically associated disorders enables further insight into disease interactions. First steps in that direction were presented in this work for obesity and diabetic disorders, as constituents of the cardiometabolic syndrome. An important conclusion from this work is that components of molecular mechanisms characterize associated diseases. Using this knowledge enables identification of disease associations, which reflect common molecular mechanisms, and provides for a starting point to identify missing causal components. Making use of such disease associations and consideration of knowledge about molecular interactions can be combined to handle limitations imposed by the sparseness of experimentally verified, curated disease/gene associations. Future lines of research will include incorporation of molecular interactions into the method for disease comparison and development of software tools that exploit the findings of this work.
Manually collected information on causal disease/gene associations was obtained from the BIOBASE Knowledge Library™ (BKL) . The BKL groups disease/gene associations into four types, correlative, causal, preventative, and negative, depending on the conclusion that can be drawn from a relevant research article. In this study, we used only associations of the causal and of the preventative type. Causal relationships are derived from experiments, which confirm or suggest the hypothesis that a gene encodes a product whose deranged activity entails a disease or a certain condition as part of a disease. The derangement may be inheritable or emerge during disease onset or progression. Preventative disease/gene associations additionally evince that experimental evidence is available for a therapeutic effect of modulating the deranged activity. In this article, we denote the respective genes as causal genes. To ensure a certain level of annotation for disease comparison, we considered only disease entities with at least five causal genes. The eventual data set comprised 375 diseases and 3051 causal genes connected by a total of 9871 disease/gene associations. Data used in this study are available upon request.
The BKL specifies diseases using MeSH descriptors , which constitute a hierarchy of broader and narrower subject headings. The hierarchical structure of MeSH descriptors allows for incorporation of disease-related scientific information into the knowledgebase, even when the disorder to which an article pertains is not distinguishable at the most specific level. Inference of disease associations performed in this work ignored hierarchical dependencies. Hence, it was anticipated that similarities between some disorders merely reflect the underlying MeSH structure.
The functions lm and summary.lm of the R statistical computing environment  were used to estimate regression models and to calculate AIC values.
In the course of disease gene prediction, the described procedure was adopted to estimate P-values for the number of diseases shared by causal genes.
Functional characteristics of disease clusters were analyzed with Gene Ontology Biological Processes . Among the three GO vocabularies for biological processes, cellular components, and molecular functions we selected the biological process ontology, because its terms were deemed to best represent molecular mechanisms that may be targets of derangement in disease. Biological process terms describe biological objectives accomplished via one or more ordered assemblies of molecular functions. In the majority of cases, more than one gene contributes to a biological process, whereas GO molecular functions denote biochemical activities of individual genes . Several biological processes are well known targets of disease mechanisms such as cell cycle (GO:0007049) in cancer or immune response (GO:0006955) in auto-immune or infectious diseases.
Signal transduction networks of molecules encoded by known and predicted causal genes were constructed by the network cluster tool of ExPlain™ . The algorithm searches for shortest paths of maximally three reaction steps between input molecules. Information about signaling reactions is taken from the TRANSPATH™ database . The ExPlain tool identifies networks, so-called network clusters, which connect a maximal number of input molecules given the distance constraint.
Layout and visualization of disease networks as well as disease/gene networks were performed with the yED graph editor developed by yWorks .
This study was supported by the European Union grants EURODIA (LSHM-CT-2006-518153) and Gen2Phen (FP7 no: 200754). We thank our anonymous reviewers for their comments, which helped to improve this article.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.