- Methodology article
- Open Access
Molecular mechanistic associations of human diseases
© Stegmaier et al; licensee BioMed Central Ltd. 2010
- Received: 18 December 2009
- Accepted: 6 September 2010
- Published: 6 September 2010
The study of relationships between human diseases provides new possibilities for biomedical research. Recent achievements on human genetic diseases have stimulated interest to derive methods to identify disease associations in order to gain further insight into the network of human diseases and to predict disease genes.
Using about 10000 manually collected causal disease/gene associations, we developed a statistical approach to infer meaningful associations between human morbidities. The derived method clustered cardiometabolic and endocrine disorders, immune system-related diseases, solid tissue neoplasms and neurodegenerative pathologies into prominent disease groups. Analysis of biological functions confirmed characteristic features of corresponding disease clusters. Inference of disease associations was further employed as a starting point for prediction of disease genes. Efforts were made to underpin the validity of results by relevant literature evidence. Interestingly, many inferred disease relationships correspond to known clinical associations and comorbidities, and several predicted disease genes were subjects of therapeutic target research.
Causal molecular mechanisms present a unifying principle to derive methods for disease classification, analysis of clinical disorder associations, and prediction of disease genes. According to the definition of causal disease genes applied in this study, these results are not restricted to genetic disease/gene relationships. This may be particularly useful for the study of long-term or chronic illnesses, where pathological derangement due to environmental or as part of sequel conditions is of importance and may not be fully explained by genetic background.
- Gene Ontology
- Disease Gene
- Causal Gene
- Disease Association
Diseases and accompanying symptoms are spawned by systems of molecules, which operate within and across cell and tissue boundaries. A major goal of medical research is to identify the molecular components which play a role in causing a pathological condition. Since first seminal achievements , events at the molecular level have been recognized as key to understand disease mechanisms.
Phenotype/genotype associations provide evidence for a role of affected gene products in respective causal mechanisms and extensive resources document medically relevant gene variants [2, 3]. Recent studies on hereditary phenotypes have shown that similarities among disorders imply involvement of functionally related gene products, summarized as "phenotypic overlap implies genetic overlap". The modular nature of human genetic diseases suggests that modules of similar disorders, also denoted as disease subnetworks, can be juxtaposed with modules of molecules which commonly contribute to a biological function, or interact in molecular complexes or pathways [4–7]. Several studies support the modularity concept and it was successfully applied to derive computational approaches for prediction of candidate genes as well as functional links between molecules [8–12].
It is now clear that analysis of disease relationships unfolds new opportunities for both medical and biological research. Several aforementioned works determined pairwise disorder similarity with a score derived from text-mining of OMIM phenotype descriptions . Rzhetsky et al.  analyzed associations among 161 diseases based on their co-occurrence in patient records. Possibilities to correlate diseases through protein interaction networks or molecular pathways were also explored [13, 14]. Sam et al.  used relations between proteins, Gene Ontology (GO) , and phenotypes established in the PhenoGO NLP system  together with Reactome  protein interactions to find diseases involving common protein-protein interaction networks such as xeroderma pigmentosum and Cockayne syndrome, for which a functional link was previously discussed . Li and Agarwal  obtained disease/gene associations through literature mining of MEDLINE abstracts and constructed a network of diseases which share common molecular pathways. In this network they identified novel disease relationships and observed that a disease is linked to several pathways and a pathway is linked to several diseases.
We present a novel approach to analyze mechanistic relationships between human diseases. Using about 10000 causal disease/gene associations annotated in the BIOBASE Knowledge Library (BKL)  a statistical method that quantifies pairwise similarity between disorders was developed. Connecting diseases at a certain significance threshold, the statistical approach revealed groups of diseases which feature characteristic biological functions. So far, computationally inferred disease relationships were mainly examined with regard to shared molecular networks. Yet, many disease associations reported in this work correspond to known clinical associations and causal links between pathologies. Furthermore, we used disease associations and gene associations to predict causal disease genes. The results suggest that analysis of causal mechanisms provides a unified framework for disease classification, discovery of causal components, and can be used to obtain computational evidence for clinical disease associations as well as hypotheses about their molecular foundation.
A molecular mechanistic map of human diseases
We extracted disease/gene associations which had been manually classified as causal or preventative from the BIOBASE Knowledge Library™ (Methods). In the following, we denote respective genes as causal genes. The data set comprised 375 diseases which were connected to at least 5 of 3051 causal genes by a total of 9871 disease/gene associations. Similarity of involved molecular mechanisms for each disease pair was assessed by calculating the number of common causal genes and the corresponding P-value as described in Methods.
We first constructed a map connecting all diseases with a minimum of two common genes and a maximal similarity P-value of 0.001. This map consisted of one giant component with 123 disease nodes, three medium-sized components with 14, 12, and 10 nodes as well as 29 small components with two to six nodes. In total, there were 239 of the 375 diseases, so that 136 diseases were not connected to any other at the required similarity threshold.
We tested whether the number of 239 diseases connected at the chosen P-value threshold was statistically significant. For this, we calculated false discovery rates (FDR) for P-values of disease pairs with at least two common causal genes using the fdrtool package . According to fdrtool, the P-value cut-off 0.001 corresponded to a false discovery rate of 0.024. Hence, the disease connections were statistically significant also after multiple testing correction. For comparison, 282 disorders were connected at a FDR of 0.05 (P-value 3.73e-3).
In the giant component, diseases are congregated in three subregions. The top of the network (yellow colored, Fig. 1) comprises mostly muscular, cardiovascular and metabolic diseases such as diabetic disorders, obesity, myopathies and heart failure, but also stroke and brain ischemia. Many of the disease entities gathered in this region are recognizable as components of the cardiometabolic syndrome, a clinical clustering of cardiovascular disease risk factors like obesity, hypertension, and insulin resistance [21, 22]. Notably, two neoplastic diseases, namely parathyroid neoplasms and pituitary neoplasms (orange nodes), are located in a branch shared with acromegaly, adenoma, hyperparathyroidism, and hypoparathyroidism (yellow nodes). Acromegaly is an endocrine disorder which is caused in more than 95% of the cases by benign, growth hormone producing pituitary adenoma . Other endocrine neoplasia such as parathyroid neoplasms can occur as part of an acromegaly-causing syndrome called multiple endocrine neoplasia (MEN) . Hence, this branch involves endocrine disorders and known comorbidities.
Through thrombosis and thrombocytopenia, also connected to the more general class of blood platelet disorders, the top region is joined with an area containing hematological malignancies like leukemia and lymphoma (red nodes, Fig. 1) as well as several other immune system-related pathologies among others multiple sclerosis, acquired immunodeficiency syndrome, and rheumatoid arthritis (purple nodes, Fig. 1). The third subregion of the giant component contains exclusively non-hematological malignancies like liver neoplasms, brain neoplasms, and melanoma (orange colored, Fig. 1). The connection to the central part of the component occupied by immune system-related disorders occurs through hepatocellular carcinoma and glioma, which are linked with multiple myeloma.
The three medium-sized components (Fig. 2) represent developmental abnormalities, audio-visual disorders as well as neurodegenerative and psychiatric illnesses. One cluster (Fig. 2A) concatenates, among others, variants of congenital mental retardation, eye abnormalities, tooth abnormalities as well as glaucoma, cataract and renal tubular acidosis. Retinal diseases, blindness as well as hearing loss and deafness are located together in another group (Fig. 2B). In the third disease group (Fig. 2C), we find Parkinsonian disorders, Alzheimer disease, dementia, as well as bipolar disorder and alcoholism. Several of the smaller disease groups (Additional file 1) reflect the hierarchy of MeSH headings which are used for BKL disease annotation, e.g. hepatitis descriptors (group 7), ataxias (group 14), osteoporosis and postmenopausal osteoporosis (group 21), and growth disorders and dwarfism (group 31). The link between xeroderma pigmentosum and Cockayne syndrome (group 10) as well as their connection to the hair disease Trichothiodystrophy were previously discussed [13, 18, 25, 26].
To examine whether the revealed disease associations reflect common causal mechanisms, we compared GO assignments of genes in the six largest disease groups. The Gene Ontology  is an extensive resource of functional annotations of genes in three main categories Biological Process, Molecular Function and Cellular Component. The Fisher test is typically used to test for significant enrichment of GO categories in gene sets (Methods). Starting from enrichment P-values obtained with the standard test, we assigned GO biological processes to disease groups identified in this work and ranked them by a preponderance value that compares P-values of different gene sets (Methods). Beyond identification of significantly enriched biological functions, this comparative approach enables to detect functional differences between gene sets even when the standard method does not assign top ranks to respective GO categories. The analysis was performed on six disease groups. We explored functional differences between disease clusters using curated GO annotation from the BKL. Calculation of preponderance values and GO term assignments were also performed with enrichment P-values reported by the DAVID Functional Annotation Tool [27, 28] as a public source of GO annotation. The first three groups used in the analysis match regions of the giant component: the top region mainly comprising cardiometabolic diseases also including parathyroid and pituitary neoplasms, the middle region constituted of leukemia, lymphoma, and immune system-related pathologies, as well as the lower region with solid tissue neoplasms (Fig. 1). In the following, we denote these groups as clusters M, I, and C, respectively. The other three disease groups were obtained from networks shown in Fig. 2, and are in the following denoted as clusters D (Fig. 2A), P (Fig. 2B), and N (Fig. 2C). Enrichment P-values were calculated and compared for complementary sets made up of genes which were specific for each cluster. Respectively, 337 genes, 279 genes, 683 genes, 107 genes, 82 genes, and 130 genes represented cluster M, I, C, D, P, and N. All diseases were associated with at least one gene in corresponding gene sets, except for transient ischemic attack, so that results only apply to 62 of 63 disorders in cluster M.
Since the BKL assignments of genes to GO biological processes are manually curated, we carried out the same analysis using enrichment P-values calculated by the DAVID Functional Annotation Tool in order to validate our results with an alternative source of GO annotations. In Additional file 2 we report the top 30 biological processes associated with each disease cluster according to enrichment P-values calculated by the DAVID tool. The topics of categories assigned to disease clusters based on GO annotation of DAVID are in good agreement with those observed in the analysis of curated BKL annotation. A notable difference is the absence of cell cycle categories among the top 30 biological processes assigned to cluster C in the analysis using DAVID. Cell cycle terms were still associated with disease cluster C, albeit with lower ranks than in the BKL analysis. For instance, preponderance values calculated for DAVID enrichment P-values ranked the GO categories "regulation of cell cycle" and "cell cycle" at position 86 and 88 (data not shown), respectively, whereas they were ranked 5th and 9th in the BKL analysis (Fig. 3). Nevertheless, the top ranked biological processes assigned to disease cluster C based on either DAVID or BKL share a common theme of cell proliferation, apoptosis, angiogenesis and developmental pathways. Hence, both resources confirmed that the disease clusters feature biological processes that reflect the type of clustered disorders.
To further explore the relevance of inferred disease associations, we inspected vicinities of some selected disorders defined by a certain similarity level. Here, we made use of the statistical method to extract all diseases associated with a pathology of interest through at least two common causal genes and a similarity P-value below 0.01. In the following, we exemplify three cases of metabolic disorders, namely type 1 diabetes (T1DM), type 2 diabetes (T2DM), obesity, and the neurodegenerative disorder Parkinson disease (PD).
Disorders associated with obesity, Parkinson disease, T1DM and T2DM at a P-value threshold of 0.01 and an overlap of at least 2 genes.
Coronarya Artery Disease
Amyotrophic Lateral Sclerosis
Attention Deficit Disorder with Hyperactivity
Diabetes Mellitus, Type 2
Diabetes Mellitus, Type 1
Graft vs. Host Disease
Diabetes Mellitus, Type 2
Epilepsy, Temporal Lobe
Hyperlipoproteinemia Type II
Polycystic Ovary Syndrome
In summary, the statistical analysis of causal genes enabled us to find meaningful disease associations. Interestingly, many of these associations correspond to clinical observations of comorbidities and known etiological relationships between diseases as supported by highlighted scientific literature. Examination of disease clusters revealed characteristic biological functions which confirm the causal mechanistic basis of inferred disease relationships. Altogether, our findings suggest that causal molecular mechanisms provide for an expedient principle to gain further insight into the network of human diseases.
Prediction of causal genes
Having a method to identify meaningful disease associations, our next goal was to apply disease similarities as a starting point for causal gene prediction. Following our previous results, we assumed that gene sets of associated disorders potentially harbor novel mechanistic components of the disease of interest. The short-list of candidate genes was then culled from associated pathologies hypothesizing that frequent co-occurrence in causal gene sets implies functional relationship with a known disease gene (Methods).
Causal genes predicted for T1DM, T2DM and obesity and supporting literature referenced by PubMed identifiers.
By manual literature research we could verify the majority of predictions as shown by the PubMed identifiers of relevant research articles given next to corresponding candidate gene symbols. Corroboration of our predictions was least successful for T1DM with 6 of 13 candidate genes left unverified, whereas only 3 of 20 genes proposed to be involved in T2DM, namely ADRB1, IL2, and ITGA2B, were not confirmed.
As an additional step, we performed network analysis of signal transduction molecules encoded by known and predicted causal genes using the network cluster algorithm of ExPlain™ . The algorithm constructs signaling pathways connecting as many molecules from an input set as possible with a distance constraint for reaction cascades. As a result, input molecules are clustered into networks of two or more molecules. These network clusters can be visualized and subjected to other bioinformatic analyses . In our pursuit, the application served two purposes. Firstly, molecular pathways point out potential mechanisms by which known and predicted causal components exert a common function. Secondly, signaling cascades may allude to additional, previously unknown constituents of disease mechanisms. In the following, we examined network clusters of known and predicted causal components of T1DM as well as T2DM.
ExPlain™ reported two network clusters for T1DM. In the following, we provide gene symbols in parentheses where these differ from the protein names reported in ExPlain™ networks. A small cluster consisted of the known causal component CD154 (CD40LG) and the predicted molecule alpha IIb-integrin encoded by ITGA2B (Table 2), providing computational evidence for a role of ITGA2B in T1DM. Additional file 4 shows the larger cluster including ten known causal components (red nodes) and the novel component IGF-2 (IGF2) (green node) connected by other molecules (blue nodes) through activating (black arrows) or inhibitory (red arrows) reactions. By manual literature research, we verified involvement of SOCS3 , Jak1 (JAK1) , and SHP-1 (PTPN6)  in T1DM. Notably, Grb-14 (GRB14) and PTP1B (PTPN1) are known molecular constituents of insulin resistance [43, 44] and development of PTP1B inhibitors for therapeutic modulation of insulin sensitivity is an active field of research . While PTP1B and Grb-14 functions were mainly explored with regard to their causal role in T2DM and obesity, the prevalence of insulin resistance in conjunction with type 1 diabetes has recently gained attention [46, 47].
We further obtained two network clusters of T2DM molecules shown in Additional files 5 and 6. In a small network (Additional file 5), known causal components ADA and CD26 (DPP4) form a cascade with the predicted causal component CD44 (green node) connected by RANTES (CCL5) (blue node), which harbors promoter polymorphisms associated with type 2 diabetes . The larger network (Additional file 6) comprises 19 known causal components (red nodes) and 5 predicted components, namely activated protein C (PROC), alpha-IIb integrin (ITGA2B), Cu-ZnSOD (SOD1), IRS-2 (IRS2) and IL-2 (IL2) (green nodes). Moreover, scientific literature supports a mechanistic role of several network components, such as PKCdelta (PRKCD) , PKCtheta (PRKCQ) , GSK3 (GSK3B) , p85 (PIK3R1) , Rac1 (RAC1) , p65PAK (PAK1) , Akt (AKT1), PDK1 (PDPK1), and mTOR (MTOR) .
Taken together, disease and gene associations successfully predicted causal genes for obesity, T1DM, and T2DM, and scientific literature verified the majority of proposed candidates. Molecular network analysis of T1DM and T2DM gene sets then suggested signal transduction cascades connecting predicted and known causal proteins encoded by respective genes. Additional constituents of causal disease mechanisms were inferred along with molecular pathways and a good part of them (more than 1/3) were supported by literature evidence. Notably, many of the cited research articles investigated respective causal components as therapeutic targets for T1DM or T2DM. These results demonstrate the utility of causal mechanism-based disease analysis for inference of novel disease genes.
Evaluation of causal gene predictions
We examined how the number of predictions correlated with P-value thresholds and observed an approximately linear dependence in all three examples (Fig. 4B, D, and 4F). This shows that the P-values effectively controlled the number of predictions, albeit the rates of change were not the same for the three disorders.
Cross-validation (CV) was performed to evaluate the robustness of our results with perturbed disease/gene association data. We carried out 20 rounds of cross-validation for obesity, T1DM, and T2DM by removing 5 randomly chosen genes from their causal gene sets. Poisson parameters were re-estimated for diseases and for genes based on 105 random gene sets and 105 random disease sets, respectively, using the modified association data. Subsequently, we predicted causal genes for each of the disorders at a P-value threshold of 0.01 and an overlap threshold of 2.
In summary, cross-validation demonstrated that the method robustly produced a limited set of genes which preferably included the originally reported gene predictions. We could observe effects of sampling random disease or gene sets in the empirical estimation of Poisson parameters. Increasing the number of random sets would mitigate the variability inherent to the sampling procedure. This does not represent a major drawback, since the regression analysis needs to be conducted only once and parameter estimates can then be used in subsequent comparisons. Furthermore, we assume that recovery of known disease genes was limited to certain fraction of the causal gene sets due to the low coverage of underlying disease/gene associations. This suggests that other types of information such as molecular interactions could complement the predictions entirely based on disease/gene associations.
We therefore tested the utility of combining predictions derived from disease/gene associations with a method that employs molecular interactions. The GeneWanderer is a tool that applies a global network distance measure to rank candidate genes according to their context to known disease genes in a network . The algorithm assigns a distance value to candidate genes based on a random walk with restart (RWR), which reflects how well candidate genes are connected to modules of disease genes . RWR distance values are higher for genes that are well connected to known disease genes and were successfully applied to prioritize candidate genes . We compared the ranks of distance values obtained with and without inclusion of genes predicted on the basis of disease/gene associations for known disease genes that were omitted from the cross-validation samples. Thus, for each of the left-out disease genes, we calculated RWR distance values using either only the truncated set of disease genes or predicted genes in addition. In the following, we denote RWR distance values as network score. An interaction network containing 10486 genes and 109089 interactions was compiled from the IntAct , BioGRID , and Reactome  databases (Additional file 7). We used the ranks of network scores of the test disease genes among all genes of the network to compare the performance with and without addition of genes predicted by disease/gene associations in each CV sample.
Knowledge about components of causal mechanisms has proven useful to analyze relationships between human diseases. The definition of causal genes applied in the BKL includes genotype associations, yet also covers other sources of evidence for involvement in causal molecular systems. This may be of importance taking into account that the activity of gene products in the context of molecular networks may bias the ability of genes to harbor pathological mutations as shown by several studies [59–61]. Probably, more or less complex patterns of genetic variation contribute to every disease. However, their functional effects become manifest in molecular interactions, where networks of proteins, yet also protein/DNA interactions, take an important part and genetic alterations are one of many possibilities to induce derangement .
On the foundation of causal disease/gene associations, we built a method to quantify the similarity of two diseases that accounts for unequal frequencies of genes in the entire set of associations. To provide a familiar and intuitive quantity, the presented method reports a P-value for the overlap of causal gene sets. At standard P-value thresholds, 0.001 and 0.01, the statistical analysis revealed meaningful disease associations as demonstrated by constructing a map of human diseases, GO analysis of disease clusters and inspection of vicinal disorders for obesity, T1DM, T2DM and PD at the lower threshold.
Human disease networks were previously studied with respect to physiological disease classes [10, 14]. Here, we first constructed the disease map and afterwards demonstrated that biological processes coincided with well-known attributes of clustered disorders as confirmed by manually curated GO biological process annotation of the BKL as well as GO annotations available through the DAVID Functional Annotation Tool. Furthermore, we observed that the giant component of our disease map as well as the vicinities of obesity, T1DM, and T2DM clustered components of the cardiometabolic syndrome. Notably, the disease vicinities that were explored in more detail reflected not only similarities between obesity, T1DM, and T2DM, but captured also specific relationships such as connections to immunological disorders in the T1DM vicinity. When we applied causal disease associations to predict new disease genes, the majority of predictions proposed for obesity, T1DM, and T2DM could be supported by references to scientific literature. Altogether, our results corroborate that the inferred disease associations reflect common molecular mechanisms and indicate applications for disease gene prediction as well as disease classification and definition.
Limitations of current standard classifications were previously challenged with regard to molecular or complex systems approaches [63, 64]. Protein-protein interactions and molecular pathways have been employed to identify disease relationships [13, 14]. So far our method for disease comparison left molecular interactions unspecified and addressed only the overlap of causal gene sets. A first step towards incorporating molecular networks into our analysis was taken by clustering causal components in signal transduction networks. As evidenced by scientific literature, the network clusters contained many known causal components and therapeutic targets. Furthermore, we translated the analysis of disease associations and of causal gene associations into a method to select for new disease genes. While disease gene prediction entirely based on experimentally verified disease/gene associations faced limitations originating from the sparseness of the available data, we could show that a combination of disease associations and molecular network analysis enhanced the possibility to identify new disease genes. Incorporation of molecular interactions is therefore an important area for further development, where greater fidelity with molecular systems that underlie disease mechanisms can be achieved.
We would like to point out that our method for prediction of causal genes relied on four cut-off values consisting of a P-value and a minimal overlap parameter in the first and the second step (Fig. 9). It may be difficult to tune each of the parameters to achieve optimal results. Throughout this work, we set identical cut-offs for disease and gene similarity and confined analyses to standard P-value thresholds (0.001 or 0.01). Furthermore, the overlap threshold was always set to a small value of 2 with the purpose of controlling a minimal level of shared causal genes or diseases. We think that with this setting the overlap cut-off sufficiently complements the P-value. As demonstrated, the number of false positives grows linearly with the P-value threshold (Fig. 4), so that this parameter lends itself to further adjust the algorithm. One possibility is to choose a number of predictions admissible for validation and to select a P-value cut-off that satisfies this constraint. For this type of approach our method offers, in addition to using a common value for disease and gene similarity, the possibility to fix the disease similarity threshold (e.g. to 0.01) and to subsequently rank predicted causal genes according to overlap P-values with known disease genes. Furthermore, the method achieved best precision for obesity, T1DM, and T2DM at P-value thresholds of 0.005, 0.003, and 0.01, respectively. At a P-value of 0.005 the observed precision for T1DM and T2DM was still at least 75%. This indicates that the range from 0.001 to 0.01 is suitable to choose a threshold for causal gene prediction and suggests 0.005 as a possible starting point.
In our efforts to verify the inferred disease associations, we were able to highlight many instances of known clinical associations and comorbidities, suggesting that these are special cases of related pathologies sharing causal mechanisms which also connect them etiologically. Interestingly, these findings inversely confirm a previous study, where co-occurrence of disorders in medical records was used to predict genetic overlap . Validation of disease co-prevalences often requires laborious population studies. According to the results of this work, the decision to mount a study could be supported by testing hypotheses about disease associations computationally. Simultaneously, shared causal components provide insight into the molecular basis of etiological disease relationships and suggest potential diagnostic markers.
So far, different methods have been proposed to investigate human disease associations. A main difference lies in the representation of disease entities by features that are eventually compared to obtain a figure of similarity. While this work focused on genes that were manually classified as causal disease genes, other approaches used clinical characteristics, phenotypes, or genes and pathways [5, 8, 14]. Each choice of feature representation involves advantages and disadvantages with respect to quality, coverage, or detail of information. For instance, manual curation promises greater quality than computationally derived annotation, but its coverage is often inferior. Furthermore, associated genes capture disease components in finer detail than descriptions of clinical characteristics, but we assume that for a disease the latter are more often defined than associated genes. It is therefore of importance to compare the different approaches to recognize and validate strengths and weaknesses.
To the best of our knowledge, no reference set has been established to systematically examine the ability of different methods to correctly identify disease associations, so that a necessary step towards such a comparison is to assemble a set of known disease links.
Another future direction will be to combine different levels of information such as causal genes, affected biological processes, and clinical characteristics to gain further insight into disease subtypes and corresponding mechanisms. Of interest are phenotypically similar disease subtypes that present different molecular mechanisms as well as similarities on the molecular level that cannot be mapped to known clinical characteristics. Identification of such disease subtypes and of "hidden" disease similarities may open new avenues to develop therapeutic approaches for respective disorders.
We developed a novel approach to analyze human disease associations and demonstrated its utility in several application areas. Causal molecular mechanisms present a unifying principle for disease classification and definition, analysis of clinical disorder associations, as well as prediction of disease genes, therapeutic targets and diagnostic markers. According to the definition of causal disease genes applied in this study, these results are not restricted to genetic disease/gene relationships. This may be particularly useful for the study of long-term or chronic illnesses, where pathological derangement due to environmental or as part of sequel conditions is of importance and may not be fully explained by genetic background. The possibility to identify common molecular mechanisms for clinically associated disorders enables further insight into disease interactions. First steps in that direction were presented in this work for obesity and diabetic disorders, as constituents of the cardiometabolic syndrome. An important conclusion from this work is that components of molecular mechanisms characterize associated diseases. Using this knowledge enables identification of disease associations, which reflect common molecular mechanisms, and provides for a starting point to identify missing causal components. Making use of such disease associations and consideration of knowledge about molecular interactions can be combined to handle limitations imposed by the sparseness of experimentally verified, curated disease/gene associations. Future lines of research will include incorporation of molecular interactions into the method for disease comparison and development of software tools that exploit the findings of this work.
Causal disease/gene associations
Manually collected information on causal disease/gene associations was obtained from the BIOBASE Knowledge Library™ (BKL) . The BKL groups disease/gene associations into four types, correlative, causal, preventative, and negative, depending on the conclusion that can be drawn from a relevant research article. In this study, we used only associations of the causal and of the preventative type. Causal relationships are derived from experiments, which confirm or suggest the hypothesis that a gene encodes a product whose deranged activity entails a disease or a certain condition as part of a disease. The derangement may be inheritable or emerge during disease onset or progression. Preventative disease/gene associations additionally evince that experimental evidence is available for a therapeutic effect of modulating the deranged activity. In this article, we denote the respective genes as causal genes. To ensure a certain level of annotation for disease comparison, we considered only disease entities with at least five causal genes. The eventual data set comprised 375 diseases and 3051 causal genes connected by a total of 9871 disease/gene associations. Data used in this study are available upon request.
The BKL specifies diseases using MeSH descriptors , which constitute a hierarchy of broader and narrower subject headings. The hierarchical structure of MeSH descriptors allows for incorporation of disease-related scientific information into the knowledgebase, even when the disorder to which an article pertains is not distinguishable at the most specific level. Inference of disease associations performed in this work ignored hierarchical dependencies. Hence, it was anticipated that similarities between some disorders merely reflect the underlying MeSH structure.
The functions lm and summary.lm of the R statistical computing environment  were used to estimate regression models and to calculate AIC values.
In the course of disease gene prediction, the described procedure was adopted to estimate P-values for the number of diseases shared by causal genes.
Statistical analysis of biological processes
Functional characteristics of disease clusters were analyzed with Gene Ontology Biological Processes . Among the three GO vocabularies for biological processes, cellular components, and molecular functions we selected the biological process ontology, because its terms were deemed to best represent molecular mechanisms that may be targets of derangement in disease. Biological process terms describe biological objectives accomplished via one or more ordered assemblies of molecular functions. In the majority of cases, more than one gene contributes to a biological process, whereas GO molecular functions denote biochemical activities of individual genes . Several biological processes are well known targets of disease mechanisms such as cell cycle (GO:0007049) in cancer or immune response (GO:0006955) in auto-immune or infectious diseases.
Prediction of causal disease genes
Molecular network clusters
Signal transduction networks of molecules encoded by known and predicted causal genes were constructed by the network cluster tool of ExPlain™ . The algorithm searches for shortest paths of maximally three reaction steps between input molecules. Information about signaling reactions is taken from the TRANSPATH™ database . The ExPlain tool identifies networks, so-called network clusters, which connect a maximal number of input molecules given the distance constraint.
Network layout and visualization
Layout and visualization of disease networks as well as disease/gene networks were performed with the yED graph editor developed by yWorks .
This study was supported by the European Union grants EURODIA (LSHM-CT-2006-518153) and Gen2Phen (FP7 no: 200754). We thank our anonymous reviewers for their comments, which helped to improve this article.
- Pauling L, Itano HA, Singer SJ, Wells IC: Sickle cell anemia, a molecular disease. Science. 1949, 110: 543-548. 10.1126/science.110.2865.543View ArticlePubMedGoogle Scholar
- Online Mendelian Inheritance in Man, OMIM (TM): McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), http://www.ncbi.nlm.nih.gov/omim/Google Scholar
- Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN: The Human Gene Mutation Database: 2008 update. Genome Med. 2009, 1: 13- 10.1186/gm13PubMed CentralView ArticlePubMedGoogle Scholar
- Brunner HG, van Driel MA: From syndrome families to functional genomics. Nat Rev Genet. 2004, 5: 545-551. 10.1038/nrg1383View ArticlePubMedGoogle Scholar
- van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A textmining analysis of the human phenome. Eur J Hum Genet. 2006, 14: 535-542. 10.1038/sj.ejhg.5201585View ArticlePubMedGoogle Scholar
- Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, Mishra G, Nandakumar K, Shen B, Deshpande N, Nayak R, Sarker M, Boeke JD, Parmigiani G, Schultz J, Bader JS, Pandey A: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet. 2006, 38: 285-293. 10.1038/ng1747View ArticlePubMedGoogle Scholar
- Oti M, Brunner HG: The modular nature of human genetic diseases. Clin Genet. 2007, 71: 1-11. 10.1111/j.1399-0004.2006.00708.xView ArticlePubMedGoogle Scholar
- Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25: 309-316. 10.1038/nbt1295View ArticlePubMedGoogle Scholar
- Rzhetsky A, Wajngurt D, Park N, Zheng T: Probing genetic overlap among complex human phenotypes. Proc Natl Acad Sci USA. 2007, 104: 11694-11699. 10.1073/pnas.0704820104PubMed CentralView ArticlePubMedGoogle Scholar
- Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L: The human disease network. PNAS. 2007, 21: 8685-8690. 10.1073/pnas.0701361104.View ArticleGoogle Scholar
- Wu X, Jiang R, Zhang MQ, Li S: Network-based global inference of human disease genes. Mol Syst Biol. 2008, 4: 189- 10.1038/msb.2008.27PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Liu Q, Jiang R: Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics. 2009, 25: 98-104. 10.1093/bioinformatics/btn593View ArticlePubMedGoogle Scholar
- Sam L, Liu Y, Li J, Friedman C, Lussier Y: Discovery of protein interaction networks shared by diseases. Pac Symp Biocomput. 2007, 76-87.Google Scholar
- Li Y, Agarwal P: A pathway based view of human diseases and disease relationships. PLoS One. 2009, 2: e4346-10.1371/journal.pone.0004346.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C: PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. Pac Symp Biocomput. 2006, 64-75.Google Scholar
- Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L: Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007, 8: R39- 10.1186/gb-2007-8-3-r39PubMed CentralView ArticlePubMedGoogle Scholar
- Spivak G: The many faces of Cockayne syndrome. Proc Natl Acad Sci USA. 2004, 101: 15273-15274. 10.1073/pnas.0406894101PubMed CentralView ArticlePubMedGoogle Scholar
- Michael H, Hogan J, Kel A, Kel-Margoulis O, Schacherer F, Voss N, Wingender E: Building a knowledge base for systems pathology. Brief Bioinformatics. 2008, 9: 518-531. 10.1093/bib/bbn038View ArticlePubMedGoogle Scholar
- Strimmer K: fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics. 2008, 24: 1461-1462. 10.1093/bioinformatics/btn209View ArticlePubMedGoogle Scholar
- Castro JP, El-Atat FA, McFarlane SI, Aneja A, Sowers JR: Cardiometabolic syndrome: pathophysiology and treatment. Curr Hypertens Rep. 2003, 5: 393-401. 10.1007/s11906-003-0085-yView ArticlePubMedGoogle Scholar
- Grassi G, Arenare F, Quarti-Trevano F, Seravalle G, Mancia G: Heart rate, sympathetic cardiovascular influences, and the metabolic syndrome. Prog Cardiovasc Dis. 2009, 52: 31-7. 10.1016/j.pcad.2009.05.007View ArticlePubMedGoogle Scholar
- Schreiber I, Buchfelder M, Droste M, Forssmann K, Mann K, Saller B, Strasburger CJ: Treatment of acromegaly with the GH receptor antagonist pegvisomant in clinical practice: safety and efficacy evaluation from the German Pegvisomant Observational Study. Eur J Endocrinol. 2007, 56: 75-82. 10.1530/eje.1.02312.View ArticleGoogle Scholar
- Dreijerink KM, van Beek AP, Lentjes EG, Post JG, van der Luijt RB, Canninga-van Dijk MR, Lips CJ: Acromegaly in a multiple endocrine neoplasia type 1 (MEN1) family with low penetrance of the disease. Eur J Endocrinol. 2005, 153: 741-746. 10.1530/eje.1.02022View ArticlePubMedGoogle Scholar
- Chu G, Mayne L: Xeroderma pigmentosum, Cockayne syndrome and trichothiodystrophy: do the genes explain the diseases?. Trends Genet. 1996, 12: 187-192. 10.1016/0168-9525(96)10021-4View ArticlePubMedGoogle Scholar
- Cleaver JE, Thompson LH, Richardson AS, States JC: A summary of mutations in the UV-sensitive disorders: xeroderma pigmentosum, Cockayne syndrome, and trichothiodystrophy. Hum Mutat. 1999, 14: 9-22. 10.1002/(SICI)1098-1004(1999)14:1<9::AID-HUMU2>3.0.CO;2-6View ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009, 4: 44-57. 10.1038/nprot.2008.211.View ArticleGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4: P3- 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Nieman LK, Ilias I: Evaluation and treatment of Cushing's syndrome. Am J Med. 2005, 118: 1340-1346. 10.1016/j.amjmed.2005.01.059View ArticlePubMedGoogle Scholar
- Luger A, Prager R, Gaube S, Graf H, Klauser R, Schernthaner G: Decreased peripheral insulin sensitivity in acromegalic patients. Exp Clin Endocrinol. 1990, 95: 339-43. 10.1055/s-0029-1210974View ArticlePubMedGoogle Scholar
- Tamimi W, Siddiqui IA, Tamim H, Aleisa N, Adham M: Effect of body mass index on clinical manifestations in patients with polycystic ovary syndrome. Int J Gynaecol Obstet. 2009Google Scholar
- Kowalska I, Malecki MT, Straczkowski M, Skupien J, Karczewska-Kupczewska M, Nikolajuk A, Szopa M, Adamska A, Wawrusiewicz-Kurylonek N, Wolczynski S, Sieradzki J, Gorska M: The FTO gene modifies weight, fat mass and insulin sensitivity in women with polycystic ovary syndrome, where its role may be larger than in other phenotypes. Diabetes Metab. 2009Google Scholar
- Hosoda H, Kojima M, Kangawa K: Biological, physiological, and pharmacological aspects of ghrelin. J Pharmacol Sci. 2006, 100: 398-410. 10.1254/jphs.CRJ06002XView ArticlePubMedGoogle Scholar
- Depoortere I: Targeting the ghrelin receptor to regulate food intake. Regul Pept. 2009, 156: 13-23. 10.1016/j.regpep.2009.04.002View ArticlePubMedGoogle Scholar
- Birtwistle J, Baldwin D: Role of dopamine in schizophrenia and Parkinson's disease. Br J Nurs. 1998, 7: 832-834.View ArticlePubMedGoogle Scholar
- Castellanos FX: Toward a pathophysiology of attentiondeficit/hyperactivity disorder. Clin Pediatr. 1997, 36: 381-393. 10.1177/000992289703600702.View ArticleGoogle Scholar
- Cannas A, Spissu A, Floris GL, Congia S, Saddi MV, Melis M, Mascia MM, Pinna F, Tuveri A, Solla P, Milia A, Giagheddu M, Tacconi P: Bipolar affective disorder and Parkinson's disease: a rare, insidious and often unrecognized association. Neurol Sci. 2002, 23 (Suppl 2): S67-S68. 10.1007/s100720200073View ArticlePubMedGoogle Scholar
- Hitzeman N, Rafii F: Dopamine agonists for early Parkinson disease. Am Fam Physician. 2009, 80: 28-30.PubMedGoogle Scholar
- Kel A, Voss N, Valeev T, Stegmaier P, Kel-Margoulis O, Wingender E: ExPlain: finding upstream drug targets in disease gene regulatory networks. SAR QSAR Environ Res. 2008, 19: 481-494. 10.1080/10629360802083806View ArticlePubMedGoogle Scholar
- Rønn SG, Börjesson A, Bruun C, Heding PE, Frobøse H, Mandrup-Poulsen T, Karlsen AE, Rasschaert J, Sandler S, Billestrup N: Suppressor of cytokine signalling-3 expression inhibits cytokine-mediated destruction of primary mouse and rat pancreatic islets and delays allograft rejection. Diabetologia. 2008, 51: 1873-82. 10.1007/s00125-008-1090-0View ArticlePubMedGoogle Scholar
- Couto FM, Minn AH, Pise-Masison CA, Radonovich M, Brady JN, Hanson M, Fernandez LA, Wang P, Kendziorski C, Shalev A: Exenatide blocks JAK1-STAT1 in pancreatic beta cells. Metabolism. 2007, 56: 915-918. 10.1016/j.metabol.2007.02.004View ArticlePubMedGoogle Scholar
- Mollah ZU, Pai S, Moore C, O'Sullivan BJ, Harrison MJ, Peng J, Phillips K, Prins JB, Cardinal J, Thomas R: Abnormal NF-kappa B function characterizes human type 1 diabetes dendritic cells and monocytes. J Immunol. 2008, 180: 3166-75.View ArticlePubMedGoogle Scholar
- Cooney GJ, Lyons RJ, Crew AJ, Jensen TE, Molero JC, Mitchell CJ, Biden TJ, Ormandy CJ, James DE, Daly RJ: Improved glucose homeostasis and enhanced insulin signalling in Grb14-deficient mice. EMBO J. 2004, 23: 582-593. 10.1038/sj.emboj.7600082PubMed CentralView ArticlePubMedGoogle Scholar
- Elchebly M, Payette P, Michaliszyn E, Cromlish W, Collins S, Loy AL, Normandin D, Cheng A, Himms-Hagen J, Chan CC, Ramachandran C, Gresser MJ, Tremblay ML, Kennedy BP: Increased insulin sensitivity and obesity resistance in mice lacking the protein tyrosine phosphatase-1B gene. Science. 1999, 283: 1544-1548. 10.1126/science.283.5407.1544View ArticlePubMedGoogle Scholar
- Wilson DP, Wan ZK, Xu WX, Kirincich SJ, Follows BC, Joseph-McCarthy D, Foreman K, Moretto A, Wu J, Zhu M, Binnun E, Zhang YL, Tam M, Erbe DV, Tobin J, Xu X, Leung L, Shilling A, Tam SY, Mansour TS, Lee J: Structure-based optimization of protein tyrosine phosphatase 1B inhibitors: from the active site to the second phosphotyrosine binding site. J Med Chem. 2007, 50: 4681-4698. 10.1021/jm0702478View ArticlePubMedGoogle Scholar
- Greenbaum CJ: Insulin resistance in type 1 diabetes. Diabetes Metab Res Rev. 2002, 18: 192-200. 10.1002/dmrr.291View ArticlePubMedGoogle Scholar
- Chillarón JJ, Goday A, Flores-Le-Roux JA, Benaiges D, Carrera MJ, Puig J, Cano-Pérez JF, Pedro-Botet J: Estimated glucose disposal rate in assessment of the metabolic syndrome and microvascular complications in patients with type 1 diabetes. J Clin Endocrinol Metab. 2009, 94: 3530-3534. 10.1210/jc.2009-0960View ArticlePubMedGoogle Scholar
- Nakajima K, Tanaka Y, Nomiyama T, Ogihara T, Ikeda F, Kanno R, Iwashita N, Sakai K, Watada H, Onuma T, Kawamori R: RANTES promoter genotype is associated with diabetic nephropathy in type 2 diabetic subjects. Diabetes Care. 2003, 26: 892-898. 10.2337/diacare.26.3.892View ArticlePubMedGoogle Scholar
- Kim YH, Choi MY, Kim YS, Han JM, Lee JH, Park CH, Kang SS, Choi WS, Cho GJ: Protein kinase C delta regulates anti-apoptotic alphaB-crystallin in the retina of type 2 diabetes. Neurobiol Dis. 2007, 28: 293-303. 10.1016/j.nbd.2007.07.017View ArticlePubMedGoogle Scholar
- Kim JK, Fillmore JJ, Sunshine MJ, Albrecht B, Higashimori T, Kim DW, Liu ZX, Soos TJ, Cline GW, O'Brien WR, Littman DR, Shulman GI: PKC-theta knockout mice are protected from fat-induced insulin resistance. J Clin Invest. 2004, 114: 823-827.PubMed CentralView ArticlePubMedGoogle Scholar
- Wagman AS, Johnson KW, Bussiere DE: Discovery and development of GSK3 inhibitors for the treatment of type 2 diabetes. Curr Pharm Des. 2004, 10: 1105-37. 10.2174/1381612043452668View ArticlePubMedGoogle Scholar
- Taniguchi CM, Aleman JO, Ueki K, Luo J, Asano T, Kaneto H, Stephanopoulos G, Cantley LC, Kahn CR: The p85alpha regulatory subunit of phosphoinositide 3-kinase potentiates c-Jun N-terminal kinase-mediated insulin resistance. Mol Cell Biol. 2007, 27: 2830-2840. 10.1128/MCB.00079-07PubMed CentralView ArticlePubMedGoogle Scholar
- Shen E, Li Y, Li Y, Shan L, Zhu H, Feng Q, Arnold JM, Peng T: Rac1 is required for cardiomyocyte apoptosis during hyperglycemia. Diabetes. 2009, 58: 2386-2395. 10.2337/db08-0617PubMed CentralView ArticlePubMedGoogle Scholar
- Mei J, Wang CN, O'Brien L, Brindley DN: Cell-permeable ceramides increase basal glucose incorporation into triacylglycerols but decrease the stimulation by insulin in 3T3-L1 adipocytes. Int J Obes Relat Metab Disord. 2003, 27: 31-39. 10.1038/sj.ijo.0802183View ArticlePubMedGoogle Scholar
- Hoehn KL, Hohnen-Behrens C, Cederberg A, Wu LE, Turner N, Yuasa T, Ebina Y, James DE: IRS1-independent defects define major nodes of insulin resistance. Cell Metab. 2008, 7: 421-433. 10.1016/j.cmet.2008.04.005PubMed CentralView ArticlePubMedGoogle Scholar
- Köhler S, Bauer S, Horn D, Robinson PN: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008, 82: 949-958. 10.1016/j.ajhg.2008.02.013PubMed CentralView ArticlePubMedGoogle Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, van Eijk K, Hermjakob H: The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010, 38: D525-531. 10.1093/nar/gkp878PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34: D535-539. 10.1093/nar/gkj109PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, Mason SP, Barabási AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 41-42. 10.1038/35075138View ArticlePubMedGoogle Scholar
- Jonsson PF, Bates PA: Global topological features of cancer proteins in the human interactome. Bioinformatics. 2006, 22: 2291-2297. 10.1093/bioinformatics/btl390PubMed CentralView ArticlePubMedGoogle Scholar
- Feldman I, Rzhetsky A, Vitkup D: Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci USA. 2008, 105: 4323-4328. 10.1073/pnas.0701722105PubMed CentralView ArticlePubMedGoogle Scholar
- Kann MG: Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief Bioinform. 2007, 8: 333-346. 10.1093/bib/bbm031View ArticlePubMedGoogle Scholar
- Thiene G, Corrado D, Basso C: Cardiomyopathies: is it time for amolecular classification?. Eur Heart J. 2004, 25: 1772-1775. 10.1016/j.ehj.2004.07.026View ArticlePubMedGoogle Scholar
- Loscalzo J, Kohane I, Barabási A-L: Human disease classification in the postgenomic era: A complex systems approach to human pathology. Mol Sys Bio. 2007, 3: 124-Google Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2008, ISBN 3-900051-07-0Google Scholar
- Choi C, Krull M, Kel A, Kel-Margoulis O, Pistor S, Potapov A, Voss N, Wingender E: TRANSPATH-A High Quality Database Focused on Signal Transduction. Comp Funct Genomics. 2004, 5: 163-168. 10.1002/cfg.386PubMed CentralView ArticlePubMedGoogle Scholar
- yWorks. http://www.yworks.com/en/products_yed_about.html