Knowledge-based compact disease models identify new molecular players contributing to early-stage Alzheimer’s disease

Background High-throughput profiling of human tissues typically yield as results the gene lists comprised of a mix of relevant molecular entities with multiple false positives that obstruct the translation of such results into mechanistic hypotheses. From general probabilistic considerations, gene lists distilled for the mechanistically relevant components can be far more useful for subsequent experimental design or data interpretation. Results The input candidate gene lists were processed into different tiers of evidence consistency established by enrichment analysis across subsets of the same experiments and across different experiments and platforms. The cut-offs were established empirically through ontological and semantic enrichment; resultant shortened gene list was re-expanded by Ingenuity Pathway Assistant tool. The resulting sub-networks provided the basis for generating mechanistic hypotheses that were partially validated by literature search. This approach differs from previous consistency-based studies in that the cut-off on the Receiver Operating Characteristic of the true-false separation process is optimized by flexible selection of the consistency building procedure. The gene list distilled by this analytic technique and its network representation were termed Compact Disease Model (CDM). Here we present the CDM signature for the study of early-stage Alzheimer’s disease. The integrated analysis of this gene signature allowed us to identify the protein traffic vesicles as prominent players in the pathogenesis of Alzheimer’s. Considering the distances and complexity of protein trafficking in neurons, it is plausible that spontaneous protein misfolding along with a shortage of growth stimulation result in neurodegeneration. Several potentially overlapping scenarios of early-stage Alzheimer pathogenesis have been discussed, with an emphasis on the protective effects of AT-1 mediated antihypertensive response on cytoskeleton remodeling, along with neuronal activation of oncogenes, luteinizing hormone signaling and insulin-related growth regulation, forming a pleiotropic model of its early stages. Alignment with emerging literature confirmed many predictions derived from early-stage Alzheimer’s disease’ CDM. Conclusions A flexible approach for high-throughput data analysis, the Compact Disease Model generation, allows extraction of meaningful, mechanism-centered gene sets compatible with instant translation of the results into testable hypotheses.


Background
In developed economies, the costs of medical services are constantly rising, stifling the growth and projecting to become unsustainable if the trend remains unchanged [1,2]. Some solutions propose the shift of the focus to early diagnostics of the diseases with the highest societal impact, to designing the strategies for reliable risk assessment and to tailoring prophylaxis efforts to the highest risk groups [3]. Another approach seeks to streamline the process of drug development by focusing the effort on the most promising targets and pre-clinical drug candidates. Both solutions may be assisted by the methods of bio-and chemo-informatics that operate within the realm of systems biology [4][5][6].
The most common type of the data analyzed by bioinformaticians is a set of differentially expressed genes obtained by microarray or RNAseq. Typical candidate list derived from these kinds of studies contains hundreds to thousands differentially expressed genes. However valuable, these sets are riddled with false-positives that changed their expression levels due to compensation for an overall increase in cellular stress or as a secondary effect of certain regulatory events, for example, the suppression of transcription factor' activity or the shift in histone modification landscape. In other words, the differential expression of given gene often is a passive consequence of stress rather than a critical event directly contributing to disease pathogenesis.
Obviously, the focus of the research efforts should be on genes most essential in pathogenesis of given disease. However, this focusing is not trivial, as every chronic disease is studied by multiple research groups that customarily formulate multiple competing hypotheses [7], thus, populating the lists of potential candidates with thousands of entries. For Alzheimer's disease alone, the Gene Cards compiled by Weizmann Institute of Science list 1890 molecules of relevance [8]. With < 25,000 genes known to comprise a genome, and no more than a third of them being expressed in a single tissue [9], this number is indicative that the long gene lists of today reflect rather poor target prioritization. Thus, there is a need for highly prioritized shortlists of potential targets directly linked to major pathogenic processes. Such lists, contracted by ontological enrichment, re-expanded by interaction network and validated by network clustering and alignment with literature were termed here Compact Disease Models (CDMs).
The reproducibility of a result in an independent experiment with at least slightly varied technical settings is the typical verification criterion for any scientific derivation [10]. In accordance to that, the gene-specific probes differentially expressed in the same direction in independently analyzed multiple subsets of the same experimental dataset and also in different experiments are less likely to report noise. Filtering of biological signals by consistency of gene-expression changes already demonstrated its value for enrichment analyses of genes mechanistically important for tumorigenesis [11,12] and metastasis [13]. As compared to gene lists generated using t-test, the lists generated using consistency of differential expression are target-enriched [4,[11][12][13][14] and, thus, are more mechanistically relevant. For example, a reliable cancer mortality signature was produced by meta-analysis for the consensus changes observed in a variety of experiments across a number of model organisms [15]. An enrichment of gene expression signatures with mechanistically relevant targets was also attempted for neurodegeneration studies, such as Alzheimer's and Parkinson's diseases [16].
While any enrichment technique is capable of demonstrating the target enrichment, the utility of this enrichment is determined by the Receiver Operating Characteristic (ROC) of the process and the point of ROC cut-off. Importantly, non-overlapping components of individual datasets may be disease-specific while remaining related to pathogenic mechanism. Therefore, the requirement of consistency has to be imposed with minimally stringent cut-offs [17]. Here we present an approach that provides an improvement over previously described techniques. In that, we implemented sorting out the gene lists into the Consistency Tiers, thus, gaining control of the extent of information loss in the non-overlapping sub-sets. Each Tier can be assessed further by functional enrichment and alignment with independent literature data. The optimal size of the consensus signature could be selected depending on the nature of the disease [17]. All together, our process includes a three-step noise filter comprised of 1) prioritization of candidates by consistency of reported directional gene expression changes, 2) functional enrichment and 3) co-clustering of candidates in a network [18].
The resultant Compact Disease Model (CDM) provides a significant saving of research effort. Assuming N independent platforms being included in given analysis and m intersections needed to provide a robust mechanism-related gene list, the number of potential contributions becomes: where RECis the recall number (the number of totally available true positives), C N m is the number of contributing platform combinations, P(m) is the number of strong mechanistic associates extracted per a single platform combination. In this case, the multi-platform nature of analyzed datasets would compensate for a low ROC curve area observed due to low recall (yield) component, while enrichment is high. Additional platforms to consider are comparative proteomics and quantitative PCR studies, massive parallel genome sequencing, promoter methylation arrays or others. To further improve the recall rate and polish the mechanistic details, the aggregated gene lists produced by all platforms combined are subjected to an interaction network building algorithm, sub-network identification and detailed assessment of the most relevant sub-networks by literature review.
Compacting mechanistically relevant genes into distilled shortlists may have a major impact on the routine verification of individual mechanistic hypotheses. Assume that a hypothesis H assigns correct connectivity between the functions X, Y, Z, the X being a receptor, Y being a G-protein and Z being a kinase. Relevance of the gene list is measured by factor of q, where q is the decimal fraction of bona-fide mechanistically relevant genes in the total list. The assignment will receive experimental verification only if all members X, Y, Z are bona-fide mechanistically related. For a 3-member sequence, the relationship is PEC = q 3 which can be generalized into: where PEC (probability of experimental correctness) is the probability that the mechanistic hypothesis is correct for a n-member sequence; q is the distillation factor of the list, n is the number of steps in a sequential mechanistic hypothesis. Under all other factors and techniques being equal, the exclusion of false positives from the gene lists is especially important for the mechanisms studied to a lesser degree (low g) and for complex hypotheses (high n).
To test CDM approach, we selected an example of Alzheimer disease, the most common form of adult-onset dementia. We were especially interested in addressing the earliest stages of this disease, when the pathological changes are still reversible and/or preventable. The particular focus of our analysis was at previously demonstrated anti-Alzheimer effects of anti-hypertensive drugs [19][20][21][22]. In our study, an application of CDM resulted in the distilled, tiered list of Alzheimer's disease-related genes integrated into a biological network model. As potential players in early disease, a group of genes that encode proteins associated with traffic vesicles, oncogenes, G-protein regulators, gonadotropin hormones and insulin-related signaling molecules was identified. Insights gained by an analysis of this CDM may aid in shifting the therapeutic efforts to the reversible stages of neurodegenerative disease, when the neuronal damage is mild and selfperpetuating misfolded protein oligomers do not yet form.

Selection of datasets
Datasets for the study included A) GSE5281 on GPL570 HG-U133 Plus 2 Affymetrix Human Genome U133 Plus 2.0 Array including 71 normal controls and 91 disease related samples (N = 162); B) GSE15222 on GPL2700 Illumina Sentrix Human Ref-8 Expression Bead Chip, including 187 normal controls and 176 disease samples (N = 363); and C) GSE26927 on GPL6255 Illumina human Ref-8 v2.0 expression bead chip platform, including 58 normal controls and 60 disease samples (N = 118). The latter dataset comprises differential expression data covering several neuropathies: Alzheimer's disease; Amyotrophic lateral sclerosis (ALS); Huntington Disease (HD); Multiple Sclerosis (MS); Parkinson Disease (PD); and Schizophrenia (SHIZ) of approximately equal size. The patient histories and disease severities were extracted from the information that accompanies the public domain datasets at GEO, NCBI at http://www.ncbi.nlm.nih.gov/geo/. Other datasets covering Alzheimer's disease and dementia on GPL96 and GPL90 Affymetrix platforms were explored but not included due to failure to pass the quality controls, namely, large number of missing genes, evidence of data imputation or evidence of weak hybridization/weak signal. The primary data describing datasets A, B and C are presented in Additional file 1.

Forming of a distilled gene list (CDM): consistency profiling step
The dataset GSE5281 comprises several distinct tissue subsets: EC -entorhinal cortex; HIP -hippocampus; MTG -Medial Temporal Gyrus; PC -Posterior Singulate; SFG -Superior Frontal Gyrus; VCX -Primary Visual Cortex, each being comprised of control and Alzheimer's disease samples. The expression values were averaged for each anatomical locus for norm and disease. The averaged signal intensities were sorted by their magnitude and the upper 40% of the entries were included in the analysis on assumption that the expression levels for the remaining low-intensity signals is unlikely to exceed experimental noise. The ratios of the averages produce either up-regulated or down-regulated fold change values. The primary data were subjected to 2-tail, different variance hypothesis T-tests between normal control and disease subsets for each brain tissue type. The p-values of these T tests were converted into negative logarithms and the logarithms were averaged across all tissue types. For GSE5281, these averages formed the Primary Consistency Scores (PCSs). In addition to differential expression, for each gene, absolute expression levels were also tracked.
The dataset GSE15222 comprises normal controls separated into two subsets, numbers 1-85 and 86-178. The disease samples were also separated into two subsets, numbers 1-85 and 86-176. Absolute averaged expression levels were computed for each normal and disease subset, separately. Similarly to analysis of previous dataset, the upper 40% of entries by their expression level intensities were considered significant and included in the analysis. The difference in expression between the normal and disease subsets was assessed by T-tests as described above to compare each disease subset to each normal subset, four separate values were produced, and the negative decimal logarithms of T-test p-values were computed. The average of 4 negative logarithms formed Primary Consistency scores for GSE15222.
All gene-specific labels in GSE5281 and GSE15222 were ranked according to their Primary Consistency scores (PCSs) and the top 10% were selected. The highest ranking probes in GSE5281 and GSE15222 were assigned to a Consistency Tier 3, if the functionally related molecules (members of the same pathway) were also displaying high PCS. Assignment to Consistency Tier 2 was made in either of two situations: (a) two Affymetrix probes representing the same gene were displaying high rank PCS, and the direction of differential change was the same for both probes (all down-regulated or all up-regulated) in the group; (b) Affymetrix and Illumina probes representing the same gene were displaying high rank PCS and the direction of differential change was identical for both platforms. Consistency Tier 1 was assigned if either of three situations: (a) to the genes that displayed high PCS on both Affymetrix and Illumina platforms as well as multiple probes on Affymetrix platform, when the direction of differential expression changes was the same for all genespecific probes; (b) to the probes that simultaneously qualify for Tier 3 and Tier 2; (c) to the three or more probes on Affymetrix platform that simultaneously showed high PCS ranking and the direction of expression changes was identical for all probes. Tiered consistency scores for all scored genes are available as the dataset D of the Additional file 1. The Tier 0 was produced by overlapping the Tier 1 and Tier 2 genes with the highest PCS ranks of GSE26927, thus, identifying a group of genes commonly participating in a number of neuropathies in addition to Alzheimer's disease.

Forming of a distilled gene list (CDM): ontological enrichment analysis step
Quantitative ontological analysis was performed using GO-MINER tool (http://discover.nci.nih.gov/gominer/ index.jsp) using high-throughput online computing option at http://discover.nci.nih.gov/gominer/htgm.jsp. This technique measures preferential enrichment of the differentially expressed gene lists in one or more of approx. 9300 functional categories, organized in a tiled partially overlapping manner, with smaller specific categories merging into greater ones. To compute enrichment in a given category, the genes that are known to be related are tracked in the differential expressed gene list and in the total list. The enrichment coefficient can be estimated as: Where: ENRenrichment coefficient, CGthe number of genes with detected expression changes in a given functional category in the experimental gene list L, Lthe number of genes in the experimental gene list, TGtotal number of genes in a given functional category, Tthe total number of genes assessed. Robustness of the enrichment coefficients is established by permuting the composition of L and expressed as p-value and False Discovery Rate (FDR). The current implementation of GoMiner uses a one-sided p-value calculated from a Fisher's exact test. To get a low p-value, good enrichment and a fairly large size of category are required. The FDR approach addresses the multiple comparison problem, and protects against over-interpreting p-values that do not have a biological meaning.
The combination of Affymetrix and Illumina probe populations was used as the "Total file" or T. Since highly expressed genes are more likely to produce consistent differential expression signatures, the total file (T) was normalized to ensure equal average expression level as compared to the gene lists (L) under study, compensating this bias. Specifically, the total list in each case was ranked by expression levels and the upper rank populations of T producing equal averages to the given L were retained as expression-adjusted total files, and the rest were discarded from the analysis, effectively decreasing the number of genes in T. Tier 0, Tier 1, Tier 1 + Tier 2 and Tier 1 + Tier 2 + Tier 3 gene lists were used as "Changed file". An option "All" was elected for "Data source". Evidence Code was elected as "All", accepting either experimental, curator inferred or computed data of functional involvement. Lookup setting for gene searching in the GO Consortium database was accepted as achieved by both cross-referencing and by use of synonyms. Both p-value of a functional enrichment category and false discovery rate (FDR) were elected as statistical criteria for including the qualifying genes in the summary report. The prospective functional enrichment categories were validated by 100 randomization cycles according to GO-MINER protocol. The smallest size for a functional enrichment category was accepted as 5. The GO-MINER output was sorted by FDR with the cutoff FDR < 0.2. The functional categories with the lowest false discovery rates were resorted by enrichment coefficients in the descending order. The relative functional enrichment coefficients reflect the extent of association of the differentially expressed genes with the pathological mechanism that caused the differential expression event in first place. The outputs of GO-MINER ontological analysis to the genes within Consistency Tiers 0-3 are available in the Additional files 2, 3, 4 and 5 datasets G-J.
The extent of ontological enrichment provides a cutoff for selection of the Consistency Tier levels to be submitted to network association step. The Tier 1 provided a conditionally optimal ROC cut-off due to high ontological enrichment and preserved pathway diversity. The Tier 1 + 2 + 3 was accepted, but considered less preferential due to a substantially lower proportion of mechanistically relevant genes based on ontological enrichment step.

Forming of a distilled gene list (CDM): network analysis step
To organize sets of genes into biological networks, Ingenuity Pathway Assistant (IPA) tool was utilized (http://ingenuity.com/). Briefly, the tool places a gene list in the context of experimental and computed interactions systematized in its database. The functional links between the members of a gene list under study form a network with high clustering coefficients for members of the same biological pathway while clustering coefficients for random associates are low. Indeed, the members of the same pathway must be functionally associated with multiple other members of the same pathway, directly or via intermediates, thus producing non-random clustering. The extent of observed clustering is compared with a random model and the extent of observed clustering is expressed as a p-value of a network. The p-value matches a probability that the associations in the network have emerged randomly. The network is partitioned into subnetworks based on global optimization of clustering when a gene under consideration is shifted between the subnetworks as a test. The formed sub-networks are ranked based on the score, the latter being the negative decimal logarithm of sub-network non-randomness p-values.
The sets of sub-networks were built using gene lists comprising Consistency Tier 1 and a joint list comprised of all three numbered Tiers (Tier 1+ Tier 2 + Tier 3), the latter as a benchmark control to illustrate the loss of the priority rank by the sub-network comprising the genes relevant to neurological diseases. In each analysis, the highest ranking sub-networks were selected, merged and plotted as connectivity graphs. The genes displaying experimentally observed differential expression were coplotted with sets of known network interaction partners. The possible network hubs were expanded, producing additional connections to more distant members. The Additional files 6 and 7 comprise the sub-network compositions for the Tier 1 and (Tier 1 + Tier 2 + Tier 3), including both experimental and inferred members.

Validation of CDM by semantic tag enrichment analysis
The quantitative evaluation of enrichment of the microarrayderived dataset with literature-derived associations was applied as a validation criterion. Each gene lists was converted into Boolean [OR] statement, for example: [gene name 1] OR [gene name 2] OR …etc. and used as a search query in Pubmed. The hits produced by PubMed were defined as Total. Additional delimiting search queries were imposed: A. [(disease or pathology or disorder)]; B.
[cancer]; C [(disease or pathology or disorder) and stress; D. [(disease or pathology or disorder) and (Alzheimer's or Alzheimer or neuropathy or neuropathic or neuro-degeneration or neurodegeneration or neurodegenerative or dementia)]. The numbers of hits for each delimited strategy were enumerated and related to the number for the total list based on gene names only. Variation in the datasets was taken into account by dividing each consistency tier list into subsets and repeating the procedure independently for each subset, pooling the variation and distributing it equally per each subset (within each consistency tier).
In this technique, both [cancer] and [(disease or pathology or disorder)] strings were used as controls accounting for non-specific organism or tissue-level stress that generally accompanies any severe pathology, while the string [(disease or pathology or disorder) and stress] was controlling for explicit gene association with stress in pathological conditions and the string [(disease or pathology or disorder) and ([Alzheimer's] or (Alzheimer or neuropathy or neuropathic or neuro-degeneration or neurodegeneration or neurodegenerative or dementia)] was controlling the expected specific association of the gene lists and the disease of interest.

Results
Overview of differential expression consistency in Alzheimer's disease Affymetrix GPL570 platform comprises approximately 54000 probes, while Illumina platform comprises~22000 probesets. Of the~24000 independent expressed genes measured by both platforms, 78 sets were satisfying criteria of the Tier 0, 105 sets were satisfying the criteria of the Tier 1, 85 sets were satisfying the criteria of the Tier 2, 450 sets matched the Tier 3 and 1298 sets were demonstrating high PCS without being validated by other consistency criteria. On both platforms, the genes within the top 40% range by their absolute expression served as random control. Of the 190 probe-sets in Tier 1+ Tier 2, the Tier 0 comprised 78, indicating that > 40% of genes robustly reported as being differentially expressed in Alzheimer's disease also produced robust detection in other neuropathies in agreement with [16]. In all Consistency Tiers, the fold differences of differential expression effects were relatively small, rarely exceeding 3-fold. In Tier 1, 20 out of 105 probe-sets were upregulated and the remaining 85 being down-regulated. In Tier 2, 12 out of 97 probe-sets were up-regulated, the remaining 85 being down-regulated. In Tier 3, the down-regulated pattern was shown by 316 genes and 176 genes were up-regulated. In the high PCS/unconfirmed group, 715 genes were down-regulated and 570 were up-regulated. In random control, the ratio of up and down-regulation signals was close to 1. The extent of relative down-regulation was strongly correlating with the extent of differential expression detection consistency. These numbers show a greater tendency for down-regulation in Alzheimer's disease-related genes and support functional significance of the consistency profiled gene lists, in agreement with degenerative character of the Alzheimer's process [23].
The absolute expression levels were also positively correlating with consistency of differential expression detection. Thus, Tier 0 average signal was~4300 arbitrary units, the remaining (Tier 1+ Tier 2) signals were, after exclusion of Tier 0, at~2330 arbitrary units, while the random control genes were at~1725 arbitrary units for the top 40% of ranked intensities and~730 arbitrary units for the entire array (see the Dataset C in Additional file 1). The 3 to 5-fold increase in average absolute expression in the Consistent gene lists vs. Random Control may be an artifact: the genes with higher expression levels may also display higher signal-to-noise ratio at hybridization. Also, at a higher concentration of transcript the thermodynamic quotient and Gibbs energy of binding increases. For genes with higher expression levels the relative proportion of binding at non-specific sites is lower. However, it is also known that mechanistically important targets tend to be the hubs of the biological network that also tend to be expressed at higher levels than non-hub entities [24,25]; this feature of biological networks provides additional robustness [26,27]. However, to account for possible gene intensity bias, further functional enrichment analysis was conducted after respective normalization.
Another concern was a possibility of a bias due to decreased inherent variation within consistently reporting gene sets, as could be expected for tightly regulated pathways. If this is true, the consistency in differential expression of these genes would reflect not a prevalence of their biological relevancy, but lower levels of respective backgrounds. To rule this scenario out, variations were measured among all Consistency Tiers and were compared to variations observed in Random Control genes both globally and in the subsets selected by matching of their expression intensities. The results are presented in Figure 1.
This analysis points at higher variation in consistent differentially expressing datasets. Thus, the biased scenario was ruled out and the consistency of the gene expression changes, indeed, was found to reflect the difference between the disease and the norm.
Still, there might be a concern that the differentially expressed data represent stress responses at both organism and tissue-specific levels, in other words, the responses expected to be pertinent to any severe pathology rather than to reflect a disease-specific mechanism. Figure 2 shows the extracted Consistency Tiers as analyzed by the methodology described above. The method comprises the Boolean presentation of the gene list crossed with the delimiting statement reflecting either association with a non-specific stress or with a specific disease. The statements like [(disease or disorder or pathology)] crossed with the corresponding Boolean representations of the consistency tiers would locate the literature publications associating the genes of interest with any disease, non-specific to the study. The query statements like [cancer] crossed with the corresponding Boolean representations of the consistency tiers would locate the literature publications associating the genes of interest with cancer as another proxy for a non-specific multiple roles played by many signaling molecules. The statement relating the gene list of interest to stressresponse comprises the negative control. Thus, relative enrichment of the disease-specific vs. disease nonspecific PubMed hits for certain levels of consistency would represent a measure of ensuring the mechanistic involvement of the genes in the disease-specific pathogenesis Figure 2 represent Venn-transformed enrichment diagrams for all Consistency Tiers.
Based on the results presented in Figure 2, it is apparent that only the Tier 0 produces a highly enriched disease associated gene list, the Venn Tier 1-0 is still significantly more enriched than the Random Control gene list, while an enrichments in Venn Tiers 2-1 and 3-2 were marginal. In fact, the enrichment for the delimiter query [(disease or pathology or disorder) and (Alzheimer's or Alzheimer or neuropathy or neurodegeneration)] in the Tier 0 was approximately 10 fold as compared to the Random Control. The degrees of enrichment against all non-specific disease-related controls were similar in each Consistency Tier and remained within a margin of experiment error. All together, against the non-delimited gene list' background and against the panel of negative controls, the disease-specific genes demonstrate the relative enrichment of~10 in the Tier 0,~3 in the Tier 1-0,~1.5-2 in the Tiers 1-2 and 3-2.
In its composition, the Tier 3 roughly corresponds to the target enrichment achievable in a benchmark (traditional) gene expression analysis based on T-test alone ( Figure 2). Low literature tag enrichment in consistency Tier 3 and in all tiers below Tier 3 questions the meaningfulness of the data not pre-cleaned by triple filtration of CDM or similar knowledge-based techniques. The data of Figure 2 show that the information most useful for mechanism-inferring purposes contained within Tier 1 and, in some case, Tier 0.

Functional enrichment analysis
Specific pathological mechanisms manifest by differential expression of genes that belong to just a few selected pathways. The effected functional categories may develop high and statistically significant enrichment coefficients in the changed gene list. The higher the extent of enrichment, the stronger is the link between the disease mechanism and the functional category of interest, pointing to greater specificity of the signal. Table 1 below combines expressionnormalized functional enrichment coefficients computed for Top 20 most enriched ontological categories for gene lists in the Tiers 0-3.
In general, the enrichment coefficients positively correlate with consistency scores, decreasing in the direction from Tier 1 (highest consistency) to Tier 3 (lowest consistency). The trend reversal from Tier 0 to Tier 1 can be explained by significant reduction (by 90%) of the total gene number in the Tier 0 as a result of expression normalization. In the Top 20 categories, the enrichment coefficients were in the range of 4.5-19, with a tendency to an upper side of the range. In non-normalized datasets, the stringent normalization by absolute expression masks the extent of functional enrichment as it may reach the values of~100 for Tier 0 and~40 for Tier 1, thus, being far above the typical values observed in traditional microarray experiments [28]. Thus, much higher distillation coefficient q of the model (1)-(2) may be reached. Based on comparison of the functional enrichments in CDM approach and the benchmark exemplified by [28], the increase in network-assisted capability to infer relevant mechanisms may be quite dramatic and certainly merits further study.

Biological network modeling
The molecules populating Consistency Tier 1 that was optimal in terms of balance between functional enrichment  and recall rate of the most relevant mechanistic participant entries were analyzed using Ingenuity Pathway Assistant (IPA) tool. The first order interactions were imputed automatically, the partners being the hubs of the cell signaling pathways in network proximity to the changed genes. The addition provided by the IPA network is valuable, since this feature partially compensates for low recall rate observed when the consistency criteria are applied. The distinctions between the sub-networks are algorithm-generated and therefore somewhat artificial. We retained for further analysis all significant hubs regardless of the sub-network they were assigned to by IPA. TGtotal genes in a given functional category within the total list T, CGchanged genes within the list L, ENRenrichment coefficient, LOG10(p)logarithm of p-value of one-sided Fisher's test of significance of a given ENR at a given group size, FDRfalse discovery rate. Venn Tier 1 + 2 is the combination of the Tier 1 and Tier 2, Venn Tier 1+ 2 + 3 is the combination of the Tiers 1, 2 and 3.
A sub-network 1 of the interaction network is shown in Figure 3 and the composition of the entire network is provided in Table 2.
According to the Table 2, the sub-networks 1 and 2 demonstrate significant scores associated with p-value of 10 -49 and 10 -47 , respectively, where the score being the probability that the genes associated at this extent of clustering were drawn randomly. Per IPA functional assignment, the highest score sub-network 1 corresponds to neurological diseases and comprises vesicle-forming components in agreement with GO-MINER analysis, validating the CDM approach from the point of internal consistency. As a control, a gene list sampling equal to Tier 1 in size and selected from the T-test enriched population (benchmark method, no CDM processing) was analyzed. The control list prioritized different sub-networks and emphasized the pro-inflammatory pathways, but not the neurological disease-specific sub-network as does CDM list (data not shown). The same trend persisted across the levels of detection consistency from T-test only and up to the Tier 3. The sub-network 2 comprises mostly cytoskeleton and mitochondrial components. The subnetwork 3 displays a score of 31 and comprises other components such as chaperones, ubiquitin pathway members and proteasome subunits. Sub-networks 4-9 were significant but displayed lower scores.
The abundance of oncogenes in the associated hub subset was remarkable. To quantify the extent of association with oncogenes, the symbol T was defined as Pubmed response to the gene symbol, assumed to be proportional to the total number of biological interactions mediated by the gene and its products. High T numbers correspond to the hubs of biological network.
To put it in a larger genomic context, a random sample of 118 gene names was extracted and the T-values were determined, producing 2 hits with 1000 < T < 5000, 2 hits with 5000 < T < 10000 and 1 hit with T > 10000. Based on this sampling and the total number of genes2 0000, an estimate of~500 hubs with T > 5000 was shown for the human interaction network. In the network sample associated with the Tier 1 consistently expressed gene list, 36 hubs of the comparable connectivity was present per 96 network associates. This is not a remarkable finding, considering that Ingenuity databank is likely biased in favor of hub enrichment. However, the finding that the ratio of oncogenes to tumor suppressors is skewed toward oncogenes is counterintuitive for a degenerative disease.
To assess the background ratio of oncogenes vs. suppressors, several databases were enquired. Search of OMIM (www.ncbi.nlm.nih.gov/omim) leads to 647 hits responding to the query [("oncogene or oncogenes")], while 882 hits responded to the queries [("tumor suppressor" or "tumor suppressors")]. These numbers correspond to~7:5 ratio of tumor suppressors to oncogenes in the global network. Similar search with the databases "Genes" and "Proteins" at NCBI produced~1:2 ratios. An analysis of the database GeneCards at www.genecards.org leads to the ratio~1:1 for the same queries. Table 2 Composition and scoring of Alzheimer's disease molecular sub-networks generated by IPA Molecules in network comprise both experimentally discovered molecules and known/predicted close interaction partners. The Score is a measure of clustering coefficient between the sub-network components. Focus Molecules are the differentially expressed gene products with experimental evidence of the linkage to disease pathogenesis and are denoted by capital letters, while small letters designate inferred interactions.
With that, an average ratio of~0.8:1 of tumor suppressor to oncogenes may be assumed as a random global control. This ratio markedly differs from the ratio observed in our data. Tumor suppressor/oncogene ratio in the hubs associated with neuropathy network was 2:9 (APC and TGFB1 as tumor suppressors, ERK1/2, AKT, MYC, FOS, AP-1, BCL2L1, HSP70, HSP90, RELA being progrowth and oncogenic, while the MAPK3, MAPK8 and MAPK14 were marked as having context dependent dual functions). Except APC, none of the hits in this T range was associated with "tumor suppressor" label, while AKT, MYC, FOS, AP-1, BCL2L1, RELA were denoted as "oncogenes".
We further tested if this oncogene association is limited to our data or is more general. A PubMed query [(Alzheimer's or Alzheimer or neuropathy or neuropathic or neuro-degeneration or neurodegeneration or neurodegenerative or dementia)] was further delimited by the keywords [("oncogene" or "oncogenes")] as well as [("tumor suppressor" or "tumor suppressors")]. The ratio of 3.4:1 was observed, while a control query [(disease or disorder)] produced 1.9:1 ratio. Similar queries [neuropathy or neurodegeneration] and [dementia or "cognitive decline"] produced the ratio 3.5:1 above the random~2:1, consistent with our data.
The high-T sub-population of network associates was segregated from the initial changed gene list (Tiers 0, 1-0, 2-1 combined) and all sub-populations underwent a similar analysis as above. Specifically, the corresponding gene lists were converted into Boolean queries and were delimited with "oncogene" and "tumor suppressor" terms. The Alzheimer's related genes were compared with a random gene sample. The random control and the initial (nontiered) list of differentially expressed genes demonstrated comparable oncogene/tumor suppressor hit ratios of 2:1, while the population of extracted network associates produced the hit ratio of 5.7:1. For sense of perspective, the corresponding ratio for oncogene BCL2-centered network was 4.6:1 and for tumor-suppressor-centered p53centered network was 1:2.2. Considering these ratios, the Alzheimer's disease network associates were as a group more oncogenic than the associates of BCL-2, considered to be a benchmark oncogene and this result is counterintuitive, considering the degenerative character of the disease.
In another analysis, the control query [disease or disorder] and [activation or activator] generated~125000 hits, while the query [disease or disorder] and [deactivate or deactivator or suppressor or repress or repressor] generated~25000 hits. The target query [(Alzheimer's or Alzheimer or neuropathy or neuropathic or neurodegeneration or neurodegeneration or neurodegenerative or dementia)] and [activation or activator] produced 17500 hits, while the query [(Alzheimer's or Alzheimer or neuropathy or neuropathic or neuro-degeneration or neurodegeneration or neurodegenerative or dementia)] and [deactivate or deactivator or suppressor or repress or repressor] produced~1300 hits. The ratios point to neuropathies being more preferentially associated with activation processes (compare 125000:25000 for the control and 17500:1300 for the neuropathies).
Thus, we conclude that an analysis of entire PubMed shows that the neuropathy-related information is more closely and paradoxically associated with oncogenes and activation than with tumor suppressors and deactivation, confirming the trend observed in our data.

A study of intersection for Alzheimer's Disease and Angiotensin receptor blocker response pathways
The limited number of mechanistically relevant members comprising CDM list allows aligning with the literature data covering downstream effects of Angiotensin receptor AT-1. Both the results in Table 2 of the current study and the AT-1 literature review indicate AT-1 related genes as likely to mediate the effect of ARBs (angiotensin receptor blockers) on Alzheimer's development.
Literature analysis points to significant interaction of AT-1 pathway with oncogene activation as well as with luteinizing hormone and insulin dependent pathways in neurons [30][31][32][33][34][35]. The role of oncogene modulation in response to AT1R blockers is complex, with some oncogenes being inhibited [30,31], while some being upregulated [32]. In the cases when the ARBs exhibit neuroprotective effects via c-JUN inhibition, levels of other oncogenes remain as they were and the overall impact of oncogenic activation could still be executed through collateral routes. An example of such collateral pathway is a compensatory increase in activity of oncogenic angiotensin II receptor II (AT-2)/MAS pathway after the blockade of angiotensin II receptor I (AT-1) [36]. In mouse model, the alleviation of Alzheimer's disease was experimentally achieved by hippocampal delivery of the oncogenic fibroblast growth factor FGF2 [37]. The predominance of neuroprotective effect in oncogene stimulation by ARBs is emphasized by ARB induction of IGF1, a molecule with a powerful anabolic and prosurvival impact [35]. Thus, the connection between AT1R inhibition and general activation of neuronal oncogenes is rather prominent in the body of research literature.
The LH/FSH regulation was previously linked to Alzheimer-like degeneration in murine models [38], thus, lending greater significance to stimulation of luteinizing hormone expression by angiotensin II that was previously observed in neurons [34].
Another group of entries in the higher score subnetworks 1 and 2 belongs to cytoskeleton rearrangement pathways. Cytoskeleton rearrangement related signaling that is the necessary step in vasoconstriction and vasodilation, a major short term effect of any anti-hypertensive drug. Angiotensin pathway is certainly involved in the pathways featured in the sub-networks 1 and 2 [39][40][41]. According to our data, the most abundant sub-networks of the Tier 1/ sub-network 1 are the regulators of G-protein signaling and G-proteins: GNAS, GNB5, GNG3, GPRASP2, RGS4, RGS6, RGS7. The vascular remodeling and vasodilating role of GRS2, GRS3, GRS5, GRS18 and GNB3 in regulating of vascular tonicity in the context of Angiotensin receptor (AT-1) signaling was described previously in [42][43][44][45][46]. Vascular tone is maintained by the cytoskeleton rearrangement and the intracellular motility, thus connecting it to the protein misfolding and/or defective chaperone complex formation.
To conclude, substantial literature evidence connects the CDM derived in the current report and organized in the network of Table 2 with Angiotensin Receptor I pathway or its blockers, in the neuronal setting.

Discussion
Inferring clinically relevant insights from the complex picture of the quantitative changes in gene expression levels remains a major challenge of systems biology. An interpretation of the disease signature remains the least standardized part of analytic procedures. In most cases, this analysis is riddled with subjective inference about whether given change in expression levels should be classified as causal, passively associated with observed phenotype or simply incidental to study design. Recent introduction of knowledge-based algorithms is expected to aid in producing reasonable hypotheses linking altered pathways to phenotypic changes. In this study, we attempted to devise knowledge-based algorithm capable of generating network clustering-validated, highly prioritized shortlists of potential targets pertinent to pathogenesis, the Compact Disease Models, or CDMs. To generate CDMs, we assume that molecular targets pertinent to pathogenesis of certain chronic disease may be recognized by their consistent visibility (differential expression, association of SNPs, functional evidence etc.) across most of independently designed experiments. In other words, a molecular target highlighted in a majority of studies (high-prevalence target) is more likely to be mechanistically important than the target detected in a minority of studies (low-prevalence target), although this relationship is may be not so straightforward [17]. The genes prioritized by relatively simple approach described in our study are later validated by assessing the reproducibility of the findings across other technological platforms. Only the strongest signals prevalent in all sample subsets ("compartments") would pass such a rigorous filter. We believe that cross-validated, highly prevalent molecules are more likely to reflect more fundamental aspects of the disease and stand closer to causation. If the high-prevalence targets also form a highly clustered network, this association cross-validates substantial role for each of its constituents in the disease etiology, by compounding the difference in the probability of being truly causative between more and less prevalent signals.
In this report, Alzheimer's Compact Disease Model (CDM) was built using both Illumina and Affymetrix platforms through extraction of differential expression data followed by tiering the gene expression evidence by its consistency. Both microarray platforms rely on oligonucleotide multi-probe approach; however, the experimental workflow, probe length, probe choice and signal processing statistics between the two platforms substantially differ [47,48]. In our approach, this interplatform discrepancy is expected to serve as a filter that eliminates the signals that display poor consistency due to low reproducibility of expression level changes or due to elevated person-to-person expression variability, thus, cutting out the probability to detect meaningless (i.e. false positive) signals.

Entropic disease model and pleiotropic role of angiotensin receptor blockers
Based on the observations of the current report, a pleiotropic model of early-stage Alzheimer disease could be proposed.
From very general considerations, a rigid differentiation program and complex shape of the neuron makes it an inherently disadvantaged cell type. Thermodynamically, expression of a gene in the nucleus that is followed by long route of the delivery of resultant protein to the target site at the synaptic junction either in a folded or properly pre-folded state is unfavorable due to high Boltzmann entropy loss associated with long processes. In the neurons, the travel distances may reach 0.5 -1 m; the maintenance of properly folded protein requires costly coordination of its intracellular traffic with the chaperone assembly sites and migration of the chaperone-protein complex to the destination. The high Boltzmann entropy loss of the process has to be matched by a high influx of free energy in the protein traffic path, derived in sufficient stimulation of anabolic and trophic pathways. This fundamental understanding is in agreement with our findings that the pathways jointly implicated in both blood pressure control and neurodegeneration are mostly anabolic and pro-survival. When anabolic pathways become down-regulated in CNS due to aging, neurotoxicity or mutation, it takes its toll on energy balance within the cell and increases the risk of misfolding. Thus, the long-term sustainability of anabolic processes in the neurons may be favored by regular anti-hypertensive treatments that assist cell survival.
In this balance of energy, the state of cytoskeleton organization determines the Boltzmann entropy loss. Disorganized cytoskeleton has higher initial entropy and the required intra-neuronal coordination would impose higher entropy costs. Thus, the intensity of anabolic processes is not the only factor determining energy supply for proper protein folding and trafficking. The luteinizing and follicle stimulating hormones (LH and FSH) both regulate menstrual cycle in females and spermatogenesis in males serving as upstream stimulators of androgen and estrogen production. Importantly, gonadotropins were found to be involved in the earliest stages of Alzheimer's disease and in memory-related processes in humans and in multiple murine models [49,50]. Non-pituitary expression of FSH and its co-localization with FSH receptor and GnRH receptor in rat cerebellar cortex was shown previously [51]. Brain regions susceptible to degeneration in AD are enriched in both LH and its receptor; moreover, in animal models of AD, pharmacologic suppression of LH and FSH reduced plaque formation [52]. As both the oocytes meiosis that is triggered by LH and the process of spermatogenesis that is initiated by FSH require extensive cytoskeleton remodeling [53,54], it is tempting to speculate that the Boltzmann entropy state of neuronal cytoskeleton may be, in part, dependent on FSH and LH stimulation, possibly through G-protein activation. Respectively, G-protein regulators form a tightly connected cluster around FSH and LH nodes (Figure 3).
It is generally accepted that the probability of any chronic disease of old age increases in parallel with an increase in genomic entropy that degrades the complexity of epigenetic landscapes [55]. Age-dependent demethylation of the genome leads to an increase in the transcription of non-coding RNAs, while CpG-rich 5' regions of select genes may become hypermethylated [56]. In case of neurons, the global hypomehylation and site-specific hypermethylation was found to be associated with degenerative and psychotic diseases [57,58]. In agreement with these observations, our data point to an overall decrease in transcript expression levels in the most of the functional categories showing high enrichment coefficients by GO-MINER. In some form, the down-regulation bias was traced among~200 members of the Tiers 0-2 and~1300 members of consistency Tier 3 and PCS groups. It is possible that this phenomenon is reflected by negative downstream changes in the stability of RNA transcripts and proteins, efficiency of translation and posttranslational modifications and, again, protein folding and trafficking. An exception to this trend is prominent up-regulation of NF-kB pathway (Figure 3, NF1C), that is involved in inflammation, cellular stress and apoptosis.
Another chromatin remodeling associated pathway is insulin signaling (Table 2). Importantly, IGF1 pathway is implicated in both life-span control and antihypertensive response. For example, a protective hormone Klotho, a competitive antagonist of IGF1 in kidney, is known to reverse degenerative nephropathies in murine models, and, as well, shown as downregulated in aging primates through chromatin methylation [59]. Interestingly, an inhibition of angiotensin II signaling by counteracting expression of IGF-II receptor is also shown to up-regulate Klotho [60,61]. Taken together, these data suggest a potential of antihypertensive agents to oppose the longterm age-related chromatin remodeling.

Literature validation of the entropic model built based on CDM filtered gene list
The principle assumption of our study is that the pathways that relevant to the disease mechanism should be consistently discovered in a number of independent studies. Many pathways highlighted by our enrichment strategy were also described as experimental findings relevant to the context of early stages of Alzheimer's disease, or, in general, the process of neurodegeneration. Extreme functional enrichment for protein traffic vesicle proteins observed in the CDM dataset points to a substantial role of this mechanism in the neurodegeneration, and is likely to be an early pathogenetic event. Some experimental reports confirm impairment of protein vesicle traffic in early stages of neurodegeneration [62,63]. The body of literature that discusses protein traffic vesicles in the context of neurodegeneration is relatively small and recent, as compared to more common and more general discussions of cytoskeleton and heat shock protein involvement in Alzheimer's disease. Hence, we may conclude that the proposed CDM building technique aids the acquisition of relatively novel mechanistic insights underrepresented in broader literature.
Another important finding of the report is abundance of oncogenes in the Alzheimer's disease interaction network built around the CDM gene list cut-off at a Tier 1 consistency. The independent literature search uncovers numerous publications describing the connection of oncogenes to improper, but possibly compensatory reactivation of cell cycle in terminally differentiated neurons that eventually leads to a cell death [64][65][66]. An alternative hypothesis points to the fact that patients with Alzheimer's disease have lower risk of incident cancer than general population [64,67,68]. One explanation to that paradox is a mitochondrial disfunction that is both implicated in early stages of Alzheimer's disease development and impacted by the oncogene-tumor suppressor balance [66][67][68]. The connection between oncogene activation and bioenergy available to a neuron appears to be well described in the literature, in agreement with the conclusions of CDM-based analysis. The mechanistic support to the bioenergetic view of oncogene role over improper re-activation of cell cycle is provided by much higher score rank of the Ingenuity sub-network 2 ( Table 2) comprising mitochondrial ATP-ase subunits vs. sub-networks 8 and 9, comprising cell cycle components.
Additionally, the CDM gene list analysis uncovers the prominent role of follicule-stimulating hormone, luteinizing hormone and gonadotropin in the development of early Alzheimer's. The independent literature search presents evidence of increased expression of LH in the neurons vulnerable to Alzheimer's disease [69]. In aged transgenic mice with Alzheimer-type of brain degeneration (Tg 2576), an ablation of the luteinizing hormone by a gonadotropin-releasing hormone analogue leuprolide acetate significantly attenuated cognitive decline and decreased amyloid-beta deposition as compared to placebo-treated animals [52]. Hence, the data presented in [69] and [52] and supported by CDM model indicate an involvement of FSH/LH pathway in Alzheimer's.
The gene list distilling steps and its subsequent compacting are crucial to CDM-based hypothesis generation. If these steps would be omitted, the resultant CDM would be represented by impractically large gene network, dominated by the non-specific pathways common for many pathologies. In Alzheimers, non-compacted gene lists are dominated by stress response and inflammatory pathways marked as the highest scoring sub-networks. Without denying the aggravating role of inflammation in Alzheimer's disease, inflammation abating approaches are unlikely to produce sustainable therapeutic results as they target relatively late stages of pathogenesis. Importantly, the distillation of the gene list into compact model (CDM) introduces an opportunity to catch a glimpse at possibly causative mechanistic alternatives that otherwise would took years to uncover through hypothesis-driven experimental studies that tend to look after overall plausibility of possible findings at the stage of the study design. While some models caution us against too stringent cutoffs for initial CDM composition [17], milder cutoffs that are combined with a cross-platform analysis appear to be a promising direction that requires further efforts.