Integrated network analysis of transcriptomic and proteomic data in psoriasis

Background Psoriasis is complex inflammatory skin pathology of autoimmune origin. Several cell types are perturbed in this pathology, and underlying signaling events are complex and still poorly understood. Results In order to gain insight into molecular machinery underlying the disease, we conducted a comprehensive meta-analysis of proteomics and transcriptomics of psoriatic lesions from independent studies. Network-based analysis revealed similarities in regulation at both proteomics and transcriptomics level. We identified a group of transcription factors responsible for overexpression of psoriasis genes and a number of previously unknown signaling pathways that may play a role in this process. We also evaluated functional synergy between transcriptomics and proteomics results. Conclusions We developed network-based methodology for integrative analysis of high throughput data sets of different types. Investigation of proteomics and transcriptomics data sets on psoriasis revealed versatility in regulatory machinery underlying pathology and showed complementarities between two levels of cellular organization.


Background
Psoriasis vulgaris is one of the most prevalent chronic inflammatory skin diseases affecting approximately 2% of individuals in Western societies, and found worldwide in all populations. Psoriasis is a complex disease affecting cellular, gene and protein levels and presented as skin lesions. The skin lesions are characterized by abnormal keratinocyte differentiation, hyperproliferation of keratinocytes, and infiltration of inflammatory cells [1,2]. The mechanisms of psoriasis pathology are complex and involve genetic and environmental factors. As we gain more knowledge about molecular pathways implicated in the disease, novel therapies emerge (such as etanercept and infliximab that target TNF-α or CD11a-mediated pathways [3,4]). In recent years, microarray mRNA expression profiling [5][6][7][8] of lesional psoriatic skin revealed over 1,300 differentially expressed genes. Enrichment analysis (EA) showed that these genes encode proteins involved in regeneration, hyperkeratosis, metabolic function, immune response, and inflammation and revealed a number of modulating signaling pathways. These efforts may help to develop new-generation drugs.
However, enrichment analysis limits our understanding of altered molecular interactions in psoriasis as it provides a relative ranking based on ontology terms resulting in the representation of fragmented and disconnected perturbed pathways. Furthermore, analysis of gene expression alone is not sufficient for understanding the whole variety of pathological changes at different levels of cellular organization. Indeed, new methodologies have been applied to the analysis of OMICs data in complex diseases that include algorithm-based biological network analysis [9][10][11][12][13] and meta-analysis of multiple datasets of different types [14][15][16][17][18][19]. Here, we applied several techniques of network and meta-analysis to reveal the similarities and differences between transcriptomics-and proteomics-level perturbations in psoriasis lesions. We particularly focused on revealing novel regulatory pathways playing a role in psoriasis development and progression.

Skin biopsies
Acquisition of the human tissue was approved by the Vavilov Institute of General Genetics of Russian Academy of Sciences review board and the study was conducted after patient's consent and according to the Declaration of Helsinki Principles.
A total of 6 paired nonlesional and lesional (all were plaque-type) skin biopsies from 3 psoriatic patients were profiled using 2D electrophoresis. All the donors who gave biopsy tissue (both healthy controls and individuals with psoriasis) provided a written informed consent for the tissue to be taken and used in this study. Clinical data for all patients are listed in Additional file 1.
Full-thickness punch biopsies were taken from uninvolved skin (at least 2 cm distant from any psoriatic lesion; 6 mm diameter) and from the involved margin of a psoriatic plaque (6 mm diameter) from every patient.

Gel image analysis
Protein spots on 2-DE gels were visualized by silver staining [21] and scanned with resolution 300 dpi. Images were analyzed using Melanie III software (GeneBio, Switzerland). The conventional analysis involved (i) protein spot relative volume (%Vol) determination, which was expressed as the sum of pixel intensities in the certain spot divided to the sum of pixel intensities in all spots on the gel; (ii) gel alignment; and (iii) spot matching. Further, sets of %Vol values for every spot were processed by Student test, thereby testing whether there was a significant variation of the certain protein level between two specified groups.

Protein identification by MALDI-TOF mass-spectrometry
Protein spots were cut out (~3 mm3) from 2-DE gels, destained, and in-gel digested with trypsin. Mass-spectrometry of trypsin digested proteins (spots No. 1,2,3,4) was performed using a Microflex MALDI-TOF massspectrometer (Bruker, Germany). Peptide samples (0.2-1 μl) were mixed with an equal volume of 2,5-dihydroxybenzoic acid solution (20 mg/ml; Sigma, USA) in 20% acetonitrile and 0.1% trifluoroacetic acid, and the resulting droplets were dried in air. Mass-spectra were obtained for mass range from 800 to 4000 daltons in reflection mode and calibrated using internal standards (trypsin autolysis peaks, MH+ 1046.54, 2212.10 daltons). Peptide peak lists were formed by the SNAP algorithm (XMass software, Bruker). Proteins were identified using the Mascot database search engine. The search parameters were as follows: mass tolerance 100 ppm, NCBI protein sequence database, Homo sapiens taxon, one missed cleavage, variable modifications by propionamide for cysteines and oxidation for methionines.
Low molecular weight proteins (9-20 kDa, spots No. 5,6,7,8,9,10) were identified using nanoLC-MS/MS massspectrometry for higher convenience, regarding lack of cleavage sites in such proteins. Analysis of trypsin digest was performed on electrospray ion trap (XCT Ultra Ion Trap Chip Cube 6330 series, Agilent) equipped with chipcube head. One μl of each sample was subjected (flow rate 3 μl/min) onto reverse-phase in-chip column (40 nl capacity) for 10 minutes under isocratic buffer (5% acetonitrile in 0.1% TFA). Following sample application peptides were separated by linear gradient (0.3 μl/min) of solution A (0.1% TFA/water) and solution B (80% ACN/ 0.1% TFA/water) for 60 minutes. Spectrum scanning was repeated three times for each sample of protein gel spots and tissue hydrolysate. Mass spectra of eluted peptide were simultaneously obtained under positive polarity for 425-1325 m/z range both in MS and MS/MS mode, 2.1 kV applied accumulation of 85000 ions for 50 milliseconds, averages on 2 spectra. Mass spectra were processed with Spectrum Mill MS Proteomics workbench software (Agilent). Proteins were identified using SwissProt Human Database in the following parameters: score ≥ 7 for peptide and ≥ 20 for protein, minimum S/N ratio 20, maximum peptide ion charge +4, precursor mass tolerance ± 2.5 Da, product mass tolerance ± 0.7 Da, Proteins identification was accomplished with detection of minimum 3 peaks of the same peptide ion with maximum of 2 missed cleavages.

Microarray data analysis
We used recently published data set [22] from GEO data base (http://www.ncbi.nlm.nih.gov/geo/; accession number GSE14095). We compared 28 pairs of samples (in each pair there was a sample of lesional skin and a sample of healthy skin from the same patient). Values for each sample were normalized by sample median value in order to unify distributions of expression signals. For assessment of differential expression we used paired Welch ttest with FDR correction [23]. Probe set was considered as differentially expressed if its average fold change exceeded 2.5 and FDR corrected p-value was less than 0.01.

Overconnection analysis
All network-based analyses were conducted with Meta-Core software suite http://www.genego.com. This software employs a dense and manually curated database of interactions between biological objects and variety of tools for functional analysis of high-throughput data.
We defined a gene as overconnected with the gene set of interest if the corresponding node had more direct interactions with the nodes of interest than it would be expected by chance. Significance of overconnection was estimated using hypergeometric distribution with parameters r -number of interactions between examined node and the list of interest; R -degree of examined node, nsum of interactions involving genes of interest and Ntotal number of interactions in the database:

Hidden nodes analysis
In addition to direct interacting objects, we also used objects that may not interact directly with objects of interest but are important upstream regulators of those [24]. The approach is generally the same as described above, but the shortest paths instead of direct links are taken into account. As we were interested in transcriptional regulation, we defined a transcriptional activation shortest path as the preferred shortest path from any object in the MetaCore database to the transcription factor target object from the data set. We added an additional condition to include the uneven number of inhibiting interactions in the path (that's required for the path to have activating effect). If the number of such paths containing examined gene and leading to one of objects of interest were higher than expected by chance, this gene was considered as significant hidden regulator. The significance of a node's importance was estimated using hypergeometric distribution with parameters rnumber of shortest paths between containing currently examined gene; R -total number of shortest paths leading to a gene of interest through transcriptional factor, ntotal number of transcription activation shortest paths containing examined gene and N -total number of transcription activation shortest paths in the database.

Rank aggregation
Both topology significance approaches produced lists of genes significantly linked to a gene or protein set of interest, ranked by corresponding p-values. To combine results of these two approaches, we used a weighted rank aggregation method described in [25]. Weighted Spearman distance was used as distance measure and the genetic algorithm was employed to select the optimal aggregated list of size 20. This part of work was accomplished in R 2.8.1 http://www.r-project.org.

Network analysis
In addition to topology analysis, we examined overexpressed genes and proteins using various algorithms for selecting connected biologically meaningful subnetworks enriched with objects of interest. Significance of enrichment is estimated using hypergeometric distribution.
We first used an algorithm intended to find regulatory pathways that are presumably activated under pathological conditions. It defines a set of transcription factors that are directly regulating genes of interest and a set of receptors whose ligands are in the list of interest and then constructs series of networks; one for each receptor. Each network contains all shortest paths from a receptor to the selected transcriptional factors and their targets. This approach allows us to reveal the most important areas of regulatory machinery affected under the investigated pathological condition. Networks are sorted by enrichment p-value.
The second applied algorithm used was aimed to define the most influential transcription factors. It considers a transcriptional factor from the data base and gradually expands the subnetwork around it until it reaches a predefined threshold size (we used networks of 50 nodes). Networks are sorted by enrichment p-value.
The proteins belonged to a diverse set of pathways and processes. Thus, keratin 17, keratin 14, and keratin 16 are a member of the keratin gene family. The keratins are intermediate filament proteins responsible for the structural integrity of epithelial cells. SERPINB4 and SERPINB3 are serine protease inhibitor to modulate the host immune response against tumor cells. Enolase 1, more commonly known as alpha-enolase, is a glycolytic enzyme expressed in most tissues, one of the isozymes of enolase. Superoxide dismutase 2 protein (SOD2) transforms toxic superoxide, a byproduct of the mitochondrial electron transport chain, into hydrogen peroxide and diatomic oxygen. Galectins are a family of beta-galactoside-binding proteins implicated in modulating cell-cell and cell-matrix interactions. Differential and in situ hybridizations indicate that this lectin is specifically expressed in keratinocytes. The cellular localization and its striking down-regulation in cultured keratinocytes imply a role in cell-cell and/or cell-matrix interactions necessary for normal growth control. S100A9 and S100A7 proteins are localized in the cytoplasm and/or nucleus of a wide range of cells, and involved in the regulation of a number of cellular processes such as cell cycle progression and differentiation. S100A7 is markedly over-expressed in the skin lesions of psoriatic patients.
We attempted to connect the proteins into a network using a collection of over 300,000 manually curated protein interactions and several variants of "shortest path" algorithms applied in MetaCore suite [30] (Figure 2, see Methods for details). The genes encoding overabundant proteins were found to be regulated by several common transcription factors (TFs) including members of the NF-kB and AP-1 complexes, STAT1, STAT3, c-Myc and SP1. Moreover, the upstream pathways activating these TFs were initiated by the overabundant S100A9 through its receptor RAGE [31] and signal transduction kinases Figure 1 Representative silver-stained 2DE gel images of lesional and uninvolved skin biopsy lysates. a) -gel image of lesional skin biopsy lysate; b) -gel image of uninvolved skin biopsy lysate. Spots corresponding to proteins overexpressed in lesions are marked with red rectangles and numbered. Spot 1 correspond to 3 proteins of keratin family, spot 2 -SCCA2, spot 3 -SCCA1, spot 4 -enolase 1, spot 5 -SOD2, spot 6 -galectin-7. S100A7 is found in spots 7 and 8 and S100A9 corresponds to 9 th and 10 th spots.
(JAK2, ERK, p38 MAPK). This network also included a positive feedback loop as S100A9 expression was determined to be controlled by NF-kB [32]. The topology of this proteomics-derived network was confirmed by several transcriptomics studies [33][34][35][36][37][38] which showed overexpression of these TFs in psoriasis lesions. Transiently expressed TFs normally have low protein level and, therefore, usually fail to be detected by proteomics methods.
RAGE receptor is clearly the key regulator on this network and plays the major role in orchestrating observed changes of protein abundance. This protein is abundant in both keratinocytes and leukocytes, though normally its expression is low [39]. RAGE participates in a range of processes in these cell types, including inflammation. It is being investigated as a drug target for treatment of various inflammatory disorders [40]. Thus, we may propose that RAGE can also play significant role in psoriasis.

Differentially expressed genes
We used Affymetrix gene expression data set from the recent study [22] involving 33 psoriasis patients. Originally, more than 1300 probe sets were found to be upregulated in lesions as compared with unlesional skin of the same people. We identified 451 genes overexpressed in lesional skin under more stringent statistical criteria (28 samples of lesional skin were matched with their nonlesional counterparts from the same patients in order to exclude individual expression variations, genes with fold change >2.5 and FDR-adjusted p-value < 0.01 were considered as upregulated). The list of overexpressed genes can be found in Additional file 2. The genes encoding 7 out of 10 proteomic markers were overexpressed, well consistent with proteomics data. Expression of Enolase 1, Keratin 14 and Galectin 7 was not altered.

Common transcription regulation for overexpressed genes and differentially abundant proteins
Despite good consistency between the proteomics and expression datasets, the two orders of magnitude difference in list size make direct correlation analysis difficult. Therefore, we applied interactome methods for the analysis of common upstream regulation of the two datasets at the level of transcription factors. First, we defined the sets of the most influential transcription factors using two recently developed methods of interactome analysis [41] and the "hidden nodes" algorithm [42]. The former method ranks TFs based on their one-step overconnectivity with the dataset of interest compared to randomly expected number of interactions. The latter approach takes into account direct and more distant regulation, calculating the p-values for local subnetworks by an aggregation algorithm [42]. We calculated and ranked the top 20 TFs for each data type and added several TFs identified by network analysis approaches ( Table 2). The TFs common for both data types were taken as set of 'important pathological signal transducers' (Figure 3). Noticeably, they closely resemble the set of TFs regulating the protein network on Figure 2.

Identification of influential receptors
In the next step, we applied "hidden nodes" algorithm to identify the most influential receptors that could trigger maximal possible transcriptional response. In total, we found 226 membrane receptors significantly involved into regulation of 462 differentially expressed genes ('hidden nodes' p-value < 0.05). The complete list of receptors can be found in Additional file 3. Assuming that topological significance alone does not necessarily prove that all receptors are involved in real signaling or are even expressed in the sample; we filtered this list by expression Protein S100-A9 (S100 calcium-binding protein A9) S100A9 Lesion only Protein S100-A7 (S100 calcium-binding protein A7) (Psoriasin) S100A7 Lesion only performance. The receptors used were those whose encoding genes or corresponding ligands were overexpressed greater than 2.5 fold. We assumed that the pathways initiated by over-expressed receptors and ligands are more likely to be activated in psoriasis. Here we assumed that expression alterations and protein abundance are at least collinear. An additional criterion was that the candidate receptors had to participate in the same signaling pathways with at least one of the common TFs. No receptor was rejected based on this criterion. In total, 44 receptors passed the transcription cut-off. Of these 24 receptor genes were overexpressed; 23 had overexpressed ligands and 3 cases had overexpression of both ligands and receptors (IL2RB, IL8RA and CCR5; see Figures 4 and 5 and Additional file 4). Interestingly, for several receptors, more than one ligand was overexpressed ( Figure 4). Several receptors are composed of several subunits, only one of which was upregulated (for example, IL-2 receptor has only gamma subunit gene significantly upregulated).
Out of 44 receptors we identified by topology analysis, 21 were previously reported as psoriasis markers (they are listed in Table 3 with corresponding references). The other 23 receptors were not reported to be linked to psoriasis or known to be implicated in other inflammatory diseases. These receptors belong to different cellular processes (development, cell adhesion, chemotaxis, apoptosis and immune response) ( Table 3).

Discussion
Meta-analysis of multiple OMICs data types and studies is becoming an important research tool in understanding complex diseases. Several methods were developed for correlation analysis between the datasets of different type, such as mRNA and proteomics [18,[43][44][45][46]. However, there are many technological challenges to resolve, including mismatching protein IDs and mRNA probes, fundamental differences in OMICs technologies, differences in experimental set-ups in studies done by different groups etc [47]. Moreover, biological reasons such as dif- ferences in RNA and protein degradation processes also contribute to variability of different data types. As a result, transcriptome and proteome datasets usually show only weak positive correlation although were considered as complimentary. More recent studies focused on functional similarities and differences observed for different levels of cellular organization and reflected in different types of OMICs data [48][49][50][51]. For example, common interacting objects were found for distinct altered transcripts and proteins in type 2 diabetes [52]. In one leukemia study [53] authors found that distinct alterations at transcriptomics and proteomic levels reflect different sides of the same deregulated cellular processes.
The overall concordance between mRNA and protein expression landscapes was addressed in earlier studies, although the data types were compared mostly at the gene/protein level with limited functional analysis [14,47]. Later, ontology enrichment co-examination of transcriptomics and proteomic data has shown that the two data types affect similar biological processes and are complimentary [49,53,54]. However, the key issue of biological causality and functional consequences of distinct Some genes were considered significant only by one of two topological approaches (this is evident for proteomics data, where low number of proteins limits capabilities of topological analysis). Missing p-value means that correspondent gene has not passed 0.05 significance threshold and has been listed among top factors only due to low p-value for other topological approach. Only one p-value was determined in the case of network analysis (enrichment p-value for the network built around seed transcription factor). Transcriptional factors common for proteomics and transcriptomics level are in bold text.
regulation events at both mRNA and protein levels of cellular organization were not yet specifically addressed. These issues cannot be resolved by low resolution functional methods like enrichment analysis. Instead, one has to apply more precise computational methods such as topology and biological networks, which take into consideration directed binary interactions and multi-step pathways connecting objects between the datasets of different types regardless of their direct overlap at gene/protein level [12,13]. For example, topology methods such as "hidden nodes" [24,41] can identify and rank the upstream regulatory genes responsible for expression and protein level alterations while network tools help to uncover functional modules most affected in the datasets, identify the most influential genes/proteins within the modules and suggest how specific modules contribution to clinical phenotype [10,52].
In this study, we observed substantial direct overlap between transcriptomics and proteomics data, as 7 out of 10 over-abundant proteins in psoriasis lesions were encoded by differentially over-expressed genes. However, the two orders of magnitude difference in dataset size (462 genes versus 10 proteins) made the standard correlation methods inapplicable. Besides, proteomics datasets display a systematic bias in function of abundant proteins, favoring "effector" proteins such as structural, inflammatory, core metabolism proteins but not the transiently expressed and fast degradable signaling proteins. Therefore, we applied topological network methods to identify common regulators for two datasets such as the most influential transcription factors and receptors. We have identified some key regulators of the "proteomics" set among differentially expressed genes, including transcription factors, membrane receptors and extracellular ligands, thus reconstructing upstream signaling pathways in psoriasis. In particular, we identified 24 receptors previously not linked to psoriasis.
Importantly, many ligands and receptors defined as putative starts of signaling pathways were activated by  transcription factors at the same pathways, clearly indicating on positive regulatory loops activated in psoriasis. The versatility and the variety of signaling pathways activated in psoriasis is also impressive, which is evident from differentially overexpression of 44 membrane receptors and ligands in skin lesions. This complexity and redundancy of psoriasis signaling likely contributes to the inefficiency of current treatments, even novel therapies such as monoclonal antibodies against TNF-α and IL-23. Thus, the key regulator, RAGE receptor, triggers multiple signaling pathways which stay activated even when certain immunological pathways are blocked. Our study suggests that combination therapy targeting multiple pathways may be more efficient for psoriasis (particularly considering feasibility for topical formulations). In addition, the 24 receptors we identified by topology analysis and previously not linked with psoriasis can be tested as potential novel targets for disease therapy.
The functional machinery of psoriasis is still not complete and additional studies can be helpful in "filling the gaps" of our understanding of its molecular mechanisms. For instance, kinase activity is still unaccounted for, as signaling kinases are activated only transiently and are often missed in gene expression studies. Topological analysis methods such as "hidden nodes" [24] may help to reconstruct regulatory events missing in the data. Also, the emerging phosphoproteomics methodology may prove to become a helpful and complimentary OMICs technology. The network analysis methodology is not dependent on the type of data analyzed and or any gene/ protein content overlap between the studies and is well applicable for functional integration of multiple data types.

Conclusion
We have successfully applied network-based methods to integrate and explore two distinct high-throughput disease data sets of different origin and size. Through identification of common regulatory machinery that is likely to cause overexpression of genes and proteins, we came to the signaling pathways that might contribute to the altered state of regulatory network in psoriatic lesion. Our approach allows easy integrative investigation of different data types and produces biologically meaningful results, leading to new potential therapy targets. We have demonstrated that pathology can be caused and maintained by a great amount of various cascades, many previously not described as implicated in psoriasis; therefore, combined therapies targeting multiple pathways might be effective in treatment.

Additional material
Additional file 1 Table S1. Patient description Additional file 2 Table S2. List of genes significantly upregulated in lesions at the mRNA level Additional file 3 Table S3. List of receptors significantly involved in regulation of genes overexpressed in psoriatic plaque ('hidden nodes' algorithm result)  'Possible' term was used if protein name co-occurred with psoriasis in articles, but no clear evidence of its implication was shown. In some cases, ligands are associated with psoriasis (i.e, IL-10).