KernelPCA data integration with enhanced interpretability
 Ferran Reverter^{1},
 Esteban Vegas^{1}Email author and
 Josep M Oller^{1}
https://doi.org/10.1186/175205098S2S6
© Reverter et al; licensee BioMed Central Ltd. 2014
Published: 13 March 2014
Abstract
Background
Nowadays, combining the different sources of information to improve the biological knowledge available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data types are kernelbased methods. Kernelbased data integration approaches consist of two basic steps: firstly the right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task.
Results
We analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed.
Conclusions
The integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge.
Keywords
Background
With the recent rapid advancements in highthroughput technologies, such as next generation sequencing, array comparative hybridization and mass spectrometry, databases are increasing in both the amount and the complexity of the data they contain. One of the main goals of mining this type of data is to visualize the relationships between biological variables that are involved [1]. For instance, visualizing gene expression guides the process of finding genes with similar expression patterns. However, due to the number of genes involved, it is more effective to display the data by means of a lowdimensional plot. Here we focus on the problem of reducing dimensionality and the interpretability of the resulting data representations.
Principal component analysis (PCA) has a very long history and is known to be a very powerful tool in the linear case. PCA is used as a visualization tool for the analysis of microarray data [2] and [3]. However, the sample space that many research problems deal with is considered nonlinear in nature; for example, the sample space of microarray data. One reason for this nonlinearity might be that the interactions of the genes are not completely understood. Many biological pathways are still not fully understood. So, it is quite naive to assume that genes are connected in a linear fashion. Following this line of thought, research into reducing the nonlinear dimensionality for microarray gene expression data has increased. Finding methods that can handle such data is of great importance if we are to glean as much information as possible from them.
Kernel representation offers an alternative to nonlinear functions by projecting the data into a highdimensional feature space, which increases the computational power of linear learning machines [4] and [5]. Kernel methods enable us to construct different nonlinear versions of any algorithm which can be expressed solely in terms of dot products; this is known as the kernel trick. Kernel machines can be used to implement several learning algorithms but the interpretability of the resultant output representations may be cumbersome, because input variables are only handled implicitly [6].
Nowadays, combining multiple sources of data to improve the biological knowledge available is a challenging task in bioinformatics. Data analysis of different sources of information is not simply a matter of adding the analysis of each separate dataset; instead it consists of the simultaneous analysis of multiple variables in the different datasets [7].
Some of the most powerful methods for integrating heterogeneous data types are kernelbased methods [8] and [9]. We can describe kernelbased data integration approaches as using two basic steps. Firstly, the right kernel is chosen for each data set. Secondly, the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. Basic mathematical operations such as multiplication, addition, and exponentiation preserve properties of kernel matrices and hence produce valid kernels. The simplest approach is to use positive linear combinations of the different kernels.
In this work, we analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality and extending previous results [10]. Moreover, we improve kernel PCA interpretability by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed. Therefore the integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge. This paper starts by briefly reviewing the notion of kernel PCA (Section 2). Section 3 contains our main results: a set of procedures to enhance the interpretability of kernel PCA when multiple datasets are analyzed simultaneously. We then present our results and apply them in parallel to analyze a nutrigenomic study in mouse [11].
Results and discussion
Kernel methods enable us to construct different nonlinear versions of any algorithm which can be expressed solely in terms of dot products, this is the case of kernel PCA. Kernel PCA can be used to reduce dimensionality, thereby improving on linear PCA, but the interpretability of the output representations may be cumbersome because the input variables are only handled implicitly.
In this section, we propose a set of procedures to improve the interpretability of kernel PCA. The procedures are related to the following aspects:

Representation of input variables.

Data integration and representation of input variables.

Representation of linear combinations of input variables.

Revealing the interpretability of input variables.
To illustrate these procedures we use an example from metabolomics and genomics. The datasets come from a nutrigenomic study in mouse [11]. Forty mice were studied and two sets of variables were acquired: expressions of 120 genes measured in liver cells; and concentrations (in percentages) of 21 hepatic fatty acids (FAs) measured by gas chromatography. Biological units (mice) are crossclassified according to two factors: genotype, which can be wildtype (WT) or PPARαdeficient mice (PPAR); and diet, with 5 classes of diet in accordance with the FA composition.
The oils used for the experimental diet preparation were: corn and rapeseed oils (50:50), as the reference diet (ref); hydrogenated coconut oil, as a saturated FA diet (coc); sunflower oil, as an ω 6 FArich diet (sun); linseed oil, as an ω 3 FArich diet (lin); and corn, rapeseed and fish oils (42.5:42.5:15), as the fish diet. In the study, it cannot be assumed that variations in one set of variables cause variations in the other; we do not know a priori if changes in gene expression imply changes in FA concentrations or vice versa. Indeed, the nuclear receptor PPARα, which acts as a ligandinduced transcriptional regulator, is known to be activated by various FAs and to regulate the expression of several genes involved in FA metabolism. It should be noted that the main observations discussed in [11], which were extracted separately from the two datasets by both classical multidimensional tools (hierarchical clustering and PCA) and standard test procedures, are also highlighted by kernel PCA graphical representations.
Representation of input variables
In order to achieve interpretability we add supplementary information into kernel PCA representations. We have developed a procedure to represent any given input variable on the subspace spanned by the eigenvectors of $\stackrel{\u0303}{C}$ (see Methods).
where Z_{ s } is in the form of (7).
Figure 1 shows samples and the variables from X_{1} to X_{5} at each sample. Variables are represented by vectors that indicate the direction of maximum growth in each variable. In fact, we can see that the vectors point to those groups characterized by higher values in each variable. For instance, the variables X_{1} and X_{2} point to the group 1, and the variables X_{3}, X_{4}, and X_{5} point to the group 3.
Analyzing the nutrigenomic dataset
Data integration and representation of input variables
is a kernel on $\left({\mathcal{X}}_{1}\phantom{\rule{2.77695pt}{0ex}}\times {\mathcal{X}}_{2}\right)\phantom{\rule{2.77695pt}{0ex}}\times \phantom{\rule{2.77695pt}{0ex}}\left({\mathcal{X}}_{1}\phantom{\rule{2.77695pt}{0ex}}\times {\mathcal{X}}_{2}\right).$ Here, ${x}_{1},{x}_{1}^{\prime}\in {\mathcal{X}}_{1}$ and ${x}_{2},{x}_{2}^{\prime}\in {\mathcal{X}}_{2}$.
Then, formula (1) allows us to display variables that belong to any of the datasets over the kernel PCA representation of samples, simultaneously.
Analyzing the nutrigenomic dataset
Representation of linear combinations of input variables
Then, formula (1) allows us to represent any linear combination of input variables.
Analyzing the nutrigenomic dataset
Revealing the interpretability of input variables
Our procedure for representing input variables on the twodimensional subspace expanded by the two main eigenvectors of $\stackrel{\u0303}{C}$, displays the variables as vectors whose direction is the direction of maximum growth of the variable at a given point; in particular, at the sample points.
So, if we set a direction in this plane, given by a vector w, we can search for input variables whose representation on the kernel PCA plane are correlated with this direction. Let us suppose that we observe clusters of samples in the kernel PCA representation; then an interesting direction can be given by the vector defined by any two cluster centroids.
Finally, we order all the variables according to R_{ k } and we can select those with higher values and also those with lower values. Thus, in this way, for each sample cluster, we can find the correlated variables with higher and lower values. Knowledge of such variables can improve the biological interpretability of the results.
A natural extension of this procedure is to take as w the vector corresponding to one of the input variables. Then, if we know that a certain input variable is useful for interpreting the kernel PCA representation, we can search for other input variables whose representation on the kernel PCA plane are correlated with this feature. If we are integrating multiple datasets, we can search for correlated variables in each dataset.
Analyzing the nutrigenomic dataset
Fatty acids: correlation with the preferred direction.
FA  mean  sd 

C16.1ω.7  0.927  0.100 
C20.3ω.9  0.917  0.336 
C18.1ω.7  0.907  0.270 
C14.0  0.898  0.131 
C18.3ω.6  0.862  0.372 
C18.1ω.9  0.695  0.132 
C16.1ω.9  0.480  0.224 
C16.0  0.295  0.265 
C20.1ω.9  0.176  0.401 
C22.5ω.3  0.198  0.346 
C20.3ω.3  0.235  0.383 
C20.5ω.3  0.300  0.219 
C20.3ω.6  0.386  0.227 
C18.0  0.392  0.171 
C22.6ω.3  0.453  0.151 
C20.2ω.6  0.601  0.306 
C20.4ω.6  0.664  0.360 
C22.4ω.6  0.684  0.367 
C18.2ω.6  0.718  0.290 
C18.3ω.3  0.727  0.482 
C22.5ω.6  0.731  0.499 
Genes: correlation with the preferred direction.
gene  mean  sd 

S14  0.998  0.002 
ACC2  0.997  0.004 
LPL  0.997  0.005 
ap2  0.996  0.006 
NGFiB  0.996  0.005 
i.FABP  0.995  0.007 
COX1  0.993  0.012 
CIDEA  0.993  0.012 
MDR1  0.991  0.016 
Lpin  0.991  0.007 
MTHFR  0.991  0.012 
Lpin1  0.989  0.009 
i.BAT  0.988  0.014 
PPARg  0.986  0.025 
ACAT2  0.984  0.013 
CYP2b10  0.978  0.022 
hABC1  0.976  0.021 
ACC1  0.975  0.012 
SPI1.1  0.353  0.042 
GSTpi2  0.587  0.038 
In Table 1, we can observe that FAs with negative correlation, such as C16.1ω.7, C20.3ω.9 and C18.1ω.7, represent FAs with higher concentrations in samples with the coc diet. In contrast, FAs that are positively correlated, such as C22.4ω.6, C18.2ω.6, C18.3ω.3 and C22.5ω.6, represent FAs with higher concentrations in samples with other types of diet. Furthermore, in Table 2, we can observe that genes with negative correlation at the top of the table, such as S14, ACC2 and LPL, are more highly expressed in samples with the coc diet, whereas genes at the bottom of the table, that are positively correlated, are less expressed in the coc diet samples. These results are in agreement with those found in [12].
Conclusions
With the rapidly increasing amount of genomic, proteomic, and other highthroughput data that is available, the importance of data integration has increased significantly recently. Biologists, medical scientists, and clinicians are also interested in integrating the highthroughput data that has recently become available with previously existing clinical, laboratory and biological information.
Kernel methods, in particular kernel PCA, constitute a powerfully methodology because they allow us to reduce dimensionality and integrate multiple datasets, simultaneously. Moreover, in this paper we have introduced a set of procedures to improve the interpretability of kernel PCA representations. The procedures are related to the following aspects: 1) representation of variables; 2) linear combination of representations of variables; 3) data integration and representation of variables; and 4) revealing the interpretability of input variables. Our procedure is a kernelbased exploratory tool for data mining that enables us to extract nonlinear features while representing variables.
Methods
Implying $\u27e8\varphi \left(x\right),\varphi \left(y\right)\u27e9=\u27e8k\left(\cdot ,x\right),k\left(\cdot ,y\right)\u27e9=k\left(x,y\right)$. After completion we can turn our feature space into a Hilbert space ℋ_{ k } [5]. The space ℋ_{ k } is the reproducing kernel Hilbert space (RKHS) induced by the kernel function k.
The solution ${\stackrel{\u0303}{\alpha}}^{k},k=1,...,r,$ are normalized by normalizing the corresponding vector ${\stackrel{\u0303}{V}}^{k}$ in ℋ_{ k }, which translates into ${\stackrel{\u0303}{\lambda}}_{k}\u27e8{\stackrel{\u0303}{\alpha}}^{k},{\stackrel{\u0303}{\alpha}}^{k}\u27e9=1$.
where $\stackrel{\u0303}{V}$ is a m × r matrix whose columns are the eigenvectors ${\stackrel{\u0303}{V}}^{1},\dots ,{\stackrel{\u0303}{V}}^{r}$.
Declarations
Acknowledgements
This article has been published as part of BMC Systems Biology Volume 8 Supplement 2, 2014: Selected articles from the HighThroughput Omics and Data Integration Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S2.
This work was partially supported by COSTBMBS, Action BM1006 "Next Generation Sequencing Data Analysis Network", SeqAhead.
Declarations
The publication costs for this article were partially funded by MECMTM200800642.
Authors’ Affiliations
References
 Gorban AN, Kegl B, Wunsch DC, Zinovyev A: Principal Manifolds for Data Visualization and Dimension Reduction. 2007, Springer Publishing CompanyGoogle Scholar
 Pittelkow YE, Wilson SR: Visualisation of Gene Expression Data the GEbiplot, the Chipplot and the Geneplot. Statistical Applications in Genetics and Molecular Biology. 2003Google Scholar
 Park M, Lee JW, Lee JB, Song SH: Several biplot methods applied to gene expression data. Journal of Statistical Planning and Inference. 2008, 138: 500515. 10.1016/j.jspi.2007.06.019.View ArticleGoogle Scholar
 ShaweTaylor J, Cristianini N: Kernel Methods for Pattern Analysis. 2004, Cambridge University PressView ArticleGoogle Scholar
 Scholkopf B, Smola AJ: Learning with Kernels  Support Vector Machines, Regularization, Optimization and Beyond. 2002, Cambridge MIT PressGoogle Scholar
 Li X, Shu L: Kernel based nonlinear dimensionality reduction for microarray gene expression data analysis. Expert Systems with Applications. 2009, 36: 76447650. 10.1016/j.eswa.2008.09.070.View ArticleGoogle Scholar
 Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CMT, Beyene J: Data Integration in Genetics and Genomics: Methods and Challenges. Human Genomics Proteomics. 2009Google Scholar
 Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble S: A statistical framework for genomic data fusion. Bioinformatics. 2004, 20 (16): 26262635. 10.1093/bioinformatics/bth294.View ArticlePubMedGoogle Scholar
 Daemen A, Gevaert O, De Moor B: Integration of clinical and microarray data with kernel methods. Proceedings of the 29th Annual International Conference of IEEE Engineering in Medicine and Biology Society (EMBC '07). 2007, Lyon, France, 54115415.Google Scholar
 Reverter F, Vegas E, Oller JM: Kernel Methods for Dimensionality Reduction Applied to the "Omics" Data. Principal Component Analysis Multidisciplinary Applications. Edited by: Sanguansat P, InTech. 2012Google Scholar
 Martin PG, Guillou H, Lasserre F, D'ejean S, Lan A, Pascussi JM, Sancristobal M, Legrand P, Besse P, Pineau T: Novel aspects of PPAR αmediated regulation of lipid and xenobiotic metabolism revealed through a multrigenomic study. Hepatology. 2007, 54: 767777.View ArticleGoogle Scholar
 Gonzalez I, Dejean S, Martin PGP, Goncalves O, Besse P, Baccini A: Highlighting relationships through Regularized Canonical Correlation Analysis: application to high throughput biology data. Journal of Biological Systemsn. 2009, 17 (2): 173199. 10.1142/S0218339009002831.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.