 Methodology
 Open Access
 Published:
Optimal projection method determination by Logdet Divergence and perturbed vonNeumann Divergence
BMC Systems Biologyvolume 11, Article number: 115 (2017)
Abstract
Background
Positive semidefiniteness is a critical property in kernel methods for Support Vector Machine (SVM) by which efficient solutions can be guaranteed through convex quadratic programming. However, a lot of similarity functions in applications do not produce positive semidefinite kernels.
Methods
We propose projection method by constructing projection matrix on indefinite kernels. As a generalization of the spectrum method (denoising method and flipping method), the projection method shows better or comparable performance comparing to the corresponding indefinite kernel methods on a number of real world data sets. Under the Bregman matrix divergence theory, we can find suggested optimal λ in projection method using unconstrained optimization in kernel learning. In this paper we focus on optimal λ determination, in the pursuit of precise optimal λ determination method in unconstrained optimization framework. We developed a perturbed vonNeumann divergence to measure kernel relationships.
Results
We compared optimal λ determination with Logdet Divergence and perturbed vonNeumann Divergence, aiming at finding better λ in projection method. Results on a number of real world data sets show that projection method with optimal λ by Logdet divergence demonstrate near optimal performance. And the perturbed vonNeumann Divergence can help determine a relatively better optimal projection method.
Conclusions
Projection method ia easy to use for dealing with indefinite kernels. And the parameter embedded in the method can be determined through unconstrained optimization under Bregman matrix divergence theory. This may provide a new way in kernel SVMs for varied objectives.
Background
Support vector machines (SVMs), a supervised machine learning technique, have been introduced by Vapnik [1, 2]. In machine learning area, SVMs [3] are traditionally considered as one of the best algorithms in terms of structural risk minimization. Kernels in SVM work by data embedding in high dimensional feature space and one can construct an optimal separating hyperplane in this space [4]. Furthermore, kernel methods have wide applications in the field of bioinformatics. Authors in [5] have proposed incremental kernel ridge regression to predict soft tissue deformations after CMF surgery. In [6], researchers utilized the kernelbased linear discriminant analysis (LDA) method to address the problem of automatically tuning multiple kernel parameters. In order to address the nonlinear problem of nonnegative matrix factorization (NMF) and the seminonnegative problem of the existing kernel NMF methods, authors in [7] develop the nonlinear NMF based on a selfconstructed Mercer kernel which preserves the nonnegative constraints on both bases and coefficients in kernel feature space. Positive SemiDefiniteness (PSD) is crucial [8] for a kernel matrix in SVMs, which is required to guarantee the existence of a Reproducing Kernel Hilbert Space (RKHS). In RKHS, one can formulate a convex optimization problem to obtain an optimal solution. Sometimes however, similarity matrices generated for practical use cannot ensure such a PSD property. For example, in evaluation of pairwise similarity between DNA and protein sequences, popular functions like BLAST and Dynamic Time Warping generate indefinite kernel matrices [9–11]. The generalized histogram intersection kernel that is conditionally positive definite is not usually positive semidefinite [12]. Hyperbolic tangent kernels [13, 14] suitable for practice sometimes are indefinite as well. As far as we know, it is still not very clear how to effectively deal with indefinite kernels in the SVM framework. Training indefinite SVMs therefore becomes a challenging optimization problem since convex solutions are no longer valid for standard SVMs in this learning scenario [15].
To deal with indefinite kernel, a number of methods have been proposed in the literature [16]. Representatives in previous studies tackled such problem by altering the spectrum of an indefinite kernel matrix so as to create a PSD one. Authors in [17] developed the denoising method which deems negative eigenvalues as noise and replaces them with zero. The flipping method is another effective method for transforming indefinite kernel into PSD one by changing the sign of negative eigenvalues [18]. Authors in [19] proposed the diffusion method which considers the data distribution and replaces the eigenvalues with exponential form. The shifting method, i.e., shifts eigenvalues by introducing new parameters to ensure all the eigenvalues are nonnegative [20]. Authors in [13] developed a method in order to find stationary points under a nonconvex dual formulation of SVMs with sigmoid kernels. Authors considered indefinite kernel learning as a minimization problem in a pseudoEuclidean space in [21]. In [22], a maxmin optimization problem is further proposed so as to find a proxy kernel for the indefinite kernel. Based on confidence function, a simple generalization of SVMs is suggested by Guo and Schuurmans [23]. Kernel principal component analysis is developed as a kernel transformation method to deal with indefinite kernels [24].
In this paper, we develop a superior and effective method, i.e., projection method, to convert an indefinite kernel into a PSD one. Compared with the existing methods, our proposed one is much more flexible and comprehensive. One can easily obtain different type of methods such as flipping or denoising method by varying its parameters. Furthermore, our suggested λ under Logdet Divergence and perturbed vonNeumann Divergence can always yield near optimal performance, which can be regarded as a good choice for dealing with indefinite kernels. Besides, our suggested projection matrix also has certain special mathematical properties. Furthermore, the connection between spectrum method and projection method can be investigated through analysis on eigenvalues.
The rest of the paper is organized as follows. Firstly, we present the projection method and also the associated theorem. Then we propose the optimal λ determination in the projection matrix under unconstrained optimization framework. After that, we apply two indefinite kernels on some real world data sets which range from cancer prediction to glycan classification. And we also validate the suggested optimal λ with the experimental data. Discussions of the experimental validation on the suggested optimal λ under Logdet Divergence and perturbed vonNeumann Divergence are followed. Finally, in the last section, we give the concluding remarks with possible future work.
Methods
Assume \({(\mathbf {X}_{i},y_{i})}_{i=1}^{n}\), where X _{ i }∈R ^{p}, and y _{ i }∈{1,−1} are a given list of labeled patterns. And function k defined as k:χ×χ→R can be regarded as a kernel function where χ represents the input space. A kernel induced by the kernel function is defined by
And according to Mercer’s theorem, a valid kernel should be positive semidefinite. Thus, to deal with invalid kernels, kernel transformation strategy is increasingly popular. In the case of nonpositive semidefinite kernel K, we may decompose it into this form K=P·D·P ^{′}. Where D is a diagonal matrix and not all the diagonal entries are nonnegative, P is orthonormal matrix with the jth column corresponding to the eigenvector for jth eigenvalue in D and P ^{′} represents the transpose of matrix P. Eigenvalue transform is the representative method in kernel transformation [17–20].
In the following, we present our suggested projection method for transforming an indefinite kernel to a PSD one.
Lemma 1
There exists an n×m (m<n) matrix B satisfying \(B^{\prime }B=I_{m}\) such that \((I_{n}\lambda BB^{\prime })\) has 1−λ and 1 as its eigenvalues, the multiplicities for whom are m and n−m respectively. Besides, it shares the same set of eigenvectors with K.
Proof
Consider that K is a real and symmetric matrix, we decompose it as K=P·D·P ^{′} where \(P=[\vec {p}_{1},\vec {p}_{2},\ldots,\vec {p}_{n}]\) and D=diag[d _{1},d _{2},…,d _{ n }] is a diagonal matrix with the diagonal elements d _{ i },i=1,2,…,n. W.L.O.G, we may assume all the eigenvalues are sorted in ascending order. We further assume the positive inertia index is l and the negative inertia index is m.
Denote \(B=[\vec {p}_{1},\vec {p}_{2},\ldots,\vec {p}_{m}]\), we have B ^{′} B=I _{ m }, since \(\vec {p}_{i},i=1,2,\ldots,n\) are orthogonal eigenvectors. Then we have
Thus, it has 1 and (1−λ) as its eigenvalues, and the multiplicity for (1−λ) is m and for 1 the multiplicity is n−m. Furthermore, the eigenvectors of (I _{ n }−λ B B ^{′}) are exactly the same as the kernel K. □
Theorem 1
Let K be an n×n real symmetric matrix which is indefinite. Then there exists an n×m (m<n) matrix B satisfying B ^{′} B=I _{ m } such that (I _{ n }−λ B B ^{′})K is a positive semidefinite kernel where λ≥1 is a regularization parameter.
Proof
Denote \(B=[\vec {p}_{1},\vec {p}_{2},\ldots,\vec {p}_{m}]\), where the definitions of \(\vec {p}_{i}, i=1,2,\ldots,m\) are the same as denoted in Lemma 1. By Eq. (2), we have
Since λ≥1, we have (1−λ)d _{ i }≥0 for 1≤i≤m. This will guarantee the kernel matrix (I _{ n }−λ B B ^{′})K is positive semidefinite. □
In particular, we get denoising method by letting λ=1 according to Eq. (3). And flipping method is the particular case of Projection method when λ=2.
Optimal λ determination
Considering that λ is a embedded parameter in the projection method, it is necessary to study optimal λ determination which can demonstrate excellent prediction power for λ>0. To this end, we begin with the definition of Bregman matrix divergence [25].
Definition 1
{Bregman Matrix Divergences}The Bregman Matrix Divergence of K is defined as follows:
Here ϕ(K)is a strictly convex differentiable function of K and tr(K) means the trace of matrix K.
A number of matrix divergences [25, 26] exist in the literature.

1.
Mahalanobis Divergence(p=2):
$$ D_{\phi}(K,K_{0})=\text{tr}\left(K^{2}2{KK}_{0}+K_{0}^{2}\right). $$(4) 
2.
Frobenius Divergence\(\left (\phi (K) = \K\_{F}^{2}\right)\):
$$ D_{\phi}(K,K_{0})=\KK_{0}\_{F}^{2}. $$(5) 
3.
vonNeumann Divergence(ϕ(K)=tr(K log(K)−K)):
$$ D_{\phi}(K,K_{0})=\text{tr}(K\log K K \log K_{0} K + K_{0}). $$(6) 
4.
LogDet Divergence(ϕ(K)=− log det(K)):
$$ {}D_{\phi}(K,K_{0})=\text{tr}\left({KK}_{0}^{1}\right)\log \det\left({KK}_{0}^{1}\right)n. $$(7)
Inspired by the work in [27] where authors proposed a framework of kernel learning [28] by unconstrained optimization, we reformulate the problem as a kernel learning one in a similar manner. The optimal λ can be obtained by minimization of \(D_{\phi }(\tilde {K},K)\) where \(\tilde {K}\) is the optimal PSD kernel which is close to original kernel K in terms of divergence. Noting that
then the minimization problem can be equivalently transformed to the following:
For Mahalanobis Divergence, by Lemma, we know that K K _{0}=K _{0} K as they share the same set of eigenvectors. Therefore, the minimization problem can be expressed as
The optimal λ can be quickly obtained as 0.
For Frobenius Divergence, the minimization problem in finding optimal λ as derived from Eq. (5) is
It is easy to see that the optimal λ is 0.
For von Neumann Divergence, we can deduce the minimization problem to be
Applying differentiation to Eq. (10), we obtain the optimal value of λ=0.
The optimal λ=0 does not make any perturbation to the original kernel matrix which is not reasonable. Hence we focus on LogDet Divergence [27], the optimal λ can be determined through the following formula
Considering that the calculation involves inverse of matrix K, where K is not necessarily positive definite, we use pseudo inverse instead. Thus, the final theoretical optimal λ becomes:
Results
Materials
In order to experimentally evaluate our method, we adopted a number of life science data sets satisfying requirements that the generated kernels are indefinite, where most of them are cancer related data sets. Three data sets are obtained from libsvm data sets [29]. One of the data sets is sonar data, there are 208 data instances where 97 are positive and 111 are negative. The Live disorder data set has 345 data instances, of which 145 are positive and 200 are negative. Breast Cancer data set has 683 data instances in total, 444 are negative and 239 are positive, and the number of attributes is 10. Another two datasets pertain to cystic fibrosis and leukemia. Within the cystic fibrosis data set, there are 177 glycan structures in total, containing 89 glycans related to cystic fibrosis, 107 related to respiratory mucin and 101 related to bronchial mucin. For leukemia related data set, 355 structures are included,originating from four human blood components: leukemic cells, erythrocytes, serum and plasma, containing 162, 111, 85 and 73 examples respectively. All the glycan structures are retrieved from the KEGG/GLYCAN database [30], where annotations are retrieved from CarbBank/CCSD database [31]. If the glycan data set contains N glycans { g _{1},g _{2},⋯,g _{ N }}, we denote the set of all qgrams existing in these N glycans to be a qgram set: \({\Phi }_{q}=\{{\phi }_{q}^{1},{\phi }_{q}^{2},\cdots, {\phi }_{q}^{n_{q}}\}\). For a specific glycan g _{ i } in the data set, qgram representation is a column vector \( x_{i}^{q}=[x_{1i}^{q},x_{2i}^{q}, \cdots, x_{n_{q}i}^{q}]^{T} \) where \(x_{li}^{q}\) means the number of lth qgram in the glycan g _{ i }. The number of attributes within the dataset depends on the value of q (q=1 to 9), where we have 9 datasets derived from cystic fibrosis data and leukemia data respectively. The last data set is about lung cancer and it is obtained from NCBI(National Center of Biotechnology Information) GEO(Gene Expression Omnibus) [32]. Affymetrix Human Genome U133 Plus 2.0 Array experiments were carried out in a set of 91 nonsmall cell lung cancer (NSCLC) samples, containing 46 tumors and 45 controls. Detailed information on the data sets can be found in Table 1.
Attribute Distribution Information for different q is provided in Fig. 1 for Leukemia Data and Cystic Fibrosis Data. We can see that the number of attributes in Leukemia Data is increasing with the increment of q while it is not the case in Cystic Fibrosis Data. In Cystic Fibrosis Data set, the number of the attributes firstly increases then decreases with the increment of q. The possible reason is that the glycan structures in Leukemia Data set is more complicated than that in Cystic Fibrosis Data set.
Experiments
We perform the experiments in 5fold crossvalidation setting and measure the performance of models with the Area Under Curve (AUC). AUC (calculated as the area under the ROC curve) is commonly used for model evaluation. We measure the averaged AUC values for the considered methods through 10 times 5fold crossvalidations. Here we introduce two kernels: the Generalized Histogram Intersection (GHI) kernel [12] and the cosine kernel for illustration purpose. These two kernels in most cases are indefinite (shown in Additional file 1: Table SI), both of which have not been used in biological applications like glycan classification or cancer prediction.
GHI kernel is frequently used in image classification and the definition is as follows:
where X _{ j } represents data vector and X _{ ji } represents the ith element of X _{ j },j=1,2,…,n.
When α=β, the kernel can be proved to be a positive semidefinite matrix. Experimental results in Table SI (in Additional file 1) also show consistence with the statement, as the minimal eigenvalue in GHI kernel is 0 when α=β. We in experimental settings use different values of α,β∈{1,2,3},α≠β to evaluate the performance of our proposed projection method.
The cosine kernel function is defined by:
which is different from the usual definition of cosine similarity: \(\frac {\mathbf {X}_{j}'\mathbf {X}_{k}}{\mathbf {X}_{j}\cdot \mathbf {X}_{k}}\). The reason we did not consider the usual definition is that the corresponding kernel matrix generated from this function is positive semidefinite, which fails to satisfy our requirements.
Experiments on GHI kernel
The performance of Projection method and original GHI Kernel SVM was summarized in Tables 2, 3, and 4. Values marked in bold face represent best performance and no marks are made when both methods showing comparable performance. When α differs from β, GHI kernel is indefinite (see Additional file 1: Table SI), we can see that Projection Method outperforms GHI Kernel method.
The performance for sonar data set is reported in Table 2. For example, when α=1,β=2, Projection Method shows the averaged AUC value 81.47% with standard deviation 0.99% while in GHI kernel method the averaged AUC value is 53.42% with standard deviation 4.94%. When (α,β)=(1,3), Projection Method shows 84.02% in the averaged AUC value, with standard deviation 1.19%. However, the averaged AUC value for GHI kernel method is only 54.10% with standard deviation 4.92%. When (α,β)=(2,3), the averaged AUC value for Projection Method is 84.31%, larger than the averaged AUC value for GHI Method 83.06%. The standard deviation in Projection Method is 1.56%, while in GHI kernel method standard deviation is 2.04%. This implies that Projection method is more powerful and stable compared to original GHI kernel method.
For live disorder data set, we can see from Table 2 that the Projection method is significantly better performance than the GHI kernel method when α≠β. The best performance of GHI kernel when indefinite achieves around 60% in AUC value which is not satisfying. When α=β, both methods show comparable performance.
For breast cancer data set, results in Table 2 indicate that when α=1,β=2 and α=1,β=3, the Projection method is clearly superior to the GHI kernel method except for α=2,β=3 where comparable performance is detected in both methods. This illustrates the fact that indefinite kernels sometimes can also perform well. However, the superiority of projection method over the original GHI kernel method is clearly shown in this data set.
In cystic fibrosis data set, we get 9 different comparison results when values of q vary from 1 to 9 as shown in Table 3. There is no clear difference between Projection method and GHI kernel, as GHI kernel is positive semidefinite for almost all considered pairs of α and β (see Additional file 1: Table SI for reference). The only 2 cases when GHI kernel indefinite are α=1,β=2 and α=1,β=3 for q = 1, and the minimal eigenvalue for the generated GHI kernel in these 2 cases is only 0.08, quite close to 0, demonstrating that the generated kernel is almost positive semidefinite.
Results for Leukemia Data are summarized in Table 4. Similar to the results in cystic fibrosis data, projection method and GHI kernel method show similar performance in most of the cases for q from 1 to 9. From Additional file 1: Table SI we can see that, GHI kernel is indefinite when α≠β for q=1,2,3. When q=1,2, Projection method is better than GHI kernel method for α=1,β=2 and α=1,β=3; However, GHI kernel method is comparable to Projection method for α=2,β=3.
Some interesting results can be found for NSCLC data as shown in Table 2. Projection method and GHI kernel method show exact performance when α=β, yielding 100% in AUC values. Note that Projection method does not make any perturbation to the original kernel when positive semidefinite(GHI kernel when α=β), we can get conclusion that GHI kernel is a preferred kernel for tumor differentiation with NSCLC data. When α differs from β, different results are shown. When α=1,β=2, Projection method shows 99.72% in averaged AUC values with 0.01% standard deviation, while GHI kernel method only can get 64.07% in Averaged AUC values with a large standard deviation 7.42%. When α=2,β=3, Projection method shows 99.99% in averaged AUC values with 0 standard deviation, while GHI kernel method can get 73.07% in averaged AUC values with a large standard deviation 8.17%. Exceptions happen when α=1,β=3 where Projection method can only get 61.46% in averaged AUC values and GHI kernel method is even worse, achieving only 51.47% in averaged AUC values.
We can conclude that the performance of projection method is not always similar for different pairs of (α,β). There exists best (α,β) for inducing best projection method, but different data sets may be suitable to different pairs. GHI kernel method sometimes when kernel is indefinite can also perform well. But in general, projection method is clearly better than the GHI kernel for the above considered data sets.
Experiments on Cosine kernel
Table 5 compares the performance of Projection method with Cosine Kernel method for the considered datasets. Our Projection Method demonstrates visible better performance compared to Cosine Kernel method in terms of averaged AUC values. From the bold text on the left column of table, we can see that Projection method is superior for almost all the cases (except q=8 where the two methods show comparable performance with each other). Apart from that, the Projection method is more stable than Cosine Kernel method because the standard deviation of AUC values for each data set is smaller in Projection method. In Live Disorder Data set, the averaged AUC values of Cosine Kernel method is 65.63% with standard deviation 2.75%, and Projection method is much better than Cosine Kernel method, achieving 73.71% AUC values in average. The superiority of Projection method over Cosine Kernel method is clearly demonstrated in Sonar Data as well. The averaged AUC value for Cosine Kernel method is 67.46% but 89.57% for Projection method. For Cystic Fibrosis Data and Leukemia Data, Projection Method shows a general decrement in performance with the increment of q. However, in Cosine Kernel method there is no obvious correlation between the performance and the value of q within the data sets. In NSCLC Data set, both the Projection method and Cosine Kernel method show unsatisfying performance though Projection method is clearly better than the original Cosine Kernel method.
Discussion
Experimental results show that the Projection method is better or comparable with the compared kernel methods: GHI kernel and Cosine kernel. Despite the fact that GHI kernel and Cosine kernel when indefinite sometimes can yield good performance, Projection method still demonstrate comparable performance. The necessity of Projection transformation for the considered indefinite kernels is clearly demonstrated. Projection method when λ≥1 can transform an indefinite kernel into a PSD one. The optimal λ determination for Projection Method focusing on four different divergences is also considered. From the deduced optimal λ, we focus on the one with LogDet Divergence as it is more realistic.
In the following, we will conduct experiments on the considered data sets, to confirm if suggested optimal λ of the Projection Method can show optimal performance in various values of λ>0.
Optimal λ in the projection method for sonar data
We set parameters α≠β∈{1,2,3} for GHI kernel and consider λ∈[0.1,200] with step size 0.1. Figure 2 plots the performance of Projection Method with different λ∈[0.1,200]. The ‘ ⋆’ shape in black color marks the performance of the Projection method with suggested optimal λ obtained under LogDet Divergence. The red line represents projected GHI kernel with α=1,β=2. The green line represents projected GHI kernel with α=1,β=3. The blue line represents projected GHI kernel with α=2,β=3. The cyan line plots the performance of projected Cosine Kernel.
The suggested optimal λ in Fig. 2 is 2.0 in Projected GHI kernel for all pairs of (α,β),α≠β. The performance of Projection Method shows a steady decrement when λ>2, implying that λ=2 is a good choice for projection method. When λ<1, the performance of projection method is quite unstable because the PSD property cannot be guaranteed.
It is very interesting to see that the suggested optimal λ is uniformly the same in the two considered kernels. Take projected GHI kernel with different (α,β) pairs for comparison, we can see that projected method with α=1,β=3 shows best performance, 0.8733, where the experimental best performance is shown to be 0.8735 achieving at λ=1.9. When α=1,β=2, the projection method with suggested optimal λ achieves 0.8246 in averaged AUC value, and the experimental best result is 0.8276. When α=2,β=3, the projection method with suggested optimal λ achieves 0.8540 in averaged AUC value, and the experimental best result is 0.8557. Considering the projected Cosine Kernel, the experimental best AUC value for Projected Cosine Kernel 0.9126 is achieved at λ=3.8, while our suggested optimal λ=2 yielding AUC value 0.9051, the difference between the two values is little: 0.0075. We can conclude that the suggested optimal λ can guarantee at least an near optimal performance.
Optimal λ in the projection method for live disorder data
Figure 3 shows the performance of Projection Method in different kernels for Live Disorder Data. The experimental optimal λ for GHI kernel with α=1,β=2 is 1.8, achieving averaged AUC value 0.7570. Our suggested λ under Logdet Divergence is 2.38, achieving averaged AUC value 0.7566. The performance difference in the Projection Method with theoretical optimal λ=2.38 and experimental optimal λ=1.8 is very small: 0.0004. When α=1,β=3, The experimental optimal λ for GHI kernel is 1.6, with the average AUC value equaling 0.7381. Our suggested optimal λ is 2.37, with averaged AUC value of 0.7333. The performance difference of the Projection Method with suggested optimal λ=2.37 and experimental optimal λ=1.6 is also very small: 0.0048. The experimental best AUC value of 0.7417 in Cosine kernel is achieved at λ=0.4, while our suggested optimal λ=2.17 yielding an AUC value of 0.7292. It can be seen that when λ>120, the performance of projected method with the two considered kernels fluctuates. When λ<120, the performance of projected method increases firstly and then decreases. The suggested optimal λ can guarantee at least an near optimal performance.
Optimal λ in the projection method for breast cancer data
Figure 4 records the performance of the Projection Method for Breast Cancer Data. The experimental optimal λ is 1.1 for GHI kernel with α=1,β=2, achieving averaged AUC value of 0.9713. Our suggested optimal λ is 4.5957, with the averaged AUC value of 0.9693. The performance becomes slightly worse with increment of λ. The performance difference of the Projection Method with suggested optimal λ=4.5957 and experimental optimal λ=1.1 is subtle: 0.0018. Similar results are shown for GHI kernel with other (α,β) pairs. The experimental best AUC value 0.9941 is achieved at λ=0.8 for Cosine Kernel, while our suggested optimal λ=4.29 yielding AUC value 0.9939. Take projected GHI kernel and Cosine Kernel for comparison, we can see that projected cosine kernel shows visible better performance than projected GHI kernel, suggesting that we should choose projected Cosine kernel for breast cancer prediction. When all the kernels are considered, the suggested optimal λ is preferable in getting optimal performance for different values of λ.
Optimal λ in the projection method for cystic fibrosis data
Figure 5 records the performance of Projection method with the 2 considered kernels. We can see that in almost all cases the Projection method shows identical performance for λ∈(0,200] except for q=1 when α=1,β=2 and α=1,β=3. One possible explanation might be that GHI kernel is PSD already before projection (please see Additional file 1: Table SI). When q=1, α=1,β=2, the best AUC value when λ∈(0,200] is 0.7908, and the smallest AUC value is 0.7890, where theoretical optimal λ yields 0.7905 in AUC value which is near optimal. When α=1,β=3, the smallest AUC value is 0.7831 when λ approaching 200, the best AUC value is 0.7841 when λ=22.5, and the suggested optimal λ through Logdet Divergence yields 0.7840, which is also near optimal. Considering Cosine kernel, the performance of Projection method firstly improves and then descends gradually for q=1,2 and 3. For example, the best performance in experiment for q=1 is achieved at λ=77, with the AUC value 0.7987, while our suggested optimal λ=5.8 gets 0.7955 in AUC value, which is near optimal. When q increases, the performance of projection method improves firstly and stays relatively stable afterwards. For example, when q=4, suggested optimal λ=3.67 gets 0.7846 in AUC value and the experimental best performance 0.8012 is obtained when λ=26.7. It can be seen that denoising method when λ=1 achieves 0.6574 and flipping method when λ=2 achieves 0.7441, implying that projection method with suggested λ is better than these two methods. Although it is not optimal, the performance of projection method is satisfactory, which is slightly inferior to the optimal.
Optimal λ in the projection method for leukemia data
Experimental Results for Projection Method with GHI kernel and Cosine kernel in leukemia data for q∈{1,2,…,9} are demonstrated in Fig. 6. Similar to Cystic Fibrosis Data, Projection method in GHI kernel shows almost identical performance for λ∈(0,200] for q∈{4,5,6,7,8,9}. This is consistent with the results in Table SI (Please refer to Additional file 1) where original GHI kernel is positive semidefinite when q≥4, as the minimal eigenvalues of the kernel matrix is 0. When q=1, the experimental best AUC value is 0.9364 for α=1,β=2, and our suggested optimal λ yields 0.9342. When q=2, the experimental best AUC value is 0.9545 for α=1,β=2 when q=2, and our suggested optimal λ yields 0.9542. When q=3, the experimental best AUC value is 0.9539 for α=1,β=2 when q=2, and our suggested optimal λ yields 0.9539. Experimental results are similar for other pairs of (α,β). Results for projected GHI kernel show that our suggested optimal λ can induce a near optimal projection method. In the case of Cosine kernel, we can get some information from the cyan line in Fig. 6. For all the considered q, there is no overall tendency when λ≤2, but the averaged AUC values will slowly decrease in a steady manner when the optimal performance is achieved. Some interesting phenomenon can be detected where projection method always shows poor performance when λ=1 (Denoising Method). For example, when q=2 the averaged AUC value of projection method is 0.6134 for λ=1, but 0.7450 for λ=0.9 and 0.9318 for λ=1.1. When q=3 the averaged AUC value of projection method is 0.4221 for λ=1, but 0.7784 for λ=0.9 and 0.9151 for λ=1.1. This probably can be explained that denoising strategy neglects some hidden information embedded in the negative eigenvalues and eigenvectors which is critical for describing the Leukemia Data. Regarding to the suggested optimal λ in projected Cosine Kernel, we can see that Projection method with the suggested optimal λ can always get at least near optimal performance for all q∈{1,2,…,9}.
Optimal λ in the projection method with NSCLC data
The performance of Projection Method with GHI kernel and Cosine Kernel in NSCLC Data is shown in Fig.7. When α=1,β=2, the optimal averaged AUC value in experiment is 0.9979 and projection method with our suggested optimal λ=2 can also ensure best performance 0.9979. When α=1,β=3, the experimental optimal averaged AUC value 0.6145 and projection method with λ=2 can also ensure best performance 0.6145. When α=2,β=3, the optimal averaged AUC value 1 in experiment and projection method with our suggested optimal λ=2 can also ensure equivalent best performance. Another conclusion can be made is that projection method with GHI kernel in different pairs of (α,β) may perform significantly different. In this experiment, we can see that α=1,β=3 is not fit for the task. Taking into consideration of the projected Cosine Kernel method, we can also conclude that cosine kernel is not suitable for dealing with tumor differentiation in NSCLC data.
Table 6 lists the optimal λ under Logdet Divergence with considered kernels for all the considered data sets. The first 3 columns refer to suggested optimal λ in Projected GHI kernel method. It is interesting to see that for cystic fibrosis data set, the suggested optimal λ is either 1 or 100 in most cases (except for q=1 when α=1,β=2 and α=1,β=3). The situation is similar for leukemia data set, where the suggested optimal λ in most cases is either 1 or 100 (q≥4). Note that our suggested optimal λ has the formula \( 1+\frac {m}{\sum _{i=1}^{m}d_{i} \text {tr}(K^{1}\vec {p}_{i}\vec {p}_{i}')} \) (Please refer to Eq.(12)). Computational error may occur when the optimal λ is calculated to be close to ±∞ which is not realistic. We therefore make the amendments accordingly where optimal λ is defined to be 1 when approaching −∞ and 100 when approaching +∞. The last column lists the theoretical optimal λ for the considered data sets with Cosine kernel. Computational error does not have influence on Projection Method with Cosine Kernel in cystic fibrosis data as \(\frac {m}{\sum _{i=1}^{m}d_{i} \text {tr}(K^{1}\vec {p}_{i}\vec {p}_{i}')}\) is not close to 0. We can draw some conclusions from the table. Firstly, when kernels are different, the suggested optimal λ in most of the cases are different within the same data set. Secondly, when data sets are different, the suggested optimal λ in most of the cases are different even under the same kernel type. Focusing on the GHI kernel, we can see that the suggested optimal λ for different (α,β) differ from each other in most of the cases. Comparing GHI kernel and Cosine kernel, we can see that even for the same data set, it may happen that one type of kernel is positive semidefinite while the other type is indefinite, this can also partly explain why the suggested optimal λ is different.
Any better optimal λ for projection method?
As stated above, we can see that under Logdet divergence, we can determine an near optimal λ for projection method. In this subsection, we are considering if there is any way to improve the projection method, in terms of finding a better optimal λ. Recall that Von Neumann divergence has the formula D _{ ϕ }(K,K _{0})=tr(K logK−K logK _{0}−K+K _{0}), we here did a little perturbation to the formula D _{ ϕ }(K,K _{0})=trKtr(logK− logK _{0})+tr(−K+K _{0}). Then we can determine optimal λ through minimizing the following function
We can easily get
Therefore, the new optimal λ _{opt1} is of the following formula:
We next conducted experiments on all the considered data sets to see the comparison of optimal λ from Logdet divergence λ _{opt} and the newly proposed λ _{opt1} in conjunction with projection method.

Lambda Comparison with Projection Method in Sonar Data, Live Disorder Data, Breast Cancer Data and NSCLC Data
As shown in the table (Table 7), we can see that the newly determined optimal λ through perturbed Von Neumann Divergence shows similar performance with the optimal λ generated by Logdet divergence. The only clear difference can be detected for Sonar data in GHI kernel when α=2,β=3 and Cosine Kernel. For GHI kernel α=2,β=3 we can see that λ _{opt} is superior to λ _{opt1}, while for Cosine kernel, λ _{opt1} is superior to λ _{opt}. Regarding the determined optimal λ under different divergences, we can see that λ _{opt} differs from λ _{opt1}. For GHI kernel case, the determined optimal λ under Logdet Divergence and perturbed vonNeumann Divergence is similar to each other in Live data set but quite different in other data sets. For cosine kernel case, λ _{opt} and λ _{opt1} are quite different from each other. We can see that though the determined optimal λ under Logdet Divergence and perturbed vonNeumann Divergence is different, the performance is comparable. When we compare both kernels, we can see that Cosine kernel with λ _{opt1} is a preferred option.

Lambda Comparison with Projection Method in Cystic Fibrosis Data
From Table 8, we can get some conclusions. For GHI kernel, it is obvious that projection method shows almost identical performance with λ _{opt} and λ _{opt1}. It is interesting to see that in GHI kernel case, λ _{opt} and λ _{opt1} are equal to each other expect when q=1 for α=1,β=2 and α=1,β=3. From Table SI we know that GHI kernel in these 2 cases is indefinite. Although the values of λ _{opt} and λ _{opt1} are quite different from each other when q=1 for α=1,β=2 and α=1,β=3, the performances are similar to each other. When it comes to Cosine kernel, we can see that projection method with λ _{opt1} tends to perform better for q∈{1,2,3,4,5,6,9}. Clear differences can be detected when q=3,4,5 that are marked in bold face. Besides, λ _{opt1} in Cosine kernel is larger than λ _{opt} for most cases (q∈{1,2,…,7}), meaning that projection method with Cosine kernel tends to show better performance for relatively large λ. When we compare GHI kernel and Cosine kernel, we find that GHI kernel in general tends to show better performance for small q, and Cosine kernel shows better performance when q is large.

Lambda Comparison with Projection Method in Leukemia Data
We can get similar conclusions for Leukemia data. As shown in the table (Table 9), projection method shows almost identical performance with λ _{opt} and λ _{opt1} though different optimal λ values are obtained (Please check q=1,2,3 respectively). When q=1, λ _{opt1} is smaller than λ _{opt}. When q=2,3 respectively, λ _{opt1} is larger than λ _{opt}. Though values of optimal λ differ from each other, the performances are quite similar, meaning that projection method with GHI kernel for Leukemia data is less sensitive in the optimal λ. When q∈{4,5,6,7,8,9}, λ _{opt} and λ _{opt1} are identical, we can see from Table SI that GHI kernel in these cases are PSD already. For Cosine Kernel, optimal λ determined by Logdet Divergence and perturbed VonNeumann Divergence differs. Projection method with λ _{opt1} performs slightly better than projection method with λ _{opt}. Besides, λ _{opt1} in Cosine kernel is larger than λ _{opt}, implying that projection method tends to show better performance for large λ. When we focus on the performance of projection method with λ _{opt1}, we can find that different from Cystic Fibrosis data set, the performance of projected cosine kernel with λ _{opt1} tends to show better performance for small q while projected GHI kernel with λ _{opt} tends to show better performance for large q.
In summary, when λ∈(0,1), the positive semidefiniteness of the projected kernel matrix cannot be assured, and the performance tends to be extremely unstable. The suggested optimal λ in Projection method is related to the eigenvalues in original kernel matrix, and thus varies in different data sets. Besides, the suggested optimal λ under Logdet Divergence and perturbed VonNeumann Divergence differs from each other in the same data sets in most cases. Even in that case, projection method under the two different cases can still guarantee near optimal performance. It can be seen that when optimal λ under Logdet Divergence and optimal λ under perturbed VonNeumann Divergence is very different, the performance of projection method in both cases is still similar, showing that in this case projection method is relatively insensitive to the values of suggested optimal λ (projection method with a large range of λ values can suggest near optimal performance). Our suggested theoretical λ under Logdet Divergence and perturbed VonNeumann Divergence sometimes cannot guarantee the best performance. There are two possible reasons. One possible reason is that the optimal λ determination by unconstrained optimization in framework of kernel learning hypothesized the positive definiteness of the kernels, but we use indefinite kernels in this case. Another possible reason is that the inverse of kernel was substituted by pseudo inverse.
Conclusions
In this paper, we propose projection method for addressing indefinite kernel learning problems. The projection method is construed from an eigenspace perspective. It is very flexible by varying the parameter λ, to change from the denoising method to the flipping method. These two spectrum based methods are wellknown techniques in dealing with indefinite kernels. Two kernels that are not generally PSD are introduced for comparison: GHI kernel method and the Cosine kernel method. We show better performance for projection method in terms of AUC values under 5fold crossvalidations. The optimal λ embedded in the Projection Method can be determined by solving an unconstrained optimization problem. Experimental studies show consistence with theoretical analysis as projection method with our suggested λ can always guarantee at least near optimal performance for λ>0. In the pursuit of precise optimal λ determination method, we also compared optimal λ determination with Logdet Divergence and perturbed VonNeumann Divergence, aiming at finding better λ in projection method. The determined optimal λ differs from each other for different kernels and data sets involved, and the results obtained are in general similar. Our proposed projection method may be regarded as a good choice for dealing with indefinite kernels. Future work may contribute to the development of more precise optimal λ determination method and the development of more variants of projection method for indefinite kernels, hoping to be applied in other areas.
References
 1
Vapnik V. The Nature of Statistical Learning Theory, 2nd edn. New York: Springer; 1995.
 2
Vapnik V. Statistical Learning Theory. New York: John Wiley; 1998.
 3
Carrizosa E, Morales DR. Supervised classification and mathematical optimization. Comput Oper Res. 2013; 40:150–65.
 4
Scholkopf B, Koji T, Jean PV. Kernel Methods in Computational Biology. London: MIT Press; 2004.
 5
Pan B, Zhang G, Xia J, Yuan P, Ip H, He Q, Lee P, Chow B, Zhou X. Prediction of soft tissue deformations after cmf surgery with incremental kernel ridge regression. Comput Biol Med. 2016; 75:1–9.
 6
Liu X, Yuen P, Feng G, Chen W. Learning kernel in kernelbased lda for face recognition under illumination variations. IEEE Signal Process Lett. 2009; 16:1019–22.
 7
Pan B, Lai J, Chen W. Nonlinear nonnegative matrix factorization based on mercer kernel construction. Pattern Recogn. 2011; 44:2800–10.
 8
Scholkopf B, Smola AJ. Learning with Kernels. London: MIT Press; 2001.
 9
Altschul SF, et al. A basic local alignment search tool. J Mol Biol. 1990; 215:403–10.
 10
Saigo H, Vert J, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics. 2004; 11:1682–9.
 11
Shimodaira H, Noma Ki, Nakai M, Sagayama S. Dynamic timealignment kernel in support vector machine. In: Advances in Neural Information Processing Systems 14. London: MIT Press: 2002. p. 921–8.
 12
Boughorbel S, Tarel J, Bougemaa N. Generalized histogram intersection kernel for image recognition. In: Proc. IEEE The 2005 International Conference on Image Processing. Genoa: IEEE: 2005. p. 161–4.
 13
Lin HT, Lin CJ. A study on sigmoid kernel for svm and the training of nonpsd kernels by smotype methods. Taipei, Taiwan: National Taiwan University.2003. Technical report.
 14
Smola AJ, Óvári ZL, Williamson RC. Regularization with dotproduct kernels. In: Advances in Neural Information Processing Systems 13. London: MIT Press: 2001. p. 308–14.
 15
Hassdonk B. Feature space interpretation of svms with indefinite kernels. IEEE Trans Pattern Anal Mach Intell. 2005; 27:482–298.
 16
Muñoz A, Diego IM. From Indefinite to Positive SemiDefinite Matrices.Berlin Heidelberg: Springer; 2006, pp. 764–72.
 17
Pekalska E, Paclik P, DuinA RPW. A generalized kernel approach to dissimilaritybased classification. J Mach Learn Res. 2002; 2(2):175–211.
 18
Graepel T, Herbrich R, BollmannSdorra P, Obermayer K. Classification on pairwise proximity data. Adv Neural Inf Process Syst. 1998; 11:438–44.
 19
Wu G, Chang EY, Zhang ZH. An analysis of transformation on nonpositive semidefinite similarity matrix for kernel machines. In: International Conference on Machine Learning. Bonn: ACM: 2005. p. 1682–1689.
 20
Roth V, Laub J, Kawanabe M. Optimal cluster preserving embedding of nonmetric proximity data. IEEE Trans Pattern Anal Mach Intell. 2000; 25:1540–1551.
 21
Ong C, Mary X, Canu S, Smola A. Learning with nonpositive kernels. In: International Conference on Machine Learning. Banff: ACM: 2004. p. 639–46.
 22
Luss R, D’Aspremont A. Support vector machine classification with indefinite kernels. Math Program Comput. 2009; 1(2):97–118.
 23
Guo Y, Schuurmans D. A reformulation of support vector machines for general confidence functions. In: Proceedings of Asian Conference on Machine Learning: Advances in Machine Learning. Nanjing: Springer: 2009. p. 109–19.
 24
Gu S, Guo Y. Learning svm classifiers with indefinite kernels. In: Proceedings of the TwentySixth Conference on Artificial Intelligence. Toronto: AAAI Press: 2012.
 25
Brian K, Sustik MA, Dhillon IS. Learning lowrank kernel matrices. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh: ACM: 2006. p. 505–12.
 26
Nock R, Magdalou EBB, Nielsen F. Mining matrix data with bregman matrix divergences for portfolio selection. Matrix Inf Geom. Berlin Heidelberg: Springer; 2013, pp. 373–402.
 27
Li FX, Fu YS, Dai YH, Cristian S, Wang J. Kernel learning by unconstrained optimization. In: In Proceedings of International Conference on Artificial Intelligence and Statistics. vol. 5. Proceedings of Machine Learning Research: 2009. p. 328–35.
 28
Conforti D, Guido R. Kernel based support vector machine via semidefinite programming: Application to medical diagnosis. Comput Oper Res. 2010; 37(8):1389–94.
 29
Libsvm Data Sets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Accessed 8 Apr 2016.
 30
Hashimoto K, Goto S, Kawano S, Aokikinoshita KF, Ueda N, Hamajima M, Kawasaki T, Kanehisa M. Kegg as a glycome informatics resource. Glycobiology. 2006; 16(5):63–70.
 31
Doubet S, Albersheim P. Carbbank. Glycobiology. 1992; 2(6):505–7.
 32
NCBI(National Center of Biotechnology Information) GEO(Gene Expression Omnibus) Repository. https://www.ncbi.nlm.nih.gov/gds/. Accessed 2 Mar 2017.
 33
Jiang H, Ching WK, Qiu YS, Cheng XQ. Projection method for support vector machines with indefinite kernels. In: Proceedings of the 12th International Symposium on Operations Research and Its Applications in Engineering, Technology and Management (ISORA 2015). LuoYang: IET: 2015. p. 137–43.
Acknowledgements
The authors would like to thank Prof.Kiyoko AokiKinoshita for providing cystic fibrosis data and Samuel Emersion Harvey for helping to polish the manuscript. The Preliminary version of the paper has been published in proceedings of the ISORA2015 [33].
Funding
This research is supported in part by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China, HKU Strategic Theme in Computation and Information, National Natural Science Foundation of China Grant Nos. 11626229, 11271144, 11671158, and 61472428 and Natural Science Foundation of SZU (No. 2017058). The publication costs are funded by Natural Science Foundation of SZU (No. 2017058).
Availability of data and materials
All the data sets are publicly available and can be accessed from the databases: LIBSVM Data, CFG (Consortium for Functional Glycomics) and NCBI (National Center for Biotechnology Information). The following links are: (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. http://www.functionalglycomics.org/. https://www.ncbi.nlm.nih.gov/.)
About this supplement
This article has been published as part of BMC Systems Biology Volume 11 Supplement 6, 2017: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2016: systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume11supplement6.
Authors’ contributions
JH designed the research. JH and WKC proposed the methods and did theoretical analysis. JH and QYS collected the data. JH, QYS and CXQ conducted the experiments and analyze the results. JH, QYS, WKC and CXQ wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional file
Additional file 1
Table SI. Additional file 1 contains one table. The table SI records Maximum and Minimum Eigenvalues with Considered Kernels generated for all the considered Data sets. (PDF 14 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 SVM
 Indefinite kernel
 Projection method
 Bregman matrix divergence