Overview of Feature Reduction/Classification Process
Microarray data sets have a significant feature-rich/case-poor problem which can lead to over-fitting (i.e. models that produce excellent results on the training data exist, but none of which may be valid and have good performance on the test data) unless the number of features are significantly reduced prior to the generation of any classification or prediction model. The objective of this three-step process is to identify those significant features which are most useful in producing an accurate classification or prediction model. The process of feature reduction/classification is depicted in Figure 1, and consists of a Coarse Feature Reduction (CFR) process, followed by a Fine Feature Selection (FFS) process and then classification.
Coarse Feature Reduction
The automated CFR employees a simple two sample t-test followed by variance pruning (cut-off based on coefficient of variation). It is a simple process to remove lot of probes that are not useful for classification, i.e., those not considered statistically significant to classification. See [22–24] for details on variance pruning.
Partial Least Squares
This section contains a brief, heuristic overview of Partial Least Squares (PLS). PLS is an extension of least squares regression (LSR). In LSR, the response variable y is predicted from p coordinates and n observations, denoted by X = {x1,x2, ...x
n
}T, where each . PLS finds "new variables" through the construction of specific combinations of the original coordinates. These "latent variables" explain both the y response as well as the covariate space and are denoted by the following expressions:
(1)
(2)
where:
-
t
s
= the sthlatent variable (or conjugate vector; a n by 1 column vector). Generally most of the variability is characterized by M latent variables with a maximum of M = 5 required for most problems.
-
p
s
and q
s
= the sthweight vectors (ps is a 1 by p row vector, q
s
is scalar).
-
ε, ζ = small errors in the remaining parts not explained by the latent variables.
For this microarray data set, we began with 271 features after CFR and reduced this set to a minimum of 1 latent variable and a maximum of 5 latent variables (see Results section). Therefore, the principle advantage of PLS for a problem of this type is its ability to handle a very large number of features: a fundamental problem of a feature-rich/case-poor data set. PLS then performs a least-squares fit (LSF) onto these latent variables, where this LSF is a linear combination that is highly correlated with the desired y response while, at the same time, accounting for the feature space variability. A summary of the features and advantages of PLS follows:
-
PLS algorithms are very resistant to over-fitting, when compared to LSR, and are fast and reasonably easy to implement.
-
For most problems with few data points and high dimensionality where PLS excels, a least squares solution may not be possible due to the singularity problem.
-
PLS regression maps the original data into a lower-dimensional space using a W projection matrix and computes a least squares solution in this space. See the algorithm below for the definition of W.
-
What makes PLS especially interesting for biomedical and data mining applications is its extension using kernels, which leads to kernelized PLS (K-PLS), similar to the treatment in SVM.
-
PLS may be considered a better principal component analysis (PCA).
-
The first key difference from PCA is that PLS computes an orthogonal factorization of the input vector X and response y (note: y can also be a vector) in the process of computing the projection matrix W.
-
The second key difference from PCA is that the least squares model for K-PLS is based on approximation of the input and response data, not the original data.
-
PLS and PCA use different mathematical models to compute the final regression coefficients. Specifically, the difference between PCA and PLS is that a new set of basis vectors (similar to the eigenvectors of XTX in PCA) is not a set of succession of orthogonal directions that explain the largest variance in data, but rather are a set of conjugate gradient vectors in the correlation matrices which span a Krylov space.
An algorithm of PLS paradigm follows:
-
1.
Let: X1 = X,y1 = y
-
2.
For m = 1 to M, where M = the desired number of latent variables, do:
-
(a)
Compute direction of maximum variance w
m
= (X
m
)Ty
m
-
(b)
Project X onto w t
m
= X
m
w
m
-
(c)
Normalize t tm = t
m
/|t
m
|
-
(d)
Deflate X Xm+1= X
m
-tm(t
m
)TX
m
-
(e)
Deflate y ym+1= y
m
-tm(t
m
)Ty
m
-
(f)
Normalize Y after deflation ym+1= ym+1/|ym+1|
-
3.
Finally, compute the regression coefficients using latent variables: β = W(TTXW)-1TTy
where:
-
w
m
is the mthcolumn vector of W
-
t
m
is the mthcolumn vector of T
-
X
m
and y
m
are the input matrix and response vector that are being deflated, and β is the linear regression coefficient vector. A geometric representation of part of the algorithm and the insight of deflation can be seen in Figure 2.
Kernelized Partial Least Squares
Non-linear relationships between variables may be found by embedding this data into a kernel induced feature space. See [25] for a good reference of kernel learning. This kernel "trick" is used in PLS and is called K-PLS. Consider now a mapping ϕ, which maps any data vector from the sample space to some other (possibly infinite dimensional) Euclidean space (feature space):
The mapping will "recode" the data set as:
(4)
This mapping of the data set is from non-linear input space to a linear feature space. That is, although the environment data representation in the input X space is non-linear, after the data are processed by the ϕ mapping, the data characterized by this mapping is linear in , with the happy result that linear techniques may be used on the mapped data while preserving the non-linear properties represented in the input space. This mapping is accomplished, as previously stated, by using a valid kernel function.
Adding this kernel-induced capability to the PLS approach means that a real time, non-linear optimal training method now exists which can be used to perform computer aided diagnosis. A second advantage of this approach is that a kernel function K(x1,x2) computes the inner products 〈ϕ(x1),ϕ(x2)〉 in the feature space directly from the samples x1 and x2, without having to explicitly perform the mapping, making the technique computationally efficient. This is especially useful for algorithms that only depend on the inner product of the sample vectors, such as SVM and PLS.
Computationally, kernel mappings have the following important properties: (1) they enable access to exceptionally high (even infinitely) dimensional and, consequently, very flexible feature space, with a correspondingly low time and space computational cost, (2) they solve the convex optimization problem without becoming "trapped" in local minimal and, more importantly, (3) the approach decouples the design of the algorithms from the specifications of the feature space. Therefore, both learning algorithms and specific kernel designs are not as difficult to analyze.
The algorithm used to develop the K-PLS model, is given below. Details can be found in [26].
-
1.
Let K0 = (K
ij
= 〈ϕ(x
i
), ϕ(x
j
)〉 = K(x
i
,x
j
))(ij = 1,...n)be the n by n Gram matrix induced by K, the selected kernel function corresponding to ϕ(·). Let K1 be the centered form of K0,y be the be the response variable, normalized to have a mean of 0 and a standard deviation of 1, and M be the desired number of latent variables.
-
2.
For m = 1 to M, do:
-
(a)
t
m
= K
m
· y
m
-
(b)
-
(c)
-
(d)
-
(e)
-
3.
Finally, compute the regression coefficients using: where:
-
4.
The regression equation then becomes:
(5)
Note x is any sample from the testing data to be predicted and K1(x
i
,x) is element from the centered form of the training/testing kernel matrix.
Evolutionary Programming derived K-PLS machines
The particular K-PLS kernel types and kernel parameters were derived using an evolutionary process based on the work of Fogel [27] called Evolutionary Programming (EP). EP is a stochastic process in which a population of candidate solutions is evolved to match the complexity and structure of the problem space.
This process iteratively generates, valuates, and selects candidates to produce a near-optimal solution without using gradient information, and is therefore well suited to the task of simultaneously generating both the K-PLS model architecture (kernel) and parameters. Figure 3 and found in more detaA description of this process is shown in Figuil below.
-
1.
Initial K-PLS parameter population created: A population of candidate solutions (K-PLS kernel architectures and parameters) is randomly generated.
-
2.
Mutation of K-PLS machines: Each of these candidate solutions then is copied and mutated, yielding a solution pool of twice the original size, using the equation given below:
(6)
where m is the total number of configurable parameters being evolved, N(0,1) is a standard normal random variable sampled once for all m parameters of the v vector, and N
i
(0,1) is a standard normal random variable sampled for each of the m parameters in the v vector.
The second step of this mutation process comprises the updating of each configurable parameter for all elements of the evolving population. If we let the vector γ
i
denote these elements for each of the individual member of the population, this update process will be accomplished as follows:
(7)
Here C is a standard Cauchy random variable. It is used because it has longer tails and offers better mutation performance.
-
3.
Selection of K-PLS machines: All elements of this pool are scored using an objective function. These objective function scores are then used to order the candidate solutions from the "most fit" to "least fit." Better results usually are obtained from using tournament selection methodologies. With tournament selection, each candidate solution competes against a random subset of the remaining solutions. Finally, the upper 50% of the solution pool is selected to continue as the basis for the next generation and the remaining 50% are "killed off" (discarded) to reduce the pool to the original population size. This process is generally repeated for a specified number of generations, unless some other "stopping" criteria is used.
For more details on the EP process, refer to our previous work [28].
Support Vector Machine and its capacity to reach the global optimum
The K-PLS results were validated by using another kernel-based Statistical Learning Theory model called a Kernelized Support Vector Machine (K-SVM). SVMs was developed by Vapnik [29–31]. Tutorials to SVM can be found in [32] and [25].
The discussion below provides the theoretical explanation for why SVMs can always be trained to a global minimum, and thereby should provide better diagnostic accuracy, when compared with neural network performance trained by back propagation.
Assume there exist n experimentally derived observations. Each observation (or training example) consists of a vector x
i
containing the input pattern and a corresponding known classification y
i
. The objective of the learning machine is to formulate a mapping x
i
→ y
i
. Now consider a set of functions f(x,α) with adjustable parameters α, that defines a set of possible mappings x → f(x,α). Here, x is given and α is chosen. In the case of a traditional neural network of fixed architecture, the α values would correspond to the weights and biases. The quantity R(α), known as the expected risk, associated with learning machines is defined as:
(8)
where, p(x, y) is an unknown probability density function from which the examples were drawn. This risk function is the expected value of the test (or validation) error for a trained learning machine. It may be shown that the best possible generalization ability of a learning machine is achieved by minimizing R(α), the expected risk. There exists a error bound of generalization, for binary classification, which holds with the probability of at least 1 - η, 0 ≤ η ≤ 1 for all approximating functions that minimize the expected risk.
(9)
The first term on the right hand side, R
emp
(α), is known as the "empirical risk", expressed by:
(10)
Empirical risk is a measure of the error rate for the training set for a fixed, finite number of observations. This value is fixed for a particular choice of α and a given training set {(x
i
,y
i
),i = 1,2, ···n}. The second term in (9) is the "Vapnik-Chervonenkis (VC) confidence interval." This term is a function of the number of training samples n, the probability value η and the VC dimension h. The VC dimension is the maximum number of training samples that can be learned by a learning machine without error for all possible labeling of the classification functions f(x,α), and is, therefore, a measure of the capacity of the learning machine. In traditional neural network implementations, the confidence interval is fixed by choosing a network architecture a priori. Neural network training by back-propagation minimizes the empirical risk only.
In contrast to neural network, in a SVM design and implementation, not only is the empirical risk minimized, the VC confidence interval is also minimized by using the principles of structural risk minimization (SRM). Therefore, SVM implementations simultaneously minimize the empirical risk as well as the risk associated with the VC confidence interval, as defined in the above expression. The bound in (9) also shows that as n → ∞, the empirical risk approaches the true risk because the VC confidence interval risk approaches zero. The reader may recall that obtaining larger and larger sets of valid training data would sometimes produce (with a great deal of training experience) a better performing neural network using classical training methods. This restriction is not incumbent on the SRM principle and is the fundamental difference between training neural networks and training SVMs. Finally, because SVMs minimize the expected risk, they provide a global minimum.
Measures of similarity for classification provided by various kernels
Understanding what similarity as applied to K-PLS and K-SVM often provides additional insight in proper kernel selection. Therefore, we now consider kernel functions and their application to K-PLS and K-SVMs. K-PLS and K-SVM solutions in non-linear, non-separable learning environments utilize kernel based learning methods. Consequently, it is important to understand the practical implications of using these kernels. Kernel based learning methods are those methods which use a kernel as a non-linear similarity to perform comparisons. That is, these kernel mappings are used to construct a decision surface that is non-linear in the input space, but has a linear image in the feature space. To be a valid mapping, these inner product kernels must be symmetric and also satisfy Mercer's theorem [33]. The concepts described here are not limited to K-PLS and K-SVMs, and the general principles also apply to other kernel based classifiers as well.
A kernel function should yield a higher output from input vectors which are very similar than from input vectors which are less similar. An ideal kernel would provide an exact mapping from the input space to a feature space which was a precise, separable model of the two input classes; however, such a model is usually unobtainable, particularly for complex, real-world problems, and those problems in which the input vector provided contains only a subset of the information content needed to make the classes completely separable. As such, a number of statistically-based kernel functions have been developed, each providing a mapping into a generic feature space that provides a reasonable approximation to the true feature space for a wide variety of problem domains. The kernel function that best represents the true similarity between the input vectors will yield the best results, and kernel functions that poorly discriminate between similar and dissimilar input vectors will yield poor results. As such, intelligent kernel selection requires at least a basic understanding of the source data and the ways different kernels will interpret that data.
Some of the more popular kernel functions are the (linear) dot product (11), the polynomial kernel (12), the Gaussian Radial Basis Function (GRBF) (13), and the Exponential Radial Basis Function (ERBF) (14), which will be discussed below.
The dot and polynomial kernels are given by,
(11)
(12)
respectively, both use the dot product (and therefore the angle between the vectors) to express similarity; however, the input vectors to the polynomial kernel must be normalized (i.e., unit vectors). This restricts the range of the dot product in (12) to ±1, yielding kernel outputs between 0 and 2d, where d is the degree of the polynomial. The implication of the dot product kernel having a positive and negative range (versus the strictly non-negative polynomial kernel) is that the classification process can learn from the unknown vector's dissimilarity to a known sample, rather than just its similarity. While the dot product kernel will give relatively equal consideration to similar and dissimilar input vectors, the polynomial kernel will give exponentially greater consideration to those cases which are very similar than those that are orthogonal or dissimilar. The value of d determines the relative importance given to the more similar cases, with higher values implying a greater importance. Measures of similarity for these two kernels are depicted in Figures 4 and 5.
The Gaussian and Exponential RBF kernels are given by:
(13)
(14)
respectively.
The Gaussian and Exponential RBF kernels use the Euclidean distance between the two input vectors as a measure of similarity instead of the angle between them (see Figures 6 and 7).
Since ||u - v|| is always non-negative, both kernels achieve a maximum output of one when ||u - v|| = 0, and approach zero as ||u - v|| increases. This approach is made faster or slower by smaller or larger values for σ, respectively. Figure 6 shows the output of the GRBF kernel as a function of the distance between the input vectors for several different values of σ. Figure 7 shows the output of the ERBF kernel.
It is clear from Figures 6 and 7 that the distance at which the kernel output reaches approximately zero varies with σ, and therefore the choice of σ for this kernel is essential in properly distinguishing the level of similarity between two input vectors. If the value of σ is too small-that is, most pairs of vectors are far enough apart that the kernel output is near zero, the SVM will have too little information to make an accurate classification. If the value of σ is too large, so that even very distant pairs of input vectors produce a moderate output, the decision surface will be overly smooth. This may mask smaller distinctive characteristics which exist in the ideal decision surface, and will also increase the effect outliers in the training data have on the classification of an unknown point.
Using PLS, KPLS, and SVM in clinical research
While the methods covered in this paper offer statistically significant improvements in diagnostic and prognostic biomedical applications, there has been great difficulty in utilizing advances such as these in clinical research. The statistics used to evaluate the performance of these techniques are not readily converted into direct clinical information that may help in patient care or pharmaceutical research. In order to address this, we have devised a framework to combine these techniques with well accepted and understood traditional biostatistics methods, the Cox Proportional Hazard model and the Kaplan-Meier (K-M) Curve. These two techniques each help address the question of how important a particular parameter is to evaluating risk/survival. The following subsections will give a basic overview of how Cox and K-M can be combined with our techniques. For simplicity, such a combination will only be described with PLS, though it could just as easily be done with KPLS or SVM.
PLS and Kaplan-Meier curves
Developed in the 1950s, the K-M curve is the gold standard in survival analysis [34]. In a normal survival curve, the number of survivors at a particular moment in time is divided by the total number of patients. These points are plotted against time to give a curve which starts at 1 and slowly curves downward until at some time when it reaches 0. A K-M curve introduces an additional element, the ability to utilize censored data. Censored data is partial data; where a final survival time is unknown but a minimum survival time is known. This can happen when patients die of unrelated causes, patient data is lost, or patients no longer keep contact with the researcher. To handle this censored data, when the partial survival time is reached, the patients are removed from the number of survivors and the total number of patients. These removals are marked on a K-M curve with a cross. The best way to utilize a K-M curve is to create different curves for different groups and compare them. For instance, a K-M curve for men and one for women would be far apart if a particular condition was much more fatal in one gender than the other. Using this concept with our techniques, we can use the PLS (or KPLS/SVM) to split a data set of patients into good and poor prognosis categories. This can be done by first splitting training data around some cut-off survival time (survival being the lack of recurrence), such as survival before or after 36-months, and training the system to make predictions on a validation set. K-M curves can then be made for those predicted groups, and if the difference between them is significant, then the system is performing well. A chi-square test is the standard for comparing curves, and a p-value derived from that test of below .05 would indicate statistically significant difference between the two prognosis categories predicted by PLS.
PLS and Cox Hazard Ratios
Another common survival analysis technique is the Cox Proportional Hazard model [35]. The Cox model is a semi-parametric linear regression model which assumes that the hazard of an observation is proportional to an unknown "baseline" hazard common to all observations. Proportionality to this baseline, it is modeled as an exponential of a linear function of the covariates. From this model, a single Cox Hazard Ratio (CHR) value is derived which represents the "risk" of an event occurring associated with being in a particular group. The larger the CHR, the greater the risk over time of the event occurring for one group than the other. Similar to the K-M curve, the PLS can separate patients into two prognosis categories, and the CHR will be a measure of the effectiveness of that categorization. A large ratio would indicate that the output of this method was a useful prognostic prediction for a patient to have recurrence.