A high performance profile-biomarker diagnosis for mass spectral profiles
- Henry Han^{1, 2}Email author
Affiliated with
DOI: 10.1186/1752-0509-5-S2-S5
© Han; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Advertisement
DOI: 10.1186/1752-0509-5-S2-S5
© Han; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Although mass spectrometry based proteomics demonstrates an exciting promise in complex diseases diagnosis, it remains an important research field rather than an applicable clinical routine for its diagnostic accuracy and data reproducibility. Relatively less investigation has been done yet in attaining high-performance proteomic pattern classification compared with the amount of endeavours in enhancing data reproducibility.
In this study, we present a novel machine learning approach to achieve a clinical level disease diagnosis for mass spectral data. We propose multi-resolution independent component analysis, a novel feature selection algorithm to tackle the large dimensionality of mass spectra, by following our local and global feature selection framework. We also develop high-performance classifiers by embedding multi-resolution independent component analysis in linear discriminant analysis and support vector machines.
Our multi-resolution independent component based support vector machines not only achieve clinical level classification accuracy, but also overcome the weakness in traditional peak-selection based biomarker discovery. In addition to rigorous theoretical analysis, we demonstrate our method’s superiority by comparing it with nine state-of-the-art classification and regression algorithms on six heterogeneous mass spectral profiles.
Our work not only suggests an alternative direction from machine learning to accelerate mass spectral proteomic technologies into a clinical routine by treating an input profile as a ‘profile-biomarker’, but also has positive impacts on large scale ‘omics' data mining. Related source codes and data sets can be found at: https://sites.google.com/site/heyaumbioinformatics/home/proteomics
With recent surges in proteomics, mass spectral proteomic pattern diagnostics has become a highly promising way of diagnosing, predicting, and monitoring cancers or other advanced diseases for its cost-effectiveness and efficiency [1]. Recent studies not only demonstrate that proteomic profiling can detect the anonymous protein peaks differently expressed between cancer patients and healthy subjects, but also show the absence or presence of disease can be discovered through proteomic pattern classification. However, this novel technology remains an important research field rather than a clinical routine because of the unresolved problems in data reproducibility and classification. The data reproducibility issue refers to that no two independent studies have been found to produce same proteomic patterns. On the other hand, the data classification issue refers to that the classification accuracy obtained from mass spectral data is inadequate to attain a clinical level (e.g., 99.5%) in most studies. Although impressive sensitivities and specificities were reported in some case studies, their classification methods have no guarantee to extend to other mass spectral data to maintain a same level performance.
Many methods and protocols are proposed and being developed to enhance mass spectral data reproducibility from biological and technological aspects. They include employing peptide profiling to replace proteomics profiling to get extremely high resolution data, improving experimental designs to avoid mingles between biological and technological variables, and developing more robust preprocessing algorithms [2–5]. However, mass spectral data reproducibility enhancement seems to be facing a built-in challenge from the technology itself [6], i.e., almost any small, even tiny changes in the part of proteome will be amplified to rather large even huge differences in mass spectra, no matter whether the sources of the changes are from biological factors or experimental conditions. The sensitive signal amplification mechanism somewhat limits the potential of these reproducibility enhancement techniques and presents difficulties in achieving reproducible and consistent diagnosis.
On the other hand, rather fewer studies have been invested in improving mass spectral proteomic pattern classification than those of enhancing data reproducibility. To attain high disease diagnostic accuracy, many studies focus on identifying biomarkers from mass spectral profiles, which are generally a small set of protein expression peaks at selected m/z (mass/charge) ratios, through different machine learning approaches (e.g., peak selection), [7, 8]. These studies are definitely important and interesting. However, they bear the following limitations. (1) The biomarker selection processing is generally individual data oriented case study. There is no guarantee to generalize it to other profiles. (2) The biomarkers obtained from these studies by nature are not reproducible because of the irreproducibility of their source data. In other words, the identified mass spectral biomarkers may lose their reusability and predictability, even if they can achieve exceptional sensitivity and specificity in classification. It is highly likely that another totally different set of biomarkers would be identified if the same type of mass spectra were generated from another set of cancer patients and healthy individuals under the same experimental conditions. (3) The sensitivity and specificity levels from the biomarkers’ classification are still inadequate to qualify this young technology as a robust clinical routine.
How could we accelerate mass spectral proteomics to become a clinical routine in complex disease diagnosis while the studies on data reproducibility enhancement are still underway? We address this challenge from a machine-learning viewpoint by developing a high-performance mass spectral pattern recognition algorithm in this study. Although data reproducibility plays a very important role in mass spectral proteomics, the essential factor to determine whether this exciting technology can fully explore its potential, to a large degree, may rely on the levels of sensitivity and specificity from mass spectral pattern classification.
If there exists a novel pattern recognition algorithm able to attain a 99.5% level accuracy in mass spectra classification for an input proteomic profile, then the profile can be viewed as a profile biomarker in disease diagnosis. This is because the high-accuracy diagnostic results would be reproducible for all input profiles by taking advantage of the novel classification technique. Under such a situation, the data reproducibility probably may not be a major concern to prevent reproducible biomarker discovery because the profile biomarker is able to “reproduce itself” by attaining clinical level diagnosis.
The high or even huge dimensionality of mass spectral data presents a challenge for high-performance proteomic pattern classification, especially for most traditional classification algorithms that were developed under the assumption that input data with a small or medium dimensionality. A mass spectral profile can be represented as a p × n matrix after preprocessing, where a row represents the ion-intensities of a set of observations (samples) at a mass charge ratio (m/z), which is similar to a gene in microarray data, and a column represents the ion-intensities of a single sample across a set of m/z ratios. Unlike traditional data (e.g., financial data), the number of variables in a mass spectral profile is much greater than the number of observations, i.e., n>>p. In addition, only a small portion of testing points (m/z ratios) among the thousands of them have meaningful contribution to data variations or demonstrate biological relevance in disease detection. Furthermore, mass spectral data by nature are not-noise free due to the non-linearity in proteomic profiling. Preprocessing techniques are unable to remove some built-in systematic noise completely. The information redundancy, noise, and high-data dimensionalities in mass spectral data not only make some traditional classification methods (e.g., Fisher discriminant analysis) lose discriminative power, but also present an urgent challenge in computational proteomics.
Many feature selection methods are employed to decrease dimensionalities, remove noise, and extract meaningful features before mass spectra classification. These methods can be categorized as input-space feature selection and subspace feature selection. The input-space feature selection reduces the dimensionality of data by selecting a subset of features to conduct a hypothesis testing or create a model under some selection criteria in the same space as input data (e.g., t-test). On the other hand, the subspace feature selection, also called transform-based feature selection, reduces data dimensionality by transforming data into a low-dimensional subspace induced by a linear or nonlinear transformation. The subspace feature selection methods are probably the most used data reduction techniques in proteomics for their popularity and efficiency. They include principal component analysis (PCA) [9], independent component analysis (ICA) [10, 11], nonnegative matrix factorization (NMF) [12], and their different extensions [13, 14]. We mainly focus on the subspace feature selection methods in this study.
These algorithms, however, are generally good at selecting global features rather than local features. The global and local features consist of high frequency and low frequency features (signals) respectively. For example, a testing point (an m/z ratio) with several exceptionally high peaks on cancer samples, which are seldom found at most testing points, can be viewed as a local feature. On the other hand, a testing point whose expression value plot curve is similar to those of other testing points is a global feature. As different frequency signals capturing different data behaviour, the global and local features interpret the global and local behaviour of data, and contribute to the global and local characteristics of data respectively. Since there is no robust screening mechanism available to distinguish the two types of features in most subspace feature selection methods, the global features may demonstrate ‘obvious’ advantages over the local features in the feature selection. That is, the low frequency signals have less likelihood to contribute to the inferred low-dimensional data, which usually are the linear combinations of all input variables, than the high frequency signals. For example, the positive and negative weights in the linear combination to calculate each principal component in PCA are likely to partially cancel each other. However, it causes that the weights representing contributions from local features are more likely to be cancelled out because of their frequencies. As such, unlike the global features, the local features are hard to extract for most subspace feature-selection algorithms. Finally, the low dimensional data inferred from the transform-based feature selection may miss some local data-characteristics described by the local features. In other words, the global features dominate the feature selection and these algorithms demonstrate a global feature selection mechanism.
Although difficult to extract out, the local features are probably the key to attaining a high-performance mass spectral pattern classification for its subtle data behaviour capturing, especially because many mass spectral samples share very similar global characteristics but different local characteristics. For example, it’s easy to distinguish a 10-years old, five-feet girl Jean between a 25-year old six-feet male Mike, because they have different global features. However, it is not easy to distinguish Mike with his twin brother Peter because they share almost same global characteristics: height, weight, hair color, etc. Nevertheless, some careful people can still detect them because Peter has a mole near his mouth but Mike does not, i.e., the mole here works as the local feature to facilitate such detection. For another example, some benign tumor samples may display very similar global characteristics but quite different local characteristics with malignant tumor samples. To attain a high-accuracy diagnosis, it is must to capture the local data characteristics to distinguish these samples sharing the similar global characteristics from each other. It may be particularly important in mass spectral proteomics because some sub-type samples may demonstrate very similar ‘global patterns’ under the same profiling technology.
A major reason for the global feature selection mechanism displayed in these algorithms is that there is no screening technique available to separate two types of features in feature selection. In other words, PCA, ICA, NMF, and their variants all belong to a single-resolution feature selection method, where all features are indistinguishably analyzed in a single-resolution despite the nature of their frequencies. Such an indistinguishable treatment causes the most-often data entries to have a high likelihood to dominate feature selection and the less-often data entries may lose opportunities. In other words, the global features are more likely to be selected than the local features and prevents effective local data-characteristics capturing. As such, the low dimensional data inferred from these methods (e.g., the projection data onto the three principal components in PCA) may probably only demonstrate the global data characteristics. Obviously, the mass spectral samples with similar global characteristics but different local characteristics will not be recognized in the following classification. Moreover, the global feature selection mechanism may bring redundant global features in the following classification because almost only the features that interpreting global characteristics are involved in training the corresponding learning machine (e.g., SVM). The redundant global features will unavoidably decrease the generalization of the learning machine and increase the risk of misclassifications or over-fitting. Finally, the learning machines integrated with the global feature selection algorithms will display instabilities in classifications, i.e., they may perform well on some data but fail badly on the others due to different contributions of the global features to the classification.
To avoid the global feature selection mechanism, it is desirable to distinguish features (e.g., sort) according to their frequencies by building some screening techniques to separate two types of features in the feature selection. In this study, we conduct multi-resolution data analysis via a discrete wavelet transform (DWT) [15] to separate features according to their frequencies. The discrete wavelet transform (DWT) hierarchically organizes data in a multi-resolution way by low and high pass filters. The low (high)-pass filters only pass low (high)-frequency signals but attenuate signals with frequencies higher (lower) than a cutoff frequency. As such, the DWT coefficients at the coarse level capture the global features of the input data and the coefficients at the fine levels capture the local features of the data, i.e., the low frequency and high frequency signals are represented by the coefficients in the coarse and fine resolutions respectively. Obviously, we can overcome the global feature selection mechanism after such a multi-resolution feature separation by selectively extracting local features and filtering redundant global features.
In this study, we present a novel multi-resolution independent component analysis (MICA) algorithm for effective feature selections for mass spectral data. Unlike the traditional feature selection methods, it suppresses redundant global features and extracts local features to capture gross and subtle data characteristics via multi-resolution data analysis. Then, we propose a multi-resolution independent component analysis based support vector machines (MICA-SVM) to achieve a high-performance proteomic pattern classification. In addition to rigorous machine learning analysis, we demonstrate the proposed classifier’s superiority by comparing it with nine state-of-the-art peers on six heterogeneous profiles generated from different profiling technologies and processed by different preprocessing algorithms. The exceptional classification performance (~99.5% average classification ratios) and excellent stability suggest this algorithm a great potential to facilitate mass spectral proteomics into a clinical routine, even if data reproducibility is not guaranteed.
Multi-resolution independent component analysis (MICA) is built from the discrete wavelet transforms (DWT), principal component analysis (PCA), the first loading vector based data reconstruction, inverse discrete wavelet transforms (IDWT) induced meta-data approximation, and independent component analysis (ICA) based subspace spanning. The DWT decomposes input data in a multi-resolution form by using a wavelet and scaling function. Mathematically, it is equivalent to multiplying input data by a set of orthogonal matrices block by block. The coefficients at the coarse and fine levels represent input data’s global and local features respectively. Alternatively, ICA seeks to represent input data as a linear combination of a set of statistically independent components by minimizing their mutual information. Theoretically, it is equivalent to inverting the central limit theorem (CLT) by searching maximally non-normal projections of the original data distribution. More detailed information about DWT, PCA, and ICA can be found in [15, 11].
MICA seeks the low dimensional meta-sample (prototype) for each high-dimensional mass spectral sample in the subspace generated by the statistically independent components from a meta-profile of the input data. As the same dimensional approximation of the original high-dimensional data, the meta-profile keeps the most important global features, drops the redundant global features, and exacts almost all local features of the original data. The meta-profile is computed by conducting an inverse DWT for the updated coefficient matrices, where the coarse level coefficients are selectively suppressed by the first loading vector reconstruction to filter the redundant global features, and the fine level coefficients are kept to extract the local features. It is worth pointing out that the independent components in MICA are calculated by conducting independent component analysis for the meta-profile. Unlike the independent components in the classic ICA that are mainly retrieved from the global features, the independent components calculated by MICA are statistically independent signals that contain contributions from almost all local features and the most important global features. As such, the latter is more representative in revealing the latent data structure than the former. Moreover, MICA brings an automatic de-noising mechanism via its redundant global feature suppressing. Since the coarse level coefficients (e.g., the first level coefficients) in the DWT generally contain “contributions” from noise, suppressing the coarse level coefficients not only filters unnecessary global features, but also removes the noise automatically. The automatic de-noising prevents noise from entering feature selection and the following classifier training, which will contribute to the robust mass spectral pattern classification. The MICA algorithm can be described as following steps.
1. Wavelet transforms. Given a protein expression profile with p samples across n m/z ratios MICA conducts a L-level column-wise DWT for input data to obtain wavelet coefficients, which consist of total L detail coefficient matrices: and an approximation coefficient matrix i.e., where
2. Redundant global feature suppressing and local feature extraction. A level threshold is selected to suppress redundant global features and maintain local features.
a). If
1). conduct principal component analysis for each detail coefficient matrix D _{ j } to obtain its principal component (PC) matrix and corresponding score matrix
2). reconstruct and update the detail coefficient matrix D _{ j } by using the first loading vector u _{1} in the PC matrix as where is a n _{ j } × 1 vector with all entries being ‘1’s.
b). If j > τ keep all detail coefficient matrices intact.
3). Inverse discrete wavelet transforms. Conduct the corresponding inverse discrete wavelet transforms using the updated coefficient matrices to get the meta-profile of i.e.,
4). Independent component analysis. Conduct the classic independent component analysis for X ^{*} to obtain components and the mixing matrix: X ^{*} = AZ ^{,} where
5). Subspace decomposition. The meta-profile X ^{*} is the approximation of X by removing the redundant global features and retaining almost all local features by selecting features on behalf of their frequencies. It is easy to decompose each sample in the subspace spanned by all independent components Each statistically independent component is a basis in the subspace, i.e., where the mixing matrix and In other words, each sample can be represented as where the meta-sample a _{ i } is the i ^{ th } row of the mixing matrix recording the coordinate values of the sample x _{ i } in the subspace. As a low dimensional vector, the meta-sample a _{ i } retains almost all local features and the most important global features of the original high-dimensional sample x _{ i }. Thus, it can be viewed as a data-locality preserved prototype of x _{ i }. It is worthwhile to note that each meta-sample in the subspace is the data locality persevered prototype of its corresponding high-dimensional mass spectral sample.
The MICA-based support vector machine applies the classic support vector machine (SVM) [16] to the meta-samples calculated from MICA to gain classification in a low-dimensional space. Unlike the traditional SVM that builds a maximum margin hyperplane in the original high-dimensional space where n ~ 10^{3} – 10^{4}, MICA-SVM separates biological samples by constructing the maximum margin hyperplane in the spanned subspace where using the meta-samples. If we assume the number of support vectors N _{ s } is much less than the training points l, then, the time complexity of the MICA-SVM is which is much lower than that of the classic SVM: provided the same number of training points and support vectors. We briefly describe the MICA-SVM algorithm for binary classification at first.
The normal of the maximum-margin hyperplane is calculated as and the intercept term b can be calculated as The decision function is used to determine the class type of a testing sample x', where are the corresponding meta-samples of samples computed from MICA respectively, and is a SVM kernel function mapping the meta-samples into a same-dimensional or high-dimensional feature space. In this work, we mainly focus on the linear kernel for its efficiency in proteomic pattern classification. In fact, we have found that a SVM classifier under a standard Gaussian ( ‘rbf’) kernel kernel) inevitably encounters overfitting for mass spectral proteomic data through rigorously theoretical analysis. The details can be found in the additional file 1.
To demonstrate the superiority of our algorithm, we include five publicly available large-scale mass spectral profiles: colorectal (CRC) [7], hepatocellular carcinoma (HCC) [8], ovarian-qaqc, prostate [17], and cirrhotic [8], in our experiments. They are heterogeneous data generated from different profiling technologies and preprocessed by different algorithms. The HCC and cirrhotic datasets are two binary-class datasets separated from a three-class profile consisting of 78 HCC, 72 control, and 51 cirrhotic samples [8].
Five heterogeneous mass spectral profiles
Dataset | #m/z | #Sample | Technology |
---|---|---|---|
Colorectal | 16331 | 48 controls + 64 cancers | MALDI-TOF high resolution |
HCC | 23846 | 72 controls + 78 cancers | MALDI-TOF high resolution |
Ovarian-qaqc | 15000 | 95 controls + 121 cancers | SELDI-TOF high resolution |
Prostate | 15154 | 63 controls + 69 cancers | SELDI-TOF low resolution |
Cirrhotic | 23846 | 72 controls + 51 diseases | MALDI-TOF high resolution |
We compared our algorithm with six state-of-the-art peers in terms of average classification rates, sensitivities, and specificities under k-fold (k=10) and 100-trial of 50% holdout cross validations (HOCV). The classification accuracy in the i ^{ th } classification is the ratio of the correctly classified testing samples over total testing samples: The sensitivity and specificity are defined as the ratios: respectively, where tp (tn) is the number of positive (negative) targets correctly classified, and fp (fn) is the number of negative (positive) targets incorrectly classified respectively. In the 100-trial of 50% holdout cross validation, all samples in each data set are pooled together and randomly divided into half to get training and testing data. Such a partition is repeated 100 times to get 100 sets of training and testing data sets. In the k-fold cross validation, an input dataset is partitioned into k disjoint, equal or approximately equal proportions. One proportion is used for testing and the other k-1 proportions are used for training alternatively in the total k rounds of classifications. These cross validations are able to decrease potential biases in algorithm performance evaluations compared with the pre-specifying training or testing data approach.
The six comparison algorithms can be categorized into two types. The first type consists of the standard support vector machines (SVM) and linear discriminant analysis (LDA), both of which are the state-of-the-art classification methods. The second type consists of four methods embedding subspace feature-selections in SVM and LDA: they are support vector machines with principal component analysis (PCA) / independent component analysis (ICA) / nonnegative matrix factorization (NMF), and linear discriminant analysis (LDA) with principal component analysis. We refer to them as PCA-SVM, ICA-SVM, NMF-SVM, and PCA-LDA respectively. The implementation details of these algorithms can be found in [14].
We employ the ‘db8’ wavelet in MICA to conduct a 12-level discrete wavelet transform for each dataset and select the level threshold as τ=2 for all profiles uniformly. Although not an optimal level threshold for all data, it guarantees automatic de-noising and “fair” algorithm comparisons. Moreover, the meta-samples obtained from MICA at τ=2 can clearly distinguish two types of samples. Although other level threshold selections may be possible, any too ‘coarse’ (e.g.τ=1) or too ‘fine’ (e.g.τ=10) level threshold selection may miss some important global or local features and affect following classifications.
Performance of seven algorithms under the 100 trials of 50% HOCV
Dataset | Ave. classification rate ± std (%) | Ave. sensitivity ± std (%) | Ave. specificity ± std (%) |
---|---|---|---|
Colorectal | |||
mica-svm | 99.05±01.82 | 98.84±03.41 | 99.28±01.82 |
svm | 95.71±02.01 | 95.28±03.67 | 96.19±02.88 |
pca-svm | 95.37±01.98 | 93.60±04.18 | 96.83±02.93 |
ica-svm | 95.57±02.02 | 93.69±04.11 | 97.11±02.81 |
nmf-svm | 92.46±02.97 | 89.91±06.92 | 94.65±04.04 |
lda | 87.39±04.60 | 84.97±07.56 | 89.36±06.11 |
pca-lda | 94.21±02.75 | 93.87±03.64 | 94.51±04.03 |
HCC | |||
mica-svm | 99.07±01.03 | 98.82±01.73 | 99.31±01.62 |
svm | 93.08±02.33 | 93.42±03.55 | 92.95±04.12 |
pca-svm | 89.65±02.86 | 89.09±04.46 | 90.33±04.59 |
ica-svm | 90.15±02.63 | 89.76±04.27 | 90.70±04.35 |
nmf-svm | 89.81±03.17 | 87.68±07.28 | 92.22±05.14 |
lda | 89.48±03.67 | 91.55±04.42 | 87.75±06.88 |
pca-lda | 91.20±02.81 | 90.08±05.18 | 92.38±03.62 |
Ovarian-qaqc | |||
mica-svm | 99.09±01.09 | 98.94±02.15 | 99.25±01.11 |
svm | 97.64±01.36 | 97.42±02.04 | 97.86±02.18 |
pca-svm | 98.63±00.88 | 99.28±01.20 | 98.12±01.58 |
ica-svm | 98.52±00.83 | 99.06±01.30 | 98.10±01.50 |
nmf-svm | 92.47±03.23 | 94.23±03.72 | 91.15±04.89 |
lda | 81.42±04.48 | 87.86±05.17 | 76.26±06.91 |
pca-lda | 98.42±01.04 | 99.30±01.13 | 97.73±01.94 |
Prostate | |||
mica-svm | 99.36±00.99 | 99.09±01.43 | 99.64±01.66 |
svm | 95.91±02.09 | 95.75±03.05 | 96.18±03.99 |
pca-svm | 97.94±01.65 | 98.48±01.70 | 97.43±03.24 |
ica-svm | 98.23±01.61 | 98.36±01.84 | 98.14±02.84 |
nmf-svm | 91.21±04.67 | 94.44±04.70 | 87.46±06.69 |
lda | 89.92±04.77 | 94.43±04.65 | 85.02±10.46 |
pca-lda | 97.50±02.20 | 97.90±02.50 | 97.10±03.25 |
Cirrhotic | |||
mica-svm | 99.52±00.85 | 99.44±01.65 | 99.62±00.95 |
svm | 95.10±03.17 | 92.97±05.91 | 96.71±03.10 |
pca-svm | 91.52±03.76 | 88.00±08.39 | 94.15±03.86 |
ica-svm | 92.07±03.41 | 88.47±07.92 | 94.72±03.48 |
nmf-svm | 88.03±03.10 | 80.43±07.84 | 93.42±03.62 |
lda | 86.66±06.11 | 86.57±10.30 | 86.93±08.19 |
pca-lda | 92.39±03.62 | 89.04±07.45 | 94.83±03.48 |
Five classifier performance under the 10-fold CV
Dataset | Ave. classification rate ± std (%) | Ave. sensitivity ± std (%) | Ave. specificity ± std (%) |
---|---|---|---|
Colorectal | |||
mica-svm | 100.0 ± 00.00 | 100.0 ± 00.00 | 100.0 ± 00.00 |
pca-lda | 93.71 ± 07.46 | 93.50 ± 10.55 | 93.57 ± 11.45 |
svm | 96.27 ± 06.45 | 96.00 ± 08.43 | 96.67 ± 07.03 |
pca-svm | 95.45 ± 06.43 | 94.00 ± 09.66 | 96.90 ± 06.55 |
ica-svm | 96.35 ± 04.73 | 96.00 ± 08.43 | 96.67 ± 07.03 |
HCC | |||
mica-svm | 99.33 ± 02.11 | 98.57 ± 04.52 | 100.0 ± 00.00 |
pca-lda | 91.33 ± 05.49 | 90.36 ± 06.69 | 92.32 ± 08.87 |
svm | 93.99 ± 06.55 | 94.64 ± 09.11 | 93.57 ± 09.00 |
pca-svm | 90.16 ± 06.20 | 91.61 ± 09.89 | 88.75 ± 10.94 |
ica-svm | 92.79 ± 07.10 | 91.79 ± 11.42 | 93.75 ± 08.84 |
Ovarian-qaqc | |||
mica-svm | 99.52 ± 01.51 | 100.0 ± 00.00 | 99.17 ± 02.64 |
pca-lda | 99.07 ± 01.96 | 100.0 ± 00.00 | 98.33 ± 03.51 |
svm | 97.68 ± 03.25 | 96.78 ± 05.20 | 98.40 ± 03.38 |
pca-svm | 98.61 ± 02.23 | 99.00 ± 03.16 | 98.33 ± 03.51 |
ica-svm | 99.09 ± 01.92 | 99.00 ± 03.16 | 99.17 ± 02.64 |
Prostate | |||
mica-svm | 100.0 ± 00.00 | 100.0 ± 00.00 | 100.0 ± 00.00 |
pca-lda | 98.52 ± 03.13 | 98.57 ± 04.52 | 98.33 ± 05.27 |
svm | 96.98 ± 03.90 | 94.29 ± 07.38 | 100.0 ± 00.00 |
pca-svm | 99.23 ± 02.43 | 100.0 ± 00.00 | 98.33 ± 05.27 |
ica-svm | 98.45 ± 03.27 | 98.57 ± 04.52 | 98.33 ± 05.27 |
Cirrhotic | |||
mica-svm | 100.0 ± 00.00 | 100.0 ± 00.00 | 100.0 ± 00.00 |
pca-lda | 96.73 ± 05.77 | 94.00 ± 09.66 | 98.57 ± 04.52 |
svm | 96.79 ± 04.14 | 96.00 ± 08.43 | 97.14 ± 06.02 |
pca-svm | 95.13 ± 05.75 | 90.33 ± 13.92 | 98.57 ± 04.52 |
ica-svm | 96.67 ± 05.83 | 94.00 ± 13.50 | 98.57 ± 04.52 |
However, our algorithm achieved the average 99.05% classification rate (sensitivity: 98.84% and specificity: 99.28%) under 100 trials of 50% HOCV where much less priori knowledge are available in classification than the LOOCV and 5-fold cross validation. In addition, under the 10-fold cross validation, the proposed algorithm achieves 99.33% and 99.52% predication ratios on the HCC and ovarian-qaqc data respectively. More impressively, it attains 100% classification ratios on the colorectal, prostate, and cirrhotic data. Unlike the other methods displaying instabilities in classifications, our algorithm demonstrates strong stability in attaining high-accuracy pattern detections for all the five profiles. This observation is also supported by its lower standard deviations of the three classification measures of MICA-SVM than those of the others.
We also have found that there are almost no statistically significant differences between SVM and its subspace feature selection based extensions (e.g., PCA-SVM), which achieve same level or slightly lower performance than the standard SVM. The reason seems to be rooted in the global feature selection mechanisms of the PCA, ICA, and NMF methods. As we pointed out before, since some mass spectral samples may display very similar global-characteristics but different local-characteristics, a SVM classifier integrated with a global feature selection method may inevitably encounter difficulty in distinguishing these samples. Although extracted by different transformation methods, the global features seem to have nearly same level contributions to proteomic data classification statistically. Moreover, the redundant global features brought by the global feature selection mechanism may get involved in the SVM learning, which would limit all the SVM-related classifiers’ generalization and cause instability in classification. This point can be also observed through their relatively high standard deviations of the classification rates, sensitivities and specificities. For example, the standard deviations of the three measures from the PCA-SVM classifier are 3.76%, 8.39%, and 3.86% respectively, which are much higher than those from the MICA-SVM classifier (0.85%, 1.65%, and 0.95%) on the cirrhotic profile. Similar observations can also be found for the other data sets.
The MICA-based support vector machines can be also extended to handle the multi-class classification, which has not been seriously addressed in mass spectral proteomics. However, it can be more practical in cancer diagnosis because detecting different pathologic states of cancers is essential in early cancer discovery. We ‘merge’ the HCC and cirrhotic data into a three-class profile to seek high-accuracy detection between healthy individuals (controls) and patients with hepatocellular carcinoma (HCC) and cirrhosis, where cirrhosis can be viewed as an early HCC stage to some degree because chronic hepatitis C causes HCC via the stage of cirrhosis.
We employ the ‘one-against-one’ method in our MICA-based multi-class SVM classification for its proved advantage over the ‘one-against-all’ and ‘directed acyclic SVM’ methods [18]. The ‘one-against-one’ method builds k(k-1)/2 binary SVM classifiers for a data set with k classes: {1,2,…k}. Each classifier is trained on data from two classes, i.e., training samples are from the ith and jth classes where i,j=1,2,..k. We describe our MICA-based ‘one-against-one’ SVM as follows.
where a _{ t } is the meta-sample calculated for the training sample x _{ i }. After building all k(k-1)/2 classifiers, we first determine if a testing sample x' is from class the ith or jth class by a local decision function where a' is the meta-sample of x'. Then, we use the ‘Max-wins’ voting approach to infer its final class type: if the local decision function says x' is in the ith class, then the ith class wins one vote; Otherwise, the jth class wins one vote. Finally, x' will belong to the class with the largest vote.
We also implemented the ‘one-against-one’ method in SVM, PCA-SVM and ICA-SVM multi-class classification for a fair comparison. It was interesting to find that the four classifiers: PCA-LDA, SVM, PCA-SVM, and ICA-SVM had equivalent performance under the two types of cross validations for this trinary data. Just as before, the LDA and NMF-SVM algorithms had lower level performance than those of the four algorithms. However, the MICA-SVM algorithm achieved average classification ratios: 97.37% and 98.52% respectively under the 100 trials of 50% HOCV and 10-fold CV, which were much higher than the corresponding average 83.79% and 86.61% level classification ratios attained by the four peers under the same cross validations.
MICA-LDA performance on six mass spectral data sets
Cross validation | Ave. classification rate ± std (%) | Ave. sensitivity ± std (%) | Ave. specificity ± std (%) |
---|---|---|---|
50%HOCV | |||
Colorectal | 96.21±02.07 | 96.20±03.70 | 96.29±02.53 |
HCC | 98.19±01.67 | 100.0±00.00 | 96.50±03.21 |
Ovarian-qaqc | 90.62±02.73 | 91.93±04.67 | 89.58±04.05 |
Prostate | 93.03±03.05 | 96.15±03.24 | 89.42±05.24 |
Cirrhotic | 98.93±01.37 | 97.52±03.05 | 99.97±00.28 |
Three-class | 96.84±01.32 | 98.69±01.49 | 96.21±02.10 |
10-fold CV | |||
Colorectal | 96.26 ± 06.77 | 95.00 ± 15.81 | 96.90 ± 06.55 |
HCC | 97.38 ± 03.39 | 100.0 ± 00.00 | 95.00 ± 06.45 |
Ovarian-qaqc | 88.46 ± 07.16 | 91.78 ± 13.41 | 86.03 ± 06.54 |
Prostate | 90.87 ± 07.86 | 92.86 ± 10.10 | 88.57 ± 13.65 |
Cirrhotic | 99.17 ± 02.64 | 98.00 ± 06.32 | 100.0 ± 00.00 |
Three-class | 97.33 ± 03.44 | 96.25 ± 06.04 | 98.57 ± 04.52 |
Performance of MICA-SVM, PLS-REG, NR-LLD, and RPLS-LLD
Algorithms | MICA-SVM | PLS-REG | NR-LLD | RPLS-LLD |
---|---|---|---|---|
Data | Average classification rates under the 100 trials of 50% HOCV (%) | |||
Colorectal | 99.05±01.82 | 96.64±02.19 | 97.02±01.78 | 96.23±02.32 |
HCC | 99.07±01.03 | 94.60±02.12 | 91.09±02.65 | 94.40±02.19 |
Ovarian-qaqc | 99.09±01.09 | 95.44±02.00 | 98.18±01.52 | 96.68±02.00 |
Prostate | 99.36±00.99 | 98.32±01.43 | 96.74±01.97 | 98.32±01.47 |
Cirrhotic | 99.52±00.85 | 91.36±03.44 | 89.82±03.50 | 92.84±02.36 |
Three-class | 97.37±01.20 | 68.51±04.82 | 84.04±03.66 | 85.23±03.31 |
Data | Average classification rates under10-fold CV (%) | |||
Colorectal | 100.0±00.00 | 94.53±08.80 | 98.09±04.03 | 97.18±06.25 |
HCC | 99.33±02.11 | 92.70±07.26 | 93.32±07.26 | 93.37±06.29 |
Ovarian-qaqc | 99.52±01.51 | 98.64±04.31 | 98.18±03.18 | 98.16±02.38 |
Prostate | 100.0±00.00 | 99.32±02.43 | 96.26±05.19 | 99.29±02.26 |
Cirrhotic | 100.0±00.00 | 95.06±07.00 | 89.36±09.60 | 94.36±06.62 |
Three-class | 98.52±03.35 | 77.11±08.23 | 85.99±08.17 | 88.64±06.83 |
The instabilities of current classification methodologies are widely found in mass spectral proteomics. In fact, almost all of these classification methods were proposed through analyzing an individual dataset [1–3, 5, 7, 8]. They may work efficiently on the individual data but lack stability when applied to other heterogeneous data generated from different profiling technologies or processed by different preprocessing methods. In fact, such instabilities not only present difficulties in reproducible biomarker discovery, but also hamper exploring the clinical potentials of this technology. Although algorithmic stability analysis is essential in computational proteomics, there is even no ad-hoc investigation on this topic. To evaluate the algorithmic stabilities of mass spectral proteomic data classification algorithms, we present an algorithmic stability analysis by introducing two scale-free measures: algorithm stability index and relative stability. The algorithm stability index measures the stability of an algorithm across a number of datasets. A high algorithm index value indicates better stability of an algorithm. Alternatively, the relative stability measures the stabilities of a set of classification algorithms with respect to a specific algorithm, which is selected as the MICA-SVM algorithm in this study. A small relative stability indicates an algorithm with a relatively close performance to that of the MICA-SVM algorithm.
Given a classification algorithm running on M heterogeneous profiles under a cross validation, the algorithm stability index δ _{ a } and the relative stability δ _{ r } are defined as, where μ _{ i }, s _{ i } are the average classification rate and the corresponding standard deviation of the algorithm on the i ^{ th } profile respectively, and the parameter is the average classification ratio of the MICA-SVM algorithm on the i ^{ th } profile.
A remaining question is how to determine the optimal level threshold in MICA so that the following SVM classifier achieves best performance. It is reasonable to believe an optimal level threshold will contribute to capturing important local and global features of the original data in the meta-samples. We here employ a log-condition number of the mixing matrix A to estimate the status of global and local feature capturing, where λ _{max} and λ _{min} are the maximum and minimum singular values of the mixing matrix. A large log-condition number indicates the better global and local feature capturing. The level-threshold is counted ‘optimal’ if the log-condition number of the mixing matrix is the largest. If log-condition numbers from two level thresholds are same numerically, the lower level threshold (which is required to be > 1) is counted as the optimal one. For instance, the largest and 2^{nd} largest α values are achieved at τ=1 and τ=7 respectively on the ovarian-qaqc data. However, our algorithm achieved the best average classification performance at τ=7, where the average classification rate, sensitivity and specificity are 99.74%, 99.73% and 99.76% respectively (The average classification rate is 95.28% at τ=1).
In this study, we present a multi-resolution feature selection algorithm: multi-resolution independent component analysis (MICA) for effective feature selection for mass spectral data, propose a high-performance classification algorithm for heterogeneous proteomic profiles, and demonstrate its superiority by comparing it with nine peers. Our approach seeks reproducible high-accuracy diagnosis by treating an input profile a whole biomarker from a machine-learning viewpoint. It shows a great potential to facilitate mass spectral proteomics technology into a clinical routine, even if the data reproducibility is not guaranteed. It is worthwhile to note that independent component analysis is a necessary step to achieve good classification performance. We have found that a similar multi-resolution principal component analysis based SVM algorithm is not able to reach a comparable performance as our algorithm because of the loss of statistical independence in the feature selection. Although our methodology can achieve the clinical-level disease diagnosis for mass spectra even if the data reproducibility is not guaranteed, we do not intend to de-emphasize the importance in enhance mass spectral proteomic profile reproducibility because of its potential in identifying reproducible biomarkers. In fact, previous studies [21] pointed out that data reproducibility may affect data analysis and bring biases. For example, hierarchical clustering may bring different results for mass spectra acquired in day one and the same data a month later. However, it is also reasonable to expect the proposed algorithm’s exceptional performance on the mass spectral data with robust reproducibility for its generality on heterogeneous data.
Our study suggests a new direction to accelerate mass spectral proteomic technologies into a clinical routine. The novel concepts of global and local feature selection, multi-resolution data analysis based redundant global feature suppressing, and effective local feature extraction techniques proposed in this study will also have positive impacts on large scale ‘omics’ data mining. The exceptional discriminative power demonstrated by MICA-based classifiers in multi-class proteomic data classification also contributes to early stage cancer diagnosis. It is interesting to find the MICA-based methods can be also applied to achieve exceptional gene expression pattern classification and meaningful biomarker discovery [22]. In the following work, in addition to further polishing our algorithm by comparing them with other state-of-the-art methodologies or data analysis tools [23], we are interested in investigating the multi-resolution independent component analysis based unsupervised or semi-supervised learning algorithms in proteomic pattern discovery by integrating the multi-resolution feature selection with the state-of-the-art clustering or semi-supervised learning algorithms, and generalize corresponding methods to the related topics such as gene subnetwork identification [24], and biomedical text classification in our future work.
The author wants to thank the anonymous reviewers for their valuable comments in improving this manuscript.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S2.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.