A high performance profile-biomarker diagnosis for mass spectral profiles

Background Although mass spectrometry based proteomics demonstrates an exciting promise in complex diseases diagnosis, it remains an important research field rather than an applicable clinical routine for its diagnostic accuracy and data reproducibility. Relatively less investigation has been done yet in attaining high-performance proteomic pattern classification compared with the amount of endeavours in enhancing data reproducibility. Methods In this study, we present a novel machine learning approach to achieve a clinical level disease diagnosis for mass spectral data. We propose multi-resolution independent component analysis, a novel feature selection algorithm to tackle the large dimensionality of mass spectra, by following our local and global feature selection framework. We also develop high-performance classifiers by embedding multi-resolution independent component analysis in linear discriminant analysis and support vector machines. Results Our multi-resolution independent component based support vector machines not only achieve clinical level classification accuracy, but also overcome the weakness in traditional peak-selection based biomarker discovery. In addition to rigorous theoretical analysis, we demonstrate our method’s superiority by comparing it with nine state-of-the-art classification and regression algorithms on six heterogeneous mass spectral profiles. Conclusions Our work not only suggests an alternative direction from machine learning to accelerate mass spectral proteomic technologies into a clinical routine by treating an input profile as a ‘profile-biomarker’, but also has positive impacts on large scale ‘omics' data mining. Related source codes and data sets can be found at: https://sites.google.com/site/heyaumbioinformatics/home/proteomics

where x,y represent two samples) as the SVM, PCA-SVM, and ICA-SVM algorithms. Interestingly, the over-fitting has its own special characteristics: the SVMbased algorithms can only recognize the majority type of the training data in each classification trial because their corresponding kernel matrices are identity or near identity matrices. Since these algorithms have almost identical performance under the 'rbf' kernel, without loss of generality, we use the SVM algorithm to address the special characteristics associated with overfitting and prove over-fitting is actually inevitable for these learning machines on mass spectral data.
We show each SVM kernel matrix under the 'rbf' kernel is identity or near identity. For convenience, we treat all samples in each mass spectral profile as training samples that includes all possible training entries for a SVM learning machine under any cross validations. The subfigure 1 in the following Figure  i j x x i j − ≠ By the definition of the 'rbf' kernel, it is easy to find that any non-diagonal kernel entry is at least 9 2 e − − < for the colorectal, HCC, cirrhotic and prostate data and their kernel matrices are identity matrices. This point is also supported by the sums of all non-diagonal entries in their kernel matrices, i.e., Although the kernel matrix of the ovarian-qaqc data is not an identity matrix, the small sum of its kernel non-diagonal entries: The sub-figure 3 visualizes 78 non-diagonal items in the lower-triangle of its kernel matrix. Like an identity matrix, the kernel matrix has all eigenvalues equal to 1 as (sub- fig 4). Obviously, these identity or near identity kernel matrices can only represent the concept of identity and have no way to generalize to other new data in the SVM learning. The following theorem states that SVM can only recognize the majority type of a training data no matter the type of a test sample if its kernel matrix is the identity or near identity. A majority (non-majority) type simply refers to the type with more (less) counts in a binary-class profile. For example, 'cancer' is the majority type because it has 69 counts among all 112 colorectal samples. A majority type ratio is the number of majority type samples over the total samples (e.g. 69/112). ( N n > ), inputted to a standard SVM with a kernel function ( , ).

k x y
The training sample labels are specified as where '-1' and '1' represent the 'cancer' and 'control' classes respectively. Then for a testing sample ' , x D X ∈ − its class type can be determined as x and the classification rate will be zero.
Since a SVM learning machine can only recognize the majority type of the training data under the 'rbf' kernel according to the theorem, it is interesting to see that the classification ratios for the five profiles under the 10-fold CV are just their majority type ratios: 64/112 (colorectal), 78/150 (HCC), 121/216 (ovarian-qaqc), 69/132 (prostate), 78/123 (cirrhotic), and their corresponding specificities or sensitivities are 100% or 0% respectively. However, the average classification rate for each profile under the 100 trials of 50% HOCV is approximately its majority type ratio because some non-majority type may work as 'local majority type' due to sampling. However, the corresponding average sensitivity and specificity are still complementary to each other. For example, the average classification rate for the colorectal data is 56.21% <57.14% (64/112) with sensitivity: 2% and specificity: 98%, which means the nonmajority type ('control') was counted as the 'local majority type' in 2 trials among the 100 trials of classifications. If #majority-type is close to #non-majority-type, then the likelihood that a nonmajority type is counted as 'local majority type' in a training data will be high. Correspondingly, the average classification rate will move far from the majority-type ratio. For example, the prostate data has 69 majority-type ('cancer') samples and 63 non-majority-type ('control') samples. The average classification rate for this data is 47.86%, which is far from majority-type ratio 52.27% (69/132), with sensitivity 67% and specificity 33%, i.e., the non-majority type is counted as the 'local majority type' in 33 trials among the 100 trials of classifications.
The biological reason for the over-fitting problem associated with the SVM-based classification is the sensitive signal-amplification mechanism from mass spectral proteomics technologies, where any subtle changes in the part of proteome will be amplified to large differences in mass spectral expressions. The exponential transform in the 'rbf' kernel makes the already amplified expression values of two biological samples has zero or approximately zero distance in the feature space, i.e., the corresponding kernel matrices are identity or near identity matrices. In other words, the SVM learning machine inevitably loses its detection capabilities. Interestingly, it seems that the over-fitting problem can be overcome by setting a 'fine' threshold (e.g., τ=10) where only the least frequency signals are captured in MICA. For example, the MICA-SVM algorithm with the 'bior4.4' wavelet' under the 'rbf' kernel can achieve average classification rates 90.64% (sensitivity: 78.78%, specificity: 100%) for the colorectal data, 96.11% (sensitivity: 91.84%, specificity: 99.59) for the ovarian-qaqc data, and 94.38% (sensitivity: 90.62%, 98.88%) for the prostate data under the 100 trials of 50% holdout crossvalidations. However, our simulations suggest that proteomic pattern classification is a linear separable problem and the 'linear' kernel is an optimal kernel selection in the SVM-based algorithms.