SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells

Background Mouse embryonic stem cells (mESCs) are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renew and to differentiate to all adult cell types. Genes that maintain mESCs self-renewal and pluripotency identity are of interest to stem cell biologists. Although significant steps have been made toward the identification and characterization of such genes, the list is still incomplete and controversial. For example, the overlap among candidate self-renewal and pluripotency genes across different RNAi screens is surprisingly small. Meanwhile, machine learning approaches have been used to analyze multi-dimensional experimental data and integrate results from many studies, yet they have not been applied to specifically tackle the task of predicting and classifying self-renewal and pluripotency gene membership. Results For this study we developed a classifier, a supervised machine learning framework for predicting self-renewal and pluripotency mESCs stemness membership genes (MSMG) using support vector machines (SVM). The data used to train the classifier was derived from mESCs-related studies using mRNA microarrays, measuring gene expression in various stages of early differentiation, as well as ChIP-seq studies applied to mESCs profiling genome-wide binding of key transcription factors, such as Nanog, Oct4, and Sox2, to the regulatory regions of other genes. Comparison to other classification methods using the leave-one-out cross-validation method was employed to evaluate the accuracy and generality of the classification. Finally, two sets of candidate genes from genome-wide RNA interference screens are used to test the generality and potential application of the classifier. Conclusions Our results reveal that an SVM approach can be useful for prioritizing genes for functional validation experiments and complement the analyses of high-throughput profiling experimental data in stem cell research.


Background
Mouse embryonic stem cells (mESCs) are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renewal as well as to differentiate into all adult cell types including the germ-line. These features render mESCs ideal for applications in basic scientific research and translational medicine. To harness their full potential, better understanding of the molecular mechanisms of mESCs self-renewal maintenance and pluripotency is critical. Therefore, genes that are critical to mESCs self-renewal maintenance are of interest to the stem cell research field. In the past decade, significant steps have been made toward identifying and characterizing the genes and regulatory networks that compose the self-renewal machinery. A mESCs stemness membership gene (MSMG) signature has been proposed through application of high-throughput profiling approaches such as mRNA expression microarrays combined with advanced computational analyses as well as through low-throughput detailed functional studies [1][2][3]. Genes that are predominantly expressed in mESCs cells are considered putative candidates for being MSMGs. Nevertheless, the overlap among candidate MSMGs across different studies is surprisingly small, whereas the full identification of MSMGs, the genes responsible for self-renewal and pluripotency, remains largely incomplete.
Fuelled by the growing volume, diversity and complexity of genome-wide profiling data generated from highthroughput biotechnologies, advanced computational approaches such as machine learning have been used to analyze multi-dimensional experimental data and integrate results from many studies [4][5][6][7][8][9][10]. Support Vector Machines (SVM) is a popular supervised machine learning method that is based on statistical learning theory [11]. SVM has been widely applied as a classification tool to address biological questions such as gene function prediction [4], protein homolog identification [5], and disease diagnosis [6]. For example, previous studies used SVM and gene expression data for gene function classification [7] and cancer tissue sample classification [8]. Such studies used a single type of experimental data to conduct the analyses. Recently, Zhu et al. developed a network-based SVM approach where they combined prior knowledge with microarray data to improve the predictive performance for cancer tissue diagnostics [9]. In another study, SVM-based predictions were applied to infer gene function by concatenating normalized features from diverse datasets [10]. Hence, there is a trend of combining heterogeneous data-types to improve classification where the SVM approach is the computational method of choice. Here we attempted to use this approach to tackle the task of predicting MSMGs utilizing two types of high-throughput data by combining several independent studies.
We hypothesized that we can utilize data from mESCsrelated mRNA microarrays profiling and genome-wide transcription factor binding profiling (ChIP-seq) applied to characterize mESCs to classify genes important for ES cell self-renewal and pluripotency (MSMGs). We believe that within these datasets there are subtle patterns from which a gene's functional characteristic, in regards to the self-renewal and pluripotency involvement, Yes or No question, can be inferred. We employed an SVM-based approach to construct a classifier that can be used to predict the class membership as being MSMG or not-MSMG for genes by combining genome-wide mRNA expression profiling data and ChIP-seq data. The accuracy and generality of the classifier are evaluated using the leave-one-out-cross-validation (LOOCV) approach. We also compared the SVM classifier with other machine learning classification methods, including linear discriminant classifier, decision trees, and artificial neural networks. Furthermore, we tested the ability of the SVM classifiers to predict the class membership of positive and negative lists of genes resulting from two genome-wide RNAi screen studies to demonstrate how such classification approach can be useful for helping in prioritizing hits from such screens.

Learning from heterogeneous data types
We extracted 91 features/attributes from mRNA gene expression and ChIP-seq experiments for each gene (vector) from mESCs-related studies (detailed description is provided in the Methods section). 79 features/ attributes were created from mRNA expression microarray profiling data extracted from the Gene Expression Omnibus (GEO) database [12] references to the files are provided in the methods and Additional files 1, 2 and 3. In addition to the 79 features/attributes created from mRNA expression microarray data, we produced 12 features/attributes from ChIP-seq studies [13]. All 12 ChIP-seq experiments we used profile the global genome-wide binding of transcription factors known to be important for maintaining self-renewal and pluripotency [14]. We implemented two types of preprocessing approaches for generating features/attributes from the ChIP-seq datasets: With the first approach, we converted the results from the ChIP-seq experiments into Boolean values where zero represents absence and one represents presence of binding sites in proximity to a gene detected as a peak in a ChIP-seq experiment. The second approach for creating features from the ChIPseq data was to compute a continuous binding value calculated as a weighted sum of intensities of all of the peaks of the transcription factor weighted by the distance between the peak and the transcription start site (TSS) [15]: Where a ij is the binding value of the transcription factor j on gene i, g k is the intensity of the k th binding peak of transcription factor j, d k is the distance between the TSS of gene i and the k th binding peak, and d 0 is a constant. a ij is then log-transformed and quantile-normalized. This method was previously introduced by Ouyang et al. [15]. Altogether, the features created from the ChIP-seq datasets are either 12 binary-valued vectors or 12 continuous-valued vectors consequently named ChIP-binary and ChIP-continuous in the charts and tables.
In order to train any supervised machine-learning classifier it is required to have a gold standard training set of classified examples. In our case, these are genes that are known to be either MSMG or not-MSMG. For this we obtained an expert collection set of genes labeled as MSMG or not-MSMG (Table 1). This classification of genes/proteins was mainly followed from a study that designed a customize microarray for mESCs [16]. In addition, we used manual expert curation process which included the construction of a literature-based self-renewal regulatory network in mESCs from low throughput studies [17]. In all, we obtained 46 genes as positive examples, classified as MSMG, and 70 genes as negative examples ( Table 1). The training sample for positive genes is relatively small since we discarded controversial candidates.
In the first set of computational experiments we tested different versions of SVM classifiers and combinations of training data types to determine which kernel function and which data type or combination of data-types performs best. Table 2 summarizes such evaluations using the LOOCV. In general, the performance of SVM on combined data types, microarray data and ChIP-seq data, appears to perform better than SVM trained on an individual data type. Additionally, the LOOCV results show that the SVM classifier with the radial basis function (RBF) kernel function appears to perform slightly better than classifiers with linear or polynomial kernel functions. The RBF function kernel used is a Gaussian radial basis function with a gamma variable that ranges between 0.1-10 as determined through the outer loop of the LOOCV selected based on the highest accuracy. In addition to the LOOCV, we also employed a three-fold cross-validation in order to plot a receiver operating characteristic (ROC) curve to compute an area under the curve (AUC) score ( Figure 1). An alternative approach to using different individual SVM kernel functions with all the features/attributes is to combine two or more SVM kernel functions for optimizing performance. In a prior similar study it was shown that using different kernels on heterogeneous datasets works better for gene function classification [5]. We did not observe much advantage of implementing weighted combinations of kernels applied to each data type separately. The reason may be the different data types used, ChIP-seq and mRNA expression microarrays data in our study versus phylogenetic and mRNA expression microarrays data in the other study. ChIPseq and mRNA expression microarrays data is intuitively more correlated [15].
Next we asked which features/attributes/studies contribute the most for successful classification of MSMG genes. For this purpose we implemented a feature selection and ranking algorithm. We applied the SVM Recursive Feature Elimination (RFE) algorithm [18] to rank all features for evaluating their discriminatory capabilities. The top 20 discriminatory features from the RBF-SVM and Poly-SVM classifiers are listed in Additional file 4  Comparison of performance of several kernel functions used for SVM learning applied on single and heterogeneous data types (mRNA expression and ChIPseq). The best performer for each category is bold-highlighted. Kernel functions include: linear kernel, polynomial kernel (poly) and Gaussian radial basis kernel (RBF) (see methods). Datasets include: micro-mRNA expression microarrays; chip_binary-ChIP-seq data with pre-processing into binary feature values; chip_contin-ChIP-seq data with pre-processing into continuous feature values. Performance of two data integration strategies: "weight"weighted kernel matrices; "simple"one kernel matrix by concatenation of the two data types (see methods). As an example, "simple_binary_poly" means the approach of concatenating microarray and binary ChIP-seq data and training using an SVM with a polynomial kernel function.
In summary, we show that the SVM-based classification can be successfully applied for discriminating between MSMG and non-MSMG, whereas combining heterogeneous data types improves learning.

SVM outperforms other classification methods
In the second set of computational experiments we compared the performance of the SVM classifiers to the following four other types of machine learning classification methods: Linear Discriminant Analysis (LDA) [19], Decision Trees (DT) [20], Artificial Neural Network (ANN) [21], and a simple classifier we created by comparing genes expressed in mESCs vs. genes expressed in embryonic bodies (EB). LDA uses training data to estimate the parameters of discriminant functions which determine boundaries in predictor space between various classes. Alternatively, DT offer a nonparametric model generating a classification tree where each branched node is split based on the values of features of gene vectors computed using information theory. ANN contain an input layer that takes in the feature values, a hidden layer made of nodes connected to the input layer with weighted links that can be adjusted, and an output layer consisting of the resultant classification. In addition, to rule out the possibility that the SVM and other classifiers simply detect mESC-specific genes, we also compared our methods to a simple classifier which predicts MSMGs based only on the gene expression fold change between mESCs and EB cells. Figure 2 summarizes the results of comparison between the different classification methods. In general, the results show that the SVM classifier outperforms the other methods. In all cases the best trained SVM either outperforms or is comparable to the other methods. However, these are not conclusive results since we haven't attempted to optimize parameter settings for the other classification methods. The average prediction accuracy of the simple classifier is 0.76 indicating that comparing the fold change for genes expressed in mESCs to EB cells is predictive by itself; however, this approach does not perform as well as any of the other classifiers.

Prioritizing candidate genes from genome-wide RNAi screens
The third set of computational experiments further test the generality of the SVM-based MSMG prediction classifier. Here we aim to assess whether genome-wide experimental characterization of genes, such as those data produced by mRNA expression profiling and genome-wide transcription factor binding profiling, can truly confer functional description (i.e. self-renewal and pluripotency membership). With this question in mind, we choose two independent studies to generate two test-sets of genes as positive and negative examples. The positive example test-set comes from a study that identified candidate genes functional in maintaining mESCs self-renewal using a genome-wide RNAi screen [22]. Whereas the negative test-set are genes identified as being important for the insulin signaling pathway, also identified from another genome-wide RNAi screen [23].
The insulin pathway related screen is considered as irrelevant to our MSMG definition and MSMG prediction task. However, we cannot rule out the possibility that some genes from the negative example test-set are also involved in stem-cell self-renewal and pluripotency regulation. The ratio (percentage) of predicted MSMG genes from the positive and negative test-set samples can be viewed as "signal-to-noise" ratio (Table 3). Overall, regardless of the data type used, whether we use microarray data alone or integrated data from microarrays and ChIP-seq, the number of genes predicted to be MSMG from the positive test set is significantly higher than from the negative test set (p-value ≈ 5.34 × 10 -12 , two-tail t-test) (Figure 3). Additionally, there is a high correlation (r = 0.89, Spearman's rank correlation) between the prediction accuracy from the LOOCV evaluation of SVMs and the signal-to-noise ratio generated from SVM predictions on the independent RNAi datasets. In other words, the prediction capacity of the SVM for future samples can be well estimated from its performance on our test-set examples using the LOOCV method. Hence, the SVM classifier is capable of discriminating between relevant RNAi screens hits and not relevant hits from another RNAi screen.
Given the labor-intensive effort and cost of identifying candidate genes from large-scale RNAi screens, the classifiers developed here may help in further prioritizing hits for functional experimental verification. Genomewide RNAi screens are considered noisy, containing high degree of false positives, where slightly different experimental protocols and statistical analyses can yield different results. As an example, recently Ding's group [24] demonstrated how a genome-wide RNAi screen approach was used to identify novel regulators of embryonic stem cell maintenance. Their results reveal a small overlap with the study we used here as a test-set [22]; 11 out of 209 candidate genes from the RNAi screen implemented by Ding's group overlaps with the study we used here in which 148 candidates were reported. Taken this into consideration, future work should continually test and train classifiers by using Ranked methods based on signal-to-noise ratio performance of predicting the percentage of genes as positive from the positive test set (self-renewal screen) and as positive from the negative test set (insulin-pathway screen). Performance of machine learning methods is evaluated and accuracy is measured using LOOCV. Labelling of panels is as follows, "microarray": using genome-wide mRNA microarray profiling data; "chip": using genome-wide ChIP-seq of transcription factors data; "micro-chip": using both microarray and ChIP-seq. The fold-changebased predictor results are only under the "microarray" panel since it uses only microarray data.  [25]. Similarly, misclassified genes are also found in the genome-wide insulin signaling pathway RNAi screen. Specifically, we found several candidate genes that are always predicted as MSMG, for example, Pim3 and Tnk2. It was shown that self-renewal of mESCs is supported by Pim1 and Pim3 [26], whereas Tnk2 was reported to stimulate breast cancer development in humans [27]. Considering the relation between stem cells self-renewal maintenance and cancer cells development, Tnk2 also appears to be a promising candidate for qualifying as a bona-fide MSMG. We emphasize that the misclassified genes, initially identified as critical for insulin signaling pathway, do not appear in our training set and therefore never seen by the SVM classifier before. Nevertheless, the classifier consistently predicted them as MSMGs.

Discussion
In this study we demonstrate the ability of SVM classifiers to predict MSMG membership. The results confirm that SVM is a fine choice for this type of classification task for this type of data. Since genome-wide RNAi screens used for discovering functional genes in stemcell self-renewal and pluripotency maintenance produce candidate lists that are inherently noisy, the SVM-based classifier can be applied to prioritize experimental choices when facing with a large list of candidate genes to verify and further functionally characterize. SVM has the advantage of being flexible for handling different data types as features in an input vector. This facilitates combining various data sources complementing each other which in general we show can increase accuracy. In our study, we only used pre-translational data. In other words, genes can only be differentiated from other genes at the mRNA and protein/DNA interaction levels. This means that post-translational properties cannot be correctly learned. Fortunately, with the growing availability of high-throughput data at the proteome level, i.e. phosphoproteomics profiling of embryonic stem cells [28,29], classification methods such as the one developed here have the potential to increase their prediction accuracy by combining such datasets.
Our computational experiments demonstrate that in general the SVM classifiers benefit from incorporating heterogeneous data. However, learning from various data types was not beneficial for the LDA and Decision Trees classifiers (Figure 2). The decrease in performance for LDA and Decision Trees might be due to sensitivity to features that do not provide substantial contribution to the classification. In addition, we did not implement a search for optimal parameter settings and feature selections for those classifiers. This would probably allow better performance and assessment of the different types of classifiers we tested.
When training a classifier to predict gene membership such as MSMG, there is a tradeoff between the size of the training set and the accuracy that can be achieved. In this study we chose to use a relatively small yet more reliable training set to increase our certainty about the true positives and true negatives. Alternatively, we could integrate together, as positive and negative training sets, genes identified from various studies, including both highthroughput and labor-intensive small-scale approaches. It would be interesting to see if such an approach would improve the performance of MSMG classification.

Conclusions
In summary, our results reveal that SVM classifiers are useful for predicting genes important for self-renewal and pluripotency of mESCs. Such an approach can be useful for prioritizing genes for functional experiments and complement the analyses of high-throughput profiling experimental data in stem cell research.

Data selection and preprocessing
To minimize the variability of different platforms and inconsistency that arise from gene ID mapping, we extract expression profile data collected from only two Affymatrix platforms (GPL339 and GPL340). These studies include time-series differential expression data from This method for preprocessing microarray data for SVM training was borrowed from Brown et al. [8].

Weighted kernel functions
The SVM classifiers we implemented to map the data from the input space to a high-dimensional space in which classification can be performed by locating data points with respect to a hyperplane that separates binary classes. The feature space can be adjusted by selecting a kernel function, which is used to transform the data for optimization of the classification [4,19]. In this study we utilized three common kernel functions and compared their prediction accuracy: (1) Linear kernel: (2) Polynomial kernel: (3) Gaussian radial basis kernel It is common practice to set σ to be equal to the median of the Euclidean distances from each positive sample vector to its closest negative sample vector [6], we note that such choice was not optimal for this particular application. Therefore, our strategy for determining σ is to sample a range of values (10 -2 to 10 1 ), using LOOCV, to select the best σ value that maximizes the prediction accuracy.

Classifier training
We integrated the two data types, mRNA expression and ChIP-seq, simply by concatenation. We then trained the SVM classifier using one of three kernel functions. Alternatively, motivated by the hypothesis that the classification would be better for treating each data type separately, we employed a strategy of integrating two kernel functions each applied to one of the two different data types. The weights for each classifier are determined by an F1 score.
F1 score is a measure of accuracy that takes into accounts both precision and recall.
As a result, we can obtain a classifier that is made of two weighted kernels.
K m and K c represent the kernel matrix measured from the microarray data and the ChIP-seq data, respectively. Hence, the more accurate, the more weight each data type in the kernel matrix would be. The motivation for combining kernel matrices by their weights is that each kernel matrix (from single data type) should exert their effects on the final training of the SVM according to their performance. We named these strategies "simple" and "weighted" in the figures and tables.

Comparison to other classification methods
Analyses for LDA, Decision Trees and ANN were performed with the default settings using the Statistics Toolbox in MATLAB, Natick, MA. For ANN, we used the Neural Network Toolbox in MATLAB implementing back-propagation to learn a two-layered-feed-forward network with five neurons in the hidden layer. To increase the reliability of the ANN results, we trained 30 ANNs and the final result is computed as the average accuracy. The simple fold-change-based predictor/classifier classifies genes as MSMG if the ESC-to-EB gene expression ratio is more than one. Such ratio was extracted from studies that compared gene expression in mESCs and EBs (GEO accessions: GSE3223, GSE10518). The accuracy of this simple predictor based on these two independent datasets is 0.73 and 0.79, respectively.

Leave-one-out cross-validation (LOOCV)
The performance of SVM classifiers and other machine learning classification methods is evaluated by LOOCV. Each classifier is trained on n-1 of the total n training samples and tested on the one left out. This step iterates n times to calculate the average performance of the trained classifier as an estimation of prediction error for unseen samples. We measured the accuracy to assess the learning performance:

ROC curves
We also evaluated the performance of various SVM classifiers by measuring the average area under the curve (AUC) using a receiver operating characteristic curve (ROC). Each time we left one fold of training samples out as a testing set and trained the SVM on the other two folds. This step is iterated three times and the average performance can be calibrated as AUC. The ROC curves were generated by varying the decision threshold of each SVM classifier.

Application of the SVM-based classifier to indentify and classify unseen MSMGs
To evaluate the SVM classifier, we chose two independent sets of genes from genome-wide RNAi screens. One screen identified genes important for stem cell selfrenewal and pluripotency, whereas the other screen identified genes important to insulin signaling. We were able to match the IDs of 126 genes from the screen that identified stem cell self-renewal out of 148 genes identified in the study [22]. These genes were used as the positive example test set. 101 genes out of the 126 genes identified as insulin signaling pathway members from the second study were ID matched [23] and used as the negative test set. SVM classifiers trained on our original training set of 46 positive and 70 negative examples were tested for their ability to classify genes from these two independent sets. The ratio of percentage of predicted MSMGs from the positive and negative test samples can be viewed as a signal-to-noise ratio: ratio = Predicted MSMGs in positive set Predicted MSMGs in n % e egative set% (11)