A novel two-layer SVM model in miRNA Drosha processing site detection

Background MicroRNAs (miRNAs) are a large class of non-coding RNAs with important functions wide spread in animals, plants and viruses. Studies showed that an RNase III family member called Drosha recognizes most miRNAs, initiates their processing and determines the mature miRNAs. The Drosha processing sites identification will shed some light on both miRNA identification and understanding the mechanism of Drosha processing. Methods We developed a computational method for Drosha processing site predicting, named as DroshaPSP, which employs a two-layer mathematical model to integrate structure feature in the first layer and sequence features in the second layer. The performance of DroshaPSP was estimated by 5-fold cross-validation and measured by ACC (accuracy), Sn (sensitivity), Sp (specificity), P (precision) and MCC (Matthews correlation coefficient). Results The results of testing DroshaPSP on the miRNA data of Drosophila melanogaster indicated that the Sn, Sp, and MCC thereof reach to 0.86, 0.99 and 0.86 respectively. Conclusions We found the Shannon entropy, a chemical kinetics feature, is a significant feature in telling the true sites among the nearby sites and improving the performance.


Background
MicroRNAs (miRNAs) are a large class of~22nt long non-protein-coding RNAs that post-transcriptionally interfere the expression of their target genes by binding to the 3'-untranslated regions (3'UTR) [1]. MiRNAs were found to degrade or suppress the expression of great amount target genes [2,3] in plants, animals and viruses [4], which play important roles in embryo development, cell growth and tissue differentiation, apoptosis and proliferation, morphogenesis and so on [5][6][7][8].
Drosha is a Class 2RNase III enzyme. In most animals, except a few miRNAs which are produced by the miRtron pathway [9], it is Drosha that cleaves the long primary-miRNAs (pri-miRNAs) to precursor miRNA (pre-miRNA) hairpins of~70nt in length [10], which initiates miRNA processing [11,12]. The Drosha processing step determines the sequence regions of pre-miRNAs for the sequentially biological process to produce mature miR-NAs by Dicer. As Dicer selects cleavage sites by measuring a set distance from Drosha processing sites [13], Drosha is considered to be the key of making the determination of the mature miRNAs. Furthermore, the Drosha process also determines the efficiency and specificity of most miRNA expression [14]. Therefore, accurate identification of Drosha processing sites will facilitate the recognition of miRNAs and the mechanisms understanding of miRNA biogenesis.
The methods in both experimental and computational ways have been employed to identify the Drosha processing sites. Kadener et al. identified 137 Drosha target sites from pri-miRNAs at the genome scale of Drosophila experimentally with the tiling microarray technology [15]. Computational method is another option for quickly and low-costly identifying Drosha processing sites. The 'Microprocessor SVM' is a computational program used to identify human Drosha processing sites with the feature set formed by structure information features and base pair information features of pre-miRNA hairpin. However, the accuracy of 'Microprocessor SVM' predicting known 5'-Drosha processing sites in human is approximately 50% [16]. One of the possible reasons of the low accuracy may be the missing of some chemical kinetics features, such as the Shannon entropy of pre-miRNA folding.
In this study, we introduced a computational method named DroshaPSP that integrated the Shannon entropy [17] into the feature set to search Drosha processing sites on pre-miRNA hairpin structure. The Shannon entropy is verified to be an significant measure in noncoding RNA sequences (ncRNAs) folding, especially miRNA [18]. It is widely accepted that the pri-miRNA folding into hairpin structure is required for the Drosha processing, so we naturally infer that the Shannon entropy is important for Drosha processing step. As we expected, our Drosha processing site predicating program, called DroshaPSP, gave SN nearly 0.91 while SP was over 0.99, and the MCC reached 0.94. This result confirmed our hypothesis that chemical kinetics features, in particular, the Shannon entropy, are import for Drosha processing.
We have reported our research results to BIBM 2012 [19]. In this supplement, we are more specific on the Methods that how we established the two-layer classifier based on SVM and discuss the irreplaceability of the first layer.

Data
Drosophila melanogaster was chosen as the study species due to its small genome.
The Drosophila melanogaster miRNA annotation data, including the sequences of pre-miRNA, the structure data of miRNA hairpin, the sequences of mature miRNA and the sequences of miRNA star were downloaded from miRBase (http://www.mirbase.org/) [4], which collects the comprehensive annotation information of Drosophila melanogaster miRNAs. It should be noted that the miRNAs produced by miRtron pathway were not considered in this study, because they are not processed by Drosha.
The sequence data of Drosophila melanogaster genome were obtained from Ensemble database [20].

Predicting steps of DroshaPSP
A two-layer prediction model is used in DroshaPSP to predict the processing sites of Drosha, as shown in Figure 1. For a given gene sequence, DroshaPSP first determines the hairpin structure with the prediction model HairpinSVM, and then identifies the Drosha processing sites of the hairpin structure with the prediction model DroshaSVM, which integrates the structure, sequence and entropy information.

HairpinSVM: Pre-miRNA like hairpin structure determination
HairpinSVM is a classifier that was constructed based on the support vector machine (SVM) [21] used for telling the pre-miRNA like hairpins which are the potential substrates of DroshaSVM. We selected the most widely used radial basis function kernel (RBF kernel) for Hair-pinSVM. The RBF kernel of SVM [22] was implemented with the package LIBSVM [23].
As shown in Figure 2A, HairpinSVM firstly mapped all the pre-miRNA sequences (70~100nt) obtained from miRBase to the Drosophila melanogaster genomic sequences by Blast [24], and extended to 180nt. These 180nt long sequences constituted the sample database (the Sample DB). For each sample in the Sample DB, all of its subsequences longer than 50nt are inputted to RNAfold software [25]. The hairpin structures returned by RNAfold were candidates for the HairpinSVM. In the case that the subsequences from a certain sample give out the same folding structure, only the longest one was retained. In brief, all the possible structures output by RNAfold were considered as pre-miRNA candidates. In the candidate dataset, the ones same with the corresponding pre-miRNA structure given out by miRBase formed the positive training set, others constituted negative training set. Finally we get 641 positive training samples and 3024 negative training samples for HairpinSVM.
In HairpinSVM, 12 structure features were included to tell the pre-miRNA like hairpin structures with the best possibility (Table 1).

DroshaSVM: Drosha processing site classifier
The output of DroshaSVM is the probability for each candidate of Drosha processing site. The candidates of Drosha processing sites refer to the sites at the 5'-stems of hairpins outputted by HairpinSVM ( Figure 2B). Similar to Microprocessor SVM, we defined that the true Drosha processing sites are the 5'-ends of mature miR-NAs and miRNA stars in 5'-stem of pre-miRNA hairpin annotated by miRBase. If miRBase gives no such annotation for a pre-miRNA hairpin, we presumed that 3'-ends of mature miRNAs gave a 2nt overhang to relative 5'true Drosha processing site. For DroshaSVM training, we collected 641 positive samples with experimentally validated from miRBase database. The negative sample set is formed by other 30,873 sites in 5'-stems of known pre-miRNAs.
Like the HairpinSVM, DroshaSVM also adopt RBF kernel for prediction model. Besides the normally used features, such as the base pair and its probability, the length from the loop, we also integrated the entropy features into DroshaSVM ( Table 2). The Shannon entropy is a Dynamical feature, which has been verified to be an significant measure in non-coding RNA sequences (ncRNAs) folding, especially miRNA. The scaled values of the features were input to SVM model training.

Estimating the performance
We applied 5-fold cross-validation test on both prediction models. In brief, both the positive and negative samples are firstly divided into 5 folds randomly. The classifier is then trained with data from 4 folds and tested on data from the rest one fold in turn. According to the results of 5-fold cross-validation, five widely used measures are used to estimate the performance of both HairpinSVM and DroshaSVM, which are: ACC (accuracy), Sn (sensitivity), Sp (specificity), P (precision) and MCC (Matthews correlation coefficient). The measures are defined as follow: where TN, TP, FN and FP respectively represent the counts of true negative, true positive, false negative, false positive. Unusually, the MCC, instead of the ACC, is used to estimate the overall performance and determine the default threshold due to the unbalanced positive and negative training sets [26]. To estimate the classifiers comprehensively, the receiver operating characteristic curve (ROC curve) is used to present the performance intuitively.
The DroshaPSP program was tested by the testing dataset and the performance is accessed also by ACC, SN, SP, P and MCC.

Results
We developed a program called DroshaPSP to automatically identify the Drosha processing sites from the given sequence based on SVM method. For a given sequence, it was first told by HairpinSVM if it is a pre-miRNA-     Figure 3B and Figure 3D, which indicate that the performance of HairpinSVM and DroshaSVM were stable. The test results suggested that the HairpinSVM and DroshaSVM gave the reliable results of pre-miRNA hairpin structure and Drosha processing sites prediction.

Performance of the DroshaPSP program
For the whole prediction program testing, we used all miRNAs of Drosophila melanogaster in miRBase version 18.0 as the testing set. The test showed that SN was 0.859 while SP reached 0.999, the value of ACC and P were 0.998 and 0.870. The comprehensive measurement MCC achieved 0.864.

Estimating the importance of the features
It is meaningful for us to estimate the influence of each feature to the SVM classifiers, so that we could figure out that the importance of each feature and get a better understanding of the miRNA maturation. To this aim, the F-score method is applied. F-score is an effective method to estimate the discrimination of two sets. Given training vectors x k , k = 1, ..., m, the number of positive and negative instances are marked as n+ and n-, respectively, then for the ith feature, its F-score is calculated as: are the average of the ith feature of the whole, positive, and negative data sets, respectively; x (+) k,i and x (−) k,i are the ith feature of the kth positive and negative instance. The larger the F-score is, the more likely this feature is discriminative.
The Figure 4A and Figure 4B present the F-score of each feature used in HairpinSVM and DroshaSVM respectively. The F-score of the feature stands for its contribution to the classifier. We can see in Figure 4A that the energy features, including the free energy of the thermodynamic ensemble and the minimal free energy, are the most effective features for pre-miRNA like hairpin selection. The features of stem structure took the second place, such as pair, length, and stem length. Other structure features of stem which impact the balance of the 5' stem and 3' stem, such as the number of bulges in the folding structure and the fraction of paired base in sequence, only contributed a little to Hair-pinSVM. According to Figure 4A, the loop structural features are less important than those features about stem. For DroshaSVM, the F-scores of all the used features are as showed in Figure 4B. Unexpectedly, the Fscore of the base types is low in all the sites we selected. These facts suggest that the base types are not so important, and the stability and probability of the base pairs of these sites are effective features for Drosha processing site prediction. We found that the region from position 3 to position 9 has higher F-score, which may be the functional positions in Drosha process. However, different features have specific high F-score regions. The entropy got highest F-score in position 5 and 6, the base pairing probability and structure got relatively higher scores, especially the probability of position 8 and 9. In addition, all the features of candidate sites got low Fscores. The explanation for this observation may be that Table 2 The features used in DroshaSVM ID Name Description 1 Loop_Distance Distance from processing site candidate to loop of the hairpin structure.

2~11
Structure Structure description of the candidate site and 9nt sites forward are paired or not.

Base
The base types of the candidate site and 9nt sites forward.

Probability
The base pairing probability of the candidate site and 9nt sites forward.

32~41
Entropy The Shannon entropy of the candidate site and 9nt sites forward.
the processing sites themselves have little to do with the Drosha processing site determination.
The Shannon entropy affects the Drosha process As far as we know that the Shannon entropy is used in the Drosha processing site identification for the first time. The Shannon entropy is a powerful chemical kinetics feature which has been proved to be effective in ncRNA folding [18]. According to the F-score analysis result (Figure 4), the traditional features probability and structure information got high F-score, the Shannon entropy showed effect that should not be ignored. The F-score of the Shannon entropy were higher than the information of base pair in candidate site and sites forward. Once we removed the Shannon entropy, the modified feature set gave out the performance that the AUC under the ROC curve of DroshaSVM decreased 9% (AUC = 0.886). We did a survey on he scores calculated by DroshaSVM with the feature set included or removed the Shannon entropy in the region of 3nt downstream and upstream to the true Drosha processing sites. The Figure 5 is the histogram that shows the average score calculated by DroshaSVM of the sites with different distance to true Drosha processing sites in both cases. The figure clearly shows that the average score of true Drosha processing sites is much higher than the sites nearby while applying the feature set included the Shannon entropy, and there is no significant difference between the sites with different distance from the true Drosha processing sites. If the feature set without the Shannon entropy is used, the average score of neighboring sites within 2nt showed a remarkable increase depending on distance from true processing sites.
These experiments demonstrated that the feature Shannon Entropy is a significant feature to tell Drosha processing sites and indicated that the Drosha process is influenced by the chemical kinetics of pre-miRNA folding.

Discussion and conclusion
The precise detection of Drosha processing sites is a crucial procedure for miRNA identification and the revealing of miRNA maturation. In this study, we proposed a two-layer prediction model named DroshaPSP to identify Drosha processing sites by combining the sequence and structure information, and the evaluation results show that our method can achieve high prediction accuracy.
In our model, a novel dynamical feature was introduced, Shannon entropy, which is helpful to distinguish the true processing sites from the ones that nearby. In the previous study, the true processing sites and the neighboring sites within 2nt are indistinct due to the similar scores assigned by their Microprocessor SVM, which led to a serious problem in predicting Drosha processing site. Finding the features that can sufficiently  characterize the genuine Drosha processing sites from the neighboring ones is our prime interest. Of this purpose, we brought in the Shannon entropy, which is a novel dynamical feature. As showed in Figure 5, with the Shannon entropy, DroshaPSP can pinpoint the true processing site from the neighborhood clearly.
Drosophila melanogaster was chosen as our study species, due to its extended annotation of Drosha processing sites on miRNAs. We did not compare our DroshaPSP with Microprocessor SVM, because the parameters of latter method were derived from human miRNAs, which were reported to be quite different from Drosophila melanogaster miRNAs, such as different cleavage partners of Drosha in human and Drosophila. Thus, the direct comparison of two prediction models derived from these two distinct species would bring on unfair results.
It is noteworthy that the purpose of HairpinSVM, the first layer of DroshaPSP, is not to scan the pre-miRNA from the given sequence, but to select the pre-miRNA like hairpin structure from all the RNA folding results of the given sequence. So, HairpinSVM cannot be replaced by other pre-miRNA predicting program. In order to clearly classify the pre-miRNA like hairpin structures, negative samples should be carefully chosen. Our negative samples are close with the positive samples in location and sequence but with clearly different hairpin structure, which make our negative samples very suitable and lead to a good performance of the first layer classification.
Although our proposed two-layer SVM method has high prediction accuracy, it is rather time-consuming, due to a lot of folding work done by RNAfold which is highly computational demanding. For example, predicting a 180nt sequence requires more than 3 minutes. This shortcoming limited its application in large dataset.
In the future, we will try to cut down the run time by changing programming language and improve the prediction accuracy of DroshaPSP with more structure features including the structure, base probability, entropy for each site. We will also extensively evaluate the performance of DroshaPSP with the prediction model trained from Drosha processing sites from other species. In addition, we are planning to develop a stand-alone implement with parallel computation option for Drosha processing site recognition on different OS platforms.
In conclusion, we developed a Drosha processing site predicting program, called DroshaPSP, which is composed of two classifiers based on SVM, the HairpinSVM and the DroshaSVM. The HairpinSVM gave out the performance with MCC 0.88, and the DroshaSVM was even better with the MCC reaching 0.94. The overall performance of DroshaSVM was that MCC reached 0.86 while SN was equal to 0.86 and SP was over 0.99. We brought the Shannon Entropy in the feature set of DroshaPSP for the first time, and gained a substantial improvement. It is found that the Shannon Entropy helped the DroshaSVM in telling the true processing site from the neighborhood.