- Open Access
A novel two-layer SVM model in miRNA Drosha processing site detection
© Hu et al.; licensee BioMed Central Ltd. 2013
- Published: 23 October 2013
MicroRNAs (miRNAs) are a large class of non-coding RNAs with important functions wide spread in animals, plants and viruses. Studies showed that an RNase III family member called Drosha recognizes most miRNAs, initiates their processing and determines the mature miRNAs. The Drosha processing sites identification will shed some light on both miRNA identification and understanding the mechanism of Drosha processing.
We developed a computational method for Drosha processing site predicting, named as DroshaPSP, which employs a two-layer mathematical model to integrate structure feature in the first layer and sequence features in the second layer. The performance of DroshaPSP was estimated by 5-fold cross-validation and measured by ACC (accuracy), Sn (sensitivity), Sp (specificity), P (precision) and MCC (Matthews correlation coefficient).
The results of testing DroshaPSP on the miRNA data of Drosophila melanogaster indicated that the Sn, Sp, and MCC thereof reach to 0.86, 0.99 and 0.86 respectively.
We found the Shannon entropy, a chemical kinetics feature, is a significant feature in telling the true sites among the nearby sites and improving the performance.
- Support Vector Machine
- Shannon Entropy
- Support Vector Machine Model
- Mature miRNAs
- Hairpin Structure
MicroRNAs (miRNAs) are a large class of ~ 22nt long non-protein-coding RNAs that post-transcriptionally interfere the expression of their target genes by binding to the 3'-untranslated regions (3'UTR) . MiRNAs were found to degrade or suppress the expression of great amount target genes [2, 3] in plants, animals and viruses , which play important roles in embryo development, cell growth and tissue differentiation, apoptosis and proliferation, morphogenesis and so on [5–8].
Drosha is a Class 2RNase III enzyme. In most animals, except a few miRNAs which are produced by the miRtron pathway , it is Drosha that cleaves the long primary-miRNAs (pri-miRNAs) to precursor miRNA (pre-miRNA) hairpins of ~70nt in length , which initiates miRNA processing [11, 12]. The Drosha processing step determines the sequence regions of pre-miRNAs for the sequentially biological process to produce mature miRNAs by Dicer. As Dicer selects cleavage sites by measuring a set distance from Drosha processing sites , Drosha is considered to be the key of making the determination of the mature miRNAs. Furthermore, the Drosha process also determines the efficiency and specificity of most miRNA expression . Therefore, accurate identification of Drosha processing sites will facilitate the recognition of miRNAs and the mechanisms understanding of miRNA biogenesis.
The methods in both experimental and computational ways have been employed to identify the Drosha processing sites. Kadener et al. identified 137 Drosha target sites from pri-miRNAs at the genome scale of Drosophila experimentally with the tiling microarray technology . Computational method is another option for quickly and low-costly identifying Drosha processing sites. The 'Microprocessor SVM' is a computational program used to identify human Drosha processing sites with the feature set formed by structure information features and base pair information features of pre-miRNA hairpin. However, the accuracy of 'Microprocessor SVM' predicting known 5'-Drosha processing sites in human is approximately 50% . One of the possible reasons of the low accuracy may be the missing of some chemical kinetics features, such as the Shannon entropy of pre-miRNA folding.
In this study, we introduced a computational method named DroshaPSP that integrated the Shannon entropy  into the feature set to search Drosha processing sites on pre-miRNA hairpin structure. The Shannon entropy is verified to be an significant measure in non-coding RNA sequences (ncRNAs) folding, especially miRNA . It is widely accepted that the pri-miRNA folding into hairpin structure is required for the Drosha processing, so we naturally infer that the Shannon entropy is important for Drosha processing step. As we expected, our Drosha processing site predicating program, called DroshaPSP, gave SN nearly 0.91 while SP was over 0.99, and the MCC reached 0.94. This result confirmed our hypothesis that chemical kinetics features, in particular, the Shannon entropy, are import for Drosha processing.
We have reported our research results to BIBM 2012 . In this supplement, we are more specific on the Methods that how we established the two-layer classifier based on SVM and discuss the irreplaceability of the first layer.
Drosophila melanogaster was chosen as the study species due to its small genome.
The Drosophila melanogaster miRNA annotation data, including the sequences of pre-miRNA, the structure data of miRNA hairpin, the sequences of mature miRNA and the sequences of miRNA star were downloaded from miRBase (http://www.mirbase.org/) , which collects the comprehensive annotation information of Drosophila melanogaster miRNAs. It should be noted that the miRNAs produced by miRtron pathway were not considered in this study, because they are not processed by Drosha.
The sequence data of Drosophila melanogaster genome were obtained from Ensemble database .
Predicting steps of DroshaPSP
HairpinSVM: Pre-miRNA like hairpin structure determination
HairpinSVM is a classifier that was constructed based on the support vector machine (SVM)  used for telling the pre-miRNA like hairpins which are the potential substrates of DroshaSVM. We selected the most widely used radial basis function kernel (RBF kernel) for HairpinSVM. The RBF kernel of SVM  was implemented with the package LIBSVM .
The features used in HairpinSVM
The length of the sequence
The loop size of hairpin structure
The stem length of hairpin structure
The number of base pairs in folding result
The fraction of paired base in sequence
The number of bulges in the folding structure output by RNAfold
The average length of bulges in sequence
The ratio between the nucleotides in bulges and those in the sequence
The minimal free energy output by RNAfold
The free energy of the thermodynamic ensemble
The probability of this single structure in the Boltzmann weighted ensemble of all structures.
The ensemble diversity is the average base-pair distance between all structures in the thermodynamic ensemble.
DroshaSVM: Drosha processing site classifier
The output of DroshaSVM is the probability for each candidate of Drosha processing site. The candidates of Drosha processing sites refer to the sites at the 5'-stems of hairpins outputted by HairpinSVM (Figure 2B). Similar to Microprocessor SVM, we defined that the true Drosha processing sites are the 5'-ends of mature miRNAs and miRNA stars in 5'-stem of pre-miRNA hairpin annotated by miRBase. If miRBase gives no such annotation for a pre-miRNA hairpin, we presumed that 3'-ends of mature miRNAs gave a 2nt overhang to relative 5'-true Drosha processing site. For DroshaSVM training, we collected 641 positive samples with experimentally validated from miRBase database. The negative sample set is formed by other 30,873 sites in 5'-stems of known pre-miRNAs.
The features used in DroshaSVM
Distance from processing site candidate to loop of the hairpin structure.
Structure description of the candidate site and 9nt sites forward are paired or not.
The base types of the candidate site and 9nt sites forward.
The base pairing probability of the candidate site and 9nt sites forward.
The Shannon entropy of the candidate site and 9nt sites forward.
Estimating the performance
To estimate the classifiers comprehensively, the receiver operating characteristic curve (ROC curve) is used to present the performance intuitively.
The DroshaPSP program was tested by the testing dataset and the performance is accessed also by ACC, SN, SP, P and MCC.
We developed a program called DroshaPSP to automatically identify the Drosha processing sites from the given sequence based on SVM method. For a given sequence, it was first told by HairpinSVM if it is a pre-miRNA-like hairpin structure. If it's predicted as a positive sample by HairpinSVM, then the DroshaSVM determined whether there were Drosha processing sites and where they would be.
Performance of the classifiers
Performance of the DroshaPSP program
For the whole prediction program testing, we used all miRNAs of Drosophila melanogaster in miRBase version 18.0 as the testing set. The test showed that SN was 0.859 while SP reached 0.999, the value of ACC and P were 0.998 and 0.870. The comprehensive measurement MCC achieved 0.864.
Estimating the importance of the features
where , , are the average of the i th feature of the whole, positive, and negative data sets, respectively; and are the i th feature of the k th positive and negative instance. The larger the F-score is, the more likely this feature is discriminative.
The Shannon entropy affects the Drosha process
As far as we know that the Shannon entropy is used in the Drosha processing site identification for the first time. The Shannon entropy is a powerful chemical kinetics feature which has been proved to be effective in ncRNA folding . According to the F-score analysis result (Figure 4), the traditional features probability and structure information got high F-score, the Shannon entropy showed effect that should not be ignored. The F-score of the Shannon entropy were higher than the information of base pair in candidate site and sites forward. Once we removed the Shannon entropy, the modified feature set gave out the performance that the AUC under the ROC curve of DroshaSVM decreased 9% (AUC = 0.886).
These experiments demonstrated that the feature Shannon Entropy is a significant feature to tell Drosha processing sites and indicated that the Drosha process is influenced by the chemical kinetics of pre-miRNA folding.
The precise detection of Drosha processing sites is a crucial procedure for miRNA identification and the revealing of miRNA maturation. In this study, we proposed a two-layer prediction model named DroshaPSP to identify Drosha processing sites by combining the sequence and structure information, and the evaluation results show that our method can achieve high prediction accuracy.
In our model, a novel dynamical feature was introduced, Shannon entropy, which is helpful to distinguish the true processing sites from the ones that nearby. In the previous study, the true processing sites and the neighboring sites within 2nt are indistinct due to the similar scores assigned by their Microprocessor SVM, which led to a serious problem in predicting Drosha processing site. Finding the features that can sufficiently characterize the genuine Drosha processing sites from the neighboring ones is our prime interest. Of this purpose, we brought in the Shannon entropy, which is a novel dynamical feature. As showed in Figure 5, with the Shannon entropy, DroshaPSP can pinpoint the true processing site from the neighborhood clearly.
Drosophila melanogaster was chosen as our study species, due to its extended annotation of Drosha processing sites on miRNAs. We did not compare our DroshaPSP with Microprocessor SVM, because the parameters of latter method were derived from human miRNAs, which were reported to be quite different from Drosophila melanogaster miRNAs, such as different cleavage partners of Drosha in human and Drosophila. Thus, the direct comparison of two prediction models derived from these two distinct species would bring on unfair results.
It is noteworthy that the purpose of HairpinSVM, the first layer of DroshaPSP, is not to scan the pre-miRNA from the given sequence, but to select the pre-miRNA like hairpin structure from all the RNA folding results of the given sequence. So, HairpinSVM cannot be replaced by other pre-miRNA predicting program. In order to clearly classify the pre-miRNA like hairpin structures, negative samples should be carefully chosen. Our negative samples are close with the positive samples in location and sequence but with clearly different hairpin structure, which make our negative samples very suitable and lead to a good performance of the first layer classification.
Although our proposed two-layer SVM method has high prediction accuracy, it is rather time-consuming, due to a lot of folding work done by RNAfold which is highly computational demanding. For example, predicting a 180nt sequence requires more than 3 minutes. This shortcoming limited its application in large dataset.
In the future, we will try to cut down the run time by changing programming language and improve the prediction accuracy of DroshaPSP with more structure features including the structure, base probability, entropy for each site. We will also extensively evaluate the performance of DroshaPSP with the prediction model trained from Drosha processing sites from other species. In addition, we are planning to develop a stand-alone implement with parallel computation option for Drosha processing site recognition on different OS platforms.
In conclusion, we developed a Drosha processing site predicting program, called DroshaPSP, which is composed of two classifiers based on SVM, the HairpinSVM and the DroshaSVM. The HairpinSVM gave out the performance with MCC 0.88, and the DroshaSVM was even better with the MCC reaching 0.94. The overall performance of DroshaSVM was that MCC reached 0.86 while SN was equal to 0.86 and SP was over 0.99. We brought the Shannon Entropy in the feature set of DroshaPSP for the first time, and gained a substantial improvement. It is found that the Shannon Entropy helped the DroshaSVM in telling the true processing site from the neighborhood.
This study was supported by the Ministry of Education of China (20050487037), the Program for New Century Excellent Talents in University (NCET-060651), the National Platform Project of China (2005DKA64001), the National Natural Science Foundation of China (90608020 and 30971642), and Natural Science Foundation of Hubei Province of China (2009CDA161).
The publication costs for this article were funded by the Ministry of Education of China (20050487037), the Program for New Century Excellent Talents in University (NCET-060651), the National Platform Project of China (2005DKA64001), the National Natural Science Foundation of China (90608020 and 30971642), and Natural Science Foundation of Hubei Province of China (2009CDA161).
This article has been published as part of BMC Systems Biology Volume 7 Supplement 4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S4.
- Bartel DP: MicroRNAs: Genomics, biogenesis, mechanism, and function (Reprinted from Cell, vol 116, pg 281-297, 2004). Cell. 2007, 131 (4): 11-29.Google Scholar
- Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM: Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005, 433 (7027): 769-773. 10.1038/nature03315.View ArticlePubMedGoogle Scholar
- Vasudevan S, Tong Y, Steitz JA: Switching from repression to activation: microRNAs can up-regulate translation. Science. 2007, 318 (5858): 1931-1934. 10.1126/science.1149460.View ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Research. 2008, 36: D154-D158. 10.1093/nar/gkn221.PubMed CentralView ArticlePubMedGoogle Scholar
- Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116 (2): 281-297. 10.1016/S0092-8674(04)00045-5.View ArticlePubMedGoogle Scholar
- Cheng AM, Byrom MW, Shelton J, Ford LP: Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res. 2005, 33 (4): 1290-1297. 10.1093/nar/gki200.PubMed CentralView ArticlePubMedGoogle Scholar
- Harfe BD: MicroRNAs in vertebrate development. Curr Opin Genet Dev. 2005, 15 (4): 410-415. 10.1016/j.gde.2005.06.012.View ArticlePubMedGoogle Scholar
- Wienholds E, Kloosterman WP, Miska E, Alvarez-Saavedra E, Berezikov E, de Bruijn E, Horvitz HR, Kauppinen S, Plasterk RH: MicroRNA expression in zebrafish embryonic development. Science. 2005, 309 (5732): 310-311. 10.1126/science.1114519.View ArticlePubMedGoogle Scholar
- Okamura K, Hagen JW, Duan H, Tyler DM, Lai EC: The mirtron pathway generates microRNA-class regulatory RNAs in Drosophila. Cell. 2007, 130 (1): 89-100. 10.1016/j.cell.2007.06.028.PubMed CentralView ArticlePubMedGoogle Scholar
- Han JJ, Lee Y, Yeom KH, Kim YK, Jin H, Kim VN: The Drosha-DGCR8 complex in primary microRNA processing. Genes & Development. 2004, 18 (24): 3016-3027. 10.1101/gad.1262504.View ArticleGoogle Scholar
- Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Radmark O, Kim S et al: The nuclear RNase III Drosha initiates microRNA processing. Nature. 2003, 425 (6956): 415-419. 10.1038/nature01957.View ArticlePubMedGoogle Scholar
- Vermeulen A, Behlen L, Reynolds A, Wolfson A, Marshall WS, Karpilow J, Khvorova A: The contributions of dsRNA structure to Dicer specificity and efficiency. Rna-a Publication of the Rna Society. 2005, 11 (5): 674-682. 10.1261/rna.7272305.View ArticleGoogle Scholar
- Park J-E, Heo I, Tian Y, Simanshu DK, Chang H, Jee D, Patel DJ, Kim VN: Dicer recognizes the 5[prime] end of RNA for efficient and accurate processing. Nature. 2011, 475 (7355): 201-205. 10.1038/nature10198.View ArticlePubMedGoogle Scholar
- Feng Y, Zhang X, Song Q, Li T, Zeng Y: Drosha processing controls the specificity and efficiency of global microRNA expression. Biochim Biophys Acta. 2011, 1809 (11-12): 700-707. 10.1016/j.bbagrm.2011.05.015.PubMed CentralView ArticlePubMedGoogle Scholar
- Kadener S, Rodriguez J, Abruzzi KC, Khodor YL, Sugino K, Marr MT, Nelson S, Rosbash M: Genome-wide identification of targets of the drosha-pasha/DGCR8 complex. RNA. 2009, 15 (4): 537-545. 10.1261/rna.1319309.PubMed CentralView ArticlePubMedGoogle Scholar
- Helvik SA, Snove O, Saetrom P: Reliable prediction of Drosha processing sites improves microRNA gene prediction. Bioinformatics. 2007, 23 (2): 142-149. 10.1093/bioinformatics/btl570.View ArticlePubMedGoogle Scholar
- Huynen M, Gutell R, Konings D: Assessing the reliability of RNA folding using statistical mechanics. J Mol Biol. 1997, 267 (5): 1104-1112. 10.1006/jmbi.1997.0889.View ArticlePubMedGoogle Scholar
- Freyhult E, Gardner PP, Moulton V: A comparison of RNA folding measures. Bmc Bioinformatics. 2005, 6: 241-10.1186/1471-2105-6-241.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu X, Zhou Y, Ma C: Recognizing drosha processing sites by a two-step prediction model with structure and sequence information. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 1-4. 10.1109/BIBM.2012.6392714.View ArticleGoogle Scholar
- Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al.: The Ensembl genome database project. Nucleic Acids Research. 2002, 30 (1): 38-41. 10.1093/nar/30.1.38.PubMed CentralView ArticlePubMedGoogle Scholar
- Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. 1992: ACM. 1992, 144-152.Google Scholar
- Burges CJC: A tutorial on Support Vector Machines for pattern recognition. Data Min Knowl Discov. 1998, 2 (2): 121-167. 10.1023/A:1009715923555.View ArticleGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27-Google Scholar
- McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research. 2004, 32: W20-W25. 10.1093/nar/gkh435.PubMed CentralView ArticlePubMedGoogle Scholar
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly. 1994, 125 (2): 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000, 16 (5): 412-424. 10.1093/bioinformatics/16.5.412.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.