Predicting protein-binding regions in RNA using nucleotide profiles and compositions

Background Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. Results We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. Conclusions Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users.


Background
Interactions between protein and RNA molecules are essential to various cellular processes, such as post transcriptional gene regulation, translation, and alternative splicing [1]. Many studies have been conducted to identify RNA-binding proteins (RBPs) or binding sites in protein and RNA molecules. In particular, recent advances in high-throughput experimental technologies, including next-generation sequencing technologies and crosslinking and immunoprecipitation (CLIP), have accelerated the discovery of RBPs and their target RNAs. Despite the increased number of known RBPs and their target RNAs, the mechanism of protein-RNA interactions is not fully uncovered and a large number of RBPs and their target RNAs remain to be uncovered. For example, for the ∼ 20, 500 protein-coding genes in humans, only 1,542 RBPs (7.5%) and their target RNAs have been identified so far [2].
As a complement to experimental methods, several computational methods have been proposed, which are largely motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms. Computational methods in general are much less time-consuming and costly than experimental methods.
Most existing computational methods are primarily limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. For instance, BindN+ [3], an upgraded version of BindN [4], uses a support vector machine (SVM) to predict the RNA-or DNA-binding residues from biochemical features and evolutionary information of protein sequences. RNABindRPlus [5] also predicts RNA-binding residues in a protein sequence by combining predictions from an optimized SVM and those from a sequence homology method. aaRNA [6] predicts RNA binding residues in protein using sequence-and structure-based features.
Compared to the task of predicting RNA-binding sites in proteins, predicting protein-binding sites in RNA is more challenging for several reasons [7]. Until very recently, there were few computational methods that can predict protein-binding sites in RNA. catRAPID estimates the binding propensity of RNA and protein molecules by combining secondary structure, hydrogen bonding and van der Waals contributions [8]. It often predicts an entire RNA sequence as a binding site even for an RNA sequence of 50 or more nucleotides. DeepBind [9] is known to outperform state-of-the-art experimental and computational methods. It uses deep convolutional neural networks, trained on a huge amount of data from high-throughput experiments. For the problem of predicting RBP-binding sites in RNA sequences, DeepBind was trained on data from RNAcompete, CLIP-seq and RIP-seq [10]. It contains ∼ 200 distinct models, each for different RBPs, so the user should try all of them in the absence of prior information on RBP. As output, it only provides a predictive binding score without protein-binding sites in the input RNA sequence. A new prediction model called PRIdictor [11,12] predicts binding sites in RNA and protein sequences at the nucleotide and residue level. Wong et al. [13] developed a method that predicts interacting nucleotides and residues between DNA and proteins.
In this paper, we propose a new method for predicting protein-binding regions in mRNA, which are associated with post-transcriptional regulation of gene expression. The method uses sequence profiles constructed from logodds scores of mono-and di-nucleotides and sequence compositions of mono-, di-and tri-nucleotides. As shown in the paper, the proposed method showed a high performance in testing on a large number of human RNA sequences and was substantially better than other methods. The rest of the paper presents the details of our approach and its experimental results.

Datasets
We obtained protein-binding sites in RNAs from CLIPdb [14], which provides curated published CLIP-seq data sets for four species (human, mouse, worm, and yeast). To obtain a sufficient amount of reliable data, we restricted the data to those binding regions of 25 nucleotides in '+' strands of human mRNAs, which were identified by PAR-CLIP technology [15] and have the binding affinity score > 0.9 in PARalyzer [16]. Human mRNAs were selected against others because the largest amount of RBP binding sites is known in human mRNAs. Different RBPs are known to have different binding preferences within an mRNA. We examined the type of RBP binding regions in the extracted human mRNAs by mapping the Ensembl transcripts to the GRCh37 assembly. Coding sequence (CDS) regions of mRNA are the most frequent binding regions of RBPs, followed by 3 UTR (Additional file 1).
The reason for selecting 25 nucleotides as the size of a binding region is because protein-binding regions identified by PAR-CLIP are typically between 21 and 35 nucleotides in length, and binding regions of 25 nucleotides resulted in the larger amount of data from CLIPdb than other choices for the size (see Additional file 2 for the distribution of the length of RBP-binding regions). After extracting a total of 5,145 RBP-binding regions for 14 RBPs, we assembled RNA sequences using the reference human genome GRCh37/hg19. These RNA sequences were used as positive data in our study (Additional file 3). RBP sequences were obtained from NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/).
For negative data, we selected 51,450 (10-fold of the positive data) non-binding regions of 25 nucleotides in the same reference human genome GRCh37/hg19. The human genome contains more non-binding regions than protein-binding regions, so we constructed several datasets with different ratios of binding to non-binding regions (called 1:1, 1:2, 1:4, 1:6, 1:8 and 1:10 datasets hereafter).
In order to remove redundancy in the datasets, we first executed CD-HIT-EST [17] on each of the six datasets (1:1, 1:2, 1:4, 1:6, 1:8 and 1:10 datasets) and removed those with a sequence similarity of 80% or higher. After removing similar sequences, 4372 sequences out of the 5,145 RBP-binding sequences were left. The remaining 4372 RBP-binding sequences were partitioned into two datasets: training dataset (70% of the remaining RBPbinding sequences) and test dataset (30%). Thus, there are no similar RNA sequences between training and test datasets and within training or test datasets. Table 1 shows the number of sequences in the training and test datasets with different ratios of positive to negative instances. Since the redundancy removal was enforced separately in the 1:1, 1:2, 1:4, 1:6, 1:8 and 1:10 datasets, the ratio of positive to negative instances may not be exactly 1 : n (n = 1, 2, 4, 6, 8, 10) (see Additional files 4 and 5).
The PWM of mono-nucleotides, also known as position specific score matrix (PSSM) or sequence profile, is frequently used with slightly different definitions [3,18]. We computed PWM + and PWM − from a training dataset of protein-binding sequences and non-binding sequences, respectively (see Fig. 1). Each element of PWM + and PWM − represents the frequency of i-th nucleotide (i is any one of A, C, G and U) in the j-th position of RNA of n nucleotides. We combined PWM + and PWM − of a training dataset into mPWM by Eq. 1, which represents the log-odds score the i-th nucleotide in the j-th position.
The PWM of di-nucleotides (dPWM) is less commonly used than PWM of mono-nucleotides, but can elucidate higher order structures of protein-binding sequences. We built dPWM in a similar way to mPWM. We first constructed dPWM + and dPWM − from a training dataset of protein-binding sequences and non-binding sequences, respectively. Each element of dPWM + and dPWM − represents the frequency of the di-th di-nucleotide (di is any one of AA, AC, . . . , UU) in the j-th position (j = 1, 2, . . ., n − 1) of RNA of n nucleotides. dPWM + and dPWM − of a training dataset were combined into dPWM, which represents log-odds score the di-th di-nucleotide in the j-th position. The same mPWM and dPWM generated from a training dataset were used in both training and testing the prediction model.
In addition to the position weight matrices of two types, we computed nucleotide compositions of three types: mono-nucleotide composition (mC), di-nucleotide composition (dC) and tri-nucleotide composition (tC). Thus, a single RNA sequence of n nucleotides is represented in a feature vector with 2n + 83 elements (n elements for mPWM, n − 1 elements for dPWM, and 84 elements for nucleotide compositions). For a sequence of 25 nucleotides, a single feature vector contains 133 elements (see Fig. 2 for the structure of a feature vector).

Protein features
To represent a protein sequence, 20 amino acids are first  [19]. Every amino acid in each protein sequence is transformed into an index representing Since similar sequences were removed separately in each 1:n dataset, the number of negative data (N) is not an exact multiple of the number of positive data (P) Fig. 1 Construction of mono-nucleotide position weight matrix (mPWM). Both binding and non-binding sequences are used to generate an mPWM, in which each element (i, j) represents the log-odds score of the i-th nucleotide (i =A, C, G and U) in the j-th position (j = 1, 2, . . ., sequence length n). F in PWM + , PWM − and mPWM denotes the frequency of a nucleotide at a position an amino acid group. For each protein sequence, the composition, transition, and distribution of amino acid groups are represented in a feature vector [19]. The composition is the normalized frequency of each group in the protein sequence. The transition is the normalized frequency of transition between each group in the protein sequence. The distribution is the normalized position of the first, 25, 50, 75 and 100%-th amino acid of each group in the protein sequence. A protein sequence is represented by a feature vector with 63 elements (7 compositions, 21 transitions, and 35 distributions). Thus, a model that predicts RBP binding sites using both RNA and proteins features require 63 more elements in a feature vector than that using RNA features only.

Prediction model
We built a support vector machine (SVM) model using a library for support vector machine (LIBSVM) [20]. As a kernel the radial basis function (RBF) was selected instead of the linear kernel because the number of instances (> 100,000 RNA sequences) in our dataset is much larger than the number of features (≈ 200). Besides, it is known that there is no need to consider linear SVM if complete model selection has been conducted using the Gaussian kernel [21]. The SVM model with the RBF kernel has two parameters, cost (C) and γ . We determined the best parameter values (C = 32 and γ = 0.0078125) by running the grid search tool of LIBSVM on the training dataset. Unless specified otherwise, all the results shown in this paper were obtained with C = 32 and γ = 0.0078125.
For comparative purposes, we also built another model using WEKA random forest (http://www.cs.waikato.ac. nz/ml/weka/). As discussed later in the Result section, the SVM model was chosen as the final model for the web server after it was compared with the random forest model. The results of the random forest model shown in this paper were obtained with 60 trees and 25 features, which resulted in the best performance.

Evaluation of the model
The performance of the SVM and random forest models was evaluated using six measures: sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and Matthews correlation coefficient (MCC), which are defined as follows.
True positives (TP), true negatives (TN), false positives (FP), and false negative (FN) represent correctly predicted binding regions, correctly predicted non-binding regions, non-binding regions that are incorrectly predicted as binding, and binding regions that are incorrectly predicted as non-binding, respectively.
As described above, our prediction model uses PWM of two types and nucleotide compositions as RNA features. To examine the contribution of the features to the prediction performance, we tried different combinations of features in 10-fold cross validation.
We evaluated the model in several different ways. First, we performed two types of cross validation: (1) standard 10-fold cross validation with six different training datasets (1:1, 1:2, 1:4, 1:6, 1:8 and 1:10 training datasets) and (2) leave-one-protein-out (LOPO) cross validation [22] with the 1:1 training dataset. The reason for performing LOPO cross validation is because typical k-fold cross validation tends to over-estimate predictive performances for paired inputs such as protein-protein interactions (PPIs) or protein-RNA interactions. Recently Park and Marcottee [23] and Hamp and Rost [24] have demonstrated that both standard and refined cross validations lead to inflated accuracy of PPI prediction methods. In LOPO cross validation with respect to RBPs, all RNA sequences (both RBP-binding and non-binding sequences) for one RBP are taken out for testing and remaining RNA sequences are used for training.
In addition to cross validations of two types, we also tested the SVM model on independent datasets, which were not used in training the model. We also compared our SVM model with DeepBind [9] and catRAPID [8] using another test dataset. Out of the 14 RBPs used in our study, DeepBind provides 7 distinct models, one for each of 7 RBPs (FUS, FXR1, FXR2, IGF2BP2, LIN28A, QKI, TARDBP). For a fair comparison, we extracted new 700 RBP-binding regions of 25 nucleotides from CLIPdb (100 RBP-binding regions for each of the 7 RBPs). To remove redundancy between the 700 RNA sequences and the training dataset, we executed CD-HIT-EST-2D on them with a cut-off value of 0.8. (see Table 2 for the number of remaining RNA sequences after running CD-HIT-EST-2D).
Since catRAPID requires an RNA sequence of at least 50 nucleotides, we extended the RBP-binding regions by including 13 nucleotides on each side of the binding regions in their original genome sequences. Redundancy between the extended RNA sequences and the training dataset was removed by running CD-HIT-EST-2D on them with a cut-off value of 0.9 because instead of 0.8 since the cut-off value of 0.8 removed too many RNA sequences (see Table 3 for the number of remaining RNA sequences after running CD-HIT-EST-2D). As negative data for the 700 RNA sequences, we extracted additional 100 non-binding regions of 25 and 51 nucleotides in the reference human genome GRCh37/hg19. Table 4 compares different combinations of features in 10-fold cross validation of our SVM model with the 1:1 training dataset. Among the single features, mPWM and dPWM were much better than nucleotide compositions. With mPWM or dPWM alone, the SVM model achieved an accuracy above 89% and an MCC above 0.79. This result indicates that mPWM and dPWM are very powerful features in predicting protein-binding regions in RNA sequences. Compared to using single features alone, using two different features resulted in performance improvement in sensitivity, accuracy, NPV and MCC. Nucleotide compositions alone achieved a much lower performance than sequence profiles of log-odds scores of mono-nucleotides and those of di-nucleotides, but performance gain was obtained with combination of nucleotide compositions and sequence profiles (sensitivity of 91.61%, specificity of 92.39%, accuracy of 92.02%, PPV of 91.69%, NPV of 92.31% and MCC of 0.840). Table 5   The specificity of our method is the same for all RBPs because it used a same set of negative data for all RBPs with a single model, whereas DeepBind has distinct models for each RBP dataset with 1:1 ratio of positive to negative instances. In particular, PPV and MCC were significantly decreased as the ratio of negative instances was increased. But, NPV was rather increased slightly.

Cross validations
As the dataset contains more negative instances, sensitivity, PPV and MCC of the random forest model were decreased. In particular, it showed a substantial decrease in sensitivity. Since there are much more non-binding sites than binding sites in actual RNA sequences, we determined that finding all possible binding sites at the expense of low PPV is better than missing the binding sites. Thus, we selected the SVM model as the final model for the web server.
As stated earlier, the SVM model with the RBF kernel is known to be better than the SVM with linear kernel when the number of instances is much larger than the number of features. For comparative purposes, we built an SVM model with linear kernel and performed 10-fold cross validation of the model (Additional file 6). The SVM model with linear kernel showed a slightly lower performance than the SVM model with the RBF kernel.
Our SVM model uses the protein sequence as an additional information when it is available. Additional file 7 shows the results of 10-fold cross validation of the SVM model when it is given a protein sequence in addition to an RNA sequence. The best performance was observed in the balanced dataset with 1:1 ratio of positive to negative instances (sensitivity of 93.18%, specificity of 92.01%, accuracy of 92.57%, PPV of 91.44%, NPV of 93.64% and MCC of 0.851).
Results of LOPO cross validation with respect to RBPs in the 1:1 training dataset are shown in Table 6. Since different RBPs have very different numbers of known RBP-binding regions, we examined a weighted average of performance measures instead of a simple average of them. The weighted average was computed from the total values of TP, FP, TN and FN of all runs. In LOPO cross validation, the model showed a sensitivity of 85.54%, a specificity of 89.53%, an accuracy of 87.60%, a PPV of 88.42%, an NPV of 86.89% and an MCC of 0.752. This result indicates that LOPO cross validation of our SVM model obtained a lower performance than 10-fold cross validation, but its average performance is reasonably high.

Independent tests
For rigorous evaluation of our model, we tested it on independent datasets (30% of the entire data), which were not used in training the model. As in the 10-fold cross validation, we tested it on six test datasets with different ratios of positive to negative instances (called 1:1, 1:2, 1:4, 1:6,  1:8, and 1:10 test datasets hereafter). As shown in Table 7, the specificity, PPV and MCC were decreased as the ratio of negative instances was increased.
In particular, PPV and MCC were significantly decreased as the dataset contains more negative instances. This trend was also observed in 10-fold cross validation. However, other performance measures (sensitivity, accuracy, and NPV) were rather increased, and specificity was decreased slightly. Figure 3 shows the ROC curves of 10-fold cross validation and independent testing of the SVM models. In 10-fold cross validation, the SVM model with the RBF kernel yielded a slightly larger area under the ROC curve (AUC = 0.9732) than the SVM model with linear kernel (AUC = 0.9714). Likewise, in independent testing the SVM model with RBF kernel showed a slightly larger AUC (0.8912) than the SVM with linear kernel (0.8878).
Since the prediction model was trained with RBPbinding RNA sequences of 25 nucleotides, we examined whether it is applicable to RNAs of different sizes. For RNAs of k nucleotides (k < 25), we extracted a total of 12,576 RBP-binding RNAs from CLIPdb. When testing the model on each RNA sequence with < 25 nucleotides, we selected a position in the RNA sequence which Using all 3 features showed the best performance. mPWM: mono-nucleotide position weight matrix, dPWM: di-nucleotide position weight matrix, compositions: frequency of mono-nucleotides, di-nucleotides, and tri-nucleotides in the RNA sequence results in the maximum sum of log-odds scores from an ungapped alignment of the sequence with mPWM. Based on the selected position, we encoded both mPWM and dPWM features and filled zeros for matrix elements that have no corresponding nucleotides in the RNA sequence to make the size of the feature vector comparable to those for 25-mer RNAs. Nucleotide compositions of short RNA sequences were encoded in the same way as RNA sequences of 25 nucleotides. The prediction performance with short RNA sequences was lower than that with 25-mer RNAs, but its accuracy is as high as 74.4% (Additional file 8). We also tested the prediction model on RNA sequences with > 25 nucleotides, and details are discussed in the next section. Additional file 9 shows the change in accuracy of the model for RNA sequences with lengths between 21 and 40 nucleotides.
Without changing the original mPWM and dPWM, we tested our model for new RBPs that were not considered in constructing datasets. It showed a low performance for some RBPs but obtained a high performance for some RBPs (Additional file 10). The best performance was observed for HNRNPD (sensitivity of 94.29%, specificity of 94.37%, accuracy of 94.33%, PPV of 92.52%, NPV of 95.71% and MCC of 0.884).
A negative dataset in our study was constructed by random selection. For comparative purposes, we constructed different negative datasets by extracting a subsequence in the upstream region of each RBP binding region. We tried several different distances ranging from 1 to 1001 nucleotides between the negative instance and the positive instance (i.e., RBP binding region) in a same RNA sequence. The performance of our model with a new negative dataset was as high as that with the previous

Comparison with other methods
For the comparison with DeepBind and catRAPID, we prepared two new datasets of RBP-binding RNA sequences. The first test dataset consists of RNA sequences of 25 nucleotides extracted from CLIPdb. In the first dataset, similar sequences with any in the training dataset were removed by running CD-HIT-EST with a cut-off value of 0.8. The second test dataset was constructed by adding 13 nucleotides in the original genome sequence at both ends of the 25-mer RNAs in the first dataset. The reason that we could not use RBP-binding RNA sequences of 51 nucleotides in CLIPdb is because DeepBind does not provide a prediction model for RBPbinding RNA sequences of 51 nucleotides (DeepBind provides distinct models for each RBP). For negative data of the test datasets, we selected 100 non-binding regions of 25 and 51 nucleotides in the reference human genome GRCh37/hg19. When testing the model on each RNA sequence with > 25 nucleotides, we found a 25-mer subsequence of the RNA sequence which results in the maximum sum of log-odds scores from an alignment of the 25-mer subsequence with mPWM. In a feature vector, we encoded both mPWM and dPWM features of the selected 25-mer subsequence along with nucleotide compositions of the entire RNA sequence. Table 2 shows the results of testing our model and DeepBind on RBP-binding sequences for 7 RBPs. In predicting RBP-binding regions of 25 nucleotides, our model  Table 3, so it was tested on 10 RBP-binding sequences for each RBP. catRAPID showed low discriminative power (DP) values in most test cases. Since DP of catRAPID represents the interaction propensity of a protein-RNA pair with respect to the training sets [8], the result of testing catRAPID on RBP-binding sequences indicates a low confidence level of the prediction. Details of the RBP-binding sequences used for comparison of three methods and raw data obtained from execution of the three methods are available in Additional file 12.

Conclusion
In this paper we proposed a new computational method to predict protein-binding regions in mRNA sequences using sequence profiles constructed from log-odds scores of mono-and di-nucleotides and nucleotide compositions. The method has been implemented in SVM models and evaluated in several ways, including standard 10-fold cross validation on six datasets with different ratios of positive to negative instances, LOPO cross validation, and independent testing with six datasets of different ratios of positive to negative instances. We also compared our method with DeepBind and catRAPID using another test dataset.
Results of cross validation and independent testing of the method on actual RBP-binding regions in human mRNAs showed that sequence profiles of log-odds scores of mono-and di-nucleotides are much more powerful features than nucleotide compositions in finding proteinbinding regions in RNA sequences. Nucleotide compositions alone achieved a much lower performance than sequence profiles of log-odds scores of mononucleotides and those of di-nucleotides, but performance gain was obtained with combination of nucleotide compositions and sequence profiles. The best performance was observed in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a PPV of 91.7%, an NPV of 92.3% and an MCC of 0.84. 10-fold cross validation of RNA and protein sequence feature vector model with a balanced dataset achieved a sensitivity of 93.2%, a specificity of 92.0%, an accuracy of 92.6%, a PPV of 91.4%, an NPV of 93.6% and an MCC of 0.85. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (sensitivity of 85.5%, specificity of 89.5%, accuracy of 87.6%, PPV of 88.4%, NPV of 86.9% and MCC of 0.752). In testing the model on independent datasets, it achieved a sensitivity of 72.5%, a specificity of 91.9%, an accuracy of 82.2%, a PPV of 89.9%, an NPV of 77.0% and an MCC of 0.66. Testing of our model and two other methods showed that our model is better than the others.
The results shown in this paper are preliminary, but demonstrate the potential of our method to predict RBPbinding regions in mRNA. Given that the average length of human mRNAs is about 2 kb and that different RBPs have different binding preferences within an mRNA, it is not straightforward to find RBP binding regions in mRNAs. A computational method like ours will help biologists save time and effort in designing and performing their in vivo or in vitro experiments to detect protein-RNA binding sites by narrowing down candidate binding regions on target RNAs.