Selecting high-quality negative samples for effectively predicting protein-RNA interactions

Background The identification of Protein-RNA Interactions (PRIs) is important to understanding cell activities. Recently, several machine learning-based methods have been developed for identifying PRIs. However, the performance of these methods is unsatisfactory. One major reason is that they usually use unreliable negative samples in the training process. Methods For boosting the performance of PRI prediction, we propose a novel method to generate reliable negative samples. Concretely, we firstly collect the known PRIs as positive samples for generating positive sets. For each positive set, we construct two corresponding negative sets, one is by our method and the other by random method. Each positive set is combined with a negative set to form a dataset for model training and performance evaluation. Consequently, we get 18 datasets of different species and different ratios of negative samples to positive samples. Secondly, sequence-based features are extracted to represent each of PRIs and protein-RNA pairs in the datasets. A filter-based method is employed to cut down the dimensionality of feature vectors for reducing computational cost. Finally, the performance of support vector machine (SVM), random forest (RF) and naive Bayes (NB) is evaluated on the generated 18 datasets. Results Extensive experiments show that comparing to using randomly-generated negative samples, all classifiers achieve substantial performance improvement by using negative samples selected by our method. The improvements on accuracy and geometric mean for the SVM classifier, the RF classifier and the NB classifier are as high as 204.5 and 68.7%, 174.5 and 53.9%, 80.9 and 54.3%, respectively. Conclusion Our method is useful to the identification of PRIs.

A lot of effort has been put on the identification of PRIs using traditional experimental methods and post-experimental methods. As experimental methods consume more time and money than post-experimental methods, the latter is gaining more and more attention. There are mainly two categories of post-experimental methods: 1)structural & chemical-based methods and 2)computational methods.
The first category of methods attempted to analyze the interacting mechanism of protein and RNA at structural and chemical levels. For example, Jones et al. [11] focused on analyzing protein-RNA complexes, and obtained the physical-chemical properties of RNA-binding residues and the distribution of atom-atom within the complexes. With protein-RNA experimental data, Ellis et al. [12] presented a statistics on properties of binding residues bounding to functional various RNAs. Besides, some function-based works [13,14] also discussed the protein-RNA interactions.
As for computation-based methods, several machine learning techniques have been employed on identifying PRIs, such as random forest (RF), Naive Bayes (NB) and support vector machine (SVM). Pancaldi et al. [15] used RF and SVM for identifying PRIs by considering more than 100 properties of RNAs and proteins. Instead, Muppirala et al. [16] used only protein and RNA sequence information for predicting interactions. Similarly, Wang et al. [17] improved the Naive Bayes (ENB) classifiers for predicting PRIs with only sequence data. Recently, we also proposed learning method [18] with only positive and unlabeled samples on PRIs prediction.
Compared with structural & chemical-based methods, computational methods are more efficient and effective. However, the performance of computational methods heavily depends on the quality of training datasets, which usually consist of positive samples and negative samples. Here, positive samples are not the problem. The difficulty lies in that we do not have experimentally-validated negative samples. Current works [16,17] addressed this problem by randomly pairing RNAs and proteins and then removing these pairs included in the positive set. In this paper, we call this method random method or traditional method. Obviously, random negative samples must not be real negative samples. So the quality of random negative sets cannot be guaranteed. This will unavoidably impact prediction performance of classifiers trained on datasets with random negative samples. This paper addresses how to select highly reliable negative samples to improve PRI prediction. To this end, we present an effective method FIRE -the abbreviation of FInding Reliable nEgative samples). The basic idea of our method is like this: given a known PRI of protein i and RNA j, for a protein k, the more difference between protein i and protein k, the less possibility that protein k interacts RNA j.
We first construct positive sets using known PRIs. Given a positive set, we establish two negative sets: one is by random method and the other by our method. And the positive set is combined with each of the two negative sets to form a dataset for model training and performance evaluation. In such a way, we construct 18 datasets of different species and different ratios of negative samples to positive samples. Then, we extract the features of each pair of protein and RNA. Here, each feature is composed of a conjoint triad of vicinal amino acids and a k nucleotide acids. To cutoff computational cost, a filterbased feature selection method is employed to reduce the dimensionality of feature vectors. Finally, we conduct extensive experiments to evaluate the proposed method by training and testing SVM, RF and NB classifier on the 18 datasets. The experimental results show that these classifiers perform much better using the negative samples generated by our method than using random negative samples.

Methods
We collected non-redundant known PRIs as positive samples, and generated 18 datasets based on our method and the random method, which were used to evaluate the performance of PRI prediction by SVM, RF and NB classifiers. Figure 1 is the procedure of our method, which contains five steps: 1) Generating negative datasets by using our method FIRE and the random method; 2) Constructing feature vectors for each pair of protein-RNA; 3) Reducing the dimension of feature vectors; 4) Training classifiers; 5) Performance evaluation.

Datasets
We constructed 9 non-redundant positive PRI sets from PRIDB [19], NPInter [20], 9 reliable negative sets based on the positive sets and the STRING [21] database by our method, and 9 random negative sets with the random method. The procedure for negative sample construction will be detailed later. Each positive set is merged with a negative set to construct a PRI dataset, consequently 18 PRI datasets in total are constructed. PRIDB is a database of protein-RNA interfaces calculated from protein-RNA complexes in PDB [22]. NPInter is a complete database covering eight-category functional interactions between proteins and noncoding RNAs of six model organisms, including Caenorhabditis elegans, Drosophila Fig. 1 The framework of this work. Here, rectangles are executive modules, and parallelograms are data modules melanogaster, Escherichia coli, Homo sapiens, Mus musculus and Saccharomyces cerevisiae. STRING is an updated online database resource Search Tool for the Retrieval of Interacting Genes, it provides uniquely comprehensive coverage and ease of access to both experimental and predicted protein-protein interaction (PPI) information.
The 18 datasets are divided to 3 groups. The first group of datasets (denoted group 1) contain 336 experimentalvalidated PRIs that are used as positive samples, which are related to the six organisms above and constructed from the NPInter and STRING databases. This group consists of six sub-datasets (named by SO) as follows: 1. The first sub-dataset (SO_reliable 1:1 ) contains 168 positive samples and 168 reliable negative samples generated by our method, the ratio of positives to negatives is 1 : 1; 2. The second sub-dataset (SO_reliable 2:1 ) contains 336 positive samples and 168 reliable negative samples, the ratio is 2 : 1; 3. The third sub-dataset (SO_reliable 1:2 ) contains 168 positive samples and 336 reliable negative samples, the ratio is 1 : 2; 4. The fourth sub-dataset (SO_random 1:1 ) contains 168 positive samples and 168 random negative samples generated by the random method, and the ratio of positives to negatives is 1 : 1; 5. The fifth sub-dataset (SO_random 2:1 ) contains 336 positive samples and 168 random negative samples, the ratio is 2 : 1; 6. The last sub-dataset (SO_random 1:2 ) contains 168 positive samples and 336 random negative samples, the ratio is 1 : 2.

Construction of random negative samples
Previous works [16,17] randomly select negative samples, the underlying hypothesis is: if there is no validated interaction between a protein and a RNA, then the protein and the RNA constitute a negative sample. Obviously, the hypothesis is not completely reasonable. The flowchart for generating random negative samples is shown in Fig. 2.
In Fig. 2, the major steps of the random method are as follows: 1. Each PRI extracted from PRIDB and NPInter is included in the positive set. From the positive set, we can get a set P of proteins and a set R of RNAs, each protein/RNA in P /R is involved in at least a positive PRI.

Construction of reliable negative samples
The basic idea of our method is like this: for an experimentally-validated PRI of protein p and RNA r, r is highly possible to interact with any protein p similar to p. On the contrary, if protein p is dissimilar to p, there is low possibility that p interacts r. Based on this idea, we propose the method FIRE to construct reliable negative PRIs. The flowchart of FIRE is shown in Fig. 3. Concretely, for each positive PRI (p, r), we try to find any protein p that is as much dissimilar as possible to p. If (p , r) is not an experimentally-validated PRI, then it is selected as a negative PRI. We first compute the similarity between each pair of proteins based on three different data sources, then we combine these similarity scores as a final score to measure the similarity between the two proteins. Detail is delayed to "Protein-protein similarity computation" section.
The procedure of our method FIRE is as follows: 1. Construct the positive set PS of PRIs based on the PRIDB and NPInter databases, and compute the similarity matrix SP of proteins involved in PS as in "Protein-protein similarity computation" section. 2. For protein p i and RNA r j that do not form a positive PRI in PS, i.e., (p i , r j ) / ∈ PS, compute a score between p i and r j as follows: (a) If protein p k (k = i) and r j forms a PRI in the positive PRI set PS, then the score SPR ijk indicating the confidence of (p i , r j ) being a positive PRI via protein p k can be evaluated via SP ik , which is the similarity between p i and p k . (b) As there may be multiple (say n) positive PRIs involving r j in PS, we aggregate the scores SPR ijk over all positive PRIs (p k , r j ) (k = i and k = 1..n) as follows: it is a potential negative PRI. Sorting all generated potential PRIs (p i , r j ) via their scores SPR ij in increasing order, the top-m protein-RNA pairs in the sorted list are taken as negative PRIs if m negative PRIs are to be generated.

Protein-protein similarity computation
We compute the similarity between any two proteins involved in the positive set based on three types of data sources: sequence information, functional annotations and protein domains, these computed similarities are called sequence similarity, functional annotation semantic similarity and protein domain similarity, which are merged to get the final similarity of the two proteins. Sequence similarity (SS). Protein sequences are obtained from the UniProt database [23]. We compute sequence similarity between two proteins using a normalized version of Smith-Waterman score [24]. The normalized Smith-Waterman score between two proteins p i and p j is (p j , p j ) where sw(., .) means the original Smith-Waterman score. By applying this operation to protein pair p i and p j , we can obtain their sequence similarity SS(p i , p j )=(nsw(p i , p j )+nsw(p j , p i ))/2.
Functional annotation semantic similarity (FS). GO annotations are downloaded from the GO database [25]. Semantic similarity between each pair of proteins is calculated based on the overlap of the GO terms associated with the two proteins [26]. All three types of GO are used in the computation as similar RNAs are expected to interact with proteins that act in similar biological processes, or have similar molecular functions or reside in similar cell compartments. We compute the Jaccard value [27] with respect to the GO terms of each pair of proteins as their similarity. The Jaccard score between term sets t i and t j of proteins p i and p j is defined as |t i ∩ t j |/|t i ∪ t j |, which is the ratio of the number of common terms between proteins p i and p j to the total number of terms of p i and p j , which is used as the functional annotation semantic similarity FS(p i , p j ) of proteins p i and p j .
Protein domain similarity (DS). Protein domains are extracted from Pfam database [28]. Each protein is represented by a domain fingerprint (binary vector) whose elements encode the presence or absence of each retained Pfam domain by 1 or 0, respectively. We compute the Jaccard value of any two proteins p i and p j with their domain fingerprints as their similarity DS(p i , p j ).
For proteins p i and p j , we compute the aggregated similarity (AS) by merging the three different similarity measures above as follows:

PRI feature vectors
Existing works [29][30][31] found that properties of amino acids are effective in protein classification. To reduce the dimensionality of protein representation, Shen et al. [32] classified the 20 amino acid residues as seven classes according to their physicochemical properties, meanwhile the concept of conjoint triads were also proposed to represent the protein properties. Wang et al. [17] further reduced the dimension of feature vector by encoding the 20 amino acids residues into four classes: {DE}, {HRK}, {CGNQSTY }, and {AFILMPVW }. In this work, we use the same strategy for encoding protein sequences.

Feature value computation
In order to discriminate the significance of different types of features in a feature vector, we introduce the concept of concentration of different features. Denote the number of unique (k + 3)-mers of type i as N i . The concentration of type i is the ratio of N i to the total number of unique (k + 3)-mers, that is, For example, the number of unique 6-mers is 64 × 4 3 . The total number of unique (k + 3)-mers used in this study is 5376, therefore the concentration of 6-mers is C 3 = 4096/5376 = 0.762. Then, the elements of a feature vector are calculated by Above, t j is the occurrence frequency of a certain unique (k + 3)-mer of type i. A feature vector contains 5376 dimensions, each of which corresponds to a unique (k + 3)-mer of a certain type i (i = 1, 2 and 3). Within a vector, the dimensions are arranged in the order of 6-mers, 5-mers and 4-mers. Then f i is further normalized to ff i as follows: (5) where f max and f min denote the maximum and the minimum of all f j (j = 1, 2, . . . , 5376), respectively.

Feature reduction
In order to reduce the computational cost, we employed a filter-based method for cutting down the dimension of feature vectors. For the i-th feature ff j (i) of the j-th vector, let F(i) p and F(i) n denote its occurrence frequency in the positive and negative sample set respectively, which are calculated by where N and M are the numbers of positives and negatives in the dataset. F(i) p and F(i) n are further normalized to FF(i) p and FF(i) n as in Eq. (5), and then the final score of each feature is defined as follows: Our objective is to choose those discriminative features that either frequently occur in the positive set but seldom occur in the negative set, or frequently occur in the negative set but rarely occur in the positive set. In such a way, we choose the features that help us to distinguish positive samples from negative samples.
As FScore(i) measures the relative enrichment of the ith feature in the positives over the negatives, it can be regarded as an indicator of the usefulness of the i-th feature. Based on the calculated FScore values, the most "useful" features that have the largest or smallest FScore values are selected to represent the PRI pairs. Suppose that we reduce the PRI vectors to k dimensions, we select the k 2 features with the largest FScore values and the k 2 features with the smallest FScore values to represent the k-dimension PRI vectors. In our work, k is set to 1000.

The classifiers and performance metrics
As several studies have successfully used random forest (RF), naive Bayes (NB) and support vector machine (SVM) to predict PRIs [15][16][17], we also use them to evaluate our method by 10-fold cross validation. Four widely-used performance metrics, sensitivity (SE), specificity (SP), accuracy (ACC) and geometric mean (GM) are used in this paper. GM is commonly used for classimbalance learning [33] because it can give a more accurate evaluation on imbalanced data. Therefore, for the imbalance datasets, we pay more attention to GM rather than ACC. These metrics are evaluated as follows: where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
In addition, we also use AUC (Area Under the receiver operating characteristic (ROC) Curve) to evaluate prediction performance in some experiments. AUC falls between 0 and 1. The maximum value 1 means a perfect prediction. For a random guess, the value of AUC is close to 0.5.

Results and Discussion
In our experiments, eighteen PRI datasets are used, these datasets either contain PRI data of different species or have different ratios of positive PRIs to negative PRIs. For each dataset, 10-cross validation is performed on SVM, RF and NB classifiers respectively, and the performance metrics of SE, SP, GM and ACC as well as AUC are used.
In the sequel, for the simplicity of notation, we denote the ratio of positive samples to negative samples as PNR, and remove the words "reliable" and "random" from the dataset names in Table 1. For example, both SO_reliable 1:1 and SO_random 1:1 are simplified to SO 1:1 . In other words, SO 1:1 represents both SO_reliable 1:1 and SO_random 1:1 .    To more clearly evaluate the advantage of reliable negative samples over random negative samples, we define the performance improvement ratio (IR) of using our reliable negatives over using random negatives as follows:

Performance comparison
where result reliable and result random denote the performance measure (any of SE, SP, GM and ACC) of using our reliable negatives and using random negatives, respectively. A positive IR means using our reliable negatives achieves better performance than using random negatives. Table 3 shows the IR values calculated based on the results in Figs. 4, 5 and 6. From Table 3, we can see that out of the 108 IR values, only 14 IRs are negative, one is 0, the other 93 (93/108≈86%) values are positive. As GM and ACC are more comprehensive than SE and SP in measuring classification performance, we check their IR values more carefully. Of the 54 IR values for GE and ACC, 51 (51/54≈94%) values are positive. Therefore, in most cases performance measure of our method is better than the random method. The largest IR is 760.4%, which is achieved for SE by SVM on dataset MUS 1:2 . We can also see that SVM and RF perform better than NB on these datasets.
The results above show that using the reliable negative samples selected by our method indeed boosts the performance of PRI prediction, and our method can serve as a practical and effective method for computationally predicting PRIs.

The effect of score threshold
To select negative samples, we have to set a score threshold, and require that all candidate negative samples (protein-RNA pairs) have scores (defined in Eq. (1)) no larger than the threshold. So the value of threshold will impact the quality of selected negative samples, and will subsequently impact the prediction performance. The smaller the threshold, the higher the quality of selected negatives, and the smaller the number of negatives that can be selected. So there is a tradeoff between the quality and the number of selected negatives. In this part, we check the impact of score threshold on prediction performance and thus suggest proper values for the threshold. Here, we use AUC to evaluate prediction performance.
We randomly select 908 nonredundant positive PRIs of Homo sapiens from PRIDB and NPInter, then construct an equal number of negative samples by our method with different score threshold values. Concretely, we generate  Figure 7 shows the results. As we can see, for all the three classifiers, with the increase of threshold value, the AUC value shows a decreasing trend, which conforms to our expectation. And when the score threshold is less than 0.7, the prediction performance is stable.

Capability of finding new positive PRIs
In this paper, we define a score (Eq. (1)) to measure the relationship between each protein and each RNA. The smaller the score, the more possible this protein-RNA pair is a negative PRI. Otherwise, the more possible it is a PRI. So the merits of our method are two-fold. On the one hand, we can use it to select highly credible negative PRIs; On the other hand, it can be used to directly predict positive PRIs. We randomly select 908 nonredundant positive PRIs of Homo sapiens from PRIDB and NPInter, and compute the score of any protein-RNA pair not included in the positive set by our method. Among the screened protein-RNA pairs, for each RNA we extract the top 4 protein-RNA pairs in terms of the aggregated score AS defined in Eq. (1) and requiring AS > 1, then we get 397 protein-RNA pairs involving 107 unique RNAs and 96 unique proteins. We search each protein-RNA pair against the NPInter and PRIDB datasets, and find that 22 pairs have been validated by biological experiments.
Furthermore, from the 397 protein-RNA pairs gotten above, we filter out those pairs whose proteins appear in PRIs of the NPInter and PRIDB datasets, and get 256 protein-RNA pairs involving 56 unique RNAs and 74 unique proteins. Then we annotate manually the 74 proteins in the 256 protein-RNA pairs by the Gene Ontology database, and we find that 64 (64/74≈86.5%) proteins have RNA binding, chromatin binding or nucleotide binding functions, which play important roles in positive or negative regulation of transcription, gene expression and RNA processing. Figure 8 is a protein-RNA interaction network constructed by the true positive PRIs and the predicted ones. The network includes 908 true PRIs represented by solid line and 256 highly credible predicted PRIs represented by dotted line. Based on our experimental results, we can believe that these predicted PRIs are very possibly true PRIs.

Conclusion
In this paper, we present a novel method FIRE for boosting the performance of protein-RNA interaction prediction by selecting high-quality negative protein-RNA pairs to construct high-performance classifiers. Experiments over 18 PRI datasets show that the three compared classifiers, including SVM, RF and NB all achieve better performance on the negative sets selected by our method than on the random negative sets. This means that our Fig. 8 The PRI network constructed by true PRIs and predicted ones by our method. 256 predicted PRIs consist of 74 unique proteins and 56 unique RNAs. The yellow ellipses and purple diamonds represent proteins and RNAs, respectively. The solid and dotted lines are the true and predicted PRIs method can screen highly-credible negative PRIs, and thus can improve PRI prediction performance. As for future work, we will further explore the interacting mechanism between protein and RNA, and propose new and more effective methods to select reliable negative samples.