70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features

Background Promoter is an important sequence regulation element, which is in charge of gene transcription initiation. In prokaryotes, σ70 promoters regulate the transcription of most genes. The promoter recognition has been a crucial part of gene structure recognition. It’s also the core issue of constructing gene transcriptional regulation network. With the successfully completion of genome sequencing from an increasing number of microbe species, the accurate identification of σ70 promoter regions in DNA sequence is not easy. Results In order to improve the prediction accuracy of sigma70 promoters in prokaryote, a promoter recognition model 70ProPred was established. In this work, two sequence-based features, including position-specific trinucleotide propensity based on single-stranded characteristic (PSTNPss) and electron-ion potential values for trinucleotides (PseEIIP), were assessed to build the best prediction model. It was found that 79 features of PSTNPSS combined with 64 features of PseEIIP obtained the best performance for sigma70 promoter identification, with a promising accuracy and the Matthews correlation coefficient (MCC) at 95.56% and 0.90, respectively. Conclusion The jackknife tests showed that 70ProPred outperforms the existing sigma70 promoter prediction approaches in terms of accuracy and stability. Additionally, this approach can also be extended to predict promoters of other species. In order to facilitate experimental biologists, an online web server for the proposed method was established, which is freely available at http://server.malab.cn/70ProPred/. Electronic supplementary material The online version of this article (10.1186/s12918-018-0570-1) contains supplementary material, which is available to authorized users.


Background
Transcription is strictly regulated and controlled by cisregulatory DNA elements, which were known as promoters and enhancers. These elements control the level of gene expression and cell fate. Promoters are cis-acting DNA sequences that switch on or off the gene expression. They are generally located upstream of the transcription start sites of genes. In prokaryotes, promoters are identified by RNA polymerase and a related sigma factor [1]. Different σ factors interact with well-defined consensus promoter sequences. Each σ factor is marked according to its molecular weight. σ 70 is a well-known factor that regulates the transcription of most housekeeping genes in normal circumstances [1]. For σ 70 promoter, two short sequence elements approximately located at around -10 bp and -35 bp nucleotides upstream from the transcription start site (TSS) with consensus TATAAT and TTGACA respectively [2]. It is important to identify the promoters in a genome, because it can help clarify the regulatory mechanism in the genome and explain disease-causing variants within cis-regulatory elements [3,4]. Meanwhile, it's a crucial part of gene structure recognition and the core issue of building gene transcriptional regulation network. Man's understanding of promoter is developing all the time. It's an area of great concern as people place increasing attention on their importance not only in developmental gene expression but also in environmental response [5,6].
Due to the rapid development of genome sequencing technology, large-scale data has been generated [7][8][9], the stable and accurate identification of promoter is an important problem. Because standard laboratory methods are time-consuming and performance overhead costing, bioinformatics technologies with perfect precision represent the ideal alternative for massive fast recognition of promoter.
The σ 70 promoter recognition task is a binary classification task. Feature extraction and classifier design are the key problems in promoter identification technology. In the past 20 years, based on the feature of promoter sequences, a serious of approaches have been developed for detecting promoter region in prokaryotes [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. In 2007, Zhang proposed an algorithm using increment of diversity with quadratic discriminant (IDQD) analysis [21]. Position weight matrix (PWM) is always regarded as a description of the sequence information, but sometimes it gives poor results [22]. After that, Wu proposed an improved Position Weight Matrix (IPWM) [23] in 2011. In the same year, Lin proposed a hybrid method (IPMD), which combines location-related scoring function and diversity increment with improved Mahalanobis Discriminant to predict promoters [16]. The next year, variable-window Z-curve was used for extracting basic features of prokaryotic promoter [14]. Recently, Lin developed an improved Z-curve called 'multi-window Zcurve' (PseZNC), which can express the frequency characteristics and three dimensionality characteristics of different length sequences [15], etc. They were mainly focused on the σ 70 promoter recognition. Among these approaches, some typical machine learning algorithms have been used to develop prokaryotic promoter region prediction. Such as, SVM (Support Vector Machine), RF (Random Forest), NB (Naïve Bayes), PLS (Partial Least Square), etc. Although these approaches have contributed to the advancement of promoter recognition, their performance demonstrates that there is a long way to go to predict promoter accurately for the following reasons. (i) Most existing approaches overlooked the correlation of neighboring nucleotides in each position, especially their difference in positive and negative samples. (ii) The local biological and physical properties of DNA may have a certain relationship with the promoters, which plays an important role in identifying them but were utterly ignored. (iii) Few web-servers were provided as the predictors, and hence their usage is quite limited [24].
In current research, we exploited a new bioinformatics tool called 70ProPred, to predict σ 70 promoter through a combination of position-specific trinucleotide propensity (PSTNP) and electron-ion interaction pseudopotentials (EIIPs) of nucleotides. Finally, based on the results analysis of jackknife test, 70ProPred significantly outperforms existing prediction models, and should be useful for identifying σ 70 promoter.

Methods
Briefly, 70ProPred is a prediction model based on support vector machine (SVM), which was built by PSTNP SS and PseEIIP sequence coding strategies. An outline of the computational framework of 70ProPred predictor is shown in Fig. 1.

Datasets
In the present study, we used the latest datasets in both [15] and [16]. A total of 741 σ 70 promoter samples were selected from the E.coli K-12 genome, which have been verified by experiments and downloaded from the Reg-ulonDB9.0 (http://regulondb.ccg.unam.mx/) [25]. The promoter region [TSS-60…TSS + 20] (the locus of TSS is 0) were prepared as positive samples with a length of 81 bp.
As there are not enough experimental confirmed negative sequences, negative samples are collected from both coding and non-coding regions. In simple terms, the benchmark dataset S used in this study can be expressed as: Where S + contains 741 σ 70 promoter samples, S − contains 1400 non-promoter samples, S − coding contains 700 coding sequences, S − non-coding contains 700 convergent intergenic sequences. Each sample contains 81 nucleotides, which is selected by a sliding window. Additionally, symbol ∪ means union.

Analysis of σ 70 promoter samples for conserved motif composition
The MEME Suite is designed to screen common sequence motifs from a set of sequences (amino acid or nucleotide). A motif can be assumed to be a conservative sequence pattern that repeats itself over a set of related sequences [26]. MEME is a useful sequence analysis tool that can rapidly detect new, non-gapped motifs for biological sequence data (protein, DNA and RNA) [27]. Then, we applied this tool to analyze the main motifs of σ 70 promoter samples and found that only a small part of these samples which have corresponding motifs in Fig. 2. The maximum number of motif was set to 3 and the remaining arguments were set to default.

Feature vector construction
In order to get more useful information from the sequence, we extracted two kinds of features. Positionspecific tendencies of trinucleotide (PSTNPss or PSTNP DS ) was adopted to reveal the differences in the Fig. 1 Overview of the proposed 70ProPred predictor. The diagram mainly contains datasets, sequence descriptors and 70ProPred prediction system. The optimal encoding combination PSTNP SS and PseEIIP are used as the input to train a SVM classifier. After optimization of the SVM parameters, the best SVM model is constructed based on the jackknife performance Fig. 2 Motif of σ 70 promoter samples as found by the MEME system. The corresponding three-motif logos as visualized for σ 70 promoter samples (details in Table 1) distribution of all positive and negative samples between trinucleotide. While the electron-ion interaction pseudopotentials (PseEIIP) were adopted to represent the interaction of trinucleotides.
Position-specific trinucleotide propensity based on singlestranded or double-stranded characteristic of DNA (PSTNP SS /PSTNP DS ) The recently proposed position-specific theory has been successfully applied to numerous fields of bioinformatics, such as identifying protein S-nitrosylation sites [47], hydroxyproline and hydroxylysine sites [48], DNA replication origin sites [49], Enhancer [41], etc. Besides, a series of studies have shown that the trinucleotides composition was effective in recognition of promoter [28,[50][51][52]. Inspired by the above studies, we presented a novel feature extraction strategy, which combined position-specific trinucleotide property (PSTNP) with the K-mer nucleotide composition information to predict σ 70 promoter. A brief account of feature extraction is as follows.
Let S represents a sample which was consisted of A, G, C, and T, i.e.
In which L means the length of the sample and L = 81, and represents the i-th position of corresponding nucleotide in the sequence.

PSTNP SS
Feature PSTNP SS using a statistical strategy based on single-stranded characteristics of DNA. There are 4 3 = 64trinucleotides: AAA, AAC, AAG, ..., TTT. So, for an 81 bp sample, its details of the trinucleotides position specificity can be expressed by the following 64 × 79 matrix [41]: where the variable F + (3mer i | j) and F − (3mer i | j) denote the frequency of the i-th trinucleotide (3mer i ) at the j-th position appear in the positive (S + ) and negative (S − ) data sets, respectively. In the formula, 3mer 1 equals AAA,3mer 2 equals AAC, …, 3mer 64 equals TTT.
Therefore, the sample of Eq.2 can be expressed as: where T is the operator of transpose and ϕ u was defined as follows:

PSTNP DS
Feature PSTNP DS using a statistical strategy based on double-stranded characteristics of DNA according to complementary base pairing, so they have more evident statistical features. At this point, we deem A and T as identical, the same to C and G. Thus, for every sample, it can be converted into a sequence contained A and T only. As shown in Fig. 3, promoter-1 converted into promoter-1 AC. So, there are 2 3 = 8 trinucleotides: AAA, AAC, ..., CCC. Therefore, for an 81 bp sample, its details of the trinucleotides position specificity can be expressed by the following 8 × 79 matrix: where the variable F + (3mer i | j) and F − (3mer i | j) denote the frequency of the i-th trinucleotide (3mer i ) at the j-th position appear in the positive (S + ) and negative (S − ) data sets, respectively. In the formula, 3mer 1 equals AAA,3mer 2 equals AAC, …, 3mer 8 equals CCC. Therefore, the sample of Eq.2 can be expressed as: where T is the operator of transpose and ϕ ' u was defined as follows: Electron-ion interaction pseudopotentials of trinucleotide (PseEIIP) Nair [53] came up with electron-ion interaction pseudopotentials (EIIP) value of nucleotides A, G, C, T. The EIIP value based methods have been shown effective through previous studies, such as the recognition of gene F56F11.4, prediction of the cystic-fibrosis gene [54], recognition of enhancer [41], and so on [55,56]. The electron-ion interaction pseudopotentials value for the nucleotides [53] are shown in Table 2. We let EIIP A , EIIP T , EIIP G , and EIIP C denote the EIIP values of nucleotides A, T, G and C, respectively. Then, we employed the mean EIIP value of trinucleotides in each sample to construct feature vector, which can be formulated as: where f xyz equal to the normalized frequency of the i-th trinucleotide (3mer i ), EIIP xyz = EIIP x + EIIP y + EIIP z expresses the EIIP value of one trinucleotide and X, Y, Z∈ [A, C, G, T]. Obviously, the dimension of vector V is 64.

Model building and parameter selection
SVM classification algorithm plays a significant role in some areas of bioinformatics [18,36,40,57]. In this work, SVM was implemented using the LIBSVM packet [58] to build models and execute predictions. The radial basis function (RBF) was selected as the kernel function. At the same time, penalty parameter C and kernel parameter γ were optimized using SVMcg in the LIBSVM package. The final parameters C = 22.6274 and γ = 2. 8284 were selected for the prediction of σ 70 promoters and non-promoters. The jackknife test is regarded as a unique random test that can produce the unique result for a given dataset [59]. Therefore, all these parameters were optimized through jackknife test.
In order to evaluate the predictive performance of the model, four metrics are calculated: sensitivity (Sn), specificity (Sp), accuracy (Acc) and MCC: In the formula, TP, TN, FP and FN represent the numbers of true positives (accurately predicted σ 70 promoters), true negatives (accurately predicted nonpromoters), false positives (falsely predicted σ 70 promoters) and false negatives (falsely predicted nonpromoters).

Results and discussion
Prediction of σ 70 promoter using only PSTNP PSTNP was first proposed for predicting enhancer [41]. The obvious advantage of this approach is that the feature vectors are encoded in a way that contains information from all training samples. In this work, the ability of PSTNP SS and PSTNP DS to discriminate σ 70 promoter and non-promoter were first declared by jackknife test (Table 3) figure (Fig. 4) with F-value of trinucleotides in different position also declared the difference in forecast results.
Furthermore, we used the Entropy (Additional file 1) [23,60] to demonstrate the conservative sites of trinucleotides property in σ 70 promoter. In order to comparing, the calculated entropy values for the trinucleotide of the σ 70 promoter and non-promoter are shown in Fig. 5. Obviously, the lower entropy, the more conservative the position is. From this figure, we can see that in σ 70 promoter most sites are obviously of lower entropy values compared with non-promoter. It may prove that PSTNP SS feature extraction agreed well with the previous prediction that in different positions the trinucleotide is conservative in σ 70 promoter.
From all above, the composition of trinucleotide which based on single-stranded characteristics of DNA contributes to the detection of σ 70 promoter. This result indicated that the PSTNP SS model performs better than the PSTNP DS model in σ 70 promoter prediction. Consequently, the training model optimized on the basis of the feature PSTNP SS .

Improving performance by incorporating PseEIIP
Because the physicochemical property indexes of nucleotides affect the recognition of promoter, incorporating the sample's average energy of delocalized electrons (EIIP), especially, the EIIP value of trinucleotides with PSTNP SS might boost the performance of the training model, the prediction results are listed in Table 4.
Eventually, the prediction model was established using the PSTNP SS + PseEIIP feature extraction methods combined with the SVM classifier (cost parameter -c 22.6274, −g 2.8284) to predict σ 70 promoter.
In order to gauge the predictive performance of training model, the ROC curve and the area under the ROC curve (AUC) were adopted. The AUC value the 70ProPred model was 0.990 (Fig. 6).
Furthermore, we constructed a heat map to visually show the distribution of feature PseEIIP in positive dataset, as shown in Fig. 7. Each hotspot in the heat map corresponds to a unique trinucleotide; for instance, hotspot (1, 1) corresponds to triplet AAA. For more detailed information on the heat map, please see Additional file 2: Table S1. Red squares are positively associated with recognition ability.

Comparison of the performance between SVM and other classifiers
In order to select a better classifier to identify σ 70 promoter, we analyzed the performances of KNN [61],  NB [62], RF [63], EB [64], LibD3C [65], GBDT [66] and SVM, which are the most widely used classification algorithms in bioinformatics. Since it is known that the number of neighbors has an impact on the performance of KNN algorithm and the tree number has an impact on the performance of RF algorithm, the optimal parameters of KNN and RF are searched in the study, as shown in Additional file 3: Table S2 and Additional file 4: Table S3.
The performances for the above classifiers in the jackknife test are shown in Table 5. The results indicate that SVM performs better than the other classifiers with the highest MCC value 0.9018.

Comparison of 70ProPred with other σ 70 promoter prediction methods
The 5-fold cross-validation results achieved by Z-curve [14], PseZNC [15], IPMD [16], PSTNP DS and 70ProPred on the benchmark dataset are listed in Table 6. Their marked difference is the feature extraction strategy. Therefore, the advantage of 70ProPred superior to other methods is mainly due to the combination of PSTNP SS and PseEIIP coding strategy. PSTNP SS employs primary sequence information of trinucleotides, and PseEIIP is closely related to the physicalchemical property of DNA sample. As shown in Table 3, based on the PSTNP SS feature only, the prediction result of 70ProPred was significantly higher than the other methods. The performance has been further improved by adding the feature PseEIIP, as shown in Table 4, especially in sensitivity (Sn). Taken together, the application of feature PSTNP SS and feature PseEIIP achieved a good performance in the prediction of σ 70 promoter.
The results in Table 6 also show that the PSTNP DSbased model performs better than the multi-window Zcurve-based method. It can also be a supplement to the present methods for predicting other DNA related predictions.

Conclusions
The 70ProPred is a new bioinformatics tool for predicting σ 70 promoter. This tool uses the feature extraction methods of PSTNP SS and PseEIIP. The combination of features and SVM could achieve an overall MCC value of 0.90. Compared to other σ 70 promoter prediction models, 70ProPred produced better results. Although this method shows good performance in σ 70 promoter prediction, there is still room to improve prediction performance due to the following reasons. (i) Since structural information is a supplementary to sequence  6 The ROC curves to assess the predictive performance based on different sequences encoding schemes for σ 70 promoter  information, the future work may build a model combine with the two aspects. (ii) The feature selection algorithms can be used to delete the redundant features to improve the prediction model. (iii) More species of promoters should be adopted to estimate the performance of 70ProPred method. In conclusion, our future work is to extend this method to other species promoter region prediction. We suspect that our feature extraction methods is not only suitable for identifying promoter, but also for other bioinformatics sequence classification tasks.