A novel neural response algorithm for protein function prediction
© Yalamanchili et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Skip to main content
© Yalamanchili et al.; licensee BioMed Central Ltd. 2012
Published: 16 July 2012
Large amounts of data are being generated by high-throughput genome sequencing methods. But the rate of the experimental functional characterization falls far behind. To fill the gap between the number of sequences and their annotations, fast and accurate automated annotation methods are required. Many methods, such as GOblet, GOFigure, and Gotcha, are designed based on the BLAST search. Unfortunately, the sequence coverage of these methods is low as they cannot detect the remote homologues. Adding to this, the lack of annotation specificity advocates the need to improve automated protein function prediction.
We designed a novel automated protein functional assignment method based on the neural response algorithm, which simulates the neuronal behavior of the visual cortex in the human brain. Firstly, we predict the most similar target protein for a given query protein and thereby assign its GO term to the query sequence. When assessed on test set, our method ranked the actual leaf GO term among the top 5 probable GO terms with accuracy of 86.93%.
The proposed algorithm is the first instance of neural response algorithm being used in the biological domain. The use of HMM profiles along with the secondary structure information to define the neural response gives our method an edge over other available methods on annotation accuracy. Results of the 5-fold cross validation and the comparison with PFP and FFPred servers indicate the prominent performance by our method. The program, the dataset, and help files are available at http://www.jjwanglab.org/NRProF/.
Recent advances in high-throughput sequencing technologies have enabled the scientific community to sequence a large number of genomes. Currently there are 1,390 complete genomes  annotated in the KEGG genome repository and many more are in progress. However, experimental functional characterization of these genes cannot match the data production rate. Adding to this, more than 50% of functional annotations are enigmatic . Even the well studied genomes, such as E. coli and C. elegans, have 51.17% and 87.92% ambiguous annotations (putative, probable and unknown) respectively . To fill the gap between the number of sequences and their (quality) annotations, we need fast, yet accurate automated functional annotation methods. Such computational annotation methods are also critical in analyzing, interpreting and characterizing large complex data sets from high-throughput experimental methods, such as protein-protein interactions (PPI)  and gene expression data by clustering similar genes and proteins.
The definition of biological function itself is enigmatic in biology and highly context dependent [4–6]. This is part of the reason why more than 50% of functional annotations are ambiguous. The functional scope of a protein in an organism differs depending on the aspects under consideration. Proteins can be annotated based on their mode of action, i.e. Enzyme Commission (EC) number  (physiological aspect) or their association with a disease (phenotypic aspect). The lack of functional coherence increases the complexity of automated functional annotation. Another major barrier is the use of different vocabulary by different annotations. A function can be described differently in different organisms . This problem can be solved by using ontologies, which serve as universal functional definitions. Enzyme Commission (E.C) , MIPS Functional Catalogue (FunCat)  and Gene Ontology (GO)  are such ontologies. With GO being the most recently and widely used, many automated annotation methods use GO for functional annotation.
Protein function assignment methods can be divided into two main categories - structure-based methods and sequence-based methods. A protein's function is highly related to its structure. Protein structure tends to be more conserved than the amino acid sequence in the course of evolution [12, 13]. Thus a variety of structure-based function prediction methods [14, 15] rely on structure similarities. These methods start with a predicted structure of the query protein and search for similar structural motifs in various structural classification databases such as CATH  and SCOP  for function prediction. Structural alignments can reveal the remote homology for 80-90% of the entries in Protein Data Bank  even if no significant sequence similarity was found for the two proteins . However, these methods are limited by the accuracy of the initial query structure prediction and the availability of the homologous structures in the structural databases. Despite of being highly accurate, the big gap between the number of sequences and their solved structures restricts the use of structure-based methods. Therefore, sequence-based methods are needed.
The main idea behind sequence-based methods is to compare the query protein to the proteins that are well characterized, and the function of the best hit is directly assigned to the query sequence. GO annotations are assigned to the BLAST search results  for the first time by GOblet  which maps the sequence hits to their GO terms. Later on the GO terms are given weights based on the E-value of the BLAST search by Ontoblast . This was further refined in GOfigure  and GOtcha  by communicating the scores from one level to the other in the GO hierarchy tree. All these methods are based on the BLAST search results; thus they fail to identify the remote homologues with a higher E-value. This problem is tackled by the Protein Function Prediction (PFP) server , which replaces the BLAST with PSI-BLAST  and thus can detect remote homologues. The PFP server can predict the generalized function of protein sequences with remote homology, but with a trade-off of low specificity. FFPred  is the most recent protein function prediction server that builds Support Vector Machine (SVM) classifiers based on the extracted sequence features of the query sequence and thus it does not require prior identification of protein sequence homologues. However the server needs one SVM classifier for each GO term, which makes it computationally expensive. Furthermore, the server only provides classifiers for 111 Molecular function and 86 Biological Process categories that represent more general annotations, which limits its usage in deciphering specific annotations. The lack of annotation specificity and high complexity of the existing methods advocate the need of improvement in the automated protein function prediction.
In the current context of protein functional characterization, the top layer represents the whole protein sequences and the subsequent layers are constituted of sequence motifs. At each layer similarity is computed between the templates of two successive layers, which are referred to as derived kernels by taking the maximum of the previously computed local kernels in a recursive fashion. Finally a mapping engine is built on the kernels derived from the neural response algorithm to map the query protein to its most probable GO term. A detailed description of the whole methodology is given in the Methods section.
We used the GO terms with no further children (leaf nodes of the GO tree) and their corresponding proteins for the assessment of our method. The rationale for using leaf nodes is that these GO terms are functionally more specific than the GO terms at the higher levels, i.e. no two GO terms should share a common protein and thus can demonstrate the specific function prediction strength of our method. This also addresses the issue of redundancy in the training set. To further fortify our argument we had also addressed the redundancy problem at sequence level by eliminating the redundant sequences that are more than 80% similar in the training set. This was done by using CD-HIT , a program that removes redundant sequences and generates a database of only the representatives. From the extracted GO terms we enumerated all the protein pairs belonging to the same GO term and labeled them as positive dataset i.e. we assigned a label Y (i, j) as 1 and the protein pairs belonging to different GO terms were labeled as negative, Y (i, j) = 0. Among such labeled pairs, we randomly selected 3000 positive pairs and 3000 negative pairs and used these labeled protein pairs to train and validate our method. After training the final mapping function, f(N (i, j) ) produced a value between 0 and 1 corresponding to the similarity between the proteins i and j in the validation set. Upon applying the threshold of 0.5, we predicted the labels Y (i, j) to 1 (share a GO term) if f(N (i, j) ) ≥ 0.5, and predict Y (i, j) to 0 (do not share a GO term) if f(N (i, j) ) < 0.5.
5 Fold cross validation results with respect to the template library
Template Library in layer 2
PROSITE + PFAM
Classification Accuracy of the NRProF, FFPred and PFP server with respect to the compiled test set.
GO terms predicted for the protein Q9H6Y2 by PFP, FFPred and NRProF.
Top 5 GO terms by PFP
GO:0005488, GO:0043169, GO:0003676, GO:0004977, GO:0046026
Top 5 GO terms by FFPred
No GO terms predicted for this sequence
Top 5 GO terms by NRProF
P51532, Q96S44, Q9HCK8 (GO:0002039), Q01638 (GO:0002114), Q13822 (GO:0047391)
GO term prediction Accuracy of the NRProF and PFP server with respect to the test set.
From Table 4, we can infer that our method NRProF performs reasonably better than the PFP server. We have not reported the accuracy of the FFPred, as it is limited to only 111 Molecular function categories, which makes it suitable for general rather than specific function annotations. There are other methods that use GO vocabulary for protein function prediction methods including GOblet, GOfigure and GOtcha. But the PFP server has already been proved to be superior to all the above mentioned methods . Thus we have compared our method (NRProF) only with the PFP server.
Impact of CD-Hit cut-off on the accuracy
We can simply calculate the similarity between a query protein and a known one to assign the corresponding GO term. However with this similarity, we can only use some naive algorithms like k-nearest neighborhood, whose accuracy is not quite satisfactory especially for biological data (proteins), which is essentially multi dimensional. In addition to this, we should artificially enforce a similarity cut-off between the query and the known protein to assign the query protein to its associated GO category. Considering the fact that the intra GO term similarity varies from GO term to GO term it is difficult to set such cut-offs. To conquer this, it is necessary to design a machine learning algorithm that can learn and chose the cut-off based on the similarity between the proteins sharing the same GO term i.e. the similarity cut-off should be high if the intra GO term similarity is high and vice versa. Here, our model assigns the query protein to its associated GO category (1st pair) based on the respective Intra GO term similarity, given by the similarity between the proteins constituting the 2nd pair, i.e. the 1st pair will be labeled as similar if its similarity is equivalent to the similarity of the 2nd pair (labeled as similar) and vice versa. By this we can bypass the cut-off that needs to be enforced on the simple similarity score for assigning GO terms.
Our test set is compiled of leaf GO terms and their corresponding proteins with no two GO terms sharing a common protein, to demonstrate the specific function prediction strength of our method. However, up on perusal we found that ~25% of the proteins belong to more than one leaf GO term under the category of molecular function. To analyse the effect of "not including such proteins" on the accuracy, we have compiled a new test set of the same size (300 proteins including proteins belonging to more than one leaf GO term). We perceive that considering proteins belonging to more than one leaf GO term has no negative effect on the GO term predictability. In fact the prediction accuracy is slightly better 89.63% when compared to 86.93% on the actual test set.
Here we present a novel protein function prediction method, NRProF, based on the neural response algorithm using the Gene Ontology vocabulary. The neural response algorithm simulates the neuronal behavior of the visual cortex in the human brain. It defines a distance metric corresponding to the similarity by reflecting how the human brain can distinguish different sequences. It adopts a multi-layer structure, in which each layer can use one or multiple types of sequence/structure patterns.
NRProF is the first instance of neural response being used in the biological domain. It finds the most similar protein to the query protein based on the neural response N between the query and the target sequences; and thereby assigns the GO term(s) of the most similar protein to the query protein. This is a profound and composite method with the essence of sequential, structural and evolutionary based methods for protein function prediction. The templates from the PRINTS and PFAM database contribute to the functional profiles or signatures (sequence). The mismatch and deletion states in the HMM profiles of the PFAM templates account to the degeneracy due to evolution and the secondary structural information of the match states in the HHM profiles contribute to the structural part. The use of HMM profiles along with the secondary structure information of PROSITE and PFAM sequence motifs to define the neural response gives our method an edge over other available methods to identify the remote homologues, as profile-profile alignments are superior to PSI-BLAST based methods in detecting the remote homologues. Thus NRProF can complement most of the existing methods.
Our method is computationally less complex compared with the other methods, as the initial neural response of the proteins in the base dataset with respect to the template library are computed only once and from there the neural response between the query and target is computed with the least computational effort unlike other BLAST/PSI-BLAST based methods. The simple derived kernel adds to the computational simplicity of our method. We validated our method in a 5-fold cross validation fashion and obtained an accuracy of 82%. Considering the criterion that a prediction is valid if and only if the actual GO term is top ranked (1st Rank) GO term by our method, 82% is quite a good accuracy. The classification accuracy of 83.8% on a test set of 400 proteins suggests that our method is highly specific in classifying the similar proteins with respect to the relative distance between the respective GO terms. Upon further caparison of our method with the PFP and FFPred servers which are the most sensitive function prediction servers to date, the GO term prediction accuracy of 86.93% evince that our method is more accurate in predicting the specific functions. Thus we conclude that our method is computationally simple yet accurate when compared with the other methods. This is achieved by simulating the neuronal behavior of the visual cortex in the human brain in the form of neural response.
To demonstrate our approach, we only used the molecular function domain with a total of 8,912 GO terms. Then we extracted the proteins and their sequences belonging to each of the GO terms. To address the issues of redundancy we had used CD-HIT , a program that removes redundant sequences and generate a database of only the representatives. These protein sequences and their respective GO terms were used as the base dataset for our model. We only used proteins from humans because we wanted to demonstrate the ability of our method to predict/characterize the function of the proteins even if they are remotely homologous to the pre-characterized proteins (human).
Our template library in bottom layer consists of HMM profiles from the Pfam database, thus we define the similarity between templates as profile-profile alignment scores. We had 10,257 profiles in the template library, making ~106 profile-profile alignments. To align the template HMM profiles we used HHsearch which is the most sensitive profile-profile alignment tool to date [33–35]. As a refinement for better sensitivity and to capture the remote homology between the templates, we considered the secondary structure information of the templates as well, which is considered more conserved and provides additional information . We have previously used secondary structure information to improve protein sequence alignment  and remote homologue identification . Thus we converted the HMM profiles to HHM  profiles containing the secondary structure information of all the match states in the HMM profiles. We employed HHsearch which uses PSI-PRED  to predict the secondary structure and added them to the HMM profiles. By doing this we were able to capture the remote homologues templates. Profile-Profile alignments were proved to be more sensitive than PSI-BLAST in the identification of remote similarity . Thus our method has the edge over the PFP server which is based on PSI-BLAST in detecting the remote homologues.
which is the neural response of the pair (p i , p j ) on the templates set q 1 ...q m .
Finally, a mapping engine was built, which defines a function "f" lying in the reproducing kernel Hilbert space  associated with a positive definite kernel K that is derived from the neural responses by inner products (linear kernel) or Gaussian radial basis functions (Gaussian kernel). First, we computed the neural response of all the proteins in the base dataset with respect to the template library in top layer. Similar neural response was computed for the query protein sequence as well. Next we computed the pair wise neural response N (i, j) between the query sequence i and the sequence j (1..n) in the base dataset. The mapping function f(N (i, j) ) produces a value ranging between 0 to 1 corresponding to similarity between the proteins p i and p j . Thus, we can predict the label Y (i, j) to 1 (similar) if f(N (i, j) ) ≥ 0.5, and Y (i, j) to 0 (non-similar) if f(N (i, j) ) < 0.5 . Other thresholds besides 0.5 are also allowed. We then assigned the query protein p i to the GO term/s associated with the protein/s p j whose label Y (i, j) was set to 1. In this case the sensitivity of GO term assignments varies with the threshold used (0.5). To overcome this dependency on the threshold, we sorted the proteins in the base dataset into descending order based on their similarity (f(N (i, j) )) to the query protein. We finally extracted the top 5 GO terms and assign them to the query protein. By doing so, we are not only overcoming the threshold dependency problem but also using the ranking (true value of the f(N (i, j) )) as the confidence scores for multiple GO terms associated with a single protein.
We used two popular classification engines viz., Support vector Machines (SVM)  and Least-Squares classifier  as the mapping engine. The main difference between them is, the loss function used for training. They use hinge loss and leastsquare loss respectively. The performance of two mapping engines is evaluated in the Results section.
We thank Prof. Steve Smale of City University of Hong Kong for valuable discussion, Alan Lai and Yan Wang of the University of Hong Kong for their critical comments.
This study was supported by grants (781511M, 778609M, N_HKU752/10) from the Research Grants Council of Hong Kong.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 1, 2012: Selected articles from The 5th IEEE International Conference on Systems Biology (ISB 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S1.