Volume 1 Supplement 1
A probabilistic context-free grammar for the detection of binding sites from a protein sequence
© Dyrka and Nebel; licensee BioMed Central Ltd. 2007
Published: 8 May 2007
The analysis of a protein, through the evaluation of interactions between the amino acid composing its sequence, is a very challenging problem where pattern recognition techniques based on Hidden Markov Model (HMM) have proved to be the most efficient . Although HMM is a powerful technique, it has limitations. According to formal language theory, its expressive power is similar to probabilistic regular grammars. A more powerful grammar, Context-Free Grammar (CFG), has been applied successfully for the recognition and prediction of RNA structure [1, 2]. However, its utilisation in the field of protein pattern recognition is a more challenging task due to the larger set of terminals and less straightforward relations between residues. In this piece of work, we propose a Probabilistic Context-Free Grammar (PCFG) to represent features of protein structures. In order to deal with the size of the protein alphabet, we use quantitative properties of amino acids to reduce the number of symbols. Based on that grammar we designed a tool allowing the detection of protein regions which are involved in binding sites. The PCFG is evolved using a genetic algorithm (GA) to describe a pattern shared by a set of proteins.
Terminal rules based on amino acid properties
Our method relies on the selection of amino acid properties in order to deal with the size of the protein alphabet. We use quantitative properties taken from AAindex database to reduce the number of symbols . This database consists of values of the 20 amino acids for over 500 properties which can be clustered into 6 categories: beta propensity, alpha and turn propensities, composition, physiochemical properties, hydrophobicity and other properties. An appropriate small set of properties can be chosen by either expert knowledge or PCA analysis of preselected properties reflecting the learning set composition (Weighted PCA). For each property, 3 non-terminals – low, medium and high level – are created. Then terminal rules are produced to associate a set of 3 probabilities to each amino acid.
Our technique was successfully tested on a PROSITE pattern (PS00219) which has a high false negative rate. As expected, results show the choice of the amino acid property is key to prediction accuracy. In this case, good prediction rate was achieved using grammars based on either charge or van der Waals volume of amino acids. Results for other properties like beta sheet, alpha helix and relative frequency are poor as they seem to be weakly related to the binding site function. Grammar based on accessibility performs slightly better. In order to automate the property selection, we processed a set of 5 amino acid properties representing the following 5 clusters: beta propensity, alpha and turn propensities, composition, physiochemical properties and hydrophobicity. Very good results were achieved for the first wPCA vectors (standard PCA did not perform as well). However, due to the poor results obtained with the second wPCA vector further investigations are necessary to conclude regarding the validity of this approach. We also observed that window size did not have a major impact on detection accuracy. Finally, our experiments showed that the best performances were achieved by combining grammars. These grammars proved to be more accurate than the PROSITE pattern.
PCFG based on quantitative representation of amino acid properties proved to be successful for PS00219. Also our process of automated property selection based on weighted PCA is encouraging as it contributed to results of the best accuracy. For the future, we plan to improve the automated property selection process and refine our procedure of grammar combination. We will also introduce a scoring scheme independent from window size and speed up convergence of evolution process. Finally, tests will be performed on large variety of binding sites.
- Sakakibara Y: Grammatical Inference in Bioinformatics. IEEE Trans on PAMI. 2005, 27: 1051-1062.View ArticleGoogle Scholar
- Sakakibara Y, Brown M, Hughey R, Mian IS, Sjolander K, Underwood RC, Haussler D: Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res. 1994, 22: 5112-5120. 10.1093/nar/22.23.5112PubMedPubMed CentralView ArticleGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28: 374- 10.1093/nar/28.1.374PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd.