PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection

Wang, Jing; Zhang, Dabing; Li, Jing

doi:10.1186/1752-0509-7-S5-S9

Volume 7 Supplement 5

Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Systems Biology

Research
Open access
Published: 09 December 2013

PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection

Jing Wang^1,2,
Dabing Zhang¹ &
Jing Li^1,2,3

BMC Systems Biology volume 7, Article number: S9 (2013) Cite this article

5405 Accesses
26 Citations
Metrics details

Abstract

Background

Assessment of potential allergenicity of protein is necessary whenever transgenic proteins are introduced into the food chain. Bioinformatics approaches in allergen prediction have evolved appreciably in recent years to increase sophistication and performance. However, what are the critical features for protein's allergenicity have been not fully investigated yet.

Results

We presented a more comprehensive model in 128 features space for allergenic proteins prediction by integrating various properties of proteins, such as biochemical and physicochemical properties, sequential features and subcellular locations. The overall accuracy in the cross-validation reached 93.42% to 100% with our new method. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) procedure were applied to obtain which features are essential for allergenicity. Results of the performance comparisons showed the superior of our method to the existing methods used widely. More importantly, it was observed that the features of subcellular locations and amino acid composition played major roles in determining the allergenicity of proteins, particularly extracellular/cell surface and vacuole of the subcellular locations for wheat and soybean. To facilitate the allergen prediction, we implemented our computational method in a web application, which can be available at http://gmobl.sjtu.edu.cn/PREAL/index.php.

Conclusions

Our new approach could improve the accuracy of allergen prediction. And the findings may provide novel insights for the mechanism of allergies.

Background

Allergens are something that can induce type-I hypersensitivity reaction in atopic individuals mediated by Immunoglobulin E (IgE) responses [1–4], which are seriously harmful to human health. For instance, allergenic proteins in food and other hypersensitivity reactions are major causes of chronic ill health in affluent industrial nations, mostly against milk, eggs, peanuts, soy, or wheat, affecting up to 8% of infants and young children [5–7]. Moreover, the introduction of genetically modified foods and new modified proteins is increasing the risk of food allergy in susceptible individuals as well [8, 9]. Consequently, assessing the potential allergenicity of proteins is essential to prevent the inadvertent generation of new allergenic food by agricultural biotechnology.

In 2001, the World Health Organization (WHO) and Food and Agriculture Organization (FAO) proposed guidelines to assess the potential allergencity of a protein, an important part of which is to use bioinformatic methods to determine whether the primary structure (amino acid sequence) of a given protein is sufficiently similar to sequences of known allergenic proteins [10, 11]. In FAO/WHO rules, a protein is identified as a putative allergen if it has at least six contiguous amino acids matched exactly (rule 1) or a minimum of 35% sequence similarity over a window of 80 amino acids (rule 2) when compared with known allergens. Some researches have shown that the bioinformatic rules of FAO/WHO produced many false positives for allergen prediction [12–19]. Since then, a number of other computational prediction methods based on the protein structure or sequence similarity comparing with known allergens have been reported [18, 20–26]. For example, a new approach brought an increase of the precision from 37.6% to 94.8% by identifying motifs from known allergen in 2003 [18]. Statistical learning method SVM (support vector machine) was used for predicting allergens since 2006, and the input features of most SVM-based prediction approaches were compose of either amino acid composition or pair-wise sequence similarity score with known allergens' [20–24, 27]. Furthermore, using identifying epitope, allergen representative peptides or family featured peptides were also applied in the allergen prediction [20, 25, 26]. But the usage of these two methods was limited because very few epitopes and allergen representative peptides have been known until now.

In our previous study, it's observed that, although FAO/WHO criteria have a higher sensitivity and the motif-based approach may give a graph view on the key allergenic motif, we found that the SVM-based method is superior to the others in the accuracy of allergen prediction and processing time [28]. As described as above, a variety of bioinformatic methods for predicting allergen have been reported, most of these approaches depend upon the similarity of protein sequence or primary sequential properties between query protein and the known allergens only. Here, besides protein sequential features, we developed an improved model for identifying potential protein allergenicity using 128 features in terms of their biochemical, physicochemical, subcellular locations. And then, all features were ranked using mRMR (maximum relevance & minimum redundancy) method and an optimal model was rebuilt and evaluated with ten-fold cross validations. At last, we presented a web-based application with a friendly interface that allows users submit individual or batch prediction with query protein or protein list using our new method.

Methods

Datasets

1176 distinct allergen proteins were collected from Swiss-Prot Allergen Index, IUIS Allergen Nomenclature, SDAP [26] and ADFS [29], and were used as the positive dataset. To build a reliable negative dataset, we integrated the previously reported methods[13, 18, 22], and the following processing was done: (1) 522,019 protein entries were downloaded from Swiss-Prot (Swiss-Prot Release 2010_11 of 02-Nov-10); (2) the entries were removed, of which sequence identities > = 30% with any known allergen; (3) all sequences less than 50 amino acid were also discarded; (4) the same number of the negative samples were selected randomly from the remaining subjects in the following cross-validations of the evaluation.

Software

NCBI-BLAST (version 2.2.23) was used to find the similarity between sequences [30]. SSpro/ACCpro 4.03 [31, 32], for predicting secondary structure and solvent accessibility of protein, were obtained from http://download.igb.uci.edu/. In order to access a protein as an allergen or non-allergen, SVM method was implemented using LIBSVM software v3.0 [33], from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. The mRMR program [34], from http://penglab.janelia.org/proj/mRMR/, was acquired for feature ranging and selection. A Perl script was written for protein features extraction and allergenicity prediction. ClustalX2 and Muscle was used for multiple sequence alignments with the default parameters [35, 36]. The NJ (Neighbour-Joining) tree was constructed with the aligned protein sequences using MEGA (version 5) with the following parameters: poisson correction, pairwise deletion, and bootstrap (1,000 replicates; random seed) [37].

Feature vector construction

(1) Features of biochemistry and physicochemistry

The following six kinds of biochemical and physicochemical properties were extracted from a given protein sequence: (1) amino acid composition (AAC), (2) molecular weight (MW), (3) hydrophobicity, (4) polarizability, (5) normalized van der Waals volume (NWV), and (6) polarity.

AAC is the fraction of each amino acid in a protein [20]. The fraction of all 20 natural amino acids was calculated using the Eq. (1).

Fraction of amino acid i = \frac{total number of amino acids (i)}{total number of amino acids in protein},

(1)

where $i$ can be any amino acid.

The molecular weight was considered in this study since some researches showed that it's related with allergen identification [38–42]. Except for AAC and MW that reflect global feature of a protein, of the above six types of properties, the construction of all the other four types of biochemical and physicochemical properties, which is related with a single amino acid in a given protein sequence was adopted from the report of Huang et al. [43]. Each of these local types of properties can be classified into three categories. For instance, an amino acid can be grouped as: polar, neutral or hydrophobic for the hydrophobicity. Similarly, the classifications of polarizability, NWV and polarity were also summarized in Table 1[44–46]. And then, in term of each type of property above, the 20 elements of original protein sequence can be recoded using the corresponding three local features such as P (polar), N (neutral) and H (hydrophobic). At last, with method developed by Huang et al. [43], the coded sequence can be integrated into the corresponding global features: C (composition), T (transition) and D (distribution). C refers to the global composition of each of the three groups (3 elements), while T is defined as the proportion of transformation of each pair letters on the total changes along the entire coded sequence (3 elements), and D expresses the distribution pattern of the code letters which is measured by the position of the first, 25%, 50%, 75%, and 100% of each of the three letters along the sequence (5*3 = 15 elements). Therefore the properties which classified into three categories would generate 21 features each (3+3+15 = 21).

Table 1 The classification of protein properties

Full size table

(2) Subcellular location description of proteins

The protein's subcellular location information was also incorporated in input features for SVM, because it is closely correlated with the function of a protein [47, 48]. There were 22 subcellular locations for eukaryotic proteins collect from UniProt [49], therefore, we represented the subcellular location features by a 22-dimensional vector $S L = (s l_{1}, s l_{2}, s l_{3}, \dots, s l_{22})$ , where $s l_{i} = 1$ refers that the query protein is located at the $i$ -th subcellular location site. Conversely, $s l_{i} = 0$ refers that the query protein is not found at the $i$ -th subcellular location site [43]. However, proteins have subcellular location annotations are in the minority. In order to solve this issue, we predicted the localization information for those without annotation based on the sequence similarity with location-known proteins. Upon the sequence similarity evaluated by BLAST [30], the query protein was considered to have the same subcellular locations with a location-known protein if the BLAST score was greater than 120 between them [43].

(3) Feature space

As mentioned above, hydrophobicity, polarizability, NWV and polarity generated 21 elements each. And there were 20 elements for AAC, 1 element for MW and 22 for subcellular locations. In addition, the length of protein was also counted as a component. Therefore, the total feature space to represent a protein sample contained (21*4+20+1+22+1) = 128 components, as listed in Additional file 1 for the details. Consequently, a protein sample can be formulated as a vector in a 128-D (dimensional) space; i.e.,

V = {[v_{1}, v_{2}, v_{3}, \cdot \cdot \cdot, v_{j}, \cdot \cdot \cdot, v_{128}]}^{T}

(2)

where $v_{j}^{}$ is the $j$ -th ( $j$ = 1,2,...,128) component of the protein.

To enhance the accuracy of SVM, each of the 128 features in Eq.2 was scaled by Eq.3.

v_{j} = (v_{j} - μ_{j}) / σ_{j} (j = 1, 2, \dots, 128)

(3)

where $μ_{j}$ is the mean, and $σ_{j}$ is the standard deviation of the $j$ -th component over all protein samples.

Feature selection

(1) mRMR method

mRMR method was developed to rank each feature according to its relevance to the target and redundancy with other features [34]. The program of mRMR was downloaded from http://penglab.janelia.org/proj/mRMR/, and run with the parameters: $λ = 1$ , m = MID.

(2) Incremental Feature Selection (IFS)

As mentioned above, the feature components could be ranked using mRMR method. But it's not uncovered that which components of the feature would be most necessary. The IFS method was adopted in this study to perform feature selection for analyzing the key properties related to allergenicity. Based on the ranked features obtained from the mRMR, 128 feature sets were constructed by adding one component to the set at a time in the order of mRMR features list. The $i$ -th set is formed like $S_{i}^{^{'}} = \{f_{1}^{^{'}}, f_{2}^{^{'}}, \dots, f_{i}^{^{'}}\} (1 \leq i \leq 128)$ , where $f_{i}^{^{'}}$ means the feature at the $i$ -th position after ranking by mRMR.

For each of feature sets, an SVM predictor was constructed and its ten-fold cross-validation performance was derived. Eventually, an IFS curve was obtained, with the component number $i$ as its X-axis and the corresponding sensitivity, specificity and accuracy as its Y-axis. If the IFS curve has a inflection point at X=h, the feature set that played a key role in allergenicity would be $S_{o p t i m a l} = \{f_{1}^{^{'}}, f_{2}^{^{'}}, \dots, f_{h}^{^{'}}\}$ .

Ten fold cross-validation

The performances of all methods applied in this study were evaluated using ten-fold cross-validation. The dataset was randomly partitioned into ten subsets, where each subset has nearly equal number of allergens and non-allergens (negative controls). Of the ten subsets, a single set was retained as the validation data for testing the method, and the remaining nine subsets were used as training data. This process was then repeated 10 times with each of the ten subsets used exactly once as the validation data. The overall performance of a method was the average performance over ten subsets.

Results

Model construction with IFS

As described in the method section, 128 feature sets were built, and the corresponding prediction models were then constructed and evaluated. As shown in Figure 1, it reached the inflection point of IFS curve at accuracy of 91.03% when the number of feature components used was 25. In other words, these 25 feature components selected by mRMR would compose the critical feature set for the classifier of allergen/non-allergen. We analyzed the 25 feature components in the next section to understand key factors for protein's allergenicity.

Optimization of feature components

To investigate which features are crucial for protein's allergenicity, we extracted the 25 feature components at the inflection point from mRMR list, in which two of five property types, "subcellular locations" and "amino acid composition", were significantly enriched by hypergeometric test (p-value < 0.05, Benjamini-Hochberg correction) (Table 2). A heatmap in Figure 2 also illustrated that the features of AAC and SL (subcellular locations) were remarkable [50]. We further try to figure out which of the 22 subcellular locations of particular importance in allergen prediction by taking look at the SL distribution in soybean (Glycine max) and wheat (Triticum aestivum). So far, these two species had most known allergenic proteins. The results revealed that endoplasmic reticulum for soybean only and other two SL (extracellular/cell surface and vacuole) for both soybean and wheat were significantly more enriched in allergens compared to randomly selected proteins (p-value < 0.05) (Table 3 and Additional file 2).

Table 2 The optimal feature components

Full size table

Table 3 The subcellular location analysis

Full size table

Allergen predicting by category

Since people who concern about allergenicity usually focus more on a specific species or category like food-plant rather than all species, we performed a multi-alignment and constructed a phylogeny tree using MEGA software (version 5.0) [37] for 116 allergens which sequence length is between 240 and 600, from the biggest two sub-families in six major categories (Aero-Fungi, Animal, Apple, Food-Plant, Mite and Pollen) respectively. 909 allergens were included in these six major categories, which account for over 77% of all allergens. The NJ (Neighbour-Joining) tree (Figure 3, Additional file 3) illustrated that the sequences of allergens were more conservative within category than between categories. Hence, we attempted to build and evaluate our predictor within Aero-Fungi, Animal, Apple, Food-Plant, Mite and Pollen individually. As displayed in Figure 4, the category-specific models in Pollen and Apple outperformed full model. Even the accuracy of allergen prediction in Apple can reach 100%.

Comparison with existing methods

We compared the performance of our method with the existing approaches for allergen prediction. So far there are three major kinds of computational methods for allergen prediction including FAO/WHO criteria, motif-based method and SVM-based method. Among the SVM-based methods, SVM-AAC taking the amino acid composition as feature vectors is mostly common used. The ROC curves illustrated the superiority of our 128-D feature vector models to the others, in which the overall accuracy reached its peak of 93.42% (Figure 5).

Web-based application

A web server named PREAL (http://gmobl.sjtu.edu.cn/PREAL/index.php) has been developed that allows people evaluate the potential allergenicity of protein(s) on-line using our new method. When a query protein sequence in FASTA format is given, PREAL will report the putative allergenicity. Besides, both category-specific and full model are available in PREAL. PREAL also provides batch prediction, which returns the results by E-mail. A snapshot of the prediction page of PREAL was displayed in Figure 6.

Discussion and conclusions

The aim of this study is to predict the potential allergenicity of proteins efficiently and analyze the key factors resulted in allergenicity. We developed a new SVM-based model by integrating various biochemical and physicochemical properties, as well as sequential features and subcellular locations. The ten-fold cross-validation indicated that the predictor can achieve from 93.42% to 100% overall accuracy. Considering the secondary structure propensity and solvent accessibility contribute to the protein's stability and function, we also expanded our model by adding these two kinds of property. As predicted by SSpro [31], an amino acid can be grouped as: helix, strand or coil for the secondary structure propensity (SSP), and the solvent accessibility can be classified into buried or exposed to solvent predicted by ACCpro (Table 1) [32]. Finally the model can be formulated as a vector in a 156-D (dimensional) space. But the corresponding evaluation indicated the overall accuracy could be increased only 0.01 by the 156 features model while its running time was more than 60 times longer than the 128-D model.

With the feature selection procedure based on the mRMR and IFS methods, we found that the subcellular locations and amino acids composition would play the crucial roles in determining the allergenicity of a protein. For soybean and wheat, the extracellular/cell surface and vacuole are observed to be the exactly effective locations. Key effect factors for allergenicity have not been reported before. Because allergenic proteins had higher sequence similarities within categories, we also carried out the predictor in six major sub-sets in which higher accuracy was obtained. To facilitate application, we built a web-based application providing the prediction approach presented in this paper on-line, so that people can perform a test even large-scale testing expediently.

Despite this, there are some issues should be addressed in further the study. Although the allergen prediction within category preformed pretty well, small amount of allergenic proteins were captured within some category limited its wide usage. Another issue is the difficulty in effective validation of a new method presented by wet experiments expect for the cross-validation.

Abbreviations

mRMR:: Maximum Relevance Minimum Redundancy method
IFS:: Incremental Feature Selection
GO:: Gene Ontology
WHO:: World Health Organization
FAO:: Food and Agriculture Organization
SVM:: support vector machine
ARP:: Allergen Representative Peptides
AAC:: amino acid composition
MW:: molecular weight
SSP:: secondary structure propensity
NWV:: normalized van der Waals volume
NJ:: neighbour joining.

References

Goldsby RA, Kindt TJ, Osborne BA, Kuby J: Immunology. 2003, New York: W.H. Freeman and Company, 5
Google Scholar
Nadler MJ, Matthews SA, Turner H, Kinet JP: Signal transduction by the high-affinity immunoglobulin E receptor Fc epsilon RI: coupling form to function. Adv Immunol. 2000, 76: 325-355.
Article CAS PubMed Google Scholar
Metzger H: The high affinity receptor for IgE on mast cells. Clin Exp Allergy. 1991, 21 (3): 269-279. 10.1111/j.1365-2222.1991.tb01658.x.
Article CAS PubMed Google Scholar
Johansson SG, Bieber T, Dahl R, Friedmann PS, Lanier BQ, Lockey RF, Motala C, Ortega Martell JA, Platts-Mills TA, Ring J, et al: Revised nomenclature for allergy for global use: Report of the Nomenclature Review Committee of the World Allergy Organization, October 2003. J Allergy Clin Immunol. 2004, 113 (5): 832-836. 10.1016/j.jaci.2003.12.591.
Article CAS PubMed Google Scholar
Sampson HA: Food allergy. Part 1: immunopathogenesis and clinical disorders. J Allergy Clin Immunol. 1999, 103 (5 Pt 1): 717-728.
Article CAS PubMed Google Scholar
Sampson HA: Food allergy. Part 2: diagnosis and management. J Allergy Clin Immunol. 1999, 103 (6): 981-989. 10.1016/S0091-6749(99)70167-3.
Article CAS PubMed Google Scholar
Sampson HA: Food allergy: when mucosal immunity goes wrong. J Allergy Clin Immunol. 2005, 115 (1): 139-141. 10.1016/j.jaci.2004.11.003.
Article PubMed Google Scholar
Taylor SL: Protein allergenicity assessment of foods produced through agricultural biotechnology. Annu Rev Pharmacol Toxicol. 2002, 42: 99-112. 10.1146/annurev.pharmtox.42.082401.130208.
Article CAS PubMed Google Scholar
Lee YH, Sinko PJ: Oral delivery of salmon calcitonin. Adv Drug Deliv Rev. 2000, 42 (3): 225-238. 10.1016/S0169-409X(00)00063-6.
Article CAS PubMed Google Scholar
FAO/WHO: Evaluation of allergenicity of genetically modified foods. Report of a joint FAO/WHO expert consultation on allergenicity of foods derived from biotechnology. 2001
Google Scholar
FAO/WHO: Report of the fourth session of the codex ad hoc intergovernmental task force on foods derived from biotechnology. 2003
Google Scholar
Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, Hefle SL: Bioinformatic methods for allergenicity assessment using a comprehensive allergen database. Int Arch Allergy Immunol. 2002, 128 (4): 280-291. 10.1159/000063861.
Article CAS PubMed Google Scholar
Bjorklund AK, Soeria-Atmadja D, Zorzet A, Hammerling U, Gustafsson MG: Supervised identification of allergen-representative peptides for in silico detection of potentially allergenic proteins. Bioinformatics. 2005, 21 (1): 39-50. 10.1093/bioinformatics/bth477.
Article PubMed Google Scholar
Gendel SM: Sequence analysis for assessing potential allergenicity. Ann N Y Acad Sci. 2002, 964: 87-98.
Article CAS PubMed Google Scholar
Kleter GA, Peijnenburg AA: Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE - binding linear epitopes of allergens. BMC Struct Biol. 2002, 2: 8-10.1186/1472-6807-2-8.
Article PubMed Central PubMed Google Scholar
Li KB, Issac P, Krishnan A: Predicting allergenic proteins using wavelet transform. Bioinformatics. 2004, 20 (16): 2572-2578. 10.1093/bioinformatics/bth286.
Article CAS PubMed Google Scholar
Silvanovich A, Nemeth MA, Song P, Herman R, Tagliani L, Bannon GA: The value of short amino acid sequence matches for prediction of protein allergenicity. Toxicol Sci. 2006, 90 (1): 252-258.
Article CAS PubMed Google Scholar
Stadler MB, Stadler BM: Allergenicity prediction by protein sequence. FASEB. 2003
Google Scholar
Aalberse RC: Structural biology of allergens. J Allergy Clin Immunol. 2000, 106 (2): 228-238. 10.1067/mai.2000.108434.
Article CAS PubMed Google Scholar
Saha S, Raghava GPS: AlgPred: prediction of allergenic proteins and mapping of IgE epitopes. Nucleic Acids Research. 2006, 34 (Web Server): W202-W209. 10.1093/nar/gkl343.
Article PubMed Central CAS PubMed Google Scholar
Muh HC, Tong JC, Tammi MT: AllerHunter: a SVM-pairwise system for assessment of allergenicity and allergic cross-reactivity in proteins. PLoS One. 2009, 4 (6): e5861-10.1371/journal.pone.0005861.
Article PubMed Central PubMed Google Scholar
Barrio AM, Soeria-Atmadja D, Nister A, Gustafsson MG, Hammerling U, Bongcam-Rudloff E: EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acids Research. 2007, 35 (Web Server): W694-W700. 10.1093/nar/gkm370.
Article PubMed Central Google Scholar
Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, Cao ZW, Chen YZ: Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Molecular Immunology. 2007, 44 (4): 514-520. 10.1016/j.molimm.2006.02.010.
Article CAS PubMed Google Scholar
Soeria-Atmadja D: Computational detection of allergenic proteins attains a new level of accuracy with in silico variable-length peptide extraction and machine learning. Nucleic Acids Research. 2006, 34 (13): 3779-3793. 10.1093/nar/gkl467.
Article PubMed Central CAS PubMed Google Scholar
Ivanciuc O, Midoro-Horiuti T, Schein CH, Xie L, Hillman GR, Goldblum RM, Braun W: The property distance index PD predicts peptides that cross-react with IgE antibodies. Mol Immunol. 2009, 46 (5): 873-883. 10.1016/j.molimm.2008.09.004.
Article PubMed Central CAS PubMed Google Scholar
Schein CH, Ivanciuc O, Braun W: Structural Database of Allergenic Proteins (SDAP). Food Allergy. 2006, Edited by SJ M. Washington D.C: ASM Press, 257-283.
Chapter Google Scholar
Zhang L, Huang Y, Zou Z, He Y, Chen X, Tao A: SORTALLER: predicting allergens using substantially optimized algorithm on allergen family featured peptides. Bioinformatics. 2012, 28 (16): 2178-2179. 10.1093/bioinformatics/bts326.
Article CAS PubMed Google Scholar
Wang J, Yu Y, Zhao Y, Zhang D, Li J: Evaluation and integration of existing methods for computational prediction of allergens. BMC Bioinformatics. 2013, S1-14 Suppl 4
Nakamura R, Teshima R, Takagi K, Sawada J: [Development of Allergen Database for Food Safety (ADFS): an integrated database to search allergens and predict allergenicity]. Kokuritsu Iyakuhin Shokuhin Eisei Kenkyusho Hokoku. 2005, 32-36. 123
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed Central CAS PubMed Google Scholar
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002, 47 (2): 228-235. 10.1002/prot.10082.
Article CAS PubMed Google Scholar
Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002, 47 (2): 142-153. 10.1002/prot.10069.
Article CAS PubMed Google Scholar
Chang C-C, Lin C-J: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3):
Peng H, Long F, Ding C: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005, 27 (8): 1226-1238.
Article PubMed Google Scholar
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997, 25 (24): 4876-4882. 10.1093/nar/25.24.4876.
Article PubMed Central CAS PubMed Google Scholar
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.
Article PubMed Central CAS PubMed Google Scholar
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.
Article PubMed Central CAS PubMed Google Scholar
Gomez L, Martin E, Hernandez D, Sanchez-Monge R, Barber D, del Pozo V, de Andres B, Armentia A, Lahoz C, Salcedo G: Members of the alpha-amylase inhibitors family from wheat endosperm are major allergens associated with baker's asthma. FEBS Lett. 1990, 261 (1): 85-88. 10.1016/0014-5793(90)80642-V.
Article CAS PubMed Google Scholar
Nakase M, Usui Y, Alvarez-Nakase AM, Adachi T, Urisu A, Nakamura R, Aoki N, Kitajima K, Matsuda T: Cereal allergens: rice-seed allergens with structural similarity to wheat and barley allergens. Allergy. 1998, 53 (46 Suppl): 55-57.
Article CAS PubMed Google Scholar
Shewry PR, Beaudoin F, Jenkins J, Griffiths-Jones S, Mills EN: Plant protein families and their relationships to food allergy. Biochem Soc Trans. 2002, 30 (Pt 6): 906-910.
Article CAS PubMed Google Scholar
Hoffmann-Sommergruber K: Pathogenesis-related (PR)-proteins identified as allergens. Biochem Soc Trans. 2002, 30 (Pt 6): 930-935.
Article CAS PubMed Google Scholar
Breiteneder H: Thaumatin-like proteins -- a new family of pollen and fruit allergens. Allergy. 2004, 59 (5): 479-481. 10.1046/j.1398-9995.2003.00421.x.
Article PubMed Google Scholar
Huang T, Shi XH, Wang P, He Z, Feng KY, Hu L, Kong X, Li YX, Cai YD, Chou KC: Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PLoS One. 2010, 5 (6): e10972-10.1371/journal.pone.0010972.
Article PubMed Central PubMed Google Scholar
Chothia C, Finkelstein AV: The classification and origins of protein folding patterns. Annu Rev Biochem. 1990, 59: 1007-1039. 10.1146/annurev.bi.59.070190.005043.
Article CAS PubMed Google Scholar
Fauchere JL, Charton M, Kier LB, Verloop A, Pliska V: Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res. 1988, 32 (4): 269-278.
Article CAS PubMed Google Scholar
Grantham R: Amino acid difference formula to help explain protein evolution. Science. 1974, 185 (4154): 862-864. 10.1126/science.185.4154.862.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Recent progress in protein subcellular location prediction. Anal Biochem. 2007, 370 (1): 1-16. 10.1016/j.ab.2007.07.006.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc. 2008, 3 (2): 153-162. 10.1038/nprot.2007.494.
Article CAS PubMed Google Scholar
The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38 (Database): D142-148.
Google Scholar
Team RDC: R: A language and environment for statistical computing. 2009, Vienna, Austria: R Foundation for Statistical Computing
Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Tao Huang for the helpful discussing on sequence recoding and subcellular locations annotation.

This work was supported by the Funds from National Basic Research Program of China (973 Program) (2012CB720804) and National Transgenic Plant Special Fund (2011ZX08011-006, 2011ZX08011-002, 2011BAK10B03), Program for "Chen Xing" Young Scholars, Shanghai Jiao Tong University, and Pujiang Talent program (12PJ1406600).

Declarations

The publication costs for this article were funded by National Transgenic Plant Special Fund of China (2011ZX08011-006).

This article has been published as part of BMC Systems Biology Volume 7 Supplement 5, 2013: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S5.

Author information

Authors and Affiliations

Bor Luh Food Safety Center, National Center for Molecular Characterization of Genetically Modified Organisms, State Key Laboratory of Hybrid Rice, School of Life Science and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Rd, Shanghai, 200240, People's Republic of China
Jing Wang, Dabing Zhang & Jing Li
Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, China
Jing Wang & Jing Li
Shanghai Center for Bioinformation Technology, China
Jing Li

Authors

Jing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dabing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Li.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JW carried out the programming and analysis studies, and drafted the manuscript. JL conceived of the study, and participated in the manuscript draft. DBZ supervised the research. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: The 128 features for allergen protein identification. (XLS 21 KB)

12918_2013_1219_MOESM2_ESM.xls

Additional file 2: The statistical data of subcellular locations for soybean and wheat. There are 22 subcellular locations (SL) for eukaryotic proteins. Only SL terms located by 3 more allergens were calculated. (XLS 20 KB)

12918_2013_1219_MOESM3_ESM.tif

Additional file 3: The NJ tree of 116 allergen sequences from six categories. The topology of this tree was generated using MEGA 5, summarizing the evolutionary relationships among the allergens from different categories. The branches of the same category were color-coded. The NJ tree was consisted of 116 allergen proteins which met the condition of sequence length is between 240 and 600, and protein family accounted for a higher proportion within the categories. (TIF 998 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Wang, J., Zhang, D. & Li, J. PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection. BMC Syst Biol 7 (Suppl 5), S9 (2013). https://doi.org/10.1186/1752-0509-7-S5-S9

Download citation

Published: 09 December 2013
DOI: https://doi.org/10.1186/1752-0509-7-S5-S9

Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Systems Biology

PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection

Abstract

Background

Results

Conclusions

Background

Methods

Datasets

Software

Feature vector construction

(1) Features of biochemistry and physicochemistry

(2) Subcellular location description of proteins

(3) Feature space

Feature selection

(1) mRMR method

(2) Incremental Feature Selection (IFS)

Ten fold cross-validation

Results

Model construction with IFS

Optimization of feature components

Allergen predicting by category

Comparison with existing methods

Web-based application

Discussion and conclusions

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

Additional file 1: The 128 features for allergen protein identification. (XLS 21 KB)

12918_2013_1219_MOESM2_ESM.xls

12918_2013_1219_MOESM3_ESM.tif

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Systems Biology

Contact us