Innovative proteomics technologies promise to chart protein degradation on a large scale [1–3]. The resulting data sets present an opportunity to further our understanding of metabolic protein stability through informed data analysis and the development and testing of computational models. The present study makes use of Yen and colleagues’ [1] extensive data set which measures the metabolic stability of about 8000 human proteins. We use this data set to (a) identify the underlying properties that appear to influence protein half-life, (b) develop a predictive model that integrates a number of different relevant data sets and is able to explain its predictions, (c) chart the metabolic stability of the full human proteome in silico, and therefore (d) infer what features influence stability on a global scale.
High-throughput methods for measuring protein degradation typically involve either metabolic labeling or protein tagging. Of the four largest data sets currently available, three were generated in human cells, one using labeling [2], and two using a tagging approach [1, 3], while the fourth data set was generated using protein tagging in yeast [4].
Doherty and colleagues used stable isotope labeling with amino acids in cell culture (SILAC) coupled with mass spectrometry, (MS) to measure the degradation rate of 576 proteins [2]. Cells were grown in medium containing 13C6arginine before being transferred to 12C6arginine medium. MS was used to measure the shrinking abundances of 13C6 labeled proteins over several time points and a degradation rate was calculated by fitting the abundances to a single exponential curve. One advantage of this method is that metabolic labeling causes minimal cell perturbation. However, by nature of using mass spectrometry, the measurements will be biased to highly abundant, and theoretically more stable proteins [5].
Belle and colleagues used tandem affinity purification (TAP) tagging along with cycloheximide inhibition of transcription and western blotting to measure the half-life of over 3000 yeast proteins [4]. The most recent tag-based method is called “bleach chase” and was used to measure the half-life of 100 proteins from H1229 cells [3]. Proteins were tagged with yellow fluorescent protein and their half-lives were inferred from “bleaching” some cells with a pulse of light and measuring the difference in fluorescent decay between bleached and non-bleached cells.
Another tag-based approach implemented in human HEK293T cells used a dual-fluorescent tagging method called global protein stability profiling (GPSP) [1]. The GPSP method uses two fluorescent proteins, enhanced green fluorescent protein (EGFP) and Discosoma sp. red (DsRed), which are expressed on a single mRNA transcript. The DsRed protein acts as a control, while EGFP is expressed as an N-terminal fusion with a protein of interest. Coupling this approach with fluorescence activated cell sorting (FACS) and microarray analysis, the authors were able to measure the stability of approximately 8000 human proteins, and it is this data set we use in our study.
An important consideration of N-terminal fusion is the interference that the EGFP tag could have on the function of N-terminal signal sequences. A recent review on the use of fluorescent protein tagging points out that approximately one third of human protein-coding genes contain position-dependent sequence information [6]. In the case of proteins with N-terminal signal peptides, or signal anchors, the fusion of a fluorescent protein to the N-terminus is likely to interfere with normal localization. Indeed, Yen and colleagues [1] found that unstable proteins contained an enrichment of membrane protein gene ontology (GO) terms but remark that it is unclear what effect fluorescent tagging will have upon the measurement of global degradation rates.
Huang and colleagues recently explored a range of predictive features in the GPSP data set and indicated that a simple associative model can classify protein stability with a reasonable accuracy – as evaluated using the same data set [7]. However, without paying attention to the potential bias caused by N-terminal tagging, a computational model may contain the same biases. Therefore, our paper presents a protein stability model based on the largest of the present protein degradation data sets with emphasis on minimising experimental bias. Indeed, it may be possible to discount the influence of experimental artefacts by first exploring and understanding their impact on models.
We created a method for classifying proteins as having a high metabolic stability (i.e. long half-life) or low stability. We developed this method using the GPSP stability data set, which is by far the most extensive available, and thus easiest to cross-reference to other complementary data resources. We considered that this data set may contain a bias portraying proteins with N-terminal signal peptides and anchors as metabolically unstable due to interference caused by the experimental technique. Consequently, we developed and tested models on two sets of proteins: a full set, and a trimmed set with secreted and transmembrane proteins removed.
Using complementary resources, including the Human Protein Reference Database (HPRD), a wide range of predictive features were explored. We identified groups of features that are statistically enriched in both stable and unstable proteins, ultimately to understand if they may be used to infer metabolic stability levels. We subsequently designed a model that explicitly recognizes and integrates known factors of the relevant processes and employed machine learning to optimise its ability to generalize to novel proteins. Finally, to illustrate metabolic stability on a system scale, we used the model to score the stability of all proteins contained in the HPRD.
Features relevant to protein stability
Protein degradation via the proteasome is mediated through poly-ubiquitination [8]. However, there are a number of other post-translational modifications (PTMs) for which a role in either targeting proteins to or protecting proteins from the degradative machinery has been suggested. These modifications include phosphorylation, prolyl hydroxylation, glycosylation and small ubiquitin-related modifier (SUMO) conjugation [9]. For example, phosphorylation can either promote or inhibit ubiquitination by regulating the E3 ligases responsible for ubiquitination [10]. Glycosylation is used as a form of protein quality control in the endoplasmic reticulum, the folding location of most secreted and integral membrane proteins [11], and this modification can act as a signal for misfolded proteins to be translocated to the cytosol for degradation.
A variety of features have been proposed to influence the targeting of a protein to the ubiquitin proteasome system. The N-end rule is one of the best documented examples of sequence-based degradation signals [12]. The rule originally stated that the in vivo half-life of a protein is associated with the identity of its N-terminal residue, causing high levels of binding selectivity for the E3 ligases that target proteins for ubiquitin-mediated degradation. The rule classified N-terminal residues as stabilizing, or belonging to one of three classes of destabilizing residues: primary, secondary or tertiary. Recently, the rule was extended when it was discovered that N-terminal acetylation of most amino acids acts to create N-degrons [13]. Now it is believed that all but two amino acids (glycine and proline) can act as degradation signals upon acetylation, and are therefore considered “destabilizing”. However, the authors note that many proteins may still have a high level of metabolic stability despite the presence of an N-degron. It is possible that long-lived proteins are protected from degradation through complex formation or folding that makes the N-degron inaccessible.
It has been suggested that structural disorder is correlated with protein stability [14]. A bioinformatic study on yeast data found that structural disorder had a weak, but significant inverse correlation with protein half-life [14]. However, a separate study of protein structural disorder found that highly disordered proteins (where high disorder is defined as a protein with greater that 80% sequence disorder) had far greater metabolic stability than proteins with low structural disorder [15].
A correlation between the frequency of certain amino acids has also been reported[1, 7]. One study demonstrated that the frequencies of tryptophan, cysteine, leucine and threonine were negatively correlated with protein stability, and conversely, that glutamic acid, aspartic acid, lysine and asparagine were positively correlated with protein stability [1]. The exact biological mechanisms underlying the relationship between amino acid composition and stability are unknown, though there are likely to be a variety of factors. For example, the PEST hypothesis claims that short hydrophilic sequence segments enriched in certain residues are correlated with protein instability [16]. What mechanisms are behind this are unknown, and it has also been suggested that PEST regions are actually sequence areas enriched in amino acids that confer phosphorylation modification sites [10].
Previous work on modeling protein stability
There are a number of ways to define metabolic stability. For example, one study presented a probabilistic method for classifying the metabolic stability in vitro of chemical compounds [17]. However, to the best of our knowledge there is currently only one published predictor for intracellular protein stability [7]. The contribution of this published work is two-fold: Huang and colleagues identified optimized sets of features relevant to protein stability, and created a predictor that classifies protein stability based upon the “best” feature vectors. Using the Yen [1] data set, the authors created four classes of protein stability (based on a discrete measurement called the protein stability index, or PSI): short (PSI < 2), medium (2 ≤ PSI ≤ 3), long (3 ≤ PSI ≤ 4) and extra long (PSI ≥ 4). The authors went on to define a list of 376 possible feature components that are believed to contribute to protein stability. These included various biochemical/physiochemical attributes of proteins (such as amino acid composition, hydrophobicity and polarity), protein subcellular locations, KEGG enrichment scores, and the number of complexes a protein is involved in.
To determine what features distinguish between different classes of stability, the authors defined three 2-class problems based on their four stability classes: (1) short and medium vs long and extra long, (2) short vs medium and (3) long vs extra long. For each problem, they ordered the feature vectors according to the maximum relevance and minimum redundancy method, which ranks elements in a vector according to their relevance to a target, and redundancy against each other [18]. To optimise the elements in the feature vectors, they used incremental feature selection (IFS) with nearest neighbor (NN) (using 1-cosine distance as the metric) to classify proteins based on increasing feature vector sizes. For each vector size, they used jack-knife cross-validation to determine the accuracy of NN. For problem (1), they reported an overall accuracy of 72.8% using 62 features; for problem (2), an accuracy of 69.8% with 43 features, and for problem (3), an accuracy of 67.8% with 122 features. On a cautionary note, the automated IFS scheme is likely to invoke a “selection bias”, whereby features are optimal for that particular data set, leading to inflated performance figures [19].
The authors reported that localisation to the cell membrane was an important contributor to predicting protein stability. In our own analysis on the set of proteins used in their study, we found that proteins localised to the membrane or the cell surface were significantly over-represented in the “short” class compared to other proteins (Additional file 1: Data Set 1). Furthermore, the most significant feature for problem (1) according to the optimal feature vector is the frequency of hydrophobic residues. We found that an average of 35% of residues in the “short” class proteins were classified as hydrophobic, compared to only 29% in the “extra long” class. Considering the hydrophobic nature of transmembrane domains, this is likely a reflection of the “short” class proteins localised to the membrane. If the GPSP data is indeed biased towards membrane proteins, then it appears that the NN model shares that bias.
The NN method as employed by Huang and colleagues [7] simply classifies a query protein based on the class of the “nearest” protein. As a result, this method does not give us information on how given features influence stability – whether they correlate with stable or unstable proteins. However, the use of probabilistic machine learning methods such as Bayesian networks can give a transparent and explainable model of protein stability. Though Huang and colleagues have presented the first attempt at computationally modeling protein stability, this work can be improved through the removal of potential experimental bias from the data, and the use of a computational method that can explain its predictions.
Bayesian networks
We chose to use Bayesian networks to model metabolic stability and the relationships between relevant features for several reasons. First, a Bayesian network can represent a large number of different types of features to influence the outcome, recognizing only dependencies that we believe exist. Second, by virtue of its probabilistic nature, uncertain observations can be incorporated, missing data can be managed, and flexible queries to probe and explain predictions can be entertained [20, 21]. Finally, though most observations are strictly “Boolean” (true or false), we are able to integrate “continuous” scores produced by methods particularly suited to challenging though largely independent sub-problems, such as support-vector machines (SVMs) applied to detect protein sequence similarity and position-weight matrices (PWMs) applied to recognize PTM sites.
We represent observations about stability, presence of specific domains, sequence and structural features, PTMs etc as random variables X1X2,…,X
N
, that may take values x1x2,…,x
N
. Variables are organized “graphically” into a network, with pa(X) representing the set of “parent” variables of X, thereby identifying what dependencies can be captured. The joint probability of all variables is given by
(1)
We discuss the selection of relevant variables and their relationships in Section “Data resources and feature identification”, including the use of latent (un-observed) variables. Parameters are set by using the expectation-maximization (EM) algorithm on a training set [22]. We also separate out data used to parameterize PWMs to score PTM sites and data used to train SVMs equipped with the 1-spectrum kernel to map a protein sequence to its stability class [23].