Discovering conditional co-regulated protein complexes by integrating diverse data sources

Abstract Background Proteins interacting with each other as a complex play an important role in many molecular processes and functions. Directly detecting protein complexes is still costly, whereas many protein-protein interaction (PPI) maps for model organisms are available owing to the fast development of high-throughput PPI detecting techniques. These binary PPI data provides fundamental and abundant information for inferring new protein complexes. However, PPI data from different experiments do not overlap very much usually. The main reason is that the functions of proteins can activate only on certain environment or stimulus. In a short, PPI is condition-specific. Therefore specifying the conditions on when complexes are present is necessary for a deep understanding of their behaviours. Meanwhile, proteins have various interaction ways and control mechanisms to form different kinds of complexes. Thus the discovery of a certain type of complexes should depend on their own distinct biological or topological characteristics. We do not attempt to find all kinds of complexes by using certain features. Here, we integrate transcription regulation data (TR), gene expression data (GE) and protein-protein interaction data at the systems biology level to discover a special kind of protein complex called conditional co-regulated protein complexes. A conditional co-regulated protein complex has three remarkable features: the coding genes of the member proteins share the same transcription factor (TF), under a certain condition the coding genes express co-ordinately and the member proteins interact mutually as a complex to implement a common biological function. Results A framework of discovering the conditional co-regulated protein complexes is proposed. Testing on the Yeast data sets under the Cell Cycle, DNA Damage and Dauxic Shift conditions, we identified a total of 29 conditional co-regulated complexes, among which the coding genes in 14 complexes show a strong association with their TFs activity. Based on the close relationship among co-regulation, co-expression and protein-protein interactions in the conditional co-regulated protein complexes, 39 novel TRs were predicted and explained. Conclusions This paper was initiated to study conditional co-regulated protein complexes by integrating multiple data sources. Taking into consideration the influence of TFs activity on the protein interactions, we found that the expression coherence of the protein complexes’ coding genes changed in accordance to their TFs’ activity, which implied that the proteins’ interactions also changed in response to the environments. Based on the three features of conditional co-regulated protein complexes, new transcriptional regulation interactions were predicted.


Background
Protein complexes perform all kinds of fundamental biological functions in cells. Thus far, there have few reliable techniques to directly detect protein complexes in a large-scale style, whereas binary interaction between two proteins is relatively easy to be detected by experiments such as Yeast Two-Hybrid (Y2H) [1], tandem affinity purification (TAP) [2] and Mass Spectrometry (MS) [3]. Many protein-protein interaction (PPI) networks of model organisms such as yeast, fruit fly and so on have been mapped. They provide fundamental and abundant data for computational approaches to the inference of new type protein complexes. However, it has been reported that protein interaction data produced by different experiments for the same organism are often associated with high false positive and false negative rates which lead to a low overlapping degree between their results [4]. For example, the common PPI between the two different mass-spectrometry approaches stands at 1,728 pairs, which correspond to only 27.5% of PPI detected by TAP or only 19.2% of PPI detected by highthroughput mass-spectrometric protein complex identification [5]. It's partly due to the limitations of the associated experimental techniques, and the more to point is the dynamic nature of the protein interaction maps. Currently, most of computational approaches mainly use large-scale statistically oriented study or exact local topological analysis of protein complexes. The former ones could acquire the information about the global structural features including particular degree distribution [6], clustering properties [7] and possible hierarchical structure of the examined networks [8] and the later ones were focused on the discovery of functional motifs [9], themes [10], and modules [11]. Recently, a few of works [12][13][14][15][16] tried to answer the question when the complexes present and how to use for other applications. In the point of biology view, interactions between biological molecules including protein-protein interactions are dynamically regulated both in time and in space. Individual proteins can participate in the formation of a variety of different protein complexes and protein complexes have different degrees of stability over conditions. In order to understand the behaviours and functions of protein complexes precisely, it's necessary to take the condition-specific features into account.
Meanwhile, different types of protein complexes have their own distinct biological 'pattern' or 'topology' characteristics. Taking the topological feature of complexes as example, some complexes can be modelled as 'clique' whose members are densely connected within themselves but sparsely connected with the rest of the network [17,18], while other ones can be modelled as 'star' where there is a 'hub' unit playing a central functional role connected to its neighbours [19,20]. Thus, the discovery of certain kind of protein complexes strongly depends on their definition according to their own distinct characteristics. Recently, Jansen [21] found that subunits of the same protein complex showed significant co-expression, both in terms of similarities of absolute mRNA levels and expression profiles. Nitin [22] studied the correlation between gene expression profiles and protein-protein interaction on four evolutionarily diverse species: human, mouse, yeast and E Coli. They found that the gene expression profiles of protein-protein interacting pairs were highly correlated in E.Coli and the likelihood of predicting protein interactions from highly correlated expression data was increased by using additional protocol for other three species. Zhang [10] observed an outstanding phenomenon that co-regulated coding genes with similar profiles often lead to intensive interactions between their protein products and forming a protein complex. Tan [23] proposed an innovative concept of co-regulated protein complex where proteins were encoded by genes that are regulated by the same transcription factors (TFs). These interesting results imply that there is a tight linkage between transcription regulation, gene expression and protein-protein interaction.
Instead of defining protein complexes only by their topological characteristics, we make use of the three remarkable features of conditional co-regulated protein complexes: (1) the coding genes of the member proteins share the same active transcription factor, (2) the coding genes express co-ordinately and (3) the member proteins mutually interact as a complex to implement a common biological function. In order to study their associations under some given condition, we integrate transcription regulation data (TR), gene expression data (GE) and protein-protein interaction data (PPI) at the level of systems biology. Furthermore, we consider the condition-specific features of the interactions including TR and PPI. Because accurate temporal parameters are not yet available for many protein-protein interactions, a common way to estimate temporal characteristics of protein products is using compilations of GE data [24]. We first use gene expression level in GE as the criterion to judge the activity of TFs in the TR. Then starting with the active TFs, conditional co-regulated complex seeds are identified in the PPI network. Finally, extra members of complexes are found by extensive searching, during which new transcription regulation interactions are predicted.

Methods
The genetic information of biological systems contained in genes is first initiated by transcriptional factors, and then mRNAs are translated to proteins to execute biological functions. The activity of TFs could make an impact on their downstream products' the functions. Accordingly, we propose the framework to discover conditional co-regulated protein complex as follows. The framework is shown in the Figure 1.

Preliminary definitions
Let T = {tf 1 ,tf 2 ,…tf s } be a set of transcription factors, P = {p 1 ,p 2 ,…p m } be a set of proteins, and G = {x 1 ,x 2 ,…,x n } be a GE data set. PPI network is represented by PPI=(P, E PPI ) , where E PPI = {(p i , p j ) | p i , p j ∊ P}. TR interaction data is denoted by TI = (T, E TI ) , where E TI = {(tf i , x j ) | Figure 1 Schematic overview of conditional co-regulated protein complex identification. Given a condition, the active TFs are identified first and find out all target genes from TR data for each active TF and their protein products. The proteins and protein-protein interaction form a sub-network in the global PPI network. This sub-network will be decomposed into several weakly connected components (WCCs). A greedy searching way would be implemented to find out possible candidate groups, which is a linked sub-network in the WCC. We calculate a coherent score for candidates. The higher the score is, the more likely the candidate is to become a co-regulated complex. Finally, a relationship between the activity of TF and coherence of candidates with the conditions' changing is investigated, which can be used to judge TF impact on their products' function. Candidates with high coherent scores and their nodes exceeding a threshold are used as seeds to search additional extra members and predict new TRs. tf i ∊ T, x j ∊ G}. In particular, x for p stands for the coding gene for protein p.
Given a GE under certain condition, L(P', E PPI ') ⊆ PPI is a conditional protein complex, if and only if it meets the following requirements.
(i) All x for p where p ∊ P ', share the same active tf ∊ T under the given condition, (ii) L is a connected graph, (iii) The coherent score Score(L) of L is greater than the threshold a and its score degree is consistent with the activity of its TF as least θ conditions in all conditions.
The above three requirements correspond to the three features of co-regulated protein complexes. Parameters a and θ in (iii) are thresholds used to distinguish a conditional co-regulated protein complex from the others.

Identification of active TFs
Identifying the active TFs under given conditions is challenging. A recent research work [25] adopted the assumption that the regulators are themselves transcriptionally regulated. Therefore, their expression profiles can provide informative clues to indicate their activity level. TFs are identified as being 'active' at certain condition if they reach sufficiently high expression levels. We use the Trace-Back algorithm [26] to identify the active TFs when conditions are given.

Identification of co-regulated complex seeds
For all target genes of each active TF, their protein products and their protein-protein interactions form a local sub-network in the global PPI network. This sub-network will be decomposed into several weakly connected components (WCCs). Because WCCs disjoint each other, one TF may correspond to more than one WCC. We take a core-neighbour strategy to search our target complexes. The procedure includes two stages: the first one is to search the core part (also called seed) in the WCCs in a greedy way and the second one is to conduct an extensive search for extra members of the core in the global PPI in a heuristic way. The computational benefit is obvious as compared to ten thousands of edges in a global PPI, the search space in WCCs decreases by orders of magnitude. As constrained by the requirement (iii) in the definition of conditional protein complexes, we also use a coherent scoring threshold α to judge whether a group is a seed. The arc to vertex number of these WCCs varies greatly from several ones to hundreds. It's infeasible to find out all combinations of the proteins which are linked by hundreds of interactions in the WCC due to the computational cost. However, in practice the member proteins in most known protein complexes do not exceed ten. Therefore, we set two parameters l and b to limit the minimum and maximum edges to narrow down the search space. We take a greedy method to search all possible groups meeting the two parameters in WCCs, and then we measure their coherent score in the given condition, and finally identify seeds whose score exceeds a and its score degree is consistent with the activity of its TF as least θ conditions.

Coherence measurement
As the coding genes in a co-regulated complex have coherent expression, the change in the coherence degree can indicate different states of the complex function. Taking the scoring methods as used in [12,16], a protein group is denoted by is the Pearson Correlation Coefficient between the coding genes of protein i and protein j to reflect their coherence. Different from the formula used in [16], we do not include the individual gene's expression variation measured by std () for two reasons. The first reason is that some genes could be active with low expression variation, and the second reason is that the score could be still high with very high expression variation and relatively low coherence.
Thus, the coherence score of L, denoted by T(L), is the sum over the scores of all the edges in L: We note that the coherence score can be influenced by the number of edges in L. Guo [16] and Ideke [13] have proved that the problem could be solved in the following way. In order to compare the coherence between groups with different number of edges, for L with K edges, we randomly choose 10 000 sub-graphs with K edges from the PPI network and compute their score by using formula (2), then calculate the average and standard deviation value of these 10 000 graphs and use formula (3) to standardize the final score for seed L with K edges. After standardization, groups with different number of edges can be compared with the coherence.

Extensive search
In the global PPI network, we seek extra proteins of the seeds which interact intensively and co-express with the given seed with a high coherence score. A protein which expresses differently with the seed will make the score decrease; while a protein which expresses consistently with most parts of the seed will increase the score. Therefore, the search process can be converted to optimize the score by adjusting the structure of the subgraph starting from the seed. Our extensive searching implements a simulated annealing procedure for every seed. The proteins that TR does not indicate the same TF with the seed can be added into the seed by the extensive search. In this situation we can predict a new TR interaction according to the inference model shown in the Figure 2. The pseudo code is shown in Table 1, where L initial is corresponding to the seed. The input parameters T start , T end are the initial and ending temperature respectively, and N is the iteration number.

Data collection
The Yeast dataset used in the method evaluation involves three biological conditions: Cell Cycle [27], DNA Damaging [28], Diauxic Shift [29]. The Cell Cycle wet-lab experiment includes expression measurements of 6 178 genes measured at 77 time points. The DNA Damaging experiment has 6 129 genes' expression values with 52 sampling points. The Diauxic Shift dataset consists of 6 068 gene expression profiles with 7 time points. We also use the total 7 074 TRs data from Luscombe's work [26], which uses the Trace-Back algorithm to determine active TFs. The distribution of active TFs in the three conditions is shown in Figure 3. Total 54 015 protein-protein interactions from the interoporc [30] are also used in our work. In the data pre-processing, we directly neglected the time point with missing value when calculating the correlation coefficient.
Because there are three different molecular types of data in our work, we unified data symbols by mapping all symbols into gene ID as the standard reference. If the corresponding coding genes of the proteins in the PPI cannot be found in the GE dataset over all conditions, we excluded those proteins from the PPI data. The data used in this work are listed as Additional file 1, Additional file 2, and Additional file 3. Starting from the active TFs, all WCCs are decomposed according to the structure of the global PPI network. The edge number of the WCCs varies greatly. The maximal edge number of the WWC for the transcription factor YKL112W can reach 293. We set l = 2 and b = 21 to greedily search complex seeds from the all possible parts whose edge number is between l and b in the WCCs. If two candidates share common 80% proteins, we take the one with a higher coherent score. Because there is no gold-standard threshold for the coherent score to judge which seeds can be in a Figure 2 Inferring new TR interactions. From the existing TR interaction data, genes a, b, c, and d are known to be target genes of a TF, and A, B, C, and D stands for their proteins. From the PPI network and GE, if we observe that the proteins A, B, C, D and an additional protein E interact intensively one another, and that their coding genes a, b, c, d and e express co-ordinately, we can predict that the TF also regulate e under the same condition. The reason is that E is so similar to the co-regulated protein group of A, B, C, D at the levels gene expression and protein-protein interaction that it can be inferred that e also has TR interactions associated with TF just as a, b, c and d do, although the known TR dataset does not indicate this.
co-regulated complex, we had to take the top ones in the ranking list to guarantee the prediction accuracy with the parameter a = 1.90. Table 2 lists a total of 29 complex seeds (YNL216W and YPR104C regulated the common seed) identified by our method from 21 TFs who's at least active in one of the three conditionsᒬ and all seeds' coherence degree is consistent with activity of their TF at least θ = 2. In Table 2, c1 represents cell cycle, c2 stands for DNA Damaging and c3 denotes Diauxic Shift. From this table, we can see that the coherence degree of the coding genes in the seeds corresponding to the TFs YOR028C, YDR451C, YLR183C, YBL021C, YGL013C, YDL020C, YDR207C, YKL112W, YCR065W and YKL109W are perfectly consistent with their TFs' activity. As the expression coherent levels of the proteins' coding genes change in accordance to the activities of the TFs in all conditions, we can infer that the functions of their protein complexes also follow the pace with their coding genes and TFs. They are perfect conditional co-regulated protein complex seeds. Take the complex seed L consisting of YDL156W, YAR007C, and YJL115W with 2 edges as example. Their transcription factor is YDR451C. Figure 4  In order to validate whether the proteins can form a complex, we identify the corresponding MIPS complexes which contain as many proteins in the predicted complex seeds as possible. 21/29 seeds have over 50% proteins covered by the corresponding MIPS complexes (the coverage genes are shown by bold). However, TFs are not the only factors to determine the behaviours of the coding genes. For example, we found that the complex seeds belonging to TF YBR049C may be influenced by other factors or stimulus, as YBR049C is active in Table 1 Pseudo codes of our search method DNA Damaging, but all its complex seeds show low coherent degree in DNA Damaging. We guess it is caused by DNA Damaging. In Table 2, some value is zero. It's because the GE under the given condition does not cover the target genes.

Extensive search
After the extensive search, not only the new TRs could be predicted, but also the extra members for the core seeds could be found. First to show whether the simulated annealing algorithm can reach a convergence point and to illustrate the parameter settings, we conducted an experiment to investigate how the score and edge number are changed with the iterations for the example complex seed L that consists of YDL156W, YAR007C, and YJL115W with 2 edges. Unfortunately, for the simulated annealing algorithm, there are no choices of parameters that will be good for all problems, and there is no general way to find the best choices for a given problem [31]. In theory, the final result could not be decided by the initial state and parameter setting, but the optimal parameter setting could have a significant impact on the method's effectiveness. Three annealing runs starting from different initial annealing temperatures T start =2, T start =1 and T start =0.05 are shown in Figure 5. We could see that all of the results converge. When the distance of T start and T end is big, it will have big acceptance probability for the weak candidates. It could be observed in heading parts of the curve in (a) and (b), whose Score(L) decreased rapidly. When the temperature is cooling down, the weak candidates are likely excluded. However, it could jump out the local optimality to global optimality. In contrast, when T start is near to T end , it's more possible to reject the weak candidates and easily to reach the local optimal. The parameter N decides the searching times. If N is small, it couldn't reach all possible searching space. Therefore, N should be set a bit big. Because our framework is based on core-neighbour strategy, the optimal local seeds have been identified and it could set T start near to T end to reject the weak candidates.
The seed corresponding to the TF Hsf1 (YGL073W) had the highest score in the cell cycle (11.1) and DNA Damaging (5.75) conditions. As mentioned in the  Figure 6 shows the topology of seed and final result of Hsf1 (YGL073W) and their expression profiles. We can note that the two sets of expression profiles exhibit a highly coherent similarity under the conditions of cell cycle and DNA Damaging, which also validates the scoring function. Based on this, we can infer that Hsf1 transcriptionally regulates the target genes YMR186W, YKL117W, and YOR027W as well, which are both covered in the extensive searching of Cell Cycle and DNA Damaging.
We validated our prediction results from three aspects: (1) we retrieved and compared with literature works which predicted the same TRs; (2) we detected the conserved binding motifs from the target genes in the seed and examined whether there were matches in the promoter of the predicted target; (3) we examined whether the function of predicted target genes was consistent with those of target genes in the seed. Of course, the final validation for the prediction result should depend on the biology experiment in the cell cycle condition. We found that the results by [32,33] and [34,35] supported our newly discovered TRs: Hsf1 regulates YMR186W and Hsf1 regulates YOR027W.
However, we have not found direct evidence to support that Hsf1 regulates YKL117W. Maybe, it is a good idea to find evidence from binding motifs to support this. There are two significant binding motifs induced by the tool MEME from the upstream 600bp of the coding genes in the seed, which are shown in Figure 7. The first motif is consistent with a known Consensus Motif (GAAXXTTCXXGAA) for Hsf1. We found that there is a mach to the first motif in the 600bp upstream of YKL117W, YOR027W, and there is a match to the second motif in the upstream of YMR186W. Finally, we compared the function of Hsf1, the coding genes in the seed and the predicted target genes. SGD has an annotation for Hsf1 as following: 'Hsf1 regulates the transcription of hundreds of targets, including genes involved in protein folding, detoxification, energy generation, carbohydrate metabolism, and cell wall organization. Deletion of Hsf1 is lethal and mutants are defective in several processes including maintenance of cell wall integrity, spindle pole body duplication, protein transport, and cell cycle progression'. Meanwhile, we conducted a function enrichment analysis for the ten genes YLR216C, YLL026W, YPL240C, YAL005C, YLL024C, YDR214W, YNL007C, YMR186W, YKL117W, and YOR027W. One finding is that these genes have a common function of 'protein folding', which belongs to the functional scope of Hsf1.
For other extensive searches, in order to guarantee the accuracy, we only consider those complex seeds whose coherence perfectly consistent with the activity of their transcription factor in three conditions. Table 3 shows the results of newly predicted TRs and extra members for the perfect condition-dependent complex seeds listed in Table 2. The superscript number on the predicted target genes corresponds to that of their seed, which have the same TF. Figure 7 Conserved motifs found in the seed and predicted target genes. Two motifs predicted by MEME [36] from the upstream 600p of the five coding genes in the seed. The first motif is consistent with a known Consensus Motif (GAAXXTTCXXGAA) for Hsf1 in TRANSFAC.

Discussion
In this paper, we proposed a framework to discover conditional co-regulated protein complexes by integrating TR, GE and PPI data. This kind of protein complexes has three remarkable features: the coding genes of the member proteins share the same transcription factor, under certain condition the coding genes express coordinately and the member proteins interact mutually as a complex to implement a common biological function.
Comparing to the existing works, one advantage is that our method not only uses the coding genes expression to measure the conditional protein activity but also takes the upstream TF activity into account to study their influence on protein complex. In the experiment, we observed some typical cases in which protein complexes' coding gene coherent degree is strongly associated with their TF activity under different conditions. Another contribution is that we advanced the procedure of discovering co-regulated protein complex to discover potential unknown transcriptional regulation based on the tight relationships among co-regulation, co-expression and protein interaction. Because our work is based on the integration of several heterogeneous data sources, the result of the work could be influenced by the data quality in several aspects. The first one is the missing value in gene expression profiles. Besides the usual ways to directly assign value zero or the average value of the gene row to them, many other approaches have been proposed such as Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute). Brock [37] has examined which imputation method is optimal for a given data set. Optimal imputation method should balance computation cost and untrue estimation. The second one is the available amount of the protein-protein interaction and transcriptional regulation data. Detecting them exactly is still a challenge in the field. In particular, it is hard to distinguish the false positive data. Another one is that an accurate prediction of TR under different conditions is very important for our work. Although this work focuses on the special kind of protein complex, it could help to understand the protein complexes' organization and functional behaviour. Meanwhile new TRs are predicted based on the tight linkage between co-regulation, co-expression and protein-protein interactions. During the extensive search, it cannot detect all TRs for a species one time, but it provides an approach to exploit TRs from the complex mechanism. To make this method widely applicable, two real-life difficulties should be taken with caution. These include: (1) Time-course GE datasets with time points exceeding 10 for species except for yeast are not too many. In fact, most of them are knock-out experiments, which usually re-sample no more than 3 times. It is hard to measure the genes' correlation with such few number of time points. (2) In this work, we used the Transcriptional regulation data directly. In fact much TR information could be extracted from other types of data like TFs binding. Meanwhile, the TF not only could act as promotion, but also repression, which will be considered in our future work. When the data become abundant and available, we believe our proposed method would be applicable for more species.