 Research Article
 Open Access
 Published:
Modelling the influence of dimerisation sequence dissimilarities on the auxin signalling network
BMC Systems Biology volume 10, Article number: 22 (2016)
Abstract
Background
Auxin is a major phytohormone involved in many developmental processes by controlling gene expression through a network of transcriptional regulators. In Arabidopsis thaliana, the auxin signalling network is made of 52 potentially interacting transcriptional regulators, activating or repressing gene expression. All the possible interactions were tested in twoway yeast2hybrid experiments. Our objective was to characterise this auxin signalling network and to quantify the influence of the dimerisation sequence dissimilarities on the interaction between transcriptional regulators.
Results
We applied modelbased graph clustering methods relying on connectivity profiles between transcriptional regulators. Incorporating dimerisation sequence dissimilarities as explanatory variables, we modelled their influence on the auxin network topology using mixture of linear models for random graphs. Our results provide evidence that the network can be simplified into four groups, three of them being closely related to biological groups. We found that these groups behave differently, depending on their dimerisation sequence dissimilarities, and that the two dimerisation subdomains might play different roles.
Conclusions
We propose here the first pipeline of statistical methods combining yeast2hybrid data and protein sequence dissimilarities for analysing proteinprotein interactions. We unveil using this pipeline of analysis the transcriptional regulator interaction modes.
Background
Auxin is a key signal in plant development that regulates organogenesis from embryogenesis onward. This major phytohormone achieves its morphogenetic activity notably by regulating the transcription of a large number of downstream genes. In Arabidopsis thaliana, the control of gene expression in response to auxin involves a complex network of 52 transcriptional regulators, consisting of 29 AUXIN/INDOLE3ACETIC ACID (Aux/IAA), that do not bind DNA, and 23 AUXIN RESPONSE FACTOR (ARF), which are true transcription factors (for review, see [1, 2]).
The current molecular model of the auxin signalling pathway assumes the formation of heterodimers between ARF and Aux/IAA in absence of auxin (Fig. 1 a). According to [3] these transcriptional regulators interact through a Cterminal dimerisation domain (CTD), made of two conserved subsequences known as domain III (DIII) and domain IV (DIV) (Fig. 1 b). ARF can bind DNA through a DNA binding domain (DBD) and act either as activators (ARF+) or repressors (ARF) of auxinresponsive transcription (Fig. 1 b) depending on the amino acid composition of the intermediate domain linking the DBD to domain III/IV (DIII/IV). It should be noted that Aux/IAA do not have a DBD and therefore are unable to regulate alone the transcription of auxinresponsive genes. When auxin accumulates in cells as a result of polar auxin transport or changes in biosynthesis, its perception targets the Aux/IAA to the proteasome [1], leading to their subsequent degradation. This subsequently releases ARF, allowing them to regulate downstream genes.
It is only recently that the topology of the Aux/IAA  ARF network was analysed extensively [4]. A yeast2hybrid (Y2H) [5] highthroughput approach, has allowed to test for most possible interactions between Aux/IAA and ARF proteins (with the exception of ARF15, 21 and 23, see [4] and Methods). A binary network was built from these data and a modelbased graph clustering method [6] that groups proteins on the basis of their connectivity profile (i.e. similar interactors) was used to explore this network. Three clusters of proteins, that closely matched biological groups (i.e. ARF+, ARF and Aux/IAA) [4] were identified in this way, thus demonstrating the rather stereotypical interaction properties of ARF+, ARF and Aux/IAA (see below for more details). Here, we extended this approach to analyse the influence of the DIII/IV primary sequence dissimilarities on the likelihood of interaction between auxin transcriptional regulators. To this end, we used a recently proposed generalisation of the mixture models for random graphs that offers the possibility to deal with valued graphs and to include explanatory variables [7]. This integrative statistical model constitutes the core of our pipeline of methods for analysing the influence of sequence dissimilarities between dimerisation domains on proteinprotein interactions.
Results and discussion
A binary network is often easier to interpret than a valued one. However, in our case, it does not fully represent the “true” biochemical network as an interaction network depends on several properties, such as interaction strength, protein concentration, spatial expression and synthesis/degradation dynamics of the proteins. We will first briefly recall how the binary network was built and analysed in [4]. Then we will compare this previous approach with the analysis of a valued network, built to minimize the loss of information, before investigating how dissimilarities between dimerisation domains can be incorporated in such a modelling framework.
Available Y2H experimental data and binary Aux/IAA  ARF network
We used in this work a previously available Y2H dataset where Aux/IAA and ARF interactions have been tested in yeast both ways [4]. Interactions were tested for each protein fused to the activation (AD) or to the binding domain (BD) of the Gal4 yeast transcription factor, thus allowing to minimize false positives. In addition and to minimize false negatives, two reporter genes, HIS3 and XGal, were used for testing the interaction. In this experiment the interaction capacities of 49 transcriptional regulators were tested (ARF15, 21 and 23 could not be cloned), thus making a total of 2401 interactions tested. We give in Table 1 an example of the results. Note that the Y2H dataset was obtained using only DIII/IV for ARF, while fulllength proteins were used for Aux/IAA (see Conclusions for a discussion of that point).
The Y2H data were previously used to build a binary network [4]. This required choosing thresholds for both tests on the basis of their empirical distributions. The threshold was set between the successive marks ‘+?’ and ‘+’ for the XGal test (see Methods for detailed explanations) and at 0.45 for the HIS3 test [4]; see an illustration of these thresholds in Fig. 2. Decision rules were then used to combine the four test outputs ([4]; see Methods).
We aimed at analysing binary and valued networks potentially influenced by dimerisation sequence dissimilarities. These networks should be built on the same transcriptional regulators. We therefore excluded ARF11 since this ARF showed no interactions in the previously published Aux/IAA  ARF binary network [4]. We also excluded ARF3 and 17 since they do not possess DIII/IV. We then built a new binary network using the same thresholds as in [4]. We also tested HIS3 thresholds at 0.3 and 0.65. Applying these thresholds only slightly modify the binary network (Additional file 1: Figure S2). The binary network built using the HIS3 threshold at 0.45 will thus be used in the rest of this work.
Building a valued network from Y2H data
Combining the XGal and HIS3 test outputs in a single interaction distance requires a standardization procedure (see Methods). The objective of standardization is to avoid dependency on the elementary distance type and scale. It is important to point out that, in our case, the valued network does not represent affinities between proteins, but rather the likelihoods of interaction between proteins. We tested several weightings of the outputs of the XGal and HIS3 tests and in particular:

network A: w _{XGal}=0.75 and w _{HIS3}=0.25;

network B: w _{XGal}=0.5 and w _{HIS3}=0.5;

network C: w _{XGal}=0.25 and w _{HIS3}=0.75.
To this end, we visualized the standardised distance distributions corresponding to “no interaction” (red) and “interaction” (green) according to the previously defined binary assignment (Fig. 3). Network C is characterized by standardised distances corresponding to “no interaction” spread over a wide range of values, thus leading to a rather large overlap with standardised distances corresponding to “interaction”. Network A on the contrary concentrates standardised distances corresponding to “no interaction” over a small range of values, leading to a clear separation with standardised distances corresponding to “interaction”. Finally, network B (corresponding to the balanced weighting) presents a reasonable compromise between the dispersion of standardised distances corresponding to “no interaction” and “interaction” and their overlap. This comparison of the networks highlights the fact that the XGal test seems more reliable than the HIS3 test in this dataset, probably because of the very long tail corresponding to higher interaction likelihoods for this test (Fig. 2). In the following, we will thus present clustering results only for networks A and B.
Network topology analysis using Bernoulli and Gaussian mixture models
To gain further insights into the binary and valued networks topology, we applied a modelbased graph clustering methods in order to group the transcriptional regulators on the basis of their connectivity profiles. The key feature of mixture models for random graphs is to give a probabilistic summary of the connectivity structure by uncovering clusters of proteins that share similar connectivity profiles. The parameters of the model are the cluster weight distribution and the connectivity distributions for each pair of clusters.
In the case of a binary adjacency matrix Z, connectivity distributions are Bernoulli distributions parametrized by connectivity probabilities, that is the probability for proteins of two clusters to interact:
The interaction Z _{ ij } between vertices i and j knowing that i belongs to cluster q and j to cluster ℓ follows a Bernoulli distribution of parameter π _{ q,ℓ }.
In the case of a weighted adjacency matrix X, the connectivity distributions are Gaussian distributions:
It should be noted that parameter μ _{ q ℓ } of a Gaussian mixture (GM) model is the mean likelihood of interaction between proteins of two clusters. This is different from the Bernoulli mixture (BM) model where the parameter π _{ q ℓ } is the probability for proteins of two clusters to interact. This makes the biological interpretation of GM model parameters less straightforward.
The inference of such models is not restricted to the estimation of the cluster weight and connectivity distributions but encompasses the inference of the number of clusters using a penalized likelihood criterion. The principle of penalized likelihood criteria such as the integrated completed likelihood (ICL) criterion consists in making a tradeoff between an adequate fitting of the model to the data and a reasonable number of parameters to be estimated. The ICL criterion is specifically tailored to the clustering objective and is expected to favour models such that the uncertainty of protein assignment to clusters is low. Jeffreys’ rules of thumb [8] suggest that a difference of ICL of at least log(100)=4.6 is needed to deem the model with the higher ICL substantially better. Since the ICL criterion is only asymptotically valid (i.e. for large N), the number of clusters given by this criterion should be considered as indicative. We thus chose to systematically investigate potential interesting clusterings combining ICL values, prior biological knowledge and within and betweencluster distances for assessing the homogeneities and separabilities of clusters. One key output for the validation of a modelbased clustering method is the posterior distributions of protein assignment to clusters. For each protein, this posterior distribution was degenerate (probability of 1 for a given cluster and 0 for the others) whatever the model used, which eased the interpretation of the clustering outputs.
Building a Bernoulli mixture model
Note that the clustering results reported here using BM models slightly differ from those presented in [4] since we only used 46 proteins (instead of 49 proteins, as explained above).
When estimating BM models on the basis of the 46 protein binary network, the ICL criterion favours first the 6cluster BM model and next the 4cluster BM model (Table 2). However, the ICL difference (ΔICL<2) between the 4 and the 6cluster BM models was not significant according to Jeffreys’ rules of thumb.
For the 4cluster BM model (Table 3), we found three clusters corresponding to biologically meaningful groups and an “outlier” cluster. The three clusters $C1^{\text {ARF+}}_{\texttt {BM}}$ , $C2^{\text {ARF}}_{\texttt {BM}}$ and $C3^{\text {IAA}}_{\texttt {BM}}$ show a specific enrichment in respectively ARF+, ARF and Aux/IAA. The fourth cluster $C4^{\text {Outlier}}_{\texttt {BM}}$ can be categorized as “outlier” since it groups one ARF with six Aux/IAA in addition of being poorly defined as detailed below. A connectivity graph representing the interaction probability between clusters is given in Fig. 4.
An important criterion to assess the validity of a clustering model is the betweencluster distance matrix D(q,ℓ) (given below). $C1^{\text {ARF+}}_{\texttt {BM}}$ , $C2^{\text {ARF}}_{\texttt {BM}}$ and $C3^{\text {IAA}}_{\texttt {BM}}$ present withincluster distances (diagonal) smaller than betweencluster distances (off diagonal), showing a strong definition of these clusters (see Eq. 3). The withincluster distance of $C4^{\text {Outlier}}_{\texttt {BM}}$ is in contrast higher than the withincluster distance of the three other clusters. In addition, its withincluster distance is larger that its distance to $C2^{\text {ARF}}_{\texttt {BM}}$ . This configuration can be interpreted in the framework of densitybased clustering (see [9] and references therein) where $C1^{\text {ARF+}}_{\texttt {BM}}$ , $C2^{\text {ARF}}_{\texttt {BM}}$ and $C3^{\text {IAA}}_{\texttt {BM}}$ are characterized by rather high density of elements with respect to the density of elements of $C4^{\text {Outlier}}_{\texttt {BM}}$ . This outlier cluster might be explained in part by biological noise in the Y2H experiments.
In the case of the 6cluster BM model favoured by the ICL criterion, two clusters are not well defined in terms of within and betweencluster distances (Additional file 2: Table S1). The cluster composition shows three Aux/IAA enriched clusters and one outlier cluster (compare the 4 and 6cluster BM models cluster composition in Additional file 1: Figure S3). This is likely a consequence of the tendency of the ICL criterion to select overparameterized models in our context.
Taken together these results suggest that the 4cluster BM model is more relevant both from the point of view of cluster definition and biological meaning. As we will see later, a clustering with three biologically meaningful clusters and an “outlier” cluster is supported by the different models and will therefore be used for comparing the outcome of these models.
Building Gaussian mixture models
We next used GM models for analysing the A and B valued networks. The ICL criterion favours the 5cluster GM model for network A and the 6cluster GM model for network B (Table 2). The more parsimonious model selected for network A may be due to the high dispersion of HIS3 values which have less weight in network A than in network B (w _{HIS3}=0.25 for network A and w _{HIS3}=0.5 for network B) (Fig. 2). This supports the idea that the XGal test is more reliable than the HIS3 test. We thus chose to focus on GM models built on the basis of network A (GMA model).
Analysing the cluster composition for the 5cluster GMA model, we found three biologically meaningful and two “outlier” clusters (see Additional file 1: Figure S4B for the cluster composition). When assessing the clustering quality, we observed that the third cluster, specifically enriched in Aux/IAA, presented a rather large withincluster distance compared to its distances to the other clusters (see Additional file 2: Table S2). The two “outliers” clusters not being that well defined too, we decided to compare the 5cluster GMA model with the 4cluster GMA model since it corresponds to the most relevant clustering found using BM models.
This 4cluster GMA model exhibits a meaningful biological structure with three clusters C1GMAARF+,C2GMAARF and $C3^{\text {IAA}}_{\texttt {GMA}}$ specifically enriched in each family of proteins and an “outlier” cluster $C4^{\text {Outlier}}_{\texttt {GMA}}$ . Remarkably, $C4^{\text {Outlier}}_{\texttt {GMA}}$ is the merging of the two “outlier” clusters identified with the 5cluster GMA model with the exception of IAA29 found in $C2^{\text {ARF}}_{\texttt {GMA}}$ for the 4cluster GMA model (see the compositions in Additional file 1: Figure S4A and B). Thus, when assessing clustering on the basis of the clusterdistance matrix (see Eq. 4) we still observe a rather large withincluster distance for $C3^{\text {IAA}}_{\texttt {GMA}}$ compared to its distances to the other clusters. Since the 5cluster model favoured by the ICL criterion is almost perfectly nested in the 4cluster model, we argue here that the simpler model is more relevant. Again, this can be interpreted as the tendency of the ICL criterion to select overparameterized models.
We give in Fig. 5 the connectivity graph obtained using the 4cluster GMA model. We stress here that the mean interaction likelihood (μ _{ q ℓ }) should not be directly compared to the interaction probabilities (π _{ q ℓ }) represented in the connectivity graph obtained using the BM model (Fig. 4) since they do not represent the same information; see Additional file 1: Figure S5 for the clustered valued adjacency matrix with proteins sorted by increasing withincluster distances.
One should note a specificity of $C3^{\text {IAA}}_{\texttt {GMA}}$ in Table 4 whose lowest protein to cluster distance (0.028) is greater than the highest protein to cluster distances (0.019, 0.016, 0.024) for the three other clusters. This explain the large withincluster distance observed for $C3^{\text {IAA}}_{\texttt {GMA}}$ ; see Eq. 4.
Comparing Bernoulli and Gaussian mixture model clusterings
Cluster compositions of the 4cluster BM model (Table 3) and GMA model (Table 4) are rather similar with 78 % match (Table 5). The differences in cluster assignment concern almost only peripheral elements of the clusters and the core of the four clusters are very similar.
The betweencluster distance matrices suggest that the BM model performs better than the GMA model, allowing for a better definition of the clusters according to within and betweencluster distances. While it may introduce errors (false positives or negatives) depending on the thresholds and decision rules defined for the XGal and HIS3 tests, the binarisation of interactions is thus likely to effectively remove experimental noise. On the opposite, the standardization is a more objective approach, since it scales the outputs of the XGal and HIS3 tests to make them comparable and limits the loss of information. However, standardization does not remove experimental noise, which seems to be in our case a shortcoming for cluster definition. Nevertheless, with both BM and GM models, we were able to identify a strong core structure in the auxin signalling network, closely related to the predicted biological structure [3].
Analysing the influence of the protein primary sequence dissimilarities on the auxin network topology using linear regression mixture models
We next sought to address how the evolution of multigenic families such as the one encoding ARF and Aux/IAA has influenced the auxin signalling network topology by modifying the dimerisation capacities of its members. To get insights into this complex question, we ask here whether dissimilarities in primary sequences of ARF and Aux/IAA dimerisation domain influence the topology of the Aux/IAA  ARF network. Note that we only present results for network A and use the distance between primary sequences as a measure of protein dissimilarities.
Building the dimerisation domain protein distance matrix
One way to analyse the influence of DIII/IV primary sequence on the Aux/IAA  ARF network is to incorporate distances between protein sequences as explanatory variables in a linear regression mixture (LRM) model. To build a distance matrix corresponding to the dimerisation domain differences in terms of amino acid se quences, we started by aligning full protein sequences of all Aux/IAA and ARF presenting the conserved CTD domain (DIII/IV) using CLUSTALW [10]. To recover DIII and DIV amino acid subsequences, we searched for conserved patterns among the aligned sequences using Gblocks [11]. Two conserved blocks were found at the Cterminal part of the sequences, corresponding to the two subdomains DIII and DIV (Methods and Additional file 1: Figure S6). The persite protein distance matrix was then obtained using the amino acid substitution model PAM computed with PROTDIST. We also computed two distance matrices, corresponding respectively to DIII and DIV, to be used in LRM models with two explanatory variables (see below).
Linear regression mixture models with DIII/IV as a single explanatory variable
We built LRM models [7] to investigate the influence of the dimerisation domain dissimilarity on the likelihood of interaction between transcriptional regulators. The linear regression mixture model with a single explanatory variable is written as follows:
where X is the weighted adjacency matrix (response distance matrix) and Y the explanatory distance matrix representing primary sequence dissimilarities between DIII/IV. As for GM models, μ _{ q,ℓ } is the mean likelihood of interaction between proteins of two clusters. The regression parameter β _{ q,ℓ } quantifies the effect of DIII/IV sequence disimilarity on interaction likelihood and is defined for each pair of clusters (q,ℓ) [7].
Introducing an explanatory variable enables to reduce the number of clusters selected by the ICL criterion: four clusters for the LRM model instead of five clusters for the GMA model (Tables 2 and 6). The singleexplanatoryvariable 4cluster LRM model (Table 7) exhibits a biologically meaningful structure with three clusters $C1^{\text {ARF+}}_{\texttt {LRM1}}$ , $C2^{\text {ARF}}_{\texttt {LRM1}}$ and $C3^{\text {IAA}}_{\texttt {LRM1}}$ enriched respectively in ARF+, ARF and Aux/IAA and an “outlier” cluster $C4^{\text {Outlier}}_{\texttt {LRM1}}$ composed of ARF and Aux/IAA; see Additional file 1: Figure S7 for the clustered valued ajdacency matrix with proteins sorted by increasing withincluster distances. This composition is very similar to the one obtained with the 4cluster GMA model (87 % of match) but a bit less to the one obtained with the 4cluster BM model (76 % of match) (Table 5).
Considering the betweencluster distance matrix (Eq. 6) we observe an increase of the ARF+ enriched withincluster distance, while the other clusters show within and betweencluster distances similar to the ones in the GMA model (Eq. 4). The estimated regression coefficients of the linear regression models are given in Eq. 7; see Additional file 1: Figure S8 for a graphical representation of the regressions.
We give in Fig. 6 a simplified representation of the influence of the dimerisation sequence distance on the likelihood of interaction between proteins of two clusters. We stress here that this representation cannot be compared with the connectivity graphs (Figs. 4 and 5) since they do not present the same information.
In the case of $C1^{\text {ARF+}}_{\texttt {LRM1}}$ , the estimated regression coefficients $\hat {\beta }_{\text {III/IV, LRM}}(q,\ell)$ (Eq. 7) show that the closer the dimerisation sequences, the less proteins in $C1^{\text {ARF+}}_{\texttt {LRM1}}$ are likely to interact $\left (\hat {\beta }\left (C1^{\text {ARF+}}_{\texttt {LRM1}},C1^{\text {ARF+}}_{\texttt {LRM1}}\right)=1.024\right)$ . However, as shown in Table 7, $C1^{\text {ARF+}}_{\texttt {LRM1}}$ is not only made of ARF+ but also includes IAA31, 7 and 13. A closer look (Additional file 1: Figure S8, topleft panel) shows that the positive influence detected for within $C1^{\text {ARF+}}_{\texttt {LRM1}}$ interaction comes from the presence of the three Aux/IAA in this cluster. We observed mainly two separated groups (in addition to the homodimers): one with low interaction likelihoods (and low dimerisation sequence distances) that corresponds to ARF+ ⇔ARF+ and Aux/IAA ⇔Aux/IAA interactions, and another one with high interaction likelihoods (and higher dimerisation sequence distances), that corresponds to ARF+ ⇔Aux/IAA interactions (Additional file 1: Figure S8). This indicates that this result is most likely an artefact.
Considering the interaction between $C1^{\text {ARF+}}_{\texttt {LRM1}}$ and $C3^{\text {IAA}}_{\texttt {LRM1}}$ , we also observed a weak but positive effect $\left (\hat {\beta }\left (C1^{\text {ARF+}}_{\texttt {LRM1}},C3^{\text {IAA}}_{\texttt {LRM1}}\right)\!=0.305\!\right)$ of dimerisation sequence distance on the interaction likelihood (the closer the sequences the less likely proteins interact). A closer inspection shows a less dispersed distribution of interaction likelihoods (Additional file 1: Figure S8), supporting the observed effect of dimerisation sequence distances on interaction likelihoods. Similar observations can be made for the interaction between $C1^{\text {ARF+}}_{\texttt {LRM1}}$ and $C4^{\text {Outlier}}_{\texttt {LRM1}} \left (\hat {\beta }\left (C1^{\text {ARF+}}_{\texttt {LRM1}},C4^{\text {Outlier}}_{\texttt {LRM1}}\right)=0.701\right)$ . Surprisingly, no effect of dimerisation sequence distances on within $C4^{\text {IAA}}_{\texttt {LRM1}}$ interaction could be detected $\left (\hat {\beta }\left (C4^{\text {IAA}}_{\texttt {LRM1}},C4^{\text {IAA}}_{\texttt {LRM1}}\right)=0.031\right)$ . Apart from these observations, no other influence of dimerisation sequence distance on interaction likelihood could be identified using this model.
Linear regression mixture models with DIII and DIV as two explanatory variables
We next tested how each dimerisation subdomain DIII and DIV could influence the interaction likelihood by incorporating in the LRM models two explanatory variables (one for each dimerisation subdomain). The linear regression mixture model with two explanatory variables can be written as follows:
The ICL criterion favours the 3cluster twoexplanatoryvariable LRM model and with a nonsignificant difference (ΔICL < 1.4) the 4cluster twoexplanatoryvariable LRM model (Table 6). The cluster composition obtained with the 4cluster twoexplanatoryvariable LRM model (Table 8) is very similar to the one obtained with the 4cluster singleexplanatoryvariable LRM model (Table 7, 95 % of match) and with the GMA model (Table 4, 91 % of match). The 4cluster twoexplanatoryvariable LRM model has 3 clusters $C1^{\text {ARF+}}_{\texttt {LRM2}}$ , $C2^{\text {ARF}}_{\texttt {LRM2}}$ and $C3^{\text {IAA}}_{\texttt {LRM2}}$ enriched respectively in ARF+, ARF and Aux/IAA and an “outlier” cluster $C4^{\text {Outlier}}_{\texttt {LRM2}}$ ; see Additional file 1: Figure S9 for the clustered valued adjacency matrix with proteins sorted by increasing withincluster distances. The 4 clusters deduced from this LRM model have similar within and betweencluster distances (Eq. 9) than the 4 clusters deduced from the singleexplanatoryvariable LRM model (Eq. 6).
The estimated regression coefficients for the two subdomains are given in Eqs. 10 and 11; see Fig. 7 and Additional file 1: Figure S10 for graphical representations of the regressions.
We give in Fig. 8 two representations of the influence of dimerisation subdomain sequence distance on interaction likelihood and thus on network topology.
For $C1^{\text {ARF+}}_{\texttt {LRM2}}$ withincluster interactions, the closer the DIV sequence distances the higher the interaction likelihood while DIII sequence distances had no effect on these withincluster interactions. However, given that the composition of $C1^{\text {ARF+}}_{\texttt {LRM2}}$ was identical in the 4cluster LRM model with a single or two explanatory variables, these results are likely artefactual and linked to the fact that the ARF+ enriched cluster contains three Aux/IAA that contribute to the detected dimerisation sequence influence.
The $C3^{\text {IAA}}_{\texttt {LRM2}}$ withincluster interactions presents an opposite behaviour when analyzing the influence of the two dimerisation subdomains: the closer the DIII sequences, the higher the interaction likelihood $\left (\hat {\beta }_{\text {III}}\left (C3^{\text {IAA}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.268\right)$ ; the farther the DIV sequences, the higher the interaction likelihood $\left (\hat {\beta }_{\text {IV}}\left (C3^{\text {IAA}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.297\right)$ . Counteracting effects of the same order of magnitude for the two subdomains thus likely explain why we could not observe any influence of dimerisation sequence distances on interaction likelihood with the singleexplanatoryvariable LRM model $\left (\hat {\beta }\left (C3^{\text {IAA}}_{\texttt {LRM1}},C3^{\text {IAA}}_{\texttt {LRM1}}\right)=0.031\right)$ .
Domainspecific effects were also found for the interaction between $C1^{\text {ARF+}}_{\texttt {LRM2}}$ and $C3^{\text {IAA}}_{\texttt {LRM2}}$ . DIII sequence distance is positively related to interaction likelihood $\left (\hat {\beta }_{\text {III}}\left (C1^{\text {ARF+}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.294\right)$ , while no –or a very limited– effect of DIV sequence distances is observed $\left (\hat {\beta }_{\text {IV}}\left (C1^{\text {ARF+}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.052\right)$ . This is in agreement with the effect $\left (\hat {\beta }\left (C1^{\text {ARF+}}_{\texttt {LRM1}},C3^{\text {IAA}}_{\texttt {LRM1}}\right)=0.305\right)$ detected within the singleexplanatoryvariable LRM model and indicates that the effect of dimerisation sequence distance on interaction likelihood is mostly linked to DIII, with little or no contribution of DIV.
Finally concerning interactions between $C2^{\text {ARF}}_{\texttt {LRM2}}$ and $C3^{\text {IAA}}_{\texttt {LRM2}}$ , a weak opposite effect was detected with the twoexplanatoryvariable LRM model for each domain $\left (\hat {\beta }_{\text {III}}\left (C2^{\text {ARF}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.194\right)$ and $\left.\hat {\beta }_{\text {IV}}\left (C2^{\text {ARF}}_{\texttt {LRM2}},C3^{\text {IAA}}_{\texttt {LRM2}}\right)=0.138 \right)$ . Again this could not be detected with the singleexplanatoryvariable LRM model $\left (\hat {\beta }\left (C2^{\text {ARF}}_{\texttt {LRM1}},C3^{\text {IAA}}_{\texttt {LRM1}}\right)=0.057\right)$ , most likely because of opposite contributions from the two subdomains.
Conclusions
Interpretation of the auxin signalling network clustering and of the contribution of domain III/IV primary sequences
Our clustering analysis provides interesting insight on the underlying biology. First and in accordance with previous work [4], the different models strongly support the idea that the auxin signalling network can be simplified in three biologically meaningful groups, corresponding roughly to the ARF+, ARF and Aux/IAA (but with an additional outlier group, see below) and showing specific interaction behaviours. The strong interaction likelihood between ARF+ and Aux/IAA was expected from the putative molecular model reviewed in [2]. This suggests that most of the Aux/IAA repress transcriptional activity of ARF+ when a low concentration of auxin is encountered. However, the weak likelihood of interaction between ARF and Aux/IAA, and between ARF and ARF+ remains a surprising conclusion (that was highlighted by [4]), given the overall good conservation of DIII/IV in ARF proteins. Further experiments and analyses need to be conducted to unveil the role of DIII/IV in ARF and its possible contribution to the auxin signalling pathway.
Using LRM models to investigate the influence of protein sequence distances on the auxin signalling network is a first attempt to establish a direct link between protein primary sequences and interaction network topology. By first using a singleexplanatoryvariable LRM model, we uncovered a rather counterintuitive contribution of the primary sequence for a few betweencluster interactions. Notably proteins from the ARF+ enriched cluster interact more likely with proteins from the Aux/IAA enriched cluster that have more distant dimerisation sequences. This suggests that the likelihood of interaction between ARF+ and Aux/IAA increases with the evolutionary distance between DIII/IV sequences. A similar observation could be made for the ARF+ enriched cluster and the outlier cluster, further suggesting that facilitated interactions between more distant proteins could contribute significantly to the structuring of the auxin signalling network. Concerning the ARF+ enriched withincluster interactions, we detected a positive relationship between protein distance and interaction likelihood. However, this is likely an artefact due to the presence of three Aux/IAA in this cluster, preventing us from drawing conclusion from this observation.
The twoexplanatoryvariable LRM models yielded a more precise view by identifying subdomain specific effects. Our results show that DIII explains most of the effect of DIII/IV sequence on the likelihood of interaction between ARF+ and Aux/IAA. Recent structural analyses of DIII/IV [12–15] showed that DIII and DIV mediate interactions between ARF+ and Aux/IAA through two charged interfaces: one face mostly positive and one face mostly negative. DIII contributes principally to the positive face, while DIV contributes to the negative face of these interaction domains. This structure allows for bidirectional interactions. Finding that changes in the primary sequence of DIII alone influence ARF+ ⇔Aux/IAA interaction likelihood then suggests that changes on a single face of the protein impact the global interaction capability. Analysing the contribution of each subdomain also highlighted several antagonistic influences, thus explaining why no effect was detected with the singleexplanatoryvariable LRM model in some cases. This suggests that changes within the primary sequence of one subdomain that could influence the interaction likelihood, can be counteracted by changes within the primary sequence of the other subdomain, an effect that could have occurred during evolution of DIII/IV.
So far DIII/IV structures have been obtained for 4 transcriptional regulators of auxin [12–15]. Obtaining further protein structures, although challenging, could allow testing the hypotheses emerging from our clustering approach. Other strategies would also allow testing further the link between interaction likelihood and protein dissimilarities:

creating a library of mutated version of DIII and DIV for each element of the network to artificially enlarge the network size;

generating similar Y2H data for other species, such as rice or tomato, which also possess large families of auxinrelated transcriptional regulators.
This would be particularly useful for small clusters such as the ARF+ enriched cluster, for which the regression model is constrained by the rather limited number of transcriptional regulators. However, it is important to stress here that the Y2H experiment was performed using additional sequences than DIII/IV for Aux/IAA (full length protein were used: [4]). Although there is no evidence that other Aux/IAA domains contribute to binding, we cannot eliminate the possibility that this introduces a bias that could affect the analysis of the influence of DIII/IV primary sequence distance on interaction likelihood. Testing the interaction capacity using only DIII/IV protein sequences for both Aux/IAA and ARF could be useful in the future to address this question. Note also that [16] has suggested an effect of the ARF middle region on interactions between Aux/IAA and ARF, thus implying that the interaction landscape could be more complex than the one established in our analysis.
It is finally interesting to compare the composition of the outlier cluster obtained with the different models (BM, GM, single and twoexplanatoryvariable LRM models). Four proteins (ARF22, IAA26, 32 and 33) were systematically assigned to the outlier cluster while three others (ARF9, IAA20 and 30) were assigned to the outlier cluster for all models estimated on the basis of the valued graph (GM, single and twoexplanatoryvariable LRM models). The composition of the outlier cluster is thus largely conserved for the different models. While this could be interpreted as a consequence of noise in the Y2H experiments affecting more specifically these proteins, analysis of the distribution of interaction likelihood involving this cluster suggest that proteins in this cluster might actually have a peculiar behaviour in the network (Additional file 1: Figure S9). The outlier cluster is characterized by an highly dispersed interaction likelihood. Proteins identified in the outlier cluster could thus be involved in specific interactions within the network, possibly highlighting an unsuspected function for these proteins in the regulation of auxin signalling.
Methods
Testing the Aux/IAA  ARF interaction capability
The Y2H experiment is a bioengineered tool based on the Gal4 transcription factor from yeast Saccharomyces Cerevisiae. The Gal4 transcription factor is made of an Nterminal DNA binding domain (BD) and a Cterminal activation domain (AD). These two subparts have been artificially separated, and tagging each with proteins allow to test for their interaction capability.
Yeast2hybrid protein interaction testing
In the original screening presented in [4], we manage to test the interaction capability of all members of the Aux/IAA  ARF family, except for ARF 15, 21 and 23. The interaction screening was therefore conducted on 49 transcriptional regulators, representing 1225 tested interactions. In order to be thorough, each interaction was tested both ways, meaning each protein was append to both AD and BD in two separate repetitions (e.g. ADARF1 v.s. BDARF2 and ADARF2 v.s. BDARF1). Finally, considering that this screening method can present false positives, two independent biological tests were conducted for each way. Overall, this represents a total of 4900 test results to analyze.
In this paper, we aim at modelling the influence of dissimilarities between dimerisation sequences on transcriptional regulator interactions. We thus had to remove the members of the Aux/IAA  ARF familly that does not possess the proteinprotein dimerisation domain, namely ARF3 and 17. This brings the number of proteins implicated in the network down to 47. Finally, ARF11 does not present any connexion in the binary network. In order to ease the comparison of the random graph clustering model outputs we chose to remove ARF11 from the analyses.
Reporting genes
The βgalactosidase (βgal) is an enzyme hydrolyzing XGal (or 5bromo4chloro3indolylbetaDgalactopyranosid) into a blue compound revealing its activity (i.e the interaction between proteins). The other reporting gene encodes a protein called imidazoleglycerolphosphate dehydratase (HIS3) which catalyses the sixth step in histidine biosynthesis. It is also from S. Cerevisiae and allow the yeast to produce histidine and thus to survive in an histidinefree medium.
Data description
The XGal test is based on a blue coloration of the media where yeasts are developing. The ordered marks for the XGal test were ‘’, ‘?’, ‘?’, ‘+?’, ‘+’, ‘++’, ‘+++’. We chose to use this full ordinal scale for computing standardized distances in order to build valued graphs. The four first marks ‘’, ‘?’, ‘?’, ‘+?’, were not distinguished and assimilated to ‘’ in [4] for defining a threshold for the XGal test. We fixed this threshold between ‘+?’ and ‘+’ (in the original ordinal scale) as in [4] in order to build binary graphs.
The HIS3 test is based on the capability of yeasts to synthesize histidine in an histidinefree medium. It can be viewed as an estimation of histidine synthesis capability upon function recovery. To assess for this synthesis capacity, a ratio of optical densities (ODs) between yeast growth in a medium without histidine and with histidine was used: {OD histidinefree medium}/ {OD histidinerich medium}. For detailed explanations on the test outputs used in the Y2H screen, see [4].
Network binarisation
Mixture model for optical density ratios
We estimated a threecomponent Gaussian mixture model $ \sum _{i=1}^{3} \alpha _{i} f_{i}\left (z;\mu _{i}, {\sigma _{i}^{2}}\right) $ on the basis of the overall OD ratio sample (HIS3 test) using the mclust R package [17]. The three components were selected using the Bayesian information criterion (BIC). We then investigated possible consistencies between limits between components (given by the values where the posterior probabilities of successive components are equal) and limits between successive marks for the XGal test. The first two components correspond to almost only XGal marks < ‘+’ while the last one corresponds mostly to marks ≥ ‘+’; see Fig. 2. The threshold for the HIS3 test was then fixed close to the limit between the second and the third component and the threshold for the XGal test between marks < ‘+’ and ≥ ‘+’.
Decision rules
Because it is a twoway tworeportinggene experiment, there are several possible test configurations which define the presence or absence of interaction for each tested interaction. In the following tables we give configurations potentially reflecting the ‘presence of interaction’, where we define a given test as “positive” (+) or “negative” () when its result is respectively above or below the defined thresholds: Configuration 1: all the tests are positive,
Configuration 2: only one test is not positive,
Configuration 3: only one way is positive for both reporter genes,
Configuration 4: one reporter gene is positive in each way,
Configuration 5: only one reporter gene is positive both ways,
An analysis not detailed here allowed us to state that the fifth configuration (only one reporter gene is positive both ways) is unreliable. We therefore discarded this case when defining the presence or absence of interaction for the binary network.
Dimerisation domain primary sequences
The protein sequences were obtained using the accession numbers of Aux/IAA and ARF presenting a dimerisation domain (ARF 3, 17 and 23 were thus excluded); see availability of supporting data for list of AGIs. Subsequences corresponding to DIII and DIV were obtained by first making a multiple alignment of the whole protein sequences using ClustalW [10]. Then, we searched for highly conserved regions using Gblocks 0.91b [11]; see availability of supporting data for list of used parameters. We subsequently found three conserved regions, the last two corresponding to DIII and DIV; for more information, see Additional file 1: Figure S6.
The flanking positions detected for domains III and IV from the full amino acid sequences were respectively [12751307] and [13441376]. Both conserved domains have a length of 32 amino acids. The sequence for DIII/IV is obtained from the concatenation of the two separate domains. We also conducted an analysis with slightly extended flanking positions [12721307] and [13441376], but this did not lead to significant changes in the analyses.
Linear regression mixture models for valued random graphs
The first version of the stochastic block model (SBM) was introduced in [18] and assumes that vertices are distributed into clusters and that the probability for an edge to exist between two vertices depends on the clusters the two vertices belong to, as described in Eq. (1). The LRM model used here is an extension to valued graphs with explanatory variables of the model introduced in [18]. An estimation method based on an expectationmaximization (EM) algorithm, with a variational approximation in the Estep was proposed in [7]. We briefly remind here some key ingredients of the estimation procedure. An implementation of the algorithm used in this study is provided by wmixnet [19].
Definition of linear regression mixture models
We consider a graph with n vertices (i=1,…n). Each vertex is assumed to belong to an (unobserved) cluster C _{ q } among Q possible clusters C _{1},…C _{ Q }. The probability for a given vertex to belong to cluster q is denoted by $\alpha _{q} \left (\sum _{q=1}^{Q} \alpha _{q} = 1\right)$ . The vertex memberships are supposed to be independent. For each pair of vertices (i, j), X _{ ij } denote the weight of the edge between them and Y _{ ij } the vector of explanatory variables associated with this pair of vertices. In the proposed model, the edge weights are independent conditionally on the vertex membership:
All the models considered here (except Model (1)) can be casted in this framework, taking for Model (2): Y _{ ij }=∅, b _{ q ℓ }=∅ ; for Model (5): $\textbf {Y}_{\textit {ij}}^{\intercal } =\, [\!Y_{\textit {ij}}]$ , $\textbf {b}_{\textit {q}\ell }^{\intercal } =\, [\!\beta _{\textit {q}\ell }]$ ; and for Model (8): $\textbf {Y}_{\textit {ij}}^{\intercal } = \left [Y_{\text {III}, ij} \; Y_{\text {IV}, ij}\right ]$ , $\textbf {b}_{\textit {q}\ell }^{\intercal } = \left [\beta _{\text {III}, q\ell } \; \beta _{\text {IV}, q\ell }\right ]$ .
Note that all these models are heterogeneous versions of the regression models considered in [7], since both the constants μ and the regression coefficients β depend on the vertex membership. As a consequence, such a model with d explanatory variables and Q clusters involves (Q−1) independent membership probabilities α _{ q }, Q ^{2} constants μ _{ q ℓ } and d Q ^{2} regression coefficients β _{ q ℓ }, that is (Q−1)+Q ^{2}(d+1) independent parameters.
Statistical methods for linear regression mixture models
The estimation of parameters, and the prediction of the vertex membership is made by a variational EM algorithm, first introduced for SBM in [6]. This algorithm is similar to a standard EM algorithm [20], since it alternates until convergence the determination of the conditional distribution of the vertex membership given the observed data (Estep) and the estimation of the parameters (Mstep). The estimation formulas used in the Mstep are given in [7].
In the case of SBM, the Estep cannot be calculated in an exact manner, as it would require to enumerate all possible vertex memberships, which is not possible even for a moderate network size. A variational approximation is used to circumvent this problem. Let τ _{ iq } be the conditional probability for vertex i to belong to cluster q given the observed edge weights. An approximation of τ _{ iq } is computed using the following fixedpoint formula:
where ϕ stands for the probability density function of the standard Gaussian distribution. Each step of the variational EM algorithm can be shown to increase a lower bound $\mathcal {J}$ of the loglikelihood of the observed data which can be rewritten as:
A model selection criterion is needed to choose the number of clusters. To this aim, we apply the ICL criterion derived in [6]. This criterion relies on a double penalty: one for the membership probabilities α _{ q } that are associated with the n vertices and one for the regression parameters (μ _{ q ℓ },β _{ q ℓ }) that are associated with the n(n−1)/2 edges. It finally writes as:
Response distance matrix: standardized distances between transcriptional regulators
The Y2H analysis involves two independent tests, the XGal and the HIS3 tests. The output of the XGal test can be interpreted as a distance defined on an ordinal scale (from no interaction to strong interaction) while the output of the HIS3 test can be interpreted as a distance defined on a ratio scale (between 0 and 1.7). Combining these observed distances requires a standardization procedure. The objective of standardization is to avoid dependency on the elementary distance type and scale. In the case of an ordinal distance (XGal test), observed distances are replaced by ranked distances
where y _{ ij } is the output of the XGal test for proteins i and j, and f _{ n } is the frequency of mark n (the possible marks are assumed to be represented as contiguous positive integers). In this case, the normalization quantity is the mean rank (1+N ^{2})/2, where N is the number of proteins.
The ratioscaled distance (HIS3 test) can be either treated as an intervalscaled distance or as an ordinal distance. Considering that the response curve of the HIS3 test is monotone but highly nonlinear and is close to a Michaelis–Menten kinetics, we chose to consider the output of the HIS3 test as a distance defined on an ordinal scale for standardization. Observed distances are replaced by the ranked distances Rank(y _{ ij }) for the XGal test and Rank(z _{ ij }) for the HIS3 test, and the standardized distances are:
where w _{XGal} and w _{HIS3} are the weights of the XGal and HIS3 tests with w _{XGal}+w _{HIS3}=1. It should be noted that a single marginal distribution was considered for each test used in the two possible configurations (bait or prey) in order to standardize the distances. In the case of missing test values, the distances can be straightforwardly adapted. If z _{ ji } is missing, we obtain:
where M _{XGal} is the number of XGal test values, and M _{HIS3} is the number of HIS3 test values.
The distance matrices {x _{ ij };i,j=1,…,N} corresponding to (w _{XGal},w _{HIS3})=(1,0), (0.75,0.25), (0.5,0.5), (0.25,0.75), (0,1) were built and tested.
Distances between dimerisation domain primary sequences
To use the primary sequence information as an explanatory variable in LRM models, we have to define a distance between two protein sequences. PROTDIST allows to compute such distances by using amino acid substitution models. One can choose between five different models, and we tested three of them: PAM, JTT and PMB. PMB which performed poorly was not used in the analyses. Finally, PAM and JTT outputs being rather similar, we focused on the PAM model, since it seems to be the most common one to date. For more information about the protein substitution models, see the PROTDIST documentation (http://evolution.genetics.washington.edu/phylip/doc/protdist.html).
Assessing the quality of the clustering
We assessed the quality of the clustering obtained by evaluating the separability of the clusters and the dispersion of the proteins within the clusters. Since, in our case, the assignment of proteins to clusters is almost deterministic (i.e. τ _{ iq }≃1 for a unique cluster q and τ _{ i ℓ }≃0 for ℓ≠q where τ _{ iq } is the posterior probability of assigning protein i to cluster q), this assignment can be viewed as a partition. The model parameters, which parametrized the edges of the graph, cannot be used directly to define dispersion measures of the proteins assigned to a given cluster. We thus used the adjacency information to derive dissimilarity measures for the proteins. The distance $D(i,j)=\sum _{k}x_{\textit {ik}}x_{\textit {jk}}/N$ between the ith and jth rows of the weighted adjacency matrix {x _{ ij };i,j=1,…,N} quantifies the difference in connectivity profile between proteins i and j. In the case of the binary adjacency matrix, this distance is the SokalMichener distance between proteins i and j [21]: $D(i,j)=\sum _{k} I(x_{\textit {ik}}\neq x_{\textit {jk}})/N$ , where I() denotes the indicator function. This is the proportion of mismatches between the ith and jth rows of the adjacency matrix.
The distance between protein i and cluster q is given by:
If the proteins are deterministically assigned to a given cluster, this distance simplifies to
where n _{ q } is the number of proteins assigned to cluster q.The distance between cluster q and cluster ℓ can be directly derived as
The within and betweencluster distances can then be defined as
Availability of supporting data
Original Yeast2Hybrid data for Aux/IAA  ARF interaction tests
All Y2H interaction results for XGal and HIS3 reporters are available in supplementary data of [4].
Aux/IAA  ARF protein sequences
Protein sequences can be found within Arabidopsis thaliana proteins banks such as SwissProt Protein Database http://www.expasy.org/ using the following 52 accession numbers: [SwissProt:Q8L7G0, SwissProt:Q94JM3, SwissProt:O23661, SwissProt: Q9ZTX9, SwissProt:P93024, SwissProt:Q9ZTX8, SwissProt:P93022, SwissProt:Q9FGV1, SwissProt: Q9XED8, SwissProt:Q9SKN5, SwissProt:Q9ZPY6, SwissProt:Q9XID4, SwissProt:Q9FX25, SwissProt: Q9LQE8, SwissProt:Q9LQE3, SwissProt:Q93YR9, SwissProt:Q84WU6, SwissProt:Q9C5W9, SwissProt: Q8RYC8, SwissProt:Q9C7I9, SwissProt:Q9C8N9, SwissProt:Q9C8N7, SwissProt:Q9LP07, SwissProt: Q38828, SwissProt:Q38829, SwissProt:Q38830, SwissProt:Q38831, SwissProt:Q38832, SwissProt: Q9C966, SwissProt:O24407, SwissProt:P93830, SwissProt:O24408, SwissProt:O24409, SwissProt:P49677, SwissProt:O24410, SwissProt:Q8LAL2, SwissProt:Q9ZSY8, SwissProt:Q9XFM0, SwissProt:Q93WC4, SwissProt:P49678, SwissProt:Q9M1R4, SwissProt: Q8H174, SwissProt:Q8RYC6, SwissProt:Q9FKM7, SwissProt:Q9C5X0, SwissProt:Q38822, SwissProt:P33077, SwissProt:P33078, SwissProt:Q38824, SwissProt:Q38825, SwissProt:Q38826, SwissProt:Q38827].
Domains III and IV subsequences
To obtain protein subsequences corresponding to conserved domains III and IV, we used the following parameters in Gblocks:

Minimum Number Of Sequences For A Conserved Position: 25

Minimum Number Of Sequences For A Flanking Position: 25

Maximum Number Of Contiguous Nonconserved Positions: 8

Minimum Length Of A Block: 10

Allowed Gap Positions: With Half

Use Similarity Matrices: Yes
See Additional file 1: Figure S6 for a detailed view of the aligned sequences and the conserved subsequences.
Abbreviations
 Aux/IAA:

AUXIN/INDOLE3ACETIC ACID
 ARF:

AUXIN RESPONSE FACTOR
 CTD:

Cterminal dimerisation domain
 DBD:

DNA binding domain
 DIII/IV:

Domain III/IV
 DIII:

Domain III
 DIV:

Domain IV
 Y2H:

Yeast2hybrid
 AD:

Activation domain
 BD:

Binding domain
 GM:

Gaussian mixture
 BM:

Bernoulli mixture
 ICL:

Integrated completed likelihood
 LRM:

Linear regression mixture
 OD:

Optical density
 BIC:

Bayesian information criterion
 SBM:

Stochastic block model
 EM:

Expectationmaximization
References
 1
Leyser O. Dynamic integration of auxin transport and signalling. Curr Biol. 2006; 16(11):424–33. doi:10.1016/j.cub.2006.05.014.
 2
Guilfoyle TJ, Hagen G. Auxin response factors. Curr Opin Plant Biol. 2007; 10(5):453–60. doi:10.1016/j.pbi.2007.08.014.
 3
Remington DL, Vision TJ, Guilfoyle TJ, Reed JW. Contrasting modes of diversification in the Aux/IAA and ARF gene families. Plant Physiol. 2004; 135(3):1738–52. doi:10.1104/pp.104.039669.
 4
Vernoux T, Brunoud G, Farcot E, Morin V, Van den Daele H, Legrand J, et al. The auxin signalling network translates dynamic input into robust patterning at the shoot apex. Mol Syst Biol. 2011; 7:508.
 5
Joung JK, Ramm EI, Pabo CO. A bacterial twohybrid selection system for studying proteinDNA and proteinprotein interactions. Proc Natl Acad Sci USA. 2000; 97(13):7382–7. doi:10.1073/pnas.110149297.
 6
Daudin JJ, Picard F, Robin S. A mixture model for random graphs. Stat Comput. 2007; 18(2):173–83. doi:10.1007/s1122200790467.
 7
Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: A variational approach. Ann Appl Stat. 2010; 4(2):715–42. doi:10.1214/10AOAS361.
 8
Kass RE, Raftery AE. Bayes Factors. J Am Stat Assoc. 1995; 90(430):773–95.
 9
Kriegel HP, Kröger P, Sander J, Zimek A. Densitybased clustering. Wiley Interdiscip Rev Data Mining Knowl Discov. 2011; 1(3):231–40. doi:10.1002/widm.30.
 10
Thompson J, Higgins D, Gibson T. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994; 22:4673–80.
 11
Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000; 17(4):540–52.
 12
Nanao MH, VinosPoyo T, Brunoud G, Thévenon E, Mazzoleni M, Mast D, et al. Structural basis for oligomerization of auxin transcriptional regulators. Nat Commun. 2014; 5:3617. doi:10.1038/ncomms4617.
 13
Korasick DA, Westfall CS, Lee SG, Nanao MH, Dumas R, Hagen G, et al.Molecular basis for AUXIN RESPONSE FACTOR protein interaction and the control of auxin response repression. Proc Natl Acad Sci U S A. 2014; 111(14):5427–32. doi:10.1073/pnas.1400074111.
 14
Han M, Park Y, Kim I, Kim EH, Yu TK, Rhee S, et al.Structural basis for the auxininduced transcriptional regulation by Aux/IAA17. Proc Natl Acad Sci USA. 2014; 111(52):18613–8. doi:10.1073/pnas.1419525112.
 15
Dinesh DC, Kovermann M, Gopalswamy M, Hellmuth A, Calderón Villalobos LIA, Lilie H, et al.Solution structure of the PsIAA4 oligomerization domain reveals interaction modes for transcription factors in early auxin response. Proc Natl Acad Sci. 2015:201424077. doi:10.1073/pnas.1424077112.
 16
Shen C, Wang S, Bai Y, Wu Y, Zhang S, Chen M, et al. Functional analysis of the structural domain of ARF proteins in rice (Oryza sativa L.)J Exp Bot. 2010; 61(14):3971–81. doi:10.1093/jxb/erq208.
 17
Fraley C, Raftery AE. MCLUST Version 3 for R: Normal Mixture Modeling and ModelBased Clustering. Technical Report 504, University of Washington, Department of Statistics. 2006.
 18
Nowicki K, Snijders TAB. Estimation and prediction for stochastic blockstructures. J Am Stat Assoc. 2001; 96:1077–87.
 19
Leger JB. Wmixnet: Software for clustering the nodes of binary and valued graphs using the stochastic block model. Technical report, arXiv:1402.3410. 2014. https://www6.inra.fr/miaparis/ProductionScientifique/Logiciel.
 20
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977; 39:1–38.
 21
Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: Wiley; 1990.
 22
Chapman EJ, Estelle M. Mechanism of auxinregulated gene expression in plants. Annu Rev Genet. 2009; 43:265–85.
 23
Hagen G, Guilfoyle T. Auxinresponsive gene expression: genes, promoters and regulatory factors. Plant Mol Biol. 2002; 49(34):373–85.
Acknowledgements
J. Peyhardi (Virtual Plants) for advices regarding the statistical models. iSAM transnational ERASysBio+ Grant to JL and TV.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
TV and YG conceived the project. JL, JBL and YG analysed the Y2H raw data and the topology of the interactome (MixNet and wMixNet). JBL and SR set up the MixNet and wMixNet models, algorithms and software. JL, TV and YG wrote the manuscript with input from the other authors. All authors have read and approved the final version of the manuscript.
Additional files
Additional file 1
Supplemental Figures. This file contains the following: histograms of optical density ratios (HIS3 test) for the successive XGal marks in Figure S1. The adjacency matrix obtained using 3 HIS3 thresholds in Figure S2. The adjacency matrices for the 4cluster and 6cluster BM models in Figure S3. The ranked average distances between transcriptional regulators for the 4cluster GMA model in Figure S4. The valued adjacency matrix for the 4cluster GMA model in Figure S5. The multiple alignment of amino acid sequences of the transcriptional regulators in Figure S6. The valued adjacency matrix for the 4cluster singleexplanatoryvariable LRM model in Figure S7. The linear regressions for each pair of clusters within the 4cluster singleexplanatoryvariable LRM model in Figure S8. The valued adjacency matrix for the 4cluster twoexplanatoryvariable LRM model in Figure S9. The linear regressions for each pair of clusters within the 4cluster twoexplanatoryvariable LRM model in Figure S10. the ranked average distances between transcriptional regulators for the 4cluster singleexplanatoryvariable LRM model in Figure S11. and the ranked average distances between transcriptional regulators for the 4cluster twoexplanatoryvariable LRM model in Figure S12. (PDF 1894 kb)
Additional file 2
Supplemental Tables. This file contains betweencluster distance matrices for the 6cluster BM model in Table S1. and the 5cluster GMA model in Table S2. (PDF 75.2 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Arabidospsis thaliana
 Auxin signalling network
 Transcriptional regulation
 Linear regression model
 Mixture model for random graphs
 Plant development
 Binary and valuedgraph clustering