A typical model of biochemical dynamics describes how abundances of a set of k molecular entities, y=(y
1,…,y
q
,…,y
k
), change with time t. Deterministically it is usually written as an ordinary differential equation (ODE)
$$ \frac{dy}{dt}=F(y,\theta), $$
((1))
where F() is a law that determines the temporal evolution of y and implicitly contains a control signal. The vector θ=(θ
1,…,θ
l
) is a vector of model parameters. To numerically simulate the model, parameter values and initial condition, (y
1(0),…,y
k
(0)), must be set. The method proposed in this paper is a priori in nature, therefore the parameter values and initial conditions are not inferred from data and must be assumed in advance based on the modellers knowledge.
Often only certain components of y, for instance first q, y
(q)=(y
1,…,y
q
), at specified times, (t
1,…,t
n
), are of interest. These components, which may correspond to experimentally measured variables, are denoted here as Y=(y
(q)(t
1),…,y
(q)(t
n
)).
Conventional sensitivity analysis fails to capture collective interactions between model parameters
Sensitivity analysis provides a prediction how Y will change, ∂
Y, in response to small changes in a single parameter, ∂
θ
i
, or all parameters, ∂
θ=(∂
θ
1,…,∂
θ
l
). If changes in parameters are small, the problem is solved by finding the derivative of a solution of the equation (1), y(t), with respect to the parameter \(\theta _{i}, z_{i}(t)=\frac {\partial y(t)}{\partial \theta _{i}}\). This derivative can be easily calculated by solving another ODE (see Additional file 1). Evaluation of z
i
(t) at the times and components of interests defines the sensitivity vector \(S_{i}=\left (z_{i}^{(q)}(t_{1}),\ldots,z_{i}^{(q)}(t_{n})\right)\) of the parameter θ
i
. The sensitivity vector describes the shift in Y in response to perturbation in the parameter θ
i
,∂
Y=S
i
∂
θ
i
. A collection of the sensitivity vectors for all i=1,…,l constitutes the sensitivity matrix S=(S
1,…,S
l
), which summarises the change in Y in response to perturbation of all of the model parameters ∂
Y=S
∂
θ. The sensitivity matrix, S, is directly linked with the concept of Fisher information. Given that Y is observed with the Gaussian unit variance error the FIM can be written as (see Additional file 1)
$$ FI(\theta)=S^{T} S. $$
((2))
Therefore the FIM contains information regarding the size of a perturbation, \(||\partial Y||=\sqrt {\partial \theta ^{T} FI(\theta) \partial \theta }\). The pairwise similarity between parameters, quantified as the cosine between the S
i
and S
j
vectors, is also given by elements of the FIM, \(\cos (S_{i},S_{j})= {S_{i}^{T}} S_{j}/{|| S_{i} || || S_{j} ||}\). It is not clear, however, how the FIM can serve as a tool to analyse mutual relations between groups of parameters. Below we provide a rigorous and practical solution to this problem.
Measuring similarity between parameters groups
Canonical correlations. The canonical correlation analysis (CCA) is a simple extension of the Pearson correlation. With CCs it is possible to measure correlations between multidimensional covariates. We modify the well established definition to suit the considered context. Assume, we measure similarity between two subsets of parameters \(\theta _{A}=\left \{\theta _{i_{1}},\ldots,\theta _{i_{a}} \right \}\) and \(\theta _{B}=\left \{ \theta _{j_{1}},\ldots,\theta _{j_{b}} \right \}\) that correspond to the two subsets of sensitivity vectors, \(\Omega _{A}=\left \{S_{i_{1}},\ldots,S_{i_{a}} \right \}\) and \(\Omega _{B}=\left \{ S_{j_{1}},\ldots,S_{j_{b}} \right \}\). The latter can be interpreted as hyper-planes. CCs form a set of correlation coefficients defined recursively. The first CC, ρ
1, is a maximal cosine between a linear combination, u
1, in Ω
A
and a linear combination, v
1, in Ω
B
,ρ
1= cos(u
1,v
1). Each next CC is found in the same way under the constraint that the next linear combination must be orthogonal to these found in the previous steps (see Additional file 1). Repeating the procedure m= min(i
a
,j
b
) times provides a set of CCs 1≥ρ
1≥…≥ρ
m
≥0 (see Fig. 1
a–b). The value of 1 indicates that there exists a linear combination of parameters in θ
A
and θ
B
having an identical impact, whereas 0 indicates existence of an orthogonal parameter combination. The CCs therefore provide an m-dimensional similarity measure between θ
A
and θ
B
.
Mutual information. The above geometric view has a natural probabilistic interpretation that provides a natural, one-dimensional similarity measure. Assume, we estimate the parameter vector θ using the maximal likelihood estimate \(\hat {\theta }\) (equivalently Bayesian posterior estimate) from data X=Y+ξ, where ξ is a measurement error. Asymptotically (for large number of independent copies of X, denoted here by N) the distribution of the estimate \(\hat {\theta }\) given a true value θ is asymptotically multivariate normal
$$ P(\hat{\theta}|\theta)\propto \exp(-\frac{1}{2N}(\hat{\theta}-\theta)FI(\theta)(\hat{\theta}-\theta)^{T}). $$
((3))
Consider the entropy, \(H(\hat {\theta }_{A})\), of the estimate \(\hat {\theta }_{A} \), and the average conditional entropy of \(\hat {\theta }_{A} \) given \(\hat {\theta }_{B},H(\hat {\theta }_{A}|\hat {\theta }_{B})\). The reduction in entropy of \(\hat {\theta }_{A}\) resulting from knowledge of \(\hat {\theta }_{B} \) is given by Shannon’s mutual information between \(\hat {\theta }_{A}\) and \(\hat {\theta }_{B}\), denoted here by I(θ
A
,θ
B
). We propose to use I(θ
A
,θ
B
) as the natural measure of similarity. The more similar θ
A
and θ
B
are, the more knowing one will help in determining the value of the other. In Additional file 1 we show that the mutual information between estimates \(\hat {\theta }_{A} \) and \(\hat {\theta }_{B}\) and CCs are closely related
$$ I(\theta_{A},\theta_{B})=H(\hat{\theta}_{A}) -H(\hat{\theta}_{A}|\hat{\theta}_{B})= -\frac{1}{m}{\sum_{i}^{m}} \log\left(1-{\rho_{i}^{2}}\right), $$
((4))
where \(H(\hat {\theta }_{A}|\hat {\theta }_{B})\) is the condition entropy of \(\hat {\theta }_{A}\) given \(\hat {\theta }_{B}\). The above measure, which throughout the paper is called MI-CCA, provides a novel and efficient way to quantify overall similarity between parameter groups via mutual information and CCs.
We use the constructed measures to propose a natural definition of parameters identifiability in the multi-parameter scenario.
(δ,ζ)-identifiability
Conventionally, parameters of a statistical model P(Y|θ) are said to be identifiable if there exists a neighbourhood of θ such that for all parameter values in that neighbourhood P(Y|θ) represents a different distribution. Equivalently the FIM must have the full rank. This definition refers simultaneously to the entire vector of model parameters θ. The definition of [13] introduces a notion of practical non-identifiability by examining the flatness of the likelihood surface. We propose a novel definition of identifiability of individual parameters in multi-parameters models. It is widely recognised that lack of identifiability can arise from two sources: lack of sensitivity, or compensation of a parameter by remaining model parameters [7–10, 12, 27–30]. A definition that quantifies this intuition has been missing. Therefore, we propose a natural criterion of whether the parameter θ
i
can be identified along with the remaining model parameters, θ
−i
. The parameter θ
i
is said to be (δ,ζ)-identifiable if ρ(θ
i
,θ
−i
)<δ and ||S
i
||>ζ. Correlation ρ is used here in the canonical sense. If θ
i
was estimated as a single parameter of the model ζ-condition requires its asymptotic variance to be smaller than 1/ζ. The δ-condition requires the parameter not to be correlated with any linear combination of the remaining parameters by more than δ. In variance terms, it translates into demanding that the variance does not increase by more than 1/(1−δ
2) when the single parameter and multi-parameter scenarios are compared (Fig. 1
c). The above definition is conceptually similar to the profile likelihood approach. However it uses asymptotic likelihood instead of actual likelihood and therefore does not require any numerical optimisation. Based on the FIM, solutions are given analytically by CCs. As a result identifiability can be determined for models of virtually any size. In practical applications values of δ and ζ must be selected. The above interpretation of δ and ζ values provides a theoretical ground to guide how these thresholds can be set. For instance, in the logarithmic parametrisation setting ζ=1 requires a parameter to be learned with at most an order of magnitude error. Parameter δ controls how the estimate’s variance increases when the parameter is estimated as a single parameter and jointly with remaining model parameters. Setting stricter values (lower δ and higher ζ) will result in lower variance of parameter estimates. Efficiency of the method enables the analysis to be performed for a range δ and ζ values that correspond to different levels of stringency. In the applications considered in this paper we used ζ=1 and δ=0.95. The latter corresponds to approximately 10-fold increase of variance (Fig. 1
c). In Additional file 1 we use one of the analysed experiments to show that these thresholds provide results consistent with the profile likelihood approach. In general, profile likelihoods can also be used to validate method’s predictions as experimental data become available (see Sections 4.3 and 6.6 of the Additional file 1).
Clustering reveals similarity structure and identifiability
Using the constructed similarity measure we can meaningfully group model parameters. We provide a modification of the conventional hierarchical clustering algorithm. At each level of the hierarchy, clusters are created by merging clusters at the next lower level. At the lowest level, each cluster contains a single parameter. The pair chosen for merging consists of the two groups with the highest mutual information, I(θ
A
,θ
B
). When a new cluster is formed we verify if each of the parameters within the newly created cluster satisfies the δ-condition. The parameters of the clusters most correlated with the remaining parameters of the cluster are removed until all satisfy the δ-condition. We use average canonical correlation between the clusters, \(\frac {1}{m}{\sum \nolimits }_{i=1}^{m}(1-\rho ^{2})\), which is normalised opposed to I(θ
A
,θ
B
), to determine the height of linkages. A set of identifiable parameters is not guaranteed to be maximal. Finding the maximal set would require testing each of the subsets of the parameter set, which is computationally infeasible. As the output of the algorithm, we obtain the visualisation of similarity structure and a set of identifiable parameters (see Fig. 2). The pseudocode describing the clustering algorithm in details is presented in Section 3 of the Additional file 1 and an R-implementation (Additional file 2) is available as an online supplement.
Example: a simple gene expression model
To clarify the principles behind the method, we use a simplistic gene expression model. We assume that the process begins with the production of mRNA molecules at rate k
r
. Each mRNA molecule r may be independently translated into protein molecules at rate k
p
. Both mRNA and protein molecules are degraded at rates γ
r
and γ
p
, respectively. Therefore, we have the vector of model parameters θ=(k
r
,k
p
,γ
r
,γ
p
) and ODEs presented in Figure 1A in Additional file 1. Consider the steady state \(Y=\left (\frac {k_{r}}{\gamma _{r}}, \frac {k_{r} k_{p}}{\gamma _{r}\gamma _{p}}\right)\). We address the following questions: 1) Which model parameters are most similar?; 2) Which parameters are identifiable?; 3) What consequence does the similarity structure have for the model robustness?; 4) How can the steady state experiment be modified to reduce parameter correlations? The similarity of the parameters is determined entirely by the response of the model to changes in parameter values. The steady state formula implies that perturbations in k
r
and γ
r
have the same impact i.e. they increase or decrease the RNA and protein level. The same holds for perturbations in k
p
and γ
p
. On the contrary, a perturbation in (k
r
,γ
r
) does not have the same impact as one in (k
p
,γ
p
). The first pair affects the level of both RNA and protein; the latter only the level of protein. This intuition is formalised and visualised by the method. The linkage between parameters k
r
,γ
r
and k
p
,γ
p
is plotted at zero height, and the non-identifiable parameters are marked red (Figure 1B in Additional file 1). Linkage between the pairs is at a non-zero height, as they are not entirely correlated. As for model robustness, the dendrogram depicts that mutually compensative perturbations occur within pairs (k
r
,γ
r
) and (k
p
,γ
p
). The analysis highlights the sources of non-identifiability and therefore helps to find experiments that render more parameters identifiable. For instance, in this example, pushing the initial condition r(t
0),p(t
0) above the steady state levels changes the model dynamics (Figure 1C in Additional file 1). The resulting exponential decay is not invariant with respect to parameter changes. As a result all parameters can be identified (Figure 1C in Additional file 1).