Parameter identifiability analysis and visualization in large-scale kinetic models of biosystems

Background Kinetic models of biochemical systems usually consist of ordinary differential equations that have many unknown parameters. Some of these parameters are often practically unidentifiable, that is, their values cannot be uniquely determined from the available data. Possible causes are lack of influence on the measured outputs, interdependence among parameters, and poor data quality. Uncorrelated parameters can be seen as the key tuning knobs of a predictive model. Therefore, before attempting to perform parameter estimation (model calibration) it is important to characterize the subset(s) of identifiable parameters and their interplay. Once this is achieved, it is still necessary to perform parameter estimation, which poses additional challenges. Methods We present a methodology that (i) detects high-order relationships among parameters, and (ii) visualizes the results to facilitate further analysis. We use a collinearity index to quantify the correlation between parameters in a group in a computationally efficient way. Then we apply integer optimization to find the largest groups of uncorrelated parameters. We also use the collinearity index to identify small groups of highly correlated parameters. The results files can be visualized using Cytoscape, showing the identifiable and non-identifiable groups of parameters together with the model structure in the same graph. Results Our contributions alleviate the difficulties that appear at different stages of the identifiability analysis and parameter estimation process. We show how to combine global optimization and regularization techniques for calibrating medium and large scale biological models with moderate computation times. Then we evaluate the practical identifiability of the estimated parameters using the proposed methodology. The identifiability analysis techniques are implemented as a MATLAB toolbox called VisId, which is freely available as open source from GitHub (https://github.com/gabora/visid). Conclusions Our approach is geared towards scalability. It enables the practical identifiability analysis of dynamic models of large size, and accelerates their calibration. The visualization tool allows modellers to detect parts that are problematic and need refinement or reformulation, and provides experimentalists with information that can be helpful in the design of new experiments. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0428-y) contains supplementary material, which is available to authorized users.


.1.1 Equations
This section shows the model equations of the TGF-β pathway, as presented in the work of Geier and coauthors [1]. The model includes 29 kinetic rates, which are as follows: ) 10 ) r 4 = k 4 C TGFb TGFbR P r 5 = k 5 C TGFb TGFbR P C I Smad r 6 = k 6 C I Smad TGFb TGFbR P r 7 = k 7 C Smad C TGFb TGFbR P r 8 = k 8 C Smad r 9 = k 9 C Smad N r 10 = k 10 2C Smad P C Smad P r 11 = k 11 C Smad P Smad P r 12 = k 10 C Smad P C CoSmad r 13 = k 11 C Smad P CoSmad r 14 = k 8 C CoSmad r 15 = k 9 C CoSmad N r 16 = k 12 k 8 C Smad P Smad P r 17 = k 8 C Smad P r 18 = k 9 C Smad P N r 19 = k 12 k 8 C Smad P CoSmad r 20 = k 13 C Smad P N r 21 = k 10 2C Smad P N C Smad P N r 22 = k 11 C Smad P Smad P N r 23 = k 10 C Smad P N C CoSmad N r 24 = k 11 C Smad P CoSmad N r 25 = k 14 C 2 Smad P CoSmad N The kinetic rates appear in the differential equations of the 18 state variables, which represent the time-varying concentrations of the species involved in the pathway: dC TGFb TGFbR dt = −r 1 + r 2 − r 3 + r 4 + r 6 dC TGFb TGFbR P dt = r 3 − r 4 − r 5 dC I Smad TGFb TGFbR P dt = r 5 − r 6 dC Smad dt = −r 7 − r 8 + r 9 dC Smad P dt = r 7 − r 10 + r 11 − r 12 + r 13 − r 17 + r 18 dC I Smad dt = r 28 − r 29 − r 5 + r 6 .
As in [1], we also assume that all the concentrations, except the Smad RNAs (C I Smad mRNA1 and C I Smad mRNA2 ), can be observed in the experiments.

Simulation details
Following the procedure described in [1], the initial conditions of the dynamic state variables were determined by finding their steady states. For this calculation, we took C dGFbR (0) = 1, C Smad (0) = 60 and C CoSmad (0) = 60, the initial concentrations of the other species as zero, and k 3 = 0 to temporarily remove the stimuli from the model. Then, simulations were performed for a suitable long time to obtain the steady state values of the variables. Finally, the value of C TGFb was set to 1.0 and the nominal value (0.01) of k 3 was re-set. The numerical values of the steady state initial condition can be seen in Table 1.
The model equations were solved using the nominal parameters (see Table 2) and the nominal initial conditions (Table 1) for the time interval t ∈ [0, 18000] seconds. The observation functions were evaluated at 15 time points equidistantly as t = linspace(0, 18000, 15) to obtain their nominal values. Pseudoexperimental data was generated using a standard deviation of 10% of the nominal signal level, while the detection thresholds for each observable was set to approximately 1% of their maximum level. This procedure generated 240 data points (16 observables, 15 time points per observable) for the model calibration.

Parameters
A list of the model parameters can be found in Table 2, which shows their name, nominal value, upper (UB) and lower (LB) bounds used for estimation purposes, and the estimated value obtained by optimization. All the identifiable subsets of model parameters can be found in table 3. We computed the subsets using collinearity threshold 20. The collinearity indices corresponding to each set (row) can be found in the second column. Table 3: Identifiable subsets of TGF-β model parameters. Each row is a list of identifiable parameters. The collinearity indices corresponding to the sets are depicted in the second column (CI).

Equations
The model of the genetic network controlling the circadian clock in Arabidopsis Thaliana [2] is described by the following dynamic equations:

Parameters
The nominal and the estimated model parameters can be found in Table 4.

Optimization results: convergence curves
This section presents the convergence curves of the optimization algorithms used for the parameter estimation of models B2 and B4. Figures 1 and 2 show the convergence curves, i.e. the current best objective function value against the CPU time during the optimization of the models, of the calibration of the models B2 and B4, respectively. Here, three methods were compared: the global optimization method enhanced scatter search (eSS) is combined with the adaptive least squares solver (NL2Sol) or with fmincon (from MATLAB). In case of 'eSS-NL2Sol reg', we solved the regularized parameter estimation problem, which prevents over-fitting. We ran each method 5 times, independently, using different initial guesses of parameters and different random seeds for the random number generator. In the figures we depicted the best performing cases for each method.

VisID Toolbox
The VisID MATLAB Toolbox can be downloaded from the following GitHub repository: https://github.com/gabora/visid. The Toolbox provides the practical identifiability analysis of model parameters after a parameter estimation was performed.
Software Requirements.
• A distribution of MEIGO Toolbox (http://www.iim.csic.es/~gingproc/ meigo.html), which contains an implementation of Variable Neighboring Search (VNS) integer optimisation algorithm, is included in the package, but the latest version can be downloaded from the above link.
• Rank revealing QR algorithm can be downloaded from https://www. mpi-magdeburg.mpg.de/1094756/rrqr. In case the algorithm is not available for certain operation system, then the default QR decomposition of MATLAB is used. We found that the built-in QR algorithm works less efficiently for the initialization of finding the largest identifiable parameter set.
Inputs for parameter estimation. Task1-Task4 are sensitivity based calculations and only the estimated parameters and the Jacobian matrix of the residual vector ( ∂R ∂θ (θ)) is required. This is part of the output structure of the parameter estimation task of AMIGO2. Task5 further requires that the model equations are given in the AMIGO2 input format (as character arrays).
Outputs of VisID Toolbox. Depending on which of the above task is evaluated, VisID generates L A T E Xtables for collinear groups of parameters and for all the largest identifiable sets. Further it generates network and edge files, which can be imported in Cytoscape for visualization.
Known limitations. We found that Task 1 works very well even on large scale models. It was performed for the model B1 in [3] (not reported in this manuscript), which contains 1759 parameters. The QR decomposition took approximately 20 minutes on a DELL Workstation Precision. However, in this case memory can be a limiting factor for computations performed on regular PCs.
Task 2 scales exponentially with the number of parameters, since it requires the computation of collinearity indexes for all groups of parameters, thus the model size can be a limiting factor for these calculations.
Task 3 depends on the parameter K and number of parameters. In case there are N parameters, the brute force approach performs K i=2 N i singular value decompositions to estimate the collinearity indexes. Therefore collinear pairs can be always evaluated, triplets are also possible for models with approximately less than 100 parameters. For medium size models, for example for the TGF-β case study we were able to compute collinearity groups of up to 6 parameters in a few minutes.