Multi-way metamodelling facilitates insight into the complex input-output maps of nonlinear dynamic models
© Tøndel et al.; licensee BioMed Central Ltd. 2012
Received: 20 December 2011
Accepted: 20 June 2012
Published: 20 July 2012
Statistical approaches to describing the behaviour, including the complex relationships between input parameters and model outputs, of nonlinear dynamic models (referred to as metamodelling) are gaining more and more acceptance as a means for sensitivity analysis and to reduce computational demand. Understanding such input-output maps is necessary for efficient model construction and validation. Multi-way metamodelling provides the opportunity to retain the block-wise structure of the temporal data typically generated by dynamic models throughout the analysis. Furthermore, a cluster-based approach to regional metamodelling allows description of highly nonlinear input-output relationships, revealing additional patterns of covariation.
By presenting the N-way Hierarchical Cluster-based Partial Least Squares Regression (N-way HC-PLSR) method, we here combine multi-way analysis with regional cluster-based metamodelling, together making a powerful methodology for extensive exploration of the input-output maps of complex dynamic models. We illustrate the potential of the N-way HC-PLSR by applying it both to predict model outputs as functions of the input parameters, and in the inverse direction (predicting input parameters from the model outputs), to analyse the behaviour of a dynamic model of the mammalian circadian clock. Our results display a more complete cartography of how variation in input parameters is reflected in the temporal behaviour of multiple model outputs than has been previously reported.
Our results indicated that the N-way HC-PLSR metamodelling provides a gain in insight into which parameters that are related to a specific model output behaviour, as well as variations in the model sensitivity to certain input parameters across the model output space. Moreover, the N-way approach allows a more transparent and detailed exploration of the temporal dimension of complex dynamic models, compared to alternative 2-way methods.
KeywordsParameter-phenotype map Dynamic models Metamodelling N-way Partial Least Squares Regression Hierarchical analysis HC-PLSR Cluster-analysis Input-output relationships Circadian clock
Dynamic models in systems biology as well as in other fields become increasingly complex as more detailed knowledge is incorporated. The massive presence of nonlinear relationships between their high-dimensional parameter- and solution- spaces is a key characteristic of such systems. Moreover, dynamic models typically generate multidimensional blocks of temporal data. Clearly it is very challenging to obtain a comprehensive overview of the behavioural repertoires of such models across the high-dimensional input parameter space, including the sensitivity of the model output to changes in the various input parameters, as well as interactions between input parameters and correlation patterns between model outputs. For dynamic model construction and validation, sound handling of such information is crucial. Since most of the existing methods for parameter estimation and sensitivity analysis are appropriate only for systems of relatively low output dimensionality and typically focus on one output variable at a time 12, a generic methodology for analysis of model behaviour that is able to handle the entire range of model complexities and give a comprehensive overview of the relationships between the input parameters and all model outputs, is sorely needed.
Statistical approaches are gaining acceptance as a means for analysis of input-output relationships of complex dynamic models 2345678910, and statistical emulation of dynamic models (metamodelling 11) has been demonstrated to be a useful tool both for speeding up computations 12 and as a basis for sensitivity analysis 2313 and uncertainty assessment 141516. Multi-way (N-way) methods have previously been shown to be effective for data integration in e.g. systems biology 1718. We therefore hypothesise that N-way approaches will be especially advantageous for metamodelling of dynamic models due to the capability of integrating temporal data from several output state variables simultaneously while retaining the information about which state trajectory that corresponds to which state variable throughout the analysis (with 2-way methods, this information is lost when concatenating the trajectories for the different state variables prior to the analysis). Consequently, a more detailed exploration of the temporal dimension of dynamic models is possible. This is important in order to obtain a comprehensive overview of how variation in the input parameters is manifested in the model output. Moreover, methods utilising several model outputs simultaneously have already been demonstrated to reduce the model sloppiness by imposing more constraints on the system 5.
Traditionally, metamodelling is carried out in the causal direction, predicting model outputs as functions of the input parameters using e.g. regression methods. Application of metamodelling in the reverse direction is, however, also of potential interest 5. The two modelling directions can be understood as extensions of the classical/inverse calibration modelling 22. Accordingly, we refer to the causal direction as classical metamodelling, and the reverse direction as inverse metamodelling, and in our application of the N-way HC-PLSR we demonstrate how their combination provides more detailed insight into the complexity of the mapping between input parameters and model outputs. Inverse metamodelling may also facilitate fitting of nonlinear models to large amounts of experimental data. Given that the results from the computations can be substituted with relevant experimentally measured data or quantities calculated from measurements, these can be used to predict corresponding parameters. Moreover, combinations of classical and inverse metamodelling can identify the key metrics to measure in order to validate the models. A more comprehensive introduction to the multivariate metamodelling methodology is given in Additional file 1: Section S1.
As long as they handle high-dimensional data with nonlinear relationships and yield interpretable representations, a wide variety of statistical methods can be effectively used for multivariate metamodelling. We have recently shown that multivariate metamodelling based on PLSR and our nonlinear extension HC-PLSR 8 provides good approximations of the input-output mappings 8 as well as informative insight into complex interaction patterns between parameters 9 of advanced nonlinear dynamic models. PLSR can use multiple response variables simultaneously and utilise inter-correlations between them for model stabilisation. PLSR analysis has been shown to effectively reveal covariation patterns in large and complex data sets, and extract correlations between possibly noisy and partially redundant input variables and outputs 6. The success of PLSR in the context of sensitivity analysis and for constraining input parameter values from dynamic model outputs has also been demonstrated by Sobie et al. 56. Highly nonlinear input-output structures may, however, be difficult to model adequately with linear models such as PLSR, even with polynomial extensions. To confront these problems, HC-PLSR was introduced 8. Heterogeneity in model sensitivity to certain parameters between various regions in the parameter space of a dynamic model of the mouse ventricular myocyte was identified by HC-PLSR-based sensitivity analysis in 9. Similarly, zooming into different regions of the state variable behavioural domain provides the opportunity to identify regions where the relationship between certain parameters and the model output is less ambiguous, indicating that these parameters are especially important for defining a specific type of temporal model behaviour. In cases where variation in the input parameters can be directly related to genotypic variations, this may provide valuable information about how a specific genotype can be of particular importance for the manifestation of certain phenotypic characteristics.
Here, we combine three different aspects of multivariate metamodelling: 1) Description of highly nonlinear input-output relationships by regional metamodelling, 2) NPLSR, allowing a retention of a tensor data structure throughout the analysis and 3) Inverse metamodelling in addition to the classical approach, providing more confident conclusions and a more comprehensive model overview. Moreover, particularly complex details are pursued by more detailed metamodelling of individual outputs and their relationships to the varied input parameters. Altogether, this provides a powerful, robust and efficient approach to exploration of the behavioural repertoire of complex dynamic models.
We illustrate our methodology by an application to a complex dynamic model of the mammalian circadian clock developed by Leloup and Goldbeter 24, which is a well-established and validated model. Models of circadian rhythms have e.g. been used for identifying mechanisms of chronotolerance and chronoefficacy for anticancer drugs 25. The dynamic model we analyse describes circadian oscillations of cellular activity in conditions of continuous darkness, and consists of 16 coupled ordinary differential equations (ODEs) describing the dynamics of three genes through intertwined positive and negative feedback loops. By combining the classical and inverse approaches of the N-way HC-PLSR, we capture several interesting parts of the present complex input-output relationships, which are difficult to deduce directly from the model’s differential equations.
In silico data set
Description and range of the parameters varied in the mammalian circadian clock model simulations
Level step size
Maximum rate of Bmal1 mRNA degradation
Maximum rate of Cry mRNA degradation
Maximum rate of Per mRNA degradation
Maximum rate of degradation of nuclear phosphorylated Per-Cry complex
Maximum rate of degradation of nuclear Per-Cry-Clock-Bmal1 complex
Rate constant for entry of the Per-Cry complex into the nucleus
Rate constant for the formation of the Per-Cry complex
Rate constant for entry of the Bmal1 protein into the nucleus
Rate constant for the formation of the inactive Per-Cry-Clock-Bmal1 complex
A separate test set based on 8192 parameter combinations found by random Monte Carlo sampling 2728 within the same parameter levels as used in the calibration set was also generated, resulting in 8125 converging simulations.
Results from the N-way HC-PLSR metamodelling of the mammalian circadian clock model
The low percentage explained Y - variance (Additional file 1: Section S2, Figure S3) showed that the global metamodelling was inadequate in both the classical and inverse direction. However, before proceeding to improved, regional metamodelling, the dominating relationships and patterns between the 9 input parameters and the 16 output state trajectories of the mammalian circadian clock model were assessed using the global approximations.
Input-output map characteristics revealed by the global classical and inverse metamodels
The dominating input-output covariation patterns
Description of the mammalian circadian clock model state variables
State variable name
Concentration of Per mRNA
Concentration of non-phosphorylated Bmal1 protein in the nucleus
Concentration of Cry mRNA
Concentration of Bmal1 mRNA
Concentration of non-phosphorylated Per protein in the cytosol
Concentration of phosphorylated Per protein in the cytosol
Concentration of non-phosphorylated Per-Cry protein complex in the cytosol
Concentration of non-phosphorylated Cry protein in the cytosol
Concentration of phosphorylated Cry protein in the cytosol
Concentration of phosphorylated Per-Cry protein complex in the cytosol
Concentration of non-phosphorylated Per-Cry protein complex in the nucleus
Concentration of phosphorylated Per-Cry protein complex in the nucleus
Concentration of inactive complex between Per-Cry and Clock-Bmal1 in the nucleus
Concentration of non-phosphorylated Bmal1 protein in the cytosol
Concentration of phosphorylated Bmal1 protein in the cytosol
Concentration of phosphorylated Bmal1 protein in the nucleus
Within the parameter space analysed here, the parameters v mB (maximum rate of Bmal1 mRNA degradation), v mC (maximum rate of Cry mRNA degradation) and v mP (maximum rate of Per mRNA degradation) had the highest correlation to the circadian clock state variables along the first three global NPLSR factors, both in the inverse (Figure 3A) and classical (Figure 3B) metamodelling. The same over-all covariation patterns could be seen both in the inverse and the classical global metamodelling, since the orthogonal design in the parameters used in the global metamodelling has no dominant covariance directions, and both the inverse and classical NPLSR metamodels will consequently be dominated by the covariance structures of the state variables (but restricted to the parameter design). The input parameter v mB were e.g. negatively correlated with the state variables B C , B N , B NP and B CP in the NPLSR factor space, which was not surprising since these state variables represent the dynamics of the phosphorylated and non-phosphorylated protein Bmal1 concentrations in the cytosol and nucleus 24. Similarly, v mP was negatively correlated with the state variables M P (dynamics of Per mRNA) and P CP (dynamics of phosphorylated Per protein concentration in the cytosol), while v mC was negatively correlated with M C (dynamics of Cry mRNA) and PC CP (dynamics of phosphorylated Per-Cry complex in the cytosol). These patterns were all in concordance with our intuition of the mammalian circadian clock model.
Prediction results from the global inverse metamodelling
Prediction results from the global classical metamodelling
The results from the test set prediction of the state variable trajectories from the input parameters in the classical metamodelling shown in Figure 4B, indicated that the temporal behaviour of the following state variables could be predicted with high accuracy from the input parameters using global NPLSR: B N , M C , M B , B C , B CP and B NP , while the prediction error was especially high for the state variables P C , PC C , C C , PC N , PC NP and I N .
Separately analysed output space regions in the hierarchical cluster-based metamodelling
To facilitate comparison, it was decided to use the same grouping (clustering) of the observations in both classical and inverse metamodelling. The state variable NPLSR X-factors from the inverse metamodelling were more directly related to the state variable behaviour than the Y-factors representing the state variables in the classical metamodelling due to the asymmetric nature of the NPLSR models (defined primarily based on the X-scores, not the Y-scores). The inverse metamodelling was therefore carried out first, i.e. the clustering of the 8135 calibration set observations was carried out on the inverse metamodelling X-factors obtained from the output state trajectories (TOutput,A,Inverse, see Figure 2), thereby ensuring that the clusters represented different model behaviours. The same clusters were then also used in the classical N-way HC-PLSR metamodelling. This was chosen due to that clustering on the X-factors or the predicted Y-factors () in the classical metamodelling (as would be a more traditional procedure) would both make the clustering more related to the designed parameter combinations instead of the state variable behaviour, since the predicted Y-factors were here predicted as linear combinations of the X-factors that are related to the parameter combinations (see Additional file 1, equation S12c). Predicted Y-factors would have to be used in the clustering instead of the Y-factors directly calculated from the state variable data in the classical metamodelling, since otherwise the classification of new observations (for which state variable data are not available) would not be possible on variables equivalent to those used to cluster the calibration set observations.
Parameter ranges and mean values for the observations in the six N-way HC-PLSR clusters
v mB (nMh-1)
v mC (nMh-1)
v mP (nMh-1)
v dPCN (nMh-1)
v dIN (nMh-1)
k 1 (h-1)
k 3 (nM-1 h-1)
k 5 (h-1)
k 7 (nM-1 h-1)
Additional input-output map characteristics revealed by the regional classical and inverse metamodelling
Prediction results from the hierarchical inverse metamodelling
The three parameters v dPCN (maximum rate of degradation of nuclear phosphorylated Per-Cry complex), v dIN (maximum rate of degradation of nuclear Per-Cry-Clock-Bmal1 complex) and k 7 (rate constant for the formation of the inactive Per-Cry-Clock-Bmal1 complex) were predicted with low accuracy also in the inverse hierarchical metamodelling. This indicated that either the circadian clock model was relatively insensitive to variations in these input parameters, or our metamodelling was not able to describe the complex relationships between these parameters and the model outputs. This has been assessed in more detail below.
Prediction results from the hierarchical classical metamodelling
Several of the circadian clock state variable trajectories could be predicted with considerably higher accuracy in classical metamodelling using N-way HC-PLSR compared to the global NPLSR (Figure 8B). Only the state variables PC C (concentration of the Per-Cry protein complex in the cytosol), PC N (concentration of the Per-Cry protein complex in the nucleus) and I N (concentration of the inactive complex between Per-Cry and Clock-Bmal1 in the nucleus) were predicted with low accuracy (R2-values below 0.4), indicating that the metamodelling with N-way HC-PLSR was able to capture the main features of the input-output mappings for most of the 16 circadian clock state variables.
Figure 7 indicated that the data for the three above mentioned state variables contained extreme outliers (especially in Cluster 1 and 2), which may have caused the low prediction accuracy for these state variables. Both for the calibration set and the test set, being an outlier seemed to be closely associated to having the lowest level of the parameter v mP , but this was not itself sufficient to generate extreme values of these three state variables.
Detailed interpretation of the revealed model sensitivity patterns
As shown in Figure 10, the temporal behaviours of the state variables PC C and PC N were predicted with higher accuracy by all regional metamodels except regional model 1 and 2 (corresponding to clusters containing outliers for these state variables), compared to the global NPLSR. However, the state variable I N was predicted with very low accuracy in all the regional metamodels, and the parameters v dPCN , v dIN and k 7 could not be well predicted in any of the regional inverse metamodels (Figure 9). Two of the parameters that were predicted with low accuracy, v dIN and k 7 , appeared in the differential equation corresponding to the state variable I N . Hence, the low prediction accuracy was probably due to an insufficiently described mapping between I N and these parameters in the N-way HC-PLSR.
In order to reveal the sensitivity patterns for the state variable I N , a separate sensitivity analysis of the relationship between I N and the circadian clock input parameters was carried out using 2-way second order polynomial HC-PLSR analogous to the analysis presented in 8, but with the parameter ranges given in Table 1. This showed that in order to be well-predicted, the state variable I N had to be logarithmised prior to the analysis. This might explain why this state variable trajectory could not be well described together with the other state variables in the N-way HC-PLSR. The results are given in Additional file 1: Section S3.1, and indicated that the parameters having the largest effects on the I N state trajectory were v mB , v mC , v mP , v dIN , k 1 and k 7 . Several interactions between these input parameters were also identified.
The parameter k 7 (rate constant for the formation of the inactive Per-Cry-Clock-Bmal1 complex) was also involved in the differential equations representing the dynamics of the concentration of non-phosphorylated Bmal1 protein in the nucleus (state variable B N ) 24 and the concentration of the non-phosphorylated Per-Cry protein complex in the nucleus (state variable PC N ), in addition to I N . PC N was not well described by the classical metamodelling, but B N was predicted with high accuracy from the parameters (Figure 8B and Figure 10). Hence, the low prediction accuracy for k 7 indicated that the state variable B N was relatively insensitive to the parameter k 7 according to our analysis (within the analysed parameter range), even though the differential equation for this state variable involved k 7 . This was confirmed by the plot of the model sensitivities estimated from the regional metamodels shown in Figure 11 (only in regional model 3 a sensitivity was identified, but this was also low), as well as the separate sensitivity analysis for the state variable B N described in Additional file 1: Section S3.2 (carried out in the same way as for I N ). This illustrates how a combination of a classical and inverse metamodelling can provide more confident conclusions about model behaviour and sensitivity patterns.
The third parameter that could not be constrained from the state variable data was v dPCN (maximum rate of degradation of nuclear phosphorylated Per-Cry complex), which was only involved in the differential equation describing the dynamics of the concentration of the phosphorylated Per-Cry complex in the nucleus (state variable PC NP ). This state variable was predicted with an R2-value of approximately 0.7 in the classical metamodelling, which is not particularly low. Thus our results indicated a low sensitivity of PC NP to the rate of degradation of the corresponding protein. This result was confirmed by the results shown in Figure 11 and the separate sensitivity analysis for the state variable PC NP described in Additional file 1: Section S3.3. This was not straight-forward to explain, and calls for a more comprehensive analysis of the relationship between the state variable PC NP and the input parameter v dPCN . Possible explanations might be that our analysis did not cover the relevant range for this parameter, causing the model sensitivity to this parameter not to be detected, or that its input-output relationship is very complex.
As seen from Figure 11, the parameter v dPCN seemed to have negative impact on the state variable P C (concentration of non-phosphorylated Per protein in the cytosol) in regional NPLSR model 2, even though neither this parameter nor the state variable PC NP (concentration of the phosphorylated Per-Cry complex in the nucleus, for which this parameter was involved in the differential equation), appeared in the differential equation representing P C . Both P C and PC NP were related to the Per protein, though. In this region of the state variable space, the effect of v dPCN on P C was either more pronounced, or the relationship between these variables was less complex and therefore more visible in the analysis. As seen from Figure 7C, Cluster 2 was characterised by high values of several of the circadian clock state variables (especially M P , P C , P CP , PC C , PC N , PC NP and I N ).
The input parameter k 3 (rate constant for the formation of the Per-Cry complex) had a large negative effect on C CP that was visible only in Cluster 5 (Figure 11). This could be explained by the fact that k 3 was involved in the equation for C C , which was part of the differential equation for C CP . Furthermore, v dIN (maximum rate of degradation of nuclear Per-Cry-Clock-Bmal1 complex) seemed to have a slight negative effect on PC NP in Cluster 1. This result was not easily deducible from the equation structure of the circadian clock model, and could not be detected in the global metamodelling. Cluster 1 was characterised by e.g. an especially large spread in the values of the state variable C C .
The parameter k 7 (rate constant for the formation of the inactive Per-Cry-Clock-Bmal1 complex) seemed to have a positive effect on the state variable C C (non-phosphorylated Cry protein in the cytosol) in regional metamodel 1. However, since the inactive Per-Cry-Clock-Bmal1 complex represses the Per and Cry genes in the nucleus, a positive effect of k 7 on C C seemed unlikely. An additional sensitivity analysis was therefore carried out by adding eight simulations to the data set, keeping all parameters except k 7 constant at the mean values found for Cluster 1. The results are shown in Additional file 1: Section S3.4, and indicated that increasing k 7 resulted in a very small decrease in C C and C CP , had a clear positive effect on I N (as expected from the differential equation for I N ), and a negative effect on PC N and PC NP . In order to try to explain the positive effect of k 7 on C C seen in Cluster 1, a separate 2-way PLSR-based sensitivity analysis was therefore also carried out for the state variable C C in Cluster 1 (see Additional file 1: Section S3.4). The effects of v mB , v mC , v mP and k 3 on C C indicated for Cluster 1 were also manifested in the 2-way PLSR analysis, but a positive main effect of k 7 was not confirmed. However, several interaction terms involving k 7 seemed to have effects on C C , such as the interaction between v mP and k 7 (which had a positive effect). Since cross-terms between the input parameters were not included in the N-way PLSR analysis, confounding of these interaction effects with the main effect of k 7 may explain the positive sensitivity to k 7 indicated by the N-way PLSR. The indication of a positive effect of k 7 on C C could also have been caused by other sources of uncertainties in the NPLSR analysis.
Analogous to the increased prediction accuracy obtained for the two parameters k 1 and k 3 , model sensitivity to these parameters could be revealed in several local regions of the state variable space (Figure 11), even though this was not evident from the global metamodelling (Figure 5). According to Figure 11, the model seemed to be insensitive to these parameters in the region represented by Cluster 6. However, the inverse metamodelling indicated that also in Cluster 6 these parameters could be predicted with a much higher accuracy than by the global NPLSR metamodel, indicating model sensitivity to these parameters also in this region of the state variable space. This illustrates the importance of carrying out both classical and inverse metamodelling to gain a more detailed insight into the sensitivity patterns of a complex model.
The main traditional approach to analysis of input-output relationships has been to use aggregated outputs derived from the state trajectories, representing the dynamics of the state variables. For instance, in their original publication of the mammalian circadian clock model 24, the authors employed a sensitivity analysis of only one aggregated output – the circadian clock period– a very important trait, but too aggregated to give sufficient overview of the entire model behaviour. Multivariate metamodelling has, at least in principle, the capacity to reveal the relationships between all input parameters and all model outputs simultaneously. This has here been illustrated for the nine input parameters assumed to be most interesting for the mammalian circadian clock and the 16 state variables of the model, where the generated N-way metamodels allowed flexible quantitative input-output regressions yielding informative graphical insight into the main underlying input-output map characteristics. In our example N=3, but the analysis can be extended to more than three modes.
Our analysis confirmed the main conclusions from the original classical sensitivity analysis of the circadian clock period carried out by Leloup and Goldbeter 24, namely that the mammalian circadian clock model was highly sensitive to parameters related to synthesis and degradation of the protein Bmal1 and its mRNA. However, our analysis improved the overview of the input-output relationships on which the circadian clock period is based. The main patterns found in our previous analysis of the same model, using conventional (2-way) PLSR 7, were also confirmed in the global NPLSR metamodelling. However, the present cluster-based N-way analysis revealed additional aspects of the input-output relationships, for example the negative effect of increasing v dPCN on the state variable P C in the part of the model output space defined by Cluster 2. Hence, the N-way HC-PLSR-based metamodelling worked as intended in this illustration example. In the example used here, the focus was on oscillating state variables. Other types of behaviour of nonlinear dynamic systems such as multiple steady states could potentially lead to additional nonlinearities in the input-output mapping, probably increasing the gain of using a cluster-based approach compared to a global analysis.
An alternative to using NPLSR would be to unfold the state variable trajectory array by concatenating all the trajectory data into one 2-way matrix and use 2-way HC-PLSR to analyse the data. However, the information about related trajectories for different state variables would then be left unused, leading e.g. to loss of the opportunity to visualise covariance structures. In order to evaluate the gain of keeping the 3-way structure in the data, the same analysis was carried out using 2-way HC-PLSR on unfolded state trajectory data as well as on aggregated outputs calculated from the state trajectories. The clustering results from these analyses (shown in Additional file 1: Section S4) indicated that the increased resolution achieved using N-way HC-PLSR could not be achieved when the state trajectory array was unfolded into two dimensions or transformed into aggregated outputs, due to a less reasonable clustering of the observations. Hence, using the NPLSR factors seems to enable identification of more relevant clusters in which to carry out regional metamodelling. The global parameter prediction accuracies obtained were comparable to those obtained with the global inverse N-way PLSR. However, in the hierarchical metamodelling, neither the unfolded state trajectories nor the aggregated outputs could predict the circadian clock parameters with as high accuracy (on average) with 2-way HC-PLSR as with the N-way HC-PLSR.
In contrast to the results obtained using N-way HC-PLSR, our previously published metamodelling of each of the circadian clock state variables separately 8 showed that all circadian clock state variables could be predicted with high accuracy from the parameters (within the parameter space analysed in that publication, which was slightly different in the present analysis). However, there is a clear gain of using a common metamodel for all state variables in terms of obtaining overview of the input-output relationships as well as covariance patterns between the state variables. Nevertheless, as demonstrated here, a separate analysis of the input-output relationships for insufficiently described state variables should accompany this type of analysis in order to gain a more complete insight into the input-output relationships. This was illustrated in our application example for e.g. the state variable I N , which had to be logarithmised and analysed separately in order for its relationships to the input parameters to be adequately described.
In NPLSR, relations between model outputs and input parameters are easily interpretable through plots of the loadings, in contrast to results produced e.g. by genetic algorithms which are often more difficult to interpret (although the latter can also handle multiple outputs). Moreover, due to the decomposition of the data into estimated latent variables, NPLSR can provide efficient dimension reduction possibilities in high-dimensional systems. However, since the NPLSR models presented here used a high number of factors to explain the input-output covariance, the dimension reduction possibilities of NPLSR may not have been fully utilised. This was caused by the differences in the time-to-peak for the different state variables, which the NPLSR uses many factors to describe. Hence, a more careful pre-processing of the data would probably result in NPLSR models using fewer factors, perhaps through shift correction as described by Westad and Martens 29. Work is in progress on testing whether this allows the NPLSR models to use fewer factors while still keeping the same predictive ability. However, even when using relatively many factors, the NPLSR models still enable great dimension reduction possibilities.
Parameter value variances in the calibration and test set and the six N-way HC-PLSR clusters
Since the selection of data subsets in N-way HC-PLSR is based on fuzzy clustering, no prior knowledge about the structure of the data is needed. Hence, this method automatically detects regions of different model behaviour. The number of clusters to use in the hierarchical metamodelling was here specified in advance, based on exploration of the predictive ability of metamodels of varying complexity. However, using instead an optimisation algorithm to find the optimal number of clusters would make semi-automatic exploration of input-output relationships of computational models possible.
The N-way HC-PLSR method presented here provides the opportunity to improve both prediction accuracy and analytical insight by identification of regional subsets of the data within which the relationships between input parameters and model outputs are more transparent than in a global regression analysis. This was exemplified by the model sensitivity to the two parameters k 1 and k 3 that was detected in the regional analysis but not in the global metamodelling.
Our results also indicate that analysing all state trajectories simultaneously using N-way methodology is more effective for identification of different behavioural domains for a system and regions where input-output mappings can be predicted with higher accuracy, than unfolding the state trajectory array into two dimensions or transforming state trajectories into aggregated outputs prior to the analysis. This is due to a more reasonable clustering of the observations. Moreover, application of the method for metamodelling in both the classical and the inverse direction represents a more comprehensive approach to the analysis of complex relationships between the model inputs and the temporal behaviour of the outputs, and allows more confident conclusions. Our results showed that the mammalian circadian clock model was highly sensitive to parameters related to the protein Bmal1, as previously found by Leloup and Golbeter 24, but in addition our approach revealed also more complex sensitivity patterns of the model.
Based on these results, we believe that the presented N-way HC-PLSR method will be instrumental for effective construction and validation of complex models. Due to its efficient handling of N-way data structures, demonstrated here in the analysis of the temporal model behaviour, we hypothesise that N-way HC-PLSR will be an especially useful tool for multivariate metamodelling of spatiotemporal models, a large future application area.
Generation of the in silico data set
A model of the mammalian circadian clock developed by Leloup and Goldbeter 24 was used to estimate the circadian oscillations of cellular activity in conditions of continuous darkness. The model consists of 16 coupled differential equations with state variables describing the dynamics of three key genes (Bmal1, Per and Cry), including their mRNA level, nonphosphorylated and phosporylated proteins as well as protein complexes. The model contains intertwined positive and negative feedback loops driving the circadian oscillations. A curated CellML implementation 303132 of the model was downloaded from http://models.cellml.org. The integration was carried out in SUNDIALS 2.3 33 using a wrapper for PySundials (http://pysundials.sourceforge.net) in the same way as in 8.
The parameter combinations in the calibration set were generated using an Optimised Multi-level Binary Replacement (OMBR) Design 26 of 9 variables with 8 equally spaced levels each (Table 1). This resulted in 8192 simulations with the circadian clock model. The ranges of each parameter are given in Table 1, and were found by an initial range-finding experiment published in 7. A full factorial design of 9 variables Â´a 8 levels would require 89>134 million runs. Hence, the OMBR design was chosen, in order to explore the effects of as many parameters and parameter values as possible. In the OMBR design method, the values of a original parameters are replaced by multi-bit binary representations, and the binary factor bits are then combined in a fractional factorial design according to a chosen confounding pattern. Thereby drastically reduced experimental designs are obtained, yet maintaining reasonable coverage of the parameter space. The OMBR design has been compared to central composite designs and semi-random designs, and has been shown to give good predictive ability 7.
For each parameter combination the resulting differential equation model was solved from the original initial conditions (see 24) until convergence to a stable limit cycle. The test for convergence was done as follows: First the system was solved with rootfinding for variable M B to extract two complete cycles. Convergence of the cycle period was tested by requiring that the period difference relative to the mean of the periods for the two cycles should be less than 0.001. Convergence to synchronous oscillations was tested by (i) interpolating all 16 state variables at 200 equally spaced time points for each cycle, (ii) linearly transforming each state variable such that the minimum and maximum values of each cycle was 0 and 1, respectively, and (iii) requiring that the sum of absolute difference between the two cycles across all the 3200 interpolated time points should be less than 0.0001.
The data set resulting from the simulations of the mammalian circadian clock consisted of sampled values for one period (here 200 timesteps) of 16 state variables (corresponding to the 16 differential equations in the model), for the set of 8192 combinations of values for the nine varied input parameters. This gave a 3-way array of 8192x16x200 data points. A description of the mammalian circadian clock model state variables is given in Table 2. Due to the wide parameter ranges used, 57 (0.7%) of the 8192 simulations did not result in a stable limit cycle. These simulations were omitted in the following analysis. The parameters and state variables were mean-centred and standardised globally in the calibration set prior to the metamodelling.
A separate test set based on 8192 parameter combinations found by random Monte Carlo sampling 2728 within the same parameter levels as used in the calibration set was used. This resulted in 8125 converging test set simulations. In the test set, the variables were pre-processed in the same way as for the calibration set, using the global calibration set means and standard deviations.
Our previously published method for nonlinear metamodelling, HC-PLSR 8, has here been extended to enable use of N-way data by using NPLSR 1934, giving N-way HC-PLSR. HC-PLSR 8 includes regional analysis using subsets of the original data set generated by fuzzy C-means (FCM) clustering 35363738 (see Additional file 1: Section S1).
In N-way HC-PLSR, a global NPLSR model comprising all observations is first generated, and FCM clustering (using Euclidian distance) on a chosen number of first mode (see Figure 1) factors (scores) of the global NPLSR X-factors (or alternatively, Y-factors) is used to separate the observations into groups within which regional NPLSR models are made. The FCM fuzzifier parameter was here chosen equal to 2. To prevent possibly unstable regression models due to a small number of calibration observations, we post-processed the clustering with the requirement that each cluster should contain at least ten observations. Smaller clusters (and their associated observations) were regarded as outliers, and not included in the subsequent regional regression analysis (but were still included in the global regression analysis).
The optimal number of factors to use in the global and regional NPLSR models, respectively, was here chosen according to the minimum cross-validated mean squared error (MSE) of prediction of the response array Y, with the extra requirement that each included component accounts for at least 1% of the total cross-validated Y-variance, in the same way as in 8. Here, 10-fold cross-validation where the ten segments were randomly chosen was used both in the global and the regional NPLS regressions. These ten segments were successively kept out of the NPLSR calibration, and predicted using an NPLSR model based on the remaining observations. NPLSR is described in Additional file 1: Section S1, and the MATLAB® N-way Toolbox v.3.11 34 was used here.
In our N-way HC-PLSR implementation, Linear Discriminant Analysis (LDA) 39, Quadratic Discriminant Analysis (QDA) 40 (the MATLAB® function "classify" from the Statistics Toolbox™ v7.6) or Naive Bayes classification (the MATLAB® function "NaiveBayes" from the Statistics Toolbox™ v7.6) can be used for classification of new observations to be predicted (based on predicted NPLSR factors for new observations). The implementation contains two options for prediction: 1) Prediction using the local regression model calibrated in the most probable cluster, and 2) Prediction using a weighted sum of the local regression models, using the estimated cluster membership values as weights. The N-way HC-PLSR was carried out in MATLAB® 41 Version 7.13 (R2011b), using in-house code which can be obtained from the authors upon request.
Classical and inverse metamodelling of the mammalian circadian clock model
The 3-way array of state variable data (observations x outputs x time points) was first used as regressor in a test set validated N-way HC-PLSR using the parameter combinations as response-variables (inverse metamodelling). To complement this analysis, analogous classical metamodelling with N-way HC-PLSR was carried out, where the parameter combinations were used as regressor variables to predict the 3-way state trajectory array. The calibration- and test sets calculated with the mammalian circadian clock model described above were used. The methodology is illustrated in Figure 2. The most probable local regression model was chosen for prediction, since this gave higher prediction accuracy than using a weighted sum of the local models. No clusters containing less than ten observations were identified, so all calibration set observations were included in the regional regression analysis.
In the inverse metamodelling, the clustering (and classification in the prediction stage) of the observations was based on the global predicted first mode X - factors, the X - scores ( in Additional file 1: Section S1), representing the state variable trajectory data. All factors found to explain a significant amount of the cross-validated calibration set Y-variation were included. The number of clusters to use was chosen by evaluating the ability to constrain (i.e. correctly predict) the input parameters from the model output in the inverse metamodelling, using from 1 to 20 clusters within the calibration set. The calibration set observations were treated as "new observations" (see Figure 2) here, that is, the same procedure as for the test set observations was used in the prediction stage. Using cross-validation to find the optimal number of clusters, as was done to find the optimal number of factors for the NPLSR models, would be too time consuming given the size of the datasets used here. The mean correlation coefficient (R2)-values for the prediction of the parameters were used as selection criterion, and using 6 clusters was considered optimal in order to balance metamodel complexity against predictive ability.
The same clusters were also used for the classical metamodelling, since the X-factors in the inverse metamodelling are more directly related to the model output state variables than the Y-factors from the classical metamodelling (a PLSR model is asymmetric). However, for the classification in the prediction stage of the classical metamodelling to be relevant, was predicted from the first mode predicted Y-factors, the Y-scores (calculated using equation S12c in Additional file 1), since in this case the Y-factors represent the state variable trajectories. This was done using second order polynomial Ordinary Least Squares (OLS) regression (including square terms and cross-terms), calibrated using calibration set from the classical metamodelling (* C A (see Additional file 1, eq. S12c)) as regressors and calibration set T XA,NWay from the inverse metamodelling () as response variables. Hence, , where is calibrated using polynomial OLS regression, and includes the calculation of from . The test set observations were then classified based on , according to the clusters found in the inverse metamodelling. See Additional file 1, sections S1.6 and S1.7 for a more comprehensive description of this methodology, including all predicting equations for test set observations.
QDA was chosen instead of LDA and Naive Bayes classification in this study, since LDA assumes the covariance matrix to be equal for all classes and Naive Bayes classification assumes that the presence of a particular feature of a class is unrelated to the presence of any other feature. In QDA, these assumptions are not made.
In the classical metamodelling, the sensitivity of each state variable to variations in the different parameters was estimated as the product of the second mode X-factors and the transpose of the second mode Y-factors (also called X- and Y- loadings), using the optimal number of NPLSR factors for the corresponding NPLSR model. The NPLSR loadings calculated here were not orthogonal. Work is currently in progress to assess the effects of using non-orthogonal loadings to estimate the model sensitivities.
Additional sensitivity analyses
Some of the input-output relationships were not well described by the N-way HC-PLSR. Additional separate sensitivity analyses were therefore carried out for some of the state variables using 2-way second order polynomial HC-PLSR with the parameters and their cross-terms and second order terms as regressors and the state trajectories as response variables, analogous to the analysis presented in 8. The regressors were mean-centred and standardised prior to the HC-PLSR, while the state trajectories were only centred. Some of the state trajectories were logarithmised prior to the regression analysis.
The same clusters as in the N-way HC-PLSR described above were used. QDA 40 on predicted PLSR Y- scores (see Additional file 1, eq. S7d) were used for classification, and all PLS components (PCs) explaining a significant amount of the cross-validated Y-variance were included. Both in the global and regional regression analyses the optimal number of PLS components to use was found by 10-fold cross-validation. In the regional regression models, the matrices of cross-terms and second order terms were deflated with respect to the variation described by the first order terms (in an OLS regression) in order to better separate the effects of nonlinear terms and first order terms. The HC-PLSR was carried out in MATLAB® 41 Version 7.13 (R2011b), using in-house code that can be obtained from the authors upon request.
For comparison, the inverse metamodelling was carried out using 2-way HC-PLSR where the 3-way state variable array was unfolded by concatenating the time series for all state variables, as well as by using aggregated outputs representing the state variable trajectories. The following aggregated outputs were derived from the state trajectories: period of oscillation, bottom, peak, time-to-bottom and time-to-peak for each state variable curve (see Additional file 1: Section S4). This resulted in 65 aggregated outputs. Both parameters, state variables and aggregated outputs were mean-centred and standardised prior to the HC-PLSR.
The number of PLS components to use in the PLSR models was chosen based on the percent explained cross-validated Y-variance as described in 8 and in the same way as for the N-way HC-PLSR, clustering and classification were done on the global X- scores of all PCs explaining a significant amount of the cross-validation variance. The same settings as described in 8 were used for the HC-PLSR, except that QDA 40 (the MATLAB® function "classify" from the Statistics Toolbox™ v7.6) was used for classification as in the N-way HC-PLSR. The same number of clusters as in the N-way HC-PLSR (6 clusters) was used also in the 2-way HC-PLSR analyses.
This study was supported by the National Program for Research in Functional Genomics in Norway (FUGE) (RCN grant no. NFR151924/S10) and by the Norwegian eScience program (eVITA) (RCN grant no. NFR178901/V30). Rasmus Bro is thanked for providing us with the newest version of The N-way Toolbox for MATLAB.
- Lillacci G, Khammash M: Parameter Estimation and Model Selection in Computational Biology. PLoS Comput Biol 2010, 6: e1000696. 10.1371/journal.pcbi.1000696View Article
- Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S: Global Sensitivity Analysis: The Primer. Chichester: Wiley-Interscience; 2008.
- Saltelli A, Chan K, Scott EM: Sensitivity Analysis. 1st edition. Chichester: Wiley; 2000.
- Santner TJ, Williams BJ, Notz W: The design and analysis of computer experiments. New York, USA: Springer; 2003.View Article
- Sarkar AX, Sobie EA: Regression Analysis for Constraining Free Parameters in Electrophysiological Models of Cardiac Cells. PLoS Comput Biol 2010, 6: e1000914. 10.1371/journal.pcbi.1000914View Article
- Sobie EA: Parameter sensitivity analysis in electrophysiological models using multivariable regression. Biophys J 2009, 96: 1264-1274. 10.1016/j.bpj.2008.10.056View Article
- Tøndel K, Gjuvsland AB, Måge I, Martens H: Screening design for computer experiments: Metamodelling of a deterministic mathematical model of the mammalian circadian clock. J Chemometr 2010, 24: 738-747. 10.1002/cem.1363View Article
- Tøndel K, Indahl UG, Gjuvsland AB, Vik JO, Hunter P, Omholt SW, Martens H: Hierarchical Cluster-based Partial Least Squares Regression is an efficient tool for metamodelling of nonlinear dynamic models. BMC Syst Biol 2011, 5: 90. 10.1186/1752-0509-5-90View Article
- Tøndel K, Vik JO, Martens H, Indahl UG, Smith N, Omholt SW: Hierarchical Multivariate Regression-based Sensitivity Analysis Reveals Complex Parameter Interaction Patterns in Dynamic Models. Submitted
- Vik JO, Gjuvsland AB, Li L, Tøndel K, Niederer SA, Smith N, Hunter PJ, Omholt SW: Genotype-phenotype map characteristics of an in silico heart cell. Front Physio 2011, 2: 106.
- Kleijnen JPC: Design and Analysis of Simulation Experiments. 1st edition. New York, USA: Springer; 2007.
- Conti S, O’Hagan A: Bayesian emulation of complex multi-output and dynamic computer models. J Stat Plan Infer 2010, 140: 640-651. 10.1016/j.jspi.2009.08.006View Article
- Campbell K, McKay MD, Williams BJ: Sensitivity analysis when model outputs are functions. Reliab Eng Syst Safe 2006, 91: 1468-1472. 10.1016/j.ress.2005.11.049View Article
- Cacuci DG: Sensitivity and Uncertainty Analysis: Theory v.1: Theory Vol 1. 1st edition. Boca Raton: Chapman and Hall/CRC; 2003.View Article
- Cacuci DG, Ionescu-Bujor M, Navon IM: Sensitivity and Uncertainty Analysis: Applications to Large-scale Systems Vol 2. 1st edition. Boca Raton: CRC Press; 2005.View Article
- Ayyub BM, Klir GJ: Uncertainty modeling and analysis in engineering and the sciences. Boca Raton: Chapman & Hall/CRC; 2006.View Article
- Yener B, Acar E, Aguis P, Bennett K, Vandenberg S, Plopper G: Multiway modeling and analysis in stem cell systems biology. BMC Syst Biol 2008, 2: 63. 10.1186/1752-0509-2-63View Article
- Conesa A, Prats-Montalbán JM, Tarazona S, Nueda MJ, Ferrer A: A multiway approach to data integration in systems biology based on Tucker3 and N-PLS. Chemometr Intell Lab In Press.
- Bro R: Multiway calibration. Multilinear PLS. J Chemometr 1996, 10: 47-61. 10.1002/(SICI)1099-128X(199601)10:1<47::AID-CEM400>3.0.CO;2-CView Article
- Krishnaiah PR: Multivariate Analysis. New York: Academic Press Inc; 1967.
- Wold S, Martens H, Wold H: The multivariate calibration method in chemistry solved by the PLS method. In Lecture notes in Mathematics, Matrix Pencils. Heidelberg: Springer; 1983:286-293.
- Martens H, Næs T: Multivariate calibration. Chichester, UK: John Wiley and Sons; 1989.
- Martens H, Martens M: Multivariate Analysis of Quality: An Introduction. 1st edition. Chichester: Wiley; 2001.
- Leloup J-C, Goldbeter A: Modeling the mammalian circadian clock: Sensitivity analysis and multiplicity of oscillatory mechanisms. J Theor Biol 2004, 230: 541-562. 10.1016/j.jtbi.2004.04.040View Article
- Altinok A, Lévi F, Goldbeter A: Identifying mechanisms of chronotolerance and chronoefficacy for the anticancer drugs 5-fluorouracil and oxaliplatin by computational modeling. Eur J Pharm Sci 2009, 36: 20-38. 10.1016/j.ejps.2008.10.024View Article
- Martens H, Måge I, Tøndel K, Isaeva J, Høy M, Sæbø S: Multi-level Binary Replacement (MBR) design for computer experiments in high-dimensional nonlinear systems. J Chemometr 2010, 24: 748-756. 10.1002/cem.1366View Article
- Kalos MH, Whitlock PA: Monte Carlo Methods Volume 1: Basics. 1st edition. New York, USA: Wiley; 1986.View Article
- Liu JS: Monte Carlo Strategies in Scientific Computing. New York, USA: Springer; 2008.
- Westad F, Martens H: Shift and intensity modeling in spectroscopy–general concept and applications. Chemometr Intell Lab 1999, 45: 361-370. 10.1016/S0169-7439(98)00144-0View Article
- Lloyd CM, Halstead MDB, Nielsen PF: CellML: its future, present and past. Prog Biophys Mol Bio 2004, 85: 433-450. 10.1016/j.pbiomolbio.2004.01.004View Article
- Cuellar AA, Lloyd CM, Nielsen PF, Bullivant DP, Nickerson DP, Hunter PJ: An Overview of CellML 1.1, a Biological Model Description Language. SIMULATION 2003, 79: 740-747. 10.1177/0037549703040939View Article
- Lloyd CM, Lawson JR, Hunter PJ, Nielsen PF: The CellML Model Repository. Bioinformatics 2008, 24: 2122-2123. 10.1093/bioinformatics/btn390View Article
- Hindmarsh AC, Brown PN, Grant KE, Lee SL, Serban R, Shumaker DE, Woodward CS: SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans Math Softw 2005, 31: 363-396. 10.1145/1089014.1089020View Article
- Andersson CA, Bro R: The N-way Toolbox for MATLAB. Chemometr Intell Lab 2000, 52: 1-4. 10.1016/S0169-7439(00)00071-XView Article
- Bezdek JC: Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell: Kluwer Academic Publishers; 1981.View Article
- Berget I, Mevik B-H, Næs T: New modifications and applications of fuzzy C-means methodology. Comput Stat Data Anal 2008, 52: 2403-2418. 10.1016/j.csda.2007.10.020View Article
- Næs T, Isaksson T: Splitting of calibration data by cluster analysis. J Chemometr 1991, 5: 49-65. 10.1002/cem.1180050106View Article
- Næs T, Kubberød E, Sivertsen H: Identifying and interpreting market segments using conjoint analysis. Food Qual Prefer 2001, 12: 133-143. 10.1016/S0950-3293(00)00039-2View Article
- McLachlan GJ: Discriminant Analysis and Statistical Pattern Recognition. 1st edition. Hoboken: Wiley-Interscience; 1992.View Article
- Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning. New York, US: Springer; 2003. Corrected edition.
- The MathWorks, Inc.: MATLAB®, v. 7.13. 2011.
- Martens H, Veflingstad S, Plahte E, Martens M, Bertrand D, Omholt S: The genotype-phenotype relationship in multicellular pattern-generating models - the neglected role of pattern descriptors. BMC Syst Biol 2009, 3: 87. 10.1186/1752-0509-3-87View Article
- Isaeva J, Sæbø S, Wyller JA, Nhek S, Martens H: Fast and comprehensive fitting of complex mathematical models to massive amounts of empirical data. Chemometr Intell Lab In press.
- Isaeva J, Martens M, Sæbø S, Wyller JA, Martens H: The modelome of line curvature: many nonlinear models approximated by a single bi-linear metamodel with verbal profiling. Physica D 2012, 241: 877-889. 10.1016/j.physd.2012.02.002View Article
- Tenenhaus M, Vinzi VE, Chatelin Y-M, Lauro C: PLS path modeling. Comput Stat Data Anal 2005, 48: 159-205. 10.1016/j.csda.2004.03.005View Article
- Pearson K: On lines and planes of closest fit to systems of points in space. Philos Mag 1901, 2: 559-572. 10.1080/14786440109462720View Article
- Jolliffe IT: Principal Component Analysis. 2nd edition. New York, US: Springer; 2002.
- Jolliffe IT: A note on the use of principal components in regression. J Roy Stat Soc C-App 1982, 31: 300-303.
- Hoerl AE, Kennard RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970, 12: 55-67. 10.1080/00401706.1970.10488634View Article
- Tibshirani R: Regression Shrinkage and Selection via the Lasso. J Roy Stat Soc B Met 1996, 58: 267-288.
- Zou H, Hastie T: Regularization and variable selection via the Elastic Net. J Roy Stat Soc B 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.xView Article
- de Jong S: SIMPLS: an alternative approach to partial least squares regression. Chemometr Intell Lab 1993, 18: 251-263. 10.1016/0169-7439(93)85002-XView Article
- Martens H: Non-linear multivariate dynamics modelled by PLSR. In Proceedings of the 6th International Conference on Partial Least Squares and Related Methods. Beijing: Publishing House of Electronics Industry; 2009:139-144.
- Berglund A, Wold S: INLR, implicit non-linear latent variable regression. J Chemometr 1997, 11: 141-156. 10.1002/(SICI)1099-128X(199703)11:2<141::AID-CEM461>3.0.CO;2-2View Article
- Gath I, Geva AB: Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 1989, 11: 773-780. 10.1109/34.192473View Article
- Frigui H, Krishnapuram R: A robust competitive clustering algorithm with applications in computer vision. IEEE Trans Pattern Anal Mach Intell 1999, 21: 450-465. 10.1109/34.765656View Article
- Bro R: PARAFAC. Tutorial and applications. Chemometr Intell Lab 1997, 38: 149-171. 10.1016/S0169-7439(97)00032-4View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.