Hierarchical Cluster-based Partial Least Squares Regression (HC-PLSR) is an efficient tool for metamodelling of nonlinear dynamic models
© Tøndel et al; licensee BioMed Central Ltd. 2011
Received: 22 December 2010
Accepted: 1 June 2011
Published: 1 June 2011
Deterministic dynamic models of complex biological systems contain a large number of parameters and state variables, related through nonlinear differential equations with various types of feedback. A metamodel of such a dynamic model is a statistical approximation model that maps variation in parameters and initial conditions (inputs) to variation in features of the trajectories of the state variables (outputs) throughout the entire biologically relevant input space. A sufficiently accurate mapping can be exploited both instrumentally and epistemically. Multivariate regression methodology is a commonly used approach for emulating dynamic models. However, when the input-output relations are highly nonlinear or non-monotone, a standard linear regression approach is prone to give suboptimal results. We therefore hypothesised that a more accurate mapping can be obtained by locally linear or locally polynomial regression. We present here a new method for local regression modelling, Hierarchical Cluster-based PLS regression (HC-PLSR), where fuzzy C-means clustering is used to separate the data set into parts according to the structure of the response surface. We compare the metamodelling performance of HC-PLSR with polynomial partial least squares regression (PLSR) and ordinary least squares (OLS) regression on various systems: six different gene regulatory network models with various types of feedback, a deterministic mathematical model of the mammalian circadian clock and a model of the mouse ventricular myocyte function.
Our results indicate that multivariate regression is well suited for emulating dynamic models in systems biology. The hierarchical approach turned out to be superior to both polynomial PLSR and OLS regression in all three test cases. The advantage, in terms of explained variance and prediction accuracy, was largest in systems with highly nonlinear functional relationships and in systems with positive feedback loops.
HC-PLSR is a promising approach for metamodelling in systems biology, especially for highly nonlinear or non-monotone parameter to phenotype maps. The algorithm can be flexibly adjusted to suit the complexity of the dynamic model behaviour, inviting automation in the metamodelling of complex systems.
Realistic deterministic dynamic models of complex biological systems are difficult to assess with respect to behaviour, since they are characterised by a large number of parameters and state variables as well as nonlinear functional relationships and intricate feedback loops. The relationship between the output from such models and their input parameter variation and initial conditions, and control of built-in functional relationships, can become extremely complex. The traditional analytical tools of nonlinear systems theory are mainly aimed at studying the effects of few parameters on low-dimensional phenomena such as steady states and limit cycles. The high-dimensional models emerging in systems biology bring new challenges such as increasing the speed of numerical solvers, developing tools for automated model simplification and global sensitivity analysis (study of how the system input variations influence the output ). Metamodels of a dynamic model, which are statistical models mapping parametric variation to variation in the state variables throughout the entire relevant parameter space , will be helpful in meeting several of these challenges.
There is currently little tradition for making use of metamodelling in advanced computational biology, but based on experiences from other fields [2, 3] we anticipate that this is up for change. Since computationally demanding multiscale modelling is an emerging field in computational biology, we foresee that metamodel-generated mappings will be very useful as a model reduction technique for speeding up simulations, for performing global high-dimensional sensitivity analyses for a whole range of purposes, for comparing the high-dimensional prediction spaces of competing models as well as for comparing models to high-dimensional phenotypic data. Input parameters and initial conditions can also be predicted from the model output or empirical data, providing opportunities for identification of biologically relevant parameters. However, for these anticipations to be fulfilled we need to develop a robust metamodelling methodology capable of producing accurate predictive mappings for a whole range of different biological models and which allows for extensive automation. Despite that considerable progress has been made in other fields, in particular engineering [1–6], there is a need for substantial development to make this approach generally applicable in biology and feasible for implementation e.g. as part of the model repositories like the CellML repository [7–9].
Most metamodelling efforts published to date make use of ordinary least squares (OLS) regression and response surface methods based on OLS that require the covariance matrix of the regressors (independent variables) to be invertible , that is, the regressors must be linearly independent. These methods are primarily focused on prediction of one single output variable at a time, and usually ignore strong inter-correlations between the output variables. An exception is Artificial Neural Networks (ANN), but these methods produce models of a format that do not allow easy interpretation. Using a Bayesian approach, Conti and O'Hagan  recently demonstrated that it is possible to emulate the output of a dynamic model to a high degree of precision using only a few hundreds of training runs. However, such Gaussian emulators are only effective for model structures having a small number of significant main effects and very mild interactions . Hence, one may claim that none of the current metamodelling approaches do make transparent how all inputs, auxiliaries and outputs are related to each other jointly under a broad range of different conditions. They are also, with few exceptions, unable to utilise inter-correlations between the output variables. Furthermore, according to Li et al. , most sensitivity analysis methods require the regressor variables to be linearly independent, something that is not likely to be the case in many biological modelling situations (e.g. due to the use of highly reduced experimental designs or sampling-based methods to set up the computational experiments). Hence, there is a need for new methodologies for mapping the entire space of biologically relevant parameters and initial conditions to the outcome of the corresponding dynamic models, that can handle linearly dependent input parameters, utilise strong inter-correlations between the outputs, as well as model highly nonlinear and non-monotone input-output relations, which characterise many biological systems. Mild nonlinearities can to some degree be modelled by polynomial regression, using e.g. square- and interaction terms as extra regressors , but a robust metamodelling methodology must be capable also of handling strong nonlinearities, in particular non-monotone input-output relationships [13–15].
A candidate approach is to make use of locally linear or locally polynomial regression modelling of carefully selected subsets of the input-output space of the original complex model. Locally linear approaches to modelling large and complex data sets have been successfully demonstrated in a variety of applications, most of which include a separation of the data into blocks based on prior knowledge about the structure of the data being analysed [16–20]. This suggests that nonlinear and non-monotone response surfaces can be handled by local high-order polynomial models. In order to also account for linearly dependent regressors and inter-correlations between the responses, alternatives to OLS are needed. Bi-linear methods based on estimated latent variables, e.g. Principal Component Analysis (PCA) [21, 22] and Partial Least Squares Regression (PLSR) [23–25] (see Additional file 1: Appendix 1), have shown considerable success in analysis and interpretation of a large number of highly multivariate and complex data sets. Martens and Martens  demonstrated the use of PLSR as an alternative to Analysis Of Variance (ANOVA) to facilitate the interpretation of multi-response data from designed experiments. PLSR maximises the explained covariance between the regressors and the responses. In contrast to most other linear regression methods, PLSR also utilises inter-correlations between the response variables for model stabilisation, and does not require the regressor variables to be linearly independent. PLSR is efficient for compressing inputs, intermediate states and output variables into their most relevant subspaces (spanned by the estimated latent variables, also called PLS components (PCs)), and hence provides a versatile means for data compression by reducing the rank of both regressors (X) and responses (Y). This, in turn, provides an effective approach to identification of important features in a complex system. PLSR is equivalent to OLS when the regressor rank is not reduced, that is, when all PLS components are included. Modelling based on such estimated latent variables (represented by so-called X- and Y- score vectors) also have the advantage of being suited for graphical visualisation, inspection and interpretation via their associated sets of loadings, i.e. the coefficients describing the relationship between the score vectors and the original variables/parameters. Theoretical aspects of PLSR and its relations to the more commonly used method Principal Component Regression (PCR) [27, 28] are given in . Campbell et al.  have shown that metamodels based on subspaces found by PLSR, when compared to Legendre polynomials and PCA, gave the simplest and most predictive basis for sensitivity analysis for a set of computational models. In , the suitability of PLSR for interpretation of complex biological systems and use of PLSR in sensitivity analysis was demonstrated. This motivated us to probe the versatility of a new variant of local modelling, here named Hierarchical Cluster-based PLS regression (HC-PLSR), which assumes no prior knowledge about the data structure. Besides presenting the HC-PLSR method, we report the metamodelling performance of HC-PLSR, global PLSR and global ordinary least squares regression on three dynamic model systems in order of increasing complexity: i) six different gene regulatory network models with various types of feedback [32, 33], ii) a dynamic model of the mammalian circadian clock  and iii) a model of the mouse ventricular myocyte function . These three test beds encompass large classes of dynamic models. We show that the HC-PLSR approach is superior in all cases and that the difference in terms of explained variance and prediction accuracy increases with the degree of nonlinearity and the presence of positive feedback loops.
In silico data sets
Gene regulatory networks
Mammalian circadian clock
Description and range of the parameters varied in the simulations with the mammalian circadian clock model
Level step size
Maximum rate of Bmal1 mRNA degradation
Maximum rate of Cry mRNA degradation
Maximum rate of Per mRNA degradation
Maximum rate of degradation of nuclear phosphorylated Per-Cry complex
Maximum rate of degradation of nuclear Per-Cry-Clock-Bmal1 complex
Rate constant for entry of the Per-Cry complex into the nucleus
Rate constant for the formation of the Per-Cry complex
Rate constant for entry of the Bmal1 protein into the nucleus
Rate constant for the formation of the inactive Per-Cry-Clock-Bmal1 complex
For each parameter combination the resulting differential equation model was solved from the original initial conditions (see ) until convergence to a stable limit cycle. The differential equations were solved in SUNDIALS 2.3  using a wrapper for PySundials http://pysundials.sourceforge.net. The resulting data set consists of values of 16 state variables (corresponding to the 16 differential equations in the model) calculated over 200 time steps each for the set of 8192 combinations of values of the nine input parameters. The test for convergence was done as follows: First the system was solved with rootfinding for variable MB to extract two complete cycles. Convergence of the cycle period was tested by requiring that the period difference relative to the mean of the periods for the two cycles should be less than 0.001. Convergence to synchronous oscillations was tested by (i) interpolating all 16 state variables at 200 equally spaced time points for each cycle, (ii) linearly transforming each state variable such that the minimum and maximum values of each cycle was 0 and 1, respectively, and (iii) requiring that the sum of absolute difference between the two cycles across all the 3200 interpolated time points should be less than 0.0001.
Mouse ventricular myocyte
Data for the murine heart cell function was generated using the model recently published by Li et al. , built to account for the action potential and calcium transient of the cardiomyocyte in terms of its constituent ion currents and gating channels. The model extends that of Bondarenko et al. , with more realistic calcium handling, detailed re-parameterisation to experimental data and consistency checking by conservation of charge. State variables include concentrations of sodium, potassium and calcium in the cytosol, calcium concentration in the sarcoplasmic reticulum and the state distribution of ion channels. Formulated as a system of 36 coupled ODEs, this model provides a comprehensive representation of membrane-bound channels and transporter functions as well as fluxes between the cytosol and intracellular organelles.
Description and range of the parameters varied in the simulations with the mouse ventricular myocyte model
Extracellular potassium concentration
Extracellular sodium concentration
Extracellular calcium concentration
SERCA, calcium uptake from cytosol to sarcoplasmic reticulum
L-type calcium current
NCX, scaling coefficient for the sodium-calcium exchanger current
Fast sodium current
Time-independent potassium current
Rapid delayed rectifier potassium current
Only the state variables were included in the regression analysis (not the auxiliary variables), and the constant state variable iKss was omitted. For this data set 15793 of the 59049 simulations (26.7%)) failed, and were therefore omitted in the analysis. In order for the time series to be comparable, interpolation to a fixed set of time points was needed. For the entire time series to be analysed, the data set had to be separated into three parts, according to the stimulus period (the values of the stimulus period are given in Table 2). The time series for each part of the data set were then interpolated using the MATLAB® system function "interp1.m" into 200 equally spaced time points between 0 and the stimulus period.
Implementation of the HC-PLSR was based on an initial global second order polynomial PLSR using all observations in the calibration set with a 10-fold cross-validation (i.e. the data was separated into ten randomly chosen segments, and ten PLSR models were made where one segment was successively taken out and predicted using a PLSR model based on the other nine segments) to identify a preliminary (global) model (mean of the ten models made in the cross-validation). The global model (i.e. the optimal number of PLS components (PCs)) was then chosen according to the minimum cross-validated mean squared error (MSE) of prediction of the response matrix Y, with the extra requirement that each included component accounts for at least 1% of the total cross-validated Y-variance. Using cross-validation to find the optimal rank is a well established technique to minimise the risk of overfitting. When modelling empirical data, the portion of the variance accounted for by the omitted PCs are assumed to be caused by noise. However, in this case, we used data from deterministic modelling which is free from noise, but since we aimed for an as simple and easily interpretable metamodel as possible (but showing adequate predictive ability), we regarded variations not accounted for by a second order polynomial as stochastic noise and used cross-validation to truncate the model. The cross-terms and second order terms were calculated from mean-centred parameter values.
Thereafter the observations were clustered by fuzzy C-means (FCM) clustering [44, 45] of the X- or Y-scores using Euclidian distance for a chosen number of PLS components (X is the regressor matrix, while Y is the response variable matrix). This allowed separation of the overall data set into subsets where local polynomial regression was hypothetically more likely to improve prediction results. In our MATLAB® implementation the FCM fuzzifier parameter was chosen equal to 2 (a standard choice). The fuzzifier parameter can be interpreted as an increased sharing of points among all clusters. Fuzzy clustering was chosen because the FCM algorithm is simple, efficient and less prone to local minima than crisp clustering algorithms. The FCM algorithm is also flexible with respect to the distance measure used, and it is easy to incorporate various types of penalties in the distance measure . It is therefore a good choice in making the HC-PLSR method widely applicable.
New observations (or test set observations) were classified based on X- or Y-scores calculated by projection into the global PLSR model. X-scores for test set observations were calculated based on loading weights from the global PLSR model (see Additional file 1: Appendix 1, Eq. A4). If Y-scores were used in the clustering, the global PLSR model was used for prediction of Y to acquire the Y-scores from the loading weights of the global PLSR model (see Additional file 1: Appendix 1, Eq. A4 and A6). Four different ways of classifying an observation into the appropriate cluster were compared: i) the nearest cluster centre from the FCM clustering (based on Euclidian distance), ii) Linear Discriminant Analysis (LDA) , iii) Quadratic Discriminant Analysis (QDA)  and iv) Naive Bayes (NB) classification . We used the MATLAB® functions "classify.m" and "NaiveBayes.m" from the Statistics Toolbox™ v7.2 for classification options ii)-iv), and in-house MATLAB® code for option i).
Predictions of response variables for the test set observations were done both by a) selecting the local model for the most probable cluster and by b) computing the regression coefficients as a probability-weighted sum of the local regression models. Outlier-check of new observations was based on the Euclidian distance matrix from the clustering of all non-outliers in the calibration set (based on the X- or Y-scores depending on what was used for clustering and classification). The maximal occurring distance from the cluster centre was found for each cluster by inspection of the calibration results, and 1.5 times these values were used as outlier limits for the respective clusters. Prediction of the outliers (corresponding to no appropriate local model) was by convention handled by the global PLSR model.
To further improve the prediction accuracy we also considered an alternative HC-PLSR approach based on hierarchical regression modelling of the residuals from the global PLSR model. Local modelling was here used only on the Y-residuals (still using the original regressor matrix), and the final predicted Y was the sum of the global predictions and the local predictions. The clusters were based on the global X-scores as described above, and we therefore used the same number of clusters as for the hierarchical regression modelling of the entire Y-matrix. This gave a small gain in prediction accuracy (the results for the mammalian circadian clock data are shown in Additional file 1: Appendix 3, Figure A3.4). However, this approach was not chosen since some interpretability is lost due to the more complex Y-loadings based on the Y-residuals.
Evaluation of the metamodelling performance of HC-PLSR
The three test applications were chosen in order to represent a wide range of dynamic model types, in order to obtain a valuable evaluation of the metamodelling performance of HC-PLSR in systems biology. This is important in order for the method to be generally applicable e.g. as part of model repositories like the CellML repository [7–9]. The first model setting represents very simple models, where differences between various types of feedback can be explored. The two other model settings represent more complex models with many parameters and state variables, connected through various types of feedback loops. As described above, model setting 2 and 3 represent an optimisation study and a screening study, respectively.
In the applications described above, a set of parameter combinations and state variable initial conditions were used as input to an ODE-based dynamic model, and the output was a set of state variables calculated over a number of time steps. In metamodelling of the dynamic models, the parameter values and initial conditions were then used as regressors (X) and the state variable time series were used as response variables (Y) in a regression analysis, in order to predict the state variable time series directly from the parameters and initial conditions.
Overview of the regressor- and response matrices used in the regression analyses and the test sets used for the three application examples
Design variables (D)
Regressor matrix (X)
Response matrix (Y)
Test set used
Gene regulatory networks
125 initial conditions for the three state variables X1, X2 and X3 (at time zero) in a 53 FFD (dimensions: 125 × 3)
[D sin(D) cos(D)]
The concatenated time series for the state variables X1, X2 and X3 (Yi = 3 × 300 time points,
i = 1 to nr. of observations)
33% of the observations in D (randomly chosen), and the corresponding Y-values
Mammalian circadian clock model
Nine model parameters in an OMBR design using eight levels for each parameter
(dimensions: 8192 × 9)
[D cross-terms of D D2]
16 state variable time series modelled separately (for each state variable Yi = 200 time points,
i = 1 to nr. of observations)
8192 new parameter combinations generated by random Monte Carlo sampling (here the entire matrix D was used for calibration) and corresponding Y-values
Mouse ventricular myocyte model
Ten model parameters in a 310 FFD
(dimensions: 59049 × 10)
[D cross-terms of D D2]*
35 state variable time series modelled separately (for each state variable Yi = 200 time points,
i = 1 to nr. of observations)*
33% of the observations in D that did not fail (randomly chosen), and the corresponding Y-values
In HC-PLSR, only the local model corresponding to the most probable cluster for each observation was used in the test set prediction, since this approach gave slightly better results than using a weighted sum of all local models. The appropriate numbers of HC-PLSR clusters were chosen by inspection of the R2 and RMSEP values of Y within the calibration set, resulting from predictions using a range of different numbers of clusters. Here the calibration set observations were first used for calibration and subsequently treated as "new observations" and classified prior to prediction, that is, the same procedure as for the test set was used in the prediction stage. The minimum number of clusters in HC-PLSR giving (approximately) maximum obtained predictive ability (maximal R2 and minimal RMSEP) was chosen for each state variable time series.
Model setting 1: Gene regulatory networks
In metamodelling of the six gene regulatory networks, the initial values of the state variables were used as regressors (X) (125 starting conditions × 3 state variables) and for each gene regulatory network, the concatenated time series for the state variables X1, X2 and X3 were used as response variables (Y) (3 state variables × 300 time steps = 900 Y-variables). The state variable data generated for the gene regulatory network motifs are shown in Additional file 1: Appendix 2, Figure A2.1. Sine and cosine terms (in radians) of X were included in the regressor matrix, while cross-terms and second order terms were omitted here. This was the result of a "trial and error" procedure to choose between different types of nonlinear terms to include in the regression equation. This could be afforded here due to the relatively small size of the data set. Adding sine and cosine terms to the regressor matrix turned out to be advantageous in order to model the nonlinearities present in this data set. All regressor and response variables were mean-centred prior to the regression analysis.
In the HC-PLSR, FCM clustering on the Y-scores was chosen due to the low number of X-variables (initial conditions for three state variables), and the fact that these were derived by a full factorial design. FCM clustering on the first three PCs of the Y-scores was used to establish the local PLSR models, since this gave a very distinct clustering of the observations. QDA based on globally predicted Y-scores for new observations gave the best classification results for this data set, and was therefore used for the test set prediction, using 33% of the observations (randomly selected from the FFD) as a separate test set (the rest of the observations were used for calibration). The test set was chosen randomly from all possible combinations of parameter values in order to be representative for the population.
Model setting 2: Mammalian circadian clock
In metamodelling of the mammalian circadian clock model, the state variable time series (Y) were modelled as functions of the parameters of the dynamic model and their cross- and second order terms (X). Each of the time series for the 16 state variables (see Additional file 1: Appendix 3, Figure A3.1) was modelled separately, giving 16 regression models for each regression method. Each regression model could thus predict the response in 200 time steps for one of the 16 state variables. The state variable time series (response variables, Y) were log-transformed, and all variables were mean-centred prior to the regression analysis.
FCM clustering on the X-scores from the global PLSR model was used in the HC-PLSR, and the classification of new observations was based on the distances to the cluster centres (in predicted X-scores for new observations) from the fuzzy clustering. Clustering on the X-scores is in general more safe compared to using the Y-scores, since the Y-scores for new observations to be predicted have to be calculated from predicted Y-values from the global PLSR model. The X-scores are calculated from the X-data for new observations, and their accuracy is therefore not dependent on the quality of the global PLSR model. However, only the first three PLS components of the X-scores were used for clustering, in order for the clusters to be as relevant as possible for both Y and X (the first PCs explain the largest amount of the covariance between X and Y).
The circadian clock test set consisted of 8192 new parameter combinations generated by random Monte Carlo sampling [48, 49] from a uniform probability distribution within the same parameter ranges as for the calibration set. Hence, the test set had the same size as the calibration set. The reason why we chose to generate a separate test set based on Monte Carlo sampling instead of sampling randomly from the original design as in the gene regulatory network example is that here a highly reduced design was used instead of an FFD. A test set sampled from a highly reduced design is less likely to be fully representative for the population than a test set sampled from an FFD (which includes all possible combinations of parameter values). Furthermore, sampling from the OMBR design to generate a test set would reduce the calibration set further, and the calibrated regression model would be less representative. Since the OMBR is optimised with respect to quadratic D-optimality , using part of the design as test set would also make the calibration set less optimal for inclusion of second order polynomial terms in the regression analysis.
Model setting 3: Mouse ventricular myocyte
In metamodelling of the mouse ventricular myocyte model, the state variable time series (Y) were modelled as functions of the parameters of the dynamic model and their cross- and second order terms (X), using the same procedure as for the mammalian circadian clock model. Each of the time series for the 35 state variables (see Additional file 1: Appendix 4, Figure A4.1) was modelled separately. The parameters were mean-centred and standardised (divided by their standard deviations) due to the large differences in absolute values for the parameters, and the state variables were mean-centred prior to the regression analysis.
In this study, 33% of the observations (randomly selected from the simulations in the FFD that did not fail) were used as a separate test set. The rest of the observations were used for calibration. The data set was separated according to the stimulus period, and the regression calibration and test set predictions were carried out for each part of the data set separately. The average R2 and RMSEP values over the three data set parts were then calculated for each state variable.
Model setting 1: Gene regulatory networks
When clustering on the Y-scores as is done for the gene regulatory motif data set, a relatively good global PLSR model is needed for reasonable prediction of the Y-scores for the test set. In this example, the lowest global model R2test set value (0.66) was obtained for gene regulatory network motif 1. This is good enough for a reasonable Y-scores prediction for this data set, since HC-PLSR results in an R2test set value of 0.99 (see Figure 5). It should, however, be kept in mind that the global model needs to provide some reasonable information to be used for prediction of the Y-scores. The required quality of the global model depends on how distinct the separation of the clusters is. Less distinctly separated clusters need a higher quality of the Y-scores predictions for the Y-scores to give reliable classification results.
Model setting 2: Mammalian circadian clock
Model setting 3: Mouse ventricular myocyte
HC-PLSR provides a very flexible metamodelling methodology, since the number of local models (clusters) can be adjusted and the model adapted to suite the complexity of the response surface of the dynamic model. Hence, even very complex data sets containing a large number of different types of input-output relations can be quite accurately modelled through modelling each type of response surface separately. HC-PLSR using only one cluster is equivalent to a global PLSR model. In principle, the least complex among several types of regression methods leading to good predictions should be chosen, in order to favour model interpretation and to avoid overfitting. In our approach to the hierarchical PLSR, both the global and the local regression models are kept for comparison. Hence, for each case of metamodelling one can evaluate the gain of using local modelling compared to the global model. The combination of local and global modelling is also useful for revealing regional differences in model sensitivity to the various parameters, e.g. by exploring the loading plots or regression coefficients for the local models. The local models also provide the opportunity to zoom into regions of the parameter space and identify the operative domains of the parameter space in an efficient manner.
The gain in prediction accuracy from using HC-PLSR was larger for metamodelling of the circadian clock data set than for the mouse ventricular myocyte data, implying that the circadian clock model exhibits behaviour which is harder to capture with linear polynomial models. This indicates that the circadian clock model contains more non-monotonicity or other complex nonlinear relationships between parameters and behaviour, but the systemic explanations are not easy to pinpoint as both dynamic models contain complex wiring. Even though polynomial PLSR and OLS gave reasonable predictions for many of the state variables in both the circadian clock and the mouse ventricular myocyte data set (see Figure 8 and Figure 11), the HC-PLSR results were superior to both PLSR and OLS in terms of metamodelling robustness. OLS gave slightly better results than global PLSR for both these data sets, since the rank was reduced in PLSR compared to OLS. Projecting the data into estimated latent variables and thereby decreasing the rank of the data has great advantages with respect to interpretation. Hence, there is a trade-off between interpretability and prediction accuracy. One may argue that the present criterion for choosing the number of components to include in the PLSR model (based on cross-validation) is a bit conservative in this case, since the regression modelling is based on noise-free data from deterministic models. Using HC-PLSR, we achieve a better prediction accuracy than both global polynomial PLSR and OLS, while keeping the advantages of reduced rank and increased overview and interpretability compared to OLS by using PLSR in the local modelling.
Our HC-PLSR using FCM clustering of the observations may be considered as a more or less automatic, "top-down" approach. In contrast, most existing hierarchical approaches are "bottom-up", that is, they start by estimating latent variables (PLSR scores) for individual, a priori known data blocks, and then combine these block scores into a second level linear PLSR model [16–19]. The use of fuzzy clustering to separate the observations into groups makes the HC-PLSR approach presented here applicable also in cases where prior knowledge about different blocks of data is not available. However, when such information is available, it should be used instead of clustering to separate the observations into groups. The number of clusters to use is here specified in advance. More computationally demanding alternatives exist, however, that allow automatic identification of the optimal number of clusters [50, 51]. An alternative would be to include an automatic optimisation of the number of clusters (e.g. based on automatic exploration of several alternatives). This would allow for a semi-automatic metamodelling methodology which would be feasible for implementation e.g. as part of the CellML repository [7–9].
In contrast to crisp clustering methods, such as K-means clustering, which allocate each observation to a unique cluster, fuzzy clustering returns membership values for the different clusters for each observation . This provides the opportunity to compute the regression coefficients for each observation as a weighted sum of the regression coefficients for the different clusters, where the membership values are used as weights. Predictions of response variables for the test set observations were here done both by a) selecting the local model for the most probable cluster and by b) computing the regression coefficients as a weighted sum of the local models. The first approach gave the best prediction accuracy in our case. The reason is probably that all data sets used in this paper were generated from experimental designs, and not by random sampling in the parameter space. For data sets with very distinct separations of the clusters, the first approach performs best, while for more continuous data, a weighted sum of the local models will probably be a better choice. Moreover, using a weighted sum will help avoiding discontinuities in regions where local models meet in more continuous data sets. We found that using only the first three PLS components in the FCM clustering gave the most distinct separation of the clusters for the three data sets tested here. This ensures that the most relevant information about the covariance between X and Y is being used for clustering when using the X-scores as basis. However, using only three PLS components may not be appropriate in general. For some data sets, one may want to use a larger number of PLS components. In order to use the Y-scores as a basis for clustering and classification, the separation of the observations needs to be quite distinct, since the Y-scores are based on predicted Y-values and therefore contain some prediction error that may disrupt the classification when the clusters are not distinctly separated.
OLS is less computationally demanding than both HC-PLSR and PLSR, and is a suitable metamodelling method in cases where the response surface is monotone, moderately nonlinear and the regressor variables are linearly independent. However, for the metamodelling approach to be feasible for automation, we need a robust regression method that can be automatically adjusted according to the properties of the response surface of the dynamic model to be emulated. HC-PLSR has advantages over other regression methods such as global PLSR and OLS in cases where the response surface is complex, highly nonlinear and non-monotone (in such cases, the model errors described by the RMSEP values from predictions with a global regression method will be much higher than the level of noise in the data), and the number of clusters used can be adjusted to suit the complexity of the response surface allowing for a robust automation of the metamodelling procedure.
Our results show that it is possible to emulate dynamic models in systems biology to a high precision using multivariate regression, and that local modelling can improve the results substantially when the parameter to phenotype map is highly nonlinear. HC-PLSR was superior to both global polynomial PLSR and global polynomial OLS regression in all three model settings reported here. Since these model settings represent large classes of dynamic models and because the HC-PLSR method can be adjusted to suite the complexity of the dynamic model behaviour in a very flexible way inviting automation, it qualifies as a promising candidate approach for metamodelling within systems biology.
This study was supported by the National Program for Research in Functional Genomics in Norway (FUGE) (RCN grant no. NFR151924/S10) and by the Norwegian eScience program (eVITA) (RCN grant no. NFR178901/V30). Ingunn Berget is thanked for providing us with the code for fuzzy clustering and supervision in how to use it.
- Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, Saisana M, Tarantola S: Global Sensitivity Analysis: The Primer. 2008, Wiley-InterscienceGoogle Scholar
- Kleijnen JPC: Design and Analysis of Simulation Experiments. 2007, New York, USA: Springer, 1Google Scholar
- Santner TJ, Williams BJ, Notz W: The design and analysis of computer experiments. 2003, New York, USA: SpringerView ArticleGoogle Scholar
- Cacuci DG: Sensitivity and Uncertainty Analysis: Theory v.1: Theory Vol 1. 2003, Chapman and Hall/CRC, 1View ArticleGoogle Scholar
- Cacuci DG, Ionescu-Bujor M, Navon IM: Sensitivity and Uncertainty Analysis: Applications to Large-scale Systems Vol 2. 2005, CRC Press, 1View ArticleGoogle Scholar
- Ayyub BM, Klir GJ: Uncertainty modeling and analysis in engineering and the sciences. 2006, Chapman & Hall/CRCView ArticleGoogle Scholar
- Lloyd CM, Halstead MDB, Nielsen PF: CellML: its future, present and past. Progress in Biophysics and Molecular Biology. 2004, 85: 433-450. 10.1016/j.pbiomolbio.2004.01.004View ArticlePubMedGoogle Scholar
- Cuellar AA, Lloyd CM, Nielsen PF, Bullivant DP, Nickerson DP, Hunter PJ: An Overview of CellML 1.1, a Biological Model Description Language. SIMULATION. 2003, 79: 740-747. 10.1177/0037549703040939.View ArticleGoogle Scholar
- Lloyd CM, Lawson JR, Hunter PJ, Nielsen PF: The CellML Model Repository. Bioinformatics. 2008, 24: 2122-2123. 10.1093/bioinformatics/btn390View ArticlePubMedGoogle Scholar
- Conti S, O'Hagan A: Bayesian emulation of complex multi-output and dynamic computer models. Journal of Statistical Planning and Inference. 2010, 140: 640-651. 10.1016/j.jspi.2009.08.006.View ArticleGoogle Scholar
- Li G, Rabitz H, Yelvington PE, Oluwole OO, Bacon F, Kolb CE, Schoendorf J: Global sensitivity analysis for systems with independent and/or correlated inputs. J Phys Chem A. 2010, 114: 6022-6032. 10.1021/jp9096919View ArticlePubMedGoogle Scholar
- Berglund A, Wold S: INLR, implicit non-linear latent variable regression. J Chemometrics. 1997, 11: 141-156. 10.1002/(SICI)1099-128X(199703)11:2<141::AID-CEM461>3.0.CO;2-2.View ArticleGoogle Scholar
- Ptashne M, Backman K, Humayun M, Jeffrey A, Maurer R, Meyer B, Sauer R: Autoregulation and function of a repressor in bacteriophage lambda. Science. 1976, 194: 156-161. 10.1126/science.959843View ArticlePubMedGoogle Scholar
- Wang KL, Warner JR: Positive and negative autoregulation of REB1 transcription in Saccharomyces cerevisiae. Mol Cell Biol. 1998, 18: 4368-4376.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaplan S, Bren A, Dekel E, Alon U: The incoherent feed-forward loop can generate non-monotonic input functions for genes. Mol Syst Biol. 2008, 4: 203-PubMed CentralView ArticlePubMedGoogle Scholar
- Wold S, Kettaneh N, Tjessem K: Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemometrics. 1996, 10: 463-482. 10.1002/(SICI)1099-128X(199609)10:5/6<463::AID-CEM445>3.0.CO;2-L.View ArticleGoogle Scholar
- Lindström A, Pettersson F, Almqvist F, Berglund A, Kihlberg J, Linusson A: Hierarchical PLS modeling for predicting the binding of a comprehensive set of structurally diverse protein-ligand complexes. J Chem Inf Model. 2006, 46: 1154-1167. 10.1021/ci050323kView ArticlePubMedGoogle Scholar
- Eriksson L, Toft M, Johansson E, Wold S, Trygg J: Separating Y-predictive and Y-orthogonal variation in multi-block spectral data. J Chemometrics. 2006, 20: 352-361. 10.1002/cem.1007.View ArticleGoogle Scholar
- Eriksson L, Trygg J, Wold S: PLS-trees®, a top-down clustering approach. J Chemometrics. 2009, 23: 569-580. 10.1002/cem.1254.View ArticleGoogle Scholar
- Wold S, Berglund A, Kettaneh N: New and old trends in chemometrics. How to deal with the increasing data volumes in R&D&P (research, development and production) - with examples from pharmaceutical research and process modeling. J Chemometrics: Special Issue: Proceedings of the 7th Scandinavian Symposium on Chemometrics. Edited by: Lars Nørgaard. 2002, 16 (8-10): 377-386.Google Scholar
- Pearson K: On lines and planes of closest fit to systems of points in space. Philosophical Magazine. 1901, 2: 559-572.View ArticleGoogle Scholar
- Jolliffe IT: Principal Component Analysis. 2002, Springer, 2Google Scholar
- Martens H, Martens M: Multivariate Analysis of Quality: An Introduction. 2001, Wiley, 1Google Scholar
- Wold S, Martens H, Wold H: The multivariate calibration method in chemistry solved by the PLS method. Lecture notes in Mathematics, Matrix Pencils. 1983, 286-293. Heidelberg: Springer-VerlagGoogle Scholar
- Martens H, Næs T: Multivariate calibration. 1989, Chichester, UK: John Wiley and SonsGoogle Scholar
- Martens M, Martens H: Partial Least Squares regression. Statistical Procedures in Food Research. Edited by: J.R. Piggott. 1986, 293-360. London: Elsevier Applied SciencesGoogle Scholar
- Jolliffe IT: A Note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society Series C (Applied Statistics). 1982, 31: 300-303.Google Scholar
- Kramer R: Chemometric Techniques for Quantitative Analysis. 1998, CRC Press, 1View ArticleGoogle Scholar
- Helland IS: Steps Towards a Unified Basis for Scientific Models and Methods. 2009, World ScientificGoogle Scholar
- Campbell K, McKay MD, Williams BJ: Sensitivity analysis when model outputs are functions. Reliability Engineering & System Safety. 2006, 91: 1468-1472. 10.1016/j.ress.2005.11.049.View ArticleGoogle Scholar
- Martens H, Veflingstad S, Plahte E, Martens M, Bertrand D, Omholt S: The genotype-phenotype relationship in multicellular pattern-generating models - the neglected role of pattern descriptors. BMC Syst Biol. 2009, 3: 87- 10.1186/1752-0509-3-87PubMed CentralView ArticlePubMedGoogle Scholar
- Gjuvsland AB, Plahte E, Omholt SW: Threshold-dominated regulation hides genetic variation in gene expression networks. BMC Syst Biol. 2007, 1: 57- 10.1186/1752-0509-1-57PubMed CentralView ArticlePubMedGoogle Scholar
- Gjuvsland AB, Hayes BJ, Omholt SW, Carlborg O: Statistical Epistasis Is a Generic Feature of Gene Regulatory Networks. Genetics. 2007, 175: 411-420.PubMed CentralView ArticlePubMedGoogle Scholar
- Leloup J-C, Goldbeter A: Modeling the mammalian circadian clock: Sensitivity analysis and multiplicity of oscillatory mechanisms. J Theor Biol. 2004, 230: 541-562. 10.1016/j.jtbi.2004.04.040View ArticlePubMedGoogle Scholar
- Li L, Niederer SA, Idigo W, Zhang YH, Swietach P, Casadei B, Smith NP: A mathematical model of the murine ventricular myocyte: a data-driven biophysically based approach applied to mice overexpressing the canine NCX isoform. American Journal of Physiology - Heart and Circulatory Physiology. 2010, 299: H1045-H1063. 10.1152/ajpheart.00219.2010View ArticlePubMedGoogle Scholar
- Mestl T, Plahte E, Omholt SW: A mathematical framework for describing and analysing gene regulatory networks. J Theor Biol. 1995, 176: 291-300. 10.1006/jtbi.1995.0199View ArticlePubMedGoogle Scholar
- Plahte E, Mestl T, Omholt SW: A methodological basis for description and analysis of systems with complex switch-like interactions. Journal of Mathematical Biology. 1998, 36: 321-348. http://www.springerlink.com/content/1d4fv3dh7c995ng0/View ArticlePubMedGoogle Scholar
- , : The MathWorks™ 2009.
- Martens H, Måge I, Tøndel K, Isaeva J, Høy M, Sæbø S: Multi-level Binary Replacement (MBR) design for computer experiments in high-dimensional nonlinear systems. J Chemometrics. 2010, 24: 748-756. 10.1002/cem.1366.View ArticleGoogle Scholar
- Tøndel K, Gjuvsland AB, Måge I, Martens H: Screening design for computer experiments: Metamodelling of a deterministic mathematical model of the mammalian circadian clock. J Chemometrics. 2010, 24: 738-747. 10.1002/cem.1363.View ArticleGoogle Scholar
- Hindmarsh AC, Brown PN, Grant KE, Lee SL, Serban R, Shumaker DE, Woodward CS: SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans Math Softw. 2005, 31: 363-396. 10.1145/1089014.1089020.View ArticleGoogle Scholar
- Bondarenko VE, Szigeti GP, Bett GCL, Kim S-J, Rasmusson RL: Computer model of action potential of mouse ventricular myocytes. Am J Physiol Heart Circ Physiol. 2004, 287: H1378-1403. 10.1152/ajpheart.00185.2003View ArticlePubMedGoogle Scholar
- Cohen SD, Hindmarsh AC: CVODE, a stiff/nonstiff ODE solver in C. Comput Phys. 1996, 10: 138-143.View ArticleGoogle Scholar
- Bezdek JC: Pattern Recognition with Fuzzy Objective Function Algorithms. 1981, Kluwer Academic PublishersView ArticleGoogle Scholar
- Berget I, Mevik B-H, Næs T: New modifications and applications of fuzzy C-means methodology. Comput Stat Data Anal. 2008, 52: 2403-2418. 10.1016/j.csda.2007.10.020.View ArticleGoogle Scholar
- McLachlan GJ: Discriminant Analysis and Statistical Pattern Recognition. 1992, Wiley-Interscience, 1View ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning. 2003, Corrected. SpringerGoogle Scholar
- Kalos MH, Whitlock PA: Monte Carlo Methods Volume 1: Basics. 1986, New York, USA: John Wiley & Sons, Inc, 1View ArticleGoogle Scholar
- Liu JS: Monte Carlo Strategies in Scientific Computing. 2008, New York, USA: SpringerGoogle Scholar
- Gath I, Geva AB: Unsupervised Optimal Fuzzy Clustering. IEEE Trans Pattern Anal Mach Intell. 1989, 11: 773-780. 10.1109/34.192473.View ArticleGoogle Scholar
- Frigui H, Krishnapuram R: A robust competitive clustering algorithm with applications in computer vision. IEEE Trans Pattern Anal Mach Intell. 1999, 21: 450-465. 10.1109/34.765656.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.