In the following sections we will provide our approach to model integration. First we will introduce a workflow which subdivides the general integration task into three major steps. We then discuss challenges in structural as well as in behavioral model integration and present possible solution strategies. To illustrate challenges and solutions arising in behavioral integration, two models describing EGF and NGF signaling originally presented in [10] are integrated.
A semi-automatic workflow based on existing standards and software
As pointed out in the previous section, there exists a variety of standards and tools that support structural model integration. We will present a semi-automatic workflow which incorporates many of these. This workflow consists of three major steps: ‘Model Preparation’, ‘Model Merging’ and ‘Model Reparameterization’ (see Fig. 1 for an outline of the workflow).
Model preparation
Prior to the merging of network models in SBML the models have to be prepared appropriately. Thereby the first task is to ensure that the units used in both models match. This might require a conversion step, where the units used in one model are converted to match those of the second model.
The goal of the model preparation step is to facilitate a unique identification of model elements. Here the software semanticSBML provides convenient features, as it allows, for example, to search a large collection of Databases for suitable annotations using keywords. Furthermore we recommend to use the SBML Validator [29] to ensure that the model is in valid SBML. In the section ‘Challenges in structural model integration’ we will point out that established annotation standards like SBO and MIRIAM are often not sufficient to discover identical model elements when signal transduction models are considered. This requires an appropriate naming scheme in combination with annotations (see Additional file 1). The names of model elements can comfortably be edited with the SBMLeditor [30]. The outcome of this step are two well prepared models in SBML format.
Model merging (structural model integration)
We recommend to use the software semanticSBML. In semanticSBML an initial matching of model elements can be calculated automatically. To this end information about the model elements is required to identify the overlap of two models automatically. This information has to be assigned in form of annotations and names in the prior model preparation step. The initial matching is calculated solely based on the annotations. A manual post-editing of this matching is supported by the software. Here the element names can be incorporated to solve conflicts, clear wrong matches or add matches which have not been found automatically. The outcome of this step is a new model with a fixed network structure.
Model reparameterization (behavioral model integration)
After the merging step the obtained model has to be tested if it is in valid SBML and if it is consistent with the experimental data (consistency check). If the merged model is not consistent with the experimental data the parameters have to be adapted. It might be necessary to pass through the reparameterization and consistency check of the model repeatedly.
Challenges in structural model integration
When integrating two models, whether semi-automatic or by hand, the overlap of the models has to be recognized and handled, that is, identical elements have to be identified and combined in an appropriate way to obtain the merged model. In this section we describe potential complications and, where available, comment on how to resolve these.
Modification on different sites
In the MIRIAM guidelines it is defined how models and model elements can be related to entries of various databases like UniProt [31], Kegg [32], Gene Ontology [33] or ChEBI [34] using the Resource Description Framework (RDF). Following the MIRIAM standard may be sufficient to discover identical elements in metabolic models, because in general, every model component can be related to web resources. Whereas for signaling systems the annotation may not be sufficient to uniquely identify elements, as often only basic forms of molecules are available in databases. Problems will then arise because in signal transduction species often describe molecules with multiple modifications or complexes composed of several molecules with various stoichiometry. These species cannot be identified using the database annotation alone. Moreover, sometimes elements can not be found in databases. In Fig. 2 an example is shown.
Here a solution is to encode additional information in the names of model elements, for example information on modification sites and the stoichiometry in complexes. To ensure a unique identification of the model elements a naming scheme can be used. In Additional file 1 we provide guidelines how a combination of rdf annotations, SBO annotations [35] and names, following a naming scheme can be used to ensure a unique identification of model elements for signaling. These guidelines have been developed within the framework of the Virtual Liver [36].
A modeler may get the impression that annotating models and following common standards is connected with a high work load. There’s no denying, but the effort put in the annotation of models prior to structural integration is definitely not in vain. Standardized formats, annotations and curated model repositories are a general trend in systems biology to make models available and more reusable for other modelers. This trend is reinforced by many journals where models have to be uploaded in repositories in a standardized format. And, in the context of this work, if the models are well prepared, software tools can be used to perform the structural integration in a semi-automatic manner.
Different level of detail in reactions
Another challenging task is the identification of the same overall reactions which are modeled on a different level of detail (see Fig. 3). This is a task which can currently not be automatized. If the model is well prepared and the elements are annotated and named as proposed in our guidelines (see Additional file 1) the reactant and product species of the overall reaction can be recognized as equal. In many cases a graphical visualization of the model may also be helpful. The decision whether the integrated model should contain the detailed or the lumped description mainly depends on the goal of the integration task.
Differently modeled reactions
A reaction may be represented differently in different models. Both models might, for example, contain the production of S2 controlled by S1. But in one model S1 acts as a modifier, while in the other model S1 acts as a reactant (see Fig. 4). A similar situation arises, when both models contain the reaction from S1 to S2 but use different kinetic laws. In this case a decision has to be made which of the two reactions should be chosen.
Molecules in different compartments
If the models contain the same molecule but in different compartments a review of the compartment names and annotations should be the first step. Depending on the integration goal, an adoption of both species and an additional transport reaction between the compartments may be a solution.
Molecules in different states
Frequently two models contain a molecule in different states, e.g. one model contains a molecule only in an unmodified state; the other model contains the molecule only in a modified state. In most cases an adoption of both molecules and an additional modification or complexation reaction is a solution.
Challenges in behavioral model integration
The outcome of the structural integration step is a merged model, that is, a combined model containing the elements of both models. During this structural integration parameter values (reaction rate constants and initial values) are assigned to the appropriate model elements.
The aim of behavioral model integration is to obtain a parameterized integrated model that is consistent with experimental data of the original models. As we will demonstrate below, this will usually not be the case, if all parameter values of the original models are re-used in the integrate model. Rather, parameter values assigned to model elements will have to be adjusted.
One reason is the inherent ambiguity in assigning parameters to model elements. While the choice of parameter values is easy for non-overlapping model parts where only one parameterization exists, it can be challenging for reactions in the overlap, where it is not a priori clear which parameterization is suited best. Choosing either one will almost certainly affect the ability of the integrated model to explain experimental data.
To illustrate this, two models describing EGF and NGF signaling have been merged. These models originate in [10], details are given in the ‘Methods’ section. For reactions that occur only in one of the models (green and blue box in Fig. 5), only one parameter set is available. But for reactions in the overlap (red box in Fig. 5) two possible parameter sets exist. For demonstration purposes the parameter values of the EGF model have been chosen for the reactions in the overlap. Consequentially, simulation results of the original EGF model can be reproduced, simulation results of the original NGF model can not be reproduced (see lower part of Fig. 5). Hence we argue that the model is not consistent with the experimental data of the original NGF model. (Whereby we assume that the original models have been consistent with experimental data. Hence, if the integrated model is able to reproduce the time courses it is also consistent with the corresponding experimental data).
Generally speaking, whenever two models A and B are integrated, choosing parameter values of model A for the overlap is expected to result in simulation results similar to those of model A; likewise, choosing parameter values of model B for the overlap is expected to result in simulation results similar to those of model B. Thus, in general, parametrizing the model overlap with values belonging to one of the models is expected to result in an integrated model that is not able to reproduce the simulation results of both original models and hence is not consistent with the experimental data of at least one model.
Consistency conditions
We propose to judge consistency with experimental data by means of input/output relations: whenever an input is applied a model produces a corresponding output.
Informally speaking, if a signal is presented to the inputs of the integrated model that come from model A, while the inputs coming from model B are set to zero, then those output signals of the integrated model coming from A should be ‘similar’ to the output of A for the same input signal (cf. Fig. 6).
To be more precise, let u
A
denote the inputs of the merged model which originate from model A, ů
A
the inputs of model A, u
B
the inputs of the merged model which originate from model B, and ů
B
the inputs of model B. Likewise, let v
A
denote the outputs of the merged model which originate from model A, v ̈
A
the outputs of model A, v
B
the outputs of the merged model which originate from model B and v ̈
B
the outputs of model B. Finally, let u=(u
A
,u
B
) and v=(v
A
,v
B
) denote input and output of the integrated model, where identical inputs and outputs are listed only once. Then u=(u
A
,0) denotes a signal where all inputs that originate from model A receive a signal while all those belonging only to B are set to zero and u=(0,u
B
) denotes a signal where the roles of A and B have been exchanged. Similarly, v=(v
A
,?) denotes a signal where all outputs originating from A show a specific value, while those belonging only to B may take any value and v=(?,v
B
) denotes a signal where the roles of A and B have been exchanged.
We say an integrated model is consistent with experimental data, if the following relations hold for all signal pairs ů
A
, v ̈
A
used to parametrize model A and for all signal pairs ů
B
, v ̈
B
used to parametrize model B (cf. Fig. 6):
-
1.
Inputs u=(ů
A
,0) yield output v≈(v ̈
A
,?)
-
2.
Inputs u=(0,ů
B
) yield output v≈(?,v ̈
B
)
To assess the similarity of the output curves we suggest to use the χ
2 merit function which is often optimized in parameter estimation (see, e.g., the software PottersWheel [37]):
$$ \chi^{2}(p) = \sum_{i=1}^{N} \left(\frac{y_{i}-y(t_{i};p)}{\sigma_{i}}\right)^{2} $$
((1))
In the above formula y
i
is the data point i with the standard deviation σ
i
and y(t
i
;p) is the model value at time point i for parameter values p (see, for example, [37]).
In the following we suggest to compute two χ
2-values to assess the similarity of the output of the integrated model to that of models A and B: \({\chi ^{2}_{A}}/N_{A}\) and \({\chi ^{2}_{B}}/N_{B}\). Thereby N
A
and N
B
are the number of data points of the respective model, y
A
(t
i
;p
A
) and y
B
(t
i
;p
B
) denote the value of the output signals of A and B at time point i and y
I
(t
i
;p
I
) the value of the output signal of the integrated model at time point i. The values \({\chi ^{2}_{A}}/N_{A}\) and \({\chi ^{2}_{B}}/N_{B}\) are then calculated as follows:
$$ \frac{{\chi^{2}_{A}}}{N_{A}} = \sum_{i=1}^{N_{A}} \left(\frac{y_{A}(t_{i};p_{A})-y_{I}(t_{i};p_{I})}{\sigma_{i}} \right) $$
((2))
$$ \frac{{\chi^{2}_{B}}}{N_{B}} = \sum_{i=1}^{N_{B}} \left(\frac{y_{B}(t_{i};p_{B})-y_{I}(t_{i};p_{I})}{\sigma_{i}} \right) $$
((3))
Now we say the integrated model is consistent, if
-
1.
\({\chi ^{2}_{A}}/N_{A} < 1\) and
-
2.
\({\chi ^{2}_{B}}/N_{B} < 1\).
With this consistency condition it is possible to formulate integration goals. It is often not required to obtain consistency of the integrated model with both original models equally well. An exemplary integration goal could be:
-
Reproduce simulation results of model A almost exactly and reproduce simulation results of model B as well as possible, that is, obtain a parameterization such that the integrated model satisfies the condition
$$ {\chi^{2}_{A}}/N_{A} <1, {\chi^{2}_{B}}/N_{B} \approx 1. $$
((4))
For the integration example in Fig. 5 we define the following integration goal: reproduce the time courses of the EGF model almost exactly and those of the NGF model as well as possible (i.e. \(\chi ^{2}_{EGF}/N_{EGF} < 1\) and \(\chi ^{2}_{NGF}/N_{NGF}\approx 1\)). This integration goal guided our choice of parameter values for the elements of the overlap: we assigned the parameter values of the EGF model to the overlap.
Generally speaking, setting an integration goal can guide the initial choice of parameter values for the overlap. If the goal is to reproduce the simulation results of one of the original models almost exactly, the parameter values of this original model should be chosen for the overlap. For the non-overlapping model parts parameter values of the original models can be chosen for the integrated model. In this sense an integration goal influences structural integration.
Note that to compare the output signals of integrated and original model one may either use experimental or simulation data. To check consistency of the model presented in Fig. 5 we make use of the simulation data of the original models. For this purpose we interpret model values y
A
(t
i
;p) as synthetic data points y
A;i
and assume normally distributed errors for these data points (10 % relative and 5 % of the maximum as the absolute error). This approach works on the assumption that the original models have been consistent with the experimental data (as is the case for the example models). If the experimental data which has been used to parametrize the original models is available an alternative approach is to use these data to judge the merged model instead of utilizing the approach with synthetic data points.
For the integration example shown in Fig. 5 we obtain \(\chi ^{2}_{EGF}/N_{EGF} = 0.0004\) and \(\chi ^{2}_{NGF}/N_{NGF} = 1689.3\)). Hence the model is not consistent with experimental data.
Parameter re-fitting
To achieve a consistent model, parameter values have to be modified. Thereby the identification of those parameters that have to be re-fitted is a crucial question that is influenced by the integration goal and the position of the overlap of the two original models.
However, no general guidelines can be formulated for the identification of those parameters that have to be re-fitted. Instead a detailed understanding of merged model and integration goal are essential. In our example the aim is to preserve the time courses of the outputs of the EGF model almost exactly. Hence we select those parameters of the NGF model that do not belong to the overlap (blue box in Fig. 5).
Contrary to the structural integration there is a lack of tools and software to support the whole process of reparameterization. Software tools for parameter fitting like PottersWheel [38] or the Systems Biology Toolbox 2 (SBTOOLBOX2) for MATLAB [39] can be used instead.
In Fig. 7 the simulation results after the merging step (left column) and after the reparameterization step (right column) are depicted for the integration example. Solid lines represent the original NGF model; dotted lines the integrated model after the respective step of the workflow. The time courses of the three output states pTrkA, pAkt and pS6 are shown after stimulation with different concentrations of NGF (colors correspond to the different NGF-stimuli used in Fig. 5). The early response can not be reproduced very well, even after reparameterization. Nonetheless, steady steady values almost coincide after the reparameterization step. Comparing entire output signals via the \(\frac {\chi ^{2}}{N}\) values reveals the similarity of the signals: \(\frac {\chi ^{2}_{EGF}}{N} = 0.289\) and \(\frac {\chi ^{2}_{NGF}}{N} = 0.899\). Hence the integration goal is achieved. Moreover, the model is consistent according to our definition. Especially for the output pS6 the time courses can be reproduced almost exactly after reparameterization. From our point of view this output is more important than the other two because it forms the end of the signaling cascade.