Challenges in horizontal model integration

Background Systems Biology has motivated dynamic models of important intracellular processes at the pathway level, for example, in signal transduction and cell cycle control. To answer important biomedical questions, however, one has to go beyond the study of isolated pathways towards the joint study of interacting signaling pathways or the joint study of signal transduction and cell cycle control. Thereby the reuse of established models is preferable, as it will generally reduce the modeling effort and increase the acceptance of the combined model in the field. Results Obtaining a combined model can be challenging, especially if the submodels are large and/or come from different working groups (as is generally the case, when models stored in established repositories are used). To support this task, we describe a semi-automatic workflow based on established software tools. In particular, two frequent challenges are described: identification of the overlap and subsequent (re)parameterization of the integrated model. Conclusions The reparameterization step is crucial, if the goal is to obtain a model that can reproduce the data explained by the individual models. For demonstration purposes we apply our workflow to integrate two signaling pathways (EGF and NGF) from the BioModels Database. Electronic supplementary material The online version of this article (doi:10.1186/s12918-016-0266-3) contains supplementary material, which is available to authorized users.


Introduction
We focus on a model integration process where established quantitative models of signal transduction set up by different groups should be combined to obtain a larger model. To avoid redundancy in the integrated model it is important to detect overlapping parts. Therefore model elements have to be identifiable to decide which elements of the different models describe the same biological objects. This document provides a guideline on how to prepare mathematical models of biological systems encoded in the Systems Biology Markup Language (SBML, [1], [2]). We prospose to use the software semanticSBML [3,4] to perform the integration of network models. All models intended for integration with other models should be consistent with the listed requirements and recommendations.

Requirements for Models
In order to detect overlapping parts, models prepared for integration have to fulfill all the requirements listed below. Explanations how to achieve these requirements are given in gray.
a) The two models have to be encoded in SBML. Thereby level 2 and version 3 or 4 (L2V3 or L2V4) or higher has to be used. Furthermore, it is important that both models use the same level and version of SBML. This requirement is tailored to the integration with the software semanticSBML. Recently, SBML level 3 core has been released. But at the moment the software semanticSBML supports only handling of models specified in SBML level 2.
To convert your SBML model into another level and version we recommend to use either Copasi [5] or CellDesigner [6]. You can import your model in one of the tools and choose the desired level and version in export.
b) The models have to be in valid SBML.
You can check validity of your model encoded in SBML with the SBML-Validator [7]. The validation rules can be found in the appendix of the SBML Level 2 specification [2] .
c) The modeling formalism has to be the same for both models. Furthermore, models have to be deterministic and non-spatial continuous. Even though in principle models containing events can also be combined but should be avoided as they need a special handling.
The modeling formalism has to be annotated using the Systems Biology Ontology (SBO) [8]. The corresponding SBO-term for a non-spatial continuous framework is SBO:0000293.
d) The organism has to be the same for the two models and the taxonomy information has to be available in the model annotation (see point e)). e) Each model prepared for annotation has to be annotated according to the MIRIAM standard [9].
The following information needs to be present in the model: model creator and contact information, creation date, last modification date, taxonomy information, biological assumptions, reference to publication or report containing detailed information on the model, i.e. describing function and expected results reproducible. These so-called model annotations can directly be edited during model setup e.g. with the modeling software Copasi. For subsequent model annotation semanticSBML provides comfortable editing functionality.
f) Instantiation of model simulations must be possible, therefore initial conditions and parameters have to be available. Furthermore kinetic expressions for all reactions have to be defined and computational cycles have to be avoided, i.e. for each quantity there is one assignment. This is also described in [10].
To instantiate simulations the software Copasi can be used. Algebraic loops should not occur because < algebraicRule > tags are not allowed in models prepared for integration (see point i)). Loops emerging because of < assignmentRule > tags should be detected in the SBML-Validator.
g) Each species, reaction and compartment must contain at least one MIRIAM compliant annotation. Besides naming conventions guidelines for the annotation of unmodified and modified molecules, complexes, reactions and compartments are described in the next section.
The annotation of model components can also be edited with semanticSBML.
h) For integration with the recommended software semanticSBML, the models must not contain algebraic rules.
The software tool we recommend for model integration does not support algebraic rules (defined by < algebraicRule > .. < /algebraicRule > in the SBML model). You can try to change the algebraic rule into an assignment rule for one variable defined by this rule. If this is not possible, create an additional parameter and use an assignment rule for this parameter or remove the algebraic rule and add it in revised form to the integrated model. Additional parameters need to be corrected after integration as well.
i) Conservation relations have to be fulfilled.
With semanticSBML the conservation of atom numbers within a reaction can be checked. Mass conservations can be checked with Copasi.
j) semanticSBML check does not report errors for both models. k) The same units have to be used for substances, volume, time, area and length in both models. Thereby only SI units are allowed, these are declared in the SBML specification [2]. The units for all parameters have to be quite clear and consistent. l) Choose intuitive and clear names for the elements (according a proposed naming scheme, see section 4).
If it is not feasible to change the names of species, put the name compliant with the naming scheme into the notes section of the respective species. For this purpose use the following prefix 'species name for model integration according naming scheme:'. The scheme compliant name has to be specified either as name for the species or in the notes section. The name field of the species is the preferred location. m) If cofactors like ATP/ADP, NAD/NADH are incorporated explicitly in the model they have to be described by global parameters or species in SBML. If they are used as local parameters an annotation is not possible.

Further Recommendations
• Use unique names and controlled vocabularies for kinetic parameters and reactions, math expressions and compartments.
The SBML elements should have assigned SBO terms if possible. The most specific one should be taken (e.g. SBO:0000190 for a Hill coefficient).
• If semanticSBML is used (as is recommended) event tags should be removed from the model prior to integration.
Events are currently not supported in model integration with semanticSBML. They have to be added into the integrated model in a revised form.
• Try to avoid lumped reactions and species in both models because this may restrict integration (this is also described in [10]). If needed, use them in a systematic manner or at least describe unambiguously. If annotated correctly lumped reactions can be detected as they are annotated with 'has part' in most cases. However, it may be difficult to identify two reactions as equal if they are modeled on different levels of detail. Generally, the decision if the overlap of two models is well identifiable and the level of detail matches has to be made from case to case.
• Try to use realistic non-negative values for concentrations to enable incorporation of this information in the integration process.

Naming and Annotation of Model Elements
This section will provide a guideline for naming and annotation of elements in SBML models. Annotations for model elements have to follow the MIRIAM standard [9], [12]. Species can be annotated with respect to databases like UniProt for Proteins, KEGG for proteins, genes, metabolites and cofactors, ChEBI for cofactors, metabolites and small molecules. In conjunction with element annotation the proposed naming scheme will facilitate model reuse and integration. Currently, there is no common naming scheme which is uniformly applied in systems biology models. The proposed naming conventions will only consider species and compartments. Identification of reactions, parameters and mathematical expressions is not appropriate by name. Nevertheless, intuitive names will make the identification much easier.
The following sections will describe how compartments, unmodified and modified molecules and complexes with various stoichiometry of parts should be annotated and named. Furthermore some general annotation guidelines will be given.

Past and Related Work
Different guidelines and standards for the naming of genes and proteins, especially for enzymes exist. A reference collection can be found on [13]. Mostly the nomenclatures are organism specific (e.g. for yeast, human or mouse) and consider only the basic forms of species (unmodified and no complexes composed of smaller identifiable parts). Besides systematic names, synonyms (often including short and common names) are provided in databases we suggest to use for the annotation of species. We propose to use the short names because systematic and common names are often too long to be feasible in models.
Some software tools facilitate species names describing modifications and composition of complexes, e.g. BioNetGen [14] and PottersWheel [15]. But if all molecules have to follow the rules, this would also lead to complicated names for simple molecules. With our naming conventions we try to support short names for simple species, the building of complex names is influenced by the BioNetGen language. Naming for modified molecules is similar to SBGN [16].

Naming and Annotation of Species Types
For a better identification of species a combination of the machine-readable annotations and the human-readable species names should be used. The names are composed like the species themselves. The smallest unit is a basic species (described in 4.2.3) which is a simple molecule like a protein or a metabolite. Modified forms of these basic species are named modified species (see 4.2.4). A complex (4.2.5) is a combination of basic and/or modified species with defined stoichiometry.

rdf Annotation
Established databases, e.g. KEGG compound, UniProt for proteins, ChEBI for cofactors and metabolites have to be used for the annotation and naming of species. All allowed databases are listed on [17]. To prevent synonyms, the closest, most similar annotation for each basic species must be used.
In SBML each element can contain a set of machine-readable rdf annotations following the MIRIAM standard. A MIRIAM compliant rdf annotation is a relation from a SBML element (e.g. a species) to a database entry. Predefined qualifiers can be used for this relation, e.g. hasPart, hasProperty, isVersionOf and so forth. All allowed qualifiers are listed on [18].
The three species types (basic species, modified species, complexes) have to be annotated as follows: • A basic species (unmodified molecule) has to contain exactly one annotation with the is qualifier.
• A modified species (basic species with covalent modification) has to contain exactly one annotation using the qualifier isVersionOf. Furthermore, for each type of modification one annotation with hasPart qualifier has to be specified. For this purpose the ChEBI or the KEGG Compound database should be used if possible (e.g. a phosphorylation can be annotated with the identifier: C00009 from the KEGG Compound database).
• Complexes must have assigned a set of annotations with qualifiers hasPart for basic species and a set of annotations with qualifiers hasVersion for modified species contained in the complex.
For example: For the phosphorylated receptor-ligand complex EGF.EGFR P assign an hasPart annotation for the basic species EGF and an hasVersion annotation for the modified species EGFR P.
Besides these annotations with the qualifiers specified above the species can contain additional annotations, e.g. isDescribedBy to link a species to the literature that describes the concentration of it. Furthermore if there are species that describe the same substance but are located in different compartments, the location should be specified with an occoursIn qualifier. This holds for the entire species. Every species should have only one single annotation for the location. If a species can be annotated and named as modified form or as basic species, it must be expressed as basic species, for example use 'ATP' instead of 'ADP P'. Furthermore, the primary ID from the database should be taken.

SBO Annotation
Besides rdf annotations SBML elements can contain annotations with SBO terms [19], [20], [21]. SBO terms can be used in SBML L2V3 and higher versions. They should be assigned to SBML elements whenever they can be applied because this may simplify decisions in the manual adjustment of the initial matching based on rdf annotations. We recommend to assign SBO terms to reactions, parameters, math expressions and compartments (e.g. a phosphorylation reaction should have assigned the SBO term SBO:0000216). The ontology can be browsed in: [8]

Basic Species
The name of a basic species should be informative for other modelers and describe the intended meaning of simple molecules. These basic species can be proteins, mRNA, genes, cofactors, metabolites or other small molecules.
Existing standards and conventions have to be applied whenever possible, e.g. for gene and protein names. If available it is recommended to use either the short name from the database entry also used for the annotation or commonly used explicit and concise names in literature. If several isoforms exist, the name has to reflect which isoform is meant. If one substance which occurs in several compartments is modeled as different species, the name of these species will be equal. For this cases we propose to precede the species name by the compartment name. This prefix has to be separated from the name by a tilde. This holds for the complete species. For example use 'cyt∼ATP' for cytosolic ATP. Abbreviations for the compartment names are given in Table 3. Furthermore, context independent names have to be used, i.e. the name should be understandable independent of model context. Thus use the protein name instead of role, e.g. 'Mapk1' instead of 'kinase' (receptor, ligand, etc.). Avoid references to the taxonomic species, e.g. human, mouse, rat, within the name. The taxonomy information has to be available in the model annotation (see Section 2). Furthermore, the names for basic species must be used consistently throughout the model, i. e. if the molecule also occurs in modified form or in complexes the basic species name must be part of the names of these molecules.
• No formatting (e.g. italic, bold or underlined characters), accents, superscripts or subscripts are allowed within the name field of SBML. To distinguish between a gene and a protein SBO terms or an optional prefix can be used (see below).
• Take care of case sensitivity: Capitalization has to follow the conventions for protein or gene symbols, e.g. all uppercase for proteins of human, mouse or rat. For human genes all letters must be in uppercase. For mouse and rat genes the first letter must be uppercase and the remaining letters in lowercase.
• Names should be given in English language. If different spellings exist, American spelling must be used.
Material type and conceptual type: If the SBO annotation field of the species is not needed for any other information the type of a species (e.g. protein or mRNA) has to be annotated with SBO, see Table 1. Otherwise the type can be encoded in the basic species name as prefix using controlled vocabulary. To distinguish the type from the name for the basic species a defined syntax has to be used. The prefix has to be composed of 'mt' or 'ct' for material or conceptual type, separated by ':' from the label, see Table 1. The type and the remaining basic species name are separated by ' '. Examples are 'mt:prot ERK', 'mt:prot TIM' or 'ct:mRNA TIM'. The distinction between material type and conceptual type and the labels are introduced in the SBGN specifications [22].

Modified Species
Names for modified species contain the name of the basic species and any present modifications. For each modification the kind of modification and optionally the site which is modified has to be specified. Specification of sites allows to distinguish between species carrying the same modifications at different sites. For example ERK, phosphorylated at Y190 or T188, carry in both cases exactly one phosphorylation.  To add the information about modification a special syntax has to be applied. A modification is added to the basic species name as a suffix separated by ' '. Further modifications can also be added separated by ' '. The label characterizes which kind of modification is present. Possible labels are taken from the SBGN specification [22], see Table 2. An example for a modified species name is 'EGFR P P'.  Modification can also describe a conformation or state of a domain. Therefore additional labels for states or covalent modifications can be defined but need to be documented. Values for state labels may be: 'active', 'inactive', 'open', 'closed', e.g. 'ERK active'. But labels are not allowed to start with a number to distinguish between modifications and stoichiometry within a complex.
Modification sites: The modification site specifies the location. Each label for a modification can be followed by a site separated by '@'. This syntax is similar to 'state@variable' notation for modifications in SBGN. Only one identifier for each site must be used. Avoid synonyms for the same site, e.g. 'ERK P@T188', 'ERK P@T' or 'ERK P'. Instead use only one consistently within the whole model. Recommended is 'ERK P@T188' as it characterizes the modification best while still being reasonably short.
It is recommended for site identifiers to include domain, amino acid, nucleic acid or chemical symbol together with position in sequence. This provides context for position information and reduces ambiguity. Examples are Fructose P@C1 P@C6 or 'ERK P@T188'.
Names of modification sites are composed from characters and numbers from ASCII. Site identifier starting with a number, or only a number are valid, e.g. 'Fructose P@1 P@6'.

Complexes
Names for complexes are composed from names of all contained modified and basic species. Names for modified species are specified as given in the previous section. Character for separation between the parts of the complex is '.' (as in the BioNetGen language). For example 'EGF.EGFR' describes the complex between EGF and EGF-receptor. The parts of a complex have to be concatenated in alphabetic order.
Modifications or different states of the whole complex can be specified as suffix for the complex enclosed in parentheses. But modifications of the complex are recommended to be expressed as modifications of the parts. For example use 'EGF.EGFR P' instead of '(EGF.EGFR) P'.
If the same part occurs multiple times in a complex there are two possibilities to address this using the species name. The parts can be listed multiple times or the stoichiometry can be specified as positive integer number separated by ' ' from the part. For example the EGF-receptor dimer can be written as 'EGFR.EGFR' or 'EGFR 2'. A stoichiometry of one is the default and must not be given. Stoichiometry can also be given for more than one part, for example the dimer of EGF receptor with two ligands bound can be specified as '(EGF.EGFR) 2' or 'EGF 2.EGFR 2'.

Naming and Annotation of Compartments
Machine-readable rdf annotations (explained in 4.2.1) and appropriate names are also essential to identify compartments. All compartments of the model have to be annotated with the Gene Ontology (GO) [23]. The allowed annotations are listed in Table 3. The corresponding names for the compartments have to be the same as defined in GO: membrane, cytosol, nucleus, vacuole, mitochondrion, extracellular (a synonym for the extracellular space). Note that cell is not a valid compartment. If no compartment is defined, the default value is cytosol.

Naming and Annotation of Reactions and Parameters
The naming and annotation of reactions should also be done as precise as possible. We propose to use a descriptive name of the process in the name of the SBML element (e.g. 'MK2 phosphorylation'). In table 4 frequent processes and recommended rdf and SBO annotations are given. For parameters we also recommend to use precise descriptive names and the SBO annotation SBO:0000002.