# Learning stochastic process-based models of dynamical systems from knowledge and data

- Jovan Tanevski
^{1, 2}Email authorView ORCID ID profile, - Ljupčo Todorovski
^{3}and - Sašo Džeroski
^{1, 2}

**10**:30

https://doi.org/10.1186/s12918-016-0273-4

© Tanevski et al. 2016

**Received: **1 December 2015

**Accepted: **6 March 2016

**Published: **22 March 2016

## Abstract

### Background

Identifying a proper model structure, using methods that address both structural and parameter uncertainty, is a crucial problem within the systems approach to biology. And yet, it has a marginal presence in the recent literature. While many existing approaches integrate methods for simulation and parameter estimation of a single model to address parameter uncertainty, only few of them address structural uncertainty at the same time. The methods for handling structure uncertainty often oversimplify the problem by allowing the human modeler to explicitly enumerate a relatively small number of alternative model structures. On the other hand, process-based modeling methods provide flexible modular formalisms for specifying large classes of plausible model structures, but their scope is limited to deterministic models. Here, we aim at extending the scope of process-based modeling methods to inductively learn stochastic models from knowledge and data.

### Results

We combine the flexibility of process-based modeling in terms of addressing structural uncertainty with the benefits of stochastic modeling. The proposed method combines search trough the space of plausible model structures, the parsimony principle and parameter estimation to identify a model with optimal structure and parameters. We illustrate the utility of the proposed method on four stochastic modeling tasks in two domains: gene regulatory networks and epidemiology. Within the first domain, using synthetically generated data, the method successfully recovers the structure and parameters of known regulatory networks from simulations. In the epidemiology domain, the method successfully reconstructs previously established models of epidemic outbreaks from real, sparse and noisy measurement data.

### Conclusions

The method represents a unified approach to modeling dynamical systems that allows for flexible formalization of the space of candidate model structures, deterministic and stochastic interpretation of model dynamics, and automated induction of model structure and parameters from data. The method is able to reconstruct models of dynamical systems from synthetic and real data.

## Keywords

## Background

Most systems in biology exhibit dynamical behavior. Their properties change as a function of time and space in a complex manner. Considering a dynamical biological system to be a well-stirred mixture of its constituents, the most commonly used mathematical model of its dynamics takes the form of a system of coupled ordinary differential equations, treating the entity properties as continuous and assuming they evolve deterministically through time. However, the deterministic nature of ordinary differential equations renders them inadequate for systems with a small number of copies (only few orders of magnitude above one) of its constituents. Furthermore, ordinary differential equations fail to account for the underlying stochasticity of natural systems [1, 2]. In molecular systems, stochastic fluctuations are responsible for the divergence in phenotype and genetic activities [3–5]. In such cases, models based on stochastic kinetics are more suitable, as they allow for treating of the modeled systems as either discrete or continuous in terms of the properties of the observed entities and stochastic in terms of the reactions between them.

Establishing a deterministic or a stochastic model of an observed biological system is an omnipresent and often complex, tedious task. This task comprises the two subtasks of structure identification, i.e., selecting an appropriate model structure, and parameter estimation, i.e., determining values of the model parameters that, together with the selected structure, lead to accurate reconstruction of the observed system behavior. While many existing approaches integrate methods for simulation and parameter estimation of a single model, only few of them provide support for the task of structure identification [6, 7]. In this paper, we design and implement a computational tool that can deal with uncertainty in both model structure and the values of model parameters for both deterministic and stochastic models. The central component of our tool is the process-based modeling formalism that allows for modular, compositional specification of the space of candidate model structures.

The formalisms of differential equations allow for encoding models with all the details needed for their execution, i.e., simulation of the behavior of the corresponding dynamical systems. Ordinary differential equations are limited to deterministic model interpretation, while stochastic differential equations are used for stochastic modeling. Differential equations are models at the lowest abstraction level, where every detail has to be fully specified and are used to encode a single model; on their own, they allow neither for parameter nor structural uncertainty.

At a higher abstraction level, the models in the domain of biology are often casted in the formalism of reaction equations. Following this formalism, the biological system is described as a reaction network. When coupled with appropriate kinetic rates, the model defines a network of possible transitions between system states. Reaction equations allow for both deterministic and probabilistic interpretation stemming from the propensity of each reaction [9].

The systems biology markup language, SBML [10], is a standard modeling formalism in system biology. It allows for encoding and exchange of individual models based on ordinary differential equations or reaction equations. Like equation-based formalisms, SBML focuses on encoding a single model structure with parameter uncertainties and does not support the specification of structural uncertainties.

Furthermore, a number of formalisms have emerged that deal with the issue of combinatorial complexity, i.e., the exponential complexity of the space of the combinations of elementary interactions between the entities observed in a given biological system. These formalisms allow for specifying rules (constraints) that limit the space of potential interactions between entities based on their properties. Note that the encoded constraints do not address the issue of structural uncertainty: Their application in the context of a given observed system leads to a single model structure. There are several classes of such formalisms.

The first group of rule-based (also referred to as interaction-based) languages, most notably BioNetGen [11] and kappa [12], define the constituent entities of a system at the level of objects with different properties. The network of interactions between the system entities is implicitly described by a set of rules that transform properties or create new entities (by forming complexes of existing entities). By defining the rules directly on the properties, the rule-based modeling approach efficiently deals with the problem of combinatorial complexity, which may arise when modeling protein-protein networks within complex signalization pathways. The rules are encoded using formalisms based on reaction equations.

The second group of agent-based formalisms, which includes process algebras [13, 14], model individual entities as agents in a complex system that act according to a set of predefined rules for communication with other agents. The process algebras describe the behavior of each agent through processes describing the inter-agent communications via different channels. A biological system described using process calculi is treated as a constrained distributed system of communication. This formal description allows for more detailed representation of the basic principles of interaction. Examples of process algebra extensions that have been adapted to and are being used in the domain of biology are the stochastic pi-calculus [15], Bio-PEPA [16] and beta binders [17].

Related to the process algebras group, the formalisms in the third group are based on constraint programming [18]. In contrast to the process calculi, the constraint programming approaches allow for defining interactions not only through specific communication channels, but by concurrently posting global constraints on the properties of the agent entities.

The limitation that is common to all aforementioned formalisms is that they can not properly represent the structural uncertainty. Uncertainty in parameter values is typically addressed by various formalism extensions that are complementary to the computational tools that offer support for them. COPASI [19] is an example of a such a tool that allows for introducing uncertainties in model parameter values and performing parameter estimation for models based on equations. The MathWorks SimBiology toolbox [20] is a proprietary software for modeling and analysis of dynamical systems in biology providing features similar to the ones of COPASI. Both tools provide a range of methods for the analysis of models (e.g., sensitivity and identifiability of model parameters), but do not provide computational methods for addressing structural uncertainty; users can only perform manual comparative analysis of different model structures.

Network inference methods [21] explicitly address structural uncertainty: most often, given gene expression data, the methods seek for a network of interactions between the observed genes. Since these methods focus on the structure of the observed network of interactions, they seldom deal with the reconstruction of the dynamical behavior of the observed system. Several methods are exception to this general rule and cast the reconstructed networks into the formalism of ordinary differential equations [21–24]. In contrast to the process-based modeling approach presented in this paper, these methods are limited to deterministic models. Furthermore, these methods follow the assumption that the same interaction dynamics applies to all of the network interactions: the process-based modeling formalism can encode different classes of model structures (interactions/processes) with different assumptions about the interaction dynamics. Finally, when it comes to constraining the space of possible model structures, some methods employ data-driven heuristics [22, 25], while some of them additionally limit the search for plausible structures based on the interactions already documented in the literature [22, 24]. The method by Wahl et al. [22] also allows for user-defined Boolean constraints specifying implausible network interactions.

Finally, ABC-SysBio [6] is most closely related to the process-based modeling approach presented here. It builds on SBML and addresses structural uncertainty by allowing the user to explicitly enumerate the alternative model structures. The process-based modeling formalism the we propose addresses exactly this limitation of the existing formalisms, i.e., the ability to properly address structural uncertainty. It allows for modular and flexible specification of the space of candidate model structures to be considered in the modeling process. Instead of specifying a fixed list of candidates, like ABC-SysBio [6], our formalism allows users to specify model components, which are then used in a compositional modeling setting, where combinations of components correspond to candidate model structures. Thus, we approach the structure identification task as a search problem [26], where we search for the most appropriate combination of model components.

This paper builds upon our previous work on inductive process-based modeling that combines knowledge and data to automatically build explanatory models of dynamical systems [27–30]. While inductive process-based modeling has been successfully applied to modeling tasks in the domain of systems biology [7, 31, 32], its scope has been limited to building deterministic models of dynamical systems cast as ordinary differential equations.

Here, we extend the scope of the process-based modeling formalism to models cast as reaction equations, hence the arrow in the top-right corner of Fig. 1. In this way, we combine the benefits of process-based modeling (in terms of addressing structural uncertainty) with the benefits of different model interpretations (including the stochastic interpretation). Finally, the formalism is implemented within a computational tool ProBMoTs for automated induction of models that combines domain knowledge represented in our formalism with measurements of the observed system behavior.

In the remainder of the paper, we first introduce the process-based modeling formalism, its extensions towards handling stochastic models of biochemical systems and the computational tool for process-based modeling, ProBMoTs. We present then two examples of use of the proposed computational tool, i.e. modeling gene regulatory networks and modeling the spread of pathogens, illustrating the use of the proposed tool and evaluating its utility. Finally, we discuss the results of the evaluation, put them in the context of existing work and outline directions for further research.

## Methods

In this section, we introduce the notion of process-based models and a formalism for their representation. We illustrate the formalism use on an example of encoding knowledge for modeling gene regulatory networks and a process-based model of a specific network, the repressilator [33]. We then introduce methods for inducing process-based models from knowledge and data by selecting appropriate model structures and parameters.

### Process-based modeling

Scientists often describe dynamical systems in terms of processes that govern the system dynamics and the entities involved in the processes^{1}. Following this high-level model description, modelers assign lower-level detailed equation-based specifications of the dynamics to individual processes and combine them into a system of coupled differential equations. The differential equations can be in turn used to simulate the behavior of the observed system or to extrapolate the simulation and predict future system behavior. However, by transforming the high-level model description into equations, its explanatory power is lost, since the equations fail to reveal (in an accessible manner) the structure of the observed system in terms of the interacting entities and processes.

Process-based modeling (PBM) clearly relates a high-level model description (entities and processes), that carries significant explanatory power, and a lower-level mathematical model (equations), that allows for simulation and prediction. To build process-based models, we first formalize the modeling knowledge by establishing templates of generic (template) entities that appear in the generic (template) processes that govern the dynamics of systems in the particular domain. Each process-based model then refers to these template components and instantiates them into specific components of the studied system.

Existing process-based formalisms rely on a coarse description of dynamics, based on fragments of differential equations. The formalism introduced in this paper relies on reaction equations, which are closer to the basic principles of system biology and are more comprehensible to biologists. A reaction equation *R*
_{
s
}→*P*
_{
s
} [*rate*] specifies a set of reactants *R*
_{
s
} and a set of products *P*
_{
s
}, as well as the reaction *rate*. Reaction equations are a powerful and flexible formalism for modeling the temporal evolution of dynamical systems.

#### Representation of modeling knowledge

*gene*, whose instances represent nodes in gene regulatory networks. We assume gene entities to represent protein-coding genes and describe them using five numerical properties. The variable properties (vars section of the entity specification) denote two gene properties that change through time: Pmol is the number of encoded protein molecules and mRNAmol is the number of mRNA transcripts. The other three properties do not change over time; they denote the constant kinetic rates of the uncontrolled gene expression alpha0, the translation of mRNA into proteins and their degradation beta, as well as the mRNA molecules degradation delta.

Templates of entities and processes for modeling gene regulatory networks. The template entity *gene* typifies network nodes, while the process templates represent gene regulation, as well as translation and protein degradation processes. The empty set symbol *∅* denotes the absence of reactants or products

Furthermore, the library specifies templates for modeling the processes of gene interaction, gene translation into proteins, and protein degradation. The degradation template specifies two reaction equations that correspond to the degradation of the encoded protein molecules with the kinetic rate of g.beta (i.e., the degradation kinetic rate for the particular gene g) and the degradation of the mRNA molecules with the rate of g.delta. Similarly, the translation process integrates the reaction equations of the gene transcription to mRNA and the mRNA translation to protein molecules.

Finally, the regulation process template represents gene interactions via their protein products. It has two mutually exclusive alternatives of activation and inhibition. The first corresponds to the case where one gene increases the transcription rate of the other, while the second alternative models repression, where one gene decreases the transcription rate of the other by binding the source gene protein to the promoter region of the repressed gene. In both cases, the reaction rate (specified between the brackets) is modeled using a Hill function, derived as a steady-state approximation of the biochemical kinetics [34].

The templates from Table 1 represent generic knowledge on modeling gene regulatory networks. They can be instantiated to entities and processes of an arbitrary network model. Note the hierarchical structure of the regulation template process: it constrains the space of instantiations by rendering the two subordinate templates of activation and inhibition mutually exclusive. This reflects the simple fact that only one regulation type applies to a given pair of genes. In the following, we will illustrate the use of this knowledge for modeling a simple regulatory network.

#### Process-based models

*E. coli*has been proven to exhibit the desired behavior. The three genes involved are TetR, often used for fine regulation in synthetic gene networks, and two repressor genes, cI and LacI.

A process-based model of the repressilator built using the templates for modeling gene regulatory networks from Table 1

The process-based model retains the understandability of the graphical model representation and provides a clear, high-level insight into the structure of the studied system. At the same time, by using the detailed knowledge of the reaction equations encoded in the templates, we can automatically translate the high-level description into a mathematical model and use it for simulation and analysis. Consider the process translation1 in Table 2: By combining it with the template translation from Table 1, we instantiate a set of two reaction equations modeling the uncontrolled transcription of *TetR* to mRNA ( ∅ -> TetR.mRNAmol with a kinetic rate of TetR.alpha0 = 0) and the translation of the mRNA to the TetR protein molecules (TetR.mRNAmol -> TetR.Pmol + TetR.mRNAmol with a kinetic rate of TetR.beta = 9.75).

*x*(

*t*), a vector of the number of molecules of each reactant at time

*t*. The repressilator state includes six variables: TetR.Pmol, TetR.mRNAmol, LacI.Pmol, LacI.mRNAmol, cI.Pmol and cI.mRNAmol. In any given state

*x*, we can calculate the propensity, i.e., the probability that the reaction

*R*

_{ j }will be active in the infinitesimal time interval [

*t, t*+

*d*

*t*), using the formula

*a*

_{ j }(

*x*)=

*c*

_{ j }

*h*

_{ j }(

*x*)

*d*

*t*, where

*c*

_{ j }denotes the reaction rate and

*h*

_{ j }(

*x*) denotes the number of distinct combinations of reactant molecules in state

*x*.

List of reaction equations stemming from the process-based model of the repressilator from Table 2

*P*(

*x,t*|

*x*

_{0},

*t*

_{0}) that the system is in a state

*x*at a given time

*t*, given the initial state

*x*

_{0}at time

*t*

_{0}, can be then defined using the following ordinary differential equation (also known as the Master Equation) [35]:

for *d*
*t*→0, where *ν*
_{
j
} is a vector specifying the changes of the number of reactant molecules after the reaction *R*
_{
j
}. We can then model the system dynamics using coupled differential equations, where each equation models the probability that the system state equals a unique combination of values of the state variables *x*.

For real biological systems, the Master Equation is too complex to be solved analytically or numerically. To this end, alternative approaches to estimating the exact or approximate probabilities have been developed. One of the most popular exact approaches is based on Monte Carlo sampling and is known as the Stochastic Simulation Algorithm (SSA) proposed by Gillespie, where others include the Gibson-Bruck method of next reaction and the class of *τ*-leaping methods [9].

If we assume that the propensity does not significantly change in infinitesimal time intervals and that the expected number of firings of each reaction is significantly large (i.e., the number of reactant molecules is large compared to the probability rate constant), we can derive the Langevin Equation. It represents a mathematical model of the reaction equations cast in terms of coupled Itō stochastic differential equations [36]. These stochastic differential equations can further be reduced to ordinary differential equations, under the assumption that we observe a negligible amount of noise in a system with a large number of reactants.

*TetR*molecules obtained by simulating the reaction equations (left-hand side) and the system of ordinary differential equations (right-hand side) inferred from the process-based model of the repressilator.

To summarize, process-based models have four important properties that make them particularly suitable for modeling dynamical systems. First, they retain the *understandability* and explanatory power of graphical model representations by providing clear insight into the structure of the observed system. At the same time, they inherit the *utility* of mathematical models for simulation and analysis of system behavior. Third, process-based models provide *general* model descriptions that support both stochastic and deterministic approaches to modeling, simulation and analysis. The fourth property is the *modularity* provided by the knowledge representation formalism: the templates can be instantiated into a number of model components. This last property is particularly relevant for the algorithms that induce process-based models from data.

#### Inducing process-based models

The formalized knowledge on modeling gene regulatory networks brings another benefit. It represents a source of constraints that limit the space of candidate model structures to be explored when modeling a particular gene regulatory network. Consider the repressilator model again and assume that we are only provided with information that it involves the three genes of *TetR*, *LacI* and *cI*. Now we can infer all the instances of the process templates from Table 1: the degradation process template that involves one gene, leads to three process instantiations, one for each gene. Similarly, the translation template leads to three processes. Finally, each pair of genes results in one instance of the activation and one instance of the *inhibition* template. Thus, for the three repressilator genes, we obtain six instances of the *activation* and six instances of the *inhibition* template. In sum, the three repressilator genes lead to 18 process instances.

Each of the process instances represents a valid model component. Following a naïve approach, one can consider any subset of components as a legitimate model structure, which yields 2^{18}=262,144 candidates. However, these include many implausible models, e.g., ones that do not include gene translation for some of the genes. To avoid implausible models, the inductive process modeling approach relies on the use of constraints that limit the ways model components are combined. For example, a constraint ruling out models that do not include translation and degradation processes for all the genes, reduces the search space to 2^{12}=4096 candidates. Furthermore, the constraint specifying the mutual exclusivity of the activation and inhibition processes for a given ordered pair of gene entities further reduces the number of candidates to 3^{6}=729 (for each of the six possible pairs of repressilator genes, we consider three modeling alternatives: absence of regulatory influence; activation; and inhibition).

An incomplete process-based model of a gene regulatory network specifying the model structures as depicted in Fig. 5

Finally, the inductive process-based modeling approach validates each candidate model structure by matching its simulation against the observed system behavior. In order to simulate the model (and assess its quality), we first have to determine the values of its constant parameters. To this end, we employ parameter estimation and find parameter values that lead to a model reproducing the observed behavior as closely as possible. We formulate the parameter estimation task as an optimization problem: We aim at minimizing an objective function that measures the goodness of fit of the model simulation to the observed behavior using the maximum-likelihood estimator [37].

Different implementations of the induction algorithm make different design choices. In the following, we provide a brief overview of the different implementations: A detailed overview is given by Džeroski and Todorovski [7]. Lagramge 2.0 [38] transforms the library and constraints into a grammar that enumerates candidate model structures. IPM [28] takes a naïve approach and uses constraints on the number of components involved in the model to address combinatorial explosion. HIPM [29] encodes the constraints into a hierarchy of process templates and approaches enumeration as a combinatorial search problem. SCIPM [39] explicitly encodes the constraints and approaches the enumeration using constraint satisfaction methods. Finally, ProBMoT [40] extends HIPM with explicit constraints referring to the particular system at hand and meta-heuristic optimization methods for parameter estimation.

Note, however, that the above inductive process-based modeling approaches have limited their focus on inducing deterministic models cast as ordinary differential equations. ProBMoTs, our extension of ProBMoT presented in this paper that allows for inducing stochastic models of dynamical systems cast as reaction equations. The extension is based on the novel formalism for encoding a library of components that supports the specification of reaction equations as models of individual processes. ProBMoTs also integrates standard simulators for reaction equations [41].

Both ProBMoT and ProBMoTs are released as open-source software packages available for download at http://probmot.ijs.si
^{2}.

### Experimental setup and model selection

To evaluate the algorithm for inducing stochastic process-based models, we apply it to several problems of modeling dynamical behavior of biological systems at different scales. We consider two synthetic modeling problems from the domain of gene regulatory networks and two real modeling problems from the domain of epidemiology. In each domain, we first encode process-based knowledge for modeling dynamical systems. In this paper, our focus is limited to encoding domain knowledge in two domains, covering models on fundamentally different scales. Note, however, that the process-based modeling approach can be applied to other domains as well, given that modeling knowledge about the domain of interest is encoded as a library of entity and process templates. For example, when modeling metabolic networks, the central entity templates will represent enzymes and metabolites, while process templates would represent different metabolic reactions (with different kinetics), formulating different models of the dynamical interactions among them. For further examples of domain knowledge libraries for process-based modeling, we refer the reader to the ProBMoT web site. Second, for the synthetic modeling problems, we select a target model and simulate it to obtain a data set for inducing models. On the other hand, the real modeling problems come with data sets of measured system behavior. Third, for each modeling problem, we define an ordered list of plausible model structures *P*. For the synthetic problems, this list includes the target model only, while for the real modeling problems, it includes all the structures of the models that have been reported in the literature as plausible explanations of the measurements. Note that for all problems, the list of candidate models considered by the induction algorithm includes all the model structures from the list *P*.

To perform induction, for each modeling problem we run ProBMoTs using the corresponding modeling knowledge (including the constraints) and the data set as inputs. Recall that the modeling knowledge defines the space of candidate model structures. The values of the model parameters are estimated by using the Differential Evolution method [42] with the recommended parameter settings: crossover probability of 0.9, differential weight of 0.8, population size 50 and the rand/1/bin strategy. We set the number of evaluations of the objective function to 1000 times the number of constant model parameters. To assess the stability of the parameter estimator, we use 10 restarts of the Differential Evolution method. For simulating the reaction equations, ProBMoTs employs the Gillespie direct method [9] to obtain 20 realizations.

*K*in the equation) are averaged just as the observed data:

where *m* denotes the model, *i* iterates over the observed variables *x*
_{
i
} and *k* iterates over the realizations, where \( \widehat {x}_{i}^{k} \) denotes the *k*-th realization of *x*
_{
i
}, and *N* is the number of observed time points.

Recall that the result of ProBMoTs is a list of models ranked with respect to their descending fit against the measured data, in our case, ascending model error. The trivial model selection strategy would be to select the model with the optimal value of the objective function. Note, however, that error-based estimates of model performance tend to overfit observations, a problem especially relevant in the context of noisy experimental data. To address the problem of overfitting, we use an alternative model selection approach that introduces a penalty for model complexity, measured as the number of reaction equations in the model. To additively combine the model complexity and the degree of fit into a single score, we normalize both to the [ 0,1] scale. The normalization is based on the minimal and maximal values of the degree of fit and complexity, respectively, over all the candidate model structures considered by ProBMoTs.

Finally, to evaluate the results of induction, we compare the list of selected models to the list of plausible models *P* using a triple of metrics (recall, hit, plateau_size). The recall is the proportion of the plausible models in the first plateau of the error profile. The indicator hit tells us whether the first plateau contains the first model structure in *P*. The size of the plateau (plateau_size) indicates the discriminative power of the induction method: the smaller the plateau, the larger the discriminative power. The ideally performing induction method would lead to the triple (100 %, true, |*P*|).

## Results

In this section, we present the results of the evaluation of ProBMoTs on the four problems of inducing stochastic process-based models from knowledge and data. The first two are from the domain of gene regulatory networks, the other from the domain of epidemiology.

### Gene regulatory networks

We first address the task of modeling the simple gene regulatory network of the repressilator, introduced in the previous section. We select the model from Table 3 as a target model and set the list of plausible model structures *P* to contain a single structure that corresponds to the target model. We then perform two experiments. In the first, we assume that the kinetic rates in processes belonging to a single class of regulatory processes (degradation, translation and regulation) have the same values. To this end, we restructure the library of templates to introduce an global template entity that declares the global kinetic rates, which are then used by the process templates. In the second experiment, we perform induction without the assumption of global kinetic rates and therefore use the library of templates as presented in the previous section.

#### Global kinetic rates

The model of the repressilator considered here has been already addressed in other studies [6, 43]. Note, however, that both studies address only the task of parameter estimation from synthetic data assuming a single model structure. In our experiment, we also aim at identifying the structure of the model. We select the single model structure used in previous studies as our target and use the following values of the global kinetic rates: (*a*
*l*
*p*
*h*
*a*0,*alpha, beta, delta, n*)=(0.0,250.0,5.0,1.0,2.1). To obtain experimental data, we average 20 realizations of the target model in the time interval *t*∈[ 0,35]. Accordingly, we use the *RMSE*
_{
AR
} objective function.

^{6}=729 possible model structures, one of which is the target model structure of the repressilator.

#### Local kinetic rates

To test the robustness of our method, we remove the assumption of global kinetic rates from the modeling scenario. Thus, we forget the changes we made to the library in the previous experiment and use the library as described in Table 1. Other than the different formalization of the domain knowledge, given the relaxed assumptions, the task remains the same: we use the same target model, the data set, the objective function (*RMSE*
_{
AR
}) and the list of plausible models as in the first experiment. The relaxed assumptions lead to an explosion in the parameter space, while the structure space remains the same. We want to test whether (and how) the relaxed modeling assumption will influence (deteriorate) the results of ProBMoTs.

### Compartmental epidemiological models

In the domain of epidemiology, we first formalize the knowledge to be used for establishing stochastic models, using the basic principles of compartmental modeling as presented by Brauer et al. [44]. There, the spread of disease is modeled by the flows of individuals between healthy and infected populations, referred to as compartments. Each flow is modeled using a reaction equation, where reactants and products correspond to compartments.

At the point of introduction of a pathogen in the population, the entire population can be considered to consist of susceptible individuals (in the compartment S), except for the individuals by whom the pathogen is introduced. From this point on, we can observe different processes of flow between compartments. One way to model the infection of individuals is to assume that all infected individuals manifest the disease symptoms. In this case, the A compartment is not populated. An alternative, more complex, model assumes that we can also have infected individuals that do not manifest the symptoms. In both cases, the infection might cause a direct flow from S to I (and/or A) or indirect flow through the L compartment of latently infected individuals.

The recovery of individuals from a disease can either cause flows from the A and I compartments to the population of recovered (or removed) individuals R or cause flows from the A and I to the population of susceptible individuals S. In any case, the recovery of the individuals from I can be controlled by moving the infected individuals to the quarantine compartment Q. Finally, the general model involves a flow of individuals from the recovery compartment to the population of susceptible individuals.

The general model can be instantiated to a number of variants, ranging from the simple SIR model that assumes only three compartments of susceptible, infected and recovered individuals, through the SLIR model that introduces the population of latently infected individuals, to the most complex SLIAQRS model that comprises all the compartments depicted in Fig. 9. For example, the SIR model includes two processes. The first instantiates the template process of infection_symptomatic that includes a single reaction equation: S.noi + I.noi → I.noi + I.noi [i], where *i* represents the rate of infection. The other process represents the template of recovery_symptomatic that includes the reaction equation I.noi → R.noi [r], where *r* denotes the recovery rate.

In contrast to the previous synthetic tasks, here we use two data sets of real measurements for induction. These come from two epidemic outbreaks, the outbreak of the Great Plague in Eyam in 1666 [45] and the outbreak of influenza type A subtype H3N2 in Tristan da Cunha in 1967 [46, 47]. The measurements for the case of the outbreak in Eyam are taken bimonthly at seven time points in the period from 3rd of July to 20th of October 1666. They include two variables: number of healthy individuals and the number of individuals that have complained of symptoms. The measurements from Tristan da Cunha are taken daily at 21 time points in October 1967. They also include two variables: number of individuals showing symptoms of infection and the number of recovered individuals.

To match the compartment variables to the variables in the data sets, we calculate the number of healthy (individuals not showing any symptoms of infection) as the sum of the number of individuals in the S, L and A compartments, the number of infected as the sum of the number of individuals in the I and Q compartments and the number of recovered as the number of individuals in the R compartment.

In accordance with the experimental setting for obtaining the measurements, we use the second objective function *RMSE*
_{
SR
}. Since the experimental data comes from real and therefore noisy measurements, we take into account model complexity to obtain the model selection score.

#### Eyam plague outbreak

For this task, we consider all possible instances of the general model as previously described, by introducing a small set of constraints of mutual exclusivity of symptomatic and asymptomatic infection, thus instantiating only the corresponding recovery for each type of infection. The total number of model structures under these constraints is 24. The initial conditions at the first time point were set to 254 individuals in the S, 7 individuals in the I and 0 in the other compartments, which exactly matches the initial conditions from the original study by Ragget [45]. The same paper proposes two plausible model structures: SIR, the structure that has been analyzed in the paper, and SLIR, suggested as the most promising one for further study. Thus, our list of plausible models structures *P* is (SIR, SLIR).

#### Tristan da Cunha influenza outbreak

For this task, we consider the same set of 24 model structures that instantiate the general model from Fig. 9. Based on the data available, we set the initial number of infected individuals to 1, other initial values to 0, except for the initial number of susceptible individuals that was fitted as a model parameter. We selected the two best performing model structures from Toni et al. [6] as plausible and set *P* to (SLIR, SIR). The other two model structures considered in the study are a modified SLIR structure, that includes time-delayed flow models, and a SIRS structure.

*P*, leading to a recall of 50 %, the hit indicator is true and the plateau size is 1. The second ranked model in the second plateau has the SIR structure of the second model in

*P*. As in the case of the Eyam plague experiments, ProBMoTs perfectly reconstructed the results of the previous modeling experiments reported by Toni et al. [6].

## Discussion

The formalism for stochastic process-based modeling, that we introduce in this work, retains the modular and straight-forward specification of entire classes of model structures from its deterministic counterpart. In contrast to the formalisms commonly used in systems biology that employ different levels of abstraction but focus primarily on the efficient description of a single model structure [10–13], the process-based formalism allows for describing uncertainty in both the structure and parameter values of a model by representing classes of model structures. The introduction of reaction-equation based description of processes improves the understandability of process-based models and allows for their stochastic interpretation, improving the generality and utility of the process-based modeling approach and bringing it closer to the domain of biology.

The experimental evaluation shows that our approach can be successfully applied to a range of problems of learning stochastic models. These can come from different biological domains and represent phenomena at different scales. Our approach exhibits excellent performance on the considered tasks, producing accurate and understandable models and successfully reconstructing the results of previous modeling efforts. The proposed approach can be applied to an arbitrary domain of interest by encoding an appropriate library of template entities and processes encountered in the particular domain.

However, several limiting issues may arise during the application of the proposed approach.

First, solving the parameter estimation task for each model structure can lead to both identifiability and distinguishability problems. The identifiability of the model parameters is a problem often encountered when modeling biological systems [48]. Performing identifiability analysis for each candidate structure is in principle possible. However, this can be challenging in terms of computational complexity when considering a large number of candidate model structures. A more tractable problem is the one of distinguishability of the candidate model structures in terms of the applied model selection criteria. Within process-based modeling, this problem presents itself in the form of long plateaus in the error profile. This problem has been studied for the task of learning deterministic models of dynamical systems from data and domain knowledge [32]. The study shows that the problem of distinguishability can be successfully addressed by the introduction of problem specific, domain dependent criteria for parameter optimization and model selection. Although this study is limited to the case of deterministic models, further work can extend its scope to stochastic modeling.

Second, the encoding of very large and complex systems in the proposed formalisms may be cumbersome. The use of reaction equations to encode a system with many entities, that is comprised of a large number of simple and repeating interactions might lead to excessively lengthy descriptions. Following concepts from related work, the issue of combinatorial complexity of composing models from elementary interactions may be solved by rule- or agent-based formalisms by introducing further abstraction.

Finally, the third limitation of our approach is related to computational cost. The simulation of a candidate model is the computationally most expensive step of the process of model induction. Therefore, the computational cost is proportional to the number of evaluations (and the number of simulations per evaluation) needed for each model during the parameter estimation task. Our method requires exhaustive enumeration and optimization of a number of candidate model structures defined by entity/process templates organized in multiple-level hierarchies of alternatives within a library of domain knowledge. Subsequently, a combinatorial explosion is possible if the problem is not well constrained. This is exactly why the process-based modeling approach includes the facility for imposing constraints on the space of possible model structures by allowing for the definition of an incomplete model.

## Conclusion

The area of computational biology lacks a unified methodology for modeling dynamical systems that would include a formalism for representing complex dynamics in a manner easily understandable to biologists and modeling experts. In this paper, we advocate the use of process-based modeling for this purpose. It allows for understandable description of a space of candidate model structures for a given modeling task. It allows for both deterministic and stochastic interpretation of process-based models. Also, it allows for automated induction of models from data and knowledge.

In order to bridge the gap between the existing and commonly used tools for modeling the dynamics of biological systems and the machine learning approaches to computational scientific discovery, we have extended the scope of process-based modeling approaches, specifically ProBMoT, to include stochastic models. As an intermediate representation, our ProBMoTs formalism includes the finer, more intuitive and easier to comprehend representation of reaction equations, which should increase the ease of use of process-based modeling in biology. This finer-grained representation of processes is a feature that broadens the possibilities of interpretation, mainly in the direction of capturing the inherent stochasticity of dynamical systems in biology.

Through the tasks considered in this work, we have shown that our approach can deal with complex parameter and structure search spaces, in lightly constrained settings, with synthetically generated tasks and in less constrained real world problems. We have thus demonstrated the potential of our approach for automated discovery of novel scientific knowledge in domains that require stochastic modeling of dynamical systems. Our results also point at an array of possibilities for further evaluation and improvement.

The presented extension of the process-based formalism integrates reaction equations as a proxy that allows for multiple interpretations of the process-based models. However, we can continue this initial step by integrating other higher-level formalisms. Combining rule-based modeling languages with the process templates from the process-based modeling formalism can be considered as a first direction for further work. The introduction of rule-based constraints would allow for automated modeling of more complex systems.

Another direction for further work stems naturally from the formulation of the modeling task as a combinatorial search problem. It concerns the implementation of incomplete, heuristics-based search strategies over the space of candidate models. Although a comparative evaluation with the method using exhaustive search is needed to establish its utility, this extension will scale-up our approach towards applications to large-scale modeling problems.

Other factors might also contribute to the overall success of our approach, e.g., the choice of a parameter estimation method and a method for simulation. Existing literature offers comparisons of the performance of different parameter estimation methods on single model structures for modeling tasks from the domain of systems biology (both deterministic and stochastic) [49, 50]. A comparison of the performance of different parameter estimation methods has also been performed in the context of deterministic process-based modeling of aquatic ecosystems [40]. The conclusions from these studies are a good starting point to investigate their performance in the context of stochastic process-based modeling tasks.

## Ethics approval

No aspect of this study required ethics approval.

## Availability of data and materials

The libraries of domain knowledge, incomplete models and data supporting the conclusions of this article are available in the Zenodo repository https://zenodo.org/record/45503 (doi:n).

## Endnotes

^{1} The notions of entities and processes are ontologically well-grounded and are present (as continuants and occurrents) in the Basic Formal Ontology.

^{2} ProBMoT and ProBMoTs are released under the terms of the BSD license http://probmot.ijs.si/licence.html.

## Declarations

### Acknowledgments

We acknowledge the financial support of the Slovene Human Resources Development and Scholarship Fund, the Slovenian Research Agency (Grants P2-0103 and P5-0093 (B)), and the European Commission (Grants ICT-2013-612944 MAESTRA and ICT-2013-604102 HBP).

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Wilkinson DJ. Stochastic Modelling for Systems Biology. Boca Raton: CRC Press; 2006.Google Scholar
- Lecca P, Laurenzi I, Jordan F. Deterministic Versus Stochastic Modelling in Biochemistry and Systems Biology. Cambridge: Woodhead Publishing; 2013.View ArticleGoogle Scholar
- McAdams HH, Arkin A. Stochastic mechanisms in gene expression. Proc Natl Acad Sci. 1997; 94(3):814–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Arkin A, Ross J, McAdams HH. Stochastic kinetic analysis of developmental pathway bifurcation in phage
*λ*- infected escherichia coli cells. Genetics. 1998; 149(4):1633–48.PubMedPubMed CentralGoogle Scholar - Samoilov M, Plyasunov S, Arkin AP. Stochastic amplification and signaling in enzymatic futile cycles through noise-induced bistability with oscillations. Proc Natl Acad Sci. 2005; 102(7):2310–315.View ArticlePubMedPubMed CentralGoogle Scholar
- Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MPH. Approximate bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface. 2009; 6:187–202.View ArticlePubMedPubMed CentralGoogle Scholar
- Džeroski S, Todorovski L. Equation discovery for systems biology: finding the structure and dynamics of biological networks from time course data. Curr Opin Biotechnol. 2008; 19(4):360–8.View ArticlePubMedGoogle Scholar
- Regev A, Shapiro E. Cellular abstractions: cells as computation. Nature. 2002; 419(6905):343.View ArticlePubMedGoogle Scholar
- Gillespie DT. Stochastic simulation of chemical kinetics. Annu Rev Phys Chem. 2007; 58(1):35–55.View ArticlePubMedGoogle Scholar
- Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, the rest of the SBML Forum, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J. The systems biology markup language (sbml): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003; 19(4):524–31.View ArticlePubMedGoogle Scholar
- Faeder JR, Blinov ML, Hlavacek WS. Rule-based modeling of biochemical systems with BioNetGen. In: Systems Biology. Methods in Molecular Biology. New York: Humana Press: 2009. p. 113–67.Google Scholar
- Danos V, Feret J, Fontana W, Harmer R, Krivine J. Rule-Based Modelling, Symmetries, Refinements In: Fisher J, editor. Formal Methods in Systems Biology. Lecture Notes in Computer Science. Berlin Heidelberg: Springer: 2008. p. 103–22.Google Scholar
- Priami C, Quaglia P. Modelling the dynamics of biosystems. Brief Bioinform. 2004; 5(3):259–69.View ArticlePubMedGoogle Scholar
- Clark A, Gilmore S, Hillston J, Tribastone M. Stochastic process algebras. In: Formal Methods for Performance Evaluation. Lecture Notes in Computer Science. Berlin Heidelberg: Springer: 2007. p. 132–79.Google Scholar
- Blossey R, Cardelli L, Phillips A. A compositional approach to the stochastic dynamics of gene networks. Trans Comput Syst Biol. 2006; 3939(3939):99–122.Google Scholar
- Ciocchetta F, Hillston J. Bio-pepa: A framework for the modelling and analysis of biological systems. Theor Comput Sci. 2009; 410(33–34):3065–084.View ArticleGoogle Scholar
- Priami C, Quaglia P. Beta binders for biological interactions. In: Computational Methods in Systems Biology. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer: 2005. p. 20–33.Google Scholar
- Bortolussi L, Policriti A. Modeling biological systems in stochastic concurrent constraint programming. Constraints. 2008; 13(1-2):66–90.View ArticleGoogle Scholar
- Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U. Copasi - a complex pathway simulator. Bioinformatics. 2006; 22(24):3067–074.View ArticlePubMedGoogle Scholar
- Mathworks SimBiology toolbox. http://www.mathworks.com/products/simbiology/. Accessed 01 Feb 2016.
- Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Syst Biol. 2006; 3:78–8.Google Scholar
- Wahl SA, Haunschild MD, Oldiges M, Wiechert W. Unravelling the regulatory structure of biochemical networks using stimulus response experiments and large-scale model selection. IEE Proc Syst Biol. 2006; 153:275–8510.View ArticleGoogle Scholar
- Oates CJ, Dondelinger F, Bayani N, Korkola J, Gray JW, Mukherjee S. Causal network inference using biochemical kinetics. Bioinformatics. 2014; 30(17):468–74.View ArticleGoogle Scholar
- Henriques D, Rocha M, Saez-Rodriguez J, Banga JR. Reverse engineering of logic-based differential equation models using a mixed-integer dynamic optimization approach. Bioinformatics. 2015; 31(18):2999–3007.View ArticlePubMedPubMed CentralGoogle Scholar
- Bonneau R, Reiss D, Shannon P, Facciotti M, Hood L, Baliga N, Thorsson V. The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol. 2006; 7(5):36.View ArticleGoogle Scholar
- Langley P, Simon HA, Bradshaw GL, Zytkow JM. Scientific Discovery: Computational Explorations of the Creative Processes. Cambridge: MIT Press; 1987.Google Scholar
- Džeroski S, Todorovski L. Encoding and using domain knowledge on population dynamics for equation discovery In: Magnani L, Nersessian NJ, editors. Logical and Computational Aspects of Model-Based Reasoning. Netherlands: Springer: 2002. p. 227–47.Google Scholar
- Bridewell W, Langley P, Todorovski L, Džeroski S. Inductive process modelling. Mach Learn. 2008; 71:109–30.View ArticleGoogle Scholar
- Todorovski L, Bridewell W, Shiran O, Langley P. Inducing hierarchical process models in dynamic domains. In: Proceedings of the Twentieth National Conference on Artificial Intelligence. Pittsburgh: AAAI Press: 2005. p. 892–7.Google Scholar
- Čerepnalkoski D. Process-based models of dynamical systems: representation and induction. PhD thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. 2013.Google Scholar
- Džeroski S, Todorovski L. Modeling the dynamics of biological networks from time course data In: Choi S, editor. Systems Biology for Signaling Networks. Berlin Heidelberg: Springer: 2010. p. 275–94.Google Scholar
- Tanevski J, Todorovski L, Kalaidzidis Y, Dzeroski S. Domain-specific model selection for structural identification of the rab5-rab7 dynamics in endocytosis. BMC Syst Biol. 2015; 9(1):31.View ArticlePubMedPubMed CentralGoogle Scholar
- Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature. 2000; 403:335–8.View ArticlePubMedGoogle Scholar
- Sanft KR, Gillespie DT, Petzold LR. Legitimacy of the stochastic michaelis-menten approximation. IET Syst Biol. 2011; 5(1):58–69.View ArticlePubMedGoogle Scholar
- Gillespie DT. A rigorous derivation of the chemical master equation. Physica A Stat Mech Appl. 1992; 188(1–3):404–25.View ArticleGoogle Scholar
- Gillespie DT. The chemical langevin equation. J Chem Phys. 2000; 113(1):297–306.View ArticleGoogle Scholar
- Rice JA. Mathematical Statistics and Data Analysis. Boston: Cengage Learning; 2006.Google Scholar
- Todorovski L, Džeroski S. Integrating domain knowledge in equation discovery In: Džeroski S, Todorovski L, editors. Computational Discovery of Scientific Knowledge. Berlin Heidelberg: Springer: 2007. p. 69–97.Google Scholar
- Bridewell W, Langley P. Two kinds of knowledge in scientific discovery. Top Cogn Sci. 2010; 2(1):36–52.View ArticlePubMedGoogle Scholar
- Čerepnalkoski D, Taškova K, Todorovski L, Atanasova N, Džeroski S. The influence of parameter fitting methods on model structure selection in automated modeling of aquatic ecosystems. Ecol Model. 2012; 245:136–65.View ArticleGoogle Scholar
- Ramsey S, Orrell D, Bolouri H. Dizzy: Stochastic simulation of large-scale genetic regulatory networks. J Bioinforma Comput Biol. 2005; 03(02):415–36.View ArticleGoogle Scholar
- Price K, Storn RM, Lampinen JA. Differential Evolution: A Practical Approach to Global Optimization. Berlin Heidelberg: Springer; 2005.Google Scholar
- Tomshine J, Kaznessis YN. Optimization of a stochastically simulated gene network model via simulated annealing. Biophys J. 2006; 91(9):3196–205.View ArticlePubMedPubMed CentralGoogle Scholar
- Brauer F, van den Driessche P, Wu J, Allen LJS. Mathematical Epidemiology. Berlin Heidelberg: Springer; 2008.View ArticleGoogle Scholar
- Ragget GF. Modelling the eyam plague. IMA J. 1982; 18:221–6.Google Scholar
- Hammond BJ, Tyrrell DAJ. A mathematical model of common-cold epidemics on tristan da cunha. J Hyg. 1971; 69:423–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Shibli M, Gooch S, Lewis HE, Tyrrell DAJ. Common colds on tristan da cunha. J Hyg. 1971; 69:255–62.View ArticlePubMedPubMed CentralGoogle Scholar
- Gutenkunst RN, Waterfall JJ, Casey FP, Brown KS, Myers CR, Sethna JP. Universally sloppy parameter sensitivities in systems biology models. PLoS Comput Biol. 2007; 3(10):1–8.View ArticleGoogle Scholar
- Moles CG, Mendes P, Banga JR. Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res. 2003; 13:2467–474.View ArticlePubMedPubMed CentralGoogle Scholar
- Sun J, Garibaldi JM, Hodgman C. Parameter estimation using metaheuristics in systems biology: A comprehensive review. IEEE/ACM Trans Compu Biol Bioinforma. 2012; 9(1):185–202.View ArticleGoogle Scholar