#### Search Space Constraints

We do not need or want to search the entire, unbounded parameter space; there are certain values that we know *a priori* that parameters should not take (negative production, diffusion or decay rates, for instance), and we do not want regulatory parameters to grow without bound along the saturated arms of the sigmoid expression-regulation function. To represent this, we may either introduce absolute criteria that prevent the optimisation algorithm from assigning out-of-bounds values, or we can produce a penalty function, which increases as the parameters move further into areas of unacceptable solutions. The penalty function may be added to the objective function (as occurs in Simulated Annealing), or it may be kept separate and handled in an algorithm-specific way (as occurs in the Evolution Strategy).

For the gene network problem, we use a penalty function for the regulatory parameters

,

*m*
^{
a
}and

*h*
^{
a
}:

where Λ is a control parameter, and

is the maximum observed intensity of gene

*b* in the data. The justification for using this penalty function is that it remains 0 for

*i.e*. when the absolute value of the total regulatory input is below a certain threshold, but rises steeply outside of those bounds. This allows individual parameter values to become quite large, while keeping total regulatory input within strict limits. We took Λ = 10^{-4}. Based on earlier results [20], we fix *h*^{
a
}parameters to a value of -2.5, which reduces the number of parameters to be determined from *n* = 66 to 62.

The production, decay and diffusion rates, *R*
^{
a
}, *λ*
^{
a
}and *D*
^{
a
}are given explicit limits, such that any parameter value outside these limits is considered unacceptable and rejected. The ranges are 10 <*R*
^{
a
}< 30, 0 <*D*
^{
a
}< 0.3 and 5 < ln 2/*λ*
^{
a
}< 20 for all *a*. These search space constraints are identical to those used in earlier studies [19, 20].

#### Serial Island Evolution Strategy

The evolutionary algorithm we are using is a parallel Island (*μ,λ*)-Evolution Strategy (piES). It is a parallel version of the serial Island ES algorithm developed by Fomekong *et al*., which was successfully applied to the same gene network problem [19]. We will describe this serial algorithm first. Wherever possible, we have used the same values for optimisation parameters as used in the earlier study.

The island ES algorithm operates on *N*
_{isl} populations of individuals, each with a population size *λ*. Each population is initialised independently, and selection, recombination and mutation are performed only within populations. A migration operation links the populations; in the serial algorithm, the best individual from each island is copied to a population randomly selected from a uniform distribution. Migration occurs every *m* generations. We took *m* = 200 (experimental runs showed little variation within the range of 50 to 200; data not shown).

We denote the set of all possible individuals as *I*, with individuals *i* in population *p* ∈ {1, ..., *N*
_{
isl
}} denoted by vector
, *i* ∈ {1, ..., *λ*}. Each individual has a parameter vector associated with it,
, as given in equation 4. We define a fitness function Φ(
): *I* → ℝ, which is equal to the objective function *E*(
) as defined in equation 3, and use the penalty function Π(
): *I* → ℝ defined in equation 5.

Selection is performed to produce a set of *μ* offspring (here, we take *μ* = *λ*/5), according to a selection operator
: *I*
^{
λ
} → *I*
^{
μ
}. The operator selects the top *μ* individuals of the population, according to a stochastic ranking procedure based on both the fitness function and the penalty function. This stochastic ranking method--introduced by Runarsson *et al*. [35]--uses a bubble-sort-like procedure, in which an arbitrary ranking is produced, and *λ* sweeps are performed, in which each individual in turn (starting from the top) is compared to the one directly below. If the result of the penalty function for both individuals is less than or equal to zero, the fitness values of the two are compared, and the pair is ordered accordingly. If the penalty function of the top individual is greater than zero, then there is a probability *P*
_{
f
}that the individuals will be ordered according to their fitness, and a probability 1-*P*
_{
f
}that they will be ordered according to their penalty value. This procedure is ended if there is no change in the order in any given sweep. We use *P*
_{
f
}= 0.45, which is the value used in [19].

Recombination is then performed on the offspring, using a recombination operator *r*
_{λ}: *I*
^{
μ
}→ *I*
^{
λ
}. The operator first produces *λ*-*μ* individuals as direct (asexual) copies of the selected individuals in the parent population, with the fittest *λ* (*mod μ*) individuals being represented twice if *λ* is not a multiple of *μ*.

Second,

*μ* additional individuals are produced by recombination; each individual

in the parent population produces an offspring

with parameter vector

by recombination between its own parameter vector

, the next fittest individual's parameter vector

, and the fittest individual's parameter vector

, using the equation

where *λ* is the recombination factor. We have taken *λ* = 0.85 as in [19].

The

*λ*-

*μ* individuals that do not undergo recombination are then mutated, according to the local mutation operator

*m*
_{{φ, α}}:

*I* →

*I*. Mutation is performed according to a non-isotropic self-adaptive mutation rule, in the sense that each individual

*i* has associated with it a step size for mutation (a mutation rate) for each parameter

*j*, denoted as

. This allows the step size to undergo evolution under selection, giving an adaptive step-size without having to specify a specific adaptation rule. Mutation starts with a random change to the mutation rate, given by

for *i* ∈ {*μ* + 1, ..., *λ*} and *j* ∈ {1, ..., *n*}, where
and
are tuning parameters. We have used *φ** = 1. *N*
_{
i
}and *N*
_{
ij
}are a vector and a matrix of random values sampled from a normal distribution with zero mean and unit variance, which is generated afresh each generation.

Next, we mutate the parameters

themselves, using

for *i* ∈ {*μ* + 1, ..., *λ*} and *j* ∈ {1, ..., *n*}, where
(0, 1) is another randomly sampled normal unit vector, generated at each generation.

Finally, we apply exponential smoothing to the step sizes, to reduce fluctuations

for *i* ∈ {*μ* + 1, ..., *λ*} and *j* ∈ {1, ..., *n*}, where *α* is a smoothing parameter. We have taken *α* = 0.2.
then becomes the mutation rate for the next round of mutation [19].

Every *τ* generations, the populations are checked to see if they have met the termination criterion (we take *τ* = 20); at the same time, information on descent speed (the current fittest individual and time since the program was started) is saved to a log. The algorithm has two termination conditions: it either runs for a preset number of generations, or it halts when the lowest value of the objective function remains below a particular preset amount for ρ × *τ* generations. Preliminary investigation showed that the convergence time was relatively constant across runs, so we use a constant number of generations. We ran all runs for 40000 generations, which we found to be long enough for virtually all runs to converge.

#### Parallel Island Evolution Strategy

Parallelisation of the serial island-ES (iES) relies on running each population on a separate processor. Since selection, recombination and mutation operate strictly within populations, only the migration operation, checking termination criteria and recording information for log files need to be parallelised. The simplest parallelisation of the serial iES is a synchronous parallel island-ES (piES). The algorithm is synchronous in the sense that all communication occurs simultaneously across all processes; when migration or other exchange of information is required, each processor halts until all other processes have caught up, and then all information is exchanged. The synchronous algorithm does not modify the behaviour of the serial algorithm, and is deterministic in the sense that serial and parallel runs with the same set of random seeds will produce exactly the same solution.

Migration occurs according to the following scheme: A node designated the master node generates a migration schedule, in which every population is assigned another population to migrate an individual to, and this schedule is broadcast to all nodes. The individual nodes then communicate with each other point-to-point, with each individual sending the parameter values for its highest-ranking individual to its designated receiver, and replacing its lowest ranking individual with the best individuals of the population for which it is a designated receiver.

The collection of data related to descent speed and the checking of termination criteria are performed together. Every *τ* generations, the best individual in each population is backed up, and the processors communicate between each other to find the lowest value of the objective function of any individual across all populations, which the master node records to a log file along with a time-stamp. As every process is aware of the fitness of the fittest individuals in all other populations, both termination criteria can be evaluated separately on each processor.

The disadvantage of the synchronous algorithm is that processors spend a significant amount of time idle. The asynchronous piES algorithm avoids this by having the processors communicate asynchronously; for migration and other communication, each processor sends information to a memory buffer associated with the process it is communicating with, which can then receive it at a later time (whenever it is ready to receive), avoiding waiting times.

For migration, every *m* generations each process copies its best individual to the buffer of a randomly selected population, and adds any individuals in its own buffer to its population. No individuals are added if the buffer is empty. Its buffer is then cleared. To avoid buffer sizes growing without bound, each processor will only place a maximum of 10 individuals in a given population buffer at any time before waiting for the them to be picked up (this is a rare event, and we have not observed it in practise). Logging descent information and checking the termination criterion also occurs asynchronously. Every *τ* generations each processor sends the fitness of its fittest individual to the buffer of a designated master node. Every *τ* generations, the master node collects the fittest individuals, records them to a log and checks the termination criterion; if the termination conditions are met, the master node sends out a terminate signal to the buffers of all processors. Similar to migration, we avoid buffer overflows as follows: each processor will leave a maximum of 50 messages in the master node buffer before waiting for it to read them (once again, we have never observed this in practise).

#### Parallel Lam Simulated Annealing

We use the parallel Lam Simulated Annealing (pLSA) algorithm developed by Chu *et al*. [27], running on *K* processors with one processor being arbitrarily defined as the master node. Optimisation parameters are taken from Jaeger *et al*. [14]. The algorithm is described in depth elsewhere [25–27, 36]. Briefly, SA functions by defining an energy, given by the objective function *E*(
) plus the penalty function Π (
) (equations 3 and 5, respectively). During each iteration (or move) of the algorithm, the *K* processors change their parameter set
, to a new state
, according to an adaptive move generation strategy. They then evaluate the energy difference between the old and new states Δ*E* = *E*(
)-*E*(
); if this is negative, the move is accepted; if not, then it is still accepted with a probability *exp*(Δ*E/T*). The temperature starts at *T*
_{0} and is decreased according to the Lam schedule, which gives the fastest decrease in energy possible while maintaining the system is quasi-equilibrium [25]. The Lam schedule requires information on the mean and variance of the energy over time; a set of running estimates of these are calculated using specially designed estimators [26]. The same temperature is used across all nodes, and thus all processors must periodically pool their local statistics (every *τ* iterations) to allow the temperature schedule to be maintained [27]. This pooling of statistics allows the temperature to be lowered *K* times as fast as in the serial case.

To compensate for processes which leave the quasi-equilibrium regime (due to the increased rate of temperature decrease compared to the serial case), a mixing of states is performed every *m* iterations. The energy states of all processes are collected and redistributed according to Boltzmann probability: the probability of any given processor being assigned the state of process *i* is given by
, where *E*
_{
i
}is the current energy of process *i*. This mixing of states allows the best results to propagate, but also allows nodes to explore higher energy solutions. It strongly resembles the selection procedure used in evolutionary algorithms.

In order to avoid the final solution being affected by the initial conditions, the algorithm performs an 'initial burn', in which each processor spends *n*
_{init} iterations running as normal serial SA at a constant temperature *T*
_{0}. After that, the algorithm needs to run for another *n*
_{init}/*K* iterations to calculate initial statistics.

There are two types of potential stopping conditions that could be used for the algorithm: the absolute condition, and the freeze condition. In the absolute condition, the algorithm terminates after the absolute mean value of *E* remains below a target energy for a certain period of time. In the freeze condition, the algorithm terminates after the absolute energy *E* changes by less than a certain proportion over a certain period of time. As preliminary investigations showed that the final energy and the convergence time were both very variable, we exclusively use the latter.