The model
We follow an approach taken in a previous study of large-scale metabolic networks [5]. We define a metabolic genotype as the set of biochemical reactions that may take place in an organism, and that are catalyzed by gene-encoded enzymes. The set of all reactions used in this work is a subset of 1221 reactions out of 5871 reactions we curated previously [5] from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [11]. These reactions comprise all elementally-balanced reactions that involve sulfur containing metabolites (see methods for details).
A metabolic genotype can be represented in at least 2 different ways (Figure 1). The first views it as a metabolic network graph whose nodes are metabolites. Reactions are represented as directed links from substrate metabolites to product metabolites (Figure 1A). The second views it as a list of reactions (Figure 1C), or, equivalently a binary vector whose length - 1221 reactions in our case - corresponds to the number of reactions in a known reaction "universe". Each position i in this vector corresponds to a reaction. Its values ('0') or ('1') at position i indicate the inability or ability of the organism to catalyze the corresponding reaction (Figure 1C).
We define the phenotype of a metabolic network as the subset of sulfur sources (out of 124 possible sources we consider, see Methods) that allow the network (metabolic genotype) to synthesize all biomass components, if one of the sulfur sources is provided as the sole sulfur source to the organism. We represent this phenotype as a binary vector of length 124 whose entry at position i indicates viability if sulfur source i is the sole sulfur source (Figure 1D). This is not the only way to define a metabolic phenotype, but it is appropriate for our purpose. An obvious alternative phenotype definition would count the number of biomass metabolites that a network can produce in a given environment. However, because all these metabolites are essential for survival of an organism, networks that cannot synthesize some of them are of limited biological relevance. Additional advantages of the phenotype definition we chose are that it allows a straightforward and systematic comparison of phenotypes, and enables us to study metabolic innovation in a biologically sensible way. Using this phenotype definition, a metabolic innovation is the ability to synthesize biomass metabolites from a new, previously unusable sulfur source.
To determine metabolic phenotypes from genotypes, we use flux balance analysis [12], a computational method that finds a growth-maximizing steady-state metabolic flux through all reactions in a metabolic network. This method requires information about the stoichiometry of every metabolic reaction, a maximally allowed flux of each metabolite in and out of the environment, and information about an organism's biomass composition (see Methods for details). We focus on a metabolic network's qualitative ability to produce all sulfur-containing biomass precursors. We will study networks that are able to do so from each one of a specific set of sole sulfur sources. For brevity, we call such networks viable. We will also refer to the number S of sulfur sources that a metabolic network must be viable on as the environmental demand imposed on the network.
We next introduce the concept of a genotype network for metabolic networks (Figure 1B) [5]. The nodes in this network correspond to individual genotypes (metabolic networks) with the same phenotype. Two genotypes are linked - they are neighbors - if they differ in a single reaction. A genotype network thus is a network of metabolic networks. This concept is useful when we examine the evolution of metabolic networks through the addition and elimination of metabolic reactions, which can occur, for example, by horizontal gene transfer [13, 14], or through loss-of-function mutations in enzyme-coding genes. Consider the metabolic network genotype G
1
of some organism. This genotype is a node on the genotype network associated with this genotype's phenotype. If some variant G
2
of this network - obtained through an addition or a deletion of a reaction - has the same phenotype as G
1
, it will be a neighbor of G
1
on the same genotype network. In this manner, one can envision phenotype-preserving evolutionary change of metabolic genotypes as a path through a genotype network. Such paths correspond to successive hops from genotype to genotype, by way of the edges connecting neighboring genotypes (Figure 1B). For our analysis, it will be useful to define a distance >D between two metabolic network genotypes as the fraction of reactions in which two metabolic networks differ, or
where R
c
is the number of reactions shared by both networks, and N
1
and N
2
are the total numbers of reactions in networks G
1
and G
2
, respectively. This formula simplifies to D = 1-RC/N when both networks have the same size N..
Minimal viable metabolic networks can be diverse and contain many superessential reactions
We begin with an analysis of minimal viable networks, which provides insights into the reactions that are essential to utilize a specific set of sulfur sources. A minimal metabolic network is a network in which all reactions are essential and none can be removed without rendering it inviable. For any one given phenotype P, there may be multiple viable minimal networks. Random minimal networks can be generated by starting from a network comprised of all 1221 reactions - which is viable on all sulfur sources - and eliminating randomly chosen reactions one-by-one, until no reactions can be further removed without rendering the network inviable on the sulfur sources defined by P. We note that a minimal network is not the same as the network with the smallest possible number of reactions with a given phenotype, which could be very difficult to find in a vast metabolic genotype space.
We generated 1000 random minimal metabolic networks viable on a given number S of sulfur sources (see methods). Specifically, we generated 100 minimal networks for 10 random sets of sulfur sources with the same number S - but not necessarily identity - of sources. We note in passing that such networks often also happen to be viable on additional sulfur sources that we did not require them to be viable on (Additional file 1). Figure 2A shows the distribution of genotype distances for pairs of minimal metabolic networks viable on S = 1, 20, or 60 sulfur sources. The figure demonstrates that, first, random minimal metabolic networks can be very different from one another. Their genotype distance may exceed D = 0.8, meaning that they may share fewer than 20 percent of reactions. Second, their average distance depends on the number of sulfur sources a network needs to be viable on. Specifically, the average genotype distance is largest D
avg
= 0.6 for minimal metabolic networks viable on S = 1 sulfur source, and decreases to D
avg
= 0.3 for networks viable on S = 60 sulfur sources. Third, the distribution of genotype distances is much wider for metabolic networks subject to few environmental demands (S = 1) where it ranges from D
avg
= 0.2 to D
avg
= 0.8, than for metabolic networks subject to many environmental demands (S = 60) where it ranges from D
avg
= 0.2 to D
avg
= 0.4.
Figure 2B (filled circles) shows the average size of minimal networks N
min
as a function of the number of sulfur sources they are viable on. It ranges from N
min
= 14 reactions for S = 1 to N
min
= 87 reactions for S = 60. By definition, all reactions in a minimal network are essential, but some of these reactions are special because they occur in all minimal networks viable on a given set of sulfur sources. We call these reactions superessential reactions [6]. The open circles in Figure 2B shows the number of superessential reactions as a function of the environmental demands S on a network. The number of superessential reactions R
SE
increases with S, but it is generally much lower than the total number of reactions in a minimal metabolic network. For example, at S = 1, 4 out of 14 reactions are superessential. At S = 60, 44 out of 87 reactions are superessential. We will show that the number of superessential reactions plays an important role in one of our analyses below.
Many viable sulfur metabolic network genotypes are connected via paths that lead far through metabolic genotype space
We next extended our previous work on carbon metabolism to ask about the existence of genotype networks in the space of sulfur-involving reactions, and of neutral paths that traverse such networks while preserving a metabolic phenotype. We define a neutral path as a series of mutations (reaction additions or deletions) that leave a phenotype intact (Figure 1B). We emphasize that we do not use the term neutrality in its meaning of unchanged fitness in the field of molecular evolution [15], but merely for brevity, in the sense of preserving viability on a specific set of sulfur sources. Changes to metabolic networks such as additions or deletions of chemical reactions can potentially have a positive or negative effect on fitness. The addition of a chemical reaction may have a beneficial effect if it increases the rate at which biomass is synthesized, or it may have a deleterious effect if it generates metabolites that interfere with cell physiology. Similarly, the deletion of a reaction may have both beneficial and deleterious fitness effects [16]. Studies on compensatory mutations in macromolecules, mutations that compensate for the fitness effects of previous mutations with negative effects, show that fitness neutrality is not a prerequisite for a population's genetic change on a genotype network [17, 18].
We are especially interested in two questions. How far does a neutral path typically lead through genotype space? And how does this distance depend on the number N of reactions in a network, and on the environmental demands on the network? To answer these questions, we performed 200 random walks of 10 000 mutations each for metabolic networks of various sizes, and for various environmental demands. Specifically, for networks of each size we performed 20 random walks for each of 10 different sets of S of sole sulfur sources that we required the network to be viable on. Each random walk started from a random initial viable metabolic network comprising N reactions (see methods for details). We allowed N to vary by no more than one reaction during the random walk. Moreover, each step in the random walk had to preserve viability. Finally, none of the steps was allowed to decrease the distance to the starting network, in order to maximize the distance from this network (see methods for details).
Figure 3A shows the maximum genotype distance D
max
obtained in such random walks for networks up to 300 reactions, where we required viability on S= 1, 5, 10, 20, 40, or 60 different sole sulfur sources. This distance is in general large. For example, D
max
is greater than 0.7 for all metabolic networks with more than 200 reactions. For each value of S, the data point at the smallest value of N (horizontal axis) corresponds to the minimal metabolic networks we discussed earlier. Perhaps surprisingly, these minimal networks can not only be very diverse, as we saw earlier, but neutral paths starting from any one such network can also reach far through genotype space. For example, the maximal length of neutral paths is D
max
= 0.65 for minimal metabolic networks viable on S = 1 sulfur source, and still a sizeable D
max
= 0.38 for metabolic networks viable on S = 60 sulfur sources. To provide a point of reference, the E. coli metabolic network has 142 reactions involving sulfur. Random viable metabolic networks of this size would have maximum genotype distances between D
max
= 0.96 (for S = 1) and D
max
= 0.60 (for S = 60).
Maximal genotype distance and robustness of metabolic networks are well approximated by simple properties of minimal networks
We asked whether the maximal genotype distance of networks of a given size, as well as their robustness to reaction removal can be predicted from properties of the underlying minimal networks. The answer is yes.
Figure 3A shows that the maximal possible genotype distance D
max
between metabolic networks of the same size increases with metabolic network size N. The solid lines show the relationship between the maximal genotype distance D
max
and metabolic network size N as predicted by the equation
(1)
Here, R
SE
is the number of reactions that are super-essential for a given environmental demand S. We had estimated this number in our previous analysis of minimal networks (Figure 2B). The simple relationship of equation (1) fits our numerical data (Figure 3A) remarkably well and corresponds exactly to our distance function when the number of superessential reactions R
SE
replaces the number of common reactions R
C
between networks of the same size. This implies that the only common reactions between maximally distant networks are the superessential reactions. Therefore, as the size of a network increases, more phenotype-preserving changes become possible. For networks of the smallest size, D
max
, systematically overestimates the maximal genotype distance, but it does so by no more than 10% percent. We note that our estimates of maximum genotype distances are only lower bounds, such that this discrepancy may result from our limited ability to find maximal genotype distances accurately. In sum, a simple, linear function of the number of superessential reactions at any one environmental demand S approximates the maximal genotype distance between networks well.
Next we examined how network robustness depends on the size of metabolic networks and on environmental demands. We define robustness as the fraction of non-essential reactions in a metabolic network. Figure 3B shows the robustness of metabolic networks as a function of network size N and varying environmental demands S on a network. Large metabolic networks with 200 reactions or more have a robustness ρ > 0.6 for all values of S. For smaller metabolic networks (N < 200), robustness ranges from ρ = 0.9 under low environmental demands (S = 1) to ρ = 0.2 under high environmental demands (S = 60). The relationship between ρ and network size N can be explained by noting that ρ = 1-Ress/N where R
ess
is the number of essential reactions. We find that R
ess
decreases linearly with increasing metabolic network size (Additional file 2) and is described by the function Ress = Nmin(1+m)-Nm, In this equation, N
min
is the average size of minimal metabolic networks (estimated above for given S) and m is the rate at which the number of essential reactions decreases with increasing metabolic network size (estimated from data in Additional file 2). The question why the number of essential reactions R
ess
decreases with increasing size has a simple answer. As one increases the size of a metabolic network by adding reactions and entire pathways to minimal metabolic networks, some reactions may become non-essential because the added reactions create alternative pathways for biomass metabolite synthesis. Describing ρ in terms of N, N
min
and m, we arrive at the following relation
(2)
which is plotted as the solid lines in Figure 3B and fits the data very well.
This relationship means that network robustness is a linear function of the ratio N
min
/N, whose inverse indicates how much larger a given network is than a minimal network for a given S, and of the rate at which reaction essentiality declines (robustness increases) with increasing N.
The diversity of phenotypes found in the neighborhood of two metabolic networks changes rapidly with their genotype distance
Thus far, we have concentrated on the characteristics of individual sets of genotypes viable on a given number S of sulfur sources, and on the genotype networks they form. Long paths through a genotype network can contribute to evolutionary innovation in metabolic phenotypes, if many novel phenotypes can be encountered near such a path. We next asked whether this is the case, and how this number of novel phenotypes depends on environmental demands on a network. We consider a phenotype to be novel if it confers viability in a set of new sulfur sources, in addition to those required by the environmental demands imposed on the metabolic network.
We first introduce the notion of a (1-mutant) neighborhood around a metabolic network genotype, which comprises all networks that differ from the genotype by a single reaction (Figure 1B). Because our genotype space has 1221 metabolic reactions, each metabolic network has 1221 neighbors. Of all these neighbors, some will be inviable in any given environment (these are the mutants that have lost an essential reaction), some will maintain the same phenotype, and some will have a novel phenotype while being viable in this environment. That is, they will have gained viability on a new sulfur source. We focus on the latter class of neighbors in this section.
We asked how different are the novel phenotypes in the neighborhood of two metabolic networks G and G
k
on the same genotype network, where G
k
is a metabolic network derived from G through k random reaction changes. That is, we determined the fraction of novel phenotypes that occurred in the neighborhood of only one but not the other network. Below we refer to it as the fraction of novel phenotypes unique to one neighborhood. If this fraction is very small even for large k, then networks in different regions of a genotype space will have mutational access to similar novel phenotypes. Figure 4 shows that the opposite is the case. We obtained the data shown during phenotype-preserving random walks starting from an initial network, by recording the fraction of novel phenotypes that occur in the neighborhood of the changing metabolic network, but not of the initial network. Every data point is an average over 20 random walks each for 10 different initial metabolic networks (thus, 200 random walks in total) at every value of S. Figure 4 shows that the fraction of unique novel phenotypes reaches high values for modest distance between two metabolic networks - small compared to the maximum genotype distance - and does not depend much on the number of sulfur sources S on which viability is required. It also does not depend strongly on metabolic network size (results not shown). In sum, the neighborhood of moderately different metabolic networks contains very different novel phenotypes.
The ability of metabolic networks to encounter novel phenotypes does not depend monotonically on their phenotypic robustness
The question of how robustness relates to phenotypic variability has raised considerable interest in recent years [19, 20]. Macromolecules - RNA and protein - whose phenotypes are more robust to mutations can access more novel phenotypes than less robust phenotypes [7, 8]. This holds for both large and small evolving populations of such molecules, at least in RNA [8]. We next asked whether these observations are specific to macromolecules, or whether they hold more generally, that is, also for the genotype-phenotype map of metabolic networks.
Above we considered the robustness of a metabolic genotype as its fraction of non-essential reactions. Analogously, we now consider the robustness of a metabolic phenotype as the average fraction of non-essential reactions of all networks with this phenotype [8]. We showed that robustness decreases as networks are required to be viable on more and more sulfur sources (Figure 3B). That is, for networks of any given size, the number S of sulfur sources on which they are viable can serve as a proxy for phenotypic robustness. The greater a phenotype's S is, the smaller is its robustness.
When analyzing how evolving populations explore a genotype network, we need to distinguish between two different kinds of populations. The first are populations where the product of population size and mutation rate is much smaller than one. For brevity, we refer to such populations here as small populations. The second are populations where this product is much greater than one. We refer to these as large populations.
Small populations are genotypically monomorphic most of the time [15], and effectively explore a genotype network much like a single changing network would, i.e., through a random walk on the genotype network. During such a random walk, the changing network encounters different phenotypes in its neighborhood. We determined the cumulative number of different novel phenotypes found in the neighborhood of a random walker. That is, if a phenotype was encountered twice, either in the same neighborhood, or in a neighborhood encountered during an earlier step, we counted it only once. We did so for networks of varying size N and number of sulfur sources S. Specifically, for each N and S, we carried out 200 random walks of 10 000 mutations each (20 walks for 10 different sets of sulfur sources at each S). Figure 5A shows the resulting data. The cumulative number of novel phenotypes is a unimodal function of S, indicating that metabolic networks under few and many environmental demands encounter fewer novel phenotypes than under an intermediate number of environmental demands (S≈20). The cumulative number of novel phenotypes depends strongly on metabolic network size for S < 20, where larger metabolic networks encounter more novel phenotypes throughout the random walk. It is not sensitive to N for larger values of S.
We next examine large evolving populations. Such populations are polymorphic most of the time. To model their evolutionary dynamics, one needs to track every individual in the population, unlike for monomorphic populations. We determined the cumulative number of novel phenotypes that are mutationally accessible to a population of metabolic networks evolving on (and restricted to) a specific genotype network. This number can be determined by examining, for each generation, the neighborhood of each individual in the population, and by counting the total number of different novel phenotypes encountered. We simulated populations of 100 individuals evolving for 2000 generations (see Methods for details). Additional file 3 shows the average number of cumulative unique novel phenotypes accessible to a population through generation 2000. Each data point represents an average and standard deviation over 200 simulations (20 simulations for 10 random sets of sulfur sources at a given S). Qualitatively, the figure resembles our observations for a single random walk (Figure 5A), except that the absolute number of cumulative unique phenotypes encountered is higher in evolving populations.
Taken together, these observations show that the number of novel phenotypes accessible to a population does not increase monotonically with phenotypic robustness. It decreases with increasing robustness (decreasing S) for low values of S, and it increases with robustness at higher values of S. Additional file 4 demonstrates this relationship in a 3 D plot of the number of novel phenotypes versus robustness and network size (A) or environmental demands (B).
We next examined two candidate explanations of this pattern. The first is that environmental demands and network size affect how rapidly a population can diversify on its genotype network, and thus also how many novel phenotypes it can access. To find out whether this diversification rate matters, we examined the average pairwise genotype distance of our evolving populations. The smaller this difference is, the more slowly a population diversifies. Figure 5B shows a plot of pairwise genotype distances, averaged over an entire population, at the end of 2000 generations. One can see that populations of smaller networks are less diverse. However, environmental demand (S) influences genotypic diversity only weakly, and not in the same unimodal way as seen in Additional file 3. Thus, population dynamic processes alone cannot explain the pattern observed in Figure 5A and Additional file 3.
The second candidate explanation is that the patterns of Figure 5A and Additional file 3 may simply reflect how the number of novel phenotypes in the neighborhood of random metabolic networks varies with N and S. Figure 5C shows the number of novel phenotypes in the neighborhood of random viable metabolic networks of varying size, and with varying environmental demands on the network. This figure is based on random samples of 200 metabolic networks (see Methods) for every value of N and S (20 metabolic networks for 10 different sets of sulfur sources at each S). The vertical axis of this figure shows the mean and standard deviation of the number of unique novel phenotypes in the neighborhood of the examined networks. It shows similar unimodal characteristics as the data in Figure 5A and Additional file 3. The figure demonstrates that the number of novel phenotypes depends strongly on metabolic network size for environments with S < 20. In this regime, larger metabolic networks have more novel phenotypes in their neighborhood than smaller networks. For S > 20, the dependency on metabolic network size disappears and the number of accessible novel phenotypes declines again. In sum, the different accessibility of novel phenotypes in evolving populations, at least qualitatively, emerges from how novel phenotypes are distributed in genotype neighborhoods, and how this distribution depends on S and N.