Mathematical solution of the model
Degree distribution
First, we show an analytical solution for the degree distribution of the model via mean-field based analysis [38–40]. This analysis is based on a mean-field approximation, in which the many-body problem is considered as the one-body problem, and is widely used in the area of statistical mechanics of complex networks. Using the mean-field analysis, we can easily get the analytical solutions.
We here consider the time evolution of k
i
, which is the degree (the number of edges) of node i. The degree of node i increases by one with the probability 1/N, where N is the total number of nodes, when Event I (a new metabolite and new reaction) occurs. When Event II occurs, two existing nodes are selected, and their degrees increase respectively as follows. One node's degree increases by one with the probability 1/N, because this node is randomly selected. The other node's degree increases by one with the probability k
i
/∑
j
k
j
, because this node is selected by a random walk from the original randomly-selected node. It is reported that the probability that a walker arrives at this node equals k
i
/∑
j
k
j
irrespective of the number of steps in the random walk [41]. Note that this probability is equal to that of the probability in preferential attachment [38] which reproduces the heterogeneous connectivity. Thus, the time evolution of k
i
is
(1)
where N = (1 - p)t because the number of nodes increases by one with the probability 1 - p, and ∑
j
k
j
= 2t because one edge is added at every time. Note that this equation is independent of the bypassed path length (the parameter q). The solution of the above equation with the initial condition k
i
(t = s) = 1 is
(2)
where A(p) = 2/[p(1 - p)].
From the above equation, because s/t = P (≥ k), the cumulative distribution P (≥ k) is
P(≥ k) = [A(p) + 1]2/p[k + A(p)]-2/p.
Since , finally, we get the degree distribution
P(k) = (γ - 1) [A(p) + 1]γ - 1[k + A(p)]-γ,
where the degree exponent γ is
(5)
As shown in Equation (4), the degree distribution follows a power law with a cutoff within a small degree.
Degree-dependent clustering coefficient
Next, we show an analytical solution for the degree-dependent clustering coefficient of the model via mean-field analysis based on [39].
The clustering coefficient [10, 12] of node i is defined as
(6)
where M
i
is the number of edges among neighbors of node i. Here we consider the time evolution of M
i
. The number of edges M
i
increases with the probability p × q, because M
i
increases when Event II occurs and a path of length 2 is bypassed (a triangle is generated). That is, we do not need to consider a bypassed path of length greater than 2. Then, M
i
of each node, which belongs to the triangle, approximately increases by one. M
i
of one node increases by one with the probability 1/N, because this node is selected at random. M
i
s of the other two nodes increase by one with the probability k
i
/∑
j
k
j
, because these nodes are selected by a random walk. Therefore, the time evolution of M
i
is
(7)
where N = (1 - p)t, and ∑
j
k
j
= 2t. Moreover, k
i
= [A(p) + 1](t/s)p/2- A(p) as shown in Equation (2).
The solution of the above equation with the initial condition M
i
(t = s) = 0 is
(8)
where A(p) = 2/[p(1 - p)]. From Equation (2), since k
i
= [A(p) + 1](t/s)p/2- A(p), this equation is rewritten as
(9)
Substituting this equation into Equation (6), we finally get the degree-dependent clustering coefficient
(10)
Average clustering coefficient
Finally, we show a mathematical solution for the average clustering coefficient of the model. Since the average clustering coefficient is expressed as the summation of the product of the degree distribution and the degree-dependent clustering coefficient, it can be described as
(11)
where K
m
is the maximum degree. The maximum degree is the case that the cumulative probability equals 1/N ; thus P(≥ K
m
) = 1/N, and from Equation (3), K
m
can be expressed as
K
m
= Np/2[A(p + 1)] - A(p).
Equation (11) is solved via numerical integral because it is analytically unsolvable.
Estimation of model parameters
This model has two parameters p and q. In order to reproduce structural properties in metabolic networks, we need to estimate these parameters in real-world networks. In this section, we show how to estimate the parameters.
The case of the parameter p
Here, we consider the average degree ⟨k⟩ of this model.
The average degree is defined as . As shown in the previous section, N = (1 - p)t, and ∑
i
k
i
= 2t. That is, the average degree of this model is
(13)
From this equation, therefore, the parameter p is estimated by
(14)
where ⟨k⟩ is obtained from real metabolic networks.
The case of the parameter q
Here, we consider the number of triangles T of this model.
In this model, the number of triangles approximately increases by one with the probability p × q because a triangle is generated with the probability q when Event II occurs. That is,
T ≃ pqt.
Since N = (1 - p)t, this equation is rewritten as
(16)
From this equation, therefore, the parameter q is estimated by
(17)
where T and N are obtained from real metabolic networks.
Data set
We used the metabolic networks of 113 organisms, which were previously investigated in Reference [18]. These metabolic networks are represented by undirected graphs in which nodes and edges correspond to metabolites and substrate-product relationships, respectively. For example, we consider a reaction S1+S2 → P1+P2. In this case, metabolites S1 and S2 connect to products P1 and P2, respectively. That is, the edge list is as follows: (S1, P1), (S1, P2), (S2, P1), (S2, P2). Note that if there are stoichiometric coefficients in the metabolic data used, then they are neglected. In order to accentuate constitutive pathways, these networks exclude 13 ubiquitous metabolites that serve for energy exchange, exchange of a proton or a phosphate moiety, and so on. To be exact, the following metabolites are excluded: water, ATP, ADP, NAD, NADH, NADPH, carbon dioxide, ammonia, sulfate, thioredoxin, (ortho) phosphate (P), pyrophosphate (PP), and H+. We only focused on the largest components of the metabolic networks in order to more accurately evaluate the structural properties.
Maximum likelihood method considering a cutoff
In order to obtain the degree exponent from real metabolic networks, we used the maximum likelihood method [27]. However, this original method does not consider a cutoff, which we denote by the constant A(p), in the degree distribution. Thus, it is difficult to compare of the degree exponent between the model and the real data. Consequently, we consider an extended maximum likelihood method:
(18)
where k
min
is the minimum degree in a network.
Null model
We used a null model to validate our model. The null model is an uncorrelated random scale-free network [28, 29], and is a popular model. Assuming a power-law degree distribution, in the null model, we can obtain a null hypothesis for the degree-dependent clustering coefficient C(k) and the average clustering coefficient C using
(19)
where ⟨⋯⟩ denotes the average over all nodes. The values, ⟨k⟩, ⟨k2⟩, and N, are obtained from real metabolic networks.
Indices for cyclic property
In order to characterize cyclic properties of networks, we define two indices inspired by the cyclic coefficient [30].
One is the cycle index of node i, defied as
(20)
where
(21)
and ⟨jh⟩ denotes all pairs of neighbors of node i. In addition, k
i
is the degree of node i. We can understand that this index is an extended clustering coefficient. This index considers cycles whose length is at least 3; however, the original clustering coefficient only focuses on cycles of length 3.
The second index is the cycle length index of node i, defined as
(22)
where is the length of the smallest cycle that passes through node i and its two neighbors j and h.
In order to characterize global cyclic properties, in this section, we focus on the average indices and , where N is the total number of nodes. Small values of ⟨rc⟩ indicate a low frequency of cycles in networks. Moreover, small ⟨rl⟩ means that the cycle length is globally long in networks.
Ignoring cycles generated by the network representation
Using these indices as cyclic properties, we investigated the resulting cyclic properties in the metabolic networks of 113 organisms in order to test the hypotheses for cycles, which are generated due to the emergence of the short-cut path. However, we cannot directly use the metabolic networks, which are analyzed in Reference [18] because the metabolic networks include cycles, which are drawn by the network representation. For example, we consider a reaction S1+S2→P1+P2. In this case, a cycle of length 4, is generated as shown in Figure 11.
In this manner, cycles due to the network representation would be drawn when the types of all metabolites are different in a reaction and, as a result, the right-hand side and the left-hand side concurrently consist of multiple metabolites. Therefore, we ignored such cycles when calculated the cycle indices.
Statistical analysis
In order to assess the significance of the observed correlations, we used Pearson's correlation coefficient r, Spearman's rank correlation coefficient r
s
, and their P -value P. We determine that there is a significant correlation between a structural property and optimal growth temperature when P < 0.05.