Both computational and experimental approaches have been used to determine the minimal gene set required to sustain a bacterial cell. Such studies have provided clues to the minimal cellular-function set needed for life. We evaluate a minimal cellular-function set directly, instead of a geneset.
We estimated the essentialities of KEGG pathway maps as the entities of cellular functions, based on comparative genomics and metabolic network analyses. The former examined the evolutionary conservation of each pathway map by homology searches, and detected "conserved pathway maps". The latter identified "organism-specific pathway maps" that supply compounds required for the conserved pathway maps. We defined both pathway maps as "autonomous pathway maps". Among the set of autonomous pathway maps, the one that could synthesize all of the biomass components (the essential constituents for the cellular component of Escherichia coli/Bacillus subtilis), and that was composed of a minimal number of pathway maps, was determined for each of E. coli and B. subtilis, as "minimal pathway maps". We consider that they correspond to a minimal cellular-function set. The network of minimal pathway maps, composed of 20 conserved pathway maps and 21 organism-specific pathway maps for E. coli, starts a sequence of catabolic processes from carbohydrate metabolism. The catabolized compounds are used for anabolism, thus creating materials for cell components and for genetic information processing.
Our analyses of these pathway maps revealed that those functioning in "genetic information processing" are likely to be conserved, but those for catabolism are not, reflecting an evolutionary aspect of cellular functions. Minimal pathway maps were compared with a systematic gene knockout experiment, other computational results and parasitic genomes, and showed qualitative agreement, with some reasonable exceptions due to the experimental conditions or differences of computational methods. Our method provides an alternative way to explore the minimal cellular function set.
Advances in sequencing technology have allowed the complete genome sequences of more than 750 prokaryotes and 20 eukaryotes to be determined thus far. One of the possible subjects to be solved using this advance of data is the identification of a minimal gene set i.e. an estimation of the genes that are necessary and sufficient for sustaining a functional cell under certain conditions . This type of research has attracted a lot of attention, not only for its scientific meaning, but also for its industrial applications. Both computational and experimental approaches have been employed to estimate minimal gene sets.
In the computational approach, it is assumed that the genes shared by distantly related organisms are likely to be essential, and that a catalogue of these genes might comprise a minimal gene set for cellular life . Soon after the two first bacterial genomes from Haemophilus influenzae  and Mycoplasma genitalium  were sequenced, Mushegian and Koonin compared them and proposed 256 genes as a close estimate of a minimal gene set . After this pioneering work, many computational analyses were performed [5–14]. In general, computational analysis is likely to underestimate a minimal gene set, because it considers only orthologous genes. By contrast, for a substantial number of essential functions, non-orthologous, and in some cases non-homologous, genes play the same role in different organisms. The existence of two or more distinct (distantly related or non-homologous) sets of genes that are responsible for the same function in different organisms is called non-orthologous gene displacement (NOGD). Wider genome comparisons have revealed that NOGD even occurs with essential genes, including the central components of the translation, transcription and, especially, replication machineries .
In the experimental approach, the essential genes that are indispensable for cell growth are determined by large-scale gene disruption, and they are considered to comprise a minimal gene set. The first experimental attempt along this line was performed by Itaya, before the advent of comparative genomics . He investigated 79 random gene-knockouts in Bacillus subtilis, and found that six of them were lethal. Based on this ratio, he estimated the minimal genome size could be 318~562 kbp (270~470 genes, if one protein is 400 aa long). Many subsequent experimental reports utilized individual knockouts [16–18], RNA interference , transposon mutagenesis [20–25], antisense RNA [26, 27] and high-throughput gene disruption . Because a gene-knockout may just retard cell growth, the numbers of essential genes tend to be overestimated. In contrast, individual gene-knockout studies might underestimate the number of a minimal gene set for a metabolic system, because simultaneous gene knockouts tend to be lethal . In addition, the estimation of essential genes depends on the experimental conditions, such as nutrients contained in culture media.
Considering these difficulties in detecting a minimal gene set by both the computational and experimental approaches, we adopted a different strategy. Instead of a minimal gene set, we computationally explored a minimal cellular-function set. The cellular functions are functional modules composed of a group of genes, for example, glycolysis, TCA cycle and aminoacyl-tRNA biosynthesis. Since one of the final aims of minimal gene set determination is to reveal the functional components of a living cell, and these components are sometimes debated in terms of the combination of the cellular functions, detection of the minimal cellular-function set is a more direct method. In addition, this approach is more robust, because the acceptance of a given cellular function could be possible regardless of NOGD (see Methods). In this work, we regard the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway maps as the entities of the cellular functions. The KEGG database classifies the genes of sequenced genomes into more than 100 functional modules, named pathway maps, in which the reactions, substrates and products of the corresponding genes (proteins) are shown. Based on this information, not only the network of genes and compounds, but also the network of pathway maps (cellular-functions) can be illustrated.
Figure 1 shows the schematic procedure for determining a pathway map set. It is divided into three parts. In the first part, the conserved pathway maps among many genomes are determined, based on the comparative genomics. In the second part, by checking the compounds imported to and synthesized in the conserved pathway maps, an initial pathway-map network is obtained. Finally, the pathway maps that provide the necessary compounds for the conserved pathway maps are connected to the pathway-map network, until there are no more suitable pathway maps to be added. We regard the final pathway-map network as a candidate for the minimal pathway map (cellular-function) set, and assessed them in terms of biomass production and the number of components (explained later).
Representative genomes and orthologous genes
As of December 2008, more than 700 bacterial genomes have been completely sequenced. We divided them into three lineages: proteobacteria, firmicutes, and others, with reference to the classical phylogenetic classification (data from KEGG). We selected 10 representative genomes from each lineage, taking the genome size, the phylogenetic distance, and the annotation availability into account (see the Legend of Additional file 1: Table S1 for detail). Thus there are three lineages, each composed of 10 genomes, in the total of 30 genomes (Additional file 1: Table S1). Within a given triplet composed of three genomes, one from each lineage, exhaustive homology searches were performed. Conserved orthologous genes among the three genomes  were detected by the bidirectional best-hits method . All of the triplet combinations (1,000) were examined. The influence of the selection of representative genomes is shown in Additional file 1: Table S2.
The KEGG database provides classifications of functionally identified genes of available genomes and presents them as more than 100 functional modules, "reference pathway maps". For each reference pathway map, the customized pathway map for each genome (organism) was constructed by highlighting the genes assigned to the pathway map. We evaluated the evolutionary conservation of each pathway map, based on the conserved orthologous genes derived as above. For the KEGG pathway maps of a given genome in a triplet, all of the conserved orthologous genes were assigned. Since the total number of genes assigned to a pathway map depends on the genome (organism), we defined the ortholog fraction (OF) of pathway map i as,
where Ni, orthis the number of orthologous genes detected in pathway map i using a given triplet, and Nai, minis the minimum total-number of genes appearing in pathway map i of the three genomes. OFiwas averaged over 1,000 triplets. The pathway maps with high OF values are considered to be the "conserved pathway maps", and they will be the "core" of a minimal cellular-function set. Referring to the descending order of OF values, we chose γ pathway maps and classified them as the conserved pathway mapsγ.
Reconstruction of the metabolic network in a pathway map
A metabolic network within each pathway map was reconstructed, based on the enzymatic reactions of a particular organism provided by the KEGG API service and the reaction_mapformula.lst file on the KEGG FTP site, where the substrates and the products for all reactions in a pathway map are described, mainly as a binary relation (Figure 2a). We used Escherichia coli (E. coli) and Bacillus subtilis (B. subtilis) as the model organisms in this work, because there are many experimental and computational data that can be compared. In the network reconstruction process, we regarded the same compounds as one node, and a substrate-product relation as an edge. Therefore a set of reactions in a pathway map was transformed into a unique network (Figure 2b). The reversibility or irreversibility assigned to each reaction in the reaction_mapformula.lst file indicates the stream of compounds in the reconstructed network. We regarded compounds at the upstream termini as the initial substrates (circles in Figure 2b), and all the compounds in the network as products of the pathway map. When a network terminus consists of many initial compounds that are connected by reversible reactions, either one of them can be the initial substrate at the terminus (Figure 2c). We assumed that if all the initial substrates of a pathway map were provided, then all of the reactions would take place there, and all products in the pathway map synthesized. Cofactors were taken into account only if they were described in the chemical reactions as substrates or products. Because such cases are unlikely to be common, cofactors were implicitly assumed to be abundant in the cell, and used if they were required. When a compound is only produced or only consumed in the pathway map, a gap (dead end) exists. At this stage we did not take into account the dead end of the pathway maps. The only-produced compounds are the downstream termini of the network, and the only-consumed compounds are the initial substrates. When the chemical reactions are presented in the indescribable form by the binary relation as, A + B → C + D, both A and B should be the substrates for each of C and D. The network in this case is shown in Figure 2d.
Construction of the pathway-map network
We randomly selected an initial substrate (in the above case, exceptionally, both A and B) in a pathway map, and tried to connect it to the same compound in another pathway map, regardless of the dead end material. If it was possible, two pathway maps were linked with a directed edge, and a new large network with a new set of initial substrates and products was created. Repeating this procedure for all conserved pathway mapsγ, an initial pathway-map network was constructed. It should be noted that the initial substrates and network configuration of the initial pathway-map network depended on the order of selection of the initial substrate and the product being connected. In our method, the only-consumed dead end compounds were supplied from the other pathway maps, and the only-produced dead end compounds were assumed to be initial substrates for the other pathway maps or removed by virtual transporters.
Extension of the pathway-map network
A pathway map that can provide the initial substrates of the pathway-map network was chosen randomly from those other than the components of the pathway-map network, and was connected into the pathway-map network to create a new one. In this process, only a part of the pathway map, i.e., minimal sequential chemical reactions necessary to synthesize the initial substrate, was connected, so that the extra reactions for the initial-pathway map network were excluded. Since the added pathway maps are non-conserved and depend on the organism, we called them the organism-specific pathway maps. Referring to each of the E. coli and the B. subtilis pathway maps, this selection and connection process was repeated until there were no more pathway maps to be connected. The resultant network defined the network of "autonomous pathway maps", because it was expected to synthesize most of the necessary compounds inside the network. The initial substrates of the autonomous pathway maps were defined as nutrients imported from the extracellular environment. We generated 10,000 autonomous pathway maps from the initial pathway-map construction process using different random seeds. As the autonomous pathway-map network depends on the γ parameter (the number of conserved pathway maps), we started from the conserved pathway mapsγ and constructed 10,000 patterns of the autonomous pathway mapsγ, at each γ.
Estimation of minimal pathway maps
We assumed the autonomous pathway maps have to synthesize at least indispensable compounds for the organism by themselves. The attained autonomous pathway maps were assessed to determine whether they satisfied this condition. As the indispensable compounds, we employed the biomass components estimated by Feist et al. for E. coli  and by Oh et al. for B. subtilis , the numbers of which are 61 and 64, respectively. The biomass components are the major and essential constituents that make up the cellular content of organisms. For E. coli, they were determined quantitatively using the dry weight composition data for an average E. coli B/r cell, which grew exponentially at 37°C under aerobic conditions in a glucose minimal medium . Among the set of autonomous pathway maps, the one that could synthesize all of the biomass components was selected, and denoted as the "autonomous pathway maps*". Minimal pathway maps were decided as the autonomous pathway maps* composed of a minimal number of pathway maps.
Results and discussion
Pathway maps with high ortholog fractions
The ortholog fraction was calculated for each pathway map (see Table 1). It revealed that the pathway maps classified as "genetic information processing", i.e., "ribosome", "aminoacyl-tRNA biosynthesis", "RNA polymerase" and "protein export", have high OF values (Table 1. Also refer to the classification columns in Table 2). "DNA polymerase" is also involved in "genetic information processing", but its OF value is lower than the others. This pathway map includes DNA polymerases I-V and DNA polymerase bacteriophage-type. The genes encoding them, except for DNA polymerase III, are not well conserved. The most conserved pathway map is "riboflavin metabolism", which is a member of the "cofactors and vitamins" category. The other members belonging to this class, i.e., "one carbon pool by folate", "pantothenate and CoA biosynthesis" and "porphyrin and chlorophyll metabolism", also have high OF values. This may reflect the importance of these compounds because cofactors and vitamins are involved in many kinds of reactions.
Decision of minimal pathway maps
We defined γ pathway maps in the descending order of OFs as the conserved pathway mapsγ. Starting from the conserved pathway mapsγ, 10,000 sets of organism-specific pathway mapsγ were determined by expanding the pathway-map networks described in the Methods. The autonomous pathway mapsγ of E. coli and B. subtilis, at each γ, consisted of the conserved pathway mapsγ and the organism-specific pathway mapsγ. These sets of autonomous pathway mapsγ were, in turn, used to determine the minimal pathway maps. Initially, the autonomous pathway mapsγ that could synthesize all the biomass components were selected (i.e., the autonomous pathway mapsγ *). In Figure 3 the least number of the autonomous pathway mapsγ * at each γ is plotted against the number of the conserved maps, γ. The least numbers of autonomous pathway mapsγ * are at a minimum when γ s are 20 (E. coli) and 25 (B. subtilis). Therefore, minimal pathway maps for E. coli and B. subtilis were determined from the autonomous pathway maps20 * and the autonomous pathway maps25 *, respectively, as those composed of the minimal number of the pathway maps.
In Figure 3, the least number of the autonomous pathway mapsγ * for B. subtilis only increases gradually. This indicates that the first time for the autonomous pathway maps to synthesize all the biomass components is at γ = 25. Note that synthesizing all the biomass components is easy if the total number of autonomous pathway maps is large, because the larger the total number is, the greater the products. By contrast, if the initial conserved pathway maps to be extended are not appropriately prepared, the extension process will be terminated before the pathway map network develops. Thus, the conserved pathway maps25 of B. subtilis provide a good starting point so that the resultant autonomous pathway maps can synthesize all the biomass components. In the case of E. coli, the autonomous pathway mapsγ are able to synthesize all the biomass components at γ = 10, but the least number of the autonomous pathway mapsγ * decreases by the increment of γ, and reaches a minimum value at γ = 20. After that, it increases as in B. subtilis. It also indicates that the conserved pathway maps20 of E. coli provide a good starting point to be the autonomous pathway maps*, composed of a necessary and sufficient number of the pathway maps.
Conserved pathway maps
The components of minimal pathway maps for each of the organisms are shown in Table 2. The conserved pathway maps are denoted as "C" (Conserved) in the "MPM" (Minimum Pathway Maps) columns in Table 2. In the left column of the Table, the KEGG classifications of the pathway maps are also indicated. The conserved pathway maps of both organisms contain those classified as "carbohydrate", "lipid", "nucleotide", "amino acid", "glycan", "cofactors and vitamins" and "genetic information processing". They do not, however, include any pathway maps classified as "energy", "other amino acids", "polyketides and nonribosomal peptides", "secondary metabolites", "xenobiotics", "environmental information processing" and "cellular processes".
Organism-specific pathway maps
The organism-specific pathway maps in the minimal pathway maps are denoted as "OS" (Organism Specific) in the "MPM" columns in Table 2. In the "carbohydrate" category, "glycolysis/gluconeogenesis", "citrate cycle (TCA cycle)" and "pentose phosphate pathway" were selected. It is also notable that in the "amino acid" category, the components of minimal pathway maps are identical in E. coli and B. subtilis, whereby in B. subtilis, "alanine and aspartate metabolism" and "arginine and proline metabolism" were identified as the conserved pathway maps, whereas they were chosen as the organism-specific pathway maps in E. coli. This indicates that even though their degrees of conservation are marginal, they are necessary to constitute the minimal pathway maps. In the "other amino acids" category, β-amino acid (only in E. coli) and D-amino acid metabolisms were required for the minimal pathway maps. In addition, the pathway maps in the "energy", "lipid", "cofactors and vitamins" and "secondary metabolites" (only in B. subtilis) categories were also selected. In the "carbohydrate" and "energy" categories, all of the pathway maps other than "aminosugars metabolism" were selected as the organism-specific pathway map, not as the conserved pathway map. This result means that catabolism is essential for cellular life, but is not conserved throughout the bacterial genomes, probably because bacteria have evolved to adapt to a variety of environments. During the evolutionary process, bacteria that succeeded in obtaining or adjusting genes to effectively utilize the nutrients around them could survive. Consequently, their catabolic pathway maps have diverged.
As described in the Methods, 10,000 sets of autonomous pathway maps of E. coli and B. subtilis for each were determined for every γ, to decide the minimal pathway maps. The autonomous pathway maps thus obtained show various combinations of the pathway maps, reflecting the variety of the organism-specific pathway maps selected. Only a part of them, named "autonomous pathway maps*", can synthesize all biomass components required to be an autonomous cell. To characterize the organism-specific pathway maps included in the minimal pathway maps, we calculated the percentage appearance of each pathway map in about 54 sets of autonomous pathway maps20 * in E. coli and 3,391 sets of autonomous pathway maps25 * in B. subtilis ("%" columns in Table 2). The discrepancy in the total number of the autonomous pathway maps* for each organism is due to the difference in the total number of conserved pathway maps. It is difficult to obtain autonomous pathway maps* using a small number of conserved pathway maps. Most organism-specific pathway maps in the minimal pathway maps are adopted almost always in the autonomous pathway maps*. For example, the organism-specific pathway maps in the "amino acid" category, all appear with more than, or nearly equal to, 90%. By contrast, some examples show the participation in the minimal pathway maps and the percentage of appearance in the autonomous pathway maps* are not necessarily mutually related: some participants in the minimal pathway maps, e.g., "citrate cycle (TCA cycle)", are not frequently accepted in the autonomous pathway maps*, whereas the other pathway maps, e.g., "methane metabolism" in E. coli, are accepted with a very high ratio, but do not participate in the minimal pathway maps. This is partly explained by the number of products and initial substrates for the pathway map. When a pathway map includes several products that can be the initial substrates of the pathway-map network, this pathway map will be frequently accepted at the "extension of the pathway-map network" process. However, if the newly connected pathway map requires many kinds of initial substrates, the extended pathway-map network needs many additional pathway maps that will supply the initial substrates. Consequently, the resultant autonomous pathway maps* cannot be minimal pathway maps. We found that the "reductive carboxylate cycle (CO2 fixation)" is almost equivalent (reverse reaction) to the "citrate cycle (TCA cycle)" from the viewpoint of logistics of compounds, and they were alternatively selected in many cases (the sum of their percentages of appearance is around 100%). But, there are dead ends in "reductive carboxylate cycle (CO2 fixation)" (at least ATP citrate synthase and 2-oxoglutarate synthase are missing in E. coli). Subsequently, this pathway map requires more initial substrates than the "citrate cycle (TCA cycle)". Furthermore, we noticed that 64 pathway maps (excluding the conserved pathway maps) appeared at least once for the autonomous pathway maps20 * in E. coli (Table 2). Considering that the number of the organism-specific pathway maps in the minimal pathway maps is 21, only one third (21/64) can be the components of the minimal pathway maps. This ratio is almost the same for B. subtilis (16/52). These results indicate that there are many possibilities to synthesize the biomass components from extracellular nutrients, i.e., many solutions to realize the autonomous pathway maps*, if the total number of the pathway maps is unlimited.
Network of minimal pathway maps
The network of minimal pathway maps of E. coli is shown in Figure 4. The compounds exchanged between pathway maps, and nutrients imported from the extracellular environment are listed in Additional file 2: Table S3. The network starts from pathway maps mostly in the "carbohydrate" (orange nodes) and "energy" (brown nodes) categories that function in catabolism. "Glycolysis/gluconeogenesis", "pentose phosphate pathway" and "propanoate metabolism" catabolize imported nutrients into products and provide them as the initial substrates for "citrate cycle (TCA cycle)", "carbon fixation" and "C5-Branched dibasic acid metabolism", and further catabolism occurs there. Subsequently, the products are used for anabolism. Most pathway maps of the "amino acid" category (red nodes) utilize products of pathway maps in the "carbohydrate" or "energy" category, but a few of them utilize products of pathway maps in the "nucleotide" (purple nodes), "other amino acids" (magenta nodes) or "genetic information processing" (yellow nodes) category. Around one of the termini of the network, "aminoacyl-tRNA biosynthesis" (yellow rectangle) exists. Most of the amino acids fed into this pathway map are synthesized by the pathway maps in the "amino acid" category. However, glycine and cysteine are provided from "glutathione metabolism", glutamine is from "nitrogen metabolism" and glutamate is from "D-glutamine and D-glutamate metabolism". This is due to the overlaps of pathway maps, that is, some of the amino acids synthetic reactions occur in pathway maps that are not classified as "amino acid". Note that the number of substrates into "aminoacyl-tRNA biosynthesis" is 21. This is because "aminoacyl-tRNA biosynthesis" synthesizes not only 20 aminoacyl-tRNAs, but also N-formylmethionyl-tRNA using 10-formyl-THF. Most pathway maps belonging to the "lipid" (blue nodes) or "cofactors and vitamins" (green nodes) category are closely connected in the network. All the pathway maps are connected into one network, except four pathway maps belonging to the "genetic information processing" category. Because any reactions are not described in the reaction_mapformula.lst file for them, they cannot have any links to the other pathway maps via the substrates and products. This overall compound-flow looks reasonable for an ideal metabolism of a living cell.
Comparison with experimental data
We compared minimal pathway maps of E. coli with the data on the "Profiling of E. coli Chromosome" (PEC) database that presents 302 essential genes . Also minimal pathway maps of B. subtilis were compared with the result of a systematic gene knockout experiment of B. subtilis that detected 271 essential genes . We defined the "experimentally-derived essential pathway maps" as those including at least two essential genes. The threshold "two" was selected, because using this value, we could define a comparable number of "experimentally-derived essential pathway maps" to compare them with the minimal pathway maps. It should be noted that approximately half of the genes coded in the E. coli genome are multiply assigned to more than or equal to two pathway maps. Therefore, the pathway map that includes at least one essential gene does not necessarily represent the experimentally-derived essential pathway map. We also employed the definition by which we could achieve the highest correlation between our results and the experimental results. If both results were different, there would be no mutual relation, even though we employed a biased threshold. We discuss other definitions in the "Remarkable features of network construction" section.
Among 114 pathway maps presented in the KEGG database, 38 were determined to be experimentally essential for E. coli. Among those, 30 are in common with minimal pathway maps in E. coli. Also 37 were determined to be experimentally essential for B. subtilis. They include 27 pathway maps that are in common with the minimal pathway maps of B. subtilis (see the pathway maps marked by "C" or "OS" in the "MPM" columns and "X" in the "Exp" columns in Table 2). We calculated the Jaccard coefficients between our results and experimental results (the number of pathway maps in both results/the number of pathway maps appeared in either or both) for every minor classification of metabolism in Table 2, and summarized them in Table 3. In E. coli, the "carbohydrate", "nucleotide", "amino acid", "glycan" and "cofactors and vitamins" categories are very consistent (the Jaccard coefficients ≥ 0.5), but "energy", "lipid", "other amino acids" and "secondary metabolites" are less consistent (< 0.5). The Jaccard coefficients for whole pathway maps (all categories) are 0.61 for E. coli and 0.53 for B. subtilis ("All" row in Table 3). Some of the discrepancies are due to the experimental environment. The inactivation experiments were carried out with a rich medium. When we derived minimal pathway maps, however, the initial substrates were the materials that cannot be synthesized in any pathway map. Catabolism ("carbohydrate" and "energy" categories) and "cofactors and vitamins" categories are considered to be strongly affected by nutrient composition. In the experiment, some cofactors, e.g., folate and pho-CoA, might be transported from the medium into the cell  and thus they do not need to be synthesized, while we required minimal pathway maps to synthesize as many requisite compounds as possible. The very high consistencies seen in the "amino acid" and "nucleotide" categories might show that in the experiments, amino acids and nucleotides were provided in the LB medium, but not abundantly .
Comparison with computational data
The results of computational studies, the persistent genes of E. coli and the functional genomic core of B. subtilis, were compared with our results. The former results are the orthologous genes conserved in most of 228 bacterial genomes  and the latter results are the genes adopting highly biased codon usage. These two data sets were selected from many studies on the minimal gene sets, because they are based on E. coli and B. subtilis genomes, anonymously accessible, and easy to convert from original ID to the KEGG ID. The "computationally-derived essential pathway maps" were assigned by the same process described in the "comparison with experimental data" section, but we used thresholds 3 and 1 for E. coli and B. subtilis, respectively, because these values were appropriate to identify a comparable number of computationally-derived essential pathway maps, so that we compared them with the minimal pathway maps.
42 and 46 computationally-derived essential pathway maps were defined for E. coli and B. subtilis, respectively. Among them, 31 and 27 were in common with the minimal pathway maps of E. coli and B. subtilis, respectively. The good consistency observed in E. coli (the Jaccard coefficient, 0.60 in total. See Table 3) could be due to the similarity of methodologies that detect minimal pathway maps and persistent genes. Both methods relied on the conservation of genes among many bacterial genomes. The lower consistency in B. subtilis (0.45) might be explained as follows. We noticed that the computationally-derived essential pathway maps for B. subtilis were abundant in catabolism ("carbohydrate" and "energy" categories). Such pathway maps occupy 35% of all computationally-derived essential pathway maps. On the other hand, the occupancies of pathway maps of catabolism in our results and in the experimental results, are only 17% and 22%, respectively. This is probably because the genes used in catabolism are likely to be under codon usage, to be expressed abundantly and ubiquitously, so that the bacteria adapt a specific environment in which they should survive.
Comparison with parasitic genomes
Buchnera aphidicola (B. aphidicola) strains (APS, Sg, Bp, Cc, 5A and Tuc7) and Wigglesworthia glossinidia (W. glossinidia) are parasitic organisms and phylogenetically close to E. coli . We compared the minimal pathway maps of E. coli with their genomes. The pathway maps that the genomes hold were taken from the KEGG database. In Table 2, the pathway maps of B. aphidicola APS are shown in the "Par" (Parasite) column. Against these data, the Jaccard coefficients of minimal pathway maps of E. coli are calculated and shown in the "Parasite" column in Table 3.
In the "carbohydrate" and "lipid" categories, the coefficients were lower than those against the experimental and the computational results, but in the "energy" and "other amino acids" categories, the coefficients were higher than those against the experimental and the computational results. The total Jaccard coefficients against the data of B. aphidicola APS, Sg, Bp, Cc, 5A, Tuc7 and W. glossinidia, were 0.59, 0.59, 0.59, 0.56, 0.59, 0.60 and 0.52, respectively, indicating minimal pathway maps show better consistency with the data of B. aphidicola strains than with the data of W. glossinidia. The minimal pathway maps were also compared with the KEGG pathway maps of Mycoplasma genitalium. This organism is not a close relative of E. coli. The Jaccard coefficient was 0.49 in total.
The high Jaccard coefficients between the minimal pathway maps of E. coli and the pathway maps of several parasitic genomes imply that the minimal cellular functions are represented in the minimal pathway maps of E. coli. Also the slight differences seen in the Jaccard coefficients may reflect the phylogenetic distances between E. coli and the parasites.
Remarkable features of network construction
We demonstrated that there were good consistencies in the comparisons between the minimal pathway maps and the experimental, computational and parasitic data. However, we noticed that these results depended on how the essential genes to the essential pathway maps were converted. We used "number" of essential genes in the pathway maps to define the essential pathway maps. When we employed the "fraction" of essential genes in the pathway maps, as was applied to define the conserved pathway maps (OF value), instead of the number, the results were slightly different. In both experimental and computational results for E. coli, the fractions of essential genes for all pathway maps in "genetic information processing" were higher than 30% (data not shown). However, the fractions of essential genes for all pathway maps in catabolism ("carbohydrate" and "energy" categories) were lower than 30%, except the fraction for "C5-Branched dibasic acid metabolism" in the computational results (50%). This is because the computationally-determined essential genes are the conserved genes, and genes in "genetic information processing" are strongly conserved. Also these genes tend to code proteins that have no substitutions, e.g., each of the ribosomal proteins. Disrupting them is likely to be lethal. Apparently, the computational method referring only to the gene conservation can hardly clarify the significance of catabolism. On the other hand, we first identify the conserved pathway maps that include conserved orthologous genes. Subsequently, the pathway maps that supply the substrates for the conserved pathway maps are identified. In this case, the pathway maps in catabolism are naturally introduced. Although this method relies on the reliability of chemical reactions or pathway-map network data employed, the framework is very simple. The significance of catabolism may be still under discussion and one may consider only the genes in "genetic information processing" and a small number of additional genes are enough to constitute a living cell in very rich media, even though there are no genes in catabolism. We cannot argue for the possibility of such a virtual organism at this stage, but we can point out that in each bacterial genome the portion of genes for catabolism is considerable (in E. coli 7%), and to shed light on their significance, our method is effective. We consider this methodology provides an alternative way to explore a minimal cellular-function set, other than current experimental and computational approaches.
The method to evaluate the minimal KEGG pathway maps proposed here identified 41 pathway maps, including 20 conserved pathway maps and 21 organism-specific pathway maps for E. coli, and 41 pathway maps, including 25 conserved pathway maps and 16 organism-specific pathway maps for B. subtilis. The conserved pathway maps include many pathway maps classified as "genetic information processing", whereas the organism-specific pathway maps mainly include pathway maps for catabolism, reflecting evolutionary aspects. The consistencies between minimal pathway maps and the experimental, computational, and parasitic data indicate that our procedure is realistic.
In the case of KEGG data analysis, since our method to detect organism-specific pathway maps is applicable to only the enzymatic reactions, it is insufficient to estimate the essentialities of "membrane transport", "signal transduction" or "cell motility", for which the chemical reactions are not provided in the reaction_mapformula.lst file. However, our method can be applied to the other data for biochemical reactions, instead of the KEGG data, e.g., a genome-scale metabolic model of E. coli, iAF1260  and that of B. subtilis, iYO844 . By analyzing the other data as well as modifying our algorithm, we can refine our results.
As mentioned in the Background, a minimal genome and a minimal cellular-function set depend on the environment or nutrients, and their general definitions are very difficult. However, an estimation of minimal cellular-functions to realize a specific biological system is useful to design an efficient biological process, from the viewpoints of synthetic biology and cell engineering. For instance, we could design a bacterial genome that will degrade harmful chemicals, such as dioxin, or synthesize beneficial materials, such as ethanol, in large quantities through photosynthesis, by considering only minimal pathway maps related to their efficient catalysis.
Koonin E: Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003, 1 (2): 127-136. 10.1038/nrmicro751
Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995, 269 (5223): 496-512. 10.1126/science.7542800
Fraser C, Gocayne J, White O, Adams M, Clayton R, Fleischmann R, Bult C, Kerlavage A, Sutton G, Kelley J, et al.: The minimal gene complement of Mycoplasma genitalium. Science. 1995, 270 (5235): 397-403. 10.1126/science.270.5235.397
Mushegian A, Koonin E: A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences of the United States of America. 1996, 93 (19): 10268-10273. 10.1073/pnas.93.19.10268
Gil R, Silva F, Zientz E, Delmotte F, Gonzalez-Candelas F, Latorre A, Rausell C, Kamerbeek J, Gadau J, Halldobler B, et al.: The genome sequence of Blochmannia floridanus: comparative analysis of reduced genomes. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (16): 9388-9393. 10.1073/pnas.1533499100
Gabaldón T, Peretó J, Montero F, Gil R, Latorre A, Moya A: Structural analyses of a hypothetical minimal metabolism. Philos Trans R Soc Lond B Biol Sci. 2007, 362 (1486): 1751-1762. 10.1098/rstb.2007.2067
Kobayashi K, Ehrlich S, Albertini A, Amati G, Andersen K, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P, et al.: Essential Bacillus subtilis genes. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (8): 4678-4683. 10.1073/pnas.0730515100
Winzeler E, Shoemaker D, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke J, Bussey H, et al.: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999, 285 (5429): 901-906. 10.1126/science.285.5429.901
Giaever G, Chu A, Ni L, Connelly C, Riles L, Véronneau S, Dow S, Lucau-Danila A, Anderson K, André B, et al.: Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002, 418 (6896): 387-391. 10.1038/nature00935
Kamath R, Fraser A, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, et al.: Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003, 421 (6920): 231-237. 10.1038/nature01278
Glass J, Assad-Garcia N, Alperovich N, Yooseph S, Lewis M, Maruf M, Hutchison Cr, Smith H, Venter J: Essential genes of a minimal bacterium. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103 (2): 425-430. 10.1073/pnas.0510013103
Gerdes S, Scholle M, Campbell J, Balázsi G, Ravasz E, Daugherty M, Somera A, Kyrpides N, Anderson I, Gelfand M, et al.: Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol. 2003, 185 (19): 5673-5684. 10.1128/JB.185.19.5673-5684.2003
Akerley B, Rubin E, Novick V, Amaya K, Judson N, Mekalanos J: A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (2): 966-971. 10.1073/pnas.012602299
Hutchison C, Peterson S, Gill S, Cline R, White O, Fraser C, Smith H, Venter J: Global transposon mutagenesis and a minimal Mycoplasma genome. Science. 1999, 286 (5447): 2165-2169. 10.1126/science.286.5447.2165
Feist A, Henry C, Reed J, Krummenacker M, Joyce A, Karp P, Broadbelt L, Hatzimanikatis V, Palsson B: A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol Syst Biol. 2007, 3: 121- 10.1038/msb4100155
Oh Y, Palsson B, Park S, Schilling C, Mahadevan R: Genome-scale reconstruction of metabolic network in Bacillus subtilis based on high-throughput phenotyping and gene essentiality data. J Biol Chem. 2007, 282 (39): 28791-28799. 10.1074/jbc.M703759200
Gil R, Sabater-Munoz B, Latorre A, Silva FJ, Moya A: Extreme genome reduction in Buchnera spp.: toward the minimal genome needed for symbiotic life. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (7): 4454-4458. 10.1073/pnas.062067299
We would like to thank Kenta Nakai, Kengo Kinoshita and Shuichi Onami for their help with this research. YA is grateful to his colleagues in the Ota laboratory (at Tokyo Institute of Technology and Nagoya University), the Nakai and Kinoshita laboratory (at The University of Tokyo) and the Onami laboratory (at RIKEN) for discussions. YA is supported by the Global COE program awarded to the Graduate School of Bioscience and Biotechnology at Tokyo Institute of Technology.
Authors and Affiliations
Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Nagatsuta-cho, Midori-ku, Yokohama, 226-8501, Japan
Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Additional file 1: Table S1 and S2.. The 30 representative genomes (Table S1) and the pathway maps with high OF values recalculated using 30 genomes that were partly different from the original 30 genomes (Table S2). (DOC 111 KB)
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.