Integration of metabolic databases for the reconstruction of genome-scale metabolic networks
© Radrich et al; licensee BioMed Central Ltd. 2010
Received: 27 January 2010
Accepted: 16 August 2010
Published: 16 August 2010
Genome-scale metabolic reconstructions have been recognised as a valuable tool for a variety of applications ranging from metabolic engineering to evolutionary studies. However, the reconstruction of such networks remains an arduous process requiring a high level of human intervention. This process is further complicated by occurrences of missing or conflicting information and the absence of common annotation standards between different data sources.
In this article, we report a semi-automated methodology aimed at streamlining the process of metabolic network reconstruction by enabling the integration of different genome-wide databases of metabolic reactions. We present results obtained by applying this methodology to the metabolic network of the plant Arabidopsis thaliana. A systematic comparison of compounds and reactions between two genome-wide databases allowed us to obtain a high-quality core consensus reconstruction, which was validated for stoichiometric consistency. A lower level of consensus led to a larger reconstruction, which has a lower quality standard but provides a baseline for further manual curation.
This semi-automated methodology may be applied to other organisms and help to streamline the process of genome-scale network reconstruction in order to accelerate the transfer of such models to applications.
Metabolism is perhaps the best characterised of all molecular interaction networks in biology. Large amounts of data relating to metabolic reactions are available to date, but despite this wealth of information metabolic phenotypes remain difficult to predict accurately . The reconstruction of the genome-scale metabolic network of an organism represents a major milestone toward better understanding of its properties. While metabolic pathways are convenient abstractions to represent routes of biochemical conversions of small molecules in an organism, their definition is often arbitrary and varies between sources . The pathway paradigm fails to provide an integrated view of interactions, regulatory and control mechanisms, which act across and with no regard of pathway boundaries. For that reason, a growing number of genome-scale metabolic networks have been constructed in recent years, most notably for microorganisms [3–7] but also for animals and humans [8–10]. The applications of such reconstructions are plentiful, encompassing metabolic engineering studies to design strains overproducing desired products, the prediction of genes responsible for orphan reactions, the determination of active reactions for a given environmental condition, the identification of coupled reaction sets, and evolutionary studies . Genome-scale metabolic reconstructions were also used to predict potential new antibiotic targets .
However, the process of reconstructing the genome-scale metabolic network of an organism remains very labour-intensive. It has been observed that the number of available metabolic models is smaller than 1% of that of fully sequenced genomes . There is therefore a clear need to accelerate and streamline the process of network reconstruction. Fully automated reconstructions on the other hand remain imprecise, mainly because of inherent limitations and errors contained in individual databases . Common sources of problems are the non-uniqueness of metabolite identifiers (some compounds being represented by generic classes such as "alcohol"), unbalanced atomic species arising from an incorrect stoichiometry or formula for one or more reactants, incorrect or missing cofactors, and enzymes catalysing more than one reaction . Additional problems are caused by the lack of usage of standards for the annotation of metabolites and reactions, making the comparison of different models extremely difficult.
In this work, we present an original semi-automated methodology aimed at accelerating the process of metabolic network reconstruction, while at the same time avoiding the loss of accuracy consecutive to a fully automated reconstruction. The underlying principle of our approach is to confront and integrate different sources of data. Different databases may contain errors or gaps about enzymes, reactions and metabolites present in an organism, but if data sources are independent, the likelihood that the same error would appear twice would be expected to be of an order of magnitude smaller than the frequency of errors in each data source. By creating an intersection between two databases, differences can be identified and assessed for further curation. The resulting process can be defined as semi-automatic because the processes of extraction and comparison between databases can be largely automated, yet a manual curation remains necessary to analyse and solve discrepancies.
As an application of our methodology, we present results of the integration of metabolite and reaction data for the model plant Arabidopsis thaliana. Plants offer a wide range of potential applications for metabolic modelling and engineering, which include the generation of pharmaceutical compounds, increasing the production of key secondary metabolites of commercial interest, improving yield and nutritional quality of crops. Despite such promising applications, the challenges of metabolic reconstruction are compounded in plants and only recently were the first genome-scale metabolic models of Arabidopsis reported [16, 17], which were constructed by manual curation. The reconstructions presented in this study were developed independently through a semi-automated process integrating data from two sources, namely AraCyc  and Kegg .
In order to map metabolites between Kegg and AraCyc, several features of the data were taken into account. Two compounds were defined as being identical only if all features were positively matched. These features included compound names, chemical formulae and structures, and enzymes catalysing reactions where the considered metabolites participate.
Compound name similarity
Names of metabolites may differ between databases for several reasons. Many chemical compounds are commonly known under multiple names, and all of their names are not necessarily indicated in all databases. Furthermore, there is no universal and common identifier between Kegg and AraCyc, as compounds are sometimes referred by their ChEBI (Chemical Entities of Biological Interest ), CAS (Chemical Abstracts Service ) or PubChem  identifier. For example, there are five different names listed for the Kegg entry C00022 (pyruvate, pyruvic acid, 2-oxopropanoate, 2-oxopropanoic acid, pyroracemic acid), and nine different possibilities for the same compound in AraCyc (pyruvate, BTS, alpha-ketopropionic acid, acetylformic acid, pyroracemic acid, 2-oxopropanoic acid, pyruvic acid, 2-oxopropanoate, 2-oxo-propionic acid). If at least one of these entries is the same in both databases, the identification of matching metabolites is straightforward. However there are many cases where no perfect match can be found. For example the Kegg compound C10434 is named 5-O-Caffeoylshikimic acid, and the same compound in AraCyc has the name caffeoylshikimate.
For this reason a string similarity algorithm, originally developed for the identification of gene/protein name similarity, has been employed to compare metabolite names . This algorithm uses logistic regression to compute the similarity between strings by incorporating a variety of features. A training data set has to be supplied in order to teach the program which differences can be treated as similar and which ones are not allowed. It is important to use features that can well characterise a string pair by capturing the similarity between different variations while highlighting the difference between terms which are not synonymous. The considered features are character bigrams, prefixes and suffixes, numbers, acronyms, common and different tokens. An appropriate training set was prepared by processing sets of multiple names of the same metabolite in each of the databases. We found that the characteristic differences that occur between metabolite names are essentially of a very similar nature as those occurring between protein names, allowing the algorithm to perform efficiently. The result of this process consisted in a list of matched names with an associated of their identity.
Chemical structure similarity
The formula of a metabolite theoretically confines the search for matching metabolites considerably. However it is not usable as a unique identifier. Two metabolites can have the same global formula and be completely different chemically because of the various possible structures of atoms. For example, 2-carboxy-D-arabinitol 1-phosphate, L-galactose-1-phosphate, D-hexose 6-phosphate and alpha-D-mannose 1-phosphate all have the same formula C6H13O9P. On the other hand, two metabolites that are identical may be represented by slightly different chemical formulae, either because an error is present in one source or because of chemical modifications (such as pH-dependent breakdowns of carboxylic groups).
All chemical structure processing was implemented in Pipeline Pilot workflows . Input structures from KEGG and AraCyc were first converted to canonical SMILES, a unique line representation of two-dimensional molecular structure . Exact structural matches occurred where SMILES strings were identical.
In previous metabolic network reconstructions it was observed that equivalent metabolites differed across sources in a number of ways. These included stereochemical differences (due to varying levels of detail about chiral centres or configurations about double bonds), tautomeric variants (where proton localisation differs), and charge states. To identify differences of these types SMILES strings were matched after purging stereochemical information from structure, or after calculation of the canonical tautomer, or following recalculation of ionisation at pH 7.4.
One further important way in which equivalent metabolites from different sources were noted to differ was by simple structural errors in their representation, such as incorrect bonding or missing functional groups. Providing that the majority of the structure is correctly specified then corrections for errors of this type can be suggested by standard cheminformatics molecular similarity measures  (here using the ECFP_4 molecular fingerprint and Tanimoto similarity metric).
Whereas exact matches of the original canonical SMILES strings provides an unambiguous mapping of metabolites across data sources, the other types of matches are only approximate (they may or may not be correct) and so require further manual checking. The approximate matching algorithms do, however, significantly reduce the manual checking workload.
At the end of the first iteration step, new compounds were added to the list of matched compounds and a new iteration was started. The whole process was repeated several times until no additional matched compounds could be found. During the process, compounds that gave a positive result in some but not all of the before-mentioned features were copied into a separate list that was manually examined. The positive results of these checks were added to the list of compounds in order to improve the outcome of the next iteration.
The final lists of matched compounds and reactions were given new unique identifiers for the new reconstructions. These matched compounds and reactions constitute the first or core (yellow) metabolic network. Compounds and reactions that were not matched in both databases were assigned to the second or third network depending on an additional process. In the intermediate (green) network, we included reactions for which either the full set of substrates or the full set of products belonged to the core network. Such reactions are likely to be valid as they have a strong connection to the core network, but may be insufficiently or inconsistently described resulting in discrepancies between databases or their absence in one of them. Compounds involved in such reactions and not already included in the core model were also added to the intermediate network. All remaining compounds and reactions were included into the complete (blue) network.
A general problem in metabolic models is caused by the fact that many compounds have different protonation states depending on the pH. There is no pH consistency in formulae found in databases, and AraCyc does not represent protons in reactions. As a consequence, we neglected differences in proton content between formulae and did not represent protons in reactions either. A strict balance of hydrogen atoms thus cannot be expected in our reconstructions.
Stoichiometric consistency validation
Incorrect definition of reaction stoichiometries often results in stoichiometric inconsistencies - a common type of modelling error, defined as contradictions between the fundamental physical constraints of mass positivity and mass conservation . An example is shown below:
R1: A ↔ B
R2: A ↔ B + C
In this network, the metabolite C cannot be assigned any positive molecular mass without violating the mass balance in the whole system. Stoichiometric inconsistencies are often caused by violations of atomic balance, by ambiguous generic metabolite definitions (e.g. "primary alcohol") and by inclusion of polymers with variable polymerisation degrees and units (e.g. starch and protein).
Stoichiometric consistency validation involves inspection of the left null-space (the null-space of the transposed stoichiometry matrix) and may include the following optional steps: Firstly, the stoichiometric consistency of the full network is verified. If the network is inconsistent, the non-conserved metabolites  are detected (e.g. C in the above example). Further, for each non-conserved metabolite, the minimal inconsistent net stoichiometries involving it are calculated (these are defined as net stoichiometries with empty sets of substrate or products, e.g. by subtracting the reaction R1 from R2, we obtain Ø ↔ C). Finally, the elementary leakage modes (minimal linear combinations of reactions resulting in inconsistent net stoichiometries) can be detected, e.g. (-R1, R2) in the above example. The localisation of such minimal inconsistent subnetworks helps to detect input errors in the reactions.
Construction of SBML version
An SBML version of the core network was constructed using libSBML . The file is MIRIAM compliant  with all species and reactions annotated with ontological terms, allowing their unambiguous identification and interpretation by third party software tools. Metabolites have been annotated with ChEBI, Kegg, PubChem, 3DMET and Lipidbank terms, along with InChI strings , while all enzymes are provided with an annotation linking them to the appropriate gene in the TAIR database . All annotations are associated to an appropriate systems biology ontology (SBO) term.
Network visualisation and analysis
Properties of the three metabolic networks.
Number of nodes
Number of edges
Number of self-loops
Average path length
The number of connected components indicates how many disjoint subnetworks the network is broken into. A self-loop is a node connected to itself. The number of shared neighbours between two nodes is the number of nodes that are neighbours of both of them. The shortest path length, also called the distance between two nodes is the smallest number of edges that have to be crossed to go from one edge to the other. The characteristic path length is the average distance and the network diameter is the largest distance between two nodes in the network.
The connectivity of a node is the number of edges connected to it. The network density is a measure of how densely the network is populated with edges. A network that contains only isolated nodes has a density of 0, whereas a clique has a density of 1. The network centralisation is a measure of how strongly a network is focused around central nodes. Networks resembling a star have a centralisation close to 1, whereas decentralised networks have a centralisation close to 0. The network heterogeneity measures the variance of the connectivity and reflects the tendency of a network to contain hub nodes. The clustering coefficient of a given node n is a ratio between the number of edges between the neighbours of n and the maximum number of edges that could possibly exist between them.
where s and t are nodes in the network different from n, σ st is the number of shortest paths from s to t, σ st (n) is the number of shortest paths from s to t that n lies on, and N is the total number of nodes in the connected component that n belongs to. The betweenness centrality of each node is a number between 0 and 1. It reflects the amount of control that a node exerts over the interactions between other nodes in the network. A node acting as a bridge between different communities has a high betweenness centrality, while a node that lies inside a community has a low one.
where L(n, m) is the length of the shortest path between n and m, and N is the total number of nodes in the connected component that n belongs to. The closeness centrality is a measure of how fast information can spread from a given node to other reachable nodes in the network.
Metabolic reconstructions of A. thaliana
We present three metabolic networks of A. thaliana corresponding to different levels of confidence. The core (yellow) metabolic network only contains compounds and reactions that have been unambiguously identified and matched in the two databases (Additional file 1). These metabolites and reactions are expected to have been well characterised and experimentally observed, and are more likely to play an important role in the plant. This model is expected to cover most of the core metabolism of Arabidopsis thaliana (Table 1).
The intermediate (green) network additionally contains reactions from both databases where either the full set of substrates or the full set of products belongs to the core network (Additional file 2). Compounds involved in these reactions and not already included in the core network were also added to the intermediate network. Such reactions and compounds have a strong connection to the core network, but their confidence status is lower since they have not been unambiguously matched between both databases. The fact that a compound or reaction was found in only one database may be due to several factors: (i) our reconstruction algorithm may have failed to find the corresponding compound or reaction in the second database; (ii) a metabolite may be represented by a generic class in one database but by a specific compound in the other; (iii) one database may contain an inaccuracy, so that a metabolite or a cofactor is missing or incorrect in a reaction; (iv) a reaction or compound may indeed be absent from one database. Some of these causes might lead to the occurrence of double entries in the intermediate network.
All remaining reactions and compounds were added to the complete (blue) metabolic network, so that a comprehensive data set of all metabolic reactions and compounds described in A. thaliana and reactions contained is presented (Additional file 3). This reconstruction should be considered as a development set whose validity needs to be confirmed from additional sources. This file additionally contains a list of genes associated through each reaction in both databases.
Detailed coverage of metabolic pathways
We investigated the distributions of enzymes belonging to the three metabolic networks among Kegg pathways (Additional file 4). In most of the carbohydrate metabolism pathways, the core (yellow) metabolic network covers between 70% and 80% of all enzymes attributed by Kegg to these pathways. This proportion generally rises above 85% in the intermediate (green) network. In nucleotide metabolism pathways, 87% of the enzymes are covered by the core network and 91% by the intermediate network. For amino acid metabolism and secondary metabolites biosynthesis, these values are most of the time between 50% and 70% in the core network and 75% in the intermediate network. Lipid metabolism has a lower coverage with 40% to 60% of enzymes being in the core network and around 70% in the intermediate network. It is not surprising that core metabolic pathways are generally better covered, as these pathways should have been the most intensively experimentally analysed and the most accurately described.
(i) The `was not included into the core network due to an ambiguity between various forms of ubiquinones and ubiquinols acting as cofactors. These compounds are represented by generic classes in Kegg, but by specific ubiquinone-8 and ubiquinol-8 compounds in AraCyc.
(ii) The transition between 2-oxoglutarate and succinyl-CoA is represented by a direct alpha-ketoglutarate dehydrogenase reaction in AraCyc. This reaction is not present in Kegg, which instead represents the transition by three different steps involving 3-carboxy-1-hydroxy-propyl-thiamine diphosphate and succinyldihydro-lipoamide-E.
(iii) Similarly, the transition between isocitrate and 2-oxoglutarate via oxalosuccinate is represented in Kegg but not in AraCyc. The direct isocitrate dehydrogenase reaction does appear in both databases.
Stoichiometric consistency validation
The intermediate (green) and complete (blue) metabolic networks unsurprisingly contain many stoichiometric inconsistencies because these reconstructions contain generic metabolite classes (e.g. "alcohols"). The stoichiometric consistency validation of the core (yellow) network was successful with the exception of molecular hydrogen. This inconsistency inevitably follows from skipping protons from reaction definitions and currently cannot be resolved, given the inaccuracies in the input data. Achieving strict hydrogen and charge balancing would nevertheless be important to obtain a high quality genome-scale metabolic model, particularly for a photosynthetic organism .
Distributions of the most important network properties are plotted in Figure 5 for the complete (blue) metabolic network. The intermediate (green) and core (yellow) metabolic networks exhibit similar distributions. Average network parameters are shown for the three networks in Table 1. The connectivity distribution of our metabolic reconstructions has the same allure, resembling a power-law, as universally observed in metabolic networks [37–39]. The average clustering coefficient is our reconstructions is close to 0.2, to be compared with reported values of 0.20 for E. coli, 0.23 for S. cerevisiae, and 0.28 for H. pylori based on the same network representation . The average path length in our reconstructions is close to 3, which is the same as observed in many other metabolic networks based on the same network representation . It is worth noticing that this value becomes significantly higher when common small molecules are removed from the network  or when an atomistic representation of metabolism is adopted .
In Arita's atomistic representation, the substrates and products of a reaction are connected only if carbon atoms can be traced between them. The Kegg RPAIR database provides atomic mappings between Kegg compounds, making it possible to construct an atomistic network for our core model. However, our intermediate and complete models contain some compounds that were not found in Kegg, therefore an atomistic network constructed from these models would be partially biased. For this reason, we provide a comparison between both network representations for the core model (Additional file 5), but the conclusions of this comparison can be extended to the entire network. In the atomistic network, there is a higher number of nodes of degree 1 and the degree distribution decreases more rapidly; the clustering coefficient and closeness centrality are lower, but the betweenness centrality is higher since nodes are less densely interconnected; the average path length is higher with its distribution peaking at 4, and the diameter of the network increases to 12 instead of 8 in the classical network.
Hubs of the three metabolic networks.
Core (yellow) network
Intermediate (green) network
Complete (blue) network
Comparison with other models
Because of the lack of usage of standardised formats, nomenclatures and annotations, it is very challenging to compare different genome-scale metabolic models with each other. A systematic comparison of our reconstruction with the two other Arabidopsis models, hereafter referred to as "Poolman"  and "AraGEM" , would therefore require a long and cumbersome manual mapping of metabolites and reaction.
Comparison of tricarboxylic acid and glyoxylate cycle reactions in Arabidopsis reconstructions.
Succinate + Ubiquinone ↔ Fumarate + Ubiquinol
Fumarate + H2O ↔ Malate
Malate + NAD ↔ Oxaloacetate + NADH + H
Malate + NADP ↔ Oxaloacetate + NADPH + H
Only with NAD/NADH
Oxaloacetate + ADP + Orthophosphate + Acetyl-CoA ↔ Citrate + ATP + CoA
Citrate ↔ cis-Aconitate + H2O
cis-Aconitate + H2O ↔ Isocitrate
Isocitrate ↔ Succinate + Glyoxylate
Glyoxylate + H2O +Acetyl-CoA ↔ Malate + CoA
Isocitrate + NAD ↔ 2-oxoglutarate + CO2 + NADH + H
Isocitrate + NADP ↔ 2-oxoglutarate + CO2 + NADPH + H
Only with NAD/NADH;
Isocitrate + NADP ↔ Oxalosuccinate + NADPH + H
Oxalosuccinate + NADP ↔ 2-oxoglutarate + CO2 + NADPH + H
2-oxoglutarate + CoA + NAD ↔ Succinyl-CoA + CO2 + NADH
2-oxoglutarate + Thiamine-pyrophosphate ↔ 3-carboxy-1-hydroxypropyl-ThPP + Co2
3-carboxy-1-hydroxypropyl-ThPP + Lipoamide-E ↔ Succinyldihydrolipoamide-E + Thiamine-pyrophosphate
Succinyldihydrolipoamide-E + CoA ↔ Succinyl-CoA + Dihydropiloamide-E
Succinyl-CoA + ADP + Orthophosphate ↔ Succinate + ATP + CoA
Succinyl-CoA + GDP + Orthophosphate ↔ Succinate + GTP + CoA
Only with ADP/ATP
Like the Poolman model, our reconstruction does not distinguish between cellular compartments, but the AraGEM model distinguishes between cytosol, mitochondria, plastid, vacuole and peroxisome. Since the aim of this work is to present a methodology for the integration of different databases, our reconstruction is not immediately intended for FBA simulations. There is a trade-off between the search for a consensus and the gap filling required for FBA.
The number of published genome-scale metabolic network reconstructions has grown rapidly in recent years. After the first reconstructions were published for E. coli and S. cerevisiae [42, 43] the number of such reconstruction has been growing quickly in recent years, covering many microorganisms, animals and human. A comprehensive description of the motivations and applications of such reconstructions has been presented by Feist and Palsson . These applications include network property analysis, metabolic engineering, biological discovery, phenotypic assessment, and evolutionary analysis.
Very few reconstructions of plant metabolic networks have been undertaken so far, and yet many of the applications mentioned before take even higher relevance in plants. Metabolic engineering is of particular significance in plants and offers promising perspectives to improving production yields, enhancing the nutritional value of crops, and generating valuable molecules for pharmacology and energy production. High-quality and comprehensive models of plant metabolism will be crucial to allow these applications to be developed. The metabolic networks of plants are of a higher complexity than those of most other living species; it is therefore both relevant and timely to start investing efforts in the construction of such models.
Although Arabidopsis thaliana has been widely used as a model plant, its metabolic network has not been studied in great details and at a large scale. There has been renewed interest in A. thaliana metabolism recently. More than 170 secondary metabolites from seven different classes have been identified in A. thaliana , whose putative functions cover the defence against pathogens and herbivores, UV protection, resistance to oxidative stress, auxin transport, etc. Glucosinolates are known for their benefits to human nutrition and were found to play a fundamental role in the defence response against microbial and fungal pathogens [45, 46]. Biosynthesis pathways of tocochromanols, a group of lipid antioxidants that are essential in human nutrition, have raised promising interest . A. thaliana was also used as a model plant to study polyamine metabolism, which plays an essential role in stress tolerance , and flavonoid production, which inhibit or stimulate cell proliferation in different human cancer cell lines .
However many issues presently hamper the use of genome-scale metabolic reconstructions by the wider research community and industry. These issues include:
(i) A limited usage of standardised formats, nomenclatures and annotations. Many metabolic reconstructions are published in spreadsheet format, using customs identifiers and nomenclature. This is a considerable obstacle to the transfer of these reconstructions to other applications and to comparisons between different networks, making it challenging to compare Arabidopsis models with each other.
(ii) Limited coordination between different reconstruction efforts. One such coordination has recently lead to the publication of a consensus metabolic reconstruction of S. cerevisiae  and a similar effort is currently under way for the human metabolic network .
(iii) The absence of update mechanisms enabling the integration of the latest scientific discoveries by the wider research community into existing models.
(iv) The absence of universal quality and validation standards.
Although this work does not claim to address all these issues, we introduced in this work a few principles seeking to propose avenues for solutions and to raise awareness about current limitations. First, we provide an annotated SBML version of the core (yellow) metabolic network of A. thaliana (Additional file 6) and non-annotated SBML versions of the two other networks (Additional files 7 and 8). SBML has become the de facto standard for systems biology models and allows them to be used by the widest range of tools. Standardised annotations following SBO specifications ensure that metabolites and enzymes can be easily identified and linked with existing databases. However, for such formats to be universally adopted by the biochemical research community, efficient and user-friendly tools will need to be developed allowing the easy input and conversion of models to a well annotated and standardised format.
While large international meetings have proven successful to confront and integrate different existing metabolic reconstructions, mechanisms allowing a convenient integration of models as they are developed would be more efficient. We showed that by confronting and integrating two independent sources, we were able to semi-automatically reconstruct a core metabolic network of A. thaliana, whose properties are comparable to existing manually reconstructed networks of other species. Such mechanisms could be generalised by the use of common repositories, following the model used for gene sequences or protein structures, allowing users to deposit new reconstructions and enhance existing ones through a seamless integration process.
Last, all applications using genome-scale metabolic reconstructions do not necessarily require the same level of data quality. For network analysis, a relatively straightforward or automatic reconstruction may be sufficient, while for metabolic engineering or experimental design a highly accurate and well-annotated model is generally necessary. It is therefore important to keep track of the sources and validation level of data used in reconstructions, so that users are able to select the data with the appropriate level of confidence for their application. As a first step towards such a process, we here provide three reconstructions with different levels of confidence. The core (yellow) network has the highest confidence level and was proven to be stoichiometrically consistent, but has some gaps. For applications such as Flux Balance Analysis, a more continuous model can be preferable even though some reactions might be of a lower confidence level. The intermediate (green) network attempts to suit such needs by filling gaps through the inclusion of partial information. The complete (blue) network eventually contains the largest amount of available information, but with the restriction that some reactions may be unconfirmed and the risk of duplications. These different levels of network reconstructions provide baselines upon which future improvements can be built by the community to ultimately obtain a high-quality genome-scale of Arabidopsis metabolism .
The software tools developed for this work are provided in Additional file 9 as Java source code, together with a protocol describing their function. The protocol used to map chemical structures is additionally available online from the myExperiment website .
In this work, we presented a methodology allowing an efficient semi-automatic reconstruction of metabolic networks via the integration of different databases, and applied this methodology to the plant Arabidopsis thaliana. The integration of different data sources significantly enhances the quality of a reconstruction and leads to quality standards that are comparable to manual reconstructions. A long-term and coordinated international effort will be desirable to provide comprehensive and accurate genome-scale metabolic models of plants, and to provide the infrastructure allowing widespread diffusion, frequent update, unimpeded compatibility and convenience of use of such models by the widest research community and industry.
NS thanks the Engineering and Physical Sciences Research Council (EPSRC) and the Biotechnology and Biological Sciences Research Council (BBSRC) for funding of the Manchester Centre for Integrative Systems Biology. PD is funded by BBSRC grant BB/F006012/1. The authors thank Ralf Steuer, Hans Westerhoff, Pedro Mendes, Mark Poolman and David Fell for fruitful comments and discussions.
- Sweetlove LJ, Fell DA, Fernie AR: Getting to grips with the plant metabolic network. Biochem J. 2008, 409: 27-41. 10.1042/BJ20071115View ArticlePubMedGoogle Scholar
- Green ML, Karp PD: The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Res. 2006, 34: 3687-97. 10.1093/nar/gkl438PubMed CentralView ArticlePubMedGoogle Scholar
- Heinemann M, Kümmel A, Ruinatscha R, Panke S: In silico genome-scale reconstruction and validation of the Staphylococcus aureus metabolic network. Biotechnol Bioeng. 2005, 92: 850-64. 10.1002/bit.20663View ArticlePubMedGoogle Scholar
- Oh YK, Palsson BØ, Park SM, Schilling CH, Mahadevan R: Genome-scale reconstruction of metabolic network in Bacillus subtilis based on high-throughput phenotyping and gene essentiality data. J Biol Chem. 2007, 282: 28791-9. 10.1074/jbc.M703759200View ArticlePubMedGoogle Scholar
- Resendis-Antonio O, Reed JL, Encarnación S, Collado-Vides J, Palsson BØ: Metabolic reconstruction and modeling of nitrogen fixation in Rhizobium etli. PLoS Comput Biol. 2007, 3: 1887-95. 10.1371/journal.pcbi.0030192View ArticlePubMedGoogle Scholar
- Andersen MR, Nielsen ML, Nielsen J: Metabolic model integration of the bibliome, genome, metabolome and reactome of Aspergillus niger. Mol Syst Biol. 2008, 4: 178- 10.1038/msb.2008.12PubMed CentralView ArticlePubMedGoogle Scholar
- Herrgård MJ, Swainston N, Dobson P, Dunn WB, Arga KY, Arvas M, Blüthgen N, Borger S, Costenoble R, Heinemann M, Hucka M, Le Novère N, Li P, Liebermeister W, Mo ML, Oliveira AP, Petranovic D, Pettifer S, Simeonidis E, Smallbone K, Spasić I, Weichart D, Brent R, Broomhead DS, Westerhoff HV, Kırdar B, Penttilä M, Klipp E, Palsson BØ, Sauer U, Oliver SG, Mendes P, Nielsen J, Kell DB: A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat Biotechnol. 2008, 26: 1155-60. 10.1038/nbt1492PubMed CentralView ArticlePubMedGoogle Scholar
- Sheikh K, Förster J, Nielsen LK: Modeling hybridoma cell metabolism using a generic genome-scale metabolic model of Mus musculus. Biotechnol Prog. 2005, 21: 112-21. 10.1021/bp0498138View ArticlePubMedGoogle Scholar
- Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R, Palsson BØ: Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc Natl Acad Sci USA. 2007, 6: 1777-82. 10.1073/pnas.0610772104.View ArticleGoogle Scholar
- Ma H, Sorokin A, Mazein A, Selkov A, Selkov E, Demin O, Goryanin I: The Edinburgh human metabolic network reconstruction and its functional analysis. Mol Syst Biol. 2007, 3: 135- 10.1038/msb4100177PubMed CentralView ArticlePubMedGoogle Scholar
- Feist AM, Palsson BØ: The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nat Biotechnol. 2008, 26: 659-67. 10.1038/nbt1401PubMed CentralView ArticlePubMedGoogle Scholar
- Lee DS, Burd H, Liu J, Almaas E, Wiest O, Barabási AL, Oltvai ZN, Kapatral V: Comparative genome-scale metabolic reconstruction and flux balance analysis of multiple Staphylococcus aureus genomes identify novel anti-microbial drug targets. J Bacteriol. 2009, 191: 4015-24. 10.1128/JB.01743-08PubMed CentralView ArticlePubMedGoogle Scholar
- Suthers PF, Dasika MS, Kumar VS, Denisov G, Glass JI, Maranas CD: A genome-scale metabolic reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput Biol. 2009, 5: e1000285- 10.1371/journal.pcbi.1000285PubMed CentralView ArticlePubMedGoogle Scholar
- Ginsburg H: Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium. Trends Parasitol. 2009, 25: 37-43. 10.1016/j.pt.2008.08.012View ArticlePubMedGoogle Scholar
- Poolman MG, Bonde BK, Gevorgyan A, Patel HH, Fell DA: Challenges to be faced in the reconstruction of metabolic networks from public databases. Syst Biol. 2006, 153: 379-84.View ArticleGoogle Scholar
- Poolman MG, Miguet L, Sweetlove LJ, Fell DA: A genome-scale metabolic model of Arabidopsis thaliana and some of its properties. Plant Physiol. 2009, 151: 1570-81. 10.1104/pp.109.141267PubMed CentralView ArticlePubMedGoogle Scholar
- de Oliveira Dal'Molin CG, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK: AraGEM - A genome-scale reconstruction of the primary metabolic network in Arabidopsis thaliana. Plant Physiol. 2010, 152: 579-89. 10.1104/pp.109.148817PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang P, Foerster H, Tissier CP, Mueller L, Paley S, Karp PD, Rhee SY: MetaCyc and AraCyc. Metabolic pathway databases for plant research. Plant Physiol. 2005, 138: 27-37. 10.1104/pp.105.060376PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, 36: D480-4. 10.1093/nar/gkm882PubMed CentralView ArticlePubMedGoogle Scholar
- Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008, 36: D344-50. 10.1093/nar/gkm791PubMed CentralView ArticlePubMedGoogle Scholar
- Buntrock RE: Chemical registries - in the fourth decade of service. J Chem Inf Comput Sci. 2001, 41: 259-63.View ArticlePubMedGoogle Scholar
- Austin CP, Brady LS, Insel TR, Collins FS: NIH Molecular Libraries Initiative. Science. 2004, 306: 1138-39. 10.1126/science.1105511View ArticlePubMedGoogle Scholar
- Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics. 2007, 23: 2768-74. 10.1093/bioinformatics/btm393View ArticlePubMedGoogle Scholar
- Hassan M, Brown RD, Varma-O'Brien S, Rogers D: Cheminformatics analysis and learning in a data pipelining environment. Mol Divers. 2006, 10: 283-99. 10.1007/s11030-006-9041-5View ArticlePubMedGoogle Scholar
- Weininger D, Weininger A, Weininger JL: SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989, 29: 97-101.View ArticleGoogle Scholar
- Willett P: Similarity-based virtual screening using 2 D fingerprints. Drug Discov Today. 2006, 11: 1046-53. 10.1016/j.drudis.2006.10.005View ArticlePubMedGoogle Scholar
- Gevorgyan A, Poolman MG, Fell DA: Detection of stoichiometric inconsistencies in biomolecular models. Bioinformatics. 2008, 24: 2245-51. 10.1093/bioinformatics/btn425View ArticlePubMedGoogle Scholar
- Nikolaev EV, Burgard AP, Maranas CD: Elucidation and structural analysis of conserved pools for genome-scale metabolic reconstructions. Biophys J. 2005, 88: 37-49. 10.1529/biophysj.104.043489PubMed CentralView ArticlePubMedGoogle Scholar
- Bornstein BJ, Keating SM, Jouraku A, Hucka M: LibSBML: an API library for SBML. Bioinformatics. 2008, 24: 880-1. 10.1093/bioinformatics/btn051PubMed CentralView ArticlePubMedGoogle Scholar
- Le Novère N, Finney A, Hucka M, Bhalla US, Campagne F, Collado-Vides J, Crampin EJ, Halstead M, Klipp E, Mendes P, Nielsen P, Sauro H, Shapiro B, Snoep JL, Spence HD, Wanner BL: Minimum information requested in the annotation of biochemical models (MIRIAM). Nat Biotechnol. 2005, 23: 1509-15. 10.1038/nbt1156View ArticlePubMedGoogle Scholar
- Stein SE, Heller SR, Tchekhovskoi D: An open standard for chemical structure representation: The IUPAC chemical identifier. Proceedings of the 2003 International Chemical Information Conference (Nîmes), Infonortics. 2003, 131-43.Google Scholar
- Huala E, Dickerman A, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang J, Huang W, Mueller L, Bhattacharyya D, Bhaya D, Sobral B, Beavis B, Somerville C, Rhee SY: The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001, 29: 102-5. 10.1093/nar/29.1.102PubMed CentralView ArticlePubMedGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-504. 10.1101/gr.1239303PubMed CentralView ArticlePubMedGoogle Scholar
- Assenov Y, Ramírez F, Schelhorn SE, Lengauer T, Albrecht A: Computing topological parameters of biological networks. Bioinformatics. 2008, 24: 282-4. 10.1093/bioinformatics/btm554View ArticlePubMedGoogle Scholar
- Dong J, Horvath S: Understanding network concepts in modules. BMC Syst Biol. 2007, 1: 24- 10.1186/1752-0509-1-24PubMed CentralView ArticlePubMedGoogle Scholar
- Reed JL, Vo TD, Schilling CH, Palsson BØ: An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biol. 2003, 4: R54- 10.1186/gb-2003-4-9-r54PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL: The large-scale organization of metabolic networks. Nature. 2000, 407: 651-4. 10.1038/35036627View ArticlePubMedGoogle Scholar
- Wagner A, Fell DA: The small world inside large metabolic networks. Proc R Soc Lond B Biol Sci. 2001, 268: 1803-10. 10.1098/rspb.2001.1711.View ArticleGoogle Scholar
- Almaas E: Biological impacts and context of network theory. J Exp Biol. 2007, 210: 1548-58. 10.1242/jeb.003731View ArticlePubMedGoogle Scholar
- Ma H, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics. 2003, 19: 270-7. 10.1093/bioinformatics/19.2.270View ArticlePubMedGoogle Scholar
- Arita M: The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci USA. 2004, 101: 1543-47. 10.1073/pnas.0306458101PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards JS, Palsson BØ: The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc Natl Acad Sci USA. 2000, 97: 5528-33. 10.1073/pnas.97.10.5528PubMed CentralView ArticlePubMedGoogle Scholar
- Förster J, Famili I, Fu P, Palsson BØ, Nielsen J: Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 2003, 13: 244-53. 10.1101/gr.234503PubMed CentralView ArticlePubMedGoogle Scholar
- D'Auria JC, Gershenzon J: The secondary metabolism of Arabidopsis thaliana: growing like a weed. Curr Opin Plant Biol. 2005, 8: 308-16. 10.1016/j.pbi.2005.03.012View ArticlePubMedGoogle Scholar
- Clay NK, Adio AM, Denoux C, Jander G, Ausubel FM: Glucosinolate metabolites required for an Arabidopsis innate immune response. Science. 2009, 323: 95-101. 10.1126/science.1164627PubMed CentralView ArticlePubMedGoogle Scholar
- Bednarek P, Pislewska-Bednarek M, Svatos A, Schneider B, Doubsky J, Mansurova M, Humphry M, Consonni C, Panstruga R, Sanchez-Vallet A, Molina A, Schulze-Lefert P: A glucosinolate metabolism pathway in living plant cells mediates broad-spectrum antifungal defense. Science. 2009, 323: 101-6. 10.1126/science.1163732View ArticlePubMedGoogle Scholar
- Dörmann P: Functional diversity of tocochromanols in plants. Planta. 2007, 225: 269-76. 10.1007/s00425-006-0438-2View ArticlePubMedGoogle Scholar
- Alcázar R, Marco F, Cuevas JC, Patron M, Ferrando A, Carrasco P, Tiburcio AF, Altabella T: Involvement of polyamines in plant response to abiotic stress. Biotechnol Lett. 2006, 28: 1867-76. 10.1007/s10529-006-9179-3View ArticlePubMedGoogle Scholar
- Woo HH, Jeong BR, Hawes MC: Flavonoids: from cell cycle regulation to biotechnology. Biotechnol Lett. 2005, 27: 365-74. 10.1007/s10529-005-1521-7View ArticlePubMedGoogle Scholar
- Mo ML, Palsson BØ: Understanding human metabolic physiology: a genome-to-systems approach. Trends Biotechnol. 2009, 27: 37-44. 10.1016/j.tibtech.2008.09.007View ArticlePubMedGoogle Scholar
- Thiele I, Palsson BØ: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protocols. 2010, 5: 93-121. 10.1038/nprot.2009.203.View ArticleGoogle Scholar
- myExperiment., http://www.myexperiment.org/packs/121