A curated C. difficile strain 630 metabolic network: prediction of essential targets and inhibitors

Background Clostridium difficile is the leading cause of hospital-borne infections occurring when the natural intestinal flora is depleted following antibiotic treatment. Current treatments for Clostridium difficile infections present high relapse rates and new hyper-virulent and multi-resistant strains are emerging, making the study of this nosocomial pathogen necessary to find novel therapeutic targets. Results We present iMLTC806cdf, an extensively curated reconstructed metabolic network for the C. difficile pathogenic strain 630. iMLTC806cdf contains 806 genes, 703 metabolites and 769 metabolic, 117 exchange and 145 transport reactions. iMLTC806cdf is the most complete and accurate metabolic reconstruction of a gram-positive anaerobic bacteria to date. We validate the model with simulated growth assays in different media and carbon sources and use it to predict essential genes. We obtain 89.2% accuracy in the prediction of gene essentiality when compared to experimental data for B. subtilis homologs (the closest organism for which such data exists). We predict the existence of 76 essential genes and 39 essential gene pairs, a number of which are unique to C. difficile and have non-existing or predicted non-essential human homologs. For 29 of these potential therapeutic targets, we find 125 inhibitors of homologous proteins including approved drugs with the potential for drug repositioning, that when validated experimentally could serve as starting points in the development of new antibiotics. Conclusions We created a highly curated metabolic network model of C. difficile strain 630 and used it to predict essential genes as potential new therapeutic targets in the fight against Clostridium difficile infections. Electronic supplementary material The online version of this article (doi:10.1186/s12918-014-0117-z) contains supplementary material, which is available to authorized users.


Background
Clostridium difficile is an opportunistic, gram-positive anaerobic spore-forming pathogen found in the environment and in the intestinal flora in up to 3% of healthy adults. Toxigenic strains of C. difficile are resistant to a wide variety of antibiotics and produce the enterotoxin TcdA and the cytotoxin TcdB. These toxins are responsible for the clinical symptoms of C. difficile infection (CDI) [1,2]. CDI is the leading cause of hospital-borne infections occurring when the natural intestinal flora is depleted following antibiotic treatment. CDI is the major cause of antibiotic-associated diarrhea and is responsible for pseudomembranous colitis, a form of severe intestinal inflammation. For the most part, CDI can still be treated with metronidazole or vancomycin for which resistance levels remain low or the recently approved fidaxomicin. Single or multiple relapses after initial treatment are common and bring about more severe symptoms. A recent clinical study reports a relapse rate of 24% and 13% with vancomycin or fidaxomicin treatment respectively [3]. Between 50% and 80% of recurrences are due to spore-mediated re-infection [2]. Unfortunately, patient-to-patient transmission and relapses are difficult to prevent due to the production of C. difficile spores that are resistant to antibiotics, heat, radiation and various chemicals.
CDI is directly responsible for an average 4.6 per 1000 patients admitted in hospitals with a 5.7% mortality rate after 30 days directly attributed to CDI [4]. In the US, over 250,000 cases are registered per year in hospitals alone and many more cases in outpatient settings [5], costing around USD$4,000 to USD$8,000 per case of primary infection and USD$8,000 to USD$15,000 per relapsing infection [6] leading to a burden of over USD$500 million [7]. More important than economic costs, CDI in older patients, those with concurrent debilitating conditions, and severely relapsing or fulminant cases may result in death.
In recent years there has been an increase in the rate of infections as well as the emergence of communityassociated, virulent and antibiotic-resistant strains [8,9]. Treatments of CDI that offer alternatives to the use of small-molecules [10] involving phages [11] or intestinal microbial flora transplants [12] are likely to meet resistance from patients. Others involving antibodies [13] or vaccines [14] are still under development.
The complexity inherent to preventing and treating CDI requires the continuous search for new ways to target C. difficile. In recent year, the growth of biological databases led to the development of the field of systems biology making it possible to build and analyze genomic-scale reconstructed metabolic networks [15]. There is a large number of highly curated reconstructed metabolic networks for a number of organisms, from E. coli [16] to human [17]. The prediction of essential genes is often used to detect potential drug targets [18][19][20]. Two techniques, Flux Balance Analysis (FBA) [21] and Synthetic Accessibility (SA) [22] are among those available to predict essential genes at a genomic scale through in silico gene deletion studies. The comparison of results obtained with either FBA or SA and experimentally determined essential genes shows equivalent levels of accuracy with either technique around 94% for B. subtilis [23], 83% for S. cerevisiae and 60-70% for E. coli [22]. The success rates are likely reflecting the quality of the metabolic network reconstructions.
The combination of systems pharmacology and metabolic network analyses can help predict off-target effects of drugs as well as open new opportunities with the repositioning of existing drugs [24]. In the present study we create and validate a highly curated metabolic network reconstruction for the pathogenic C. difficile strain 630. We then utilize it to predict essential genes or gene pairs. We employ a combination of systems biology, bioinformatics and structural computational biology methods to detect potential human cross-reactivity targets for and detect small-molecules, including existing approved drugs that may bind a number of the predicted C. difficile targets.

Creation of the network
The genome of C. difficile strain 630 is composed of a circular chromosome of 4,290,252 bp coding for 3968 open reading frames (ORFs) as well as a plasmid of 7881 bp coding for 11 ORFs [25]. The C. difficile strain 630 draft reconstructed metabolic network presented here covers 20.3% of the ORFs present in the chromosomal genome of the bacteria with 806 ORFs. These 806 genes code for proteins catalyzing 769 metabolic, 145 transport and 117 exchange reactions. A total of 592 unique metabolites (703 in total not considering extracellular or intracellular state) are involved in the 1031 reactions in the network. The coverage of the genome is similar to those of previously published reconstructed metabolic networks such as B. subtilis with 20% [23] and higher than the most recent network of C. acetobutylicum with 13.0% [26]. Most reactions have at least one gene association (77.9%). Reactions without any gene association were added based on the existence of evidence from the literature such as in the case of Stickland reactions [27][28][29][30], presence in databases such as xanthine amido hydrase or to fill functional gaps to obtain a functional network as in the case of putative transporters for end-products of fermentation.
The final version of the network is available in 3 different formats: 1. An excel file that shows on different spreadsheets the reactions, metabolites, genes, and compartments that comprise the network and the definitions of the network based on the standard described in the RAVEN toolbox [31]. This file is meant to be easily readable by humans; 2. A tab-separated format of the network amenable to analysis in the R Statistical Computing environment (www.r-project.org) using Sybil [32]; and lastly, 3. A SBML Level 2 formatted network [33] that can be used with tools such as Matlab or other SBML compliant software. While a naming convention has been suggested recently for metabolic reconstructions [16], we feel that a naming convention that does not allude to the name of the organism is insufficient. Therefore, in the present work, the C. difficile strain 630 metabolic network reconstruction is called iMLTC806cdf as per the suggested convention with the added cdf suffix denoting the KEGG [34] three-letter organism ID representing C. difficile strain 630. The SBML version of the model has been deposited to the BioModels database [35] and assigned the identifier MODEL1409240004 .

Validation
Four types of growth media were simulated in silico via modulation of the exchange reactions for the import of metabolites present in the simulated growth media. All four tested media (Table 1) produced biomass based on FBA and SA analysis [22]. For SA, 4 proteins (oxidized ferredoxin, oxidized thioredoxin, acyl and sulfur carrier proteins) that cannot be produced due to the absence of protein biosynthesis reactions in the network but at equilibrium in FBA had to be supplemented to the media to make all reactions possible in FBA also accessible in SA. ATP and nicotinate were supplemented to the minimal medium to allow biomass production in SA while ATP was added to the complex medium.

Essential metabolites
In order to validate the network, experiments involving the removal or addition of certain metabolites from the media were reproduced in silico. Each essential amino acid (cysteine, leucine, Isoleucine, proline, tryptophan and valine) was confirmed essential in the network as their removal prevented the production of biomass in any medium. None of the three essential vitamins are essential in the network (Additional file 1: Table S1). The essentiality of two of these, biotin and pyridoxine, is due to their implication in the regulation of processes which could not be simulated in the metabolic network. Panthothenate (which is important in lipid metabolism) was in the past determined to be essential for a number of C. difficile strains tested [37]. More recently, a new ketopantoate reductase (KPR) gene panG was discovered in a number of pathogenic bacteria and found to have a homolog in C. difficile strain 630 [40]. Therefore, whereas panthothenate is commonly thought to be essential, this essentiality is strain specific and absent in C. difficile strain 630. A ΔpanG mutant in Francisella tularensis did not have any differences compared to wild type infections in a mouse model for pneumonic tularemia [40]. This is likely due to the fact that panthothenate (vitamin B5) is widely available in food and most bacteria are able to import panthothenate through a sodium co-transport mechanism [40].

Non-essential metabolites
Removal of non-essential metabolites did not have an important effect on growth. The non-essential amino acid methionine is known to enhance growth of the bacteria and is used in the minimal medium to increase growth rate. Interestingly, the removal of methionine from the minimal medium leads to a slight reduction in biomass production (less than 1%), a small but qualitatively correct effect. We simulated the removal of arginine and histidine (both nonessential amino acids) from the rich medium and in both cases this lead to the qualitatively correct result of a decrease in biomass production in agreement with the experimental evidence [36]. Most (8 out of 11) non-essential amino acids that when removed from a complete media do not affect growth rate experimentally also have no effect in silico when removed from the complex medium. Furthermore, the addition of certain non-essential amino acids (8 out of 13) or any nucleosides to the minimal medium leads to an augmentation of biomass production in agreement with experimental data (Additional file 1: Table S1).

Carbon sources
The utilisation of different carbon sources in the absence of glucose in the network was simulated and compared to experimental data. Such data is not always specific to C. difficile strain 630 and small differences among strains do exist [36] (Additional file 1: Table S2). The removal of glucose produced the largest decrease in biomass production (~33%) but had a smaller effect in the complex medium (~6%). For 15 experimentally tested carbon sources (out of 20 carbon sources tested in silico) we obtain 100% agreement between the predicted utilization of alternative carbon sources by C. difficile, including the impossibility to use lactose as a carbon source. Furthermore, we predict that C. difficile strain 630 should not be able to use rhamnose or myo-inositol (for which there is some evidence of usage in other strains but no experimental information for strain 630) while malate, glycerol and chorismate would lead to increased growth rates.

Comparison with existing metabolic network reconstructions
We compared iMLTC806cdf to the recently created automatically-generated non-curated reconstructed metabolic network of C. difficile [41]. The automated network contains 3211 reactions, 1548 unique metabolites and 1337 genes resulting in over 2762 genes/reactions associations. One fundamental requirement for a reconstructed metabolic network is its ability to produce biomass. As noted by its creators, the automatic network cannot produce biomass. This is likely due to the numerous flaws present in the automated network that are absent or present in a lesser number in the curated iMLTC806cdf reconstruction presented here. Among these: generic  [37]. c [38,39]. d [23]. e The following metabolites or proteins are added in SA analysis to permit the production of biomass: ATP, nicotinate, oxidized ferredoxin, oxidized thioredoxin, acyl and sulfur carrier proteins and are present in unchanging equilibrium concentrations in FBA.
metabolites, incorrect reaction stoichiometry, repetitions, unclear reactions, dead-end metabolites and non-metabolic genes and reactions. For example, the automated network contains 485 generic metabolites representing 31.3% of the metabolome opposed to 23 in iMLTC806cdf (representing 3.9% of the metabolome). Curiously the automated network reconstruction contains 29 reactions involving oxygen, which should not be present in an anaerobic organism. Other important flaws in the automatically generated C. difficile network include the presence of 1562 export reactions and 648 dead-end metabolites. Overall, 532 reactions (58.2% of the reactions in iMLTC806cdf) are common to both reconstructions out of 3211 present in the automatically generated network. A detailed comparison is presented in Table 2 (or in the form of a Venn diagram in Additional file 1: Figure S1) and clearly shows the vast differences between the two models and that a large number of problems associated to the automatically generated network are absent in iMLTC806cdf.
We also compared iMLTC806cdf to a curated metabolic network reconstruction of the closely related bacterium Clostridium acetobutylicum. Three curated metabolic networks exist for this organism [26,42,43] all focusing on metabolic engineering of the bacteria to maximize the production of butanol. The latest network [26], called iCAC490, was chosen for the comparative analysis as it was the largest, the most recent and the only one available in a commonly accepted usable format (SBML). The analysis of reactions shared between the two reconstructions was possible due to the extensive use of KEGG identifiers Reactions involving oxygen are allowed in C. acetobutylicum thus were not quantified.
in both networks. In the case of transport reactions, two transporters were considered similar if they transported the same molecule with one transporter from one network potentially matching multiple transporters in the other network. We did not differentiate between phosphoenolpyruvate (PEP):carbohydrate phosphotransferase system (PTS) [44], ion channels [45] or ATP driven transporters [46] as long as the transported molecules were the same in the two networks. iMLTC806cdf contains 174 more reactions than the reconstructed network of C. acetobutylicum and 416 in common with it (representing 45.5% of our reactions). iMLTC806cdf has 62 less unique metabolites and 389 metabolites in common (representing 66% of our unique metabolites). The presence of 186 (28% of the metabolome) dead-end metabolites (a complete list can be found in Additional file 1: Table S3) suggests that the C. acetobutylicum network still has incomplete pathways and gaps that could affect biomass production. A detailed comparison is presented in Table 2 or in the form of a Venn diagram in Additional file 1: Figure S2.

Single gene deletions
We performed an in silico gene deletion study using both Synthetic Accessibility (SA) and Flux Balance Analysis (FBA) on iMLTC806cdf in order to identify potential essential genes that may lead to the discovery of novel therapeutic targets. This analysis removed reactions that were catalyzed by each gene alone (or in pairs, next section) or by a complex that involved that gene product and then measured the capacity of the network to produce biomass (either a flux in biomass production for FBA or S net for SA, see Methods). The complex medium (as describe in Table 1) was the one used for the gene deletion studies since it reproduces the high concentration and diversity of nutriments found in the intestinal tract. Furthermore, the medium used is the same as the simulation of the Bacillus subtilis metabolic network [23], which in turn is an approximation of the one used to perform the experimental validation of lethality of single gene deletions in that organism [47].
A total of 66 out of 806 genes deletions were identified via FBA analysis as deleterious based on a 5% variation threshold [48]. An additional 10 genes were found by SA to increase the number of reactions necessary to produce biomass and deemed essential based on this criterion. Overall, 50 of the 76 predicted essentials genes are essential according to both FBA and SA (see Table 3). We observe an agreement rate of 96.8% between the two techniques in terms of the prediction of both lethal and non-lethal genes, which is similar to what was found when comparing both techniques on E. coli and S. cerevisiae [22]. Since the two techniques provide different insights on the metabolism and characteristics of the network (see Discussion), we consider genes identified as essential by either of these techniques as relevant as those identified by both.
Essential genes were compared to experimental results for the gram-positive bacteria Bacillus subtilis [47], which is the closest relative of Clostridium difficile with experimental essentiality data for all of its genes. Since the simulation involved the deletion of genes via deletion of metabolic reactions, functional homologs (genes responsible for reactions that share the same EC number and catalyze similar reactions) were used for the comparison. Among the 76 genes with a predicted effect on biomass production in C. difficile, 46 have homologs that are essential in B. subtilis, 8 did not have any functional homolog and 22 had a homolog that was not essential in B. subtilis (Table 3). We were able to compare 618 C. difficile genes (76.8% of the genes in iMLTC806cdf) for which we could detect a B. subtilis functional homolog with an overall prediction accuracy of 89.2% (Additional file 1: Table S4). A similar rate of 89.0% was obtained based on the comparison of 525 genes using sequence homology (E value <1e −5 , sequence identity above 30% and alignment overlap over 80% of the C. difficile sequence, Additional file 1: Table S4). While an accuracy of around 89% is extremely high, it is important to keep in mind that despite being closely related, differences are expected between the B. subtilis and C. difficile.
We utilized iCAC490 to perform the prediction of essential genes in C. acetobutylicum and compared the results to those above for C. difficile. Essential genes in the C. acetobutylicum network were identified in a similar manner than in iMLTC806cdf, using Sybil [32] within the R environment for statistical computing. Of the 658 C. difficile proteins with C. acetobutylicum sequence homologs, 368 are present in iCAC490. Based on these 368 proteins, we obtain an agreement of 72.3% between iMLTC806cdf iCAC490 (Additional file 1: Table S5). It is interesting to note that we obtain a higher agreement between our computational results for C. difficile and experimental results for B. subtilis than with computational results for iCAC490 representing a more closely related species.
The difficulty of performing genetic manipulations in C. difficile is notorious and severely restricts our ability to compare our results with experimental information. Despite the lack of extensive information on experimentally verified essential genes, the little evidence that exists, supports our predicted essential roles for a number of genes: metK [49], guaA [50] and ntpA-B-C-D-E-F-K [51]. Other known essential genes such as secA1-A2 [52], metG and gyrA-B [53], trpS [49] and gldA [54] are not present in the network and are involved in non-metabolic processes. Lastly, the gene prdF has been mutated and was shown to be non-essential [55], in agreement with its predicted non-essential role in iMLTC806cdf.
We compared the list of 76 genes predicted to be essential in C. difficile using iMLTC806cdf to the genes in the Database of Essential Genes (DEG) [56]. Interestingly, and serving as further validation of iMLTC806cdf, a total of 61 of these genes are present in DEG, i.e., these genes have homologs known to be essential in other species. The remaining 15 predicted essential genes that are not present in DEG (Table 3) include ntpA,D,E,F,K (all subunits V-type ATP synthase) as well as metF (involved in amino acid synthesis) and xpt (guanine synthesis) among others (Additional file 1: Table S6).
The inhibition of the product of essential genes that are upregulated during CDI may require a smaller drug dose to generate an effective response, thus decreasing side effects. We utilized transcriptomics data associated to the differential expression of genes during infection [57] to annotate predicted essential genes in view of their potential use as therapeutic targets. Eight predicted essential genes are downregulated in vivo while 6 are upregulated during infection. Some of the genes that are upregulated during infection and predicted to be essential such as fabD, serA or CD2549, are of additional interest as they could not only affect growth, but also colonisation and pathogenesis processes [57] (Table 3).

Detection of potential human cross-reactivity targets Sequence and functional similarities
One of the main goals in detecting essential genes is to assess their potential as therapeutic targets. One factor weighting in favour of a potential therapeutic target is the lack of a human homolog, as this decreases the chances of side effects of a potential drug off-targeting the gene product of the human homolog. We again utilize here two definitions of homology, the standard sequence homology that relates two genes through evolution and functional homology, which relates two genes via common function of their gene products, specifically the same E.C. number. Functional homology is stricter than sequence homology as it is not based on any level of similarity between the two proteins, only based on the fact that the two enzymes catalyze the same reaction. Fifty-four genes have no sequence homologs in H. sapiens. Thirty of these have functional homologs, while 24 genes identified as potential targets do not have any human functional or sequence homolog (Additional file 1: Table S6).

Local structural similarities
There is a possibility that potential cross-reactivity targets may perform different functions (EC numbers) and have little sequence similarity yet still have sufficient 3D atomic binding-site similarities to be inhibited by a drug developed against a C. difficile target. We proceeded to analyse binding-site similarities to detect potential crossreactivity targets for the 24 C. difficile proteins where sequence or functional similarity search did not detect any human homolog. Binding-site similarities are measured in terms of detected geometrically and chemically equivalent atoms in common between two binding-sites [58,59]. We found an average of 36 atoms in common between binding-sites (average p-value 0.038 and zscore 3.05) for 21 out of 24 models with members of Pfam families [60] that include human proteins. In three cases, ispF, CD2549 and dapH, no significant level of binding-site similarity was found to a Pfam family that contains human homologs (Additional file 1: Table S7).
↑ Genes shown to be up-regulated in vivo (6 in total) [57]. * Genes not present in the Database of Essential Genes (DEG) (15 in total) [56].
catalyzed by the modelled C. difficile protein. In seven of those cases, the top matching Pfam family that contains human homologs binds a similar ligand (Additional file 1: Table S6). Taking as an example the case of the enzyme encoded by the asd gene, we detect 39 atoms (Z-score 3.92, p-value 0.012) in common to a glyceraldehyde-phosphate dehydrogenase from spinach (PDB ID 2PKQ) bound to NADPH, a member of Pfam family PF00044 that has human homologs (Figure 1). Five out of the top 7 most similar binding-sites also bind NADPH or NADP, all from different Pfam families. The superimposition of these diverse binding-sites based on their similarities to the asd gene product binding-site leads to an extremely good superposition of their respective bound-ligands (Inset Figure 1). This suggests that the detected similarities are biologically significant. The quality of the resulting superimpositions together with the detection of similarities across families that bind similar ligands to those that bind the C. difficile targets reinforces the confidence in the biological significance of our predictions. The quality of the alignment of the NADP molecules across different families via the detected similarities suggests that these capture the molecular determinants responsible for binding. As such determinants are conserved across protein families, there is a possibility that these are also conserved within families and thus present in the human homolog. In most cases, the similarities that were detected actually represent commonly used cofactors or other ubiquitously used ligands, such as NADP above or ATP. These results do not necessarily mean that a drug targeting the C. difficile protein will bind the human homolog belonging to the detected Pfam families, but these should be used as potential cross-reactivity targets in the rational design of inhibitors against the C. difficile protein in question. Furthermore, given that the detected similarities focus on common cofactors and ubiquitous molecules such as ATP, the results also suggest that targeting the subpockets of less ubiquitously used substrates may reduce the chance of cross-reactivity.

Metabolic essentiality
The inhibition of potential human cross-reactivity targets detected by sequence, functional or 3D binding-site similarities may not necessarily lead to any serious side effects. As a result of the differences between human and C. difficile metabolism, a protein may be essential in the former but non-essential in the later. We sought to use FBA to determine if inhibition of human homologs of predicted essential C. difficile proteins could have any effect on the human cell. To do so, we performed a gene deletion FBA analysis on the latest draft of the human reconstructed metabolic network RECON2 [17]. In the case of functional homologs all reactions associated to the human homolog of the C. difficile target were removed for the FBA analysis from the human network. Likewise, in the case of the 21 C. difficile potential targets without human functional homologs but with detected binding-site similarities, we identified all human proteins from the Pfam families where similarities were detected and deleted all RECON2 reactions associated (from 1 to 121 reactions at once depending on the C. difficile protein). We opted for this conservative approach to simulate a situation in which a potential drug would inhibit all potential cross-reactivity targets. Only 4 predicted essential C. difficile genes have predicted essential human cross-reactivity targets (underlined in Table 3). For all others, the presence of a human potential cross-reactivity target may not be sufficient to discard it as a potential target.

Double gene deletions
We performed double gene deletions to identify potential polypharmacological targets and to target reactions that are catalyzed by isoenzymes. Based on FBA analysis, 203 gene pairs involving 69 unique genes that had small or no effect in single gene deletion were deleterious when removed in pairs. An additional 3 essential gene pairs involving 6 new unique genes were found using SA. Eight gene pairs were considered essential in both SA and FBA (Additional file 1: Table S8). Some double mutants show a synergistic effect, defined as an effect greater that an additional 1% reduction in biomass production in the double mutant than the sum of effects of each single mutant. The 39 synergistic double mutants were analysed in more detail (Additional file 1: Table S9). Thirteen of these synergistic gene combinations resulted in total abolition of biomass production in FBA or prevented the biosynthesis of at least one element of the biomass in SA, 11 had an effect between 10% and 20% and the remaining 15 had a small effect on biomass production (between 5 and 10%). Twelve of the essential pairs of genes are isoenzymes that catalyze the same reactions. Sixteen gene pairs represent enzymes involved in pathways with the same functional category while the remaining 11 gene pairs affect different pathways ( Figure 2). All essential pairs identified by both SA and FBA are isoenzymes whose removal results in a total arrest of biomass production.
Isoenzymes usually result in a higher biomass loss than relatively distant pairs (Figure 2). For the 12 isoenzymes, the deletion of the two genes in a pair is required to remove a reaction that is catalysed by both. For the remaining essential gene pairs, the reactions associated with both genes are used in parallel in the wild type, a case of metabolic plasticity [61], or the reactions from only one of the genes is used while the reaction from the other member of the pair can act as a backup, a case of metabolic redundancy [61]. Depending on the category in which an essential gene pair falls, different strategies may be required in order to target the pair [61]. From the 39 essential genes pairs, 23 represent cases of plasticity of the network and 4 cases of redundancy in the network (Additional file 1: Table S9).

Distribution of predicted essential genes across pathways
In order to see which parts of metabolism are more enriched with essential genes, we classified reactions into 8 functional pathway classifications (Additional file 1: Table S10). A hypergeometric test [62] for over and under-representation was performed to identify pathways enrichment in essential genes. This analysis confirmed that linear pathways, like lipid synthesis, tend to have more essential genes (25 out of 71 single gene deletions, overrepresentation p-value 1.81e −10 ) due to the lack of alternative ways leading to the production of biomass metabolites ( Figure 3, Additional file 1: Table S11).

Potential compounds binding predicted essential targets
Molecules potentially binding the proteins encoded by the 123 genes identified as potential targets on their own Figure 2 Effect of the deletion of essential genes and deletion of essential pairs of genes in term of biomass lost. Essential genes removal identified by SA were considered to give a null biomass if one of the component of the biomass was impossible to produce otherwise they were arbitrary attributed an effect of 5-10% since the number of reaction required to produce biomass augmented by less than 10% in every cases. The number of cases in the unlabelled sections of the pie chart is in clockwise order 1, 1, 3 and 1.
(76 genes) or as part of pairs (47 unique genes, forming 39 different pairs with synergetic deleterious effect when both genes are removed) were identified based on sequence homology (E-value < 1.0e −5 , sequence identity above 30%, and overlap over 80%) between predicted essential proteins and entries in the DrugBank database Version 4 beta [63] (Table 4). A total of 125 molecules bind 41 protein entries from DrugBank with homologs among 29 predicted essential C. difficile targets. While the list includes cofactors, binders, inhibitors and activators, all such molecules bind the homologs of the predicted essential C. difficile targets. Most of these molecules are still experimental. Interestingly, 22 molecules are approved drugs based on DrugBank annotation. Among these we have Celurelin predicted to bind to the product of two distinct predicted essential genes in the same pathway: fabF and fabH (with the potential for polypharmacology). Mycophenolic acid and Ribavirin are predicted to bind the predicted essential product of guaB and, TCL a potential inhibitor of the product of CD2577 which is essential only as part of double mutant with fabG. The double inhibition could be achieved with the use of the experimental molecules linked with fabG such as EMO, MAX or TDB. Lastly, pyridoxal phosphate is a potential binder of two proteins that are part of a pair whose double mutation is predicted to be lethal (glyA and CD2834). The identification of any potential binding small-molecules (based on target homology) is useful since there is a chance that these molecules may also bind the predicted essential C. difficile homolog protein and this information could be used as a basis for the development of more potent and selective inhibitors. This analysis also helps elucidate the role of pyridoxine, an essential vitamin that has no direct effect on biomass, since pyridoxal phosphate is a cofactor that binds two genes (glyA and CD2834) part of the same essential pair. In a similar way, we can elucidate one of the reasons for biotin essentiality via its identification as a cofactor for accC, a predicted essential gene as single mutant. This list remains to be experimentally validated but is meant as a starting point in targeting any one of the genes predicted as being essential in the network.

Discussion
In this work we present iMLTC806cdf, a highly curated metabolic network of the nosocomial pathogen Clostridium difficile strain 630. This metabolic network is functional in the sense of being amenable to simulations using Flux Balance Analysis to measure biomass production in diverse types of media. iMLTC806cdf is available in SBML, TSV and Excel formats. The network is based on the aggregation of metabolic data present in databases, augmented with information from various sources of experimental data from the literature and manually curated in order to ensure a high quality of the resulting network.
Metabolic networks bring together information from various databases. Due to inaccuracies, contradictions and missing information in each database [64][65][66], the noncurated network resulting from the simple aggregation of all these sources of information is often incomplete and includes a large number of errors, missing data and repetitions, which make the generated networks unusable [41]. Different databases will often have multiple identifiers for highly similar entries, which cause duplications. Surprisingly, this is true not only for reactions and metabolites where curation is more difficult but also for genes. Manual curation is essential to correct inaccuracies present in metabolic databases and particularly to fill gaps (missing reactions) necessary to remove dead-end metabolites to create a functional network. Manual curation is a time consuming, expensive and non-scalable process. Unfortunately, it is also indispensable at this time as can be attested by the 14 predicted essential genes (alone or as The target gene is part of more than one essential pairs, one with each of the proteins in parenthesis. part of a double mutant) that do not even appear in the automatically generated C. difficile network. It is clear that the automated generation of functional metabolic networks is a worthwhile goal. However, at present automatically generated networks are not functional. At this time, such networks can only be used at best as a starting point prior to intensive manual curation. Their use as starting points however subtracts very little to the manual work required to produce a functional network. In some sense, metabolic networks serve as a tool to aggregate all existing knowledge about the metabolism of an organism. Clearly, the more well studied an organism is, the more information exists to build and validate such a network. C. difficile is an organism that is not well studied due to a number of factors. First, being a pathogen severely restricts the number of active researchers studying it given the experimental requirements in terms of biosafety. Second, C. difficile presents particular challenges that make genetic manipulations notoriously difficult. Until recently, genetic studies in C. difficile were restrained by the lack of efficient tools to inactivate specific genes. The recent development of a universal gene knockout system in clostridia has opened new possibilities and it is now somewhat easier to disrupt genes in a very specific and directed manner [67,68]. This system called ClosTron is based on retargeting of the Lactococcus lactis Ll.LTRB group II intron so that upon transfer into C. difficile by conjugation, the intron integrates at a specific user-defined chromosomal site. With the advent of ClosTron our knowledge of the biology of C. difficile and the identification of essential genes is bound to increase but at the moment, there is a lack of C. difficile specific literature and that limits the completeness and validation of the network. Some information such as biomass constituents and protein-protein interactions had to be extrapolated from other closely related species. The metabolic pathways involved in linking dead-end metabolites to the rest of the network also sometimes had to be extrapolated due to lack of experimental data. No in-depth studies exist of the directionality of reactions as it was done in E. coli [69], or association between biomass production and predicted growth rate as it was done in B. subtilis [23], due to the absence of metabolomics data and precise growth rate experiments for C. difficile. While information can be extrapolated from closely related organisms, fundamental differences still exist and may be a source of potential errors. Notwithstanding the existing limitations in creating and validating a C. difficile specific network, iMLTC806cdf is as complete or more than existing curated networks and accounts for all existing relevant experimental information. We hope that in addition to being a tool to aggregate existing knowledge, iMLTC806cdf will prove to be valuable as a nucleation point in developing our understanding of this important human pathogen driving the generation of new experimental hypotheses.
The analysis of a metabolic network in isolation in the absence of other relevant processes present in the organism poses its own problems. For example, the effect of metabolites (such as pyridoxine and biotin) [37] involved in nonmetabolic processes could not be simulated. Likewise, methionine added to minimal media greatly increases growth in vitro [36] but the addition of methionine to the minimal media in iMLTC806cdf increases biomass by less than 1%. Methionine is mostly used in the bacteria as S-adenosyl methionine involved in the biosynthesis of cofactors and vitamins which are not directly involved in biomass synthesis and have an effect that cannot be simulated in the metabolic network [70]. While the removal of methionine produced a qualitatively correct outcome, the loss of biomass in the network when removing methionine from the minimal medium is only due to the additional reactions required to produce enough methionine as required for biomass production.
The case of pantothenate, the only essential vitamin with a clear metabolic effect is unique, as its essentiality is strain-dependant. A biosynthesis pathway for panthothenate was recently identified in C. difficile strain 630 [40] and is present in some other strains based on MetaCyc. As a result, this vitamin is a non-essential component of the growth medium based on our in silico analysis. This vitamin is however essential in a number of strains previously tested [37], which did not include either strain 630 or the others containing this pathway in MetaCyc.
The comparison with C. acetobutylicum network [26] indicates that both bacteria share the same metabolic core. Existing differences in reactions and associated genes may explain the differences obtained while comparing the effects of deleted genes and reactions. The different media utilized for both bacteria may also cause some differences.
The comparison with experimental results for B. subtillis [47] was used as validation due to the absence of experimental results for the gene essentiality in C. difficile. While essential differences exist between the two organisms, a large degree of conservation is also present. Therefore one should expect that a large number of genes conserve their essentiality across these two species. The high level of accuracy (according to functional or sequence homology) between the experimental results and our predictions serves as a validation of iMLTC806cdf as a mature draft metabolic network and increases our confidence in the list of predicted essential genes. One important difference between the metabolism of C. difficile and B. subtilis is that the later can use oxygen to produce energy while the former cannot. Therefore, predicted essential C. difficile genes involved in fermentation such as pykF or ackA are not essential in B. subtilis. Other genes such as fabH, CD1966 and ribC are only present in one copy in C. difficile while more than one gene catalyses the same reactions in B. subtilis [71], explaining why such genes are non-essential in the latter. Some genes whose inactivation is deleterious in vitro are not identified in silico due to their implication in nonmetabolic processes. Both iMLTC806cdf and the network of Bacillus subtilis [23] fail to identify the essentiality of CD1536 (yumC in B. subtilis) due to its implication in regulatory processes. In addition, the toxicity of a molecule cannot be simulated in silico either. Therefore, the essentiality of genes involved in detoxification or whose deletion leads to the accumulation of toxic molecules cannot be simulated. For example, the removal of CD3543 would lead to an accumulation of nicotinate that could be toxic. This effect cannot be simulated in the network, therefore CD3543 is not considered essential in the network but is essential in vivo in B. subtilis [47].
The combined use of FBA and SA allowed us to detect more essential genes than using either technique alone. Their joint use increases our confidence on the predictions for those genes where the two techniques agree. At the same time, the two techniques complement each other. For example, the gene acpS catalyses the only reaction that leads to the production of a holo-acyl-carrier protein from the apo version of the protein. This reaction is essential since the holo form of the protein is required to perform the elongation of lipids. The presence of a cycle that allows for the reutilisation of the released holo-acyl-carrier protein at the end of lipid elongation prevents FBA to identify the deletion of acpS as deleterious. The analysis by SA uses the apo version of the acylcarrier protein and predicts acpS as essential since without it, the holo form cannot be produced. New targets were also found when using SA for double mutants. Targets identified by SA are mostly isoenzymes that lead to deletion of new reactions in the shortest possible pathway leading to the production of essential biomass metabolites.
The gene deletion analysis identified interesting potential therapeutic targets. Targets such as the aspartatesemialdehyde dehydrogenase (E.C. 1.2.1.11) asd (UniProt ID Q17ZW9) or the diaminopimelate epimerase (E.C. 5.1.1.7) dapF (UniProt ID Q182T1, also known to be essential in B. subtilis) that do not have human functional homologs, decrease the chance of side effects due to cross-reactivity. Another predicted essential gene, the aspartate-ammonia ligase (E.C. 6.3.1.1) asnA (UniProt ID Q183C9) is up-regulated in vivo and could be important for the pathogenesis of the bacteria [57]. Most targets like CD2549, dxr and ispF have more than one of these characteristics and would be interesting for more than one reason.
As a result of the conservation of local binding site environments [58,59,72], drugs often targets proteins in a way that might not be sequence-dependent. To account for that effect, we used functional homologs to identify potential human cross-reactivity targets for predicted essential C. difficile proteins. This made for a more stringent analysis since the number of potential human functional homologs is almost twice as large as those based on sequence similarity alone. The absence of a human homolog is often used as a criteria for identification of potential drug target [73]. If no homolog is present, there is a smaller probability that a drug targeting this specific protein have an effect in humans.
For those cases where sequence or functional homology did not detect potential human cross-reactivity targets, we utilized the detection of local binding site similarities. This analysis identified protein families with human representatives that harbour large binding-site similarities to the C. difficile targets in the absence of sequence or functional similarities. The detected similarities are primarily localized to binding-sites of cofactors and ubiquitously used ligands such as NADP or ATP. It is important to keep in mind that it is not possible to determine a minimum similarity threshold other than 100% above which one can be certain that the detected human proteins will act as cross-reactivity targets as small differences can bring about drastic effects [59].
The presence of a human potential cross-reactivity target (a homolog or a protein with sufficient binding-site similarities) is not sufficient to evaluate whether or not targeting a particular target might have important side effects since the human protein might not be essential. The use of predicted essentiality of human functional homologs or those with detected binding-site similarities in RECON2 [17] in conjunction with their essentiality in C. difficile represents a more consistent analysis of targets across hosts and pathogens. To our knowledge this use metabolic networks across species to determine the potential of a target to have cross-reactivity targets leading to side effects is novel.
A "perfect" predicted essential target would be one without (or with non-essential) potential cross-reactivity targets in human and E. coli (as a proxy for gram-negative and gut flora in general), with essential homologs in B. subtilis and up-regulated in vivo. Although no C. difficile target could be found fulfilling all properties at once, the 123 potentially essential targets identified (as single or double mutants) fulfil several of these properties and could, once validated experimentally serve as a target for the development of new antibiotics.
The list of active molecules that potentially bind predicted essential targets includes many molecules that could help in the validation of the targets in C. difficile and the development of novel drugs [74]. Experimental validation is required to determine if the identified smallmolecules do bind the C. difficile homologs. Some of these small-molecules, such as the approved anti-viral Ribavirin, could speed the approval of C. difficile specific inhibitors through drug repositioning. In the case of Ribavirin, the molecule is a rapidly absorbed guanosine analog currently used in the treatment of Influenza [75] and hepatitis C [76]. Cerulenin has anti-fungal and anti-bacterial activity targeting FabF in B. subtilis [77], thus very likely targeting the same protein in C. difficile as we predicted. In all cases, the multiple small molecules predicted to bind the predicted essential C. difficile proteins could be used to bias library selection for the rational development of new inhibitors.

Conclusions
In the current work we present the first extensively curated metabolic network reconstruction for C. difficile (strain 630) iMLTC806cdf and validate it with experimental data on essential metabolites and carbon sources. We compare iMLTC806cdf to existing networks showing the importance of manual curation and use the network to predict essential genes. The predictions agree with experimental data for B. subtilis (the closest organism for which such data is available). We detect potential cross-reactivity targets for each of these genes using a variety of methods combining systems and structural computational biology and determine that for only 4 out of 76 predicted essential genes, if exiting, the potential human cross-reactivity targets are themselves essential in the human metabolic network reconstruction RECON2. For a number of essential genes we find potential binding small molecules, including approved drugs such as Ribavirin, which may inhibit the respective gene products. We hope that iMLTC806cdf will find further use in the community and the results here lead to the development of novel antibiotics against C. difficile infections.

Creation and curation of the draft metabolic network
We created a collection of metabolic and transport reactions associated with C. difficile strain 630 from the KEGG [34], MetaCyc [78] and TransportDB [79] databases. Reactions involving polymers (glycogen, peptides and others) were ignored to avoid the spontaneous creation of matter due to the presence of molecules of undefined length. Also, some of these polymers, like glycogen, are only used for energy storage [80] and would not have any impact in simulations that optimize biomass production. Reactions involved in spore formation, the conjugation process, RNA, peptide or DNA modification, cell repair, and other non-metabolic enzymatic reactions were not added to the network since these reactions are not directly contributing to the production of biomass constituents. Finally, the reconstructed network concentrates solely on metabolism without including signalling, gene regulation and posttranslational modification of proteins even if these can clearly affect metabolism.
The initial draft network is little more than a collection of reactions. In that state it cannot be used for any sensible application such as the simulation of biomass production. This is due to the numerous inaccuracies present in the databases. These inaccuracies consist of, but are not limited to, the presence of generic and dead-end metabolites, missing or erroneous pathways, missing genes and unbalanced reactions (Additional file 1: Table S12). Two cases exemplify some of these inaccuracies. First, noting that aerobic pathways should not be present at all in the C. difficile, the inclusion of incomplete versions of both aerobic and anaerobic pathways for the biosynthesis of vitamin B12 is problematic (Additional file 1: Figure S3). Second, the omission of most reactions in the Stickland and amino acid fermentation pathways (Additional file 1: Figure  S4), important sources of energy for the bacteria, which had to be completed based on literature. Both of these corrections a numerous others in of the same nature were necessary to create a functional network (Figure 4).
Most transport reactions were based on TransportDB [79] although putative transporters not present in the databases had to be added for molecules known to be exported or imported by the bacteria. An exchange reaction (reaction that simulates interaction with the media via the appearance or disappearance of the given metabolites in the network) was created for each metabolite with an extracellular version. These exchange reactions are set to allow the export of a metabolite unless it is part of the tested growth medium in which case import is also possible.
The curation not only involved the addition and suppression of reactions from the initial draft, many characteristics of each reaction such as the directionality, the presence of complexes or the assignment of gene-reaction associations and their inclusion as part of a pathway had to be analyzed manually. The directionality of reactions was based on information obtained from the MetaCyc database when available. Reactions found exclusively in KEGG were kept bidirectional unless leafing to the production of a highly energetic compounds (ATP, NAD+, NADP+, etc.) that were not known to be produced in such a manner. Examples of reactions producing highly energetic compounds are ATP synthases, amino acid fermentation and glycolysis.
Protein-protein interactions are highly important in the network since they can modify the essentiality of a gene based on knowledge of the involvement of its protein product as part of a protein complex. The possibility that a protein complex is responsible for catalysis was investigated for every reaction that had more than one gene associated to the reaction and for any gene whose protein is identified as a subunit of a complex. We used information from Trans-portDB, UniProt, Brenda and literature data (either for the C. difficile protein of interests or a homolog of same function that could indicate a similar interaction between genes products). In every case the STRING version 9.1 [81] score was calculated to evaluate the confidence score attributed to each complex (Additional file 1: Table S13).
We used KEGG pathway identifiers since they are generic and allow maintaining a small number of pathways in the network. Pathways containing less than five reactions were manually merged to the closest relevant pathway. Reactions that were not associated to any pathway in KEGG (32% of the reactions in KEGG) were linked to existing pathways. The pathway assigned to the reaction was the one in which the less linked metabolite of the reaction was most often present in the network (Additional file 1: Table S10).

Lipid biosynthesis and cell membrane composition
A problem with lipid biosynthesis in metabolic networks is the fact that a variety of different fatty acids exist in cells with different lengths and saturation states and can be used in numerous cellular processes, both metabolic and non-metabolic. Also, the lipid composition of the cellular membrane is a mix of various phospholipids and glycerolipids of varying lengths and composition and this composition varies depending on the growth conditions [82]. A complete definition of lipid metabolism is likely impossible to define at this moment given the lack of experimental information specific to C. difficile membranes. Furthermore, the resulting network would be dominated by lipid reactions with the same few genes repeated for every possible length and saturation state for every lipid type. Therefore, having an exhaustive definition of lipid metabolism would not bring any additional relevant information on metabolism. In most reconstructed metabolic networks, lipids are used almost exclusively in membrane formation. Following [42], in order to reduce the complexity of lipid metabolism while keeping it as close to the real bacteria as possible, the fatty acid composition of all lipids is held at a constant 16:0 (carbon chain length: number of double bonds), which is the most abundant fatty acid observed in C. acetobutylicum [83]. The last step in simplifying lipid biosynthesis was to combine the elongation of fatty acids from acetyl (2:0) to palmitate (16:0) into a single reaction and beta-oxidation of the palmitate back into acetyl as another single reaction.

Biomass
In order to simulate bacterial growth, the biomass (ensemble of macromolecules necessary for cellular growth and division) composed of DNA, RNA, cell wall, proteins, solute pool and lipids was defined based on the following elements. Nucleic acid composition of DNA used in the network was calculated based on the nucleotide content of the genome and plasmid of Clostridium difficile strain 630, RNA composition is based on the content of the transcriptome using the UCSD genome browser [84] and protein composition using the proteome. Lipid, cell wall and solute pool composition as well as the overall biomass composition were taken from the C. acetobutylicum network [42] due to a lack of literature specific to C. difficile. A detailed composition of the biomass is available in supplementary Additional file 1: Table S14.

Simulation of growth and gene essentiality
Two methods were used to simulate bacterial growth and determine gene essentiality: flux balance analysis (FBA) [21] and synthetic availability (SA) [22].

Flux Balance Analysis (FBA)
Flux balance analysis [21] is a constraint-based modeling method commonly used in the study of genome-scale metabolic networks. FBA firsts creates a stoichiometry matrix (S) from the network where each row represents a metabolite and each column a reaction. The values in this matrix correspond to the stoichiometry of the metabolite in the reaction with a negative number representing consumption and a positive number representing production of the metabolite. A system of linear equations is produced by multiplying S with a column vector v representing the fluxes through each reaction. FBA creates a steady-state distribution of fluxes where the product of the previous multiplication must equal zero S ⋅ v = 0. Since the resulting system of linear equations is undetermined, FBA uses linear programming to maximize a particular objective function Z, in our case biomass as the representation of growth, by maximizing the multiplication of a row vector c containing the weight of each reaction on Z with the column vector v used previously (maximize Z = c ⋅ v). Values in v are constrained by lower and upper bounds representing various factors like enzyme directionality of the reaction, capacity, uptake, secretion rates, etc. FBA finds a distribution of fluxes in the network respecting the constraints on v and maximizing Z at the same time. In our studies, the Sybil package version 1.1.11 [32] available for the free R environment for statistical computing (version 2.15.2) was used in order to run FBA simulations. Other tools also exist that can use FBA with different interfaces like the COBRA package [85] that runs on the proprietary Matlab computing environment.

Synthetic Accessibility (SA)
Synthetic accessibility [22] is a parameter-free method to predict the essentiality of genes through their deletions in metabolic networks. SA uses network topology to calculate the number of reaction steps needed to produce the outputs (biomass) of the network from the inputs metabolites available in the growth medium. SA works by examining all reactions that use only input metabolites and marks those reactions and their products as 'accessible'. In an iterative manner, successive iterations search for new reactions that have all required substrates marked as accessible until no new reaction can be added. Each metabolite j have a synthetic accessibility value S j representing the number of iterations before this metabolite became accessible and the Synthetic Accessibility of the network as a whole, S net , is the sum of S j of each of the output metabolites. Increases in S net resulting from gene deletions are predicted as being deleterious. Synthetic accessibility is a simpler method than FBA and gives comparable results on the prediction of essential genes demonstrating that the topology of the network is the principal factor influencing essentiality. We use our own implementation of the algorithm.

Local structure similarity
The identification of local structure similarities is separated into three steps: the creation of a three dimensional model of the protein, the identification of the probable binding site, the comparison of the binding site to a dataset of known binding sites. The models used where created using I-Tasser [86], a tool which builds 3D models based on multiple-threading alignment and iterative fragment based simulations. Detailed methodology for I-Tasser can be found elsewhere [86]. We used the Isocleft Finder (Kurbatova et al., [58]) web-interface to compare the largest cavity of each of the models (representing the binding site in 83% of cases [87]), identified using their own implementation of the SURFNET algorithm [88], to a non-redundant dataset of 7339 binding-sites of unique combinations of protein families bound to distinct ligands [58] using Isocleft [58,59]. IsoCleft is a graph-matching based method for the detection of structural and chemical similarities between pairs of protein cavities. In each case, we selected the most similar match found to a protein in the IsoCleft Finder non-redundant dataset that contains human homologs. It is important to note however that the representative of such family in the non-redundant dataset may itself not be necessarily a human protein.

Additional file
Additional file 1: Figure S1. Comparison of the metabolites (A) and genes (B) from iMLTC806cdf to those in the automatic network. Figure S2. Comparison of the metabolites (A), genes (B) and reaction (C) from the Clostridium difficile metabolic network to those from the network of Clostridium acetobutylicum. Figure S3. Example of reaction involving oxygen and their curation. Figure S4. Stickland reactions. Stickland reactions are an example of reactions often incomplete in metabolic databases that were added to the network based on Clostridium difficile specific literature. Table S1. Validation: addition or removal of various metabolites and comparison with literature data. Table S2. Utilization of various metabolites as carbon source and comparison with literature data. Table S3. List of dead-end metabolites identified in Clostridium acetobutilicum network (iCAC490). Table S4. Predicted Clostridium difficile gene essentiality vs. experimental data for functional homologs in Bacillus subtilis. Table S5. Comparison of predicted C. difficile and C. acetobutylicum gene essentiality. Table S6. Details of the predicted essential genes and comparison with targets identified in other target databases. Table S7. Binding site environment similarity used to find possible cross-reactive targets in human for Clostridium difficile essential genes without human homologs. Table S8. List of essential gene pairs with (A) or without (B) synergetic effect. Table S9. Details of the double mutant deleterious gene pairs with synergetic effect. Table S10. Functional classification of metabolic pathways. Table S11. Distribution of essential genes and genes involved in essential pairs in pathways. Table S12. Treatment of dead-end metabolites in the network. Table S13. Full list of protein complexes present in the network. Table S14. Clostridium difficile macromolecule and biomass constitution.