The investigation of large complex systems on a global scale makes it impractical and maybe even impossible to know details about all involved metabolites, genes, and proteins. At the same time, a high level of knowledge about metabolic subsystems and or enzyme activities is necessary in order to come up with hypotheses of particular metabolic fates and novel reactions. In this study we used a systems biology approach to characterise and fill gaps in human metabolism. The key results include i) many dead-end metabolites affect reaction cascades, ii) computationally predicted solutions require thorough manual curation and biochemical insight, and iii) four biological plausible hypotheses were identified. This work highlights that finding gene candidates for metabolic functions in the human genome is not a trivial issue and the extensive manual effort to curate the computational predictions of candidate reactions highlight the overall quality and quantity of data included in Recon 1.
We characterised knowledge gaps of human metabolism, represented by blocked reactions and dead-end metabolites identified in RECON 1 , to which solutions, in the form of non-organism specific metabolic reactions, could be found using the computer algorithm SMILEY  (Figure 1). We identified 175 blocked reactions, 70% of which had high confidence scores, and observed that while they were unevenly distributed within human metabolic pathways, most were found in the cytosol (Figure 2). Furthermore, we found that they arose due to different dead-end metabolite types that were in some cases responsible for up to 14 blocked reactions. These properties are likely to affect how trivial it will be to address these knowledge gaps experimentally and suggest that the impact of resolving these gaps, both in terms of novel metabolic discovery and their influence on RECON 1, will be different. For example, determining the fate of a metabolite, which results in multiple blocked reactions, will have a different impact on the biological accuracy of RECON 1 than resolving a single blocked reaction. Nevertheless, a single blocked reaction could be of great interest as a candidate target for novel metabolic discovery as its components could represent a drug target and resolving the gap could have unforeseen effects on network robustness, i.e., human metabolism. Also, assaying cytosolic reactions will be more straightforward than determining the function of compartmentalised reactions. Subsequently, which knowledge gaps are chosen for experimental research is ultimately a human decision depending on research goals, biological novelty factors, ease of experimental validation, and underlying evidence of the knowledge gap's validity.
We highlighted four examples of missing knowledge in human metabolism (Figures 5, 6, 7 and 8) that resulted in biologically plausible hypotheses using a combined algorithmic and manual approach. The hypotheses were strengthened with published experimental data. In the case of iduronic acid (Figure 6), a major constituent of glycosamine glycans, we argued for a hypothesis that an extracellular transport reaction needs to be added to RECON 1. Although no direct evidence for such transport could be identified in the human genome, the existence of iduronic acid in human urine (M. Fuller, personal communication) suggests that a transporter may be a biologically plausible solution. Further evidence is that the build-up of iduronic acid in the lysosome has been linked to lysosomal storage disorder caused by defects in the sialic acid lysosomal transporter . Similarly, defects in α-L-iduronidase (18.104.22.168), the exo-glycohydrolase that cleaves iduronic acid off the non-reducing end of dermatan sulfate and heparan sulfate are known to cause a different type of lysosomal storage disorder, called mucopolysacharideosis I . Despite its apparent involvement in disease, the metabolic fate of iduronic acid is unknown. The present work highlights knowledge gaps in human metabolic processes, such as the fate iduronic acid, which in the context of investigating lysosomal storage disorders due to protein deficiencies, have not been relevant but are now required to generate a complete picture of human metabolism. Our gap filling examples showed that algorithms, such as SMILEY, can be used to direct hypotheses of novel functions in human metabolism. Nevertheless, a semi-automated approach was required to assist with the identification of plausible gap filling candidates for experimental verification.
Multiple gap finding and gap filling algorithms exist, including GapFind/GapFill  and GrowMatch , and the use of alternative algorithms will undoubtedly increase the number of possible hypotheses as they employ different heuristics and data sources (e.g., universal databases). This work does not provide a comprehensive list of possible gap-filling reaction solutions but rather assesses the use of (semi)-automated computational approaches for identifying and completing missing functions in human metabolism on a large-scale. We found that computational tools, such as SMILEY, do not necessarily suggest biologically plausible gap filling hypotheses. The generated hypotheses need to be evaluated in a manual, time-consuming manner, similar to the gap filing process employed during the reconstruction approach [4, 26]. The search for novel functions is therefore only semi-automated. Automated algorithms could however be trained, based on manual effort, to prioritize or exclude certain types of solutions. In addition, approaches could be developed that incorporate methods to build hypotheses of genes associated with orphan reactions [56–60], which SMILEY does not directly do.
Identification of genes associated with biological plausible hypotheses as suggested by SMILEY was a major challenge. Relatively few knowledge gaps were resolved using known metabolic functions (Figure 3), and the solutions required detailed literature review such that homology, of what were often prokaryotic genes/proteins to possible human counterparts, could be assessed. In light of our results, we believe that existing automatic gap filling approaches for uncovering gene function will be of limited use for mammals. This limitation arises from a lack of phylogenetic information, which is extensively explored for annotating microbial genomes [58, 61]. Although various homology databases exist for mammalian genomes covering up to seventy mammalian species , the majority of phenotypic, genetic, and biochemical studies have been performed using mice, and to a lesser extent, human cells. Information derived from these databases therefore originates from few organisms making them less useful for annotation purposes. Furthermore, co-expression analysis is used in microbes to determine genes with related function [63–65] and could serve as a strategy for gene finding in the human genome. However, analysis of regions of correlated transcription (RCT) in human and mouse identified both related and unrelated genes being co-expressed . The majorities of RCT were not found in both human and mouse, which the authors explained with i) missing definition of homology and/or synteny, ii) no conserved pattern, and/or iii) physiological differences between human and mice. This example highlights the challenges associated with finding novel gene functions in the human genome using established methods from the bacterial world. Novel approaches may include the use of protein-protein interaction data [67–69], tissue-specific information [70, 71] and disease information  combined with gap filling algorithms. In particular, the latter work  observed a high degree of correlation between known co-occurring (co-morbid) diseases in patients and flux-coupling of the reactions that are perturbed in association with each of the disease states. Flux coupled reaction sets , or perfectly coupled reaction sets (Co-sets) , have been calculated in genome-scale metabolic models. Co-sets are often along linear pathways . Thus, a low co-morbidity of two metabolically linked diseases would indicate a missing link along a Co-set, which would break the flux coupling by creating a pathway split. Similarly, single nucleotide polymorphisms have been mapped onto metabolic networks  and may be used for identifying missing functions in human metabolism.
Ubiquitous unknowns e.g. genes with unknown function and orphan enzymes belonging to orthologous families, have been identified as top targets for functional elucidation in terms of biological knowledge payoff as these are ancient in origin and therefore likely to be involved in essential metabolic processes [22, 25, 76]. We believe that combining a metabolic network approach with knowledge of ubiquitous unknowns could also represent an ideal method for organism specific novel function identification.