Overview of modeling and model validation framework
The research framework relies on two data sets that represent different states of knowledge for the same gene regulatory network. The gene regulatory network includes a set of genes important for the development of the C. elegans embryonic C cell lineage, a cell that gives rise to two distinct cell types: mesoderm and ectoderm [10]. The C lineage is dependent on the transcription factor PAL-1, and a number of genes have been identified that influence how cells in the lineage choose between the two possible cell type outcomes [12, 13]. The test data set is based largely on results of Yanai et al.[14], who completed gene perturbation experiments followed by transcript abundance analysis for all genes in the network, and yeast one hybrid (DNA binding) analysis for all transcription factors and the upstream sequences for each gene in the network. These data (along with additional data curated from the literature, see Methods) are used to develop the Gold Standard Network (GSN). The source data set for the Mathematically Inferred Model (MIM) is provided in [13]. It includes time-course analysis of gene transcript abundance in essentially wild-type animals. In addition, we have formalized the knowledge-driven model proposed by Baugh et al. (Figure 9 of [13]) as the Wild Type Model (WTM), for comparison to the MIM. Consequently, both the mathematical model (MIM) and the knowledge-driven biological model (WTM) are derived from the same data set and reflect a comparable state of knowledge of the gene regulatory network. These two models are then compared to the Gold Standard Network derived from the test data set. To select the genes to include in the network, we used all of the genes in the network proposed by Baugh et al.[13], plus lin-26. lin-26 had been previously identified as a critical factor for development of epidermal cell types [15–17]. This approach of separating the model-building data from the model-testing data provides a rigorous test of our mathematical modeling strategies using data that are pre-existing in the literature.
We will use the word “module” to refer to groups of genes, following the groupings of Yanai et al.[14], with additions from Baugh et al.[13]. The specific groups are “initiation” for pal-1, tbx-8, and tbx-9; “ectoderm” for elt-1, elt-3, lin-26, and nhr-25; “mesoderm” for hlh-1, hnd-1, and unc-120; “mixed” for nob-1, scrt-1, and vab-7; and “other” for cwn-1 and mab-21. In all figures we have color coded the nodes according the module: blue for initiation, yellow for ectoderm, gray for mesoderm, brown for mixed, and green for other.
The gold standard network is a comprehensive knowledge-driven model of the gene regulatory network controlled by PAL-1
In order to assess the performance of a model, a representation of the "truth", or a so-called gold standard network, is required. A network is defined as a gold standard if it is used to validate the performance of a method or a model; in essence it is considered to be the sought-after solution [18, 19]. While it is common to construct a gold standard from a synthetic or a simulated network, we aim to assess the predictive power of two types of models (i.e., data-driven versus knowledge-driven) in the presence of future knowledge. Therefore our gold standard, constructed from interactions obtained from experiments subsequent to those in [13], is intended to be a comprehensive model of the regulatory network controlled by PAL-1.
The primary source of data for our gold standard network, which we label GSN, regulating cell fate decisions in the C. elegans C lineage is [14], with additional data curated from the literature (Additional file 1: Table S 1, tabs “Gene interactions” and “Gene interactions – refs”). Yanai et al. incorporated their results into a knowledge-driven model represented as a set of directed graphs (Figure 4 of [14]), which we refer to as the Experimentally Derived Model (EDM). While this model reflects the data, we have produced an alternative model that is derived from a systematic approach to data interpretation (see Methods). The rationale for production of this alternative model is to demonstrate one way in which scientists lacking specialized expertise in a particular biological system can use existing data to build a knowledge-driven model, and derive testable hypotheses from that model. We define this as the GSN. Due to its size, the GSN is represented as a pair of graphs: one showing the interactions between genes within a regulatory module (e.g., ectoderm, mesoderm) and the other showing the interaction between genes in different modules (Figure2). The model includes both directed and undirected edges, depending on the type of experimental data that predict the edge. For consistency, we use the gene (rather than protein) names in all of the models.
To evaluate the GSN, we compared it to the EDM, including only pal-1 and the genes of the ectoderm and mesoderm modules, to match the choices in [14]. In Figure3 we illustrate the similarities (in black) and differences (in red) between the EDM and the GSN. The GSN shares extensive similarity with the EDM, a result that is not unexpected given that the two draw on the same data sources. Two edges in the ectodermal module (from elt-1 to nhr-25, and elt-3 to lin-26) are included in the EDM, whereas they are predicted to be absent by the GSN. Five additional edges are predicted by the EDM, but the data were insufficient for the GSN to either support the edge, or provide a direction for the edge. We hypothesize that the requirement of two data features to support inclusion of an edge in the GSN will result in a more conservative network than provided by the EDM. Altogether, the GSN is a network that shares similarity with one derived independently by experimental specialists.
One benefit of the GSN is that it incorporates data for all genes in the network, allowing description of features that are not included in the EDM. For example, we observe that the initiation genes pal-1 and tbx-8/tbx-9 are distinct in their interactions within the GSN, whereas no such distinction is made in the EDM. In particular, pal-1 and tbx-8/tbx-9 have different target gene sets. In addition, tbx-8/tbx-9 are regulated by other genes in the network (nhr-25, lin-2 6 and nob-1), whereas pal-1 has no in-network regulators. In the GSN all three of the initiation genes (along with others) regulate elt-3, whereas in the EDM pal-1 is the only regulator from the gene set. In the GSN, we see that pal-1 regulates all mesodermal genes, which matches the predictions in the EDM; however, we also see that the other two initiation genes tbx-8/tbx-9 also regulate hlh-1. Altogether, the GSN includes a greater complexity of interactions for the initiation gene set than does the EDM.
The EDM incorporates data for two pal-1-regulated modules: ectoderm and mesoderm. The EDM and GSN exhibit similarities within the ectoderm module, but there are a number of differences for the mesoderm module. The EDM predicts that all three genes regulate each other. The GSN is in agreement between hnd-1 and hlh-1; however there is disagreement around unc-120. The difference arises from there not being enough support for the regulatory interactions involving unc-120. In fact, the GSN suggests that the regulatory interaction from hnd-1 to unc-120 may happen through nob-1; that is, there is a directed path of hnd-1 → nob-1 → unc-120. While additional data might modify the interpretation, the more conservative GSN identifies the potential for indirect regulatory relationships.
For the mixed module, the EDM provides no predictions for these genes. While we cannot make a direct comparison, we will highlight some predictions of the GSN. The GSN identifies the mixed gene vab-7 as a key regulator of the mesoderm module, as only it and pal-1 regulate all three genes of the module. The GSN also identifies nob-1 as an important regulator in the whole network. nob-1 interacts with each module through regulation of elt-1 (ectoderm), unc-120 (mesoderm), and tbx-8/tbx-9 (initiation). nob-1 also regulates the other genes within the mixed module. In contrast, scrt-1 exhibits no interactions beyond the mixed module. Thus the GSN identifies the cross-network nature of genes in the mixed module, and also demonstrates functional differences among the mixed module genes.
The authors of the EDM proposed an “expert-inferred” model (i.e., one not strictly based on their data) for the regulation of the ectoderm and mesoderm modules by pal-1 (see Figure 6 of [14]); we refer to this model as the C lineage model. For completeness’ sake, we compare the GSN to those additional findings. In the C lineage model, the authors predict that the ectoderm gene elt-1 regulates the mesoderm module; the GSN narrows the target set to unc-120 and hlh-1, and includes lin-26 as another regulator of these two mesoderm genes. Additionally, in the GSN the mesoderm module regulates elt-1, which matches the predictions in the C lineage model. Thus the GSN also makes predictions that match expert-driven network features that reflect expert experience, knowledge of the literature, and knowledge of the experimental system.
In building the GSN, we also collected information about which genes are observed to not regulate other genes, referred to as “non-interactions”. Non-interactions are distinct from relationships for which there is insufficient data, and identify prospective cases of network hierarchy or other network constraints. For example, a number of non-interactions were identified between initiation genes and genes from other categories. The genes elt-1, elt-3, hlh-1, unc-120, and scrt-1 were found to not regulate any of the initiation genes. Additionally lin-26, nhr-26, hnd-1, vab-7, and nob-1 do not regulate pal-1. In the ectoderm module, we observe that nhr-25 and elt-3 do not regulate lin-26; elt-1 does not regulate nhr-25; and lin-26 does not regulate elt-3. In the mesoderm module, however, no non-interactions were discovered. These results identify differences in either the interconnectedness or the functional redundancy between the two modules.
To further evaluate the GSN, we evaluated a set of “statistics” from graph theory for which common features have been discovered in modeling gene regulatory networks [20]. In many biological networks, the average path length (average minimal number of edges between any two nodes) is less than four. With an average path length of about three, the GSN is consistent with this prediction. The average out-degree (the number of outgoing edges) is five and the average in-degree of the nodes is five. Consistent with other gene regulatory networks, the GSN has relatively few nodes with an out-degree of greater than half of the network in their target set: pal-1 lin-26, and nob-1. These genes can be considered the network hubs. While there are also relatively few nodes with an in-degree of greater than half of the network (hlh-1 and elt-1), the GSN is unusual for a transcriptional network as it has a relatively large range for the in-degree of the nodes (0–10 in a 15 node network). Overall, the GSN exhibits network features seen in other gene regulatory networks, and serves as a knowledge-based model for C cell lineage development against which other models can be tested.
A mathematically inferred model of the gene regulatory network controlled by PAL-1
We derived a Mathematically Inferred Model (MIM) for the gene network regulating cell fate decisions in the C. elegans C cell lineage using wild-type gene expression time-course data from [13]. The model results from applying two different modeling methods in series. Such approaches have been termed “pipelines,” and they exhibit improved performance over individual methods alone [21]. To utilize the benefits and minimize the weaknesses of methods from different modeling classes, we applied statistical (covariance (COV)) and algebraic (the Minimal Sets Algorithm (MSA)) methods in series (see Methods). Covariance can discover a larger number of gene relationships, but does not provide information about the regulatory relationship between genes that covary. MSA, on the other hand, identifies fewer possible relationships, but it predicts directional edges since it incorporates the observed gene expression changes from one time point to the next. Consequently, the model includes both directed and undirected edges, reflecting the types of edges predicted by the different methods. The MIM is represented as a set of three figures: one showing the interactions between genes within a regulatory module, one showing directed interactions between genes in different modules, and one showing undirected interactions between genes in different modules (Figure4).
There are two principal ways to combine the methods used in the MIM: COV followed by MSA, denoted COV-MSA, and vice versa. Intuitively it seems natural to first allow COV to decide which genes interact and then to let MSA determine the direction of the interaction; however, either order is technically feasible. We built a model using both pipeline orders, compared them using the assessment methods described below, and found that they yield comparable results (see Methods). These findings suggest that, at least in this case, the order of modeling methods within the pipeline does not drastically impact the performance of the model. The MIM of Figure4 results from application of MSA-COV, as this model is modestly better than that built from the reverse order. The graphs from application of COV-MSA are shown in Figures S1 and S2 in Additional file 2.
To assess the MIM, we compared it to the GSN, and to the model offered by Baugh et al. ([13], their Figure 9). We term this model the Wild Type Model (WTM; Figure5), as it is a knowledge-based model derived from the same wild-type data used for the MIM. As a first comparison, we evaluated the number of predictions in each model that are validated by the GSN. The WTM model makes 39 predictions, which includes true positives, true negatives, false positives, and edges considered to be half-right (see Methods) with 32 or 82% being correct, whereas the MIM makes 88 predictions with 57 or 65% being correct. Thus although the WTM achieves a higher percentage of correct predictions, the MIM makes over twice as many predictions, without a comparable loss of correctness. For a more detailed comparison of the models, we used precision-recall (PR) and receiver operating characteristic (ROC) plots (Figure6). In both graphs, points that lie above the dashed line have a stronger predictive value than random. The data used to produce these figures are included in Additional file 1: Table S 1, Tab “Overall.”
We assessed the models’ overall performance, as well as performance on ability to identify targets of PAL-1 and to identify the ectoderm and mesoderm subnetworks. More specifically, we considered the subnetworks generated by the genes within a module, that is the subgraph of edges incident only to genes within a module. We also considered the subnetworks of the modules within the context of the entire network; that is the subgraph induced by those edges incident to genes within a module and possibly incident to genes outside the module. The subnetwork of the first type is denoted by “s” and of the second type, by “w” in Figure6. Note that two nodes are adjacent if they share an edge, and an edge is incident to a node if the node is an endpoint of the edge.
A global comparison of the models can be obtained by evaluating the distance of the resulting model from one whose predictions are random guesses (see Methods). The MIM has a total distance of 2.8, whereas a best performing model has a total distance of about 8.49. In contrast, the total distance of the WTM from random is 1.3, arguing that the MIM globally has a 111% increase in predictability over the WTM. The predictions that the WTM (blue triangles) makes are correct in the context of the GSN, accounting for high precision (92%) and low FPR (1%); however, the WTM misses many of the interactions in the GSN, which is evident in its low recall (14%). It is due to the low recall value that the predictions of the WTM have a distance of 0.04 and 0.09 from random guesses in the PR and ROC plots, respectively. While the MIM (orange squares) has more misidentified predictions (lower precision at 71% and higher FPR at 46%) than the WTM, it has greater recall (55%). These changes result in a distance of 0.18 and 0.06 from random guesses in the PR and ROC plots, respectively, indicating a 316% increase in PR space and a 29% decrease in ROC space. We see that the MIM correctly identified more targets of PAL-1 (78% recall) than the WTM (33% recall); both models have 100% precision and 0% FPR for the targets of PAL-1. The MIM’s PR and ROC points are a distance of 0.55 from a random guess, compared to the WTM’s points at a distance of 0.24; this value of 0.55 is the largest distance achieved by either model. The MIM also identified more interactions in the mesoderm module (59-67% recall; the range reflects differences between subnetwork “s” vs “w”) than the WTM (15-33% recall). While the WTM has 100% precision and 0% FPR on the mesoderm module, the MIM has 82-100% precision and 0-33% FPR. The significance is that the MIM does not lose much in the way of making mistakes, while it gains much in the way of identifying other interactions, as compared to the WTM. Where the mathematical model suffers is in the prediction of the ectoderm module, with a precision of 50-71%, a recall of 50-67% and a FPR of 45-100%. As above, we use the distance to summarize the findings: the distances of the MIM’s PR and ROC points range from −0.24 to 0.15. We point out that negative distances reflect worse than random predictions. Notably, the single false positive edge prediction in the WTM is also in the ectoderm module, with 71-78% precision, 12-28% recall and 4-17% FPR, with distances between −0.07 and 0.08. The false positives present in both models may be indicative of features of the ectoderm module - such as complexity - that uncover the limitations of our modeling strategy. Alternatively, the false positives could suggest genetic redundancy in this module. For example, redundancy in the network would lead to under-prediction of regulatory edges based on single gene perturbation experiments, which are the primary source of data for the GSN.
Now we compare specific features of the MIM to the GSN. One notable difference between the MIM and the GSN is in the predictions for tbx-8 and tbx-9. In the GSN, these initiation genes are indistinguishable, which is not the case in the MIM. For example, the MIM suggests that tbx-8 interacts with lin-26, hlh-1, and nob-1, whereas there is no such prediction for tbx-9. Furthermore, of these two initiation genes, only tbx-9 is found to regulate elt-3 in the MIM, while in the GSN both regulate elt-3. On the chromosome, tbx-8 and tbx-9 are adjacent genes with a shared upstream region (they are on opposite strands of the DNA), and this genomic organization has suggested that they are co-regulated [22]. However, because the MIM is built from gene expression data, these model differences reflect differences in the observed behavior of the tbx-8 and tbx-9 transcripts, and suggest that the regulation of these genes may be more complicated than previously thought.
An important feature of the MIM is the involvement of pal-1 not just as a key regulator of other genes, but also as a participant in feedback loops commonly found in gene regulatory networks. Both the GSN and the MIM identify tbx-8 and tbx-9 as regulators of pal-1, and include the two-cycles pal-1 ← → tbx-8/tbx-9. The mathematical model additionally suggests non-initiation genes as regulators of pal-1, namely, hlh-1 nob-1, and cwn-1. The genes pal-1 → nob-1 → cwn-1 form a feedforward loop (with each gene also providing feedback on pal-1), while pal-1 → scrt-1 → hlh-1 and pal-1 → (nhr-25 lin-26) → nob-1 form feedback loops. (Note that in a feedforward loop the first element goes to the last, whereas in a feedback loop the last goes to the first.) Other feedforward loops shared by the GSN and the MIM are pal-1 → hnd-1 → lin-26 and pal-1 → lin-26 → hlh-1. Since pal-1 is both provided maternally and transcribed zygotically, feedback loops in the MIM identify potential zygotic activities for pal-1. While other studies have demonstrated the general importance of zygotic pal-1[23–25], the ability to separate prospective zygotic from maternal roles is a unique features of the MIM compared to the other models.
The MIM identifies possible alternative network architectures compared to the GSN. Both the GSN and MIM predict the directed paths lin-26 → nhr-25 → elt-1 and hnd-1 → lin-26 → hlh-1, the latter suggesting an alternate path of the direct regulation of hlh-1 by hnd-1 represented in the GSN. Another example of an alternative path is that from scrt-1 to vab-7, shown to be direct in the GSN; however, another explanation is that scrt-1 acts on vab-7 through hnd-1, as suggested by the MIM. In building the GSN, we assumed that experimentally confirmed edges were direct. However, we recognize that gene expression changes might result from either direct or indirect effects. The MIM suggests cases where alternative (indirect) regulatory relationships might explain the observed data.
A comparison of the non-interactions between the MIM and the GSN could not be performed. While such information can in principle be extracted from the chosen modeling methods, there were no non-interactions being reported by MSA and adjusting the threshold for COV resulted in all genes not interacting. As there were no conclusive results, we exclude comparison of non-interactions involving the MIM.
We evaluated the MIM for the same graph theory metrics as we used for the GSN. The average path length is about 3, and therefore less than 4. The nodes with an out-degree greater than half of the network are pal-1 and nhr-25. The average in-degree of the nodes is 3, and - compared to the GSN - the MIM exhibits a narrower in-degree range (2–6). No node exhibits an in-degree greater than half of the network, but the nodes with greatest in-degree are pal-1 (indeg = 6), nhr-25 (indeg = 5), and nob-1 (indeg = 5). We conclude that the MIM derived from wild-type data exhibit similar network features to those of other gene regulatory networks.
Altogether, we have mathematically reverse engineered a model based on wild-type gene expression data. We employed a pipeline modeling strategy that benefits from the unique strengths of two distinct modeling methods, and show that the order of method in the pipeline does not have a big impact on outcome. The model is enriched for positive predictions compared to random guesses, as well as compared to a knowledge-driven model built from the same data. A key benefit of the model is that it extracts information and offers predictions beyond the focus of the original experimental framework. We conclude that complementing knowledge-driven insight with systematic modeling approaches has the potential to improve predictability, prioritize future experiments, and suggest new network features compared to reliance only on knowledge-driven models.