Topological effects of data incompleteness of gene regulatory networks
© Sanz et al.; licensee BioMed Central Ltd. 2012
Received: 18 May 2012
Accepted: 6 July 2012
Published: 25 August 2012
Skip to main content
© Sanz et al.; licensee BioMed Central Ltd. 2012
Received: 18 May 2012
Accepted: 6 July 2012
Published: 25 August 2012
The topological analysis of biological networks has been a prolific topic in network science during the last decade. A persistent problem with this approach is the inherent uncertainty and noisy nature of the data. One of the cases in which this situation is more marked is that of transcriptional regulatory networks (TRNs) in bacteria. The datasets are incomplete because regulatory pathways associated to a relevant fraction of bacterial genes remain unknown. Furthermore, direction, strengths and signs of the links are sometimes unknown or simply overlooked. Finally, the experimental approaches to infer the regulations are highly heterogeneous, in a way that induces the appearance of systematic experimental-topological correlations. And yet, the quality of the available data increases constantly.
In this work we capitalize on these advances to point out the influence of data (in)completeness and quality on some classical results on topological analysis of TRNs, specially regarding modularity at different levels.
In doing so, we identify the most relevant factors affecting the validity of previous findings, highlighting important caveats to future prokaryotic TRNs topological analysis.
As it is commonly noticed in the literature, gene regulation is a complex process involving different phases and biochemical phenomenologies [1, 2]. Among these mechanisms, transcriptional control constitutes one of the main resources the cell relies on to respond biochemically to environmental fluctuations and challenges. As a consequence, systematic characterization of TRNs has turned into a subject of high scientific interest . Topological features of TRNs are customarily characterized at all scales using different metrics. At the large scale, genome-wide TRNs are signed and directed networks which present the following features: (i) regulatory proteins –origin of the regulatory interactions of the whole system– represent a small fraction of the total number of nodes; (ii) out-going connectivity patterns are very heterogeneous –a small percentage of global regulators (hubs) send most of the links; and (iii) in-coming link distributions are quite compact: there is a characteristic scale that defines the typical number of regulations each protein receives .
Turning to the mesoscale, modularity appears also in TRNs as a key feature to understand the dynamical function of the system. In genome-wide TRNs, each regulator defines its own regulon as the set of nodes directly or indirectly regulated by it. Regulons are then subnetworks, that can be sometimes hierarchically organized; in other occasions, regulons partially overlap in non-trivial ways. Thus, the identification of groups of regulons –or parts of them– interconnected through atypical, dense patterns is expected to store information about the biological role of the proteins within them . The underlying idea is that community structure in biological networks might contribute to unveil functional modularity.
However, perhaps one of the most striking results on topological analysis of TRNs is related to small-scale (sets of 3 or 4 nodes) connectivity patterns, which present statistics anything but contingent . Some of these patterns (or motifs) have been found to appear much more frequently than expected by random, while others, instead, are underrepresented in real networks. These statistical profiles, measured on different systems, allow the emergence of network families, each of which provide a general framework to understand the origin and the dynamical principles of the systems within them .
In addition to the aforementioned issues, the experimental challenges underlying the systemic characterization of the TRNs are far from being solved. The quantity and quality of available data on genome-wide transcriptional regulation are significant only for a small set of model organisms. Besides scarcity, the usual problem is related to the heterogeneous quality of the experimental evidences of the regulatory interactions, the building blocks of TRNs. Despite these problems, the amount of high-quality experimental information about transcriptional regulation at systemic level is growing each day, not only within the context of model prokaryotes.
In this work, we analyze three of the best known prokaryotic TRNs, for which these data quality improvements are being more thoroughly incorporated to publicly available data sets. Two of them correspond to the model bacteria Escherichia coli and Bacillus subtilis, while the third one corresponds to the pathogen Mycobacterium tuberculosis, whose first network characterizations [10–12] are more recent and incomplete due to the much higher difficulty associated to its wet-lab treatments and protocols. Specifically, the general question we set to answer here is whether robust and biologically relevant conclusions about TRNs can be reached given the current incompleteness of the data, going a step further with respect to other works that had somehow addressed this question previously . Besides, we also show that some topological metrics do depend on the level of detail incorporated in TR maps, in particular, the structure of the mesoscale. Our findings show that extreme care should be taken when strong claims are made based on partial data. This is the case of TRNs superfamilies, which we argue are indeed grouped into a single class.
The identification of modules in complex networks has attracted much attention of the scientific community in the last years. A modular view of a network offers a coarse-grained perspective in which nodes are gathered not due to knowledge-based decisions –function, composition, etc.–, but rather on a topological basis –who is connected to whom. To this end Newman put forward the concept of modularity Q, which quantifies how far a certain partition is from a random counterpart. From this definition, algorithms and heuristics to optimize modularity (Q) have appeared ever faster and more efficient , and generalizations to directed, weighted and signed networks are also available in the literature [16, 17]. All these efforts have led to a considerable success regarding the quality of detected community structure in networks, and thus a more complete topological knowledge at this level has been attained. Behind this interest underlies the intuition that the relation between network structure and dynamics is strongly mediated by the mesoscale, and that community structure plays a central role in network formation and functioning. And yet, with few exceptions, link attributes are seldom taken into account.
In this section we intend to underline that interaction direction and sign critically shape the detected community structure of a network. This is ever more dramatic in the case of TRNs, where a sharp distinction must be made between regulators (which mostly emit links) and the rest of the network, which mainly receive them. Also it is peculiar (though not exclusive) of these systems to allow for positive (activating) and negative (inhibitory) relationships. In practice, directions and signs are not always available in the datasets. Regarding directionality, we analyze a system –the TRN of M. tuberculosis– for which that is not an actual problem, as regulatory proteins are well identified, i.e. their function as link sources is known. Nevertheless, there are many cases of organisms whose regulatory pathways have not been explicitly identified, and in those cases the real topology is usually replaced by a co-expression network, which acts as an undirected proxy for the true underlying regulatory structure. Unavailability of interaction signs is, on the other hand, a more persistent problem: there exist many experimental approaches to infer a transcriptional regulation that do not inform about the sign of the interaction. Furthermore, there are interaction signs which depend on environmental conditions. Therefore, given the unavoidable incompleteness of the data, we explore whether link attributes determine the network modular structure, and to what extent.
To address the previous question, we perform a systematic comparison of the effects of preserving the original information (sign and direction) in modularity measures and community structure in TRNs. To this end, we will analyze the TRN of M. tuberculosis, for which we will consider three different topologies: one that preserves all available information (directed-signed, DS); an intermediate one (preserving directions, but not signs –directed-unsigned, DU); and a last one where all fine-grained information is ignored (undirected-unsigned, UU). From the output of this analysis, we provide a way to quantify how much biological information is lost when directions and/or signs are dropped out. Note that the three versions of the network have the same number of nodes N and number of links L, the only differences being those regarding direction and/or the sign of the interactions. Interaction signs have been compiled from the experimental works enlisted in , although signs were not reported there (see ).
which accounts for the deviation of actual positive weights against a null case random network; the negative counterpart Q − is defined accordingly, just placing negative weights in the expression. As for our current object of study, links in the network can only take values +1 or -1, and are originally defined as directed.
Figure 1 (top) represents the number of modules N c that a combination of Q-maximization heuristics  has detected for the three versions of the TRN of M. tuberculosis. Each topology has been scrutinized at different scales, screening the parameter r for 200 possible values, in a range such that it yielded an interpretable amount of modules. This range changes for different topologies, thus r is normalized in the plot to allow for comparison. On visual inspection it is apparent that the three topologies present plateaus, where different r values yield similar partitions in terms of N c . This indicates that certain topological scales are robust and persistent, which might be a clue to identify functionally relevant groups of nodes . Notably, the UU topology presents a single plateau at N c = 205 and then fails to stabilize for larger r’s. On the contrary, DU and DS, which retain more information, yield stable partitions at many levels. Although for different r values, these topologies exhibit almost the same behavior regarding plateaus and the number of communities N c these plateaus present. At this point, one can say that the mesoscale analysis for DU and DS networks allows a richer interpretation in terms of the grouping of nodes, but there is no way to confirm if these are more or less biologically sound, than, for example, the UU topology.
The Asymmetric Wallace index can also be defined the other way around (A W B,A ), but in this case this is not considered, because detected partitions are systematically more divisive than the functional one, i.e. we are interested in seeing how detected partitions are embedded in the functional one.
Figure 1 (bottom) shows the results for the proposed scheme. Initial results (early r) for the UU and DS networks are artificially high, because . Besides this, the plot indicates that only the partitions obtained from the DS topology are significantly similar to the functional one. In fact, beyond the initial stages of the resolution levels, both DU and UU’s community structures are far from being embedded in the functional categorization. Quite surprisingly, resolution levels with similar N c do not entail similar A W A,B values. For instance, the three topologies show at some point a plateau with N c ≈ 200. But A W UU,F ≈ 0.1, A W DU,F ≈ 0.2 and finally A W DS,F ≈ 0.5.
These results suggest that the more complete knowledge about link attributes, the richer representation of the mesoscale, in which different levels of topological coarse-graining can be well identified, with possible bio-dynamical implications that need to be explored.
Exhaustive search of topologically common footprints and systematic differences between different real systems constitutes an important topic in network theory since its very beginning . Along these lines, the classification of networks in families bring light into the evolutionary principles that ultimately yield to the complex topologies that real, evolving systems like TRNs show today . In this sense, the work by Alon and coworkers  constitutes a milestone.
Therefore, computing the Z score for all possible triads in a network yields a 13-dimensional vector that, when normalized, represents the so-called triad significance profile (TSP). From the analysis of different systems’ profiles, four superfamilies were identified with common TSPs: two families of non-biological networks –semantic adjacency words maps and social systems– and two families of biological, information processing networks.
The biological interpretation of the emergence of the two superfamilies of TRNs –or more generally, bio-information processing networks– proposed in  has to do with the typical response times developed by each group of systems. These times are similar to those of single interactions for the networks in the first group (rate-limited networks) but remarkably greater than characteristic interaction times for the systems within the second superfamily (unrate-limited networks).
The recent addition to this scheme of the TRN of M.tuberculosis poses an intriguing question. As it is visible to the naked eye in Figure 2 (panel B) its TSP, although belonging to an unicellular organism, has a greater correlation with the representative of the unrate-limited superfamily. The fact that M.tb. has these developmental-like topological features at its TRN might be interpreted under a coherent biological picture . The pathogen has an evolutive history tightly bound to its condition of a human intracellular obligate parasite, which could eventually have caused an adaptation of the bacterium to the rhythms and response dynamics of host cells. Indeed, certain stimuli, like hypoxia, yield anomalously slow shifts in Mycobacterium tuberculosis gene expression patterns, which can take as much as 80 days until stabilization .
The third panel in Figure 2 invalidates the previous hypothesis, and presents the TSPs of the updated TRNs of two bacteria which were initially characterized as rate-limited according to their TSPs. Visible at a glance, the update of the datasets has shifted their TSPs from one superfamily to another, in a way that suggests that the division of the information processing networks into two groups was an effect of data incompleteness.
The key of the change observed in the TSPs stems from the small number of two nodes feedback loops that are observed in unicellular organisms TRNs. Indeed, this possibility was already foreseen in  (see footnote 12 there). When feedbacks are absolutely absent from the system under study, as the randomizing algorithm preserves the number of them, feedback loops will also be absent in the null ensemble. This situation makes the Z-scores associated to triads 4, 5, 6, 9, 10, 11, 12 and 13 undefined, as in Figure 2, panel A. As time goes by, such cases have become obsolete: new links have been discovered and added to the growing datasets, and some of them generate feedback loops, which are now present in the triads listed before. In the three updated systems studied, we have found as many as 12 feedback loops in E.coli TRN, 9 in B.subtilis and 6 in M.tuberculosis. The result, after the incorporation of these new feedbacks, suppose that the division between two superfamilies of biological information processing networks according to their TSPs disappears, affecting the biological interpretation about the eventual relationship between time responses and motifs statistics.
Beyond the discussion on the robustness of motifs statistics that is faced here with a similar spirit of other previous works , much has been written about the eventually deep biological implications of anomalous network motifs’ statistics as a ubiquitous, topological property of gene regulatory networks. On the one hand, environmental evolutionary adaptation has been claimed to lie underneath this ubiquitous topological treat in gene regulatory circuits . According to this point of view, different environmental requirements could exert different evolutionary pressures to gene expression dynamics which may be correlated to network‘s topologies at the level of motifs, each of which is believed to offer different dynamical performances, as it has been observed in several precise cases [34, 37–39]. Complementarily, recent theoretical studies have addressed how functional, artificial networks required to drive different dynamical functions yield divergent motifs contents .
However, as it has been stressed in several works, evolutionary pressures are not the sole mechanism able to generate not-random statistics in networks motifs. Simple models incorporating spatial distribution of nodes  or typical mechanisms of network growth assimilable to those which drive gene-regulatory changes upon evolutionary time  have been found to generate network motifs without any evolutionary pressure. Under this kind of interpretation, network motifs could appear, not as a consequence of environmental adaptation but rather as a side-effect of some “intrinsic constraints” related to typical mechanisms of genetic material transformation like DNA fragments duplication, deletion, inversion etc . Supporting this hypothesis, a simple but powerful argument is often put forward: topological-bias at the level of TR networks could hardly be a consequence of dynamics-based, natural selection, as in a vast amount of cases transcriptional regulatory mechanisms constitute only one layer of more complex regulatory pathways also coupling translational and post-translational interactions, which are the ultimate responsible of the complex gene expression dynamical patterns observed in the cell . However, comparisons between motifs in gene regulatory networks of different bacteria which should have suffered the effects of entirely comparable “intrinsic constraints” yield slight “fine-tuning” differences in motifs statistics that can be reasonably related to environmental adaptation .
The present work does not intend to introduce any additional argument in the debate, which can hardly be considered closed. The reason may be that, as it has been pointed elsewhere, intrinsic constraints and evolutionary pressures are not, definitely, mutually exclusive mechanisms of network transformation , and to quantify the relative relevance of each mechanism could result in even a harder task. Our main purpose in this section is, however, to increase our understanding  about the robustness of certain topological treats of gene regulatory networks against data incompleteness, as well as to warn about how this analysis affects the network taxonomy scheme proposed in .
Experimental techniques used in transcriptional regulation inference are numerous and often subtle . However, usual approaches can be grouped within two main categories. The first approach is based on the explicit detection of the physical protein-DNA interaction between regulators and promoters of target genes. This presents the advantage that only direct operations of regulators on targets can be observed. However, the existence of a protein-DNA interaction under certain in-vitro conditions does not guarantee that it is physiologically relevant in terms of target expression levels.
The alternative approach is essentially based on the generation of mutant strains in which the functionality and/or the expression levels of a certain binding factor are significantly altered with respect to those of the wild type. Then, expression levels of genes which are potentially regulated by the binding factor under study are registered and compared between wild type and mutant strains. In this way, if these different levels of regulator activity yield significantly different target expression measures, one might assume that the regulator is actually acting on the target.
Well and poorly characterized links
Statistics of suspicious links
H o p value <10−17
H o p value <0.007
H o p value <0.019
This indicates that suspicious links constitute a topologically defined subset of interactions which is systematic and significantly less reliably characterized than on average in all the systems under study. This observation is in agreement with the hypothesis that insufficient experimental methods of transcriptional regulation inference can suppose the systematical observation of topologically-biased spurious links. The problem addressed here seems to critically affect the characterization of the activity of sigma factors. In fact, when we reconstruct the networks under study by considering only transcription factors as regulators and exclude sigma factors, the whole picture significantly changes. Indeed, the percent of suspicious links which are better characterized is even greater than the background, both for B.subtilis (45.0% vs 42.2%) and for E.coli (46.0%vs 43.1%). For the case of M.tuberculosis, the analysis can be hardly conclusive due to the loss of statistics after sigma factors removal (no well characterized link is located within the set of suspicious interactions, now, less than 100 in the whole signed network). These findings, put together, suggest that characterization of sigma factor regulons is more sensitive to the aforementioned issues.
Variation in Z scores
−4.7 ± 0.2
−6.9 ± 0.4
−2.2 ± 0.1
−2.9 ± 0.4
−6.9 ± 0.8
−1.1 ± 0.3
4.7 ± 0.2
6.9 ± 0.4
2.2 ± 0.1
2.5 ± 0.7
7.0 ± 0.8
1.1 ± 0.3
As we have shown here, sources of unreliability can be of diverse nature: from the often unjustified lack of details in link attributes to the lack of key interactions, whose inclusion radically modify motifs’ TSPs. As a matter of fact, our first finding convincingly shows that data incompleteness could exert a relevant influence on the topological characterization of the mesoscale in prokaryotic TRNs. More precisely, we have shown how a complete knowledge of link attributes (directions and signs) can yield richer mesoscale structures in TRNs. Secondly, we have also shown that a mere updating of the interactions that make up a TRN in which key regulatory interactions are incorporated, radically modifies previous results based on the analysis of motifs appearances. In fact, some of the previous conclusions do not hold anymore. We have observed that prokaryotic TRNs show motifs significance profiles very similar to those belonging to multicellular, developmental TRNs, signal transduction and neural systems. Finally, experimental mischaracterization of the links has also been studied, and yet, we have found that its influence on motifs statistics is reduced. These results suggest that the evolutionary interplay between topology and dynamics is more similar between regulatory systems of multicellular and unicellular organisms than expected.
Transcriptional Regulatory Networks have been increasingly studied during the last several years. Nowadays, however, their characterization can only be considered provisional, as they consist of incomplete annotations of often heterogeneous and unreliable experimental evidences, computational inferences and theoretical predictions. While working with still incomplete networks could be of valuable help to uncover unknown biochemical pathways, there are situations in which reliable conclusions cannot be obtained. Moreover, we don’t even know when the latter is the case. Accuracy and robustness of the results thus require us to be able to assess what results are dependent on the noisy and uncertain nature of some annotated links. This is crucial if deep biological implications are to be claimed.
We find the community structure of the networks studied using the modularity concept introduced by Newman . To perform these costly calculations we have used a mixture of heuristics, including extremal optimization and Newman’s fast algorithm, as implemented in . On the other hand, the statistical significance of motifs has been calculated as it is customarily done [6, 7]. Finally, for an exhaustive list of the experimental methods that have been categorized in different groups, see http://cosnet.bifi.es/researchlines/systems-biology/data.
We would like to thank José Alberto Carrodeguas and Alejandra Nelo for helpful comments. This work has been partially supported by MICINN through Grants FIS2008-01240, FIS2009-13364-C02-01, and FIS2011-25167 and by Comunidad de Aragón (Spain) through a grant to FENOL group.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.