Module-based subnetwork alignments reveal novel transcriptional regulators in malaria parasite Plasmodium falciparum

Background Malaria causes over one million deaths annually, posing an enormous health and economic burden in endemic regions. The completion of genome sequencing of the causative agents, a group of parasites in the genus Plasmodium, revealed potential drug and vaccine candidates. However, genomics-driven target discovery has been significantly hampered by our limited knowledge of the cellular networks associated with parasite development and pathogenesis. In this paper, we propose an approach based on aligning neighborhood PPI subnetworks across species to identify network components in the malaria parasite P. falciparum. Results Instead of only relying on sequence similarities to detect functional orthologs, our approach measures the conservation between the neighborhood subnetworks in protein-protein interaction (PPI) networks in two species, P. falciparum and E. coli. 1,082 P. falciparum proteins were predicted as functional orthologs of known transcriptional regulators in the E. coli network, including general transcriptional regulators, parasite-specific transcriptional regulators in the ApiAP2 protein family, and other potential regulatory proteins. They are implicated in a variety of cellular processes involving chromatin remodeling, genome integrity, secretion, invasion, protein processing, and metabolism. Conclusions In this proof-of-concept study, we demonstrate that a subnetwork alignment approach can reveal previously uncharacterized members of the subnetworks, which opens new opportunities to identify potential therapeutic targets and provide new insights into parasite biology, pathogenesis and virulence. This approach can be extended to other systems, especially those with poor genome annotation and a paucity of knowledge about cellular networks.


Conclusions:
In this proof-of-concept study, we demonstrate that a subnetwork alignment approach can reveal previously uncharacterized members of the subnetworks, which opens new opportunities to identify potential therapeutic targets and provide new insights into parasite biology, pathogenesis and virulence. This approach can be extended to other systems, especially those with poor genome annotation and a paucity of knowledge about cellular networks.

Background
Malaria is a major threat to public health and economic development in endemic regions. About 300-500 million cases are reported, and 1-2 million people die from malaria every year. Children and pregnant women are among the hardest hit of malaria victims. Five parasite species, P. falciparum, P. vivax, P. malariae, P. ovale, and P. knowlesi, cause human malaria. P. falciparum is the most virulent and widespread one.
As a proof-of-concept, we chose to predict the components in the transcriptional regulation network in P. falciparum. It was chosen because: (1) parasite employs exquisite regulatory machinery on gene expression to assure timely and accurate coordination on parasite growth, development, infection, and virulence. (2) Very little is known about the components, dynamics, and design principles of this network. New discoveries of network components could significantly fill our knowledge gaps and possibly lead to new short lists of proteins that are poorly understood and poorly annotated for functional characterization. The correspondent network used was from Escherichia coli. Detection of network similarities among Eukaryotes and among Prokaryotes have been demonstrated [73,78], but detection of similarities between these two groups is a more challenging problem. The ability to make comparisons across such a wide phylogenetic gap means, firstly, that evolutionarily conserved (and therefore significant) subnetworks can be detected and, secondly, that it is possible to search beyond more closely related strains. This is especially significant in cases like P. falciparum, where the immediate relatives reveal comparatively little about its functional subsystems.

Results and discussion
Module-based subnetwork alignments predicted 1,082 components in transcriptional regulation network in P. falciparum It is a common belief that the malaria parasite possesses a complex and orchestrated transcriptional regulatory system [79,80]. However, only a small number of transcriptional regulators have been identified, including a conserved set of basic transcription factors [81] and those predicted based on parasite developmental microarray expression profiles and motif analysis [82][83][84]. A recent study by Bischoff and Vaquero [85] combining literature searches, motif finding, and transcriptomic, proteomic, and interactomic analyses expanded this list to include proteins related to chromatin functions and remodeling.
Our functional module-based subnetwork alignments predicted that 1,082 P. falciparum proteins were functional orthologs of known transcriptional regulators in the E. coli network (Additional file 1). 37% of these predicted functional orthologs appeared to be "putative uncharacterized proteins" or "conserved Plasmodium proteins" of unknown function. This is in agreement with the fact that 10 years after the completion of genome sequencing, the proportion of ORFs with no functional assignment has only been reduced to 45% [86]. Functional enrichment analysis [87] revealed that 31 Gene Ontology (GO) terms were over-represented (p < 0.05), including those processes that are well known to be associated with transcriptional regulation such as proteolysis [72], response to stimulus, and proteasome activity ( Figure 1).

General transcriptional regulators
The predicted functional orthologs include several general transcriptional regulators ( Table 1) that are commonly present in a wide variety of species. The first is basic transcription factor 3b (Accession number PF14_0241). It was found via yeast 2-hybrid (Y2H) analysis [57] to have a direct PPI with a nascent polypeptide associated complex α chain protein (PFF1050w), the erythrocyte binding antigen-181(EBA-181, PFA0125c), and a putative coronin binding protein (PFF1110c), suggesting that it may be involved in protein folding, immune evasion, and cellular actin dynamics ( Figure 2). The second is a putative CCAAT-binding transcription factor (PF14_0374). A Y2H assay [57] showed that it had PPIs with six proteins. Two of these proteins are likely involved in global transcription, including (a) a putative NOT1 protein (PF11_0049). Proteins in the NOT1 family were shown to regulate the activity of general transcription factor TFIID [88]; and (b) a conserved Plasmodium protein (PF14_0603) that has a functional domain RPC4 which comprises a subunit of the tRNA specific polymerase RNA Pol III. The third interacting protein is a merozoite surface protein 7 (MSP7) precursor (PF13_0197), which is a regulator of parasite growth and a surface antigen regarded as a potential vaccine target [89]. The fourth protein associated with PF14_0374 is a conserved Plasmodium membrane protein (PF14_0315) that contains two FYVE/PHD zinc fingers for binding to potential target molecules. The remaining two proteins associated with PF14_0374 are 40S ribosomal proteins S10 (PF07_0080) and S20e (PF10_0038), indicating A putative YL1 nuclear protein (PF14_0608) was predicted to be a transcriptional regulator. It has two functional domains YL1 (Pfam accession PF05764) and YL1 C-terminal domain (PF08265), both of which are typical DNA binding domains. This protein may be related to chromatin remodeling. In addition, a Y2H assay using this protein as a bait pulled out a chloroquine resistance marker protein (PF14_0463) (Figure 2).

Apicomplexan-specific ApiAP2 transcriptional regulators
Most interestingly, our subnetwork alignments also predicted 11 putative transcriptional regulators belonging to the Apicomplexan-specific ApiAP2 family. A characteristic feature of this family is the presence of the Apetala2 (AP2) domain. AP2 transcription factors play a pivotal role in floral development in plants [90]. The recent discovery of AP2 in the Apicomplexa, the phylum to which malaria parasites belong, suggested that the ApiAP2 proteins were derived from bacteria or the apicoplast progenitor via transponsons, followed by lineage-specific radiation [91]. These ApiAP2 proteins, in addition to regulating heterochromatin formation and genome integrity, may develop novel parasitespecific functions such as antigenic variation, invasion, and sporozoite development [92][93][94][95]. P. falciparum possesses 27 ApiAP2 members. Among the 11 ApiAP2 proteins predicted by our network alignments, five contain a single AP2 domain, four contain two AP2 domains, and two contain three AP2 domains ( Figure 3). Analyzing the protein-protein association data from the STRING database [4], in conjunction with the data from the Y2H assays, temporal microarray experiments, proteomics, and literature, revealed that these 11 ApiAP2 proteins are associated with 1-17 proteins in the cellular networks ( Figure 4 and Additional File 2). At least four ApiAP2 proteins (PF07_0126, PFD0985w, PF11_0404 and PF10_0075) have PPIs, suggesting that they play central role in transcriptional regulation. The ApiAP2 protein with highest connectivity is PFD0985w, which has 17 interaction partners ( Figure 4). It has direct physical interactions with two other ApiAP2 proteins (PF07_0126 and MAL8P1.153). It is associated with a nucleosome assembly protein (PFI0930c) that is implicated in chromatin remodeling, and a putative Ndc80 homolog (PFF0785w) that may be a component of the mitotic spindle related to chromosome segregation. It is also associated with three surface antigens including a reticulocyte binding protein 2 homologue a (PF13_0198) which may play a role in determining host-cell invasion specificity [96], an antigen 332 (PF11_0506) in the Duffy binding-like (DBL) protein family which may be related to parasite entry to the host, and an asparagine-rich antigen (PF08_0060). This ApiAP2 protein PFD0985w also appeared to be related to a number of secreted proteins including a putative secreted ookinete protein (PFA0430c), and two proteins that are associated with Maurer's clefts [97], parasite-derived membranous structures within the host cell cytoplasm [PfSec31(PFB0640c), which is a COPII- coated vesicle component and PHISTb (PFD0080c)]. In addition, PFD0985w has direct PPIs with the 26S proteasome AAA-ATPase subunit RPT3 (PFD0665c), which is a component in ubiquitin-proteasome system for protein degradation, and pyruvate kinase (PFF1300w), an essential enzyme for glycolysis.
The ApiAP2 protein with second largest connectivity is PF07_0126. It has 15 PPI partners (Figure 4) that can be divided into five categories: (1) transcriptional regulation. It is associated with two otherApiAP2 proteins (PFD0985w and PFF0200c), and a CCAAT-box DNA binding protein subunit B (PF11_0477); (2) epigenetic regulation. It is associated with PfHMGB2 (MAL8P1.72), which has a DNAbinding domain: HMG-box (High Mobility Group box). The proteins in this family have been implicated in regulation of transcription, replication, repair, and chromatin remodeling; (3) signaling. PF07_0126 is associated with at least three putative signaling proteins, including (a) Figure 3 Phylogenetic tree of the ApiAP2 transcriptional regulator family in P. falciparum. The tree was constructed using the neighborjoining method [120]. 11 out of the 27 members were predicted by the subnetwork alignment algorithm. •: ApiAP2 protein with 1 AP2 domain ▲: ApiAP2 protein with 2 AP2 domains; ■: ApiAP2 protein with 3 AP2 domains.
PF13_0042, which contains a forkhead-associated domain that is found in a variety of regulatory proteins involved in signaling. (b) a calcium/calmodulin-dependent protein kinase (PF11_0060) that is implicated in signaling cascades. (c) a putative 14-3-3 protein (MAL8P1.69). Proteins in 14-3-3 family include regulatory ligands to various signaling molecules such as kinases and receptors; (4) surface antigens for cell adhesion and entry to the host. PF07_0126 is associated with a Duffy binding-like antigen 332 (PF11_0506), an erythrocyte membrane-associated antigen (PFD1045c), and a QF122 antigen (PF10_0115) with an RNA-binding motif; (5) metabolism. The glycolytic enzyme fructose-bisphosphate aldolase (PF14_0425) is associated with the ApiAP2 protein PF07_0126.
The role of ApiAP2 proteins in transcriptional and epigenetic regulation is also indicated by a direct PPI between a putative ApiAP2 with 3 AP2 domains PF10_0075 and a histone acetyltransferase GCN5 (PF08_0034), an enzyme for histone modification and chromatin remodeling [98]. This ApiAP2 protein may also been involved in the regulation of genome integrity through a PPI with a DNA repair protein rhp16 (PFL2440w), and cytoskeleton organization of actin (Figure 4).
Two of these 11 ApiAP2 proteins have been experimentally characterized to some extent: (1) the crystal structure of the AP2 domain of PF14_0633 has been determined, revealing a multiple-site binding pattern [99], and gene disruption assays showed that its ortholog in the rodent parasite P. berghei was an indispensible regulator for sporozoite development in the mosquito stage [94]. However, its regulatory roles and targets remain uncharacterized in P. falciparum. As shown in Figure 4, it has only two direct PPIs revealed by Y2H assays [57]: the first is a ribosomal protein P0, and the second protein PTEX150 (PF14_0344) is an important component in a translocon of exported proteins (PTEX). Located in the vacuole membrane, PTEX was recently discovered as a novel ATP-dependent protein trafficking machinery [100]. Notably, PTEX150 is only present in the Plasmodium genus. The PPI between PTEX150 and ApiAP2 suggests that this export machinery may have parasite-specific regulation. PTEX is becoming an attractive therapeutic target due to its importance to virulence and parasite survival and its distant evolutionary relatedness to the human host. (2) PF11_0442. Its counterpart in P. berghei is a transcription factor that regulates ookinetespecific gene expression for parasite invasion of the mosquito midgut. PF11_0442, however, may play a role in the red blood cell (RBC) stage: It has one PPI partner, rhoptry-associated protein 1 (RAP1, PF14_0102). RAP1 is an escort protein required to localize RAP2 to the rhoptries, apical organelles essential for RBC invasion [101].
In summary, ApiAP2 proteins are a family of stagespecific transcriptional regulators for diverse processes ranging from epigenetic modification, chromosome Node size is proportional to the degree of the node. Nodes are colored according to their functional classification in the eggNOG database [121]. The visualization is as for Figure 2. organization and dynamics, invasion, protein sorting and trafficking, protein turnover, and metabolism.

Other potentially important proteins that may be involved in transcriptional regulation
Module based subnetwork alignments predicted additional proteins that are likely involved in transcriptional regulation (Table 1). Two proteins (PFD0685c and MAL13P1.96) are members of the SMC (structural maintenance of chromosomes) superfamily; they both have a RecF/RecN/SMC N terminal domain and may be involved in chromatin cohesion and dynamics. A numbers of zinc-finger proteins were identified by network alignments as well. They exhibit different types of domain configurations, including the classical DNAbinding motif C2H2 found in transcription factors, the C3HC4 type (RING finger) typically found in proteins mediating ubiquitination, the C-x8-C-x5-C-x3-H (CCCH) type implicated in cell cycle regulation, the DHHC type found in proteins important for membrane association and trafficking, the DNL type implicated in protein translocation into mitochondria, and the CW type related to DNA-binding and protein-protein interaction. A putative transcriptional coactivator (ADA2, PF10_0143) has a ZZtype zinc finger domain. ADA2 was shown, in baker's yeast and Arabidopsis thaliana, to physically interact with GCN5, a histone acetyltransferase and a potent transcriptional activator [102,103]. The Y2H assay in P. falciparum [57] revealed that ADA2 has direct physical interactions with proteins including a minichromosome maintenance (MCM) complex subunit (PF14_0177), a pre-mRNA splicing factor (PFD0265w), a heat shock protein hsp70 interacting protein (PFE1370w), a sodium-dependent phosphate transporter (MAL13P1.206), a serine/threonine protein kinase in the FIKK family (PFA0130c), cathepsin C (PF11_0174), and a mature parasite-infected erythrocyte surface antigen (PFE0040c), suggesting its potential versatile roles in DNA replication, splicing, transport, protein processing, signal transduction, and invasion.
Other putative transcriptional regulators include PFE0870 and PF14_0170. PFE0870 contains two functional domains: a FACT complex subunit (SPT16/CDC68) domain which was reported to facilitate transcriptional initiation and interact with nucleosomes and histones [104], and a histone chaperone Rttp106-like domain. This protein may be involved in heterochromatin silencing and epigenetic regulation. PF14_0170 is a putative protein in the NOT global transcriptional regulator family. Y2H assays showed that it had direct physical interactions with CCAAT-box DNA binding protein subunit B (PF11_0477), DNA topoisomerase II (PF14_0316), and calcium dependent protein kinase 1 (PFB0815w), emphasizing its involvement in general transcriptional control and chromosome topology and signaling processes. It also has a PPI with a Pf11-1 protein (PF10_0374), which may play a role in protein trafficking processes associated with Maurer's cleft.

Conclusions
A functional-module based alignment approach was used to predict system components in transcriptional regulatory networks in malaria parasite P. falciparum. Our results predicted general transcriptional regulators that may regulate gene expression in a global or pleiotropic mode. Our results also imply a group of parasite-specific transcriptional regulators in the ApiAP2 family that play roles in diverse cellular processes ranging from chromatin remodeling, protein sorting and secretion, signal transduction, and invasion. Finally, our analysis has identified other potentially important proteins involved in transcriptional regulation. Our present knowledge of the transcriptional machinery and its regulatory capacity is rudimentary. The identification of network components in this machinery will open new avenues to the development of novel therapeutic targets and provide new insights into parasite biology, pathogenesis and virulence.
The premise of our subnetwork alignment approach is that functional annotations of the proteins can be transferred across species through conserved interactions in the aligned PPI networks. Under this framework, a priori information as to the identity or function of a gene is not necessary for the gene to be placed in a network. Thus genes identified only because of their key role in a network become potential targets. Furthermore, placement of the gene product in a systems context could, in itself, serve to identify the function of the gene product. If successfully applied, a systems biology approach circumvents the limiting factor in comparative genomics -the difficulty in obtaining reliable functional assignments.

Ortholog prediction by subnetwork querying
To predict functional orthologs for P. falciparum proteins, we formulated the problem as subnetwork querying. We first mapped the annotated E. coli transcriptional factors (GO:0003700: transcription factor activity) into the E. coli protein-protein interaction network. For each transcriptional factor, nearby neighbors were selected to form its neighborhood subnetwork. Similarly, each P. falciparum protein was mapped into the P. falciparum PPI network and a neighborhood subnetwork was built to include its nearby neighbors. Since the E. coli network and the P. falciparum network differ in size and density, the nearby neighbors were selected with a rule to control the neighborhood size. Let N k (p) denote the set of proteins that are exactly k hops from the central protein p. The neighborhood of central protein p is N(p) = N 1 (p) ∪ N 2 (p) ... ∪ N k (p) such that |N(p)| ≤ 500. Specifically, we first included the neighboring proteins that are 1 hop from the central protein. If the size of the neighborhood was less than 500, we continued to include the proteins that were 2 hops from the central protein. We kept increasing the hop distance until the neighborhood size was larger than 500. In other words, nearby proteins were selected by their distance to the central protein and the neighborhood size was kept below 500 unless the central protein has more than 500 direct neighbors in the PPI network.
After we obtained the neighborhood subnetwork for the E. coli transcriptional factors and all the P. falciparum proteins, we aligned each E. coli subnetwork against all the P. falciparum subnetworks. The central protein of the best-aligned P. falciparum subnetwork was identified as the functional ortholog of the E. coli transcriptional factor.

Aligning neighborhood subnetworks with graph kernel
To evaluate how well a P. falciparum neighborhood subnetwork aligned with an E. coli neighborhood subnetwork, we assigned a score for each possible alignment and summarized the alignment scores with a graph kernel. Graph kernels are effective approaches to measure the similarity between two labeled networks [105,106]. Given a pair of labeled graphs, a graph kernel is designed to summarize all possible isomorphic subgraphs (exact matches) in the two graphs. However, since there are an exponential number of subgraphs, it is computationally infeasible to detect all isomorphic subgraphs. A simplification is to compute the number of common paths between two graphs by a random walk on a product graph of the two compared graphs or by dynamic programming [107][108][109]. Alternatively, a graph kernel can also explicitly summarize the similarity between the shortest paths in the two graphs with each pair of shortest paths measured by a convolution kernel [110]. Since our focus is only on the paths that go through the central protein, we modified the shortest path graph kernel to only consider the paths between the central protein and the other proteins in the subnetwork. The underlying hypothesis is that each shortest path going through the central protein can characterize the functional role of the protein in the chained molecular activities along the path. As shown in Figure 5, given two subnetworks S p with central protein p and S q with central protein q, we define a simple shortest path similarity function, with the normalization parameter s = 10 measures the sequence similarity between proteins x and y based on the E-value of the sequence alignment, and dist(x, y) is the length of the shortest path connecting proteins x and y in the PPI subnetwork. Since the scores were small numbers, the computation was done in -log 10 scale. In this similarity function, we took each pair of the proteins (i1, i2) in one subnetwork and identify the (j1, j2) in the other subnetwork that gives the maximum ratio between their sequence similarity with respect to (i1, Figure 5 Computation of subnetwork alignment score. The alignment score between subnetwork S p and S q is the summation of the similarity score between all pairs of matched shortest paths ((i1, i2) and (j1, j2) in the figure), calculated based on the sequence similarities (E(i1, j1) and E(i2, j2)) and the distances in the subnetworks (dist(i1, i2) and dist(j1, j2)).
i2) and the closeness in the subnetworks. Specifically, we computed the shortest path through the central protein between all pairs of proteins in the neighborhood subnetwork. The shortest paths of two neighborhood subnetworks are then compared and scored pairwise. The total of the alignment scores was reported as the subnetwork alignment score. Our strategy is to incorporate both the sequence similarity of the proteins and the role of the central proteins in the subnetwork in the similarity measure, which summarizes the functional coherence between the two subnetworks and between the two central proteins of the two subnetworks.

Network data and analysis
The E. coli protein-protein interactions were obtained from IntAct database. IntAct database provides binary protein-protein interactions derived from literature curation or direct user submissions. The complete set of proteinprotein associations for P. falciparum was extracted from the STRING database [111]; each association between a pair of proteins has a confidence score (S) ranging from 0.15 to 0.999, based on the evidence from sequence similarity comparison, pathway (KEGG [112] and PlasmoCyc [52]) assignments, genome neighborhood analysis, phylogenetic inference, and literature co-occurrence. The associations were visualized in Cytoscape [113] and converted to an undirected weighted graph, where there is a single edge between any pair of proteins and the S value is used as the weight. The network was characterized using Net-workAnalyzer [114]. The default values were used for all three plugins. The set of proteins associated with transcriptional regulation were screened using BiNGO [115] to determine if any categories of proteins, as identified by their Gene Ontology terms, were enriched. The hypergeometric test was used with the Benjamini and Hochberg false discovery date correction. A significance level of 0.05 was selected.

Additional material
Additional file 1: Functional orthologs involved in transcriptional regulation in P. falciparum. The query genome is P. falciparum, and the target genome is E. coli. GO: Gene Ontology. BP: Biological Process. MF: Molecular Function. CC: Cellular Component.
Additional file 2: The protein-protein associations involving 11 ApiAP2 transcriptional regulators in P. falciparum.