NetGen: a novel network-based probabilistic generative model for gene set functional enrichment analysis

Background High-throughput experimental techniques have been dramatically improved and widely applied in the past decades. However, biological interpretation of the high-throughput experimental results, such as differential expression gene sets derived from microarray or RNA-seq experiments, is still a challenging task. Gene Ontology (GO) is commonly used in the functional enrichment studies. The GO terms identified via current functional enrichment analysis tools often contain direct parent or descendant terms in the GO hierarchical structure. Highly redundant terms make users difficult to analyze the underlying biological processes. Results In this paper, a novel network-based probabilistic generative model, NetGen, was proposed to perform the functional enrichment analysis. An additional protein-protein interaction (PPI) network was explicitly used to assist the identification of significantly enriched GO terms. NetGen achieved a superior performance than the existing methods in the simulation studies. The effectiveness of NetGen was explored further on four real datasets. Notably, several GO terms which were not directly linked with the active gene list for each disease were identified. These terms were closely related to the corresponding diseases when accessed to the curated literatures. NetGen has been implemented in the R package CopTea publicly available at GitHub (http://github.com/wulingyun/CopTea/). Conclusion Our procedure leads to a more reasonable and interpretable result of the functional enrichment analysis. As a novel term combination-based functional enrichment analysis method, NetGen is complementary to current individual term-based methods, and can help to explore the underlying pathogenesis of complex diseases. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0456-7) contains supplementary material, which is available to authorized users.


Classification of different methods for enrichment analysis
Basing on the model input (type of gene list: gene set only or full gene list with score) and the output (evaluation pattern of the identified GO terms: single term or term combination), the current functional enrichment analysis methods can be briefly categorized into three classes (Table S1, Class I-III).

Parameter sensitivity analysis 2.1 Workflow
Based on the original assumption of our network-based generative model, we have " > $ ≫ . However, the true parameter combination for generating active gene list is largely unknown, and may be related to the studied biological problem. The inappropriate selection of solving parameters may affect the performance of enrichment analysis. Therefore, a parameter sensitivity analysis, performed on biological process (BP) domain, was executed first to test the robustness of each model parameter.
For each model parameter combination ( " , $ , ) in sensitivity analysis, we used a preset generating values to simulate an active gene list, then a wide-range solving parameter values were used to test the robustness of method. The preset generating and solving values for each model parameter are shown in Table S2.  In our generative model, parameter stands for the influence of noise and uncontrollable error in experiment. Since the number of human genes is four orders of magnitude, we first chose = 0.001 by experience, which resulted in that the number of active genes selected due to noise was excusable, and at most one order of magnitude. Besides, we also chose = 0.01 to obtain a more comprehensive analysis result. We test the performance of enrichment analysis methods for using = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 in the solving procedures.
Parameter " is closely related with the coverage of active terms, i.e. the proportion of active genes annotated by the active terms. We have " ≥ 0.5 in enrichment analysis by experience. Here, we selected " = 0.8 and " = 0.5 as preset values of generating parameter, and performed sensitivity analysis for using " = 0.1, 0.3, 0.5, 0.7, 0.9 in the solving precedures.
Parameter $ is the probability of peripheral gene being activated via network propagation. We selected $ = 0.3 and $ = 0.1 as preset values of generating parameter, and performed sensitivity analysis for using $ = 0.1, 0.3, 0.5, 0.7, 0.9 in the solving procedures.
Parameter is a positive number to balance the log-likelihood and the penalization on size of active term set. A larger makes the model prone to select a fewer number of terms. Here, we set default value = 3 as recommended in [1], and performed the parameter sensitivity analysis for using = 0.1, 1, 3, 10, 100, 1000 in the solving procedures.
We used " = 0.8, $ = 0.3, = 0.001 as a default generating parameter combination. For each alternative value of the corresponding parameter, we just replaced the related value and kept other parameters unchanged. The whole workflow of parameter sensitivity analysis is as follows (for clarity, we illustrate the sensitivity analysis of generating parameter " = 0.5): 1. We restricted the terms in biological process domain, with number of covered gene between 2 and 500, and then randomly selected 500 terms from this refined term set to obtain an annotation set. 2. For the above annotation set, we randomly selected 5 biological process terms 20 times as the target active term set. For each target active term set, we generated an active gene list using " = 0.5, $ = 0.3, = 0.001. 3. The above 20 active gene lists were the model input. We used " = 0.1, 0.3, 0.5, 0.7, 0.9 as model solving parameter values, and kept other parameters as $ = 0.3, = 0.001, = 3. 4. For each value of parameter " , the 20 model outputs were combined to obtain a 2×2 contingency table. Besides, the Bonferroni corrected hypergeometric test p-values were used as the significant scores for these output terms. 5. The area under the precision-recall (AUPR) was computed for each value of parameter " . 6. The above procedure was repeated 10 times and the averaged AUPR was returned for each value of parameter " . In the sensitivity analysis, we used pr.curve function provided in the R package PRROC (Version: 1.1 from https://cran.r-project.org/web/packages/PRROC/index.html) to compute the AUPR.

Results
Follow the parameter sensitivity analysis procedure introduced in the previous section, the sensitivity analysis results are shown in Figure S1-S4.    From the results we can see that, NetGen, which consistently maintained a high-level average AUPR and had a lower variance than that of GenGO, was not sensitive to the selection of solving parameter " ( Figure S1). However, the average AUPR of GenGO declined when the solving parameter " decreasing. One explicable reason may be that the existence of peripheral genes perturbs the GenGO model, whereas in NetGen model the peripheral genes may assist to identify the true functional terms and counteract the deviation caused by uncertain " . Therefore, the network information can help to improve the accuracy of enrichment analysis. The overall results on samples generated by " = 0.8 are better than that by " = 0.5. This is expected since larger " indicates smaller uncertainty.
Parameter $ is the probability of peripheral genes being activated via network propagation, which distinguishes NetGen with other related methods. However, overemphasizing the network information may counter-productive sometimes. Specifically, excessive peripheral genes did confuse the selection of true active terms. The performance of NetGen declined sharply and a larger variance was observed, when the solving parameter was far away from the preset generating value ( Figure S2). On the other hand, the maximal distance between solving parameter and the preset true value, to achieve a tolerable performance (average AUPR > 0.95), was roughly about 0.2. This hysteresis can offset the high sensitivity of the inappropriate selection of $ . The average AUPR maintained a high-level when $ was relatively small. Therefore, a relatively conserved strategy was adopted on the selection of $ for real data applications. We fixed $ = 0.1 or $ = 0.05 in the mixed parameter selection strategy as described in the main text. The result of GenGO was a straight line since GenGO itself is irrelevant with $ . It is also expected that the result of GenGO on samples generated by $ = 0.3 is much worse than that by $ = 0.1.
Parameter stands for the influence of noise and uncontrollable error in experiment. NetGen performed well when the solving parameter value was around the preset generating value of , with a lower variance ( Figure S3). Similar to the situation in sensitivity analysis of $ , the curve began to decrease, when the solving parameter value exceeded a tolerable lag. On the contrary, the performance of GenGO did not meet the optimal value of and had a continued growth. One reason may be that the active genes generated via network propagation are regarded as noise in GenGO model, which also brings a larger variance. Parameter is a positive number to balance the log-likelihood and the size penalization on the active term set. A larger makes the model prone to select a smaller number of terms. NetGen was not sensitive to the selection of on both the average AUPR and the average selected term number ( Figure S4). Besides, it seems that GenGO prone to identify several redundant terms when = 3. The inferior performance of GenGO may partly be explained by the inappropriate selection of hyperparameter .
In conclusion, NetGen showed an approximately stable performance to the selection of model parameters. The performance of NetGen was preferable when the model parameters satisfying " ≫ $ ≫ . According to the results of sensitivity analysis, we further designed a mixed parameter selection strategy (see Methods in main text), to fit the practical active gene list in real datasets.

Simulation results on other domain
The simulation studies were executed to compare the performance of NetGen with other existing methods. The detailed description of the simulation study can be found in main text. Here, the results on cellular component domain and molecular function domain were shown in Figure S5-S6, and the result on biological process was introduced in Figure 3 in the main text. The results showed that NetGen outperformed other alternative methods on both three domains.

Alternative simulation results
In our work, in addition to the simulation procedure as introduced in the main text, we also simulated the circumstance that the active gene lists were unrelated to the network information. Namely, the active gene lists were simulated under the assumption of the GenGO model. The related results can be found in Figure S7-S9.   From the results we can see that the performance of NetGen was satisfactory on both three domains, when the network propagation parameter $ is set small enough ( Figure S7-S9 C). From a holistic perspective, NetGen achieved a comparable performance with GenGO on the molecular function domain, even if $ = 0.3 ( Figure S9 A and D). This may be directly related with the structure of molecular function domain itself. Not surprisingly, the worst performance occurred when turning down the proportion of the active core genes but keeping a large network propagation parameter $ . This indeed overly amplified the role of biological network ( Figure S7-S9 B), and was absolutely not in conformity with the assumption of GenGO model. On the other hand, overemphasizing both the effect of experiment noisy ( ) and network propagation ( $ ) made NetGen perform not very well ( Figure S7-S8 D). Another notable result was that the lower recall of NetGen, which was related to the small number of identified terms. NetGen cautiously predict candidate terms with a limited number, in order to keep a higher precision. In conclusion, the performance of NetGen is closely related to its particular parameter $ . According to the sensitivity analysis results ( Figure S2), NetGen performed very well even with a $ smaller than its true value for generating data. Therefore, we suggested a relatively conserved strategy on the selection of $ in practical applications, that is, using a relative small value close to zero, for example 0.05.

The description of microarray expression datasets and GO annotation data 5.1 microarray expression datasets
In our work, four microarray gene expression datasets of human complex diseases were selected from the Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo/, accession number GSE4115, GSE11223, GSE9750, GSE36895, respectively), for real datasets analyses. The selection of microarray expression profiles is based on the following criteria: 1) Homo sapiens organism disease.
2) Published (submission date) in recent ten years.
3) Hybridized using Affymetrix Human Genome U133A array or Affymetrix Human Genome U133 Plus 2.0 Array is preferred. 4) A balanced number of case and control samples and the total number is at least 50. 5) Described in or be used by at least one high quality journal paper. For lung cancer dataset (GSE4115), we combined the original primary and prospective datasets, which made a total of 97 and 90 smokers with and without lung cancer, respectively. For ulcerative colitis dataset (GSE11223), we only used the uninflamed samples in each cohort, which made a 66 ulcerative colitis patients and 69 healthy control donors. As for renal cell carcinoma (GSE36895), the paired expression profiles of 23 clear-cell RCC patients and their related normal cortex were used for further analysis. For all expression datasets, we averaged the expression values of the probes mapping on the same gene. The summaries of the detailed processed datasets are shown in Table S3.

GO annotation data
The GO annotation was extracted from R package org.Hs.eg.db in Bioconductor project. The detailed information about the GO annotation data was summarized in Table S4.

Active gene lists used in the real data applications
To obtain the active gene list of each microarray expression dataset, student's t-test was performed for each gene on the disease and the control cases. We sorted the microarray genes by ascending order of the t-test p-values. The top 100 genes were selected as the differential expression genes, which was then overlapped with the annotated genes. The annotated differential expression genes were used as the final active gene list to perform the functional enrichment analysis. For each dataset, the annotated differential expression gene set was listed as follows:

Mixed parameter selection strategy on simulation studies
In real applications, the mixed parameter selection strategy was designed to fit the generating parameters that derive the related active gene list. In this section, we want to test the effectiveness of the mixed parameter selection strategy on the simulation studies.
The workflow of this test is similar to the procedures of simulation studies as introduced in the main text. To execute the mixed parameter selection strategy, each type of the solving parameter combinations was used in the step3 of the simulation study procedure. According to the parameter sensitivity analysis result, we used the alternative value of parameter $ = 0.1 and $ = 0.05. The following measurements were used to evaluate the performance of different solving parameter combination.
1) The frequency of the term combinations with the lowest Fisher's exact test pvalue; 2) The overall AUPR of the term combinations. The final results were shown in Figure S10-S11.  First, the results in Figure S11 indicate that the generating parameters are not necessarily the best solving parameters. We can observe that the trends are similar in each panel for different generating parameters. The parameter setting " = 0.8, $ = 0.05, = 0.01 achieves the best AUPR in all cases. On the other hand, the performances of different solving parameters are very close, which also confirms that the parameters of NetGen are quite robust.
Second, the patterns of Figure S10 and S11 are very different. Under the same parameter combination of " and $ , a relatively larger is prone to decrease the Fisher's exact test p value of the found term set. However, a relatively larger often has better performance in terms of the overall AUPR. This phenomenon implies the Fisher's exact test p-value maybe not a good indicator of the term combination's quality.
Selecting an appropriate solving parameter in real application is very difficult. There may not exist the optimal solving parameter combination. Therefore, we can use the mixed parameter selection strategy to produce multiple solutions in real applications, which offers more information of the underlying biological processes for the downstream analysis.