XMRF: an R package to fit Markov Networks to highthroughput genetics data
 YingWooi Wan^{1, 2, 4},
 Genevera I. Allen^{3, 6, 4},
 Yulia Baker^{3},
 Eunho Yang^{7},
 Pradeep Ravikumar^{8},
 Matthew Anderson^{1, 5} and
 Zhandong Liu^{1, 3, 4}Email author
https://doi.org/10.1186/s1291801603130
© The Author(s) 2016
Published: 26 August 2016
Abstract
Background
Technological advances in medicine have led to a rapid proliferation of highthroughput “omics” data. Tools to mine this data and discover disrupted disease networks are needed as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epigenetic markers.
Results
We developed an R software package, XMRF, that can be used to fit Markov Networks to various types of highthroughput genomics data. Encoding the models and estimation techniques of the recently proposed exponential family Markov Random Fields (Yang et al., 2012), our software can be used to learn genetic networks from RNAsequencing data (counts via Poisson graphical models), mutation and copy number variation data (categorical via Ising models), and methylation data (continuous via Gaussian graphical models).
Conclusions
XMRF is the only tool that allows network structure learning using the native distribution of the data instead of the standard Gaussian. Moreover, the parallelization feature of the implemented algorithms computes the largescale biological networks efficiently. XMRF is available from CRAN and Github (https://github.com/zhandong/XMRF).
Keywords
Background
Markov random fields (MRFs) are a popular tool for estimating relationships between genes, finding regulatory pathways, and visually depicting genetic networks. Estimating sparse, highdimensional undirected graphical models, or Markov Networks, linking a set of p genes measured on n samples has been well studied for the Gaussian graphical model (GGM) [1] and also the binary or categorical Ising model [2]. As many highthroughput genomic data sets such as counts observed in Next Generation Sequencing data, are not approximately Gaussian or binary, existing methods for graphical models are greatly limited. To address this, a recent line of work has proposed a parametric family of graphical models based on univariate exponential family distributions with a rich theoretical foundation [3, 4]. These models can be used to estimate genetic networks from a variety of data types: gene expression networks based on RNA Sequencing via Poissonfamily graphical models, mutation and aberration networks via Ising graphical models, and epigenetic networks via Gaussian graphical models. In this paper, we introduce an R software package, XMRF, that encodes models and estimation techniques for fitting exponential family Markov Networks to highthroughput genomics data as well as software to preprocess genetic data and visualize the resulting genetic networks.
Methods
Recently, we proposed a novel class of MRF models [3] constructed by assuming that all nodeconditional distributions are univariate exponential families. These then yield a class of models appropriate for a variety of data types such as counts, categorical, continuous, and skewed continuous variables found in highthroughput genomics data. Further, we proposed a simple way of learning the network structure of these exponential family MRFs by maximizing the penalized nodeconditional likelihoods; these methods come with strong theoretical guarantees for sparse graph estimation as discussed in [3]. As the models are constructed via exponential families, the socalled neighborhood selection problems for each node reduce to that of an ℓ _{1} penalized and possibly constrained generalized linear model (GLM) [3].
Here, \(\mathcal {C}\) is the constraint region of the parameter space as discussed further for specific examples in [4]. Notice that this cooresponds to fitting a penalized GLM.
In the XMRF package, we implement the neighborhood selection graph estimation technique by proximal or projected gradient descent using warmstarts over the range of regularization parameters, λ. Note also that each nodeneighborhood problem is completely independent and can hence be computed in parallel; this is achieved using the default parallel support and the snowfall package [5] in R. The maxrule is used to construct the network from all neighborhood estimates. Selecting the sparsity of the network corresponds to selecting λ. We implement two datadriven approaches to do so: StARS which is computed over a range of λ values [6], and a stabilitybased approach for a single value of λ which for every edge, computes the proportion of times the edge is selected in the model (termed the stability score) over many bootstrap resamples [7]. Our package also implements Gibbs samplers to simulate data for our exponential family MRF models. Finally, our package includes functions for preprocessing sequencing data [8] and includes a host of network visualization and network manipulation strategies through a dependency on the igraph package [9]. Our package is developed under statistical computing environment R and compatible to be executed in version 3.0.2 or later.
Results and discussion
Recommended families to use in our XMRF package
Genetics data  Type  XMRF family  

RNASeq or miRNASeq  Counts  LPGM or SPGM  
Microarray or Methylation  Continuous  GGM  
Mutations or CNVs  Binary  ISM 
The XMRF() function will return an object of GMS class representing the fitted models. The GMS object contains the list of fitted networks, the stability of each fitted network, the full regularization path, and the index of the optimal network. The default plot method of GMS class enables drawing the learned network in graphical format and saves the output to a PDF document. The package also includes plotGML function to write the learned network in graph modeling language, which can be imported to Cytoscape [11] for further visualization customization.
Choosing the right graphical model for genomics data
The development of highthroughput technology, such as microarray, SNP array, arrayCGH, methylation array, exomesequencing, and RNAsequencing, has generated a wide variety of genetics data. Each of these genetics data varies in data types. For example, next generation sequencing (RNASeq and miRNASeq) data are read counts; expression profiles from microarray and methylation array are continuous values; mutations and CNVs are usually represented in binary, with value one represents the gene is mutated, amplified or deleted in the patient, or value zero otherwise (Table 1). To accurately estimate the underlying network structure from these data types, one needs to apply the right network inference algorithm based on the platformspecific data distribution.
In XMRF, data are modeled using their native distribution instead of normalizing the data to follow a Gaussian distribution. To accomplish this, our package implements methods for three families: Gaussian graphical models (GGM), Ising models (ISM), and Poisson family graphical models including regular Poisson graphical models (PGM) as well as several variants of the Poisson family of models such as the truncated Poisson (TPGM), sublinear Poisson (SPGM), and local Poisson (LPGM) [8]. Note that [10] proposed all these variants of the Poisson family as the regular Poisson graphical model only permits negative conditional dependencies between nodes; each of these variants relaxes restrictions resulting in both positive and negative conditional dependencies. For genomic networks based on sequencing data, we recommend using the LPGM variant as proposed in [8], noting that this local model closely approximates the proper MRF distribution of the SPGM formulation [10].
Poisson graphical model for NGS data
Count data generated by next generation sequencing (NGS) is a good example of why parametric families of Markov Networks beyond Gaussian graphical models are needed. This read count data is highly skewed and has large spikes at zero so that standardization to a Gaussian distribution is impossible. Here, we demonstrate through a real example how to process NGS data so that we can use Poisson family graphical models to learn the network structure. Our processing pipeline is given in [12] and encoded in the processSeq function of the package.
To estimate the underlying network structure of the countvalued data, XMRF implements four different models from the Poisson family graphical models: regular Poisson graphical model (PGM) that only permits negative conditional dependencies, truncated Poisson (TPGM), sublinear Poisson (SPGM), and local Poisson (LPGM) [8]. The latter three models are variants of the Poisson family that relax restrictions as imposed in a regular Poisson model, resulting in both positive and negative conditional dependencies [10]. TPGM should be used if one wants to truncate the large counts observed in NGS dataset. Alternatively, SPGM implements a sublinear truncation for the NGS data which gives a softer reduction on large counts. LPGM is a faster algorithm that approximates the Markov Network while preserving both positive and negative relationship [8].
Gaussian graphical model for expression arrays
When genomics data is profiled with microarrays, such as with mRNA arrays, miRNA arrays, or methylation arrays, Gaussian graphical models should be used to estimate the network structures. Similar workflows as presented in last section can be applied to fitting Gaussian Markov Networks to data that approximately follows a multivariate Gaussian distribution.

Obtain gene expression data for KIRC, profiled with mRNA microarray.

Obtain data for only tumor samples.

Filter genes so that the top 5 % of variable genes remain.

Use the XMRF function to learn the network structure. Note that it is always good practice to visualize the data to confirm the distributional family before model fitting. In this example as shown in Fig. 3, the data follows a Gaussian distribution and thus fitting a Gaussian Graphical model is appropriate.

Write the network in GML format and view the network via Cytoscape (Fig. 4).
Code snippets for the above workflow are provided as follows:
Ising graphical model for mutation data
To fit Markov Networks to binary data, the XMRF function with method=~ISM~ can be used. In this section, we give an example of fitting an Ising model to simulated data with a lattice graph as well as estimating interactions among mutated genes in TCGA lung squamous cell carcinoma (LUSC) samples [16].
Learning a lattice graph from simulated data
Estimate LUSC mutation networks
In this section, we estimate the relationships among mutated genes in 178 lung squamous cell carcinoma (LUSC) patients from the TCGA project [16]. We obtained the data via getTCGA from TCGA2STAT package. A total of 13655 genes for LUSC patients were obtained. Genes with a mutation rate of less than 15 % in the cohort or with an undefined gene name were filtered out before analysis. This left data with 59 genes and 179 patients. Similar to the workflow applied on simulated data presented, Ising models were fit across 20 regularization values, and the optimal network was selected from 100 iterations via the StARS approach.
Tuning parameter selection: network sparsity
Our XMRF package implements two datadriven methods to determine the sparsity of a fitted network. The first method is the stability selection over many bootstrap samples for a single regularization value [7]. The second method is the StARS selection, which is computed over a range of regularization values to select a network with the smallest regularization value that is simultaneously sparse and reproducible in random samples [6]. In this section, both of these methods are demonstrated for an example using the local Poisson graphical model (LPGM).
Stability selection
Here, we demonstrate the stability selection technique to learn the network sparsity for networks estimated via the LPGM method.
We simulate a scalefree network with 30 variables and 200 observations. We determine the network sparsity based on the stability score, which retains network edges that are estimated in more than 95 % (sth=0.95) of the 50 bootstrap repetitions (N=50). The code is given below:
StARS selection
If users want to fit a network over a series of regularization parameters instead of a single lambda as shown in last section, a numerical vector of regularization values should be given for the lambda.path parameter of the XMRF function.
Another option to study the Markov Networks over the complete regularization path is to let our XMRF method decide the path from a null model (empty network) to the full model (saturated network). In this case, the XMRF(…,method=~~,…) function will compute the maximum lambda that gives the null model and the minimum lambda that gives the full model for each of the parametric familes employed. The maximum lambda is computed based on the input data matrix, and is the maximum element from columnwise multiplication of data matrix (data matrix in n x p) normalized by the number of observations. Based on the maximum lambda value, the number of lambda (nlams) and the minimum lambda (lmin), sequence of appropriate lambda values will be computed.
Stability selection via StARS seeks to select the lambda value out of the regularization path which yields the most stable network (or, least variable to bootstrap perturbations). Specifically, the variability of each fitted network is measured based on the stability of edges inferred from the bootstrap samples. The network with the smallest penalization and variability below the user specified cutoff (beta) is selected as the final optimal network.
In the following example, we fit the XMRF(…, method=~LPGM~) to learn the same simulated scalefree network of 30 nodes from 200 observations along a path of 20 regularization parameters.
Visualization and data exportation
To enable users to visualize the inferred network in graphical form, XMRF includes three plotting functions with slight variations to serve different purposes. First, the default plot function of the GMS class will draw the optimal inferred network and save it to a PDF file with the following command:
Second, the plotNet function allows users to plot a specific network with specific layout. For example, to plot the simulated network and the inferred network in Fig. 7 with the same layout, the following commands can be used:
The third plot function allows users to view the inferred network in other graph visualizing software such as the Cytoscape. The plotGML function will write the network in the graph modeling language (GML) format which then can be imported to Cytoscape. For example, with the following command:
Conclusion
We have developed an open source R package that allows users to learn the network structure from data acquired from various highthroughput genomics technologies. Our tool is the only software that allows data to be modeled using their native distribution instead of normalizing the data to follow Gaussian distribution as most other statistical models require. In addition, the parallelization of our algorithms provides an efficient tool for computing largescale networks.
Notes
Declarations
Acknowledgements
ZL and YW are partially supported by the Houston Endowment and NSF DMS1263932; GA and YB by NSF DMS1264058 and DMS1209017; PR and EY by ARO W911NF1210390, NSF IIS1149803, IIS1320894, IIS1447574, and DMS1264033.
Declarations
The publication costs for this article were funded by the corresponding author. This article has been published as part of BMC Systems Biology Volume 10 Supplement 3, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume10supplement3.
Availability of data and materials
XMRF is available from the CRAN Project and Github at: https://github.com/zhandong/XMRF.
Authors’ contributions
YW and ZL implemented the R package. GA, PR, EY, and YB developed the theoretical foundation for XMRF. YW, ZL, and GA wrote the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Meinshausen N, Buhlmann P. Highdimensional graphs and variable selection with the lasso. Ann Stat. 2006; 34(3):1436–62.View ArticleGoogle Scholar
 Ravikumar P, Wainwright MJ, Lafferty JD. Highdimensional ising model selection using l1regularized logistic regression. Ann Stat. 2010; 38(3):1287–319.View ArticleGoogle Scholar
 Yang E, Ravikumar PD, Allen GI, Liu Z. Graphical Models via Generalized Linear Models. Adv Neural Inf Process Syst. 2012; 25:1367–1375.Google Scholar
 Yang E, Ravikumar P, Allen GI, Liu Z. Graphical models via univariate exponential family distributions. J Mach Learn Res. 2015; 16:3813–47.Google Scholar
 Knaus J. Snowfall: Easier Cluster Computing (based on Snow). R package version 1.844. 2013. http://CRAN.Rproject.org/package=snowfall. Accessed 2012.
 Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. Adv Neural Inf Process Syst. 2010; 24(2):1432–40.PubMedPubMed CentralGoogle Scholar
 Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Series B (Statistical Methodology). 2010; 72(4):417–73.View ArticleGoogle Scholar
 Allen GI, Liu Z. A Local Poisson Graphical Model for Inferring Networks From Sequencing Data. NanoBioscience IEEE Trans. 2013; 12(3):189–98.View ArticleGoogle Scholar
 Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695.Google Scholar
 Yang E, Ravikumar PD, Allen GI, Liu Z. On Poisson Graphical Models. Adv Neural Inf Process Syst. 2013;:1718–26.Google Scholar
 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11):2498–504.View ArticlePubMedPubMed CentralGoogle Scholar
 Allen GI, Liu Z. A loglinear graphical model for inferring genetic networks from highthroughput sequencing data. In: Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference: 2012. p. 1–6. doi:10.1109/BIBM.2012.6392619.Google Scholar
 Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490(7418):61–70.View ArticleGoogle Scholar
 Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding M, Bamford S, Cole C, Ward S, Kok CY, Jia M, De T, Teague JW, Stratton MR, McDermott U, Campbell PJ. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2015; 43(Database issue):805–11.View ArticleGoogle Scholar
 Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013; 499(7456):43–9.View ArticleGoogle Scholar
 The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012; 489(7417):519–25.View ArticlePubMed CentralGoogle Scholar