### MONSTER: MOdeling Network State Transitions from Expression and Regulatory data

The MONSTER algorithm models the regulatory transition between two cellular states in three steps: (1) Inferring state-specific gene regulatory networks, (2) modeling the state transition matrix, and (3) computing the transcription factor involvement.

**Inferring state-specific gene regulatory networks:** Before estimating the transition matrix, **T**, we must first estimate a gene regulatory starting point for each state. While there have been many methods developed to infer such networks [1–7], we have found the bipartite framework used in PANDA [8] to have features that are particularly amenable to interpretation in the context of state transitions. PANDA begins by using genome-wide transcription factor binding data to postulate a network “prior”, and then uses message-passing to integrate multiple data sources, including state-specific gene co-expression data.

Motivated by PANDA, we developed a highly computationally efficient, classification-based network inference method that uses common patterns between transcription factor targets and gene co-expression to estimate edges and to generate a bipartite gene regulatory network connecting transcription factors to their target genes.

This approach is based on the simple concept that genes affected by a common transcription factor are likely to exhibit correlated patterns of expression. To begin, we combine gene co-expression information with information about transcription factor targeting derived from sources such as ChIP-Seq or sets of known sequence binding motifs found in the vicinity of genes. The process of building prior networks to use as input to MONSTER may be complex, but our tool is agnostic to the source of this input. Users of MONSTER should use domain specific knowledge to generate an appropriate prior network.

We then calculate the direct evidence for a regulatory interaction between a transcription factor and gene, which we define as the squared partial correlation between a given transcription factor’s gene expression, *g*
_{
i
}, and the gene’s expression, *g*
_{
j
}, conditional on all other transcription factors’ gene expression:

$$\hat{d}_{i,j}=cor\left(g_{i},g_{j}|\left\{ g_{k}:k\ne i,k\in\mathbf{TF}_{\mathbf{j}}\right\} \right)^{2}, $$

where *g*
_{
i
} is the gene which encodes the transcription factor *TF*
_{
i
}, *g*
_{
j
} is any other gene in the genome, and **TF**
_{
j
} is the set of gene indices corresponding to known transcription factors with binding site in the promoter region of *g*
_{
j
}. The correlation is conditioned on the expression of all other potential regulators of *g*
_{
j
} based on the transcription factor motifs associated with *g*
_{
j
}. The direct evidence is motivated by the idea that changes in transcription factor expression may lead to similar changes in in target gene expression. The coexpression of co-targeted genes is long established in the literature [9, 10], and evidence also points to the coexpression of transcription factor genes with targets of that transcription factor [11, 12]. Moreover, studies across multiple tissues have shown widely varying expression of transcription factor genes, indicating that this expression can be used to predict their regulatory involvement [13]. Naturally, transcription factor behavior depends on many factors, including those that occur after translation. However, it makes intuitive sense that the mRNA abundance of a gene for a transcription factor should correlate with target genes to some degree.

Next, we fit a logistic regression model which estimates the probability of each gene, indexed *j*, being a motif target of a transcription factor, indexed *i*, based on the expression pattern across the *n* samples across *p* genes in each phenotypic class:

$$logit\left(P\left[\mathbf{M}_{i,j}=1\right]\right)=\beta_{0,i}+\beta_{1,i}g_{j}^{(1)}+\dots+\beta_{N,i}g_{j}^{(N)} $$

$$\hat{\theta}_{i,j}=\frac{e^{\hat{\beta}_{0,i}+\hat{\beta}_{1,i}g_{j}^{(1)}+\dots+\hat{\beta}_{N,i}g_{j}^{(N)}}}{1+e^{\hat{\beta}_{0,i}+\hat{\beta}_{1,i}g_{j}^{(1)}+\dots+\hat{\beta}_{N,i}g_{j}^{(N)}}} $$

where the response **M** is a binary *p*×*m* matrix indicating the presence of a sequence motif for the *i*
^{th} transcription factor in the vicinity of each of the *j*
^{th} gene. And where \(g_{j}^{(k)}\) represents the gene expression measured for sample *k* at gene *j*. Thus, the fitted probability \(\hat {\theta }_{i,j}\) represents our estimated indirect evidence. Combining the scores for the direct evidence, \(\hat {d}_{i,j}\), and indirect evidence, \(\hat {\theta }_{i,j}\), via weighted sum between each transcription factor-gene pair yields estimated edge-weights for the gene regulatory network. We score each gene according to the strength of indirect evidence for a regulatory response to each of the transcription factors and combine this with the direct evidence of regulation. Combining our measures of direct and indirect evidence presents some challenges. Though both are bounded by [0,1] their interpretations are quite different. The direct evidence can be considered in terms of its conditional gene expression *R*
^{2} between nodes, while the indirect evidence is interpreted as an estimated probability. Therefore, we use a non-parametric approach to combine evidence. Specifically, the targets of each transcription factor are ranked and combined as a weighted sum, \(w_{i,j}=\left (1-\alpha \right)\left [rank\left (\hat {d}_{i,j}\right)\right ]+\alpha \left [rank\left (\hat {\theta }_{i,j}\right)\right ]\), where *α* is a constant bounded between [0,1]. Our choice of the weight is by default *α*=0.5, corresponding to an equal contribution of direct and indirect evidence. This parameter could be adjusted if the context of a study involved reason to prefer one source of evidence over the other (see Supporting Information).

Applying this approach to gene expression data from two distinct phenotypes results in two *p*×*m* gene regulatory adjacency matrices, one for each phenotype. These matrices represent estimates of the targeting patterns of the *m* transcription factors onto the *p* genes. This network inference algorithm finds validated regulatory interactions in *Escherichia coli* and Yeast (*Saccharomyces cerevisiae*) data sets (see Supporting Information).

**Modeling the state transition matrix:** Many methods have been developed for inferring gene regulatory networks, but more recent work has been proposed for estimating gene regulatory network differentiation [14, 15]. Once we have gene regulatory network estimates for each phenotype, we can model the problem of estimating the transition matrix within a regression framework. With this formulation, we solve for the *m*×*m* matrix that best describes the transformation between phenotypes (1). More specifically, MONSTER predicts the change in edge-weights for a transcription factor, indexed *i*, in a network based on all of the edge-weights in the baseline phenotype network.

$$E\left[b_{i}-a_{i}\right]=\tau_{1,i}a_{1}+\dots+\tau_{m,i}a_{m} $$

where *b*
_{
i
} and *a*
_{
i
} are column-vectors in **B** and **A** that describe the regulatory targeting of transcription factor *i* in the final and initial networks, respectively.

In the simplest case, this can be solved with normal equations,

$$\hat{\tau}_{i}=\left(A^{T}A\right)^{-1}A^{T}(b_{i}-a_{i}) $$

to generate each of the columns of the transition matrix **T** such that

$$\hat{\mathbf{T}}=\left[\hat{\tau}_{1},\hat{\tau}_{2},\dots,\hat{\tau}_{m}\right] $$

The regression is performed *m* times corresponding to each of the transcription factors in the data. In this sense, columns in the transition matrix can be loosely interpreted as the optimal linear combination of columns in the initial state adjacency matrix which predict the column in the final state adjacency matrix. The interpretation of the transition matrix can be best understood by comparing it to the identity matrix. A transcription factor, *i*, that does not alter its regulatory targets between states will have expected values of 0 for all entries in column *i*, with the exception of entry *i*. In the context of discovering changes in network configurations for a transcription factor, we are most interested in evaluating the degree to which each column has non-zero values for all non- *i*
^{th} entries. In essence, we are describing one network as a linear combination of another network. Numerous biological mechanisms, such as the formation of protein complexes, protein inactivation, post-translational modification, epigenetics, etc. allow for the systematic modification of network structures and drive the changes that are detected in the transition matrix (see Supporting Information).

This framework allows for the natural extension of constraints such as *L*1 and/or *L*2 regularization (see Supporting Information). For the analysis we present in this manuscript, we use the normal equations and do not impose a penalty on the regression coefficients. The inclusion or exclusion of this feature should depend primarily on assumptions of the underlying network mechanisms. L1 regularization, for example, will tend to infer a sparse transition matrix. This is reasonable in contexts in which the regulatory targeting pattern of TFs is not expected to change for the vast majority of TFs. For our four observational studies of COPD, a highly complex disease, it is not reasonable to assume that transcription factors differ in a sparse manner.

**Computing the transcription factor involvement:** For a transition between two nearly identical states, we expect that the transition matrix would approximate the identity matrix. However, as initial and final states diverge, there should be increasing differences in their corresponding gene regulatory networks and, consequently, the transition matrix will also increasingly diverge from the identity matrix. In this model, the transcription factors that most significantly alter their regulatory targets will have the greatest “off-diagonal mass” in the transition matrix, meaning that they will have very different targets between states and so are likely to be involved in the state transition process. We define the “differential transcription factor involvement” (dTFI) as the magnitude of the off-diagonal mass associated with each transcription factor, or,

$$ \hat{dTFI_{j}}=\frac{\sum_{i=1}^{m}I\left(i\ne j\right)\hat{\tau}_{i,j}^{2}}{\sum_{i=1}^{m}\hat{\tau}_{i,j}^{2}} $$

(2)

where, \(\hat {\tau _{i,j}}\) is the value of the element in the *i*
^{th} row and *j*
^{th} column in the transition matrix, corresponding to the *i*
^{th} and *j*
^{th} transcription factors. To estimate the significance of this statistic, we randomly permute sample labels *n*=400 times across phenotypes. From these 400 permuted results, we infer the standard error for each estimate under the null. *P*-values are determined by comparing the observed estimate to this standard error and FDR values is computed from these p-values (see Supporting Information).