Intrinsically Bayesian robust classifier for single-cell gene expression trajectories in gene regulatory networks

Background Expression-based phenotype classification using either microarray or RNA-Seq measurements suffers from a lack of specificity because pathway timing is not revealed and expressions are averaged across groups of cells. This paper studies expression-based classification under the assumption that single-cell measurements are sampled at a sufficient rate to detect regulatory timing. Thus, observations are expression trajectories. In effect, classification is performed on data generated by an underlying gene regulatory network. Results Network regulation is modeled via a Boolean network with perturbation, regulation not fully determined owing to inherent biological randomness. The binary assumption is not critical because the resulting Markov chain characterizes expression trajectories. We assume a partially known Gaussian observation model belonging to an uncertainty class of models. We derive the intrinsically Bayesian robust classifier to discriminate between wild-type and mutated networks based on expression trajectories. The classifier minimizes the expected error across the uncertainty class relative to the prior distribution. We test it using a mammalian cell-cycle model, discriminating between the normal network and one in which gene p27 is mutated, thereby producing a cancerous phenotype. Tests examine all model aspects, including trajectory length, perturbation probability, and the hyperparameters governing the prior distribution over the uncertainty class. Conclusions Simulations show the rates at which the expected error is diminished by smaller perturbation probability, longer trajectories, and hyperparameters that tighten the prior distribution relative to the unknown true network. For average-expression measurement, methods have been proposed to obtain prior distributions. These should be extended to the more mathematically difficult, but more informative, expression trajectories. Electronic supplementary material The online version of this article (10.1186/s12918-018-0549-y) contains supplementary material, which is available to authorized users.

Genes have interactions with each other, which can determine how they are behaving over time and define the dynamics of gene regulatory networks (GRNs). One way of showing the dynamics of GRNs over discrete time points is Boolean networks with perturbation (BNp) [3]. A BNp is a Markov chain, in which the state of a gene (0 for off and 1 for on) at the current time is a function of the states of its predictor genes at the previous time plus a small random Boolean noise. Network inference could be done using RNA-Seq time series measurements [4].
Suppose we have single-cell measurements sampled with a sufficient rate to detect regulatory timing. In effect, this would mean that classification would be done on data reflecting an underlying gene regulatory network. In [5] we proposed a classifier to classify the state trajectories of the two classes: wild-type and mutated, each having its own BNp. We derived the Bayes classifier and computed the Bayes error. We analyzed the effects of the length of the trajectories, perturbation probability, and different mutations on the Bayes error.
In [6] and [7] we assumed an observation model on the state dynamics of the BNps, from which the expression values of the genes are obtained. As the parameters of the model were all unknown and the network functions were partially known, we proposed an expectationmaximization (EM)-based algorithm to estimate these parameters and functions, and then plugged them into the Bayes classifier. In [6] we assumed that the expression trajectory data come from single-cell measurements and compared that with a multiple-cell scenario, in which instead of trajectories we have the averaged expressions of all genes over all cells, which translates to an average over all states.
In this paper we extend the single-cell trajectory classification to the Bayesian framework. We propose the intrinsically Bayesian robust (IBR) classifier for the trajectorires. The IBR classifier is a specific type of the obtimal Bayesian classifier (OBC), first introduced in [8,9] for the classification of static data. In fact, the difference between the OBC and IBR classifiers is that in the OBC the expectation of the class-conditional densities is taken over the posteriors of the parameters to obtain the effective class-conditional densities, whereas in the IBR classifier the expectation is taken over the priors. The IBR/OBC concept has been applied to linear and morphological filtering [10,11], and IBR Kalman filtering [12]. Regarding the prior distributions, prior construction methods using the pathway knowledge have been studied in the literature, such as [13,14]. Bayesian methods for finding differentially expressed genes can be used for the aim of classification [15].
Here we apply the IBR classifier to the classification of trajectories. As opposed to [6,7], where we estimate the parameters, here we assume that the parameters belong to an uncertainty class governed by a prior distribution. We assume that there are two classes: wild-type (S = 0) and mutated (S = 1). We introduce a Bayesian version of the partially observed Boolean dynamical system (POBDS), proposed in [16], as the observation model. We use a beta prior distribution for the prior probability of the class S = 0 and also for the gene perturbation probability. Since the observation model given the states is Gaussian, we employ the normal-gamma distribution as the prior distribution of the mean and precision (inverse of variance) of the Gaussian model.
In the simulation part, we employ a mammalian cellcycle gene regulatory network [17] consisting of 10 genes as the BNp for the class S = 0 and its mutated version as the BNp for the class S = 1. We analyze the effects of all the hyperparameters of the model and the length of the observed trajectories on the classification error. The proposed classifier is computationally efficient because we use the sum-product method to reduce the complexity to m × 2 n , where m is the length of the observed trajectory and n is the number of genes in the network. Being linearly dependent on time points, m, makes the classifier very fast even for longer trajectories.

Methods
In a Boolean network (BN) for n genes, each gene value x i ∈ {0, 1}, for i = 1, · · · , n, at time k + 1 is determined by the values of some predictor genes at time k via a Boolean function f i : {0, 1} n → {0, 1}. In practice, f i is a function of a small number of genes, K i , called the in-degree of the gene x i in the network. The in-degree of the network is K = max i=1,··· ,n K i . A gene network can be represented as a graph with vertices representing genes and edges representing regulations. There is a state diagram of 2 n states corresponding to the truth table of the BN, representing the dynamics of the network. Given an initial state, a BN will eventually reach a set of states, called an attractor cycle, through which it will cycle endlessly. Each initial state corresponds to a unique attractor cycle, and the set of initial states leading to a specific attractor cycle is known as the basin of attraction (BOA) of the attractor cycle.

State model
We allow stochasticity in our state model by using Boolean networks with perturbation (BNps) instead of deterministic BNs. For BNps, perturbation is introduced with a probability p k by which the state of a gene in the network can be randomly flipped at time k. We assume that there is an independent identically distributed (i.i.d.) random perturbation vector at each time k, denoted by n k ∈ {0, 1} n , such that the i-th gene flips at time k if the i-th component of n k is equal to 1. Therefore, the dynamical model can be expressed as where X k = [x 1 (k), x 2 (k), · · · , x n (k)] T is a binary state vector, called a gene activity profile (GAP), at time k, in which x i (k) indicates the expression level of the i-th gene at time k (either 0 or 1); f = f 1 , f 2 , · · · , f n T : {0, 1} n → {0, 1} n is the vector of the network functions, in which f i shows the expression level of the i-th gene at time k + 1 when the system lies in the state X k at time k; n k = [n 1 (k), n 2 (k), · · · , n n (k)] T is the perturbation vector at time k, in which n 1 (k), n 2 (k), · · · , n n (k) are i.i.d. Bernoulli random variables for every k with the parameter p k = P(n i (k) = 1) for i = 1, · · · , n; and ⊕ is component-wise modulo 2 addition. The existence of perturbation makes the corresponding Markov chain of a BNp irreducible. Hence, the network possesses a steady-state distribution π describing its longrun behavior. If p k is sufficiently small, π will reflect the attractor structure within the original network. We can derive the transition probability matrix (TPM) if we know the truth table and the perturbation probability of a BNp. As a result, the steady-state distribution π can be computed as well.

Prior for state parameter
In this paper, we assume that we know the underlying Boolean networks for both the wild-type and mutated classes, and the only uncertain parameter at the state level is the perturbation probability p k . Since 0 < p k < 1, we can employ the beta prior for p k , for all k = 1, 2, · · · , as where a and b are known parameters. Since in reality p k is close to zero, we can choose a and b in such a way that this fact is satisfied. To this end, we can use the mean and variance of p k : (3)

Observation model
We define a Bayesian partially-observed Boolean dynamical system (BPOBDS) as the model for the gene expression data. In this model, we assume that the gene expressions come from Gaussian distributions whose parameters are governed by prior distributions whose parameters (the hyperparameters of the observations) are a function of the hidden Boolean states. If y j (k) is the expression value of the j-th gene at time k, then the observation model is for j = 1, 2, · · · , n and k = 1, 2, · · · , where θ j (k) and λ j (k) denote the mean and precision, respectively, of the Gaussian distribution.

Priors for observation parameters
We employ the well-known normal-gamma prior distribution for θ j (k) and λ j (k): where α 0 , β 0 , κ 0 , μ 0 , and δ 0 are known positive hyperparameters, and x j (k) is the hidden Boolean state of gene j at time k. The intuition behind the prior (5) is that when gene j at time k is on or off, that is, x j (k) = 1 or 0, the hypermean of the expression for that gene is μ j (k) = μ 0 + δ 0 or μ j (k) = μ 0 , respectively, at time k. In (5), μ 0 is the baseline expression level and δ 0 is the expression coefficient. The hyperparameters α 0 , β 0 , and κ 0 determine the level of uncertainty, by which we can control the variance of the outputs. We assume the same values of hyperparameters for all genes at all times.

IBR classifier
If one knows the feature-label distribution, then the error of any classifier can be found and an optimal (Bayes) classifier minimizes classifier error. If the feature-label distribution is unknown but belongs to an uncertainty class of feature-label distributions, then we desire a classifier to minimize the expected error over the uncertainty class. Given a classifier ψ, from the perspective of mean-square error (MSE), the best error estimate minimizes the MSE between its true error (a function of parameter θ) and an error estimate. This Bayesian minimum-mean-squareerror (MMSE) estimate is given by the expected true error, is the error of ψ on the feature-label distribution parameterized by θ and the expectation is taken relative to the prior distribution π(θ) [18].
An IBR classifier minimizes the Bayesian MMSE esti- where x is a multidimensional vector of data, and R 0 and R 1 partition the sample space, then [8] is the effective class-conditional density for class y, y being the space for θ y , f θ y (x|y) is the class-conditional density, and c is the prior probability of the class 0. The IBR classifier is given by [8]

Trajectory-based IBR classifier
Let s = p 2:m , θ 1:n (1 : m), λ 1:n (1 : m) denote the parameters of the class S = s, for s = 0, 1, where p 2:m means the parameters p 2 , p 3 , · · · , p m , and similarly for θ 1:n (1 : m) and λ 1: n (1 : m). Furthermore, let f s denote the Boolean network function of the class S = s and X = [X 1 , X 2 , · · · , X m ] denote the Boolean state trajectory at m consecutive times points at which Y has been observed.
has the states of the n genes at time k, which are hidden and not observed. Suppose that we obtain the expressions of n genes at m consecutive time points. Let Y k = y 1 (k), · · · , y n (k) T denote the n × 1 expression vector of n genes at time k, and Y = [Y 1 , · · · , Y m ] denote a time trajectory of length m, containing the expression vectors at the m consecutive times. The problem is to optimally classify this observed trajectory Y to the class 0 (wild-type) or class 1 (mutated). Let c and 1 − c be the prior probabilities of the class S = 0 and class S = 1, respectively. Since we are uncertain about c, we use a beta prior with mean where a c and b c are known parameters. According to (6), the IBR classifier for the trajectories is where p(Y|S = s) is the effective class-conditional density of the trajectory Y in the class S = s for s = 0, 1.

Effective class-conditional densities of trajectories
The joint distribution of Y, X , and s given the class S = s can be factorized as In deriving (10), it is assumed that the state parameters {p 2:m } are independent of the observation parameters p (θ 1:n (1 : m), λ 1: If we assume that the conditional expressions, given the parameters, of the n genes at m time points are independent, the likelihood of Y given the parameters in (10) We should note that the independency assumption in (13) is only for the observations, whereas the genes have interactions at the state level, following the underlying Boolean network, so that they cannot be considered independent. Due to the Markov property in (1), p (X |p 2:m , S = s) in (10) can be factored as where P(X k+1 |X k , p k+1 , S = s) is the probability of transitioning from state X k at time k to state X k+1 at time k + 1, given the perturbation probability p k+1 , in the class S = s, and P(X 1 |S = s) is the probability of the first state X 1 in the class S = s. Let x i denote the n × 1 Boolean vector of the state i, for i = 1, 2, · · · , 2 n . Given the perturbation probability p k+1 , the conditional transition probability matrix (TPM) at time k + 1, which is a 2 n × 2 n matrix, in the class S = s can be derived from (1) as where d x j , f s x i is the Hamming distance between the two Boolean vectors x j and f s x i . For obtaining p(Y|S = s), we need to integrate out the joint distribution p(Y, X , s |S = s) in (10) with respect to s and X : Fortunately, as the priors are conjugate, we can analytically solve the integrals in (16). We will use the following lemmas to do so. Lemma 1 Let P = (0 1) be the domain of p k+1 . The following equation holds: Proof See Appendix 1 in Additional file 1.
We assume that the observations Y occur in the steady state. As a result, in (16), P(X 1 |S = s) is the steady-state probability of the state X 1 in the class S = s, for s = 0, 1. The following lemma gives the steady-state distribution.

Lemma 2 Let π (s) i
= P X 1 = x i |S = s denote the steady-state probability of the i-th state in the class S = s, and π (s) = π (s) 1 , · · · , π (s) 2 n be the 1 × 2 n vector of the steady-state distribution. Then π (s) can be calculated from

where M (s) is the transition probability matrix of the class S = s with the entries
Proof See Appendix 2 in Additional file 1.

Summing out X
Using (16), (17), (19), and (20), we have We use the sum-product algorithm [19] to efficiently compute the summation in (22). Define the 2 n × 1 vector (k) by for i = 1, · · · , 2 n , where x i j is the j-th entry of the Boolean state x i . We define an auxiliary 2 n × 1 vector (s) (k) at the time k, for k = 1, · · · , m, which is initialized and updated as follows: for k = 1, · · · , m − 1, where T denotes the transpose, and • is the Hadamard product. Once we have calculated (s) (m), the summation of (22) is equal to the l 1 norm (s) (m) 1 , which is the summation of the all 2 n entries of (s) (m). Therefore, (22) can be written as

Results and discussion
In this section, we consider a mammalian cell-cycle gene regulatory network [17], depicted in Fig. 1, for evaluating our proposed trajectory-based IBR classifier. This GRN consists of n = 10 genes, whose interactions are shown in Fig. 1. The Boolean functions of the corresponding Boolean network for this GRN are given in Table 1 [17]. We define class S = 0 as the wild-type class, whose network function f 0 is in Table 1. The value of the gene CycD is determined by extracellular signals, and we assume it is f 1 = 0. According to [17], one mutated case which leads to cancer is when the gene p27 in the network is shut down and cannot be activated by its regulating genes, that is, f 3 = 0. Therefore, we define class S = 1 as the mutated class with the network function f 1 , which is the same as Table 1 with f 3 = 0. We analyze the effects of the hyperparameters and m on the classification error in Figs Figure 2 shows the classification error versus m for a = 1, 3, 5, when the values of the other hyperparameters are b = 100, α 0 = 100, β 0 = 10 4 , δ 0 = 40, and κ 0 = 100. We see that the classification error decreases by increasing m and converges to zero. This means that for long enough trajectories we can have perfect classification. The value of the hyperparameter a determines the amount of uncertainty for the gene perturbation probability p k at time k. From a biological perspective, we know that p k should be small. As a result, we have chosen b = 100, and a = 1, 3, 5, leading to the mean values of p k , respectively, as E p k = a a+b ≈ 0.01, 0.03, 0.05. For a given b, the bigger value of a allows a wider range of p k , which results in higher classification error. Figure 3 represents the classification error versus m for different values of α 0 and β 0 , when a = 1, b = 100, δ 0 = 40, and κ = 100. As the precision (inverse of variance) has a Gamma(α 0 , β 0 ) distribution, its mean and variance are equal to E [λ] = α 0 β 0 and Var [λ] = α 0 Figure 3 shows the error curves for the two cases α 0 β 0 = 10 −3 and 2 × 10 −3 , each having a different value of β 0 , leading to a different variance of λ. As such, we can see the effects of both the mean and variance of the precision λ. Whenever α 0 β 0 increases, there is lower variance in the outputs, which results in lower errors. We also notice from Fig. 3 that for a given value of α 0 β 0 , increasing β 0 decreases the error, the reason being that increased β 0 yields lower variance for the precision. Figure 4 plots the error versus α 0 for a = 1, 3, 5, when m = 5, b = 100, δ 0 = 40, κ 0 = 100, and β 0 = 10 5 . For all values of a, the classification error is a decreasing function of α 0 . Similarly, Fig. 5 gives the error versus β 0 for a = 1, 3, 5, when α 0 = 10 and the other hyperparameters are the same as in Fig. 4. Figure 5 shows an increasing trend of classification error as β 0 grows. Figure 6 shows the error as a function of a, when m = 5, b = 100, δ 0 = 40, κ 0 = 100, α 0 = 100, and β 0 = 10 5 . When a grows, the uncertainty of the perturbation probability grows as well. The error is an increasing function of a. Similarly, Fig. 7 is for error versus b, when a = 1, and the others are the same as in Fig. 6. In Fig. 7 the classification error is decreasing as b increases. Figure 8 illustrates the error versus κ 0 , for m = 5, a = 1, b = 100, δ 0 = 40, α 0 = 100, and β 0 = 10 5 . From (5), the hyperparameter κ 0 controls the variance of the mean parameters θ j (k). When κ 0 increases, the uncertainty of θ j (k) is reduced and its density peaks at μ 0 and μ 0 + δ 0 for the unexpressed and expressed states, respectively. Consequently, we expect better error rates for higher κ 0 . Accordingly, in Fig. 8 the classification error is a decreasing function of κ 0 . Regulating function x 1 CycD f 1 = Extracellular signals x 7 Cdc20 f 7 = CycB x 10 CycB f 10 = (Cdc20 ∧ Cdh1) Figure 9 illustrates the error versus δ 0 , the expression coefficient, for m = 5, a = 1, b = 100, κ 0 = 100, α 0 = 100, and β 0 = 10 5 . Having larger δ 0 means that the mean values of data for each gene in the unexpressed and expressed cases are well separated, which leads to lower classification error. As expected, in Fig. 9 the error is decreasing as δ 0 gets larger.

Conclusions
In this paper, we propose a trajectory-based intrinsically Bayesian robust classifier for classification of single-cell gene-expression trajectories. We assume that the expressions of the n genes, whose interactions are known in terms of a Boolean network, are observed in m consecutive time points, for both the wild-type class (S = 0) and mutated class (S = 1). As the parameters have uncertainty, we assign priors for them. We assume a beta distribution We assume a normal-gamma distribution for the mean and precision of the expressions at each time and for each gene, given the underlying states. As such, we derive closedform solutions for the effective class-conditional densities of the trajectories in each class, by which we define the IBR classifier. The performance of the IBR classifier is evaluated in a cell-cycle gene regulatory network with 10 genes, for which we know the Boolean networks of the two classes. We have analyzed the effects of m and all the hyperparameters on the classification error.
In this paper we have assumed the form of the prior distributions, including the hyperparameter values. Once this is done, the analysis is mathematical: derive the effective class-conditional densities. As is generally the case with Bayesian methods, two issues arise: how are the  priors developed and how robust are the methods to the prior assumptions? Both issues have been extensively studied in regard to the OBC, of which the IBR classifier is the special case in which there is no update to a posterior. In [9] robustness of the OBC relative to various modeling assumptions has been investigated, including false assumptions on the covariance structure and falsely assuming Gaussianity. In [20], similar robustness issues have been considered for the MMSE error estimator, which is required for defining the OBC. Regarding prior construction, the question is how to transform existing scientific knowledge into a prior distribution. Because the IBR classifier and other IBR filtering methods involve prior distributions on the underlying scientific model, for instance, the stochastic process (power spectra) in Wiener filtering and the feature-label distribution in classification, uncertainty arises from insufficient scientific knowledge, the prior characterizes that uncertainty, and, in effect, constrains the optimization relative to that uncertainty. For IBR/OBC classification in the context of uncertain gene regulatory networks, prior construction methods based on constrained optimization have been developed for both Gaussian [21] and discrete [22] gene regulatory networks under the assumption of static observations. Future work involves extending these prior-construction methods to trajectories, which should be possible, albeit, with more difficult mathematical analysis and substantially more computation. Another issue to be addressed in future work is extending the trajectory-based IBR classifier to optimal Bayesian classification, which will require analytic representation