Huber group LASSO
Given m datasets, , satisfying model (5), to ensure that all A(k)'s have the same structure, elements of A(k)'s on the same position are grouped together and can be inferred by the group LASSO,
where A
i
(k)T is the i th row of the matrix A(k) and x
j
(k) is the j th column of the matrix X(k). w
k
is the weight for the k th dataset, which can be assigned by experience. In this study, we choose w
k
= n
k
/ Σn
k
i.e., the more observations the dataset has, the higher weight it is assigned with. The penalty term in (6) takes advantage of the sparse nature of GRNs and has the effect making the grouped parameters to be estimated either all zeros or all non-zeros [10], i.e., a
iℓ
(k)'s, k = 1, . . . , m, become either all zeros or all non-zeros. Therefore, a consistent network topology can be obtained from the group LASSO method. λ is a tuning parameter which controls the degree of sparseness of the inferred network. The larger the value of λ, the more grouped parameters become zeros.
To introduce robustness, we consider using the Huber loss function instead of the squared error loss function and propose the following Huber group LASSO method
(7)
where the Huber loss function is defined as
(8)
The squared error and Huber loss function are illustrated in Figure 9. It can be seen that for small errors, these two loss functions are exactly the same while for large errors, Huber loss which increases linearly is less than the squared error loss which increases quadratically. Because Huber loss penalizes much less than the squared error loss for large errors, the Huber group LASSO is more robust than group LASSO when there exists large noise or outliers in the data. It is also known that the Huber loss is nearly as efficient as squared error loss for Gaussian errors [24].
For convenience, we define some notations and rewrite the problems (6) and (7) in more compact forms. Let Y
i
= [Y
i
(1)T , . . . , Y
i
(m)T]T, the vector stacking observations of the i th target gene across all datasets, where Y
i
(k)T is the i th row of Y(k). Let b
iℓ
= [a
iℓ
(1), . . . , a
iℓ
(m)]T, the vector containing the grouped parameters. Denote by the vector containing all parameters related to the regulation of the i th target gene. According to the orders of the parameters in b
i
, re-arrange the rows of X(k) and piece them together to have where X
i
= diag(X
i
(1)T , . . . , X
i
(m)T) with X
i
(k)T being the i th row of X(k). Then (7) can be rewritten as
(9)
where x
j
is the j th column of X, y
ij
is the j th element of Y
i
, and . (6) can be rewritten similarly.
Optimization algorithm
The minimization of problem (9) is not easy as the penalty term is not differentiable at zero and the Huber loss does not have the second order derivatives at the transition points, ±δ. Observed that fixing i, the problem (9) can be decomposed into p sub-optimization problems. For each, we get b
i
by minimizing
(10)
where for notational convenience, we omit the subscript i here and b
ℓ
is a block of parameters of b, i.e. .
To optimize (10), an iterative method is developed by constructing an auxiliary function, the optimization of which keeps J (b) decreasing. As in [13], given any current estimate b(k), a function Q(b | b(k)) is an auxiliary function for J (b) if conditions
(11)
are satisfied. In this study, we construct the auxiliary function as
(12)
where γ is the largest eigenvalue of . It can be easily shown that this auxiliary function satisfies (11).
Considering the block structure of b, we apply a block-wise descent strategy [14], i.e., cyclically optimize one block of parameters, b
j
, at a time. Denote by the vector after updating the R th block. Given b(k)(ℓ− 1), update it to b(k)(ℓ) by computing
(13)
where x
j,ℓ
is the block of elements in x
j
corresponding to b
ℓ
and (·)+ = max(·, 0). We repeat to update every block using (13) until it converges. For a specific value of λ, the whole procedure is described as follows:
-
1
Initialize b(0). Set iteration number k = 0.
-
2
Cycle through (13) one at a time to update the ℓ th block, ℓ = 1, . . . , p
-
3
If {b(k)} converges to b∗, go to the next step. Otherwise, set k := k + 1 and go to Step 2.
-
4
Return the solution b∗.
Note that the algorithm can be adapted to solve (6) with quite similar derivations. In the following section, we show that the sequence {b(k)} generated from the algorithm guarantees the objective function J (b) keep decreasing. We also show that the limit point of the sequence generated is indeed the minimum point of J (b).
Convergence analysis
The convergence of the optimization algorithm for the minimization of (10) is analyzed in the way similar to [25]. We first show the descent property of the algorithm.
Lemma 1 The sequence {b(k)} generated from the optimization algorithm keeps the objective function J(b) decreasing, i.e., J(b(k)) ≥ J (b(k+1)).
Proof By (11) and (13), we have
Next, we show that if the generated sequence satisfies some conditions, it converges to the optimal solution.
Lemma 2 Assume the data (y, X) lies on a compact set and the following conditions are satisfied:
1 The sequence {b(k)} is bounded.
2 For every convergent subsequence , the successive differences converge to zeros, .
Then, every limit point b∞ of the sequence {b(k)} is a minimum for the function J(b), i.e., for any ,
Proof For any and
where and ∇
j
represents the partial derivatives with respect to the j th block of parameters. Denote the second term by ∂P (b
j
; δj ) and it has
(14)
We assume the subsequence converges to . From condition 2. and (14), we have
and
(15)
since .
As minimizes with respect to the j th block of parameters, using (14), we have
(16)
with
Due to condition 2.,
(17)
Therefore, (15), (16) and (17) yield
(18)
for any 1 ≤ j ≤ p.
For , due to the differentiability of f (b),
Finally, we show that the sequence generated from the proposed algorithm satisfies these two conditions.
Theorem 3 Assuming the data (y, X) lies on a compact set and no column of X is identically 0, the sequence {b(k)} generated from the algorithm converges to the minimum point of the objective function J (b).
Proof We only need to show that the generated sequence meets the conditions in Lemma 2.
For the sake of notational convenience, for fixed j and define
Let b(u) be the vector containing u as its j th block of parameters with other blocks being the fixed values.
Assume u + δ and u represent the values of the j th block of parameters before and after the block update, respectively. Hence, as defined in (12), u is obtained by minimizing the following function with respect to the j th block in the algorithm:
(19)
where and . Thus, u should satisfy
(20)
where s = u /|| u||2 if u ≠ 0; ||s||2 ≤ 1 if u = 0. Then, we have
(21)
The second and third equalities are obtained using mean value theorem with τ ∈ (0, 1) and (20). For the first inequality, the following property of the Huber loss function and the property of subgradient are used.
The result from (21) gives that
(22)
where
Using (22) repeatedly across every block, for any k, we have
Note that by Lemma 1, {J (b(k))} converges as it keeps decreasing and is bounded from below. The convergence of {J (b(k))} yield the convergence of {b(k)}. Hence, conditions of Lemma 2 hold which imply that the limit of {b(k)} is the minimum point of J (b).
Implementation
The tuning parameter λ controls the sparseness of the resulted network. A network solution path can be obtained by computing networks on a grid of λ values from , which is the smallest value that gives the empty network, to a small value, e.g. λmin = 0.01λmax. In our previous work [12], BIC criterion is used to pick a specific λ value which corresponds to a determinant network topology. A method called "stability selection" recently proposed by Meinshausen and Buhlmann [15] finds a network with probabilities for edges. Stability selection performs the network inference method, e.g. group LASSO, many times, resampling the data in each run and computing the frequencies with which each edge is selected across these runs. It has been used with the linear regression method to infer GRNs from steady-state gene expression data in Haury et al. [26] and has shown perspective effectiveness. In this study, we adapt the stability selection method to finding GRN topology from multiple time-course gene expression datasets. Given a family of m time-course gene expression datasets , k = 1, . . . , m, for a specific λ ∈ Λ, the stability selection procedure is as follows
-
1
Use moving block bootstrap to draw N bootstrap samples for every dataset to form N bootstrap families of multiple time-course datasets, i.e. , b = 1, . . . , N
-
2
Use the proposed Huber group LASSO to infer from the b th bootstrap family of datasets. Denote by the network topology shared by .
-
3
Compute the frequencies for each edge (i, j), i.e., from the gene j to gene i, in the network
(24)
where is the (i, j)'s entry of and #{·} is the number of elements in that set.
For a set of λ ∈ Λ, the probability of each edge in the inferred network is
(25)
The final network topology can be obtained by setting a threshold, edges with probabilities or scores less than which are considered nonexistent. This study only focus on giving a list of edges with scores. The selection of threshold is not discussed here. The stability selection procedure can also be applied with the group LASSO method (6).
Since the data used are time series data, the moving block bootstrap method is employed in the first step to draw bootstrap samples from each dataset. For a dataset with n observations, in the moving block bootstrap with block length l, the data is split into n − l + 1 blocks: block j consists of observations j to j + l − 1, j = 1, . . . , n − l + 1. [n/b] blocks are randomly drawn from n − l + 1 blocks with replacement and are aligned in the order they are picked to form a bootstrap sample.
Another tuning parameter δ controls the degree of robustness. Generally, it picks where is the estimated standard deviation of the error and , where MAD is the median absolute deviation of the residuals. In this study, we use the least absolute deviations (LAD) regression to obtain the residuals. To avoid the overfitting of LAD which leads to a very small δ, we choose by .