Vector Autoregressive Model
We now consider gene expression data in timecourse microarray experiments at n time steps t ∈ {1, 2, ..., n}. Let y_{
t
}= (y_{t,1}, ..., y_{t, m})' be the vector of m gene expressions at time step t. We assume that y_{
t
}at time step t is generated from y_{t1}at the previous time step t  1 with a firstorder VAR model:
where c is a constant vector, B= [B_{
ij
}]_{1≤i, j≤m}is an m × m autoregressive coefficient matrix, ϵ_{
t
}is assumed to follow an mdimensional multivariate normal distribution N(0, Σ), and the notation denotes the transpose of the vector y_{
t
}. Throughout the paper, we assume that Σ is a diagonal matrix, that is, . Note that the coefficient B_{
ij
}measures the influence that node i exerts on node j and the nonzero B_{
ij
}provides a functional connectivity which is related to Granger causality [14]. The Granger causality is a concept widely used in econometrics to analyze the relationships between variables and is based on the intuitive idea that a cause always occurs before its effect. Thus, we can describe a gene network by a directed graph based on the coefficient matrix B where nodes represent genes and edges show functional connectivities. In the estimated graph, node i is linked to node j by the edge i → j if and only if B_{
ij
}≠ 0.
For simplicity of explanation, we introduce the following notations:
Since the covariance matrix Σ is diagonal, the first order VAR model of these observations can then be considered as m linear regression models
Here, we center each input variable so that there is no intercept in (2) and scale each input variable so that the observed standard deviation is equal to one.
The major objective of our analysis is to estimate the coefficient matrix B from observations. Especially, our interest is its structures, that is, we want to estimate which elements of each column of B are zero or nonzero. Let denote the set of indices corresponding the nonzero components of by . The graph inference problem can be formally stated as the problem of predicting from the observations. This formulation is equivalent to m variable selection problems in linear regression where each component in y_{
t
}is the response and y_{t1}is the covariates.
Recursive Elastic Net
From this section, we focus on a variable selection problem in linear regression model with response y_{t, j}and covariates y_{t1,1}, ..., y_{t1, m}. For notational simplicity, we omit the suffix j of z_{
j
}, β_{
j
}, λ_{1j}and λ_{2j}, and use the notation z, β, λ_{1} and λ_{2}.
To obtain a sparse solution for parameter estimation with high prediction accuracy and encourage a group effect between variables in linear regression, Zou and Hastie [13] proposed the elastic net, which combines both l_{1}regularization and l_{2}regularization together. Similar to the lasso, due to the property of l_{1}penalty, the elastic net performs automatic variable selection. It can also choose highly correlated variables which is due to the benefit of l_{2}penalty. The loss function of the elastic net for parameter estimation can be represented by
where λ_{1} and λ_{2} are regularization parameters. The naive elastic net estimator is then the minimizer of (3):
Although the naive elastic net overcomes the limitations of the lasso, Zou and Hastie [13] showed the naive elastic net does not perform satisfactorily in numerical simulations and this deficiency in performance is due to imposing a double shrinkage with l_{1} and l_{2}penalties. To solve the problem, they proposed the elastic net estimator as a biascorrected version of the naive elastic net estimator given by:
However, the elastic net suffers from too many false positives in the variable selection problem for extremely highdimensional data, which will be illustrated in Simulation Section. This is due to the same amount of shrinkage for all the coefficients in the l_{1}penalty. That is, the same regularization parameter λ_{1} is used for each coefficient without assessing their relative importance. In a typical setting of linear regression, it has been shown that such an excessive l_{1}penalty can affect the consistency of model selection [15–18].
To enable different amounts of shrinkage for each coefficient, we define the following weighted loss function:
where λ_{1} and λ_{2} are regularization parameters, and w_{1}, ..., w_{
m
}are positive coefficient weights. The weighted elastic net estimator and its biascorrected estimator are defined as:
and
The modification of the l_{1}penalty was first proposed by Zou [18]. This type of l_{1}regularization was applied to parameter estimation for several statistical models, such as Cox's proportional hazard model [19], least absolute deviation regression model [20], and graphical Gaussian model [21].
We next consider how to set the values of w_{
k
}. One possible way is to select the weights so that they are inversely proportional to the true coefficients that is,
where L is some specified large value. If w_{
k
}is small, tends to be nonzero. While tends to be zero if w_{
k
}is large. By using the weight (9) and appropriate values of the regularization parameters, we can choose the true set of variables, i.e., . However, in many situations, one does not know true coefficients, and thus we need some plugin estimator instead of the true coefficients. In this article, we propose a new multistep regularization, called recursive elastic net. The proposed method tries to obtain a better estimator than the elastic net estimator by solving the weighted elastic net problem and updating the weights alternatively. The recursive elastic net can be described as:
Recursive Elastic Net

1.
Set a maximum number of iterations to be M and choose an initial estimator by the naive elastic net (4).

2.
For the lth iteration (l = 1, ..., M), define the coefficient weights by using the (l  1)th weighted elastic net estimator as follows:

3.
Solve the weighted elastic net problem and estimate the coefficient vector:

4.
Repeat Step 2 and Step 3 until a stopping criterion is fulfilled or when the number of iterations attains M.
The parameter δ > 0 is introduced to avoid that the weights of zerovalued components in take the value of infinity. From numerical simulations which are not demonstrated, δ should be taken near 0. In this article, we set δ = 10^{5}. Following Zou and Hastie [13], we can also consider a biascorrected version of the recursive elastic net by replacing (10) with
We call the iterative procedure using the elastic net estimator (5) as the initial variable weights and updating the coefficients by the biascorrected weighted elastic net (11) corrected recursive elastic net. Note that our solution with the iterative procedure is a local optimum when initializing some variable weights and we cannot guarantee the convergence of a global minimum. Thus, it is important to choose a suitable starting point for the variable weights. In the recursive elastic net and the corrected recursive elastic net, we propose to initialize with the naive elastic net and the elastic net estimators, (4) and (5), that is, the unweighted solutions of (10) and (11), respectively. In the recursive elastic net and the corrected recursive elastic net, the regularization parameters determine whether or not the lth estimator is better than the (l  1)th estimator. The selection problem of the regularization parameters will be discussed in the latter section.
Grouping Effect
In this section, we show that the recursive elastic net estimator can lead to a desirable grouping effect for correlated variables. In the framework of linear regression, the grouping effect can be considered as an effect that the coefficients of highly correlated variables should be similar each other. As discussed in Zou and Hastie [13], the lasso does not have the grouping effect and tends to select one among a correlated set. The following Lemma which is quoted from Zou and Hastie [13] guarantees the grouping effect for the recursive elastic net in the situation with identical variables, since the loss function of the recursive elastic net is strictly convex when λ_{2} > 0. The Lemma is given by:
Lemma 1 Assume that x_{
i
}= x_{
j
}for i, j ∈ {1, ..., m} and is estimated by (10), then for any λ_{2} > 0.
Furthermore, the following theorem provides an upper bound on the difference of the ith and jth coefficients for the lth iteration with the proposed method:
Theorem 1 For given dataset (z, X), regularization parameters (λ_{1}, λ_{2}) and variable weights of the lth iteration , suppose that the response z is centered and the covariate matrix X is standardized. Let be the recursive elastic net estimator of the lth iteration (10). Suppose that two coefficients satisfy and define
then
where
and
is the sample correlation.
The proof of Theorem 1 is given by Appendix. We notice that the grouping effect of the recursive elastic net contributes not only the sample correlation ρ but also the difference between the ith and jth coefficients of the (l  1)th iterations . If x_{
i
}and x_{
j
}are highly correlated, i.e., ρ ≐ 1, and the difference is almost zero, then Theorem 1 says that the difference is also almost zero.
Computational Algorithm
In this section, we provide an algorithm for solving the weighted elastic net problem in (6) by modifying the LARSEN algorithm [13]. The modified algorithm called LARSWEN can be described as:
Algorithm: LARSWEN

1.
Define the (n  1) × m design matrix by

2.
Solve the elastic net problem by the LARSEN algorithm:

3.
Output the weighted lasso estimator by
For a fixed λ_{2}, the LARSWEN algorithm sequentially updates the locations of the nonzero coefficients and its values, and produces the entire solution path of the weighted elastic net estimator . The R source code is available from the authors upon request. The computational cost of the LARSWEN algorithm is the same as that of the LARSEN algorithm except for the treatment of Step 1 and Step 3. If the algorithm for calculating is stopped after u steps, it requires O(u^{3} + mu^{2}) operations.
Selection of Regularization Parameters
The regularization parameters play a critical role for determining the performance of variable selection with the recursive elastic net. We now discuss how to choose the values of the regularization parameters λ_{1}and λ_{2} in the recursive elastic net. Traditionally, crossvalidation is often used for estimating the predictive error and evaluating different models. For example, leaveoneout crossvalidation involves using a single observation from the original sample as the validation data and the remaining observations as the training data, and requires additional calculations for repeating such that each observation in the sample is used once as the validation data. Note that we need to choose two regularization parameters for each variable j in the VAR model. Thus, crossvalidation can be timeconsuming and hardly be applied for the regularization parameter selection in the estimation of the VAR model with the recursive elastic net. In order to choose the two regularization parameters, we use the degrees of freedom as a measure of model complexity. The profile of degrees of freedom clearly shows that how the model complexity is controlled by shrinkage, which helps us to pick an optimal model among all the possible candidates [22]. In recent years, an unbiased estimate of the degrees of freedom for the lasso [22] was proposed. Using the result of Zou et al. [22], an unbiased estimate of the degrees of freedom for the recursive elastic net in the lth iteration is given by
where denotes the active set of and the corresponding design matrix is given by
Note that does not include λ_{1} and depends on and λ_{2}. Let and be the coefficient vector and the active set of the LARSWEN step α, respectively, and λ_{α,1}be the corresponding l_{1}regularization parameter of the step α. Hence, we can use in place of λ_{α,1}as a tuning parameter.
In our numerical studies, two model selection criteria, Bayesian information criterion (BIC) [23] and biascorrected Akaike information criterion (AICc) [24, 25], are considered. Substituting the number of parameters in BIC and AICc by the degrees of the freedom we can define a modified BIC and a modified AICc for selecting the regularization parameters as
where
For a given λ_{2}, we choose the optimal step that gives the smallest BIC or the smallest AICc from the candidate models.
VAR Modeling with Recursive Elastic Net
As shown in the previous section, the inference of nonzero components in the VAR model can be formalized as m variable selection problems, i.e., our objective is to estimate . Therefore, we can apply the recursive elastic net directly to estimate the coefficient matrix B in the VAR model. In this section, we provide an algorithm for estimating B in the VAR model with the recursive elastic net and the derived model selection criteria for a fixed λ_{2}. The algorithm is as follows:
Algorithm:RENETVAR

1.
Set a maximum number of iterations to be M and a maximum step size to be u.

2.
For a fixed λ_{2}, start with the initial coefficient weights

3.
For j = 1, ..., m

(a)
For the lth iteration, given , the LARSWEN algorithm after u steps produces u solutions,
where

(b)
Calculate the values of model selection criterion for u candidate models, say
where denotes the active set of , and MC = AICc or BIC.

(c)
Select from among which satisfies

(d)
Update the variable weights by

4.
Update l = l + 1 and repeat Step 3 until a stopping criterion is satisfied or the number of iterations attains M.
As a stopping criterion, we calculate the sum of the values of model selection criterion for j = 1, ..., m:
The algorithm stops if SMC^{(l)}(λ_{2})  SMC^{(l  1)}(λ_{2}) > 0 (l = 2, 3, ..., M) holds.
Typically, we first pick a grid of values for λ_{2}, say
For each λ_{γ, 2}, the RENETVAR algorithm produces . Finally, we select which satisfies
We can also estimate with the biascorrected recursive elastic net by replacing the solution (18) with
We note that our iterative algorithm guarantees the decrease of the sum of the values of model selection criterion and searches the model minimizing (19), although it cannot guarantee the decreasing of the loss function.