Since, the *l*
_{1}LS sparse coding (Equation (10)) is a two-sided symmetric model, thus a coefficient can be zero, positive, or negative [18]. In Bioinformatics, *l*
_{1}LS sparse coding has been applied for the classification of microarray gene expression data in [11]. The main idea is in the following. First, training instances are collected in a dictionary. Then, a new instance is regressed by *l*
_{1}LS sparse coding. Thus its corresponding sparse coefficient vector is obtained. Next, the regression residual of this instance to each class is computed, and finally this instance is assigned to the class with the minimum residual.

We generalize this methodology in the way that the sparse code can be obtained by many other regularization and constraints. For example, we can pool all training instances in a dictionary (hence

*k* =

*n* and

A =

D), and then learn the non-negative coefficient vectors of a new instance, which is formulated as an one-sided model:

We called this model the

*non-negative least squares* (NNLS) sparse coding. NNLS has two advantages over

*l*
_{1}LS. First, the non-negative coefficient vector is more easily interpretable than coefficient vector of mixed signs, under some circumstances. Second, NNLS is a non-parametric model. From a Bayesian viewpoint, Equation (

17) is equivalent to the MAP estimation with the same Gaussian error as in Equation (

6), but the following discrete prior:

This non-negative prior implies that, the elements in

x are independent, and the probability of

*x*
_{
i
} = 0 is 0.5 and the probability of

*x*
_{
i
}
*>*0 is 0.5 as well. (That is the probabilities of

*x*
_{
i
} being either 0 or positive are equal, and the probability of being negative is zero.) Inspired by many sparse NMFs,

*l*
_{1}-regularization can be additionally used to produce more sparse coefficients than NNLS above. The combination of

*l*
_{1}-regularization and non-negativity results in the

*l*
_{1}NNLS sparse coding model as formulated below:

We call Equation (

19) the

*l*
_{1}
*NNLS* model. It is more flexible than NNLS, because it can produce more sparse coefficients as controlled by

*λ*. This model in fact uses the following prior:

Now, we give the generalized sparse-coding-based classification approach in details. The method is depicted in Algorithm 1. We shall give the optimization algorithms, later, required in the first step. The NN rule mentioned in Algorithm 1 is inspired by the usual way of using NMF as a clustering method. Suppose there are

*C* classes with labels 1, ⋯,

*C*. For a given new instance

b, its class is

*l* = arg max

_{
i = 1,⋯, k
}x

_{
i
}. It selects the maximum coefficient in the coefficient vector, and then assigns the class label of the corresponding training instance to this new instance. Essentially, this rule is equivalent to applying

*nearest neighbor* (NN) classifier in the column space of the training instances. In this space, the representations of the training instances are identity matrix. The NN rule can be further generalized to the weighted

*K*-

*NN* rule. Suppose a

*K*-length vector

accommodates the

*K*-largest coefficients from

*x*, and

has the corresponding

*K* class labels. The class label of

b can be designated as

*l* = arg max

_{
i = 1, ⋯, C
}
*s*
_{i} where

.

is a

*K*-length vector and is defined as

The maximum value of *K* can be *k*, the number of dictionary atoms. In this case, *K* is in fact the number of all non-zeros in x. Alternatively, the *nearest subspace* (NS) rule, proposed in [19], can be used to interpret the sparse coding. NS rule takes the advantage of the discrimination of property in the sparse coefficients. It assigns the class with the minimum regression residual to b. Mathematically, it is expressed as *j* = min_{1≤i≤C
}
*r*
_{
i
}(b) where *r*
_{
i
} (b) is the regression residual corresponding to the *i*-th class and is computed as
, where δ
_{
i
}(x) is defined analogically as in Equation (21).

**Algorithm 1**
*Sparse-coding-based classification*

**Input**: A
_{
m×n
}: *n* training instances, c: class labels, B
_{
m×p
}: *p* new instances

**Output**:

p: predicted class labels of the

*p* new instances

- 1.
Normalize each instance to have unit *l*
_{2}-norm.

- 2.
Learn the sparse coefficient matrix X, of the new instances by solving Equation (10), (17), or (19).

- 3.
Use a sparse interpreter to predict the class labels of new instances, e.g. the NN, *K*-NN, or NS rule.

### Optimization

#### Active-set algorithm for l_{1} LS

The problem in Equation (

10) is equivalent to the following non-smooth unconstrained

*quadratic programming* (QP):

where

H
_{
k×k
}=

A
^{T}
A, and

g =

*-*
A
^{T}
b. We thus know that the

*l*
_{1}LS problem is a

*l*
_{1}QP problem. This can be converted to the following smooth constrained QP problem:

where

u is an auxiliary vector variable to squeeze

x towards zero. It can be further written into the standard form:

where I is an identity matrix. Obviously, the Hessian in this problem is positive semi-definite as we always suppose H is positive semi-definite in this paper.

A general active-set algorithm for constrained QP is provided in [20], where the main idea is that a working set is updated iteratively until it meets the true active set. In each iteration, a new solution x
_{t} to the QP constrained only by the current working set is obtained. If the update step p
_{t} = x
_{t} - **x**
_{
t-1 }is zero, then Lagrangian multipliers of the current active inequalities are computed. If all these multipliers corresponding to the working set are non-negative, then the algorithm terminates with an optimal solution. Otherwise, an active inequality is dropped from the current working set. If the update step p
_{t} is nonzero, then an update length *α* is computed using the inequality of the current passive set. The new solution is updated as x
_{
t
}= x
_{
t-1 }+ *α*
p
_{
t
}. If *α <*1, then a blocking inequality is added to the working set.

To solve our specific problem efficiently in Equation (24), we have to modify the general method, because i) our constraint is sparse, for the *i*-th constraint, we have *x*
_{
i
}
*- u*
_{
i
}
*≤* 0 (if *i ≤ k*) or *-x*
_{
i
}
*- u*
_{
i
}
*≤* 0 (if *i > k*); and ii) when *u*
_{
i
}is not constrained in the current working set, the QP constrained by the working set is unbounded, therefore it is not necessary to solve this problem to obtain p
_{
t
}. In the later situation, p
_{t} is unbounded. This could cause some issues in numerical computation. Solving the unbounded problem is time-consuming if the algorithm is unaware of the unbounded issue. If p
_{
t
}contains positive or negative *∞*, then the algorithm may crash.

We propose the revised active-set algorithm in Algorithm 2 for *l*
_{1}LS sparse coding. To address the potential issues above, we have the following four modifications. First, we require that the working set is complete. That is all the variables in u must be constrained when computing the current update step. (And therefore all variables in x are also constrained due to the specific structure of the constraints in our problem.) For example, if *k* = 3, a working set {1, 2, 6} is complete as all variables, *x*
_{1}, *x*
_{2}, *x*
_{3}, *u*
_{1}, *u*
_{2}, *u*
_{3}, are constrained, while {1, 2, 4} is not complete, as *u*
_{3} (and *x*
_{3}) is not constrained. Second, the update step of the variables that are constrained once in the working set are computed by solving the equality constrained QP. The variables constrained twice are directly set to zeros. In the example above, suppose the current working set is {1, 2, 4, 6}, then *x*
_{2}, *x*
_{3}, *u*
_{2}, *u*
_{3} are computed by the constrained QP, while *x*
_{1} and *u*
_{1} are zeros. This is because the only value satisfying the constraint *-u*
_{1} = *x*
_{1} = *u*
_{1} is *x*
_{1} = *u*
_{1} = 0. Third, in this example, we do not need to solve the equality constrained QP with four variables. In fact we only need two variables by setting *u*
_{2} = *-x*
_{2} and *u*
_{3} = *x*
_{3}. Forth, once a constraint is dropped from the working set and it becomes incomplete, other inequalities must be immediately added to it until it is complete. In the initialization of Algorithm 2, we can alternatively initialize x by 0's. This is much efficient than x = (H)^{-1}(*-*
g) for large-scale sparse coding and very sparse problems.

#### Active-set algorithm for NNLS and l_{
1
} NNLS

Both the NNLS problem in Equation (

17) and the

*l*
_{1}NNLS problem in Equation (

19) can be easily reformulated to the following

*non-negative QP* (NNQP) problem:

**Algorithm 2**
*Active-set l*
_{1}
*QP algorithm*

**Input**: Hessian H
_{
k×k
}, vector g
_{
k×1}, scalar *λ*

**Output**: vector x which is a solution to min

% *initialize the algorithm by a feasible solution and complete working set*

*x* = (H)^{-1}(*-*
g); u = *|*
x
*|*;

; % initialize working set

; % initialize inactive(passive) set

**while** true **do**

% *compute update step*

let
be the indices of variables constrained once by
;

p
_{2k×1 }= 0;

where *e*
_{
i
} = 1 if
, or -1 if
;

**if**
p = 0 **then**

obtain Lagrange multiplier

µ by solving

where A is the constraint matrix in Equation (24)

**if**
**then**

terminate successfully;

**else**

where
;

add other passive constraints to
until it is complete;

**end if**

**end if**

**if**
p ≠ 0 **then**

[x; u] = [x; u] + *α*
p;

**if**
*α <*1 **then**

;
. where *i* corresponds to *α*;

**end if**

**end if**

**end while**

where H = A
^{T}
A, g = *-*
A
^{T}
b for NNLS, and g = *-*
A
^{T}
b + λ for *l*
_{1}NNLS.

Now, we present the active-set algorithm for NNQP. This problem is easier to solve than *l*
_{1}QP as the scale of Hessian of NNQP is half that of *l*
_{1}QP and the constraint is much simpler. Our algorithm is obtained through generalizing the famous active-set algorithm for NNLS by [21]. The complete algorithm is given in Algorithm 3. The warm-start point is initialized by the solution to the unconstrained QP. As in Algorithm 2, x can be alternatively initialized by 0's. The algorithm keeps adding and dropping constraints in the working set until the true active set is found.

**Algorithm 3**
*Active-set NNQP algorithm*

**Input**: Hessian H
_{
k×k
}, vector g
_{
k×1}

**Output**: vector x which is a solution to

*x* = [(H)^{-1}(*-*
g)]_{+}; % x = [y]_{+}
*is defined as x*
_{
i
} = *y*
_{
i
}
*if y*
_{
i
}
*>*0*, otherwise x*
_{
i
} = 0

; % *initialize active set*

; % *initialize inactive(passive) set*

µ = Hx + g; % *the lagrange multiplier*

**while**
R ¹ **Æ** and min_{
iÎR
}(*µ*
_{
i
}) *< -e*
**do**

% *e is a small positive numerical tolerance*

*j* = arg min_{
iÎR
}(*µ*
_{
i
}); % *get the minimal negative multiplier*

**while** min
**do**

; % *there is one or several indices correspond to α*

*x* = *x* + *α*(*t* - *x*);

**end while**

*x* = *t*;

*µ* = *Hx* + *g*;

**end while**

#### Parallel active-set algorithms

The formulations of

*l*
_{1}QP and NNQP sparse coding for

*p* new instances are, respectively,

If we want to classify multiple new instances, the initial idea in [19] and [11] is to optimize the sparse coding one at a time. The interior-point algorithm, proposed in [22], is a fast large-scale sparse coding algorithm, and the proximal algorithm in [23] is a fast first-order method whose advantages have been recently highlighted for non-smooth problems. If we adapt both algorithms to solve our multiple *l*
_{1}QP in Equation (27) and NNQP in Equation (28), it will be difficult to solve the single problems in parallel and share computations. Therefore, the time-complexity of the multiple problems will be the summation of that of the individual problems. However, the multiple problems can be much more efficiently solved by active-set algorithms. We adapt both Algorithms 2 and 3 to solve multiple *l*
_{1}QP and NNQP in a parallel fashion. The individual active-set algorithms can be solved in parallel by sharing the computation of matrix inverses (systems of linear equations in essence). At each iteration, single problems having the same active set have the same systems of linear equations to solve. These systems of linear equations can be solved once only. For a large value *p*, that is large-scale multiple problems, active-set algorithms have dramatic computational advantage over interior-point [22] and proximal [23] methods unless these methods have a scheme of sharing computations. Additionally, active-set methods are more precise than interior-point methods. Interior-point methods do not allow
and
must be always greater than
due to feasibility. But
is naturally possible when the *i*-th constraint is active. *u*
_{
i
} = *x*
_{
i
} = 0 is reasonable and possible. Active-set algorithms do allow this situation.

### Kernel extensions

As the optimizations of *l*
_{1}QP and NNQP only require inner products between the instances instead of the original data, our active-set algorithms can be naturally extended to solve the kernel sparse coding problem by replacing inner products with kernel matrices. The NS decision rule used in Algorithm 1 also requires only inner products. And the weighted *K*-NN rule only needs the sparse coefficient vector and class information. Therefore, the classification approach in Algorithm 1 can be extended to kernel version. For narrative convenience, we also denote the classification approaches using *l*
_{1}LS, NNLS, and *l*
_{1}NNLS sparse coding as *l*
_{1}LS, NNLS, and *l*
_{1}NNLS, respectively. Prefix "K" is used for kernel versions.