In this section, we describe the proposed method. First, we present an approach to find the \right" trade off between prediction performance and model complexity using regularization. We then describe our approach by showing a mapping of the forest generated by RF to rule space where many of rules are being removed by 1-norm regularization. Then we present several metrics for evaluation of accuracy and interpretability respectively.

### Balancing accuracy and model complexity with regularization

Machine learning algorithms normally learn a function *f* from input *x* to output *y*, that is,

y\phantom{\rule{2.77695pt}{0ex}}=f\left(x\right).

A loss function *L*(*x, y, f*) is minimized with respect to *x, y*, and *f*. The loss function usually takes the form of error penalty, for example, the squared error:

L\left(x,\phantom{\rule{2.77695pt}{0ex}}y,\phantom{\rule{2.77695pt}{0ex}}f\right)={\left(y\phantom{\rule{2.77695pt}{0ex}}-f\left(x\right)\right)}^{2}

which aims at achieving low error rate on training data. It is common that model constructed this way works very well on training data, but not on test data. This is called overfitting. To avoid the overfitting problem, we can add a complexity penalty to the loss function, for example, *L*_{1} regularization:

{\left(y-f\left(x\right)\right)}^{2}+\lambda \parallel w\parallel {}_{1}

where *w* is parameter in the model, *λ* is tuning parameter to balancing the accuracy and complexity. It can generates relatively less complex model comparing with the previous one. In this article, we use 1-norm regularization. Due to the sparse solution of 1-norm regularization, the model constructed above is much simplified. 1-norm regularization has been widely applied in statistics and machine learning, e.g., [4, 5], and [6]. The above optimization can be solved by a linear program solver (LP).

### Rule elimination using 1-norm regularization from random forests mapped rules

From training samples, we can construct a random forest. As the path from a root node to a leaf node in a decision tree is interpreted as a regression rule, a random forests is equivalently represented as a collection of regression rules. Because each sample traverses each tree from root node to one and only one leaf node, we define a feature vector to capture the leaf node structure of a RF. For sample ^{x}*i*^{,} the corresponding feature vector that encodes the leaf node assignment is defined as **X**^{i} = [*X*_{1}^{, . . . ,} *X*^{q}]^{T} where *q* is the total number of leaf nodes in the forest,

{X}_{i}=\left\{\begin{array}{cc}{a}_{j}\hfill & \text{if}\phantom{\rule{2.77695pt}{0ex}}{\text{x}}_{i}\phantom{\rule{2.77695pt}{0ex}}\text{reaches}\phantom{\rule{2.77695pt}{0ex}}\text{the}\phantom{\rule{2.77695pt}{0ex}}j\text{-th}\phantom{\rule{2.77695pt}{0ex}}\text{leaf}\phantom{\rule{2.77695pt}{0ex}}\text{node},\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}\right.

(1)

i=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}l,\phantom{\rule{2.77695pt}{0ex}}j=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}q.

where *a*_{
j
} is the target value at leaf node *j*. We call the space of **X**_{
i
}^{'}s the rule space. Each dimension of the rule space is defined by one regression rule. The above mapping is an extension of binary mapping applied in [7, 8] to the regression case.

Using the above mapping, we obtain a new set of training samples in the rule space,

\left\{\left({\text{X}}_{1},\phantom{\rule{2.77695pt}{0ex}}{y}_{1}\right),\phantom{\rule{2.77695pt}{0ex}}\left({\text{X}}_{2},\phantom{\rule{2.77695pt}{0ex}}{y}_{2}\right),\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}\left({\text{X}}_{l},\phantom{\rule{2.77695pt}{0ex}}{y}_{l}\right)\right\}.

In rule space, we consider the following form

where weight vector **w** and scalar *b* define linear regression function for the sample. The weights in (2) measure the importance of rules: the magnitude of a weight indicates the importance of the rule. Clearly, a rule can be removed safely if its weight is 0. Rule elimination is therefore formulated as a problem of learning the weight vectors.

We use the technique described in previous section, consider the following learning problem using 1-norm regularization:

\begin{array}{c}{\displaystyle \underset{\text{w},{\xi}_{i}}{\text{min}}}\phantom{\rule{2.77695pt}{0ex}}\left(\lambda ||\text{w}{||}_{1}+{\displaystyle \sum _{i=1}^{l}}{\xi}_{i}\right)\\ \text{s}.\text{t}.\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}|{\text{w}}^{T}{\text{X}}_{i}+b-{y}_{i}|\le {\xi}_{i}\\ \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{\xi}_{i}\ge 0,\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}l.\end{array}

(3)

The solution to the above optimization problem is usually sparse, controlled by regularization parameter *λ. λ* is chosen by cross validation on the training set. Rules with zero weights *w* can be removed. Figure 2 illustrate the process shown in this section.

### Combined rule extraction and feature elimination

It is assumed that only important features are kept in the remaining rule. Features that do not appear in the rules extracted using (3) are removed because they have no or little effect on the regression problem. In this way, we can select rules and features together.

It is possible to further select rules from a RF built on the selected features to get a more compact set of rules. This motivates an iterative approach. Features selected in the previous iteration are used for constructing a new RF. A new set of rules is then extracted from the new RF. This process continues until the selected features do not change.

Figure 3 illustrate the overall workflow of the algorithm.

### Evaluation of results

R squared [9] statistics measures the goodness of fit of the models to the data. It is used to describe how well the predictions fit on test data. Let *y*^{′} denote prediction values from the algorithm, \overline{y} denote mean value of target variable, the formulation of R squared is:

{R}^{2}=1-\frac{S{S}_{err}}{S{S}_{tot}}

(4)

where S{S}_{err}=\sum {\left({y}_{i}-{y}^{\prime}\right)}^{2},\phantom{\rule{2.77695pt}{0ex}}S{S}_{tot}=\sum {\left({y}_{i}-\overline{y}\right)}^{2},\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}n,\phantom{\rule{2.77695pt}{0ex}}n is number of test samples. An R squared value closer to one indicates better performance. A simple evaluation on the quality of a regression algorithm is the standard deviation of R squared based on multiple runs. For the interpretability, a small set of rules and concise rules are naturally easier for human to interpret.

In general, random forests classification is more robust against noise compared with many other methods [10]. There is a limited research, however, on whether random forests regression based methods are also robust. One straightforward method is to introduce some noise into the data and then compare the difference between R squared with and without noise. The smaller the difference is, the more robust the algorithm is to noise.