Learning accurate and interpretable models based on regularized random forests regression

Background Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. Methods In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features. Results We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression. Conclusion It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied.


Background
In many real applications, it is vital to have an interpretable model (e.g., relevant features and predictive rules) and high performance prediction at the same time to understand the underlying problem well. Some of the state-of-the-art algorithms like Support Vector Machines (SVM) [1], Artificial Neural Network (ANN) [2], and Random Forests (RF) [3], generally predict the outcome with high accuracy. But other than accuracy, it is hard to interpret the models built since they either are "black box" model, or include so many decision rules that human cannot explain them clearly. On the other hand, some algorithms, especially those based on decision trees, are easy to interpret. However, the prediction performance is usually low compared to SVM, ANN, or RF. See Figure 1 for an illustration regarding the interpretability-prediction performance space. Basically, to help explain the generated model, it is desirable to have an algorithm that falls on the region C. Finding a right tradeoff between prediction performance and model interpretability is thus important.
Decision trees use a tree structure to represent a partition of the space. From the root node to each leaf node of a decision tree, we can consider it as a decision rule. Decision rule based algorithms are well known for their capability of shedding light on the decision process in addition to making a prediction. Another factor affecting the interpretation of model generated from data is feature selection. In general, fewer features involved in the model will make it less complex and more interpretable. There is a rich resource of prior work on rule-based learning and feature selection in the fields of bioinformatics and statistical learning. It is beyond the scope of this article to supply a complete survey of the respective areas. Below we review some of the main findings most closely related to this article.

Our contribution
In many biological problems, building a good predictive model that explains the problem well is the ultimate goal of modelling.
High performance and concise representation (i.e., a small rule set and a small feature set) are two important requirements of rule learning methods. Regression tree based methods usually generate a small set of rules. However, their performance is relatively low compared with those using regression with SVM (Support Vector Regression) and random forests. An RF generally has high performance, but generates a large number of rules. It is difficult to interpret the model using a large number of rules. In this article, we take an iterative approach to regularize random forests to obtain refined rules without compromising the performance. RF has an ensemble of regression trees and covers more candidate rules compared with a single decision tree. Regularization keeps only a small number of rules that are the most discriminative. We take an embedded approach with a greedy backward elimination strategy for feature elimination.
We combine rule extraction and feature elimination method iteratively. The result of rule extraction is used for feature elimination. The selected features are then fed into RF and there is 1-norm regularization step to extract important rules. The iterative alternating approach continues until the selected subset of features does not change. Only a few rule learning algorithms are geared toward regression problems as opposed to classification problems. In addition to application to classification case, we apply this iterative approach to another category of learning algorithm -regression rule learning, extending its domain of usage.
It is important to evaluate the quality of the algorithm in terms of prediction performance and interpretability. We use a set of metric to evaluate rule quality as follows:

Methods
In this section, we describe the proposed method. First, we present an approach to find the \right" trade off between prediction performance and model complexity using regularization. We then describe our approach by showing a mapping of the forest generated by RF to rule space where many of rules are being removed by 1-norm regularization. Then we present several metrics for evaluation of accuracy and interpretability respectively.

Balancing accuracy and model complexity with regularization
Machine learning algorithms normally learn a function f from input x to output y, that is, A loss function L(x, y, f) is minimized with respect to x, y, and f. The loss function usually takes the form of error penalty, for example, the squared error: which aims at achieving low error rate on training data. It is common that model constructed this way works very well on training data, but not on test data. This is called overfitting. To avoid the overfitting problem, we can add a complexity penalty to the loss function, for example, L 1 regularization: where w is parameter in the model, l is tuning parameter to balancing the accuracy and complexity. It can generates relatively less complex model comparing with the previous one. In this article, we use 1-norm regularization. Due to the sparse solution of 1-norm regularization, the model constructed above is much simplified. 1-norm regularization has been widely applied in statistics and machine learning, e.g., [4,5], and [6]. The above optimization can be solved by a linear program solver (LP).
Rule elimination using 1-norm regularization from random forests mapped rules From training samples, we can construct a random forest. As the path from a root node to a leaf node in a decision tree is interpreted as a regression rule, a random forests is equivalently represented as a collection of regression rules. Because each sample traverses each tree from root node to one and only one leaf node, we define a feature vector to capture the leaf node structure of a RF. For sample x i , the corresponding feature vector that encodes the leaf node assignment is defined as , . . . , X q ] T where q is the total number of leaf nodes in the forest, where a j is the target value at leaf node j. We call the space of X i ' s the rule space. Each dimension of the rule space is defined by one regression rule. The above mapping is an extension of binary mapping applied in [7,8] to the regression case.
In rule space, we consider the following form where weight vector w and scalar b define linear regression function for the sample. The weights in (2) measure the importance of rules: the magnitude of a weight indicates the importance of the rule. Clearly, a rule can be removed safely if its weight is 0. Rule elimination is therefore formulated as a problem of learning the weight vectors.
We use the technique described in previous section, consider the following learning problem using 1-norm regularization: The solution to the above optimization problem is usually sparse, controlled by regularization parameter l. l is chosen by cross validation on the training set. Rules with zero weights w can be removed. Figure 2 illustrate the process shown in this section.

Combined rule extraction and feature elimination
It is assumed that only important features are kept in the remaining rule. Features that do not appear in the rules extracted using (3) are removed because they have no or little effect on the regression problem. In this way, we can select rules and features together.
It is possible to further select rules from a RF built on the selected features to get a more compact set of rules. This motivates an iterative approach. Features selected in the previous iteration are used for constructing a new RF. A new set of rules is then extracted from the new RF. This process continues until the selected features do not change. Figure 3 illustrate the overall workflow of the algorithm. Figure 2 Mapping samples to rule space and eliminate rules. Random forests model is generated from training samples. Viewing a path from root node and a leaf node of a decision tree as a decision rule, we can encode data into rule space. Some of rules can be eliminated without affecting the prediction performance.

Evaluation of results
R squared [9] statistics measures the goodness of fit of the models to the data. It is used to describe how well the predictions fit on test data. Let y ′ denote prediction values from the algorithm, y denote mean value of target variable, the formulation of R squared is: where SS err = (y i − y ) 2 , SS tot = (y i −ȳ) 2 , i = 1, . . . , n, n is number of test samples. An R squared value closer to one indicates better performance. A simple evaluation on the quality of a regression algorithm is the standard deviation of R squared based on multiple runs. For the interpretability, a small set of rules and concise rules are naturally easier for human to interpret.
In general, random forests classification is more robust against noise compared with many other methods [10]. There is a limited research, however, on whether random forests regression based methods are also robust. One straightforward method is to introduce some noise into the data and then compare the difference between R squared with and without noise. The smaller the difference is, the more robust the algorithm is to noise.

Datasets
In this section, we first describe the data sets used. We then present detailed results and discussion.
We first test our method on an artificial data set. The data is illustrated in Figure 4. The target values are 1 through 6 corresponding to different shapes and colors. For each target value, 100 samples are generated according to different mean values with standard deviation of 0.5. The relationship between target variable and input variable X1 and X2 is nonlinear.
We then applied our method to several data sets from real applications summarized in Table 1. The first data set is Stockori flowering time data set [11]. The flowering time of 697 plants were collected. The prediction of flowering time is based on 149 genotypes of the plants.
The Parkinson's Telemonitoring data set [12] contains biomedical voice measurement from 42 people with early-stage Parkinson's disease. There are 5875 total voice recordings. The goal is to predict total Uni-fied Parkinson's Disease Rating Scale (UPDRS) scores from the voice measures and other features of patients. Breast Cancer Wisconsin (Prognostic) data set [13] is constructed using a digitized image of a fine needle aspirate (FNA) of a breast mass from breast cancer patients. Characteristic features are computed from the images. The prediction is the recurrence time or disease-free time after treatment. The Relative location of computed tomography (CT) slices on axial axis data set [14] consists of 384 features extracted from CT images. These features are derived from two histograms in polar space. The response variable is relative location of an image on the axial axis ranging from 0 to 180 where 0 denotes the top of the head and 180 the soles of the feet. We randomly choose 2140 CT images for the analysis. The above three data sets are retrieved from University of California, Irvine (UCI) repository [15].
The Seacoast data set is a collection of sensor readings about different biochemical concentrations under various humidity and temperatures. Concentrations of the biochemical can be inferred from sensor responses using our approach. The data set is pre-processed by normalizing raw sensor responses, calibrating sensor data according to standard no biochemical input conditions and according to the time delay in the sensor response, if available. Humidity levels and temperatures are also factored out first by using a regression based approach. This results in sensor responses being in the same scale. 2250 times points are sampled and used.
The TCGA Glioblastoma multiforme (GBM) data is downloaded from The Cancel Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov/tcga/). 548 gene expression profiles were retrieved from the Broad Institute HT HG-U133A platform (Affymetrix, Santa Clara, CA, USA). Each gene expression profile consists of normalized expression data of 12042 genes. The survival information of patients is retrieved from TCGA clinical data. After removing gene expression samples with unknown survival information, 427 samples were used in our analysis.

Results on artificial data set
Random forests regression generates 11700 rules with R 2 of 0.87. Our method gets 7 rules with R 2 of 0.66. There is not too much loss in the prediction performance. The predicted rules are as follows: 1 IF x 2 ≤ 2.04 and x 1 > 3.84 THEN y = 5 2 IF x 2 ≤ 1.21 and x 1 > 3.75 THEN y = 5 3 IF x 2 ≤ 4.17 and x 2 > 3.91 THEN y = 6 4 IF x 1 ≤ 3.68 and x 2 ≤ 1.93 and x 1 > 2.09 THEN y = 4 5 IF x 2 > 4.21 and x 1 ≤ 2.62 THEN y = 6 6 IF x 2 ≤ 2.33 and x 2 > 0.93 and x 1 > 3.66 THEN y = 5 7 IF x 2 > 3.88 and x 1 < 2.72 THEN y = 6.
They are also illustrated in Figure 4. Numbers in text boxes are prediction values of target variable. Lines generated from rules partition the original space. Many of these rules align well with the partition. Noted that multiple run  of our approach generates different sets of rules. The number of extracted rules also changes. The partitions in those rules align well with the partition also.

Results on different data sets
The following tables present the result of our proposed methods on different data sets. Results are from test data. From Table 2, we can see that in all data sets, the number of rules is reduced significantly comparing to random forests yielding less than 1% of the original number of rules in the forest. At the same time, the performance measured by R 2 does not change too much. In most data sets, except Parkinson's Telemonitoring data set, RF gives the best performance. Support vector regression is the least competitive in the cases we tested. Our approach stands somewhere in the middle. Note that on Stockori flowering time data set, the target variable, flowering time, is ordered. Here we simply treat it as numbers. The performance is comparable with RF. In Breast Cancer Wisconsin (Prognostic) data set, the predictive performance is low indicating it is a hard problem. Our approach does not work well on this data set either. It may be resulted from over pruning the rules.
The standard deviation on the R 2 , number of rules selected, number of features selected demonstrates that the methods are stable on most of these data sets. The standard deviation of R 2 is obtained from the average of R 2 over ten runs.  where v 1 is mean radius, v 2 is mean texture, v 3 is mean perimeter, v 4 is mean area, v 5 is mean smoothness, v 9 is mean symmetry, v 11 radius standard error (SE), v 12 is texture SE, v 14 is area SE, v 16 is compactness SE, v 17 is concavity SE, v 19 is symmetry SE, v 20 is fractal dimension SE, v 21 is worst radius, v 22 is worst texture, v 23 is worst perimeter, v 25 is worst smoothness, v 29 is worst symmetry, v 30 is worst fractal dimension, and v 31 is tumor size. Among these rules, size, shape, and texture features occur more often than other features indicating these features are more important than other features in deciding breast cancer. This result is similar to conclusion made in [16] and [17].
To illustrate how prediction values matched true values, we use an approach similar to [18], which was used for clustering analysis. Here we partition the target values in different intervals, and then count how many samples fall into the same interval for both prediction and true values. The resulting confusion matrix can be visualized to get an idea how they match. Figure 5 shows that most of the sample matches are in the diagonal of the matrix which indicate correct match.
Top ten rules are extracted from Seacoast data based on absolute value of weight of rules. Figure 6 shows the number of occurrence of sensors in top ten rules. The sensors are numbered from 1 to 16 accordingly: C0 = 0 MSS556 cp29I Ethyl Cellulose, C1 = 1 MSS556 Ethyl Cellulose, C2  Figure 6, two sensors, C5 = 5 MSS556 PECH and C5 = 5 MSS557 PECH, used more often, suggesting it is more important or effective in determining chemical concentration.
For TCGA Glioblastoma multiforme data set, we can see an interesting result that the number of genes is reduced during iteration, while the number of remaining rules is almost constant after the first iteration. The prediction performance are not good for any of the three algorithms indicating it is a harder problem, and current gene expression profiles may not provide the necessary information for the survival prediction. See Figure 7.

Results with noisy data
To illustrate the robustness of our approach on noisy data, we randomly add some Gaussian noise in the Stockori flowering time data set with probability 0.3 for each  feature in a sample. Storckori flowering time was randomly sampled for training and testing sets. The sets were used for the experiments. 10 runs were done for each method. The mean of Gaussian noise is 0.5, while its standard deviation is 0.2. From Table 3, we can see that support vector regression has the highest p value on paired t test on difference in mean R 2 s of SVR on data without noise and data with noise, it was not affected too much by Gaussian noise. But its R 2 is still the lowest among all three methods. The p value shows that there is no statistical significance between results with noise and without noise. Our approach has similar values of R 2 compared with those of random forests. Increasing the probability of noise from 0.3 to 1, both random forests and the proposed approach are affected by the increased noise level.

Conclusion
We propose to use an ensemble of decision rules generated from random forests and 1-norm regularization to balance prediction performance and interpretability of regression problems. The method selects a small number of rules (using a small number of features) while retaining performance comparable to RF, better than SVR in most cases.
Due to decision trees' ability handling mixed data type, our approach is able to handles data with mixed type.
We also study robustness of our approach in the presence of noise. The prediction performance is still comparable with random forests in terms of performance within small amount of Gaussian noise.
Regression problems are generally harder than classification problems both in terms of prediction performance and interpretability [8]. Therefore, care should be taken when interpreting the results.