# Learning accurate and interpretable models based on regularized random forests regression

- Sheng Liu
^{1}, - Shamitha Dissanayake
^{2}, - Sanjay Patel
^{3}, - Xin Dang
^{4}, - Todd Mlsna
^{2}, - Yixin Chen
^{1}Email author and - Dawn Wilkins
^{1}

**8(Suppl 3)**:S5

https://doi.org/10.1186/1752-0509-8-S3-S5

© Liu et al.; licensee BioMed Central Ltd. 2014

**Published: **22 October 2014

## Abstract

### Background

Many biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance.

### Methods

In this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features.

### Results

We tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression.

### Conclusion

It demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied.

## Keywords

## Background

Decision trees use a tree structure to represent a partition of the space. From the root node to each leaf node of a decision tree, we can consider it as a decision rule. Decision rule based algorithms are well known for their capability of shedding light on the decision process in addition to making a prediction. Another factor affecting the interpretation of model generated from data is feature selection. In general, fewer features involved in the model will make it less complex and more interpretable. There is a rich resource of prior work on rule-based learning and feature selection in the fields of bioinformatics and statistical learning. It is beyond the scope of this article to supply a complete survey of the respective areas. Below we review some of the main findings most closely related to this article.

### Our contribution

In many biological problems, building a good predictive model that explains the problem well is the ultimate goal of modelling.

High performance and concise representation (i.e., a small rule set and a small feature set) are two important requirements of rule learning methods. Regression tree based methods usually generate a small set of rules. However, their performance is relatively low compared with those using regression with SVM (Support Vector Regression) and random forests. An RF generally has high performance, but generates a large number of rules. It is difficult to interpret the model using a large number of rules. In this article, we take an iterative approach to regularize random forests to obtain refined rules without compromising the performance. RF has an ensemble of regression trees and covers more candidate rules compared with a single decision tree. Regularization keeps only a small number of rules that are the most discriminative. We take an embedded approach with a greedy backward elimination strategy for feature elimination.

We combine rule extraction and feature elimination method iteratively. The result of rule extraction is used for feature elimination. The selected features are then fed into RF and there is 1-norm regularization step to extract important rules. The iterative alternating approach continues until the selected subset of features does not change. Only a few rule learning algorithms are geared toward regression problems as opposed to classification problems. In addition to application to classification case, we apply this iterative approach to another category of learning algorithm - regression rule learning, extending its domain of usage.

- 1
Accuracy:

*R*^{2}. - 2
Variance of accuracy.

- 3
Interpretability: number of rules.

- 4
Interpretability: number of variables used in rule.

- 5
Robustness to noise.

## Methods

In this section, we describe the proposed method. First, we present an approach to find the \right" trade off between prediction performance and model complexity using regularization. We then describe our approach by showing a mapping of the forest generated by RF to rule space where many of rules are being removed by 1-norm regularization. Then we present several metrics for evaluation of accuracy and interpretability respectively.

### Balancing accuracy and model complexity with regularization

*f*from input

*x*to output

*y*, that is,

*L*(

*x, y, f*) is minimized with respect to

*x, y*, and

*f*. The loss function usually takes the form of error penalty, for example, the squared error:

*L*

_{1}regularization:

where *w* is parameter in the model, *λ* is tuning parameter to balancing the accuracy and complexity. It can generates relatively less complex model comparing with the previous one. In this article, we use 1-norm regularization. Due to the sparse solution of 1-norm regularization, the model constructed above is much simplified. 1-norm regularization has been widely applied in statistics and machine learning, e.g., [4, 5], and [6]. The above optimization can be solved by a linear program solver (LP).

### Rule elimination using 1-norm regularization from random forests mapped rules

^{ x }

*i*

^{,}the corresponding feature vector that encodes the leaf node assignment is defined as

**X**

^{ i }= [

*X*

_{1}

^{ , . . . , }

*X*

^{ q }]

^{ T }where

*q*is the total number of leaf nodes in the forest,

where *a*_{
j
} is the target value at leaf node *j*. We call the space of **X**_{
i
}^{'}s the rule space. Each dimension of the rule space is defined by one regression rule. The above mapping is an extension of binary mapping applied in [7, 8] to the regression case.

where weight vector **w** and scalar *b* define linear regression function for the sample. The weights in (2) measure the importance of rules: the magnitude of a weight indicates the importance of the rule. Clearly, a rule can be removed safely if its weight is 0. Rule elimination is therefore formulated as a problem of learning the weight vectors.

*λ. λ*is chosen by cross validation on the training set. Rules with zero weights

*w*can be removed. Figure 2 illustrate the process shown in this section.

### Combined rule extraction and feature elimination

It is assumed that only important features are kept in the remaining rule. Features that do not appear in the rules extracted using (3) are removed because they have no or little effect on the regression problem. In this way, we can select rules and features together.

It is possible to further select rules from a RF built on the selected features to get a more compact set of rules. This motivates an iterative approach. Features selected in the previous iteration are used for constructing a new RF. A new set of rules is then extracted from the new RF. This process continues until the selected features do not change.

### Evaluation of results

*y*

^{′}denote prediction values from the algorithm, $\overline{y}$ denote mean value of target variable, the formulation of R squared is:

where $S{S}_{err}=\sum {\left({y}_{i}-{y}^{\prime}\right)}^{2},\phantom{\rule{2.77695pt}{0ex}}S{S}_{tot}=\sum {\left({y}_{i}-\overline{y}\right)}^{2},\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}n,\phantom{\rule{2.77695pt}{0ex}}n$ is number of test samples. An R squared value closer to one indicates better performance. A simple evaluation on the quality of a regression algorithm is the standard deviation of R squared based on multiple runs. For the interpretability, a small set of rules and concise rules are naturally easier for human to interpret.

In general, random forests classification is more robust against noise compared with many other methods [10]. There is a limited research, however, on whether random forests regression based methods are also robust. One straightforward method is to introduce some noise into the data and then compare the difference between R squared with and without noise. The smaller the difference is, the more robust the algorithm is to noise.

## Results and discussion

### Datasets

In this section, we first describe the data sets used. We then present detailed results and discussion.

Some statistics of data sets.

Data | Number of Samples | Number of Features |
---|---|---|

Stockori Floweringtime | 697 | 149 |

Parkinsons Telemonitoring | 5875 | 19 |

Breast Cancer Wisconsin (Prognostic) | 198 | 32 |

Relative location of computed tomography (CT) slices on axial axis | 2140 | 384 |

Seacoast | 2250 | 16 |

TCGA Glioblastoma multiforme | 427 | 12042 |

The Parkinson's Telemonitoring data set [12] contains biomedical voice measurement from 42 people with early-stage Parkinson's disease. There are 5875 total voice recordings. The goal is to predict total Uni-fied Parkinson's Disease Rating Scale (UPDRS) scores from the voice measures and other features of patients. Breast Cancer Wisconsin (Prognostic) data set [13] is constructed using a digitized image of a fine needle aspirate (FNA) of a breast mass from breast cancer patients. Characteristic features are computed from the images. The prediction is the recurrence time or disease-free time after treatment. The Relative location of computed tomography (CT) slices on axial axis data set [14] consists of 384 features extracted from CT images. These features are derived from two histograms in polar space. The response variable is relative location of an image on the axial axis ranging from 0 to 180 where 0 denotes the top of the head and 180 the soles of the feet. We randomly choose 2140 CT images for the analysis. The above three data sets are retrieved from University of California, Irvine (UCI) repository [15].

The Seacoast data set is a collection of sensor readings about different biochemical concentrations under various humidity and temperatures. Concentrations of the biochemical can be inferred from sensor responses using our approach. The data set is pre-processed by normalizing raw sensor responses, calibrating sensor data according to standard no biochemical input conditions and according to the time delay in the sensor response, if available. Humidity levels and temperatures are also factored out first by using a regression based approach. This results in sensor responses being in the same scale. 2250 times points are sampled and used.

The TCGA Glioblastoma multiforme (GBM) data is downloaded from The Cancel Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov/tcga/). 548 gene expression profiles were retrieved from the Broad Institute HT HG-U133A platform (Affymetrix, Santa Clara, CA, USA). Each gene expression profile consists of normalized expression data of 12042 genes. The survival information of patients is retrieved from TCGA clinical data. After removing gene expression samples with unknown survival information, 427 samples were used in our analysis.

### Results on artificial data set

*R*

^{2}of 0.87. Our method gets 7 rules with

*R*

^{2}of 0.66. There is not too much loss in the prediction performance. The predicted rules are as follows:

- 1
**IF***x*_{2}≤ 2.04 and*x*_{1}*>*3.84**THEN***y*= 5 - 2
**IF***x*_{2}≤ 1.21 and*x*_{1}*>*3.75**THEN***y*= 5 - 3
**IF***x*_{2}≤ 4.17 and*x*_{2}*>*3.91**THEN***y*= 6 - 4
**IF***x*_{1}≤ 3.68 and*x*_{2}≤ 1.93 and*x*_{1}*>*2.09**THEN***y*= 4 - 5
**IF***x*_{2}*>*4.21 and*x*_{1}≤ 2.62**THEN***y*= 6 - 6
**IF***x*_{2}≤ 2.33 and*x*_{2}*>*0.93 and*x*_{1}*>*3.66**THEN***y*= 5 - 7
**IF***x*_{2}*>*3.88 and*x*_{1}*<*2.72**THEN***y*= 6.

They are also illustrated in Figure 4. Numbers in text boxes are prediction values of target variable. Lines generated from rules partition the original space. Many of these rules align well with the partition. Noted that multiple run of our approach generates different sets of rules. The number of extracted rules also changes. The partitions in those rules align well with the partition also.

### Results on different data sets

The following tables present the result of our proposed methods on different data sets. Results are from test data.

*R*

^{2}does not change too much. In most data sets, except Parkinson's Telemonitoring data set, RF gives the best performance. Support vector regression is the least competitive in the cases we tested. Our approach stands somewhere in the middle. Note that on Stockori flowering time data set, the target variable, flowering time, is ordered. Here we simply treat it as numbers. The performance is comparable with RF. In Breast Cancer Wisconsin (Prognostic) data set, the predictive performance is low indicating it is a hard problem. Our approach does not work well on this data set either. It may be resulted from over pruning the rules.

Results on different data sets.

Numbers after ± are standard deviation. SVR is support vector regression. | |||
---|---|---|---|

Random Forests | Our Approach | SVR | |

Stockori Flowing Time | |||

| 0.54 | 0.45 | 0.28 |

Number of Rules Selected | 66020 | 348 | NA |

Number of Features Used in a Rule | 8.8 | 7.5 | NA |

Number of Features Selected | 149 | 135 | 149 |

Parkinson's Telemonitoring | |||

| 0.15 | 0.06 | 0.17 |

Number of Rules Selected | 644789 | 3796 | NA |

Number of Features Used in a Rule | 9.72 | 7.4 | NA |

Number of Features Selected | 19 | 19 | 19 |

Breast Cancer Wisconsin (Prognostic) | |||

| 0.04 | -0.19 | -0.04 |

Number of Rules Selected | 43907 | 126 | NA |

Number of Features Used in a Rule | 7 | 3 | NA |

Number of Features Selected | 32 | 31 | 32 |

Relative location of CT slices on axial axis | |||

| 0.92 | 0.77 | 0.26 |

Number of Rules Selected | 172984 | 901 | NA |

Number of Features Used in a Rule | 12 | 8 | NA |

Number of Features Selected | 384 | 20 | 384 |

Seacoast | |||

| 0.64 | 0.59 | -0.19 |

Number of Rules Selected | 120771 | 385 ≤ 5 | NA |

Number of Features Used in a Rule | 14 | 6 | NA |

Number of Features Selected | 16 | 16 | 16 |

TCGA Glioblastoma multiforme | |||

| 0.04 | -1.94 | -0.09 |

Number of Rules Selected | 53539 | 279 | NA |

Number of Features Used in a Rule | 3 | 2 | NA |

Number of Features Selected | 12042 | 2 | 12042 |

The standard deviation on the *R*^{2}, number of rules selected, number of features selected demonstrates that the methods are stable on most of these data sets. The standard deviation of *R*^{2} is obtained from the average of *R*^{2} over ten runs.

- 1
**IF***v*_{22}*>*30.27 and*v*_{1}*≤*17.23 and*v*_{5}*>*0.09 and*v*_{11}*>*0.24 and*v*_{25}*≤*0.16 and*v*_{14}*>*28.38 and*v*_{20}*>*0.00 and*v*_{20}*≤*0.01**THEN***y*= 64.5 - 2
**IF***v*_{12}*≤*1.17 and*v*_{9}*≤*0.18 and*v*_{3}*>*88.13 and*v*_{16}*>*0.02 and*v*_{21}*>*23.37**THEN***y*= 57 - 3
**IF***v*_{4}*>*814.40 and*v*_{12}*≤*0.70**THEN***y*= 101.33 - 4
**IF***v*_{17}*≤*0.05 and*v*_{30}*≤*0.10 and*v*_{19}*≤*0.01 and*v*_{31}*>*0.70 and*v*_{16}*≤*0.03 and*v*_{23}*≤*130.75**THEN***y*= 69.25 - 5
**IF***v*_{23}*>*123.70 and*v*_{29}*>*0.26 and*v*_{2}*≤*18.43 and*v*_{17}*>*0.03**THEN***y*= 109.2.

where *v*_{1} is mean radius, *v*_{2} is mean texture, *v*_{3} is mean perimeter, *v*_{4} is mean area, *v*_{5} is mean smoothness, *v*_{9} is mean symmetry, *v*_{11} radius standard error (SE), *v*_{12} is texture SE, *v*_{14} is area SE, *v*_{16} is compactness SE, *v*_{17} is concavity SE, *v*_{19} is symmetry SE, *v*_{20} is fractal dimension SE, *v*_{21} is worst radius, *v*_{22} is worst texture, *v*_{23} is worst perimeter, *v*_{25} is worst smoothness, *v*_{29} is worst symmetry, *v*_{30} is worst fractal dimension, and *v*_{31} is tumor size. Among these rules, size, shape, and texture features occur more often than other features indicating these features are more important than other features in deciding breast cancer. This result is similar to conclusion made in [16] and [17].

### Results with noisy data

*p*value on paired t test on difference in mean

*R*

^{2}s of SVR on data without noise and data with noise, it was not affected too much by Gaussian noise. But its

*R*

^{2}is still the lowest among all three methods. The

*p*value shows that there is no statistical significance between results with noise and without noise. Our approach has similar values of

*R*

^{2}compared with those of random forests. Increasing the probability of noise from 0.3 to 1, both random forests and the proposed approach are affected by the increased noise level.

Result on stockori flowering time data set with noise.

Numbers after ± are standard deviation. SVR is support vector regression. | |||
---|---|---|---|

Random Forests | Our Approach | SVR | |

| 0.43 | 0.36 | 0.24 |

D | 0.13 | 0.1 | 0.01 |

## Conclusion

We propose to use an ensemble of decision rules generated from random forests and 1-norm regularization to balance prediction performance and interpretability of regression problems. The method selects a small number of rules (using a small number of features) while retaining performance comparable to RF, better than SVR in most cases.

Due to decision trees' ability handling mixed data type, our approach is able to handles data with mixed type.

We also study robustness of our approach in the presence of noise. The prediction performance is still comparable with random forests in terms of performance within small amount of Gaussian noise.

Regression problems are generally harder than classification problems both in terms of prediction performance and interpretability [8]. Therefore, care should be taken when interpreting the results.

## Declarations

### Acknowledgements

This work is supported in part by the US National Science Foundation under award numbers EPS-0903787 and EPS 1006883.

**Declarations**

The full funding for the publication fee came from Mississippi Experimental Program to Stimulate Competitive Research (EPSCoR).

This article has been published as part of *BMC Systems Biology* Volume 8 Supplement 3, 2014: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Systems Biology Approaches to Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S3.

## Authors’ Affiliations

## References

- Cortes C, Vapnik V: Support-Vector Networks. Machine Learning. 1995, 20 (3): 273-297. [http://dx.doi.org/10.1023/A:1022627411411]Google Scholar
- McCulloch W, Pitts W: A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 1943, 5 (4): 115-133. 10.1007/BF02478259.View ArticleGoogle Scholar
- Breiman L: Random Forests. Maching Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Tibshirani R: Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B. 1996, 58: 267-288.Google Scholar
- Zhu J, Rosset S, Hastie T, Tibshirani R: 1-norm Support Vector Machines. Advances in Neural Information Processing Systems 16: December 8-13, 2003; Vancouver and Whistler, British Columbia, Canada. Edited by: Thrun S, Saul LK, Schölkopf B. 2003, 49-56.Google Scholar
- Zou H: An Improved 1-norm SVM for Simultaneous Classification and Variable Selection. Journal of Machine Learning Research -Proceedings Track. 2007, 2: 675-681.Google Scholar
- Liu S, Chen Y, Wilkins D: Large Margin Classifiers and Random Forests for Integrated Biological Prediction on Mixed Type Data. Proceedings of the 7th Annual Biotechnology and Bioinformatics Symposium (BIOT): October 14-15, 2010, Lafayette, Louisiana, USA. 2010, 11-19.Google Scholar
- Liu S, Patel RY, Daga PR, Liu H, Fu G, Doerksen RJ, Chen Y, Wilkins DE: Combined rule extraction and feature elimination in supervised classification. IEEE Transactions on NanoBioscience. 2012, 11 (3): 228-236.View ArticlePubMedGoogle Scholar
- Steel R, Torrie J: Principles and procedures of statistics, with special reference to the biological sciences. 1960, New York: McGraw-HillGoogle Scholar
- Hamza M, Larocque D: An Empirical Comparison of Ensemble Methods Based on Classification Trees. Journal of Statistical Computation and Simulation. 2005, 75: 629-643. 10.1080/00949650410001729472.View ArticleGoogle Scholar
- Stockori : Stockori qtl dataset. [http://agbs.kyb.tuebingen.mpg.de/wikis/bg/stockori.zip]
- Tsanas A, Little M, McSharry P, Ramig L: Accurate Telemonitoring of Parkinson's Disease Progression by Noninvasive Speech Tests. IEEE Transactions on Biomedical Engineering. 2010, 57 (4): 884-893.View ArticlePubMedGoogle Scholar
- Street WN, Mangasarian OL, Wolberg WH: An Inductive Learning Approach to Prognostic Prediction. Proceedings of the Twelfth International Conference on Machine Learning: July 9-12, 1995; Tahoe City, California, USA. Edited by: Prieditis A, Russell SJ. 1995, Burlington: Morgan Kaufmann, 522-530.Google Scholar
- Graf F, Kriegel HP, Schubert M, Pölsterl S, Cavallaro A: 2D Image Registration in CT Images Using Radial Image Descriptors. Medical Image Computing and Computer-Assisted Intervention. Edited by: Fichtinger G, Martel A, Peters T. 2011, Berlin Heidelberg: Springer-Verlag, 607-614. Lecture Notes in Computer Science, vol. 6892Google Scholar
- Frank A, Asuncion A: 2010Google Scholar
- Wolberg WH, Nick Street W, Mangasarian OL: Importance of Nuclear Morphology in Breast Cancer Prognosis. Clinical Cancer Research. 1999, 5 (11): 3542-3548.PubMedGoogle Scholar
- Narasimha A, Vasavi B, Harendra Kumar M: Significance of nuclear morphometry in benign and malignant breast aspirates. International Journal of Applied and Basic Medical Research. 2013, 3: 22-26. 10.4103/2229-516X.112237.PubMed CentralView ArticlePubMedGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning. 2003, 52 (1-2): 91-118.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.