Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data

Table 8 Model selection and evaluation metrics (general and per class) of top 5 models from 36 possible instantiations of pipeline using LC data-set

	FS	Sampling	Classifier	CV F1	CV Precision	CV Recall	Train	Test	Test	Test	Test	Test	Test	Test	Test	Test	Model
				Mean ± Std	Mean ± Std	Mean ± Std	F1	F1	Precision	Recall	F1 (0)	Precision (0)	Recall (0)	F1 (1)	Precision (1)	Recall (1)	Parameters
1	RFE-LR	Up-sampling	RF	0,72 ± 0,054	0,686 ± 0,102	0,79 ± 0,039	1	0,722	0,778	0,729	0,871	0,964	0,794	0,2	0,125	0,5	n_estimators =30
2	RLR-L1	SMOTE-sampling	KNN	0,712 ± 0,087	0,68 ± 0,122	0,762 ± 0,066	0,777	0,741	0,806	0,844	0,889	1	0,8	0,222	0,125	1	n_neighbors =5,
																	C =100
3	ANOVA	No sampling	RF	0,698 ± 0,077	0,651 ± 0,12	0,776 ± 0,061	1	0,652	0,722	0,595	0,839	0,929	0,765	0	0	0	n_estimators =30
4	RFE-LR	SMOTE-sampling	RF	0,689 ± 0,077	0,648 ± 0,119	0,761 ± 0,071	1	0,681	0,778	0,605	0,875	1	0,778	0	0	0	n_estimators =30
5	ANOVA	No sampling	Linear SVM	0,687 ± 0,113	0,687 ± 0,136	0,707 ± 0,112	1	0,811	0,833	0,823	0,9	0,964	0,844	0,5	0,375	0,75	C =0.1

They are ordered by CV F1. FS stands for feature selection, Cv for cross-validation, F1 is the measure of model evaluation defined as: Precision x Recall / (Precision + Recall). Precision is the proportion of examples classified as positive that are truly positive and Recall the proportion of truly positive examples that are classified as positive. Std stands for standard deviation. Train indicates we used the training set to compute the evaluation metric and Test if we used the test set. (0) indicates it’s an evaluation metric for class 0 and (1) for class 1

ISSN: 1752-0509