Research on Feature Selection Methods based on Random Forest

: Aiming to deal with the irrelevant or redundant features, this paper proposes eight kinds of feature selection methods. The first seven feature selection methods include CART and Random Forests (CART-RF), CHIAD and Random Forests (CHIAD-RF), SVM and Random Forests (SVM-RF), Bayesian Network and Random Forests (BN-RF), neural Network and Random Forests (NN-RF), K-Means and Random Forests (K-Means-RF) and Kohonen and Random Forests (Kohonen-RF). These methods use CART, CHAID, SVM, BN, NN, K-Means and Kohonen to evaluate the importance and ranking of features, and then obtain feature subsets through RF algorithm. The eighth method is named hybrid integration methods and random forests (Integrate-RF). Integrate-RF uses the average importance of the seven methods and the optimal features subset can be selected based on the OOB data classification error rate. Experimental results indicate that feature selection methods proposed in this article can effectively select features and reduce the data dimension.


INTRODUCTION
With the data dimension increases, there are more and more irrelevant features and redundant features; thereby the learner cannot fully obtain the effective information from data, reducing the performance of the learner.Especially for small-sample and high-dimensional data, it is easy for "dimensional disaster" and over-fitting phenomenon to occur.To deal with such a problem, feature selection [1][2] is an effective method; it removes irrelevant features and redundant features to reduce the number of features, and finds optimal feature subset which is the smallest but can improve the learner's recognition to the maximum.Feature selection [3] is very useful to solve dimension reduction of high dimensional data, especially for the modelling of small-sample and high-dimensional data.
Feature selection [4] is a process of cyclically searching for the optimal feature subset; it mainly includes feature importance evaluation, feature subset generation, feature subsets evaluation, search termination condition, results verification and confirmation.Among them, feature importance evaluation [5][6][7][8][9][10][11] is one of the most important steps to feature selection.Feature selection algorithms can be divided into supervised feature selection [12] and unsupervised feature selection [13], and can also be divided into four categories [14][15][16]: filter methods, wrapper methods, embedded methods and hybrid methods.
The paper proposes eight kinds of feature selection methods, the eighth feature selection method using hybrid integration of difference models and random forests (Integrate-RF).The difference models include CART [17], CHAID [18], SVM [19], Bayesian Networks (BN) [20], Neural Networks (NN) [21], K-Means [22] and Kohonen [23].Integrate-RF [24] method obtains ordering of feature importance by integrating various supervised and unsupervised feature evaluation methods, and generates feature subset by forward search strategy, then evaluates the feature subsets in terms of the minimum OOB error of the random forests, feature number of the lowest OOB error, the average OOB error, the variance of OOB error rate.

METHOD 2.1 Feature-Importance-Evaluation Methods
Hybrid feature selection method includes feature importance evaluation, feature subsets generation and feature subsets evaluation.Feature importance evaluation is one of the most important steps to feature selection.Integrate-RF method obtains ordering of feature importance by hybrid integration of difference models.CART, CHAID, SVM, Bayesian Networks (BN), and Neural Networks (NN) belong to supervised embedded feature selection algorithms.K-Means and Kohonen are unsupervised embedded feature selection algorithms, they are all available for feature-importance-evaluation.

Supervised Methods
Feature importance can be determined by computing the reduction in variance of the target attributable to each feature, via a sensitivity analysis.This method of computing feature importance is used in the following models: CART, CHAID, SVM, BN, NN, when Y is the target, Xj is feature, where j = 1, …, k, k is number of features, Y = f(X1, X2, …, Xk).Model for Y is based on features X1 through Xk.Features are ranked according to the sensitivity measure defined as follows.
where V(Y) is the unconditional output variance.In the numerator, the expectation operator E calls for an integral over X−i; that is, over all factors but Xi, then the variance operator V implies a further integral over Xi.Feature importance is then computed as the normalized sensitivity.
Classification and regression tree (CART) splits the data into two subsets so that the samples within each subset are more homogeneous than in the previous subset.It is a recursive process; each of those two subsets is then split again, and the process is repeated until the homogeneity criterion is reached or some other stopping criterion is satisfied.The same predictor field may be used several times at different levels in the tree.It uses surrogate splitting to make the best use of data with missing values.CART is quite flexible.It allows unequal misclassification costs to be considered in the tree growing process.It also allows you to specify the prior probability distribution in a classification problem.
CHAID stands for Chi-squared Automatic Interaction Detector.It is a highly efficient statistical technique for segmentation, or tree growing.Using the significance of a statistical test as a criterion, CHAID evaluates all of the values of a potential predictor field.It merges values that are judged to be statistically homogeneous (similar) with respect to the target variable and maintains all other values that are heterogeneous (dissimilar).It then selects the best predictor to form the first branch in the decision tree, such that each child node is made of a group of homogeneous values of the selected field.This process continues recursively until the tree is fully grown.The statistical test used depends upon the measurement level of the target field.If the target field is continuous, an F test is used.If the target field is categorical, a chi-squared test is used.CHAID is not a binary tree method; that is, it can produce more than two categories at any particular level in the tree.Therefore, it tends to create a wider tree than the binary growing methods.It works for all types of variables, and it accepts both case weights and frequency variables.It handles missing values by treating them all as a single valid category.
The Support Vector Machine (SVM) is a supervised learning method that generates input-output mapping functions from a set of labelled training data.The mapping function can be either a classification function or a regression function.For classification, nonlinear kernel functions are often used to transformed input data to a high-dimensional feature space in which the input data become more separable compared to the original input space.Maximum-margin hyperplanes are then created.The produced model depends on only a subset of the training data near the class boundaries.Similarly, the model produced by Support Vector Regression ignores any training data that is sufficiently close to the model prediction.(Support Vectors can appear only on the error tube boundary or outside the tube.) Bayesian Networks (BN) provides a succinct way of describing the joint probability distribution for a given set of random variables.Let V be a set of categorical random variables and G = (V, E) be a directed acyclic graph with nodes V and a set of directed edges E. A Bayesian network model consists of the graph G together with a conditional probability table for each node given values of its parent nodes.Given the value of its parents, each node is assumed to be independent of all the nodes that are not its descendents.The joint probability distribution for variables V can then be computed as a product of conditional probabilities for all nodes, given the values of each node's parents.Given set of variables V and a corresponding sample dataset, we are presented with the task of fitting an appropriate Bayesian network model.The task of determining the appropriate edges in the graph G is called structure learning, while the task of estimating the conditional probability tables given parents for each node is called parameter learning.
Neural networks (NN) predict a continuous or categorical target based on one or more predictors by finding unknown and possibly complex patterns in the data.The multilayer perceptron (MLP) is a feed-forward, supervised learning network with up to two hidden layers.The MLP network is a function of one or more predictors that minimizes the prediction error of one or more targets.Predictors and targets can be a mix of categorical and continuous fields.

Unsupervised Methods
This method uses the following models to compute the importance of predictors: K-Means, Kohonen, when Y is target, Xj is predictor, where where Ω denotes the set of predictor and evaluation features, sigi is the significance or p-value computed from applying a certain test, as described below.If sigi equals zero, set sigi = MinDouble, where MinDouble is the minimal double value.In across clusters, the p-value for categorical feature is based on Pearson's chi-square, the p-value for continuous features is based on an F test.In clusters, the null hypothesis for categorical feature means that the proportion of cases in the categories in cluster j is the same as the overall proportion, the p-value for categorical features is based on Pearson's chi-square.The null hypothesis for continuous features is that the mean in cluster j is the same as the overall mean, the p-value for continuous features is based on Student's t statistic.The K-Means method is a clustering method, used to group records based on similarity of values for a set of input fields.The basic idea is to try to discover k clusters, such that the records within each cluster are similar to each other and distinct from records in other clusters.K-Means is an iterative algorithm; an initial set of clusters is defined, and the clusters are repeatedly updated until no more improvement is possible (or the number of iterations exceeds a specified limit).
Kohonen models are a special kind of neural network model that performs unsupervised learning.It takes the input vectors and performs a type of spatially organized clustering, or feature mapping, to group similar records together and collapse the input space to a two-dimensional space that approximates the multidimensional proximity relationships between the clusters.The Kohonen network model consists of two layers of neurons or units: an input layer and an output layer.The input layer is fully connected to the output layer, and each connection has an associated weight.Another way to think of the network structure is to think of each output layer unit having an associated center, represented as a vector of inputs to which it most strongly responds (where each element of the center vector is a weight from the output unit to the corresponding input unit).

Feature Subset Generation and Evaluation
RF is a classification that comes with a set of CARTs, and introduces random selection features based on Bagging, which can be used for classification, regression and variable importance analysis.It can process data with missing values, and achieves excellent effect for class imbalanced data; besides, it has high stability and strong generalization, and can complete the test internally and get classification errors.Therefore, this paper uses random forests as a classifier to evaluate the feature subset.

PREPROCESSING
CART, CHIAD, SVM, BN, NN, K-means, and Kohonen are seven effective machine learning methods.The paper uses these seven methods for feature importance evaluation, random forests evaluates the feature subset, then forms seven feature selection methods; model structure is shown in Fig. 1.

Figure 1 Eight feature selection methods structure
There are seven feature selection methods that are proposed.They include: CART Random Forests (CART-RF), CHIAD Random Forests (CHIAD-RF), SVM Random Forests (SVM-RF), BN Random Forests (BN-RF), NN Random Forests (NN-RF), K-Means Random Forests (K-Means-RF) and Kohonen Random Forests (Kohonen-RF).These methods use CART, CHIAD, SVM, BN, NN, K-means or Kohonen to obtain the feature importance ranking.Then the forward search method is used to generate the feature subset, according to the feature importance.The forward search method starts from the empty subset, and greedily adds the feature with the highest score into the feature subset each time.After adding one feature, the corresponding model is trained and tested; at the end, the classification ability of the feature subset is determined by the OOB error rate of the random forests.
The Integrate-RF model structure is shown in Fig. 1.This method uses seven feature importance evaluation methods to obtain seven feature importance ranking results, takes the average of the seven results as the final feature importance ranking results, and then uses a forward search method to generate a feature subset.Finally, the classification ability of the feature subset is determined by the OOB error rate of the random forests.

RESULTS AND DISCUSSION
In order to verify the effectiveness of the eight feature selection methods, the paper designs three sets of experiments.The first set of experiments uses CART, CHAID, SVM, BN, NN, K-Means, and Kohonen to calculate the importance ranking of features, and takes the average of the above seven methods as the feature importance ranking of the Integrate-RF method; the second set of experiments rearranges the features in the data table according to the feature importance to obtain a new data table, and then builds random forests model with decision tree number ranging from 1 -100 for each new data table, to get the decision tree number when the random forests model is better.The third group of experiments compares the minimum OOB error, OOB error mean and variance, and the number of features in the optimal feature subset for each method with the introduction of features.

Data Sets Introduction
The experiment uses six sets of UCI classification data sets.A brief description is shown in Tab. 1.

Feature Importance Ranking
Feature importance evaluation is an important part of the feature selection.This experiment uses seven algorithms of CART, CHAID, SVM, BN, NN, K-Means, and Kohonen in the SPSS Model to obtain seven feature importance ranking results.The average of the feature importance rankings is the feature importance ranking of Integrate-RF.The results are shown in Tab. 2.

Parameter Selection of Classifier
The number of decision tree has a certain impact on the performance of random forests.The main task in this section is to explore and analyse the relationship between the number of decision tree and the OOB error estimates of random forests, and to find the number of decision tree when the random forest is better.The specific experimental process is as follows: first, reorganize the data table according to the feature importance ranking of each model; second, the number of trees in the random forests starts from 1 to 100, and the step size 20, classify the new data table and get the OOB error estimation rate.The relationship between the OOB error estimation rate of each method and the number of trees for six data sets is shown in Fig. 2.   In Fig. 2, there are a total of 46 sub-graphs.The first 4 sets of data sets each has 8 sub-graphs, and the last 2 sets of data sets each has 7 sub-graphs.The sub-graphs (aa) have the title of Lenses.CART.rf means that CART method obtains the feature importance ranking of the Lenses data and reorders the features in the Lenses data table, and then uses random forests to classify the new data table; the abscissa is the number of decision trees, and the ordinate is the OOB error estimation rate; The dotted line in the sub-picture represents the change trend of OOB error estimation rate for each category over the number of decision trees; the black solid line represents the change trend of overall OOB error estimates over the number of decision trees.From Fig. 2, it can be concluded that the OOB error estimate of the random forests decreases and then gradually balances over the number of decision trees, and we can get the number of decision trees for the best model of each data set as follows: The Lens dataset is 25, the Iris dataset is 25, the Breast Cancer Wisconsin (Original) dataset is 50, the Breast Cancer Wisconsin (Diagnostic) dataset is 50, the lung-cancer dataset is 50, and the SCADI dataset is 100.

OOB Estimate of Error Rate Analysis
Analysing each sub-graph in Fig. 3, we can know, as features were introduced sequentially, the OOB error estimation rate of the random forests decreases and then balances, indicating that the above eight feature importance evaluation methods are all effective.Comparing all methods, it is found that Integrate-RF method is more stable.
In order to further compare the above eight methods, it calculates the minimum value, average value and variance of the OOB error estimation rate of each data set in the above experiments.The results are shown in Fig. 3.Each data set uses different methods to obtain the number of features at the minimum OOB error, the results are shown in Tab. 3.
Combining Fig. 3, Tab. 3 and Tab. 4, we can know: In Lenses, there is a total of 4 features, CART-RF, SVM-RF, and BN-RF introduce the first three features to get the minimum OOB error rate 16.67%; K-Means-RF introduces the first four features to get the minimum OOB error rate of 25%; Integrate-RF introduces the first two features to get the minimum OOB error rate of 16.67%.
In Iris, there is a total of four features, CART-RF, BN-RF, K-Means-RF, Kohonen-RF.Integrate-RF introduces the first three features to get the minimum OOB error rate 4.00%; CHIAD-RF and SVM-RF introduce the first four features to get the minimum OOB error rate 3.33%; NN-RF introduce the first four features to get the minimum OOB error rate 5.33%.
In BCWO, there is a total of nine features, CART-RF introduces the first seven features to get the smallest OOB error rate being 3.43%; CHIAD-RF introduces the first nine features to get the smallest OOB error rate to be 3.58%; SVM-RF introduces the first eight features to get the smallest OOB error rate of 3.00%; BN-RF introduces the first nine features to get the smallest OOB error rate of 3.43%; NN-RF introduces the first nine features to get the smallest OOB error rate being 2.86%; K-Means-RF introduces the first nine features to get the smallest OOB error rate of 3.29%; Kohonen-RF introduces the first seven features to get the smallest OOB error rate of 3.29%; Integrate-RF introduces the first seven features to get the smallest OOB error rate to be 3.00%.
In BCWD, there is a total of thirty features, CART-RF introduces the first thirty features to get the smallest OOB error rate is 3.69%; CHIAD-RF introduces the first eight features to get the smallest OOB error rate is 3.16%; SVM-RF introduces the first twenty-five features to get the smallest OOB error rate is 2.81%; BN-RF introduced the first fourteen features to get the smallest OOB error rate is 2.99%; NN-RF introduced the first twenty-four features to get the smallest OOB error rate is 3.69%; K-Means-RF introduced the first thirty features to get the smallest OOB error rate is 2.99%; Kohonen-RF introduces the first twenty-nine features to get the minimum OOB error rate is 3.34%; Integrate-RF introduce the first twenty features to get the minimum OOB error rate is 3.69%.
In lung-cancer, there is a total of fifty-six features, CART-RF introduces the first three features to get the minimum OOB error rate of 31.25%;CHIAD-RF introduce the first two features to get the minimum OOB error rate being 28.12%; SVM-RF introduce the first twenty features to get the minimum OOB error rate of 31.25%;NN-RF introduce the first six features to get the minimum OOB error rate is 34.38%;K-Means-RF introduce the first thirty-three features to get the minimum OOB error rate is 40.62%;Kohonen-RF introduce the first thirty features to get the minimum OOB error rate is 37.50%; Integrate-RF introduce the first nine features to get the minimum OOB error rate is 28.12%.
In SCADI, there is a total of 205 features, CART-RF introduces the first four features to get the smallest OOB error rate is 12.86%; CHIAD-RF introduce the first four features to get the smallest OOB error rate is 14.29%; SVM-RF introduce the first a hundred and thirteen features to get the smallest OOB error rate is 14.29%; NN-RF introduce the first one hundred and twenty six features to get the smallest OOB error rate of 14.29%; K-Means-RF introduce the first forty-one features to get the smallest OOB error rate of 12.86%; Kohonen-RF introduce the first fifteen features to get the smallest OOB error rate of 14.29%; Integrate-RF introduced the first 27 features to get the minimum OOB error rate is 14.29%.
From the analysis of the above results, three points can be obtained.First, eight feature importance evaluation methods are feasible and effective, especially when there are more feature amounts in the data set, the effect is more obvious.Second, Integrate-RF performs better on all data sets, indicating that Integrate-RF is more adaptable.Third, the same features are ordered differently in the data table, and the results produced are also different, indicating that the random forests extraction features in the R software package are related to the ordering of features in the data table.
(a) (b) (c) (d) (e) (f) Figure 3 The relationship between OOB estimate of error rate and the number of selected features using eight feature-importance-evaluation and random forests classification methods (Among them, BCWO represents the data set Breast Cancer Wisconsin (Original), BCWD represents the data set Breast Cancer Wisconsin (Diagnostic), and LC represents the data set lung-cancer.) Table 3 The value of OOB estimate of error rate of different features set sizes using eight feature-importance-evaluation and random forests (Min / %, AV + V / %)

CONCLUSION
In order to overcome the dimensional disaster and over-fitting problems, and improve the efficiency of data analysis the paper proposes eight feature selection methods, they are CART-RF, CHAID-RF, SVM-RF, BN-RF, NN-RF, K-Means-RF, Kohonen-RF and Integrate-RF.The first seven methods use CART, CHAID, SVM, BN, NN, K-Means or Kohonen to evaluate feature importance, and then use forward search strategy to generate and update feature subsets; finally, use random forests to evaluate feature subsets.The eighth feature selection method uses hybrid integration of difference models and random forests.Through 6 sets of UCI data experiments, the results show that the eight methods can effectively select features, reduce the data dimension.

DECLARATION
(1) We note that a shorter conference version of this paper appeared in 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI) 27.-29.August 2021, Shanghai, China.This manuscript supplements and expands the following contents in more detail: seven basic feature selection methods structure, feature importance ranking, the relationship between OOBEER estimate of error rate and the number of trees, the value of OOB estimate of error rate of different features set sizes using eight featureimportance-evaluation and random forests classification methods, and etc.
(2) The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
(4) This work uses the software of IBM SPSS modeler 14.1 to get the feature importance ranking, and then uses the R programming to perform feature subset generation, evaluation, and selection.
Note: there are a total of 46 sub-graphs.The first 4 sets of data sets each have 8 sub-graphs, and the last 2 sets of data sets each have 7 sub-graphs.

Figure 2 Figure 2 Figure 2 Figure 2
Figure 2The relationship between OOB estimate of error rate and the number of trees (continuation)

Table 1
Data sets introduction

Table 2
The results of feature importance ranking Because Breast Cancer Wisconsin (Diagnostic), Lung Cancer, and SCADI have many features, they are only shown the features that the average value sort is top 10.

Table 4
Number of convergent features using eight feature-importance-evaluation and random forests classification methods