mercredi 20 avril 2016

Number of features to pick when grow a random forest for regression

I am building a RF for regression on Matlab, the RF is working quite well for my case, but am trying to get a better OOB error / generalization error out of it.

One important parameter to tune is the number of features to select at random for each decision split, or in Matlab, the parameter "NVarToSample".

While, in the help documentation of matlab

Default is the square root of the number of variables for classification and one third of the number of variables for regression.

in the document of python scikit package

Empirical good default values are max_features=n_features for regression problems

and in the RF original paper written by Leo Breiman:

An interesting difference between regression and classification is that the correlation increases quite slowly as the number of features used increases. The major effect is the decrease in PE∗(tree). Therefore, a relatively large number of features are required to reduce PE∗(tree) and get near optimal testset error.

so I increased the number from 1/3*(max_features) to max_features and the performance did improved with this increment. Buy my concerns are:

  1. Would this large number of features used influence the randomness in random forest?

  2. How many number of features did you choose when solving your problems ( by RF) ? and why?

Thank you !




Aucun commentaire:

Enregistrer un commentaire