I cannot tell if Spark is implementing any sort of (outcome) label balancing when subsampling data in ensemble models. Neither of the argument descriptions mentions a specific sampling implementation. I would like to know if the sample is being balanced in any way.
If it is a simple bootstrap sample, what would happen if a rare outcome was being modeled and a subsample with no instances of one label was drawn?
From the Ensembles documentation:
subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
From the Decision Tree documentation:
subsamplingRate: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using RandomForest and GradientBoostedTrees), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
Aucun commentaire:
Enregistrer un commentaire