I'd like to do something similar to this post about Pandas dataframes but in R, and ideally regardless of data types (e.g. a dataframe with both factors and numeric columns).
I want to get a random sample of an R dataframe in which each variable is relatively representative of the population.
I have seen ways to create a stratified sample based on a single variable. But, I want to ensure representation on multiple columns, and not just factors.
I wrote a simple algorithm to approach this across numeric variables, using the Wilcoxon test on each variable. So, if all numeric columns in the sample (test set) seem to come from the same set as the numeric columns in the remaining set (train set). You take a random sample and validate it with the following function, and resample and validate until you get a sample that meets your minimum representativity across all variables:
validate_split = function(train, test, feats_lst, feats_p_val_lst, alpha = .5) {
# Conducts Wilcoxon ranks sum test column by column to test if train and test
# represent a similar superset. (i.e., is the split stratified on every
# feature?) Both train and test should have the same features. There should
# be at least one numeric (i.e. continuous) feature, as the test will only
# be performed on these columns -- this does limit the test.
# Parameters:
# train: (data.frame) A subset of original set to compare to the other
# subset, test.
# test: (data.frame) A subset of original set to compare to the other
# subset, train.
# feats_lst: (list(character)) List of features to test.
# feats_p_val_lst: (list(character:list(double)) Dictionary of p-values to
# to track which features are hardest to stratify.
# alpha: (numeric) Probability of incorrectly rejecting the null hypothesis.
# H0 = feature n of train and test does not represent different sets.
# (i.e. representative split)
# H1 = feature n of train and test represents a different superset.
# Return:
# valid: (bool) Are the sets representative of the same superset?
valid = TRUE
for (feat in feats_lst) {
if (valid & feat %in% colnames(train) & feat %in% colnames(test)) {
results = wilcox.test(
x = as.double(train[[feat]]),
y = as.double(test[[feat]])
)
if (!(results$p.value > alpha)) {
# print("Reject null hypothesis that split is not unrepresentative:")
valid = FALSE
}
# print(feat)
# print(results$p.value)
feats_p_val_lst[[feat]] = c(feats_p_val_lst[[feat]], results$p.value)
}
}
return(list('valid' = valid, 'p_vals' = feats_p_val_lst))
}
Again, this leaves out factors and integers entirely, unless you cast them as doubles, but that would violate the Wilcoxon assumption that the data is continuous.
Given that my current dataset contains about 80 variables, almost half of which are doubles, this suffices because the factors are probably pretty representative if all the doubles are.
But, it takes forever to run and get even p > .5 (i.e. fail to reject the null hypothesis that these data sets are not from different populations (i.e. are not unrepresentative)). And, what about a data set with all or most of its variables as factors or integers?
Is there a better way, both/either from a statistical perspective and/or an R/programming perspective?
Aucun commentaire:
Enregistrer un commentaire