I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr
solution in pipes will be preferred as it can be inserted easily in long code.
Let's take the example of iris
dataset. Species
column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length
. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve
library(tidyverse)
iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)
# A tibble: 15 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.4 3.7 1.5 0.2 setosa
2 5.3 3.7 1.5 0.2 setosa
3 5.7 4.4 1.5 0.4 setosa
4 5 3.5 1.6 0.6 setosa
5 4.8 3.1 1.6 0.2 setosa
6 6.1 2.9 4.7 1.4 versicolor
7 6.7 3.1 4.7 1.5 versicolor
8 5 2 3.5 1 versicolor
9 7 3.2 4.7 1.4 versicolor
10 5.7 2.9 4.2 1.3 versicolor
11 7.2 3.2 6 1.8 virginica
12 6.7 2.5 5.8 1.8 virginica
13 6.4 2.8 5.6 2.1 virginica
14 6.3 3.3 6 2.5 virginica
15 7.2 3 5.8 1.6 virginica
But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.
iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)
#Error: `prop` must be a single number
OR
iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0
Please help
Aucun commentaire:
Enregistrer un commentaire