dimanche 29 juillet 2018

Random row selection based on a column value and probability

Here is the dummy set

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)


1                   -1.22049503   blue
2                    1.61641224   blue
3                    0.09079087   blue
4                    0.32325956   blue
5                   -0.62733486    red
6                    0.43102051    red
7                    0.61619844    red
8                   -0.17718356    red
9                    1.18737562 yellow
10                  -0.19035444 yellow
11                  -0.49158052 yellow
12                  -1.47425432 yellow
13                   0.22942192   pink
14                   0.76779548   pink
15                   0.97631652   pink
16                  -0.33513712   pink

what I am trying to get is like if the df$color is blue then those rows will be selected, but if the df$color is blue then it got higher probability of getting that row selected, if df$color is yellow then it got lesser probability of getting that row selected, and if df$color is pink then it got very less probability of getting that row selected

This is what I came up with

my.data.frame <- df[(df$color == 'pink') | (df$color == 'blue') & runif(1) < .6 | (df$color == 'red') & runif(1) < .6|(df$color == 'yellow') & runif(1) < .3, ]

But here is the output in 2 runs

1                   -1.22049503  blue
2                    1.61641224  blue
3                    0.09079087  blue
4                    0.32325956  blue
13                   0.22942192  pink
14                   0.76779548  pink
15                   0.97631652  pink
16                  -0.33513712  pink

In second run

1                   -1.22049503  blue
2                    1.61641224  blue
3                    0.09079087  blue
4                    0.32325956  blue
5                   -0.62733486   red
6                    0.43102051   red
7                    0.61619844   red
8                   -0.17718356   red
13                   0.22942192  pink
14                   0.76779548  pink
15                   0.97631652  pink
16                  -0.33513712  pink

So here the blue rows are always getting selected as expected, but the other rows say all the red rows are selected in first run, in second run all the pink and all the red rows are selected - instead of some in red and even less in pink.

What am I missing? or any better way of doing this?




Aucun commentaire:

Enregistrer un commentaire