mardi 2 juillet 2019

Stepwise sampling in a tibble

I am trying to simulate some data by sampling multiple steps.

The first step (create x) works fine.

In the second step, I want to create the variable y by sampling from different vectors based on the value of x.

My code runs without errors, but fails at what I am trying to achieve as it only samples one value for e.g., x == "A", and then reuses that value for all subsequent rows where x == "A". I want it to sample one time for each row where x == "A"

Code:

library(tidyverse)
set.seed(1)

data <- tibble(
  x = sample(c("A", "B", "C"), size = 10000, prob = c(0.1, 0.2, 0.7), replace = TRUE),
  y = case_when(
    x == "A" ~ sample(c("A1", "A2", "A3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "B" ~ sample(c("B1", "B2", "B3"), size = 1, prob = c(0.3, 0.4, 0.3)),
    x == "C" ~ sample(c("C1", "C2", "C3"), size = 1, prob = c(0.3, 0.4, 0.3)),
  ))

unique(data$x)
[1] "C" "A" "B"

unique(data$y)
[1] "C1" "A2" "B3"

If the code works as intended unique(data$y) should return something similar to [1] "A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"

I know the problem is the size = 1 argument in sample(), but what can I replace it with? Removing it returns the error:

Error: `x == "A" ~ sample(c("A1", "A2", "A3"), prob = c(0.3, 0.4, 0.3))` must be length 100 or one, not 3

And I have tried size = nrow(.data) and size=nrow(.), but that also returns error.

Is there a simple solution to this?




Aucun commentaire:

Enregistrer un commentaire