mercredi 2 décembre 2020

How do I directly sample from summarised data with a count (number of replicates) column?

The problem: giving one data frame (A) of 100 rows an IntHours value, which is sampled from a different data frame (B), without using a loop.

I have a summary data frame which is:

 C <- data.frame(IntHours = c(1, 2, 3), HoursCount = c(274, 50, 46))

The IntHours come from B, which has IntHours values up to 8. I only need the values 1 through 3. I do not require the other columns from B. C represents the actual filtered, grouped, summarised data from B.

How do I take a random sample of 100 values of 1, 2, and 3, without replacement, from C? The hours count shows the number of underlying rows for each value of 1, 2, and 3.

I know how to sample from C using a loop with vectors and an index, and how to expand C into 370 rows and randomly sample treating the IntHours as a grouped variable.

But how can I directly sample 100 IntHours values without doing any expansion? The HoursCount value is treated as a strict weight, and not replicates. So slice_sample() in dplyr will only return the three rows, in descending order of HoursCount. The base R sample() fails, logically, with the error that there are not enough rows in order to provide a sample of 100 using sampling without replacement.

Desired outcome: construct a 1-column data frame of 100 rows, consisting of the sampled IntHours. I will then bind_col to the 100-row data frame for which I need these values. Without using a loop. Using sampling without replacement.

I'm still writing my package (!) and I'm trying to keep the code as short as possible. This includes removing all non-essential loops but also using code that is easy to read.

Is there a direct way of doing this? I've searched with the [R] and [sample] tags, and I can't find anyone who wants to sample from a summary table/data frame who didn't expand the summary data first. A Google search provided Pandas answers.

Edited: this is one approach. Expand the data and then slice_sample() from it.

D <- data.frame(IntHours = (c(rep(1, times = 274), rep(2, times = 50), rep(3, times = 46))))
E <- D %>%
  slice_sample(n = 100, replace = FALSE)

This gives the random sample of 100. But is there a way of doing this directly from C?




Aucun commentaire:

Enregistrer un commentaire