vendredi 30 mars 2018

sampling based on specified column values in R

I have a data like this, where Average is the average of X, Y, and Z.

head(df)
ID  X   Y   Z   Average
A   2   2   5   3
A   4   3   2   3
A   4   3   2   3
B   5   3   1   3
B   3   4   2   3
B   1   5   3   3
C   5   3   1   3
C   2   3   4   3
C   5   3   1   3
D   2   3   4   3
D   3   2   4   3
D   3   2   4   3
E   5   3   1   3
E   4   3   2   3
E   3   4   2   3

To reproduce this, we can use

df <- data.frame(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
                     X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
                     Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L), 
                     Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L), 
                     Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))

From this, I want to extract one observation per ID such that we don't get same (as much as is possible) values of the combination of X, Y, and Z. I tried

library(dplyr)
df %>% sample_n(size = nrow(.), replace = FALSE) %>% distinct(ID, .keep_all = T)

But, on a larger dataset, I see too many repetitions of the combination of X, Y, Z. To the extent possible, I need the output with equal or close to equal representation of cases (i.e. the combination of X, Y, Y) like this:

   ID   X   Y   Z   Average
    A   2   2   5   3
    B   5   3   1   3
    C   2   3   4   3
    D   3   2   4   3
    E   4   3   2   3




Aucun commentaire:

Enregistrer un commentaire