jeudi 14 décembre 2017

Combining judgments of randomly selected participants with dplyr

I have the following data frame 'df'. Each participant (here 10 participants) saw several stimuli (here 100), and made a judgment about it (here a random number). For each stimuli, I know the true answer (here a random number; a different number for each stimuli but always the same answer for all participanst)

participant <- rep(1:10, each=100)
stimuli <- rep(1:100, 10)
judgment <- rnorm(1000)
df1 <- data.frame(participant, stimuli, judgment)
df2 <- data.frame(stimuli=1:100, criterion=rnorm(100))
df <- merge(df1, df2, by='stimuli') %>% arrange(participant, stimuli)

Here is what I am trying to do:

1) Taking n randomly selected participants (here n is between 1 and 10).

2) Computing the mean of their judgments per stimuli

3) Computing the correlation between this mean and the true answer

I want to perform step 1-3 for all n (that is, I want to take 1 randomly selected participants and perform steps 1-3, then I want to take 2 randomly selected participants and perform steps 1-3 ... 10 randomly selected participants and perform steps 1-3. The results should be a data frame with 10 rows and 2 variables: N and the correlation. I want to work only with dplyr.

Here is my solution:

MyFun = function(Data) {

HelpFun = function(x, Data) { 
# x is the index for the number of participants.
# It Will be used in the lapply call bellow
participants_id = unique (Data$participant)      
participants_x = sample(participants_id, x)
filter(Data, participant %in% participants_x) %>% 
  group_by(stimuli) %>% 
  summarise( mean_x = mean(judgment),
             criterion = unique(criterion) ) %>%
  summarise(cor = cor(.$mean_x, .$criterion))
  }
 N <- length(unique(Data$participant))

lapply(1:N, HelpFun, Data) %>% bind_rows()
}  

MyFun(df) 

The problem is that this code is slow. Since every selection is random, I perform all this 100 times. And this slow... very slow. Even with a working memory of 16 Gb.

system.time(replicate(100, MyFun(df), simplify = FALSE ) %>% bind_rows())

Any idea about making all of this faster?




Aucun commentaire:

Enregistrer un commentaire