I have the following data frame 'df'. Each participant (here 10 participants) saw several stimuli (here 100), and made a judgment about it (here a random number). For each stimuli, I know the true answer (here a random number; a different number for each stimuli but always the same answer for all participanst)
participant <- rep(1:10, each=100)
stimuli <- rep(1:100, 10)
judgment <- rnorm(1000)
df1 <- data.frame(participant, stimuli, judgment)
df2 <- data.frame(stimuli=1:100, criterion=rnorm(100))
df <- merge(df1, df2, by='stimuli') %>% arrange(participant, stimuli)
Here is what I am trying to do:
1) Taking n randomly selected participants (here n is between 1 and 10).
2) Computing the mean of their judgments per stimuli
3) Computing the correlation between this mean and the true answer
I want to perform step 1-3 for all n (that is, I want to take 1 randomly selected participants and perform steps 1-3, then I want to take 2 randomly selected participants and perform steps 1-3 ... 10 randomly selected participants and perform steps 1-3. The results should be a data frame with 10 rows and 2 variables: N and the correlation. I want to work only with dplyr.
Here is my solution:
MyFun = function(Data) {
HelpFun = function(x, Data) {
# x is the index for the number of participants.
# It Will be used in the lapply call bellow
participants_id = unique (Data$participant)
participants_x = sample(participants_id, x)
filter(Data, participant %in% participants_x) %>%
group_by(stimuli) %>%
summarise( mean_x = mean(judgment),
criterion = unique(criterion) ) %>%
summarise(cor = cor(.$mean_x, .$criterion))
}
N <- length(unique(Data$participant))
lapply(1:N, HelpFun, Data) %>% bind_rows()
}
MyFun(df)
The problem is that this code is slow. Since every selection is random, I perform all this 100 times. And this slow... very slow. Even with a working memory of 16 Gb.
system.time(replicate(100, MyFun(df), simplify = FALSE ) %>% bind_rows())
Any idea about making all of this faster?
Aucun commentaire:
Enregistrer un commentaire