lundi 26 mars 2018

Performance of for-loop with high number of cases

I have 88.000 observations, coded with 1:

obs <- rep(1,88000)

In addition, I have the following function in which a random experiment is performed. A value p is compared with a random number; depending on the result, x changes (+ 1) or stays the same.

rexp <- function(x,p){
  if(runif(1) <= p) return(x + 1)
  return(x)
}

Beside "obs" and "rexp" an empty dataframe "dat" with 500 rows and 0 columns is given. There is also a placeholder "result":

dat <- data.frame(row.names = 1:500)
dat$result <- rep(',',500)

I use following loop to apply the function "rexp" (with p = 0.03) 500 times to the vector "obs" and save the number of changes of "obs" caused by the random experiment as "result" in the dataframe "dat":

for(i in 1:500){
  x <- sapply(obs,rexp,0.03)
  x <- table(x)
  x <- x[names(x) == 2]
  dat$result[i] <- x
}

Now to the problem: The for-Loop above basically works, but its performance is very bad. The execution takes very long and often the loop even gets stuck. In the example above, there are only 88.000 observations used, working with like 880.000 seems almost impossible. I'm not sure why the performance is so poor. For example, on my device the same procedure is possible in less than a minute in stata (even with 880.000 observations). I know that for-loops should be bypassed in r anyway, but I do not know how to perform the procedure otherwise. I would be grateful for any hint to explain and improve the performance of the described loop!




Aucun commentaire:

Enregistrer un commentaire