samedi 30 janvier 2021

Speed of random sampling from distribution in R

I was trying to investigate a probability distribution whose moments were the Catalan numbers, and came up with

qcatmo <- function(p, k=4){ (qbeta(p/2+1/2, 3/2, 3/2)*2 - 1)^2 * k } 
colMeans(outer(qcatmo(ppoints(10^6)), 0:10, "^"))
#      1     1     2     5    14    42   132   429  1430  4862 16796

which worked out nicely. But then I tried to generate random values from this distribution, and found three possible approaches (A using the quantile function I already knew worked applied to runif, B slightly more direct using the built-in rbeta function, and C using a form of rejection sampling with runif) with noticeably different speeds when used on large samples:

rcatmoA <- function(n, k=4){ qcatmo(runif(n), k) }
rcatmoB <- function(n, k=4){ (rbeta(n, 3/2, 3/2)*2 - 1)^2 * k }
rcatmoC <- function(n, k=4){
             n0 <- ceiling(n*4/pi + 7*sqrt(n) + 35)
             x0 <- runif(n0)^2 
             y0 <- runif(n0)^2 
             x0[x0 + y0 < 1][1:n] * k
             }

which when benchmarked gave

library(microbenchmark)
n <- 10^4
microbenchmark(
  rcatmoA(n,4),
  rcatmoB(n,4),
  rcatmoC(n,4)
  ) 
#Unit: microseconds
#          expr     min       lq      mean   median       uq     max neval cld
# rcatmoA(n, 4) 22817.2 23014.95 23259.688 23186.95 23322.80 25128.9   100   c
# rcatmoB(n, 4)  1526.5  1534.40  1615.255  1541.30  1607.15  4952.1   100  b 
# rcatmoC(n, 4)   781.5   788.70   884.339   795.00   813.80  7266.2   100 a  

My questions are:

  • Why is the B version so much faster than the A version?
  • If the B version is faster because it avoids applying a function to runif data, why is the C version even faster?
  • Is there any general advice on how best to make random samples in this sort of situation?



Aucun commentaire:

Enregistrer un commentaire