mardi 14 février 2023

Sampling from truncated multivariate normal distribution in R given a frequency table of values (of one variable) needed

Let's assume we are interested in getting a set of individuals with following ages from a multivariate normal:

age n
25 20
30 43
35 66

The multivariate normal has following parameters:

mu = c(32, 170) #The first element is mean age and the second on is mean is height, but it plays no role here.
sigma = matrix(c(9, 10,
                10, 100), nrow = 2)

The samples (2 in this example) can be drawn using e.g. mvrnorm function from MASS package, i.e.:

MASS::mvrnorm(n = 2, mu = mu, Sigma = sigma)
         [,1]     [,2]
[1,] 32.50403 168.4208
[2,] 31.45465 153.8083

As I mentioned above and can be seen from the table, I would like to get 20 25-year-olds etc.

What would be the best way of doing this?

What initially came into my mind was to loop through the rows of the frequency table, and generate sample after sample, until I have as many individuals with the correct age, e.g.

frqtable = data.frame(age = c(25, 30, 35), n = c(20, 43, 66))

tempmatrix <- matrix(nrow = 0, ncol = length(mu))
for(row in 1:nrow(frqtable)) {
    rowtempmatrix <- matrix(nrow = 0, ncol = length(mu))
    while(nrow(rowtempmatrix) < frqtable[row, 'n']) {
        sample <- MASS::mvrnorm(n = frqtable[row, 'n'] - nrow(rowtempmatrix), mu = mu, Sigma = sigma) #I think, it is best to sample at least as many as still needed...
        rowtempmatrix <- rbind(rowtempmatrix, sample[sample[,1] >= frqtable[row, 'n'] & sample[,1] < frqtable[row, 'n'] + 1,]) #I define being 25 years old as >= 25, but < 26.
    }
    tempmatrix <- rbind(tempmatrix, rowtempmatrix)
}

However, this does not work (infinite while loop), and this solution is quite complex and ugly in general.

Any suggestions, how to get this done effectively and efficiently? I know, I could draw like one million samples, and reject the ones I do not need, but if I happen to have a mu vector and covariance matrix with more elements, this may sometimes need a bit too much memory.

(I do not know, if dplyr package could be utilized here, since as far as I know, it has no functions to filter matrices.)




Aucun commentaire:

Enregistrer un commentaire