I have written a following function to sample from multivariate normal...
library(dplyr)
rmvtnormtibble <- function(n, mu, Sigma, names) {
(matrix(rnorm(n * length(mu)), ncol = length(mu)) %*% chol(Sigma)) %>%
{t(.) + mu} %>%
t %>%
as_tibble %>%
setNames(names)
}
...that returns a tibble having columns that correspond different variables.
The function can be called this way:
rmvtnormtibble(n = 120 + 86 + 56, mu = c(A = 24, B = 48, C = 30), Sigma = diag(3), names = c("A","B","C"))
Then I have the following sampling scheme:
A | B | freq |
---|---|---|
30 | 40 | 120 |
25 | 41 | 86 |
27 | 22 | 56 |
sample_scheme <- tibble(A = c(30,25,27), B = c(40,41,22), freq = c(120,86,56))
And now, how do we read this sample scheme table? We want exactly 120 observations having A column to be between 30 and 31 AND B column to be between 40 and 41. 86 observations A being between 25 and 26 AND B being between 41 and 42. 56 observations being A being between 27 and 28 AND B being between 22 and 23.
So any number (x), in columns A and B, defines an interval [x, x+1].
Then we can start sampling. If we set the n argument extremely high, we would most likely get what we need plus some extra observations. However, if the sampling scheme has high frequencies and/or the A and B values tend to deviate from their respective means and/or the multivariate normal has many dimensions, we might run into memory issues. Therefore n needs to be set according to these but also according to possible memory constraints of the environment.
So my question is, that how could we accrue a tibble with new observations when we loop the rmvtnormtibble function and keep adding new observations according to the sample scheme most effectively? So when we get new observations that we need, we add them, when we get observations we do not need, we omit them.
Aucun commentaire:
Enregistrer un commentaire