I want to generate multivariate discrete survey data with a correlation structure, preferably in base R. The marginal probabilities (i.e. the probability for a 1 in a dichotomous variable) should be definable. Among several methods and packages I've tried, e.g. the one from Barbiero and Ferrari 2017 with the GenOrd package seems to be quite straightforward if one is rather a researcher than a statistician. (I've seen something similar in base R, though.) E.g. a bivariate simulation is obtained by:
library(GenOrd)
set.seed(42)
marginal <- list(c(0.2, 0.5, 0.7, 0.9), # define marginal probabilities (cumulative!)
c(0.1, 0.3, 0.4, 0.5))
Sigma <- matrix(c(1, 0.5, 0.5, 1), 2, 2) # define correlation structure (CMX*)
n <- 1e3
sim <- ordsample(n, marginal, Sigma)
> head(sim)
[,1] [,2]
[1,] 3 5
[2,] 2 3
[3,] 2 5
[4,] 3 5
[5,] 4 4
[6,] 3 3
> cor(sim)
[,1] [,2]
[1,] 1.000000 0.498474
[2,] 0.498474 1.000000
However, simulating multivariate data quickly becomes painful since one has to define a vast correlation matrix (CMX) which has to be valid as well. Since the CMX can be random I followed Ravi Varadhan who proposed an "easy way to produce a valid CMX". Combining his proposal with the GenOrd approach for 6 variables I always get errors, though:
library(GenOrd)
marginal <- list(.5, c(.1, .3, .7), c(.15, .4, .75), .51,
c(.15, .31, .48, .60, .71, .80, .87, .92, .95, .97, .98, .99),
c(.9, .95))
support <- list(0:1 , 0:3, 0:3, 0:1, 18:30, 1:3)
set.seed(259616)
R <- matrix(runif(36), ncol=6) # Ravi Varadhan
RtR <- R %*% t(R)
Sigma <- cov2cor(RtR)
ordcont(marginal, Sigma, support)
Error in ordcont(marginal, Sigma, support) : Some correlation coefficients are not feasible! Please use function corrcheck to get lower and upper bounds!
I know this has to do with upper and lower bounds obtained by GenOrd::corrcheck(marginal, support)
, where Sigma has to lie within somehow. Maybe cerating a valid CMX is much easier than I think.
Anyway, my question is, how to (easily) generate multivariate ordinal survey data with random correlation structure and predefined marginal probabilities. Perhaps the resulting CMX would be adjustable somehow (since it is a proxy for defining a vast valid CMX for hours). How could I achieve this goal in an easy way?
Perhaps there's rather a base R solution, since the solution should not be too much dependent on a package.
The expected output should be a matrix, where columns are composed of variables with defined marginal probabilities,
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 4 8 3 3 2 0
[2,] 8 4 4 9 1 1
[3,] 2 0 1 0 1 0
[4,] 5 0 6 2 1 1
[5,] 8 2 7 3 0 0
and where the columns should be randomly (adjustably?) correlated, to get some results in regression analyses.
Aucun commentaire:
Enregistrer un commentaire