dimanche 26 avril 2020

R - Randomly sample from a matrix using a distribution to denote the number of zeros in each column

I am trying to randomly sample from a matrix (b below) but I want the resulting matrix of samples to have a proportion of zeros in each column equal to that of another matrix (a below). I am trying to use sample() function to do this but I'm not having much joy. Some reproducible code is below which will hopefully explain my problem:

set.seed(1234)
# matrix a is the matrix that holds the distribution of zeros I want to match
a <- matrix(as.integer(rexp(200, rate=.1)), ncol=20)
# matrix b is the matrix to be sampled from 
b <- matrix(as.integer(rexp(2000, rate=.1)), ncol=20)

a looks like:

     [,1] [,2] [,3] [,4] [,5]
[1,]    6    0    6    1   22
[2,]   19    6    0   23   19
[3,]    8   22    8    5    0
[4,]   24   17   28    3    0

b looks like:

      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    1   10    5    9
 [2,]   26    1    3    2    2
 [3,]    4    8    3    0    0
 [4,]    2   10   35    3   11
 [5,]    1    3   16    0    6
 [6,]    2    4    2   16    2
 [7,]    3   18   13    6   17
 [8,]    0    2    9    0   13
 [9,]    2   15    6   27   30
[10,]    1    2    7    9   15
[11,]   13    0    5    1    2
[12,]   18   12    9   27   33
[13,]    0   20    3   18    1
[14,]    5    7    7   16    4
[15,]    5    6    4    5    2
[16,]    0    7    5   10    7
[17,]    3   20    5   14   34
[18,]   28    0   10    5    8
[19,]   33    0    2    6   13
[20,]    7   28    0   11    8

I extract the distribution of zeros in each column of a to use in the sampling

dist<-apply(a,2, function(x) sum(x!=0)/length(x)) 
dist
[1] 1.00 0.75 0.75 1.00 0.50

I then go on to try and sample from b to hold the same number of rows as a

b_sample<-b[sample(x=nrow(b),
                   size=4,
                   replace=F
                   )
            ,]

This will work but I want the b_sample to to have the same proportion of zeros in each column as a. I have tried to do this

b_sample<-b[sample(x=nrow(b),
                   size=4,
                   replace=F,
                   prob=dist
                   )
            ,]

but I get an error:

Error in sample.int(x, size, replace, prob) : 
  incorrect number of probabilities

I am not sure if I have the format wrong to do this or is the sample() function not the correction function at all to use. Any help would be greatly appreciated!




Aucun commentaire:

Enregistrer un commentaire