random: R - Resampling a dataset, but simulated data must abide by conditions specified by original dataset

vendredi 14 septembre 2018

R - Resampling a dataset, but simulated data must abide by conditions specified by original dataset

I have a list ("input"), and each element in the list is a string of IDs, which represent subgroups of a larger population of individuals:

>head(input)
[[1]]
[1] "A"  "C"

[[2]]
[1] "D"  "E" "A"

[[3]]
[1] "A" "B" "J" "E"

[[4]]
[1] "B"

[[5]]
[1] "C" "F" "A"

[[6]]
[1] "H"

#the population
ids = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")

To create the original dataset, I ran a short loop:

input = list()

for (i in 1:1000) {

    id.count = sample(1:4,1)
    id.subgroup = sample(ids, id.count, replace = FALSE))

    input[[i]] = id.subgroup

}

I want to randomly simulate a new dataset, keeping constant the following from the original dataset:

(a) the number of appearances of each ID (in the above example, 'A' shows up 4 times, 'H' one time, etc)

(b) the distribution of subgroup sizes (in the above example, there is one group of 4, two groups of 3, one group of 2, and two groups of 1)

So far, I run through the original list (input), identify the length at each index, and randomly sample that many IDs from the original data. I use these samples to create a new simulated dataset.

However, I don't just want to sample each element without replacement; I also don't want to repeat the values, or IDs, within any given subgroup. The code below not only ends up with the same ID multiple times in a subgroup, it also can't keep the number of appearances the same across the dataset.

all.ids = unlist(input)

simulated = list()

for (i in 1:length(input)) {

    temp.length = length(input[[i]])
    temp.sample = sample(all.ids, temp.length, FALSE)

    simulated[[i]] = temp.sample

}

Maybe I shouldn't use the 'sample' function, since what I really want to do is sample pseudo-randomly (no two IDs the same). But also, every time I sample from 'all.ids', I want to remove that ID from 'all.ids', so that the total appearances of each ID remains the same. Essentially, I want to randomly sample from the remaining IDs each iteration through the loop, but making sure that within each subgroup no ID appears more than once.

A successful solution to the problem would look like this:

>head(simulated)
[[1]]
[1] "F"  "A"

[[2]]
[1] "A"  "E" "C"

[[3]]
[1] "D" "B" "H" "E"

[[4]]
[1] "A"

[[5]]
[1] "C" "A" "B"

[[6]]
[1] "J"

random

vendredi 14 septembre 2018

R - Resampling a dataset, but simulated data must abide by conditions specified by original dataset

Aucun commentaire:

Enregistrer un commentaire