I am currently trying to come up with a semi-decent (considering complexity, statistical properties and common sense) algorithm for sampling.
The data is currently contained inside a hash table, where each key is an item and the key's value is the item's frequency in the original distribution.
If one wanted to sample from such histogram, how would he go about doing that if he wanted to preserve the original probabilities of the items and transfer them into the sample?
Also, we require that there is a flag of whether duplicate items are allowed in the sample. In the case of not allowing the duplicates, the best I came up with is to apply the algorithm from the paragraph above and delete the item from the hash table once it is sampled. This way, at least the relative probabilities are preserved amongst the remaining items. However, I am unsure of whether this is an accepted practice statistically.
Is there a generally accepted algorithm for doing this? If it helps, we need to implement it in Common Lisp.
Aucun commentaire:
Enregistrer un commentaire