I am trying to implement a range mapper for TRNG output files for a C application with ranges of up to 4 bits in size. Due to the pigeonhole bias problem I have settled on using a discard algorithm.
My idea for a parsimonious algorithm would be something like:
-- Read 16 bytes from file and store as an indexed 128 bit unsigned integer bitbucket to be bitmask selected n bits at a time.
-- Predetermine as much as possible the ranges/buckets required for each input and store in an array.
-- For each n bits in the bitbucket select an input from the array that will not discard it if one exists. If 2 bits cannot find an input try 3 bits and if that cannot find an input try with 4 bits. At first when there are many inputs it should be easy not to discard, but as the choice of inputs gets low discards will become more common. I am not entirely sure if it is better to start with fewer bits and work my way up or to do the opposite.
The downside of this bit sipping range mapper seems to be that I need to assume about twice as much random input data as would be required with biased scaling methods. For instance a 9 bucket input from a 4 bit rand output will miss about 43% of the time.
Existing implementations/algorithms: This seems like an example of a more complex and efficient method of parsimonious range mapping but I find his explanation entirely impenetrable. Can anyone explain it to me in English or suggest a book I might read or a university class I might take that would give me a background to understand it?
There is also arc4random which seems to be a runtime optimized unbiased modulo discard implementation. Like almost all unbiased range mapper implementations I have found this seems not to particularly care about how much data it uses. That does not however mean that it is necessarily less data efficient because it has the advantage of fewer misses.
The basic idea of arc4random seems to be that as long as the number of pigeons (max_randvalue_output) is evenly divisible by the number of holes (rangeupperbound) the modulo function itself is an elegant and unbiased range mapper. However modulo only seems to be relevant when you are not bit sipping, i.e. when the output from the random source is more than ceil(log2(buckets)) bits.
There seems to be a tradeoff between the number of 'wasted' random bits and the percentage of discards. The percentage of misses is inversely proportional to the number of excess bits in the input to the range mapper. It seems like there should be a mathematical way to compare the data efficiency of a bit sipping range mapper with a more bit hungry version with fewer misses, but I don't know it.
So my plan is to just write two implementations: a bit sipping parsimonious type of range mapper that may or may not be a little like the mathforum example (which I don't understand) and an invariant byte input modulo range mapper which accepts byte inputs from a TRNG and uses a discard-from-the-top-of-largest-multiple modulo method of debiasing to match (x)n pigeons to n holes which is intended to be like arc4random. When finished I plan to post them on codereview.
I am basically looking for help or advice with any of these issues that might help me to write a more parsimonious but still unbiased range mapper particularly with respect to my parsimonious algorithm. Runtime efficiency is not a priority.
Aucun commentaire:
Enregistrer un commentaire