I am trying to generate a random sample from a population that uses a cryptographically secure DRBG instead of the 'insecure' Mersenne Twister that is used by default in Python.
This library provides the DRBG and I repurposed the random.sample
generation algorithm in the Python source code as the basis for my sampling algorithm.
The code works but for only samples substantially smaller than the population whereas the builtin Mersenne Twister does not fail even if the sample and population are equal.
The program will randomly select locations in an array for a character to reside in it. Overriding is not allowed as the string has to be reconstructed from the array after it has been placed into it (this part of the code has not been shown by explained in case it helps explain the problem).
Code:
import string
import secrets as secret
import random as rd
import numpy as np
from math import ceil, log
from aes_drbg import AES_DRBG
import timeit
# Initialize the AES DRBG
key = 1
aes_drbg = AES_DRBG(256)
aes_drbg.instantiate(bytes(key))
def random_number(limit):
"""Generate a random number within the limit using the AES DRBG."""
sequence_length = 1
if limit >= 4_294_967_296:
sequence_length = 8
elif limit >= 65_536:
sequence_length = 4
elif limit >= 256:
sequence_length = 2
random_number_bytes = aes_drbg.generate(sequence_length)
random_number = int.from_bytes(random_number_bytes, byteorder="big")
return random_number % limit
def sample(population, k):
"""Generate a random sample of size k from the given population"""
randbelow = random_number
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** ceil(log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n - i)
result[i] = pool[j]
pool[j] = pool[n - i - 1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
def string_generator(size):
"""Generate a random string of a given size."""
chars = string.ascii_uppercase + string.ascii_lowercase
return ''.join(secret.choice(chars) for _ in range(size))
def main():
test = string_generator(50_000_000) # Generate random string as test data.
array = np.zeros([100_000_000]) # Initialize empty array.
rd.seed(1)
start_time = timeit.default_timer()
rand_sample = sample(range(len(array)), len(test)) # use `sample` for AES DRBG; `rd.sample` for the Mersenne Twister
end_time = timeit.default_timer()
print(rand_sample)
print(f"Execution time: {end_time - start_time}")
if __name__ == "__main__":
main()
As the value of the len(test)
approaches the value of the len(array)
is when the DRBG stops outputting and runs indefinitely. I am not sure where I have gone wrong but I cannot diagnose the mistake.
The MWE is a simplification of my actual code by demonstrates the problem adequately as far as I can gather. The numbers for test
and array
are far larger in production (billions) so I'm open to any other more efficient algorithms or other cryptographically secure DRBGs such as XCHACHA20 or even other libraries that have secure DRBGs available that allow full control over the seeds so the results can be repeated. Any guidance is highly appreciated.
Aucun commentaire:
Enregistrer un commentaire