mercredi 26 mai 2021

Generating a random sample from a population using NIST CTR DRBG in Python

I am trying to generate a random sample from a population that uses a cryptographically secure DRBG instead of the 'insecure' Mersenne Twister that is used by default in Python.

This library provides the DRBG and I repurposed the random.sample generation algorithm in the Python source code as the basis for my sampling algorithm.

The code works but for only samples substantially smaller than the population whereas the builtin Mersenne Twister does not fail even if the sample and population are equal.

The program will randomly select locations in an array for a character to reside in it. Overriding is not allowed as the string has to be reconstructed from the array after it has been placed into it (this part of the code has not been shown by explained in case it helps explain the problem).

Code:

import string
import secrets as secret
import random as rd
import numpy as np
from math import ceil, log
from aes_drbg import AES_DRBG
import timeit


# Initialize the AES DRBG
key = 1
aes_drbg = AES_DRBG(256)
aes_drbg.instantiate(bytes(key))

def random_number(limit):
    """Generate a random number within the limit using the AES DRBG."""
    sequence_length = 1
    if limit >= 4_294_967_296:
        sequence_length = 8
    elif limit >= 65_536:
        sequence_length = 4
    elif limit >= 256:
        sequence_length = 2

    random_number_bytes = aes_drbg.generate(sequence_length)
    random_number = int.from_bytes(random_number_bytes, byteorder="big")

    return random_number % limit

def sample(population, k):
    """Generate a random sample of size k from the given population"""
    randbelow = random_number
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("Sample larger than population or is negative")
    result = [None] * k
    setsize = 21  # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** ceil(log(k * 3, 4))  # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):  # invariant:  non-selected at [0,n-i)
            j = randbelow(n - i)
            result[i] = pool[j]
            pool[j] = pool[n - i - 1]  # move non-selected item into vacancy
    else:
        selected = set()
        selected_add = selected.add
        for i in range(k):
            j = randbelow(n)
            while j in selected:
                j = randbelow(n)
            selected_add(j)
            result[i] = population[j]
    return result


def string_generator(size):
    """Generate a random string of a given size."""
    chars = string.ascii_uppercase + string.ascii_lowercase
    return ''.join(secret.choice(chars) for _ in range(size))

def main():
    test = string_generator(50_000_000)  # Generate random string as test data.

    array = np.zeros([100_000_000])  # Initialize empty array.

    rd.seed(1)

    start_time = timeit.default_timer()
    rand_sample = sample(range(len(array)), len(test))  # use `sample` for AES DRBG; `rd.sample` for the Mersenne Twister
    end_time = timeit.default_timer()

    print(rand_sample)
    print(f"Execution time: {end_time - start_time}")

if __name__ == "__main__":
    main()

As the value of the len(test) approaches the value of the len(array) is when the DRBG stops outputting and runs indefinitely. I am not sure where I have gone wrong but I cannot diagnose the mistake.

The MWE is a simplification of my actual code by demonstrates the problem adequately as far as I can gather. The numbers for test and array are far larger in production (billions) so I'm open to any other more efficient algorithms or other cryptographically secure DRBGs such as XCHACHA20 or even other libraries that have secure DRBGs available that allow full control over the seeds so the results can be repeated. Any guidance is highly appreciated.




Aucun commentaire:

Enregistrer un commentaire