random: Is my approach valid for showing that my algorithm (usage of python's random.sample) is NOT susceptible to the Birthday Problem?

jeudi 20 avril 2023

Is my approach valid for showing that my algorithm (usage of python's random.sample) is NOT susceptible to the Birthday Problem?

I tried to prove to myself one way or the other whether the way I was generating these "codes" was susceptible to the Birthday Problem, in which case--I tell myself--collisions should eventually grow with the square of the number of reserved codes, with a large enough number of reserved codes.

I believe the approximated formula I am recalling was collision_probability = k**2 / 2n, where k is the number of items (or codes which will be reserved) and n is the number of buckets (or possible codes).

Below, I go on to calculate P(collision) for each level of reserved codes, 0 - 1 million. Then I fit the probabilities at each point to a line of best fit. --- My most important question is, did I assume what was I trying to prove somewhere along the way here? Or otherwise, miss the point of the birthday problem's effect on the probability of putting K items into N buckets?


    import itertools
    import sys
    import csv
    import random
    import string
    from functools import partial
    
    
    def calculate_num_collisions(gen_code, iterations, codes=None):
        num_collisions = 0
        for _ in range(iterations):
            code = gen_code()
            if code in codes:
                num_collisions += 1
    
        # Round to 4 digits, for probability visibility down to e.g. 0.0001,
        # or 0.01% probability, in other words, really 2 digits
        return round(num_collisions / iterations, 4)
    
    
    def generate_code(possible_values, length):
        # Try to use the exact same algorithm as in my codebase, another place
        return "".join(random.sample(possible_values, length))
    
    
    if __name__ == "__main__":
    
        if len(sys.argv) < 2:
            print("Please supply output file.")
            sys.exit(1)
    
        possible_values = string.digits
        length = 6
        iterations_to_check = 40
        # iterations is how many samples to take at each level of codes already assigned. 
    
        _gen_code = partial(generate_code, possible_values, length)
    
        with open(sys.argv[1], "w") as csvfile:
    
            codes = set()
            total_buckets = len(possible_values) ** length
            fieldnames = ["current_buckets", "observed_collision_probability"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
    
            # rather than generate codes randomly in this outer loop, create a generator
            # that will produce all possible such codes -- wastes less time
            all_possibilities = itertools.product(possible_values, repeat=length)
            for i in range(total_buckets):
    
                observed_collision_probability = calculate_num_collisions(
                    gen_code=_gen_code, iterations=iterations_to_check, codes=codes
                )
                current_buckets = total_buckets - len(codes)
                writer.writerow(
                    dict(
                        observed_collision_probability=observed_collision_probability,
                        current_buckets=current_buckets,
                    )
                )
                try:
                    codes.add("".join(next(all_possibilities)))
                except StopIteration:
                    break

So, when I take the data generated from this script, and plot it along with a fitted linear equation...

I get the following:

The point is, it looks very linear.

... Choosing a quadratic objective function instead results in a coefficient of 0 on the quadratic component. (Smaller questions might include whether I used the scipy.optimize.curve_fit function correctly here, as I am new to usage of data science python libraries.)

Thanks in advance, sorry for openness of question.

I referred to these questions, but I'm still confused:

random

jeudi 20 avril 2023

Is my approach valid for showing that my algorithm (usage of python's random.sample) is NOT susceptible to the Birthday Problem?

Aucun commentaire:

Enregistrer un commentaire