jeudi 20 avril 2023

Is my approach valid for showing that my algorithm (usage of python's random.sample) is NOT susceptible to the Birthday Problem?

I tried to prove to myself one way or the other whether the way I was generating these "codes" was susceptible to the Birthday Problem, in which case--I tell myself--collisions should eventually grow with the square of the number of reserved codes, with a large enough number of reserved codes.

I believe the approximated formula I am recalling was collision_probability = k**2 / 2n, where k is the number of items (or codes which will be reserved) and n is the number of buckets (or possible codes).

Below, I go on to calculate P(collision) for each level of reserved codes, 0 - 1 million. Then I fit the probabilities at each point to a line of best fit. --- My most important question is, did I assume what was I trying to prove somewhere along the way here? Or otherwise, miss the point of the birthday problem's effect on the probability of putting K items into N buckets?


    import itertools
    import sys
    import csv
    import random
    import string
    from functools import partial
    
    
    def calculate_num_collisions(gen_code, iterations, codes=None):
        num_collisions = 0
        for _ in range(iterations):
            code = gen_code()
            if code in codes:
                num_collisions += 1
    
        # Round to 4 digits, for probability visibility down to e.g. 0.0001,
        # or 0.01% probability, in other words, really 2 digits
        return round(num_collisions / iterations, 4)
    
    
    def generate_code(possible_values, length):
        # Try to use the exact same algorithm as in my codebase, another place
        return "".join(random.sample(possible_values, length))
    
    
    if __name__ == "__main__":
    
        if len(sys.argv) < 2:
            print("Please supply output file.")
            sys.exit(1)
    
        possible_values = string.digits
        length = 6
        iterations_to_check = 40
        # iterations is how many samples to take at each level of codes already assigned. 
    
        _gen_code = partial(generate_code, possible_values, length)
    
        with open(sys.argv[1], "w") as csvfile:
    
            codes = set()
            total_buckets = len(possible_values) ** length
            fieldnames = ["current_buckets", "observed_collision_probability"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
    
            # rather than generate codes randomly in this outer loop, create a generator
            # that will produce all possible such codes -- wastes less time
            all_possibilities = itertools.product(possible_values, repeat=length)
            for i in range(total_buckets):
    
                observed_collision_probability = calculate_num_collisions(
                    gen_code=_gen_code, iterations=iterations_to_check, codes=codes
                )
                current_buckets = total_buckets - len(codes)
                writer.writerow(
                    dict(
                        observed_collision_probability=observed_collision_probability,
                        current_buckets=current_buckets,
                    )
                )
                try:
                    codes.add("".join(next(all_possibilities)))
                except StopIteration:
                    break

So, when I take the data generated from this script, and plot it along with a fitted linear equation...

usage of curve_fit

I get the following:

data fitted to a linear model

The point is, it looks very linear.

... Choosing a quadratic objective function instead results in a coefficient of 0 on the quadratic component. (Smaller questions might include whether I used the scipy.optimize.curve_fit function correctly here, as I am new to usage of data science python libraries.)

Thanks in advance, sorry for openness of question.

I referred to these questions, but I'm still confused:




Aucun commentaire:

Enregistrer un commentaire