I tried to prove to myself one way or the other whether the way I was generating these "codes" was susceptible to the Birthday Problem, in which case--I tell myself--collisions should eventually grow with the square of the number of reserved codes, with a large enough number of reserved codes.
I believe the approximated formula I am recalling was collision_probability = k**2 / 2n
, where k
is the number of items (or codes which will be reserved) and n
is the number of buckets (or possible codes).
Below, I go on to calculate P(collision) for each level of reserved codes, 0 - 1 million. Then I fit the probabilities at each point to a line of best fit. --- My most important question is, did I assume what was I trying to prove somewhere along the way here? Or otherwise, miss the point of the birthday problem's effect on the probability of putting K items into N buckets?
import itertools
import sys
import csv
import random
import string
from functools import partial
def calculate_num_collisions(gen_code, iterations, codes=None):
num_collisions = 0
for _ in range(iterations):
code = gen_code()
if code in codes:
num_collisions += 1
# Round to 4 digits, for probability visibility down to e.g. 0.0001,
# or 0.01% probability, in other words, really 2 digits
return round(num_collisions / iterations, 4)
def generate_code(possible_values, length):
# Try to use the exact same algorithm as in my codebase, another place
return "".join(random.sample(possible_values, length))
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Please supply output file.")
sys.exit(1)
possible_values = string.digits
length = 6
iterations_to_check = 40
# iterations is how many samples to take at each level of codes already assigned.
_gen_code = partial(generate_code, possible_values, length)
with open(sys.argv[1], "w") as csvfile:
codes = set()
total_buckets = len(possible_values) ** length
fieldnames = ["current_buckets", "observed_collision_probability"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
# rather than generate codes randomly in this outer loop, create a generator
# that will produce all possible such codes -- wastes less time
all_possibilities = itertools.product(possible_values, repeat=length)
for i in range(total_buckets):
observed_collision_probability = calculate_num_collisions(
gen_code=_gen_code, iterations=iterations_to_check, codes=codes
)
current_buckets = total_buckets - len(codes)
writer.writerow(
dict(
observed_collision_probability=observed_collision_probability,
current_buckets=current_buckets,
)
)
try:
codes.add("".join(next(all_possibilities)))
except StopIteration:
break
So, when I take the data generated from this script, and plot it along with a fitted linear equation...
I get the following:
The point is, it looks very linear.
... Choosing a quadratic objective function instead results in a coefficient of 0 on the quadratic component. (Smaller questions might include whether I used the scipy.optimize.curve_fit
function correctly here, as I am new to usage of data science python libraries.)
Thanks in advance, sorry for openness of question.
I referred to these questions, but I'm still confused:
- What is the probability of collision with a 6 digit random alphanumeric code?
- random.choice not random
- Random is barely random at all?
- What is the difference between random.sample and random.shuffle in Python
Aucun commentaire:
Enregistrer un commentaire