mardi 25 avril 2023

Two levels of sampling

I have a bunch of Things.

A Thing is a struct with a field, source, typed as a string.

Currently I get a deterministic sampled selection of Things by simply hashing the Thing.

def is_thing_sampled(t: Thing):
    hashed_thing = my_deterministic_hash(t);
    return hashed_thing % 100 < sample_size_pct;

Now I want to extend this function so that it additionally samples Thing of a specific source. If the source is "foo", I want to do another level of sampling on it.

def is_thing_sampled(t: Thing):
   hashed_thing = my_deterministic_hash(t)
   base = hashed_thing % 100 < sample_size_pct;
   if base and t.source == "foo":
      # try to sample again. How do I do this?? 
      double_hash = my_deterministic_hash(hashed_thing)
      return double_hash % 100 < foo_sample_size_pct

    return base    

Can someone help me understand what's the right approach? I'd love some pointers - I'm a total noob at statistics.

Aucun commentaire:

Enregistrer un commentaire