mercredi 15 septembre 2021

Function to hash short strings into [0, N) for small N

I'm looking for a hash function f that maps short strings (say no longer than 100 characters) into integer intervals [0, N), where N is typically between 10 and 100, that distributes values as uniformly as possible into the different buckets 0, 1, ..., N-1.

The solution I currently have combines SHA1, CRC32 and a bunch of reversible int -> int transformations as for example outlined in this answer, followed by a final modulo operation. It works rather well, but I feel that there is still room for improvement, since the buckets I get are not always as evenly sized as I would have hoped.

Last but not least, let me briefly outline my use case: I use this hash function to split a labeled data set into train/validation & test sets for supervised machine learning, based on the string identifiers of the individual rows. So, given a hash into [0, 10), I then define my training data to have hashes {0, 1, 2, 3, 4, 5}, my validation data to have hashes {6, 7}, and finally my test data to have hashes {8, 9}. I could of course also just use a random split, but the method with the hashes seems very appealing to me, because it's stable, flexible and transparent.

To sum things up: What kind of hash function would you suggest with the aforementioned properties, for the use case I've just described?




Aucun commentaire:

Enregistrer un commentaire