I'm looking for a hash function f that maps short strings (say no longer than 100 characters) into integer intervals [0, N), where N is typically between 10 and 100, that distributes values as uniformly as possible into the different buckets 0, 1, ..., N-1.
The solution I currently have combines SHA1, CRC32 and a bunch of reversible int -> int transformations as for example outlined in this answer, followed by a final modulo operation. It works rather well, but I feel that there is still room for improvement, since the buckets I get are not always as evenly sized as I would have hoped.
Last but not least, let me briefly outline my use case: I use this hash function to split a labeled data set into train/validation & test sets for supervised machine learning, based on the string identifiers of the individual rows. So, given a hash into [0, 10), I then define my training data to have hashes {0, 1, 2, 3, 4, 5}, my validation data to have hashes {6, 7}, and finally my test data to have hashes {8, 9}. I could of course also just use a random split, but the method with the hashes seems very appealing to me, because it's stable, flexible and transparent.
To sum things up: What kind of hash function would you suggest with the aforementioned properties, for the use case I've just described?
Aucun commentaire:
Enregistrer un commentaire