I'm looking for a hash function f
that maps short strings (say no longer than 100 characters) into integer intervals [0, N)
, where N is typically between 10 and 100, that distributes values as uniformly as possible into the different buckets 0, 1, ..., N-1
.
The solution I currently have combines SHA1, CRC32 and a bunch of reversible int -> int
transformations as for example outlined in this answer, followed by a final modulo operation. It works rather well, but I feel that there is still room for improvement, since the buckets I get are not always as evenly sized as I would have hoped.
Last but not least, let me briefly outline my use case: I use this hash function to split a labeled data set into train/validation & test sets for supervised machine learning, based on the string identifiers of the individual rows. So, given a hash into [0, 10)
, I then define my training data to have hashes {0, 1, 2, 3, 4, 5}
, my validation data to have hashes {6, 7}
, and finally my test data to have hashes {8, 9}
. I could of course also just use a random split, but the method with the hashes seems very appealing to me, because it's stable, flexible and transparent.
To sum things up: What kind of hash function would you suggest with the aforementioned properties, for the use case I've just described?
Aucun commentaire:
Enregistrer un commentaire