vendredi 30 juin 2017

How can I shuffle a very large list stored in a file in Python?

I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1.

This would be the naive (and totally nonfunctional) way of doing it, just so it's clear what I'm wanting.

import random
numbers = range(2**32)
random.seed(0)
random.shuffle(numbers)

I've tried making the list with numpy.arange() and using pycrypto's random.shuffle() to shuffle it. Making the list ate up about 8gb of ram, then shuffling raised that to around 25gb. I only have 32gb to give. But that doesn't matter because...

I've tried cutting the list into 1024 slices and trying the above, but even one of these slices takes way too long. I cut one of these slices into 128 yet smaller slices, and that took about 620ms each. If it grew linearly, then that means the whole thing would take about 22 and a half hours to complete. That sounds alright, but it doesn't grow linearly.

Another thing I've tried is generating random numbers for every entry and using those as indices for their new location. I then go down the list and attempt to place the number at the new index. If that index is already in use, the index is incremented until it finds a free one. This works in theory, and it can do about half of it, but near the end it keeps having to search for new spots, wrapping around the list several times.

Is there any way to pull this off? Is this a feasible goal at all?




Aucun commentaire:

Enregistrer un commentaire