I have an array of dimension (20000000, 247)
of size around 30 GB
in a .npy
file. I have 32 GB
available memory. I need to shuffle the data along rows. I have opened the file in mmap_mode
. However, if I try anything other than in-place modification, for example np.random.permutation
or creating a random.sampled array of indices p
and then returning array[p]
, I get MemoryError. I have also tried shuffling the in chunks and then try stacking the chunks to build the full array, but MemoryError
. The only solution I have found till now is loading the file in mmap_mode = 'r+'
and then doing np.random.shuffle
. However, it takes forever (it has been 5 hours still it's getting shuffled).
Current code:
import numpy as np
array = np.load('data.npy',mmap_mode='r+')
np.random.seed(1)
np.random.shuffle(array)
Is there any faster method to do this without breaking the memory constraint?
Aucun commentaire:
Enregistrer un commentaire