mercredi 25 septembre 2019

Shuffling large memory-mapped numpy array

I have an array of dimension (20000000, 247) of size around 30 GB in a .npy file. I have 32 GB available memory. I need to shuffle the data along rows. I have opened the file in mmap_mode. However, if I try anything other than in-place modification, for example np.random.permutation or creating a random.sampled array of indices p and then returning array[p], I get MemoryError. I have also tried shuffling the in chunks and then try stacking the chunks to build the full array, but MemoryError. The only solution I have found till now is loading the file in mmap_mode = 'r+' and then doing np.random.shuffle. However, it takes forever (it has been 5 hours still it's getting shuffled).

Current code:

import numpy as np
array = np.load('data.npy',mmap_mode='r+')
np.random.seed(1)
np.random.shuffle(array)

Is there any faster method to do this without breaking the memory constraint?




Aucun commentaire:

Enregistrer un commentaire