lundi 3 août 2020

PyTorch: why does numpy.random produce a repeated or identical sequence of values in DataLoader workers?

I know there is lot of talk about making results reproducible and runs return the same results, I however now want to add some randomness in my Pytorch Dataset: I am using Pytorch and its Dataloader. There is some attribute of my data which should be created randomly when getitem of the dataset is called, and this should be different every epoch, I do not want the value to be created once and then stored. When I set num_workers to 0 this works, but when I set num_workers to > 0 the random attribute is always the same for identical elements in the batch. My first question is why exactly (I assume, Pytorch caches the dataset once it is generated, but the question is why do multiple workers here behave differently?), and the second question is how can I obtain my desired behavior most efficiently?

Small code sample: I want each call of get_item to return different values:

from torch.utils.data import Dataset
import torch
from torch.utils.data import DataLoader
import numpy as np

class MyDataset(Dataset):
    def __getitem__(self, index):
        return np.random.uniform(0, 1)

    def __len__(self):
        return 5


dataset = MyDataset()
loader = DataLoader(
    dataset,
    num_workers=2,
    shuffle=False
)
for data in loader:
    print(data)

print('----')
for data in loader:
    print(data)

dataset = MyDataset()



Aucun commentaire:

Enregistrer un commentaire