lundi 13 février 2017

Numpy randint not really random?

The situation: I have a big dataset with more than 18 million examples. I train several models and want to track the accuracy.

When forwarding all examples and computing accuracy this is approximately 83 percent. But this takes a long time.

So I try to sample a small subset of the whole dataset and compute accuracy for that. I expect to see approximately the same number (around 80 percent)

total = 4096
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]

However, now the output looks like this, when running it for 10 'random' batches:

> satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4096/4096
> 1.0

So here it performs always way too good and seems to only almost only sample from the 80 percent good examples. What can I do to make it really random, such that it gives a good view of the accuracy?

This makes also the training go wrong, because for the next training batch only the good examples are sampled.




Aucun commentaire:

Enregistrer un commentaire