The situation: I have a big dataset with more than 18 million examples. I train several models and want to track the accuracy.
When forwarding all examples and computing accuracy this is approximately 83 percent. But this takes a long time.
So I try to sample a small subset of the whole dataset and compute accuracy for that. I expect to see approximately the same number (around 80 percent)
total = 4096
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]
However, now the output looks like this, when running it for 10 'random' batches:
> satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4096/4096
> 1.0
So here it performs always way too good and seems to only almost only sample from the 80 percent good examples. What can I do to make it really random, such that it gives a good view of the accuracy?
This makes also the training go wrong, because for the next training batch only the good examples are sampled.
Aucun commentaire:
Enregistrer un commentaire