mercredi 7 avril 2021

Reproducibility in Latent Dirichlet Allocation using numpy.random.seed

I am using LDA to perform topic modeling for a larger software project. When the application is first initialized, we run several tests to ensure that everything is working. One of these tests involves checking the output of LDA with a toy dataset against the known result of LDA against this same dataset. The problem comes in with the fact that LDA is inherently random and that if I just run LDA against the dataset, store the results, and then compare those results against a later run of LDA against the dataset, the results will be pretty different.

So, I tried using np.random.seed(111) before running LDA in the initial phase and in the test phase. This gives me exactly consistent results, which means I can check the results for strict equality (which makes things much easier). That said, I've noticed that this is not best practice since it can affect the global random number generator, which can affect all kinds of other results within the application. What I would like to do is to set the seed just during my testing and then reset once the testing is complete. At this point, I prefer to not have to pass a rng = np.random.default_rng(111) object around, because all the navigation is happening through routes, and this could be a super pain-in-the-ass.

I'm looking for suggestions for how to best accomplish what I want. Like I said, I think the easiest thing to do is just to np.random.seed(111) and then something like np.random.reset_seed (although this is not an actual function, just something that I'm hoping for`).




Aucun commentaire:

Enregistrer un commentaire