vendredi 26 août 2016

How to organize code with random numbers in version control systems for analysis?

this is mostly a question about the organization of files in VCS such as git, but in order to give a good picture of my problem, here a quick introduction into the subject matter:

Project Overview

I am working on a project, where neural-network-like probabilistic models are implemented and tested for different sets of parameters. Currently we are implementing in Python, although the problem might be relevant for different programming languages as well. The outcome is usually error measurements or graphs or something similar. Now at the moment our project is as follows:

  • several people are working on the code base of the project and are implementing new features

  • some other people are already trying to explore the behavior of the model for different parameter sets, i.e. figuring out for which parameter ranges the model shows qualitatively different behavior

At the moment we use git with GitHub as VCS with one master branch for the current stable version and one branch for every member of our team for active development. We exchange code by merging between branches and merge to master whatever seems a stable new feature.

One big problem in general is, that this is a research project without a clear project outline. Sometimes we are specifically fixing some bugs or implementing something planned with feature branches. But sometimes it is not clear, what exactly the next feature will be, or whether it is even possible to implement what we have in mind. Some of us are basically exploring the behavior of our model in a more or less structured way. I know. But that's how it is.

Controlling Probabilistic Behavior

Our model is probabilistic on many levels. Various parts are initialized with random numbers and random numbers are also used while the model simulation is running.

Of course the best way to explore a probabilistic model is to let it run many times and statistically analize the results. Now for either demonstration purposes or in order to explore some specific behavior more deeply, you want the cases to be reproducible. Currently we do this, by setting the seeds of the random number generator at the beginning, like in numpy for python with

import numpy as np
np.random.seed(42)
a = np.random.rand() # -> will always be 0.3745401188473625
b = np.random.rand() # -> will always be 0.9507143064099162

Version Control Problems

We identified two issues with our current setting:

1) How to store the snapshots for a specific behavior for later exploration?

In order to label snapshots appropriately, we thought about both using branches and tags for specific experiments and found sets of parameters, like this:

* master
|
*---------------
|\              \
* * experiment1  * experiment2
| |              |
. * tag setting1 * tag setting1
. |              |
. * tag setting2 * tag setting2

The problem here is that as far as we understood, commits with tags are not meant to be changed later. Since we might work on these settings later, we would have to branch again from the specific tag.

Another way to go would be to use only branches, one for every found setting, such that every branch head corresponds to one working state of the system. But this would lead to a huge number of branches for all these things we identified.

So how would you organize a structure like this? Especially with the following problem in mind:

2) How to merge changes into stored snapshots without changing probabilistic behavior

Suppose one of our developers found a bug in the implementation that we had so far, or implemented a very useful feature and fixed it in the master branch. Now it might be very beneficial to use these changes for later analysis of one of the identified behaviors of the model. The problem is, that if the changes use random numbers, chances are that the behavior of the model will be completely different after merging.

import numpy as np
np.random.seed(42)
a = np.random.rand() # -> will always be 0.3745401188473625

# fixing some stuff here
c = np.random.rand()
# -> will be 0.9507143064099162 as was previously 'b'

b = np.random.rand()
# -> will now be 0.7319939418114051 and not anymore 0.9507143064099162

# ...
# code using 'b' will behave differently

This is really a big problem, because it means that:

  • either we cannot (or only when not changing random numbers) use new features or apply bugfixes for the analysis of already identified interesting sets of parameters and random conditions

  • or we have to identify these settings again and again after every change that uses random numbers

Of course the problem is still easy for the code shown here, involving only a few random calls. But in the models we have, random numbers will be generated many times and often the number of iterations is again influenced by the output of computations from other random-number-involving parts.

Do you have any recommendations concerning this issue?




Aucun commentaire:

Enregistrer un commentaire