mardi 13 août 2019

Randomly split up elements from a stream of data without knowing the total number of elements

Given a "split ratio", I am trying to randomly split up a dataset into two groups. The catch is, that I do not know beforehand how many items the dataset contains. My library receives the data one by one from an input stream and is expected to return the data to two output streams.

Illustration:

                            ┌─► stream A
 input stream ──► LIBRARY ──┤
                            └─► stream B

For example, given a split ratio of 30/70, stream A would be expected to receive 30% of the elements from the input stream and stream B the remaining 70%. The order must remain.


My ideas so far:

Idea 1: "Roll the dice" for each element

The obvious approach: For each element the algorithm randomly decides if the element should go into stream A or B. The problem is, that the resulting data sets might be far off from the expected split ratio. Given a split ratio of 50/50, the resulting data split might be something far off (could even be 100/0 for very small data sets). The goal is to keep the resulting split ratio as close as possible to the desired split ratio.

Idea 2: Use a cache and randomize the cached data

Another idea is to cache a fixed number of elements before passing them on. This would result in caching 1000 elements and shuffling the data (or their corresponding indices to keep the order stable), splitting them up and passing the resulting data sets on. This should work very well, but I'm unsure if the randomization is really random for large data sets (I imagine there will patterns when looking at the distribution).

Both algorithms are not optimal, so I hope you can help me.


Background

This is about a layer-based data science tool, where each layer receives data from the previous layer via a stream. This layer is expected to split the data (vectors) up into a training and test set before passing them on. The input data can range from just a few elements to a never ending stream of data (hence, the streams). The code is developed in JavaScript, but this question is more about the algorithm than the actual implementation.




Aucun commentaire:

Enregistrer un commentaire