I am new to Spark, and looking to segment data randomly.
Say I have data like so:
+------+-------+-------+-------+
| Col1 | Col2 | Col3 | Col3 |
+------+-------+-------+-------+
| A | 0.532 | 0.234 | 0.145 |
| B | 0.242 | 0.224 | 0.984 |
| A | 0.152 | 0.753 | 1.413 |
| C | 0.149 | 0.983 | 0.786 |
| D | 0.635 | 0.429 | 0.683 |
| E | 0.938 | 0.365 | 0.328 |
| C | 0.293 | 0.956 | 0.963 |
| D | 0.294 | 0.234 | 0.298 |
| E | 0.294 | 0.394 | 0.928 |
| D | 0.294 | 0.258 | 0.689 |
| A | 0.687 | 0.666 | 0.642 |
| C | 0.232 | 0.494 | 0.494 |
| D | 0.575 | 0.845 | 0.284 |
+------+-------+-------+-------+
But, col1 has many more different groups / categories. And I want to randomly assign by Col1, meaning all records of Col A, if Col A is so randomly selected, will go to RDD1.
- 30% to go to one RDD
- Another 30% of those to go to another RDD
- Another 30% to go to a third RDD
- Final 10% to go to a fourth RDD
When we say 30%, we mean that 30% of the unique values of Col1. So, if my Col1 labels are: [A,B,C,D,E,F,G,H,I,J], 3 of those labels and all of the associated rows go to the first grouping, etc, etc.
I can imagine a series of: .map() functions where I emit a tuple of (Col1_Label, [random value between 0 and 3, with each having probability 0.3, 0.3, 0.3, 0.1]) and then do subsequent filtering.
Is there an easier way and a way that will allow me to assign a seed to replicate results?
Aucun commentaire:
Enregistrer un commentaire