mercredi 22 juillet 2020

Pyspark: Sample full groups of data based on an indicator column

I am new to Spark, and looking to segment data randomly.

Say I have data like so:

+------+-------+-------+-------+
| Col1 | Col2  | Col3  | Col3  |
+------+-------+-------+-------+
| A    | 0.532 | 0.234 | 0.145 |
| B    | 0.242 | 0.224 | 0.984 |
| A    | 0.152 | 0.753 | 1.413 |
| C    | 0.149 | 0.983 | 0.786 |
| D    | 0.635 | 0.429 | 0.683 |
| E    | 0.938 | 0.365 | 0.328 |
| C    | 0.293 | 0.956 | 0.963 |
| D    | 0.294 | 0.234 | 0.298 |
| E    | 0.294 | 0.394 | 0.928 |
| D    | 0.294 | 0.258 | 0.689 |
| A    | 0.687 | 0.666 | 0.642 |
| C    | 0.232 | 0.494 | 0.494 |
| D    | 0.575 | 0.845 | 0.284 |
+------+-------+-------+-------+

But, col1 has many more different groups / categories. And I want to randomly assign by Col1, meaning all records of Col A, if Col A is so randomly selected, will go to RDD1.

  • 30% to go to one RDD
  • Another 30% of those to go to another RDD
  • Another 30% to go to a third RDD
  • Final 10% to go to a fourth RDD

When we say 30%, we mean that 30% of the unique values of Col1. So, if my Col1 labels are: [A,B,C,D,E,F,G,H,I,J], 3 of those labels and all of the associated rows go to the first grouping, etc, etc.

I can imagine a series of: .map() functions where I emit a tuple of (Col1_Label, [random value between 0 and 3, with each having probability 0.3, 0.3, 0.3, 0.1]) and then do subsequent filtering.

Is there an easier way and a way that will allow me to assign a seed to replicate results?




Aucun commentaire:

Enregistrer un commentaire