lundi 27 avril 2020

Generate random value on new column, based on group value of other columns in Spark

We have following Dataframe:

------
G1|G2|
1 | 1|
1 | 1|
1 | 2|
2 | 1|
2 | 2|
2 | 3|

So based-on column G1 and G2 we have 5 groups: 1-1, 1-2, 2-1, 2-2, 2-3.

I would like to create new column isSelected with following rule: With N rows belonged to each group, I would like to randomize at least 50% rows will have value 1 and 0 for otherwise. Every group must have at least 1 row that isSelected = 1 and [number of 1 rows] - [number of 0 rows] should be less than 1

Following is one valid generation:

----------------
G1|G2|isSelected
1 | 1|1
1 | 1|0
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|1

Following is not valid:

----------------
G1|G2|isSelected
1 | 1|1
1 | 1|1 --> Not OK, this group has 2 1-row and 0 0-row.
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|0 --> Not OK, this group has 0 1-row.

How to do it directly in Spark?




Aucun commentaire:

Enregistrer un commentaire