random: Generate random value on new column, based on group value of other columns in Spark

lundi 27 avril 2020

Generate random value on new column, based on group value of other columns in Spark

We have following Dataframe:

------
G1|G2|
1 | 1|
1 | 1|
1 | 2|
2 | 1|
2 | 2|
2 | 3|

So based-on column G1 and G2 we have 5 groups: 1-1, 1-2, 2-1, 2-2, 2-3.

I would like to create new column isSelected with following rule: With N rows belonged to each group, I would like to randomize at least 50% rows will have value 1 and 0 for otherwise. Every group must have at least 1 row that isSelected = 1 and [number of 1 rows] - [number of 0 rows] should be less than 1

Following is one valid generation:

----------------
G1|G2|isSelected
1 | 1|1
1 | 1|0
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|1

Following is not valid:

----------------
G1|G2|isSelected
1 | 1|1
1 | 1|1 --> Not OK, this group has 2 1-row and 0 0-row.
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|0 --> Not OK, this group has 0 1-row.

How to do it directly in Spark?

random

lundi 27 avril 2020

Generate random value on new column, based on group value of other columns in Spark

Aucun commentaire:

Enregistrer un commentaire