We have following Dataframe:
------
G1|G2|
1 | 1|
1 | 1|
1 | 2|
2 | 1|
2 | 2|
2 | 3|
So based-on column G1
and G2
we have 5 groups: 1-1, 1-2, 2-1, 2-2, 2-3.
I would like to create new column isSelected
with following rule: With N rows belonged to each group, I would like to randomize at least 50% rows will have value 1
and 0
for otherwise. Every group must have at least 1 row that isSelected = 1
and [number of 1
rows] - [number of 0
rows] should be less than 1
Following is one valid generation:
----------------
G1|G2|isSelected
1 | 1|1
1 | 1|0
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|1
Following is not valid:
----------------
G1|G2|isSelected
1 | 1|1
1 | 1|1 --> Not OK, this group has 2 1-row and 0 0-row.
1 | 2|1
2 | 1|1
2 | 2|1
2 | 3|0 --> Not OK, this group has 0 1-row.
How to do it directly in Spark?
Aucun commentaire:
Enregistrer un commentaire