lundi 1 avril 2019

Select single random sample from group by in Hive

I have a table that looks like so:

Name      Age       Num_Hobbies     Num Shoes
Jane      31        10              2
Bob       23        3               4
Jane      60        2               200
Jane      31        100             6
Bob       10        8               7
etc etc

I would like to group this table by Name and Age, and at random pick one row from the rest of the columns.

In pandas, I would do the following:

df.groupby(['Name', 'Age']).apply(lambda x: x.sample(n=1))

In hive, I know how to create the group, but not how to choose a single random sample from group.

I saw this question on stack overflow: How to sample for each group in hive?

However, I do not understand how to apply Dynamic partitions or Hive bucketing to select a single sample from a group.




Aucun commentaire:

Enregistrer un commentaire