dimanche 3 septembre 2017

Pyspark - Lambda Expressions for specific columns

I have a pyspark dataframe that looks like:

+---------------+---+---+---+---+---+---+
|         Entity| id|  7| 15| 19| 21| 27|
+---------------+---+---+---+---+---+---+
|              a|  0|  0|  1|  0|  0|  0|
|              b|  1|  0|  0|  0|  1|  0|
|              c|  2|  0|  0|  0|  1|  0|
|              d|  3|  2|  0|  0|  0|  0|
|              e|  4|  0|  3|  0|  0|  0|
|              f|  5|  0| 25|  0|  0|  0|
|              g|  6|  2|  0|  0|  0|  0|

I want to add a random value between 0 and 1 to all elements in every column sans Entity & id. There could also be any number of columns after Entity & id (in this case there's 5, but there could be 100, or a 1000 or more).

Here's what I have so far:

 tie_break_df = data.select("*").rdd.map(
     lambda x, r=random: [Row(str(row)) if isinstance(row, unicode) else 
     Row(float(r.random() + row)) for row in x]).toDF(data.columns)

However, this will also add a random value to the id column. Normally, if I knew the number of elements before, and I knew they would be fixed I could explicitly call them out in the lambda expression with

data.select("*").rdd.map(lambda a,b,c,d,e,f,g: 
         Row(a,b, r.random() + c r.random() + d, r.random() + e, r.random() 
               + f, r.random() + g))

But, unfortunately, this won't work due to not knowing how many columns I"ll have ahead of time. Thoughts? I really appreciate the help!




Aucun commentaire:

Enregistrer un commentaire