I have a pyspark dataframe that looks like:
+---------------+---+---+---+---+---+---+
| Entity| id| 7| 15| 19| 21| 27|
+---------------+---+---+---+---+---+---+
| a| 0| 0| 1| 0| 0| 0|
| b| 1| 0| 0| 0| 1| 0|
| c| 2| 0| 0| 0| 1| 0|
| d| 3| 2| 0| 0| 0| 0|
| e| 4| 0| 3| 0| 0| 0|
| f| 5| 0| 25| 0| 0| 0|
| g| 6| 2| 0| 0| 0| 0|
I want to add a random value between 0 and 1 to all elements in every column sans Entity & id. There could also be any number of columns after Entity & id (in this case there's 5, but there could be 100, or a 1000 or more).
Here's what I have so far:
tie_break_df = data.select("*").rdd.map(
lambda x, r=random: [Row(str(row)) if isinstance(row, unicode) else
Row(float(r.random() + row)) for row in x]).toDF(data.columns)
However, this will also add a random value to the id column. Normally, if I knew the number of elements before, and I knew they would be fixed I could explicitly call them out in the lambda expression with
data.select("*").rdd.map(lambda a,b,c,d,e,f,g:
Row(a,b, r.random() + c r.random() + d, r.random() + e, r.random()
+ f, r.random() + g))
But, unfortunately, this won't work due to not knowing how many columns I"ll have ahead of time. Thoughts? I really appreciate the help!
Aucun commentaire:
Enregistrer un commentaire