lundi 29 juillet 2019

PySpark DataFrame - Append Random Permutation of a Single Column

I'm using PySpark (a new thing for me). Now, suppose I Have the following table: +-------+-------+----------+ | Col1 | Col2 | Question | +-------+-------+----------+ | val11 | val12 | q1 | | val21 | val22 | q2 | | val31 | val32 | q3 | +-------+-------+----------+ and I would like to append to it a new column, random_qustion which is in fact a permutation of the values in the Question column, so the result might look like this: +-------+-------+----------+-----------------+ | Col1 | Col2 | Question | random_question | +-------+-------+----------+-----------------+ | val11 | val12 | q1 | q2 | | val21 | val22 | q2 | q3 | | val31 | val32 | q3 | q1 | +-------+-------+----------+-----------------+ I'v tried to do that as follow: python df.withColumn( 'random_question' ,df.orderBy(rand(seed=0))['question'] ).createOrReplaceTempView('with_random_questions') The problem is that the above code does append the required column by WITHOUT permuting the values in it.

What am I doing wrong and how can I fix this?

Thank you,

Gilad




Aucun commentaire:

Enregistrer un commentaire