vendredi 24 mai 2019

Replace a certain group in a pyspark dataframe column with a random item from a list

Let's say the dataframe looks like this:

ls = [
    ['1', -9.78],
    ['2', 5.38],
    ['1', 8.86],
    ['2', -0.47],
    ['1', -0.19],
    ['1', 4.78],
    ['1', -9.23],
    ['2', -89.32]
]
test = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
test.show()

output:

+----+------+
|col1|  col2|
+----+------+
|   1| -9.78|
|   2|  5.38|
|   1|  8.86|
|   2| -0.47|
|   1| -0.19|
|   1|  4.78|
|   1| -9.23|
|   2|-89.32|
+----+------+

I want to replace all row where the value in col1 == 1 with random pick from a list of items: ['a', 'b', 'c'] (with replacement).

For example, the result would look like this:

+----+------+
|col1|  col2|
+----+------+
|   a| -9.78|
|   2|  5.38|
|   a|  8.86|
|   2| -0.47|
|   c| -0.19|
|   b|  4.78|
|   a| -9.23|
|   2|-89.32|
+----+------+

How can I do this in pyspark?




Aucun commentaire:

Enregistrer un commentaire