samedi 20 mars 2021

Why spark-3 and spark-2-4-7 returns different sample for the same dataframe (with the same ordered partitions)?

Could anyone explain me how is it possible that sample on both versions of spark-2-4-7 and spark-3 return different results? I am aware of the fact that partitioning and order matters. It is why I test dataframe with one ordered partition:

scala> spark.version
res28: String = 3.0.1

scala> val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val cols = df.columns.map(c => df(c))

cols: Array[org.apache.spark.sql.Column] = Array(value)
scala> df.repartition(1).sortWithinPartitions(cols: _*).sample(false, 0.5, 42).show

+-----+
|value|
+-----+
|    4|
|    8|
+-----+

while as on spark-2-4-7:

scala> spark.version
res21: String = 2.4.7

scala> val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF
df: org.apache.spark.sql.DataFrame = [value: int]

scala> val cols = df.columns.map(c => df(c))
cols: Array[org.apache.spark.sql.Column] = Array(value)

scala> df.repartition(1).sortWithinPartitions(cols: _*).sample(false, 0.5, 42).show
+-----+
|value|
+-----+
|    5|
|    7|
|    8|
|    9|
+-----+

How to explain this difference? Is it possible to make it consistent? I was trying to find difference in code, however I haven't managed to do it.




Aucun commentaire:

Enregistrer un commentaire