Could anyone explain me how is it possible that sample
on both versions of spark-2-4-7 and spark-3 return different results? I am aware of the fact that partitioning and order matters. It is why I test dataframe with one ordered partition:
scala> spark.version
res28: String = 3.0.1
scala> val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val cols = df.columns.map(c => df(c))
cols: Array[org.apache.spark.sql.Column] = Array(value)
scala> df.repartition(1).sortWithinPartitions(cols: _*).sample(false, 0.5, 42).show
+-----+
|value|
+-----+
| 4|
| 8|
+-----+
while as on spark-2-4-7
:
scala> spark.version
res21: String = 2.4.7
scala> val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF
df: org.apache.spark.sql.DataFrame = [value: int]
scala> val cols = df.columns.map(c => df(c))
cols: Array[org.apache.spark.sql.Column] = Array(value)
scala> df.repartition(1).sortWithinPartitions(cols: _*).sample(false, 0.5, 42).show
+-----+
|value|
+-----+
| 5|
| 7|
| 8|
| 9|
+-----+
How to explain this difference? Is it possible to make it consistent? I was trying to find difference in code, however I haven't managed to do it.
Aucun commentaire:
Enregistrer un commentaire