I'm trying to execute a simple random sample wth Scala
from an existing table, possibly ~100e6 records.
import org.apache.spark.sql.SaveMode
val nSamples = 3e5.toInt
val frac = 1e-5
val table = spark.table("db_name.table_name").sample(false, frac).limit(nSamples)
(table
.write
.mode(SaveMode.Overwrite)
.saveAsTable("db_name.new_name")
)
But it is taking too long (~5h by my estimates).
Useful information:
-
I have ~6 workers. By analyzing the number of partitions of the table I get:
11433
. -
I'm not sure if the
partitions/workers
ratio is reasonable. -
I'm running
Spark 2.1.0
usingScala
.
I have tried:
-
Removing the
.limit()
part. -
Changing
frac
to1.0
,0.1
, etc.
Best,
Aucun commentaire:
Enregistrer un commentaire