mardi 24 avril 2018

Spark sample is too slow

I'm trying to execute a simple random sample wth Scala from an existing table, possibly ~100e6 records.

import org.apache.spark.sql.SaveMode

val nSamples = 3e5.toInt
val frac = 1e-5
val table = spark.table("db_name.table_name").sample(false, frac).limit(nSamples)
(table
  .write
  .mode(SaveMode.Overwrite)
  .saveAsTable("db_name.new_name")
)

But it is taking too long (~5h by my estimates).

Useful information:

  1. I have ~6 workers. By analyzing the number of partitions of the table I get: 11433.

  2. I'm not sure if the partitions/workers ratio is reasonable.

  3. I'm running Spark 2.1.0 using Scala.

I have tried:

  1. Removing the .limit() part.

  2. Changing frac to 1.0, 0.1, etc.

Best,




Aucun commentaire:

Enregistrer un commentaire