dimanche 11 août 2019

Stratified Sampling on Dataframe - Spark 2.3 + Scala

I have read a lot about Stratified Sampling in Spark for RDD. I came across sample in DataFrames while RDDs have sampleByKey.

Questions:

  1. Does sample in Dataframe sample the data based on Stratified strategy?
  2. This post shows how to use sampleByKey on Dataframes. Will the performance degrade when we move back-and-forth converting DF --> RDD --> DF ??
  3. What are the major differences between sample and sampleByKey ?
  4. What is the best way to sample a CSV file of 10GB with approx 50M row? Should I use DF sample or RDD sampleByKey? (I have no keys for this file).
  5. I'm open to hearing any more suggestions using Spark+Scala for Question 4

What I have tried till now

iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.sample(0.2) //20% sample.. that's approx 10M rows 

and

iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.rdd.keyBy(x=>x(0)).sampleByKey(false, 0.2) //20% sample.. that's approx 10M rows 




Aucun commentaire:

Enregistrer un commentaire