dimanche 11 août 2019

Stratified Sampling on Dataframe - Spark 2.3 + Scala

I have read a lot about Stratified Sampling in Spark for RDD. I came across sample in DataFrames while RDDs have sampleByKey.


  1. Does sample in Dataframe sample the data based on Stratified strategy?
  2. This post shows how to use sampleByKey on Dataframes. Will the performance degrade when we move back-and-forth converting DF --> RDD --> DF ??
  3. What are the major differences between sample and sampleByKey ?
  4. What is the best way to sample a CSV file of 10GB with approx 50M row? Should I use DF sample or RDD sampleByKey? (I have no keys for this file).
  5. I'm open to hearing any more suggestions using Spark+Scala for Question 4

What I have tried till now

iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.sample(0.2) //20% sample.. that's approx 10M rows 


iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.rdd.keyBy(x=>x(0)).sampleByKey(false, 0.2) //20% sample.. that's approx 10M rows 

Aucun commentaire:

Enregistrer un commentaire