I have read a lot about Stratified Sampling in Spark for RDD. I came across sample
in DataFrames while RDDs have sampleByKey
.
Questions:
- Does
sample
in Dataframe sample the data based on Stratified strategy? - This post shows how to use
sampleByKey
on Dataframes. Will the performance degrade when we move back-and-forth converting DF --> RDD --> DF ?? - What are the major differences between
sample
andsampleByKey
? - What is the best way to sample a CSV file of 10GB with approx 50M row? Should I use DF
sample
or RDDsampleByKey
? (I have no keys for this file). - I'm open to hearing any more suggestions using Spark+Scala for Question 4
What I have tried till now
iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.sample(0.2) //20% sample.. that's approx 10M rows
and
iFile = "/user/me/data.txt"
data = spark.read.format("csv").option("seq", "|").load(iFile)
data.rdd.keyBy(x=>x(0)).sampleByKey(false, 0.2) //20% sample.. that's approx 10M rows
Aucun commentaire:
Enregistrer un commentaire