jeudi 29 octobre 2020

Pyspark - how to mark parts of a sampled DataFrame?

I took a sample of an original DataFrame which gives me approximately 200000 records. I want to mark the sample such that it has three parts of marked records. 1/3 of the records is marked as sample_metric_1, another 1/3 is marked as sample_metric_2 and the rest of the records is marked as sample_metric_3.

sample_df = original_df.sample(False, .09, 5)

One way to do this that comes to mind is to apply a .withColumn() the parts of the DataFrame. But not sure how to correctly extract the needed data subsets from the sample?

Or maybe there is a better approach here?




Aucun commentaire:

Enregistrer un commentaire