I took a sample of an original DataFrame which gives me approximately 200000 records. I want to mark the sample such that it has three parts of marked records. 1/3 of the records is marked as sample_metric_1
, another 1/3 is marked as sample_metric_2
and the rest of the records is marked as sample_metric_3
.
sample_df = original_df.sample(False, .09, 5)
One way to do this that comes to mind is to apply a .withColumn() the parts of the DataFrame. But not sure how to correctly extract the needed data subsets from the sample?
Or maybe there is a better approach here?
Aucun commentaire:
Enregistrer un commentaire