vendredi 17 septembre 2021

Select Random Data From Python Dataframe based on Columns data distribution & Conditions

I have 12 rows in my input data and need to select 4 random rows which keeps the columns distribution in focus at the time of random selection.

This is a sample data, original data contains million rows.

Input data Sample -

input_data = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
           'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
             'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
                     'Michigan','Ohio','Florida','New York','Washington']})

Output Data Expectation -

output_data = pd.DataFrame({'Id': ['A','A','B','C'],
               'Fruit': ['Apple','Mango','Apple','Orange'],
                 'City':['California','Ohio','Michigan','New York']})

My random selection should consider the below three parameters -

  1. The Id distribution, in below image, out of 4, 2 rows should be selected from A, 1 row from B and one from C

enter image description here

  1. The Fruit distribution, 2 rows for Apple, 1 for Mango and 1 for Orange

enter image description here

  1. The data should prioritize the higher frequency Cities

enter image description here

I am aware of sampling the data using pandas sample function and tried that which gives me unbalanced selection -

input_data.sample(n = 4)

Any leads on how to attend the problem is really appreciated!




Aucun commentaire:

Enregistrer un commentaire