I have 12 rows in my input data and need to select 4 random rows which keeps the columns distribution in focus at the time of random selection.
This is a sample data, original data contains million rows.
Input data Sample -
input_data = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
'Michigan','Ohio','Florida','New York','Washington']})
Output Data Expectation -
output_data = pd.DataFrame({'Id': ['A','A','B','C'],
'Fruit': ['Apple','Mango','Apple','Orange'],
'City':['California','Ohio','Michigan','New York']})
My random selection should consider the below three parameters -
- The Id distribution, in below image, out of 4, 2 rows should be selected from A, 1 row from B and one from C
- The Fruit distribution, 2 rows for Apple, 1 for Mango and 1 for Orange
- The data should prioritize the higher frequency Cities
I am aware of sampling the data using pandas sample function and tried that which gives me unbalanced selection -
input_data.sample(n = 4)
Any leads on how to attend the problem is really appreciated!
Aucun commentaire:
Enregistrer un commentaire