I have a df of 300000 rows and 25 columns.
I have added a unique index to all the rows, using uuid.uuid4().
Now I only wand a random portion of the dataset (say 25%). Here is what I am trying to do to get it, but its not working:
def gen_uuid(self, df, percentage = 1.0, uuid_list = []):
for i in range(df.shape[0]):
uuid_list.append(str(uuid.uuid4()))
uuid_pd = pd.Series(uuid_list)
df_uuid = df.copy()
df_uuid['id'] = uuid_pd
df_uuid = df_uuid.set_index('id')
if (percentage == 1.0) : return df_uuid
else:
uuid_list_sample = random.sample(uuid_list, int(len(uuid_list) * percentage))
return df_uuid[df_uuid.index.any() in uuid_list_sample]
But this gives an error saying keyerror: False
The uuid_list_sample that I generate is the correct length
So I have 2 questions:
- How do I get the above code to work as intendend? Return a random portion of the pandas df based on index
- How do I in general get a percentage of the whole pandas data frame? I was looking at pandas.DataFrame.quantile, but Im not sure if that does what im looking for
Aucun commentaire:
Enregistrer un commentaire