mercredi 9 mars 2022

How to get a percentage of a pandas dataframe

I have a df of 300000 rows and 25 columns.

I have added a unique index to all the rows, using uuid.uuid4().

Now I only wand a random portion of the dataset (say 25%). Here is what I am trying to do to get it, but its not working:

def gen_uuid(self, df, percentage = 1.0, uuid_list = []):
        for i in range(df.shape[0]):
            uuid_list.append(str(uuid.uuid4()))
        uuid_pd = pd.Series(uuid_list)
        df_uuid = df.copy()
        df_uuid['id'] = uuid_pd
        df_uuid = df_uuid.set_index('id')
        if (percentage == 1.0) : return df_uuid
        else:
            uuid_list_sample = random.sample(uuid_list, int(len(uuid_list) * percentage))
            return df_uuid[df_uuid.index.any() in uuid_list_sample]

But this gives an error saying keyerror: False

The uuid_list_sample that I generate is the correct length

So I have 2 questions:

  1. How do I get the above code to work as intendend? Return a random portion of the pandas df based on index
  2. How do I in general get a percentage of the whole pandas data frame? I was looking at pandas.DataFrame.quantile, but Im not sure if that does what im looking for



Aucun commentaire:

Enregistrer un commentaire