vendredi 24 février 2023

How to randomly sample balanced pairs of rows from a Pandas DataFrame

Suppose I have a dataset which contains labels, filenames, and potentially other columns of metadata. The dataset may have as many as 200,000 examples. I've provided a snippet below that simulates this setup.

import pandas as pd
import numpy as np
import IPython.display as ipd

size = 20000
df = []
rng = np.random.default_rng(0)
for i in range(size):
    l = rng.choice(('cat', 'dog', 'mouse', 'bird', 'horse', 'lion', 'rabbit'))
    fp = str(rng.integers(1e5)).zfill(6) + '.jpg'
    df.append((l, fp))
df = pd.DataFrame(df, columns=['label', 'filenames'])
ipd.display(df)

I would like to efficiently produce N randomly generated pairs of data, with the condition that the dataset is balanced between positive and negative pairs, e.g.,

# df_out would be of size "N"
df_out = pd.DataFrame([], columns=['label_1', 'label_2', 'filepath_1', 'filepath_2'])

Here I am defining a positive pair as one where label_1 equals label_2, and a negative pair as one where the two labels are not equal. So the goal is for df_out to contain roughly 50%-positive and 50%-negative pairs.

The first approach I tried works by sampling 2N rows from the DataFrame, then collapses them into pairs.

N = 20
ii = rng.permutation(np.arange(N*2)%len(df))
func = lambda x: x.dropna().astype(str).str.cat(sep=',')
df_out = df.iloc[ii].reset_index(drop=True)  # subsample 
df_out = df_out.groupby(df_out.index//2)  # collapse every two rows into one row
df_out = df_out.agg(func).reset_index(drop=True)  # use `func` to combine rows
for k in df.columns:
    df_out[[f'{k}_1',f'{k}_2']] = df_out[k].str.split(',', expand=True)
    del df_out[k]

So this works to make pairs of rows, but it doesn't take any consideration to positive or negative pairs.

# as one would expect, this percentage is not equal to 50%
print(sum(df_out.eval('label_1==label_2')) / N)



Aucun commentaire:

Enregistrer un commentaire