vendredi 8 juillet 2022

Repeated repeated sampling without replacement

I have a Slots dataframe with ~1,000,000 race+gender slots (so efficiency is a slight concern, I think). There are 3 race categories (White, Black, Hispanic) and 2 gender categories (Male, Female), for a total of 6 race-gender categories (e.g. White-Male (WM), Black-Female (BF), etc.). The dataframe is grouped into blocks of 10 slots that have different race+gender compositions (e.g. one may have 2 WMs, 2 BF, 3 HF, 1 HM, 1 WF, and 1 BM etc.), though there is always at least one of each. The people in a given block have to be unique.

Here's part of the Slots dataframe:

Block Race_Gender Person Name
a WM
a WM
a BF
a HF
a HM
a HM
a HM
a HM
a BM
a WF
b BM
b BM
b BM
b HF
b HM
b WF
b WF
b BF
b WM
b WM
... ... ...

I need to fill those slots using a population of 60 people. The population of 60 is comprised of 10 people from each of the 6 race-genders. For simplicity, we'll name the people according to their race-gender, e.g. White Male1, ..., White Male10, Black Female1, ..., Black Female10, etc. The Population dataframe looks like this:

Race_Gender Person Name
WM White Male1
WM White Male2
... ...
WM White Male10
BF Black Female1
... ...
BF Black Female10

I want to randomly, without replacement, fill the slots. Of course, there are more slots than people, but I want the "without replacement" piece to happen repeatedly until the people are used up. Specifically, once I use up all 10 people in a given race-gender category, I would re-populate with the original 10 and then repeat, as needed. In this way, I'm doing repeated repeated sampling.

So the resulting dataframe would look like this (notice no repeats across blocks):

Block Race_Gender Person Name
a WM White Male10
a WM White Male4
a BF Black Female2
a HF Hispanic Female2
a HM Hispanic Male3
a HM Hispanic Male7
a HM Hispanic Male6
a HM Hispanic Male1
a BM Black Male5
a WF White Female7
b BM Black Male1
b BM Black Male3
b BM Black Male10
b HF Hispanic Female1
b HM Hispanic Male5
b WF White Female4
b WF White Female3
b BF Black Female1
b WM White Male6
b WM White Male7
... ... ...

My two main ideas are:

  1. Build a dataframe from the population of 60 so that it's big enough to cover the Slots dataframe. Then either sample all the people needed from each race-gender and merge them on. The problem, I think, is ensuring no duplicates within a block.
  2. My current idea: Make a copy of the Population dataframe, then iterate (itertuples(),df.values(), or convert to dict and loop) over the Slots dataframe, randomly choosing an appropriate person from the Population and removing them from the copy. If there are no more from that subpopulation, then re-populate it with the originals from the Population dataframe. The only problem is that this is running very slowly. Just 15,000 rows takes 30 seconds to populate!

Any better ideas?

Here's the code for my current idea:

import pandas as pd
import random

random.seed(0)

########################
#  FOR REPLICATION
########################
## GENERATE POPULATION DATAFRAME
def generate_pop():
    races = ["White","Black","Hispanic"]
    genders = ["Male","Female"]
    names = []
    race_genders = []
    for i in range(1,11):
        for r in races:
            for g in genders:
                names.append(f"{r} {g}{str(i)}")
                race_genders.append(r[0]+g[0])
    return pd.DataFrame({"Name":names,"Race_Gender":race_genders})

## GENERATE SLOTS DATAFRAME
def generate_slots():
    # Fixed parameters
    race_genders = ["BM","WM","HM","BF","WF","HF"]
    blockSize = 10

    # adjustable parameters
    numBlocksToGenerate = 1000

    # create as many blocks as is specified
    slots = []
    blockID = 1 # this identifies the blocks of 10
    for c in range(1,numBlocksToGenerate+1):
        # initialize block to have at least one of each race_gender, the block ID, and an empty name cell
        block = [(rg, blockID, "") for rg in race_genders]

        # keep adding randomly selected race-genders until the block is full
        while len(block) < blockSize:
            block.append((random.choice(race_genders),blockID))

        # add the block to the list of all blocks
        slots.extend(block)

        blockID += 1

    return pd.DataFrame(slots, columns=["Race_Gender","Block", "Name"])

popDF = generate_pop()
slotsDF = generate_slots()


###################
#   MAIN CODE
###################
# a dynamic copy of the Population of 60 people (with Name and Race_Gender columns). 
# It will grow and expand as people are randomly selected out and re-populated back in
popDF_dynamic = popDF.copy() 

# convert slots DF to dict for faster processing
slots_dict = slotsDF.to_dict('records')

def fill_slot(row):
    global popDF_dynamic
    # get the subset of people in the current race-gender category
    subpop = popDF_dynamic[popDF_dynamic["Race_Gender"] ==  row["Race_Gender"]]

    # if all people in the supop are used up, re-populate from the original
    if len(subpop) == 0:
        popDF_dynamic = pd.concat([popDF_dynamic, popDF[popDF["Race_Gender"] == row["Race_Gender"]]])
        subpop = popDF_dynamic[popDF_dynamic["Race_Gender"] ==  row["Race_Gender"]]

    # randomly select one
    choice = subpop.sample()

    # drop it from the dynamic Population dataframe
    popDF_dynamic.drop(choice.index, inplace=True)

    # add the person to the row
    row["Name"] = choice.iloc[0]["Name"]     

    return row   

# final dataframe
final = pd.DataFrame([fill_slot(row) for row in slots_dict])



Aucun commentaire:

Enregistrer un commentaire