In my data I have hire dates of employees and their paygrades. Paygrades are divided in categories: ( 1 = Intern , 2 : Junior , 3 : Senior ...)
Based on this data , I'm trying to generate approximate Birth Dates for these employees. Taking in account that an employee would be at least 23 years old.
This is the function I developed :
def generate_birth_date(paygrade, hire_date_str):
if isinstance(hire_date_str, float) and math.isnan(hire_date_str):
# Handle the case when hire_date_str is NaN
return None
if isinstance(hire_date_str, float):
hire_date_str = str(int(hire_date_str))
hire_date = datetime.strptime(hire_date_str, "%y-%m-%d").date()
if paygrade == 'Intern':
birth_year = random.randint(1998, 2000)
elif paygrade == 'Junior':
birth_year = random.randint(1996, 1998)
elif paygrade == 'Senior':
birth_year = random.randint(1994, 1996)
elif paygrade == 'Manager':
birth_year = random.randint(1992, 1994)
elif paygrade == 'Senior Manager':
birth_year = random.randint(1990, 1992)
elif paygrade == 'Director':
birth_year = random.randint(1988, 1990)
else:
birth_year = random.randint(1982, 1984)
birth_month = random.randint(1, 12)
birth_day = random.randint(1, 28) # Assuming maximum of 28 days in a month
birth_date = datetime(birth_year, birth_month, birth_day)
return birth_date.date()
And this is how i'm calling it:
# Apply the function to the PAY_GRADE and HIRE_DATE columns to generate birth dates
df['BIRTH_DATE'] = df.apply(lambda row: generate_birth_date(row['PAY_GRADE'], row['HIRE_DATE']), axis=1)
The results are not 100% accurate, because II feel like sometimes he takes in account only the paygrade and sometimes the hire date only. For instance , an employee may be hired in 2006 with paygrade 2 , meaning he's a junior, meaning he was at least 23 years old by that age. Which means he would've at least almost 40 years old by now. How can I correct my function to retrieve ideal results ?
Aucun commentaire:
Enregistrer un commentaire