mercredi 11 juillet 2018

Python - Create a data set with correlating numeric variables

I want to create a dataset where I have years of experience from 1 to 10 and have salary from 30k to 100k. I want these salaries to be random and to follow the years of experience. Sometimes a person with more experience may make less than a person with less experience.

For example:

years of experience | Salary
1                   | 30050
2                   | 28500
3                   | 36000
...
10                  | 100,500

Here is what I have done so far:

import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

Which gives me:

   experience         salary
0         1.0   31060.903965
1         2.0   38838.681742
2         3.0   46616.459520
3         4.0   54394.237298
4         5.0   62172.015076
5         6.0   69949.792853
6         7.0   77727.570631
7         8.0   85505.348409
8         9.0   93283.126187
9        10.0  101060.903965

we can see that we do not get some records where a person with higher experience made less than a person with lower experience. How can I fix this? Of course I want to scale this to give me 1000 rows




Aucun commentaire:

Enregistrer un commentaire