lundi 21 septembre 2020

how can i create a stratified sample based on country and ratio of positives to negatives?

all,

i am doing a project based on predicting default. I have 3 countries data, morocco, spain and india and each of them have different ratios of nondefaulters : defaulters. I would like to train on a sample of this data (combined) but also want to take into consideration that the data by country has different levels of imbalance.

If i was to take one country and sample it i would use stratfied sampling. But how can i also take into account the country? e.g. below i combine all the data then apply stratfied sampling. but in my resultant sample the % of defaulters for spain for example are not the same as that of the original sample.

df = data.concat([moroc,spain,india])
y = df['status']
df.drop(columns=['status'], inplace=True)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in sss.split(df, y):


    x_train, x_test = df.iloc[train_index], df.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
   
sample = pd.concat([x_train, y_train], axis=1)

what can i do to take into account the country as well in the above? all the different countries dataset have different sizes.




Aucun commentaire:

Enregistrer un commentaire