all,
i am doing a project based on predicting default. I have 3 countries data, morocco, spain and india and each of them have different ratios of nondefaulters : defaulters. I would like to train on a sample of this data (combined) but also want to take into consideration that the data by country has different levels of imbalance.
If i was to take one country and sample it i would use stratfied sampling. But how can i also take into account the country? e.g. below i combine all the data then apply stratfied sampling. but in my resultant sample the % of defaulters for spain for example are not the same as that of the original sample.
df = data.concat([moroc,spain,india])
y = df['status']
df.drop(columns=['status'], inplace=True)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in sss.split(df, y):
x_train, x_test = df.iloc[train_index], df.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
sample = pd.concat([x_train, y_train], axis=1)
what can i do to take into account the country as well in the above? all the different countries dataset have different sizes.
Aucun commentaire:
Enregistrer un commentaire