jeudi 2 décembre 2021

Sample dataframe maintaining multiple frequency distributions

I have an example pandas dataframe, df, below:

{'column_a': {0: 'b', 1: 'b', 2: 'a', 3: 'b', 4: 'd', 5: 'a', 6: 'b', 7: 'b', 8: 'c', 9: 'a', 10: 'a', 11: 'a', 12: 'a', 13: 'c', 14: 'c', 15: 'c', 16: 'b', 17: 'a', 18: 'a', 19: 'b', 20: 'd', 21: 'c', 22: 'a', 23: 'b', 24: 'c', 25: 'c', 26: 'c', 27: 'e', 28: 'e', 29: 'e', 30: 'e', 31: 'c', 32: 'e', 33: 'e', 34: 'd', 35: 'e', 36: 'd', 37: 'e', 38: 'd', 39: 'b', 40: 'd', 41: 'c', 42: 'b', 43: 'd', 44: 'c', 45: 'e', 46: 'd', 47: 'c', 48: 'e', 49: 'b', 50: 'c'}, 'column_b': {0: 'c', 1: 'b', 2: 'b', 3: 'd', 4: 'b', 5: 'a', 6: 'd', 7: 'c', 8: 'c', 9: 'd', 10: 'a', 11: 'a', 12: 'b', 13: 'a', 14: 'c', 15: 'd', 16: 'd', 17: 'c', 18: 'b', 19: 'd', 20: 'a', 21: 'a', 22: 'd', 23: 'b', 24: 'a', 25: 'c', 26: 'e', 27: 'd', 28: 'b', 29: 'c', 30: 'd', 31: 'b', 32: 'e', 33: 'b', 34: 'b', 35: 'c', 36: 'b', 37: 'b', 38: 'd', 39: 'c', 40: 'b', 41: 'a', 42: 'b', 43: 'e', 44: 'e', 45: 'c', 46: 'e', 47: 'c', 48: 'b', 49: 'b', 50: 'c'}, 'column_c': {0: 'b', 1: 'd', 2: 'b', 3: 'b', 4: 'd', 5: 'c', 6: 'b', 7: 'a', 8: 'a', 9: 'a', 10: 'a', 11: 'b', 12: 'd', 13: 'c', 14: 'b', 15: 'a', 16: 'a', 17: 'a', 18: 'b', 19: 'c', 20: 'a', 21: 'a', 22: 'b', 23: 'd', 24: 'd', 25: 'c', 26: 'd', 27: 'c', 28: 'c', 29: 'e', 30: 'd', 31: 'c', 32: 'd', 33: 'c', 34: 'b', 35: 'b', 36: 'd', 37: 'd', 38: 'd', 39: 'b', 40: 'c', 41: 'e', 42: 'e', 43: 'b', 44: 'b', 45: 'd', 46: 'd', 47: 'c', 48: 'e', 49: 'd', 50: 'b'}, 'column_d': {0: 'b', 1: 'c', 2: 'd', 3: 'd', 4: 'b', 5: 'b', 6: 'd', 7: 'd', 8: 'd', 9: 'b', 10: 'd', 11: 'c', 12: 'b', 13: 'a', 14: 'c', 15: 'c', 16: 'd', 17: 'c', 18: 'd', 19: 'a', 20: 'd', 21: 'b', 22: 'd', 23: 'b', 24: 'd', 25: 'e', 26: 'c', 27: 'c', 28: 'c', 29: 'd', 30: 'c', 31: 'e', 32: 'd', 33: 'd', 34: 'd', 35: 'b', 36: 'c', 37: 'e', 38: 'b', 39: 'e', 40: 'b', 41: 'c', 42: 'b', 43: 'e', 44: 'b', 45: 'c', 46: 'd', 47: 'c', 48: 'c', 49: 'b', 50: 'd'}, 'target': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 0, 26: 0, 27: 0, 28: 0, 29: 0, 30: 0, 31: 0, 32: 0, 33: 0, 34: 0, 35: 0, 36: 0, 37: 0, 38: 0, 39: 0, 40: 0, 41: 0, 42: 0, 43: 0, 44: 0, 45: 0, 46: 0, 47: 0, 48: 0, 49: 0, 50: 0}}

What I am trying to accomplish is to select a subsample of this dataframe of an arbitrary length or percentage, but in doing so, I want to maintain (as closely as possible) the frequency distributions of each value for each class.

For example, if I want to simply subsample the dataframe, I can use .sample() method

smaller_df = df.sample(n=100) or smaller_df = df.sample(frac=0.1)

However, it could be the case that the distributions of each value in each column in each class are lost. I need to preserve these value densities while downsampling my dataset size.

I can see these frequency densities with:

for col in df.columns:
    print(df.groupby(['target'])[col].value_counts(normalize=True))

That output looks like:

      target        column_a
0       e           0.384615
        c           0.269231
        d           0.230769
        b           0.115385
1       a           0.360000
        b           0.320000
        c           0.240000
        d           0.080000

I have seen this post on Stack Overflow which seemingly answers that for a single distribution, but not multiple.

Ideally, how can I downsample my dataframe to maintain each columns frequency distribution with less samples? My actual dataset is (8370994, 731)




Aucun commentaire:

Enregistrer un commentaire