mardi 19 novembre 2019

Using Pandas data frame how to randomly select row data using multiple conditions

I have a data set with column NDVI , Lat , Long, Group, Village & Taluka.

I want to randomly select 10 villages in each taluka,and each 10 villages select 5 row data randomly. but I am stuck to select random function. So, In taluka(block) I want to select 50 Data values, but there is condition is select 10 villages and each village atleast want 5 entries, and based on "Group" column, Probability to Proportion probability select 5 points. If Village XYZ have 70% area in Very Poor" then it will be select n=5 *0.70 = 3.5 Sample = 4(rounding), so it will be 4 data rows will be select for that village. If village XYZ have 30% area in Group: "Good", then it will be randomly select n=5*0.30 =1.5 = 2 (rounding)

    >>> import pandas as pd
    >>> import numpy as np
    >>> df=pd.read_excel("/home/desktop/Music/Data-Balaghat.xlsx")
    >>> def f(x):
        x['No.of Points'] = x.groupby(['Village'])['NDVI'].transform('count')
        x['No.of Points'] = x['No.of Points'].fillna('')
        return x

    >>> df1 = df.groupby(['Taluka','Group']).apply(f)
    >>> df1 = df.groupby(['Taluka','Village']).apply(f)
    >>> sample=df1.loc[df1['No.of Points'] >= 5]
    >>> def f(x):
        labels = ['Very Poor','Poor','Average','Good']
        x = x.sort_values('Village','NDVI', ascending=False)
        x['Level'] = pd.qcut(x['NDVI'], 4, labels = labels)
        x['Sum_Level_wise'] = x.groupby(['Village','Level'])['NDVI'].transform('sum')
        x['Probability'] = x['Sum_Level_wise'].div(x['NDVI'].sum()).round(2)
        x['Sample'] = x['Probability'] * x.groupby('Level')['NDVI'].transform('size')
        x['Selected villages'] = x['Sample'].apply(np.ceil).astype(int)
        x['Selected village'] = x.groupby('Level').apply(lambda x: x['Village'].head(x['Selected villages'].iat[0])).reset_index(level=0)['Village']
        x['Selected village'] = x['Selected village'].fillna('')
        return x

df1 = df.groupby(['Taluka','Village']).apply(f)
df1['Selected village'].replace('', pd.np.nan, inplace=True)
df1.dropna(subset=['Selected village'], inplace=True)

Data set




Aucun commentaire:

Enregistrer un commentaire