lundi 23 octobre 2017

Selecting x samples randomly with replacement using numpy

I have dataframe which looks like this:

    revisionId  itemId wikidataType
1    307190482      23           Q5
6    305019084      80           Q5
8    303692414     181           Q5
9    306600439     192           Q5
11   294597048     206           Q5

In complete dataframe, there are 100 such different values present in column wikidataType. Its a large dataframe, so I want to restrict it to 1000 records per wikidataType and I selected it randomly using following:

df = df.groupby('wikidataType', group_keys=False).apply(lambda x: x.sample(1000) if len(x) > 1000 else x.sample(1000, replace=True))

If some types do not have 1000 records then I want to use "with replacement" strategy to have same no of records for all wikidataTypes. Hence, I used replace=True. E.g. If for wikidataType Q5 I have 900 records, then I want to repeat some random 100 records but from those 900 records. So finally I will have 900 distinct records, with 100 of them repeating. I used x.sample(1000, replace=True) but that gave me only 4 distinct records for type Q5, each of them was repeated many times, when I actually had 1000 distinct records for that type. Anyone has an idea how to resolve this.

Thanks in advance.




Aucun commentaire:

Enregistrer un commentaire