I have dataframe which looks like this:
revisionId itemId wikidataType
1 307190482 23 Q5
6 305019084 80 Q5
8 303692414 181 Q5
9 306600439 192 Q5
11 294597048 206 Q5
In complete dataframe, there are 100 such different values present in column wikidataType. Its a large dataframe, so I want to restrict it to 1000 records per wikidataType and I selected it randomly using following:
df = df.groupby('wikidataType', group_keys=False).apply(lambda x: x.sample(1000) if len(x) > 1000 else x.sample(1000, replace=True))
If some types do not have 1000 records then I want to use "with replacement" strategy to have same no of records for all wikidataTypes. Hence, I used replace=True. E.g. If for wikidataType Q5 I have 900 records, then I want to repeat some random 100 records but from those 900 records. So finally I will have 900 distinct records, with 100 of them repeating. I used x.sample(1000, replace=True) but that gave me only 4 distinct records for type Q5, each of them was repeated many times, when I actually had 1000 distinct records for that type. Anyone has an idea how to resolve this.
Thanks in advance.
Aucun commentaire:
Enregistrer un commentaire