I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.
In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.
If the initial collection is equally distributed, why is this use of $sample returning results with such a vastly different distribution, and how can I get a representative sample from a collection?
Here's the PyMongo line I'm using for the query:
cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])
Aucun commentaire:
Enregistrer un commentaire