random: "Random" sample from MongoDB returning heavily skewed results

I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.

In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.

If the initial collection is equally distributed, why is this use of $sample returning results with such a vastly different distribution, and how can I get a representative sample from a collection?

Here's the PyMongo line I'm using for the query:

cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])

random

dimanche 22 octobre 2017

"Random" sample from MongoDB returning heavily skewed results

Aucun commentaire:

Enregistrer un commentaire