random: Setting minimum sample size for multiple sub-populations based on smallest sub-population

lundi 27 février 2017

Setting minimum sample size for multiple sub-populations based on smallest sub-population

So I have 1 population of users, this population is split into sub-populations of users based on their date of birth. There are about 20 different buckets of users that fall into the desired age groups.

The question is to see how different bucket interacts with a system over time.

Each bucket has varied size, biggest bucket has about 20,000 users (at the mid point of the distribution) with both tail ends having <200 users each.

To answer the question of system usage over time I have cleaned the data and am taking a sample of .9 of the lowest sup-population from each of the buckets.

Then I re-sample with replacement N number of times (can be between 100 to 10000 or whatnot). The average of these re-samples closely approaches the sub-population mean of each bucket, what I find that pretty much over time for most metrics of interaction (1,2,3,4,5,6 months) the tail end with the lowest number of users is the most active. (this could suggest that higher member buckets contain a large proportion of users who are not active or those users that are active are just not as active different user buckets).

I took a quick summary of each of the buckets to make sure that there are no irregularities and indeed the data shows that the lowest bucket does have higher quartiles, mean, lowest and highest data values compared to the other buckets.

I went over the data collection methodology to make sure that there are no errors in obtaining the data and looking through various data points it does support the result of graphing the re-sampled values.

My question is, should I take sample size based on each individual bucket independently, my gut tells me no as all the buckets belong to the same population and if I sample on the buckets each sample has to be fair and thus use N number of data points from the smallest bucket.

There is no modelling involved, this is just looking at the average number of usage of each user bucket per month.

Is my approach more or less on the right track?

random

lundi 27 février 2017

Setting minimum sample size for multiple sub-populations based on smallest sub-population

Aucun commentaire:

Enregistrer un commentaire