lundi 25 mai 2020

How to optimize a simulation on more than one variable in R?

I have a dataframe organized as such: df<- dataframe(individual, group name, Z1, Z2, Z3). In my dataset each individual is a member of a group. I am interested only in certain amount of data (e.g. 15000 out of 25000). I have too many Zero's in my dataset. I want to apply two different simulations:

  1. To find all possible combinations of "individuals" where the mean(Z1)~1 and find a range for Z2 and Z3.
  2. To find all possible combinations of "individuals" where the mean(Z1), mean(Z2) and mean(Z3) ~1

The histogram of Z1 Histogram . The boxplot of Z1 shows too many outliers Boxplot. To give an overview of my dataset:

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.010 0.060 1.854 0.470 108.130

I tried to do the simulation using lapply function and giving some rates to my dataset (Z=Z1):

LO<- lapply(1:5000, function(i){sample(Z,15000,replace=TRUE, prob=1/(Z+8)+(0.2*Z))})
MEANS=unlist(lapply(LO, mean))
hist(MEANS)

In this way I have to adjust the "prob" manually in order to get my histrogram centered on 1. Is this a good way to answer my first problem? Then for the second problem, how can I optimize my simulation on 3 variables? Should I use if-loop? As a side question: how can I give weigh to my dataset based on the population of the each group (the higher the population the higher the probability of individuals to be chosen from that group in my 15000 sample).




Aucun commentaire:

Enregistrer un commentaire