vendredi 21 octobre 2022

Drawing samples of a dataset with specific mean and artificially increased standard deviation

I want to create a list of 10 artificial dataframes with the same mean of my original dataframe, and an artificially increased standard deviation. In each of the 10 artificial dataframes, the columns should have the same name and size of the original dataset. Each column should also have the same expected mean as the corresponding column in the original dataframe, but the expected standard deviation of each column is increased by 10%, 20% etc in each artificial dataframe relative to the standard deviation of the column in the original dataframe. So each artificial dataframe in the list will be corresponding to an x increase in the expected standard deviation of each column, where x in x = seq(10,100,10). One artificial dataframe will have the same exact columns of the original dataframe with same expected mean and size, but an expected standard deviation increased by 10%, the second sample will have an expected standard deviation of each column increased by 20% and so on.

Based on this other post Generate random numbers with fixed mean and sd this is my attempt so far:

#First define the function to draw random numbers given specific values of n, mean, sd

    rnorm2 <- function(n,mean,sd) {mean+sd*scale(rnorm(n))} 

#Create random df for replicability
df = data.frame(replicate(9,sample(0:1,100,rep=TRUE)))

names(df) = c("a", "b", "b" , "d", 
            "e" , "f", "g" ,
            "h" , "i")

#Compute and store mean, standard deviation and size of each column in my dataset:

# First, initialize  vectors

columns = c("a", "b", "b" , "d", 
            "e" , "f", "g" ,
            "h" , "i")

ncols = length(df)

column_means <- vector(mode = "numeric", length = ncols)
column_sd <- vector(mode = "numeric", length = ncols)

#Now loop through each column to obtain mean, standard deviation and increased standard deviation by x

xs = seq(10, 100, 10)

for(x in seq_along(xs)){
    for (i in seq_along(columns)){
     column_means[i] <- mean(df[[i]], na.rm = TRUE)
     column_sd[i] <- sd(df[[i]], na.rm = TRUE)
     column_sd_new[[i]][x] <- column_sd[i] + ((x/100)*column_sd[i])
 } 
}

However this gives me the following error:

Error in `*tmp*`[[i]] : subscript out of bounds

Also, I cannot find a way to apply the rnorm2 function to obtain a list of 10 artificial dataframes.

Any help would be greatly appreciated!




Aucun commentaire:

Enregistrer un commentaire