So in the biotech company I work for, we scan for small biomarkers that are present in various sample types from experiments being performed by research institutions. For example, some research institution/department is studying what biomarkers may be present in certain individuals of some population of interest. Perhaps over time, they do this kind of a study three times (not necessarily with the same individuals, though some of the same individuals may be present in the future studies -- this is case by case dependent), and then they want to merge the data into one single, finalized data frame. Sounds easy enough, right? Well, it can be a bit more complicated than just stacking the data sets on top of each other.
As you might imagine, sometimes a biomarker shows up in some individuals and not in others. Furthermore, sometimes, a biomarker shows up solely in one of the sampled populations and not any of the other ones sampled. I'm working on an internal R&D project for my company, and am trying to simulate data that we might get from three separate experiments (which are stored in the three "toy" data frames I've created in the code provided below). If you run what I've created below, it will result in one "all" data frame at the end that consists of 30 observations from 30 (fake) individuals, where each biomarker is a column labeled "x1", "x2", etc. Again, the point here is to try and simulate real data for an internal research project I'm working on. I've tried to simulate the fact that sometimes, a biomarker is present in one set and not all the others. This is why the column names aren't all the same and some have names that aren't present in the others.
# bringning in the best data management package there is!
library(dplyr)
# making a couple toy data frames used to try and create dummy examples of all the merged files
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")
# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study
# inserting each dummy dataframe into a list object for storage (might be useful later)
toy_data_list <- list(toy_df1, toy_df2, toy_df3)
# merging the toy data sheets into the "Data All"-esque file; this takes each dataframe and stacks them
# and stacks them on top of each other in sequential order of the SSIDs.
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)
My issue is this: I'm having trouble trying to simulate the randomness of the NA occurrences within each "toy" dataframe. For instance, in toy_df_1
, perhaps biomarker x1
was observed in all but the first, third, and seventh individuals. Also still in toy_df_1
, perhaps x3
was only observed in only the even numbered observations, so observations 1, 3, 5, 7, and 9, have NA
values. Etc.
I don't know how to best simulate this random data in each data frame, per each biomarker, and this is what I'm seeking help/input on here. I have provided below the currently (very rough) code ideas I'm working with. My current idea/thought process is this: I want to loop through each toy data frame stored in this toy dataframe list, which I have named toy_data_list
, and then loop through each column within each of these toy data frames and take a random sample of the observations, and replace them with NA
. This is the part I am having trouble with the most I think. I don't want the randomly selected amount of observations to always be 3 or 4. Ideally, sometimes it'd be all 10, 0, 1, 5, 3, 2, etc., see what I mean? The below code is my best attempt at doing this but I'm running into issues with that sample()
function . If anybody has any better ideas or ways of doing this I'm all ears. Thank you!
# adding NA values to the original toy dataframes to simulate real data
amount_sampled <- c(0:10)
for (i in 1:length(df_list)){
for (j in 1:nrow(df_list[[i]])){
df_list[[i]][,j][sample(df_list[[i]][,j], size = sample(amount_sampled, 1))] <- NA
}
}