mardi 28 décembre 2021

R: Selecting Rows Based on Conditions Stored in a Data Frame

I am working with the R programming language.

I have the following dataset:

num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <-  as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <-  as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)


> head(my_data)
  id num_var_1 num_var_2 num_var_3 num_var_4  num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1  1  9.439524  5.021006  4.883963  8.496925  11.965498            B           AA          AAA         CCCC         AAAA
2  2  9.769823  4.800225 12.369379  6.722429  16.501132            B           AA          AAA         AAAA         AAAA
3  3 11.558708  9.910099  4.584108 -4.481653  16.710042            C           AA          BBB         AAAA         CCCC
4  4 10.070508  9.339124 22.192276  3.027154  -2.841578            B           CC          DDD         BBBB         AAAA
5  5 10.129288 -2.746714 11.741359 35.984902 -10.261096            B           AA          AAA         DDDD         DDDD
6  6 11.715065 15.202867  3.847317  9.625850  32.053261            B           AA          CCC         BBBB         EEEE

Based on the answer provided from a previous question (R: Randomly Sampling Mixed Variables), I learned how to randomly take samples from this data:

library(dplyr)
 library(purrr)

# calc the ratio of choosing variable
var_num <- ncol(my_data) - 1
var_select_ratio <- sum(1:var_num) / (var_num^2)

num_func <- function(vec, iter_num) {
  random_val = runif(iter_num, min(vec), max(vec))
  is_select <- sample(c(NA, 1), iter_num, 
                      prob = c(1 - var_select_ratio, var_select_ratio), replace = TRUE)
  return(random_val * is_select)
}

fac_func <- function(vec, iter_num) {
  nlevels <- sample.int(length(levels(vec)), iter_num, replace = TRUE)
  is_select <- sample(c(0, 1), iter_num, 
                      prob = c(1 - var_select_ratio, var_select_ratio), replace = TRUE)
  out <- map2(nlevels, is_select,  # NOTE: this process isn't vectorized
              function(nl, ic){
                if(ic == 0) NULL else sample(vec, nl)
              })
  return(out)
}

integ_func <- function(vec, iter_num) {
  if(is.factor(vec)) fac_func(vec, iter_num) else num_func(vec, iter_num)
}


# if you want to paste factor_var
res2 <- res %>% 
  mutate_if(is.list, function(col) map_chr(col, function(cell) paste(sort(cell), collapse = " "))) %>%   # paste
  mutate_if(is.character, function(col) na_if(col, ""))  # replace "" to NA

This produces the following results:

 > res2 = data.frame(res2)

> res2
   num_var_1 num_var_2  num_var_3 num_var_4  num_var_5 factor_var_1 factor_var_2    factor_var_3             factor_var_4             factor_var_5
1   8.251683 27.791314  30.525573  33.95768   2.388074            B         <NA>             AAA                     AAAA                     DDDD
2   9.012602        NA         NA        NA  20.236515            A        AA BB            <NA>                     <NA>                     BBBB
3         NA 16.778085  28.097324   5.69020         NA            B           BB     CCC DDD DDD                     <NA> AAAA BBBB CCCC CCCC CCCC
4  12.838667 -3.694075  13.411877  -2.20004         NA         <NA>     AA AA BB     AAA BBB CCC                     <NA> AAAA AAAA BBBB CCCC DDDD
5         NA        NA  11.922439  17.63757         NA          A B     AA AA BB            <NA>                AAAA AAAA                     BBBB
6  12.768595        NA  28.507646        NA         NA            C           AA     BBB DDD DDD      AAAA AAAA CCCC DDDD AAAA AAAA BBBB EEEE EEEE
7         NA        NA -20.424906        NA  20.147004         <NA>        AA AA            <NA> AAAA AAAA AAAA CCCC EEEE                     <NA>
8         NA  6.299722   8.569485  24.82825 -17.715862         <NA>           BB AAA AAA BBB CCC                     <NA>                BBBB EEEE
9  10.846757        NA         NA        NA         NA        A B C     AA BB CC            <NA>                     <NA>                BBBB BBBB
10        NA  4.663916  22.335404        NA         NA        B B C        AA BB AAA AAA AAA DDD AAAA AAAA CCCC EEEE EEEE                     <NA>

My Question: Is it possible to take conditions from different rows in "res2" and use them to perform operations on "my_data"? For example:

  • If you take the 6th row and 10th row from "res2" : res2[c(6,10),]

  • In Row 6: num_var_2 = NA, num_var_4 = NA and num_var_5 = NA

  • In Row 6: num_var_1 = 12.768595 , num_var_3 = 28.507646, factor_var_1 = C, factor_var_2 = AA, factor_var_3 = BBB DDD DDD, factor_var_4 = AAAA CCCC DDDD, factor_var_5 = AAAA BBBB EEEE EEEE

  • In Row 10: num_var_1 = NA, num_var_4 = NA num_var_5 = NA, factor_var_5 = NA

  • In Row 10: num_var_2 = 4.663916, num_var_3 = 22.335404 , factor_var_1 = B C, factor_var_2 = BB CC, factor_var_3 = AAA DDD, factor_var_4 = AAAA CCCC DDDD, factor_var_5 = AAAA CCCC EEEE

I want to perform the following operation:

Step 1: where my_data$num_var_1 = 12.768595 , my_data$num_var_3 = 28.507646, my_data$factor_var_1 = C, my_data$factor_var_2 = AA, my_data$factor_var_3 = BBB DDD DDD, my_data$factor_var_4 = AAAA CCCC DDDD, my_data$factor_var_5 = AAAA BBBB EEEE EEEE

Step 2 (using the data from Step 1): Take columns "my_data$num_var_2 , my_data$num_var_4 and my_data$num_var_5" and replace a random 30% of elements in these columns with 0

Step 3 (using the data from Step 2): where my_data$num_var_2 = 4.663916, my_data$num_var_3 = 22.335404 , my_data$factor_var_1 = B C, my_data$factor_var_2 = BB CC, my_data$factor_var_3 = AAA DDD, my_data$factor_var_4 = AAAA CCCC DDDD, my_data$factor_var_5 = AAAA CCCC EEEE

Step 4 (using the data from Step 3): Take columns "my_data$num_var_1 , my_data$num_var_4 and my_data$num_var_5, my_data$factor_var_5" and replace a random 35% of elements in these columns with 0

Is it possible to directly perform Step 1 - Step 4 using "my_data" and "res2"?

Currently, I can do this manually (R: Randomly Changing Values in a Dataframe).

Can someone please show me how to do this?

Thanks!




Aucun commentaire:

Enregistrer un commentaire