I have to populate a set of 'Failure' values within a 'Bucket' randomly.
For instance,
| Bucket | Failure | Id |
|--------|---------|----|
| B1 | F1 | 1 |
| B1 | F2 | 2 |
| B1 | F1 | 3 |
| B1 | null | 4 |
| B1 | null | 5 |
| B2 | F3 | 6 |
| B2 | F4 | 7 |
| B2 | null | 8 |
In table above, each Bucket can contain many records. Some of those records will contain Failure populated, but most will not. My goal is to randomly assign the Failure based on the proportion of Failures within a bucket. For instance, for combination - {B1, F1} as compared to the proportion of B1 records(with Failure populated) is 2/3 and for {B1, F2} the proportion of B1 records(with failure populated) is 1/3.
Therefore the records of B1 with null Failure column (Id=4,5) should get randomly either failure F1 or F2 but with the same proportion of F1 as 2/3 and F2 as 1/3. This logic needs to be applied for all buckets within the table.
I see that this is a complicated thing. I'm relatively a R noob, therefore, any code examples would be much appreciated.
In between, I see this question. But the solution doesn't run: Fill missing value based on probability of occurrence
See sample code:
test <- data.frame(
bucket = c(rep('B1', 5), rep('B2',3))
, failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA)
, Id = seq(1:8)
)
test
sample_fill_na = function(x) {
x_na = is.na(x)
x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
return(x)
}
test[, failure := sample_fill_na(failure), by = bucket]
Aucun commentaire:
Enregistrer un commentaire