vendredi 26 janvier 2018

Replace NA values in a column in dataframe based on probability of occurrence with non-NA values

I have to populate a set of 'Failure' values within a 'Bucket' randomly.

For instance,

| Bucket | Failure | Id |
|--------|---------|----|
| B1     | F1      | 1  |
| B1     | F2      | 2  |
| B1     | F1      | 3  |
| B1     | null    | 4  |
| B1     | null    | 5  |
| B2     | F3      | 6  |
| B2     | F4      | 7  |
| B2     | null    | 8  |

In table above, each Bucket can contain many records. Some of those records will contain Failure populated, but most will not. My goal is to randomly assign the Failure based on the proportion of Failures within a bucket. For instance, for combination - {B1, F1} as compared to the proportion of B1 records(with Failure populated) is 2/3 and for {B1, F2} the proportion of B1 records(with failure populated) is 1/3.

Therefore the records of B1 with null Failure column (Id=4,5) should get randomly either failure F1 or F2 but with the same proportion of F1 as 2/3 and F2 as 1/3. This logic needs to be applied for all buckets within the table.

I see that this is a complicated thing. I'm relatively a R noob, therefore, any code examples would be much appreciated.

In between, I see this question. But the solution doesn't run: Fill missing value based on probability of occurrence

See sample code:

test <- data.frame(
bucket = c(rep('B1', 5), rep('B2',3))
    , failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA)
    , Id = seq(1:8)
)

test

sample_fill_na = function(x) {
    x_na = is.na(x)
    x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
    return(x)
}

test[, failure := sample_fill_na(failure), by = bucket]




Aucun commentaire:

Enregistrer un commentaire