I am working on a multinomial logistic regression problem (i.e., where I want to classify some unordered, independent levels of a nominal outcome variable). My issue is that I know the levels of the outcome variable (in this example, y=c('a','b','c')
) and I know the predictor variables, their levels, and their class (here, either numeric/integer or nominal). I know what the approximate distributions should be between each predictor and the outcome (e.g., higher values of x
appear more frequently with y='a'
, otherwise low values of x
are randomly distributed across the other levels of y
).
Essentially, I want to do 4 things: 1) generate a dataset of these variables that approximate my specified distributions; 2) run a multinomial logistic regression on the data, nnet::multinom(y~.,df)
; 3) use the resulting model to predict()
the probability of each y
level using new data; and 4) retrieve the probabilities for further processing. I am not interested in the MLR model's accuracy or p-values, so I don't need to split my data into train/test samples and do k-folds cross-validation or anything.
My initial thought was that this type of reverse engineering a dataset based on some user-specified distributions cannot be too uncommon, and that there was probably an R
package/function that could do this. I haven't found any so far. My approaches so far have been as follows: manually specify the distributions for each level of each predictor against each level of the outcome, like so:
rm(list=ls())
set.seed(123)
# specify vars and levels -- y=outcome var
y <- c('a','b','c')
x <- c(1:5)
p <- c(1:4)
r <- c(1:8)
q <- c('foo','bar','hello','world') # nominal var
# sample data based on user-specified distributions/probs
df1 <- data.frame(x1=sample(x,100,T,prob=c(0.1,0.1,0.2,0.25,0.35)),
y='b')
df2 <- data.frame(x1=sample(x,200,T,prob=c(0.35,0.25,0.2,0.1,0.1)),
y=sample(c('a','c'),200,T))
df <- rbind(df1,df2)
# check distribution of x1 levels v. y levels
table(df$x1,df$y)
b a c
1 7 38 30
2 11 29 26
3 22 17 22
4 26 14 7
5 34 12 5
The issue is that this is tedious as the number of predictors becomes larger and they have more levels. My next approach was to generate a random sample of data, run the MLR model, and tweak the model weights.
# create random sample
df <- ldply(mget(ls()),
function(x) sample(x,1000,T)) %>%
gather(k,v,-`.id`) %>%
spread(`.id`,v) %>% select(-k)
str(df)
# change back vars to numeric
df[,c('p','r','x')] <-
apply(df[,c('p','r','x')],2,function(x) as.numeric(x))
glimpse(df)
Observations: 1,000
Variables: 5
$ p <dbl> 2, 2, 3, 1, 3, 2, 2, 4, 2, 4, 4, 3, 2, 4, 1, 4, 2, 1, 4, 3, 1, 3, 4, 3, 2, 2, 3...
$ q <chr> "bar", "bar", "foo", "bar", "world", "hello", "foo", "hello", "world", "hello",...
$ r <dbl> 2, 2, 1, 6, 6, 3, 4, 8, 6, 6, 2, 2, 8, 7, 7, 6, 3, 2, 4, 5, 2, 7, 1, 6, 3, 7, 8...
$ x <dbl> 2, 5, 1, 3, 3, 5, 2, 4, 1, 3, 5, 1, 5, 5, 2, 1, 1, 4, 4, 1, 5, 1, 5, 4, 4, 3, 2...
$ y <chr> "a", "c", "b", "a", "b", "a", "b", "c", "c", "b", "c", "c", "b", "a", "c", "b",...
# graph distribution of each predictor against each outcome -- not run here
# df %>% gather(k,v,-y) %>% group_by(y,k,v) %>%
# summarise(n=n()) %>%
# mutate(prop=n/sum(n)) %>%
# ggplot(aes(y,prop,fill=v)) + geom_bar(stat='identity',position='dodge') +
# facet_wrap(~k,scales='free') + theme(legend.position = 'none')
# run MLR model
m <- multinom(y~.,df)
summary(m)$coefficients
m$wts # coefficients from model
# adjust weight 16, which is x against y=b
m$wts[16] <- 1
Again, though, this is tedious when the number of predictors and levels is large. Plus as I continue altering the model weights and predicting new data, I get some unexpected probabilities (obviously, I don't know enough about MLR to confidently use this method).
So, I am more or less stuck at this stage. I have considered using multiple imputation or bootstrapping methods to generate the desired data, but I don't think either method is applicable here. MI will impute data for incomplete cases, whereas I want to specify a limited number of complete cases and extrapolate from there. Similarly, bootstrapping will resample the data assuming that the sample distribution approximates the population distribution. Again, I don't see how I can specify a limited number of cases that will validly do that (perhaps bootstrapping plus permutation/shuffling ?).
Anyways, any help/suggestions are greatly appreciated here. And thanks to anyone that actually reads this lengthy post!
Aucun commentaire:
Enregistrer un commentaire