samedi 20 juillet 2019

Reverse engineer multinomial logistic regression data

I am working on a multinomial logistic regression problem (i.e., where I want to classify some unordered, independent levels of a nominal outcome variable). My issue is that I know the levels of the outcome variable (in this example, y=c('a','b','c')) and I know the predictor variables, their levels, and their class (here, either numeric/integer or nominal). I know what the approximate distributions should be between each predictor and the outcome (e.g., higher values of x appear more frequently with y='a', otherwise low values of x are randomly distributed across the other levels of y).

Essentially, I want to do 4 things: 1) generate a dataset of these variables that approximate my specified distributions; 2) run a multinomial logistic regression on the data, nnet::multinom(y~.,df); 3) use the resulting model to predict() the probability of each y level using new data; and 4) retrieve the probabilities for further processing. I am not interested in the MLR model's accuracy or p-values, so I don't need to split my data into train/test samples and do k-folds cross-validation or anything.

My initial thought was that this type of reverse engineering a dataset based on some user-specified distributions cannot be too uncommon, and that there was probably an R package/function that could do this. I haven't found any so far. My approaches so far have been as follows: manually specify the distributions for each level of each predictor against each level of the outcome, like so:

rm(list=ls())
set.seed(123)

# specify vars and levels -- y=outcome var
y <- c('a','b','c')
x <- c(1:5)
p <- c(1:4)
r <- c(1:8)
q <- c('foo','bar','hello','world') # nominal var

# sample data based on user-specified distributions/probs
df1 <- data.frame(x1=sample(x,100,T,prob=c(0.1,0.1,0.2,0.25,0.35)),
                  y='b')
df2 <- data.frame(x1=sample(x,200,T,prob=c(0.35,0.25,0.2,0.1,0.1)),
                  y=sample(c('a','c'),200,T))
df <- rbind(df1,df2)

# check distribution of x1 levels v. y levels
table(df$x1,df$y)
     b  a  c
  1  7 38 30
  2 11 29 26
  3 22 17 22
  4 26 14  7
  5 34 12  5

The issue is that this is tedious as the number of predictors becomes larger and they have more levels. My next approach was to generate a random sample of data, run the MLR model, and tweak the model weights.

# create random sample
df <- ldply(mget(ls()),
            function(x) sample(x,1000,T)) %>% 
  gather(k,v,-`.id`) %>%
  spread(`.id`,v) %>% select(-k)
str(df)
# change back vars to numeric
df[,c('p','r','x')] <- 
  apply(df[,c('p','r','x')],2,function(x) as.numeric(x))

glimpse(df)

Observations: 1,000
Variables: 5
$ p <dbl> 2, 2, 3, 1, 3, 2, 2, 4, 2, 4, 4, 3, 2, 4, 1, 4, 2, 1, 4, 3, 1, 3, 4, 3, 2, 2, 3...
$ q <chr> "bar", "bar", "foo", "bar", "world", "hello", "foo", "hello", "world", "hello",...
$ r <dbl> 2, 2, 1, 6, 6, 3, 4, 8, 6, 6, 2, 2, 8, 7, 7, 6, 3, 2, 4, 5, 2, 7, 1, 6, 3, 7, 8...
$ x <dbl> 2, 5, 1, 3, 3, 5, 2, 4, 1, 3, 5, 1, 5, 5, 2, 1, 1, 4, 4, 1, 5, 1, 5, 4, 4, 3, 2...
$ y <chr> "a", "c", "b", "a", "b", "a", "b", "c", "c", "b", "c", "c", "b", "a", "c", "b",...

# graph distribution of each predictor against each outcome -- not run here
# df %>% gather(k,v,-y) %>% group_by(y,k,v) %>%
#   summarise(n=n()) %>%
#   mutate(prop=n/sum(n)) %>%
#   ggplot(aes(y,prop,fill=v)) + geom_bar(stat='identity',position='dodge') +
#   facet_wrap(~k,scales='free') + theme(legend.position = 'none')

# run MLR model
m <- multinom(y~.,df)
summary(m)$coefficients
m$wts # coefficients from model

# adjust weight 16, which is x against y=b
m$wts[16] <- 1

Again, though, this is tedious when the number of predictors and levels is large. Plus as I continue altering the model weights and predicting new data, I get some unexpected probabilities (obviously, I don't know enough about MLR to confidently use this method).

So, I am more or less stuck at this stage. I have considered using multiple imputation or bootstrapping methods to generate the desired data, but I don't think either method is applicable here. MI will impute data for incomplete cases, whereas I want to specify a limited number of complete cases and extrapolate from there. Similarly, bootstrapping will resample the data assuming that the sample distribution approximates the population distribution. Again, I don't see how I can specify a limited number of cases that will validly do that (perhaps bootstrapping plus permutation/shuffling ?).

Anyways, any help/suggestions are greatly appreciated here. And thanks to anyone that actually reads this lengthy post!




Aucun commentaire:

Enregistrer un commentaire