mercredi 10 juillet 2019

Sampling not completely at random, with boundary conditions

I have summary level data that tells me how often a group of patients actually went to the doctor until a certain cut-off date. I do not have individual data, I only know that some e.g. went 5 times, and some only once. I also know that some were already patients at the beginning of the observation interval, and would be expected to come more often, whereas some were new patients that entered later. If they only joined a month before the cutoff data, they would be expected to come less often than someone who was in the group from the beginning.

Of course, the patients are not well behaved, so they sometimes miss a visit, or they come more often than expected. I am setting some boundary conditions to define the expectation about minimum and maximum number of doctor visits relative to the month they started appearing at the doctor.

Now, I want to distribute the actual summary level data to individuals, i.e. create a data frame that tells me during which month each individual started appearing at the doctor, and how many times they came for check-up until the cut-off date.

I am assuming this can be done with some type of random sampling, but the result needs to fit both the summary level information I have about the actual subjects as well as the boundary conditions telling how often a subject would be expected to come to the doctor relative to their joining time.

Here is some code that generates the target data frame that contains the month when the observation period starts, the respective number of doctor's visits that is expected (including boundary for minimum and maximum visits), and the associated percentage of subjects who start coming to the doctor during this month:

library(tidyverse)

months <- c("Nov", "Dec", "Jan", "Feb", "Mar", "Apr")
target.visits <- c(6,5,4,3,2,1)
percent <- c(0.8, 0.1, 0.05, 0.03, 0.01, 0.01)

df.target <- data.frame(month = months, target.visits = target.visits,
percent = percent) %>%
  mutate(max.visits = c(7,6,5,4,3,2),
         min.visits = c(5,4,3,2,1,1))

This is the data frame:

   month target.visits percent max.visits min.visits
   Nov             6    0.80          7          5
   Dec             5    0.10          6          4
   Jan             4    0.05          5          3
   Feb             3    0.03          4          2
   Mar             2    0.01          3          1
   Apr             1    0.01          2          1

In addition, I can create the data frame that shows the actual subject n with the actual number of visits:

subj.n <- 1000
actual.visits = c(7,6,5,4,3,2,1)
actual.subject.perc = c(0.05,0.6,0.2,0.06,0.035, 0.035,0.02)

df.observed <- data.frame(actual.visits = actual.visits,
actual.subj.perc = actual.subject.perc, actual.subj.n = subj.n * actual.subject.perc)

Here is the data frame with the actual observations:

actual.visits actual.subj.perc actual.subj.n
             7            0.050            50
             6            0.600           600
             5            0.200           200
             4            0.060            60
             3            0.035            35
             2            0.035            35
             1            0.020            20

Unfortunately I do not have any idea how to bring these together. I just know that if I have e.g. 60 subjects that come to the doctor 4 times during their observation period, I would like to randomly assign a starting month to each of them. However, based on the boudary conditions min.visits and max.visits, I know that it can only be a month from Dec - Feb. Any thoughts are much appreciated.




Aucun commentaire:

Enregistrer un commentaire